Scala: The powerhouse of Apache Spark

0
8075

 

Scala, which is an acronym for Scalable Language, is a multi-paradigm, statically-typed, type-safe programming language focused on Web services. Widely used by data scientists today, its popularity is set to soar in the future because of the boom in the Big Data and data science domains.

The world is being flooded with data from a wide range of sources. The hottest trends in technology currently are Big Data and data science, both of which offer ways to cope with this data deluge. Many platforms have emerged in this space, but Apache Spark and Scala work in synergy to address the various challenges this humongous data throws up. They are being used on Facebook, Pinterest, NetFlix, Conviva and TripAdvisor, among others, for Big Data and machine learning applications.

So what is Scala?

Scala stands for Scalable Language. It was developed as an object-oriented and functional programming language. Everything in Scala is an object, even its primitive data types. If you write a snippet of code in Scala, you will see that the style is similar to a scripting language. It is very powerful, yet compact, and requires only a fraction of the lines of code compared to other commercially used languages. Due to its characteristics and support for distributed/concurrent programming, it is popularly used for data streaming, batch processing, AWS Lambda Expression and analysis in Apache Spark. It is one of the most widely used languages by data scientists, and its popularity will soar in the future due to the boom in the Big Data and data science domains.

What is Apache Spark?

Apache Spark is a cluster computing framework based on Hadoop’s MapReduce framework. Spark has in-memory cluster computing, which helps speed up computation by reducing the IO transfer time. It is widely used to deal with Big Data problems because of its distributed architectural support and parallel processing capabilities. It is preferred to Hadoop due to its stream processing and interactive query features. To provide a wide range of services, it has built-in libraries like GraphX, SparkSQL and MLlib. Spark supports Python, Scala, Java and R as programming languages, of which Scala is the most preferred.

Reasons to use Scala for Spark

Eighty eight per cent of Spark users code in Scala for the following reasons:
1) Apache Spark is written in Scala and is scalable in JVM. Being proficient in Scala helps you dig into the source code of Spark, so that you can easily access and implement the latter’s newest features.
2) Spark is implemented in Scala, so it has the maximum features available at the earliest release. The features are ported from Scala to support other languages like Python.
3) Scala’s interoperability with Java is its biggest advantage, as experienced Java developers can easily grasp the object-oriented concepts quickly. You can also write Java code inside a Scala class.
4) Scala is a static typed language. It looks like a dynamic typed language because it uses a sophisticated type inference mechanism. This leads to better performance.
5) Scala renders the expressive power of a dynamic programming language without compromising on type safety.
6) It is designed for parallelism and concurrency to cater to Big Data applications. Scala has efficient built-in concurrency support and libraries like Akka, which allow you to build a scalable application.
7) Scala works well within the MapReduce framework because of its functional nature. Many Scala data frameworks follow similar abstract data types that are consistent with Scala’s collection of APIs. Developers just need to learn the basic standard collections, which allow them to easily get acquainted with other libraries.

Installing Scala

Scala can be installed on Windows or Linux based systems. It is mandatory for Java to be installed before Scala. The following steps will install Scala 2.11.7 on Ubuntu 14.04 with Java 7. Type the following commands in a terminal.

1. To install Java, type:

$ sudo apt-add-repository ppa:webupd8team/java
$ sudo apt-get update
$ sudo apt-get install oracle-java7-installer

2. To install Scala, type:

$ cd ~/Downloads
$ wget http://www.scala-lang.org/files/archive/scala-2.11.7.deb
$ sudo dpkg -i scala-2.11.7.deb
$ scala –version

Working with Spark RDD using Scala

Resilient Distributed Datasets (RDD) are the basic data types of Spark. They can be created in two ways—from an existing source or an external source.

Creating RDD from an existing source: First, switch to the home directory and load the sparkcontext as sc.

$ ./bin/spark-shell

Then, create an RDD from existing data stored previously in the driver program. First, we will make an array and then use the parallelise method to create the Spark RDD from an iterable already present in the driver program, using the following code:

val data = Array(2,4,6,8)
val distData = sc.parallelize(data)

To view the content of any RDD, use the collect method, as shown below:

distData.collect()

Creating RDD from an external source: An RDD can be created from external sources which have the Hadoop Input Format such as a shared file system, HDFS, HBase, etc. First, load the desired file using the following syntax:

val lines = sc.textFile(“text.txt”);

To display the lines, use the command given below:

lines.take(2)

Basic transformations and actions

Transformations modify your RDD data from one form to another. Actions will also give you another RDD, but this operation will trigger all the lined up transformations on the base RDD and then execute the action operation on the last RDD.

Map transformation: Map applies a function to each element of the RDD. The following code checks the length of each line.

val Length = lines.map(s => s.length)
Length.collect()

Reduce action: Reduce aggregates the elements according to key value. It is applied to the output of the Map function. The following code calculates the sum total of characters in the file.

val totalLength = Length.reduce((a, b) => a + b)

The DataFrame API

DataFrame is a distributed data collection organised into columns that are similar to a relational database system. DataFrame can be created from Hive tables, structured data files, external databases or existing RDDs. The DataFrame API uses a schema to describe the data, allowing Spark to manage the schema and only pass data between nodes. This is more efficient than using Java serialisation. It is helpful when performing computations in a single process, as Spark can serialise the data into off-heap storage in a binary format and then perform many transformations directly on this, reducing the garbage-collection costs of constructing individual objects for each row in the data set. Because Spark understands the schema, we don’t need to use Java serialisation to encode the data. The following code will explore some functions related to DataFrames.

First, create a sparkcontext object, as follows:

val sqlcontext = new org.apache.spark.sql.SQLContext(sc)

Then, read an external JSON object and store it in DataFrame dfs. Next, show and print it, using the following code:

val dfs = sqlContext.read.json(“employee.json”)
dfs.show()
dfs.printSchema()

DataSet API

DataSet API has encoders that translate between JVM representations (objects) and Spark’s internal binary format. Spark has built-in encoders, which are powerful as they generate byte code to mix with off-heap data, and provide on-demand access to individual attributes without having to de-serialise an entire object. Moreover, the DataSet API is designed to work well with Scala. When working with Java objects, it is important that they are fully JavaBean-compliant. The following code will explore some basic functions:

val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val sampleData: Seq[ScalaPerson] = ScalaData.sampleData()
val dataset = sqlContext.createDataset(sampleData)

The Scala vs Python debate

Scala and Python are both powerful and popular among data scientists, many of whom learn and work with both languages. Scala is faster and moderately easy to use, while Python is slower but very easy to use. But Python usually ranks second because of the following reasons:

1. Scala is usually 10 times faster than Python for processing. Python code needs a lot of translation and converting, which makes the program a bit slow. Hence, there is a performance overhead. Scala has an edge in performance when there are fewer processor cores.
2. Scala is better for concurrency due to its ability to easily integrate across several databases and services. It has asynchronous libraries and reactive cores. Python doesn’t provide heavyweight processes to fork for multi-threading or parallel computing.
3. The Scala programming language has several existential types, macros and implicits. Its advantages are evident when using these powerful features in important frameworks and libraries. Scala is the best choice for the Spark streaming feature because Python Spark streaming support is not advanced and mature like Scala.

Scala has slightly fewer machine learning and natural language processing libraries than Python. The library has only a few algorithms but they are sufficient for Big Data applications. Scala lacks good visualisation and local data transformations. Nevertheless, it is preferred since Python increases portability for more issues and bugs, as translation is tough. That Scala is the winning combination of both object-oriented and functional programming paradigms might be surprising to beginners and they could take some time to pick up the new syntax. Scala programming might be a difficult language to master for Apache Spark, but the time spent on learning it is worth the investment.

LEAVE A REPLY

Please enter your comment!
Please enter your name here