The Complete Magazine on Open Source

Spark’s MLlib: Scalable Support for Machine Learning

1.37K 0

Designated as Spark’s scalable machine learning library, MLlib consists of common algorithms and utilities as well as underlying optimisation primitives.

The world is being flooded with data from all sources. The hottest trend in technology is related to Big Data and the evolving field of data science is a way to cope with this data deluge. Machine learning is at the heart of data science. The need of the hour is to have efficient machine learning frameworks and platforms to process Big Data. Apache Spark is one of the most powerful platforms for analysing Big Data. MLlib is its machine learning library, and is potent enough to process Big Data and apply all machine learning algorithms to it efficiently.

Apache Spark

Apache Spark is a cluster computing framework based on Hadoop’s MapReduce framework. Spark has in-memory cluster computing, which helps to speed up computation by reducing the IO transfer time. It is widely used to deal with Big Data problems because of its distributed architectural support and parallel processing capabilities. Users prefer it to Hadoop on account of its stream processing and interactive query features. To provide a wide range of services, it has built-in libraries like GraphX, SparkSQL and MLlib. Spark supports Python, Scala, Java and R as programming languages, out of which Scala is the most preferred.

MLlib

MLlib is Spark’s machine learning library. It is predominantly used in Scala but it is compatible with Python and Java as well. MLlib was initially contributed by AMPLab at UC Berkeley. It makes machine learning scalable, which provides an advantage when handling large volumes of incoming data.

The main features of MLlib are listed below.

Machine learning algorithms: Regression, classification, collaborative filtering, clustering, etc

Featurisation: Selection, dimensionality reduction, transformation, feature extraction, etc

Pipelines: Construction, evaluation and tuning of ML pipelines

Persistence: Saving/loading of algorithms, models and pipelines

Utilities: Statistics, linear algebra, probability, data handling, etc

Some lower level machine learning primitives like the generic gradient descent optimisation algorithm are also present in MLlib. In the latest releases, the MLlib API is based on DataFrames instead of RDD, for better performance.

The advantages of MLlib

The true power of Spark lies in its vast libraries, which are capable of performing every data analysis task imaginable. MLlib is at the core of this functionality. It has several advantages.

Ease of use: MLlib integrates well with four languages— Java, R, Python and Scala. The APIs of all four provide ease of use to programmers of various languages as they don’t need to learn a new one.

Easy to deploy: No preinstallation or conversion is required to use a Hadoop based data source such as HBase, HDFS, etc. Spark can also run standalone or on an EC2 cluster.

Scalability: The same code can work on small or large volumes of data without the need of changing it to suit the volume. As businesses grow, it is easy to expand vertically or horizontally without breaking down the code into modules for performance.

Performance: The ML algorithms run up to 100X faster than MapReduce on account of the framework, which allows iterative computation. MLlib’s algorithms take advantage of iterative computing properties to deliver better performance, surpassing that of MapReduce. The performance gain is attributed to the in-memory computing, which is a speciality of Spark.

Algorithms: The main ML algorithms included in the MLlib module are classification, regression, decision trees, recommendation, clustering, topic modelling, frequent item sets, association rules, etc. ML workflow utilities included are feature transformation, pipeline construction, ML persistence, etc. Single value decomposition, principal component analysis, hypothesis testing, etc, are also possible with this library.

Community: Spark is open source software under the Apache Foundation now. It gets tested and updated by the vast contributing community. MLlib is the most rapidly expanding component and new features are added every day. People submit their own algorithms and the resources available are unparalleled.

Basic modules of MLlib

SciKit-Learn: This module contains many basic ML algorithms that perform the various tasks listed below.

Classification: Random forest, nearest neighbour, SVM, etc

Regression: Ridge regression, support vector regression, lasso, logistic regression, etc

Clustering: Spectral clustering, k-means clustering, etc

Decomposition: PCA, non-negative matrix factorisation, independent component analysis, etc

Mahout: This module contains many basic ML algorithms that perform the tasks listed below.

Classification: Random forest, logistic regression, naive Bayes, etc

Collaborative filtering: ALS, etc

Clustering: k-means, fuzzy k-means, etc

Decomposition: SVD, randomised SVD, etc

Spark MLlib use cases

Spark’s MLlib is used frequently in marketing optimisation, security monitoring, fraud detection, risk assessment, operational optimisation, preventative maintenance, etc.

Here are some popular use cases.

NBC Universal: International cable TV has tons of data. To reduce costs, NBC takes its media offline when it is not in use. Spark’s MLlib is used to implement SVM to predict which files should be taken down.

ING: MLlib is used for its data analytics pipeline to detect anomaly. Decision trees and k-means are implemented by MLlib to enable this.

Toyota: Toyota’s Customer 360 insights platform uses social media data in real-time to prioritise the customer reviews and categorise them for business insights.

ML vs MLLib

There are two main machine learning packages —spark.mllib and spark.ml. The former is the original version and has its API built on top of RDD. The latter has a newer, higher-level API built on top of DataFrames to construct ML pipelines. The newer version is recommended due to the DataFrames, which makes it more versatile and flexible. The newer releases support the older version as well, due to backward compatibility. MLlib, being older, has more features as it was in development longer. Spark ML allows you to create pipelines using machine learning to transform the data. In short, ML is new, has pipelines, DataFrames and is easier to construct. But MLlib is old, has RDD and has more features.

MLlib is the main reason for the popularity and the widespread use of Apache Spark in the Big Data world. Its compatibility, scalability, ease of use, good features and functionality have led to its success. It provides many inbuilt functions and capabilities, which makes it easy for machine learning programmers. Virtually all known machine learning algorithms in use can be easily implemented using either version of MLlib. In this era of data deluge, such libraries certainly are a boon to data science.