Deriving knowledge and insights from data empowers enterprises to make effective decisions, cut costs, ensure quick delivery, as well as enhance competitiveness and productivity. Enterprises that use data science as an integral part of their business strategy score over businesses that ignore it.
Data science is regarded as the systematic study and analysis of data from various sources for effective decision making and problem solving. The objective of data science is to facilitate the application of various processes related to data like data procurement, data pre-processing for noise clearing, data representation, data evaluation, data analysis and the use of data to create knowledge related to the enterprise. Data science contributes in a significant manner towards decision making at varying levels—individual, organisational and global. The ultimate emerging opportunity in data science is Big Data, which enables organisations to analyse a large number of data sets generated through website logs, sensors and data transactions to gain insights and make effective decisions.
The term ‘data science’ has appeared in several contexts since the past 30 years and has recently gained significant attention. An early reference to data science was by Peter Naur in 1960 when a term titled ‘atalogy’ was coined. In 1974, Naur published his book, ‘A Concise Survey of Computer Methods’, which freely used the term data science in its survey of the contemporary data processing methods being used in a wide range of applications.
The importance of data science and analytics in enterprises
In the modern era, data is regarded as a key element to run any enterprise successfully. It is not enough for the business to just acquire data, but to attain better results, data should be effectively used. Data must be processed and analysed in proper ways to gain insights into a particular business. When data is acquired from multiple sources, with no specific format, it incorporates lots of noise and redundant values. In order to make it suitable for decision making, data has to undergo the processes of cleaning, munging, analysing and modelling. This is where data science and analytics come into play. Big Data analytics has been a major revolution for enterprises, since it has strengthened the roots of businesses. As of 2018, more than 50 per cent of the world’s enterprises are making use of Big Data analytics in business, as compared to 17 per cent in 2015, which marks a tremendous growth in adoption.
Many companies are interested in data science to improve decision making, but businesses are mostly unaware of the need for planning and carrying out such activities. To make efficient use of data science, the primary requirement is skilled personnel, i.e., data scientists. Such professionals are responsible for collecting, analysing and interpreting large amounts of data to identify ways to help the business improve operations and gain a competitive edge over rivals. The best resources and infrastructure are also required to carry out all data analysis activities in a smooth manner. In addition, it is necessary to identify possible sources of data and permissions as well as the methods needed to acquire the data. The next step is learning and building data science skills. Statistical, modelling, programming, visualisation, machine learning and data mining skills are required to carry out data science activities. The final step is to take the relevant proactive action based on the insights from data analysis.
The major hurdle in data science is the availability of data. Data collection as well as data structuring is essential, before data can be made useful. Then the data has to be cleaned, analysed and processed into a suitable model with efficient presentation.
The following reasons highlight the importance of data science and Big Data analytics for today’s enterprises.
- Effective decision making: Data science and Big Data analytics act as trusted advisors to the organisation, particularly for effective strategic planning. They provide a strong base for the management to enhance its analytical abilities and the overall decision making process. They enable the measuring, recording and performance tracking metrics that allow the top management to set new goals and objectives for profit maximisation.
- Identification of competitive trends: Data analytics enables organisations to determine the patterns within large data sets. This is very useful in identifying new and upcoming market trends. Once the trends are identified, these can become a useful parameter for the enterprise to gain a competitive advantage by introducing new products and services in the market to gain an edge over rivals.
- Efficiency in handling core tasks and issues: By making the employees aware of the benefits of the organisation’s analytics product, data science can enhance job efficiency. Working closely with company goals, the employees will be able to direct more resources towards core tasks and issues at every stage, which will enhance the operational efficiency of the business.
- Promoting low risk data-driven action plans: Big Data analytics makes it possible for every SME to take action based on quantifiable, data-driven evidence. With such actions, businesses can reduce unnecessary tasks and curtail certain risks to a remarkable level.
- Selecting a target audience: Big Data analytics plays a key role in providing more insight into customer preferences and expectations. With deeper analysis, companies can identify target audiences and can propose new customer-driven and customer-oriented products and services.
The growing adoption of data science and analytics in enterprises
In the past three years, Big Data analytics has seen a significant increase in usage. The following are some interesting statistics:
- Reporting, dashboards, advanced visualisation, end-user ‘self-service’ and data warehousing are the top five technologies and strategic initiatives in business intelligence. Big Data currently ranks 20th across 33 key technologies. Big Data analytics has greater strategic importance than IoT, natural language processing, cognitive business intelligence (BI) and location intelligence.
- Among the 53 per cent of the enterprises worldwide that are currently using data science for decision making, telecom and financial sector companies are major adopters of Big Data analytics.
- Data warehouse optimisation is regarded as a very important aspect of Big Data analytics, followed by customer analysis and predictive maintenance.
- Spark, MapReduce and Yarn are the most popular data science analytics software frameworks today.
- The most popular Big Data access methods include: Spark SQL, Hive, HDFS and Amazon S3. A 73 per cent share of this segment is held by Spark SQL, followed by 26 per by Hive and HDFS, and the rest by Amazon S3.
- Machine learning is gaining more industrial support with the Spark Machine Learning Library (MLib) and the adoption rate is expected to rise by 60 per cent in the next one year.
Considering the adoption rate of data science in enterprises, lots of commercial and open source tools are now available. The focus of this article is to cover the various open source tools for data science, especially those based on machine learning and deep learning.
1. Apache Mahout
The Apache Mahout project was started by a group of people involved in Apache Lucene, who conducted a lot of research in machine learning and wanted to develop robust, well documented, scalable implementations of common machine learning algorithms for clustering and categorisation. The primary objective behind the development of Mahout was to:
- Build and support a community of users to contribute source code to Apache Mahout.
- Focus on real-world, practical use cases as compared to bleeding-edge research or unproven techniques.
- Strongly support the software with documentation and examples.
Apache Mahout is a scalable machine learning library. It is available as open source software under the Apache Software Foundation. It supports algorithms for clustering, classification and collaborative filtering on distributed platforms. Mahout also provides Java libraries for common maths operations (focused on linear algebra and statistics) and primitive Java collections. It has a few algorithms that are implemented as MapReduce. These algorithms can run on Hadoop to exploit the parallelism in a distributed cluster. Once Big Data is stored in the Hadoop Distributed File System (HDFS), Mahout provides data science tools to automatically find meaningful patterns in big data sets.
- Allows collaborative filtering, i.e., it mines user behaviour and makes probable product recommendations.
- Allows clustering, which organises items in a particular class, and naturally occurring groups in such a way that all the items belonging to the same group share similar characteristics.
- Allows automatic classification by learning from existing categories and accordingly assigns unclassified items to the best category.
- Another delightful feature is frequent item set mining that analyses items in a group, i.e., those similar to items in a shopping cart and then identifies which items typically appear together.
- Incorporates a number of pre-designed algorithms for Scala + Apache Spark, H2O, Apache Flink, etc.
- Contains Samsara, which is a vector maths experimentation environment with R-like syntax, which works at scale.
- Includes several MapReduce enabled clustering implementations like k-means, fuzzy k-means, Canopy, Dirichlet and MeanShift.
- Comes with distributed fitness function capabilities for evolutionary programming.
- Includes matrix and vector libraries.
Official website: https://mahout.apache.org/
Latest version: 0.13.0
2. Apache SystemML
Apache SystemML was designed by researchers at the IBM Almaden Research Center in 2010. Previously, when data scientists wrote machine learning algorithms using R and Python for small data and for Big Data, they made use of SCALA. This process was long, full of iterations and sometimes error prone. In order to simplify the process, SystemML was proposed. The objective of SystemML was to automatically scale an algorithm written in R or Python to handle Big Data without any errors by using the multi-iterative translation approach.
Apache SystemML is a new, flexible machine learning system that automatically scales to Spark and Hadoop clusters. It provides a high-level language to quickly implement and run machine learning algorithms on Spark. It advances machine learning in two important ways – the Apache SystemML language and Declarative Machine Learning (DML) – and includes linear algebra primitives, statistical functions and ML constructs. It provides automatic optimisation as per data and cluster to ensure efficiency and scalability.
- Customisation of algorithms: All machine learning algorithms in SystemML are in high-level, declarative machine learning language (DML). This language provides better productivity to data scientists in terms of full flexibility in designing custom analytics and data independence from input formats and physical data representations.
- Multiple execution modes: SystemML can be operated in standalone mode on a single machine, allowing data scientists to develop algorithms locally without the need of a distributed cluster. It features a Spark MLContext API that allows for programmatic interaction via Scala, Python, and Java. SystemML also features an embedded API for scoring models.
- Support of a great many machine learning algorithms: SystemML features a suite of production-level examples that can be grouped into six broad categories — descriptive statistics, classification, clustering, regression, matrix factorisation and survival analysis.
- Optimisation: Algorithms specified in DML are dynamically compiled and optimised based on data and cluster characteristics using rule-based and cost-based optimisation techniques.
Official website: https://systemml.apache.org/
Latest version: 1.1.0 for Spark 2.0.2
H20 was designed and developed by H20.ai in 2011. It enables end users to test thousands of prediction models to discover varied patterns in data sets. H20 uses the same types of interfaces like R, Python, Scala, JAVA, JSON and Flow notebook/Web interface, and works in a seamless manner with a large number of Big Data technologies like Hadoop and Spark. It provides a GUI driven platform to companies to perform faster data computations. Currently, H20 supports all types of basic and advanced algorithms like deep learning, boosting, bagging, naïve Bayes, principal component analysis, time series, k-means, generalised linear models, etc.
Recently, APIs for R, Python, Spark and Hadoop have also been released by H20, which provide data structures and methods suitable for Big Data. H2O allows users to analyse and visualise whole sets of data without using the Procrustean strategy of studying only a small subset with a conventional statistical package.
H2O uses iterative methods that provide quick answers using all of the client’s data. When a client cannot wait for an optimal solution, it can interrupt the computations and use an approximate solution. In its approach to deep learning, H2O divides all the data into subsets and then analyses each subset simultaneously using the same method. These processes are combined to estimate parameters by using the Hogwild scheme, a parallel stochastic gradient method. These methods allow H2O to provide answers that use all the client’s data, rather than throwing away most of it and analysing a subset with conventional software.
- Very fast and accurate: H20 is enabled to run fast serialisation between nodes and clusters. It supports large data sets and is highly responsive.
- Scalable: Fine-Grain distributed processing on big data at speeds up to 100x faster is done with fine-grain parallelism, which enables optimal efficiency, without introducing degradation in computational accuracy.
- Easy to use.
- Multi-platform: Can run on Windows, Linux and even Mac OS X.
- Excellent GUI: The GUI is compatible with all types of browsers like Safari, Firefox, Internet Explorer and Chrome.
Official website: https://www.h2o.ai
Latest version: 188.8.131.52 (Tutte)
4. Apache Spark MLib
Apache Spark MLib is a machine learning library that has been designed to make practical machine learning scalable and easy. It comprises common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction as well as lower-level optimisation primitives and higher-level pipeline APIs.
Spark MLib is regarded as a distributed machine learning framework on top of the Spark Core which, due to the distributed memory-based Spark architecture, is almost nine times as fast as the disk-based implementation used by Apache Mahout.
Given below are the various common machine learning and statistical algorithms that have been implemented and included with MLib.
- Classification: Logistic regression, naïve Bayes
- Regression: Generalised linear regression, survival regression
- Decision trees, random forests, and gradient-boosted trees
- Recommendation: Alternating least squares (ALS)
- Clustering: K-means, Gaussian mixtures (GMMs)
- Topic modelling: Latent Dirichlet allocation (LDA)
- Frequency item sets, association rules and sequential pattern mining
- Easy to use: MLib easily fits into Spark’s API and interoperates with NumPy in Python and R libraries.
- Best performance: Spark MLib is excellent in performance – almost 100 times better than MapReduce. MLib contains high-quality algorithms that leverage iteration and gives better results than one-pass approximations.
- Multi-platform support: MLib can run on Hadoop, Apache Mesos, Kubernetes, a standalone system or in the cloud.
Official website: https://spark.apache.org/mllib/
Latest version: 2.1.3
5. Oryx 2
Oryx 2 is the result of Lambda architecture built on Apache Spark and Apache Kafka for real-time large-scale machine learning. It is designed for building applications and includes packaged, end-to-end applications for collaborative filtering, classification, regression and clustering.
Oryx 2 comprises the following three tiers:
- General Lambda architecture tier: Provides batch, speed and serving layers, which are not specific to machine learning.
- Specialisation on top provides machine learning abstraction to hyper parameter selection, etc.
- End-to-end implementation of the same standard machine learning algorithms as an application (ALS, random decision forests, k-means).
Oryx 2 consists of the following layers of Lambda architecture as well as connecting elements.
- Batch layer: Used for computing new results from historical data and previous results.
- Speed layer: Produces and publishes incremental model updates from a stream of new data.
- Serving layer: Receives models and updates, and implements a synchronous API, exposing query operations on results.
- Data transport layer: Moves data between layers and takes input from external sources.
Official website: http://oryx.io/
Latest version: 2.6.0
6. Vowpal Wabbit
Vowpal Wabbit is an open source, fast, out-of-core learning system sponsored by Microsoft and Yahoo! Research. It is regarded as an efficient, scalable implementation of online machine learning with techniques like online, hashing, allreduce, reductions, learning2search, active and interactive learning.
The platform supports the following:
- Multiple supervised and semi supervised learning problems – classification, regression and active learning
- Multiple learning options – OLS regression, single layer neural net, LDA, contextual-bandit, etc
- Multiple loss functions
- Multiple optimisation algorithms – SGD, BFGS, conjugate gradient
- Input format: The input format is highly flexible for any learning algorithm.
- Speed: The learning algorithm is extremely fast and it can be applied to learning problems with a sparse terafeature, i.e., 1012 sparse features.
- Highly scalable.
- Feature pairing: Subsets of features can be internally paired so that the algorithm is linear in the cross-product of the subsets. This is useful for resolving ranking problems.
Official website: http://hunch.net/~vw/
Latest version: 8.5.0