The Complete Magazine on Open Source

Building Effective Machine Learning Solutions with Open Source Tools

, / 152 0

Programmer working

Extracting hidden insights and patterns from data is of immense value to any business. Machine learning addresses this need effectively. But for it to work well, the right tools need to be used. Let’s take a look at a few open source tools that could be put to good use in this domain.

The adoption of machine learning is not only due to the availability of more cost-effective and powerful data acquisition technologies and processing capabilities in the hardware domain. The proliferation of many tools, especially in the open source domain, makes machine learning easier to implement.

Along with the right set of enablers, machine learning is becoming pervasive in both enterprise as well as consumer applications. The usage of this technology can really be widespread, ranging from telematics to product recommendations, and from speech recognition to condition monitoring. No wonder, machine learning and the related technologies are rapidly becoming an integral part of the enterprise technology stack. Today it is considered as a critical skill of the developers as well as of the architects.

While machine learning can act as a key enabler to any business, for it to work effectively, it is crucial to choose the right set of tools. There are a wide range of tools available in the open source domain. Each one of them has evolved to cater to some specific need and has its own set of use cases. It is not fair to do an apple-to-apple comparison of all tools available today. Rather, it is necessary to analyse and understand the business need at hand and, accordingly, choose the tool that would be the best fit to solve the problem.

In this article, with the help of some relevant examples, we will discuss some of the key machine learning use cases along with the open source tool sets suitable for each of them.

Machine learning: A quick preview

Before delving into the machine learning tool set, let us take a quick look at what machine learning is all about and how it can be useful for business. Machine learning involves a system that acquires intelligence from past data, experience, observation and training. As part of machine learning, statistical techniques are used to learn the patterns, trends and structure hidden in the data. The learnings or the insights are then leveraged to perform many business operations. For example, by identifying books that have been purchased by your friends, Amazon is able to recommend books to you. Similarly, if you have a habit of watching the latest action movies on Netflix, it will suggest other action movies for you, by using your past viewing history.

Open source tools
Till recently, machine learning was confined more to the world of academics. That may be the reason for the plethora of open source tools and libraries around machine learning and related technologies. With the passage of time and the vast interest in this area, many open source tools and libraries have evolved. By simply using the most suitable implementation of machine learning algorithms, it is possible to derive insights fairly quickly, easily and in a cost-effective manner. Some of the popular tools are listed below.

  • R: This is a language and environment for statistical computing and graphics. R provides a wide variety of statistical (linear and non-linear regression, classical statistical tests, time-series analysis, classification, clustering, etc) and graphical techniques, and is highly extensible. The R language is often the vehicle of choice for research in statistical methodology, and it provides an open source route to participation in that activity.
  • Python: While R is specifically created for statistical analysis, Python also has a rich set of machine learning implementations. It is widely used among the scientist community. Being an interpreter, high level programming language, Python is a good fit for machine learning implementation, as quite often, these implementations call for an agile and iterative approach.
  • Apache Mahout: This provides an environment for quickly creating scalable, performant machine learning applications.
  • H2O: This is for data scientists and application developers who need fast, in-memory scalable machine learning for smarter applications. H2O is an open source parallel processing engine for machine learning.
  • RapidMiner: This is a platform that provides an end-to-end development environment for machine learning. Through a wizard-driven approach, RapidMiner allows the user to quickly build the predictive analytics model.

Choosing the right tool: A key challenge

It is extremely crucial to choose the appropriate tool for the successful implementation of machine learning tool. As seen in the previous sections, there are a wide range of open source machine learning tools available. Each of these tools has its own feature set, strengths and limitations. Due to the varied feature sets, it is not possible to directly compare the tools or suggest one that can be considered as the best fit for all use cases. Hence, it is often a challenging task to select the right tools set to solve the business problem at hand. We need to first have a clear understanding of the problem. Based on that understanding, we need to choose the right tool set which is the best fit to solve that type of business problem.

Like in any other domain of software architecture, selecting a machine learning tool should be driven by the business use case. To choose the right tool, we first need to understand the business context and the business driver. Once we define the problem and get the consensus from all the stakeholders, we need to identify the machine learning technique that would be suitable to meet that business driver. The identified functional and non-functional requirements would form the basis of choosing the appropriate tool which would cater to the business need.

Now, if we consider the entire spectrum of business use cases around machine learning, prima facie they can be grouped into three segments:

  • The classical use cases, which call for in-depth data analysis and very specific algorithms. In these cases we need to use tools like R, Python and Octave as these tools have a huge set of implementations for a really large number of algorithms. These tools allow us to create a bespoke solution by selecting a specific algorithm. Also, with the basic building blocks given in these tools, the user often customises the algorithm or the implementations to address a specific business need. As these tools have evolved mainly due to the work of and for the data scientist community, most of the time, they may not have a very user-friendly interface and the learning curve may be relatively steeper.
  • The second set of use cases have a strong focus on scalability and performance, along with unveiling hidden insights. This is popularly known as big data analytics. A default choice for these use cases would be suitable tools from the Hadoop ecosystem. A widely used tool in this category is Mahout. It is one of the pioneers in this league. Recently, we have been seeing many more improved and optimised tools in this category like H20, Sparkling Water, MLib, etc. These tools leverage distributed computing and in-memory technology to come up with faster and scalable implementations of machine learning algorithms. The choice of algorithm is much less in these tools. Also, quite often, for the desired results, they require high-end infrastructure.
  • The third segment of use cases is primarily for the business users’ community. Here, users want to get some quick insights not only by visualisation but also from the predictive analytics perspective without really getting into the details of the statistical techniques or algorithms. Of late, we are seeing a lot of development in this domain, with many vendors coming up with their offerings —Microsoft Azure, Amazon machine learning, IBM Watson Analytics, etc. Many of the tools in this space are in the commercial domain. The open source tools like RapidMiner and Weka can be extended, and would be a good fit to satisfy this requirement.

Moving ahead
To trigger wider adoption of machine learning, many of the tools are now being deployed on the cloud. These cloud based tools come with their own pros and cons. They reduce the concept-to-cash window even further, as with these tools, the user does not have to worry about the hardware and the deployment aspect. Here, the user can concentrate more on the business solution and can utilise resources more effectively. Today, most cloud based tools are available in the commercial domain. Some open source tools like RapidMiner are available as cloud based deployments. These tools would be a very good fit for the third category of use cases. It is a good idea to watch the developments in this domain.

With the wider adoption of machine learning, businesses are expecting value from their investments. For effective implementation and to obtain the expected return on investment, the right choice of tools is very crucial.