Data remains as raw text until it is mined and the information contained within it is harnessed. Mining data to make sense out of it has applications in varied fields of industry and academia. In this article, we explore the best open source tools that can aid us in data mining.
Data mining, also known as knowledge discovery from databases, is a process of mining and analysing enormous amounts of data and extracting information from it. Data mining can quickly answer business questions that would have otherwise consumed a lot of time. Some of its applications include market segmentation – like identifying characteristics of a customer buying a certain product from a certain brand, fraud detection – identifying transaction patterns that could probably result in an online fraud, and market based and trend analysis – what products or services are always purchased together, etc. This article focuses on the various open source options available and their significance in different contexts.
A brief look at mining tasks
For those who are new to data mining, let’s take a brief look at some of the common mining tasks.
Pre-processing: This involves all the preliminary tasks that can help in getting started with any of the actual mining tasks. Pre-processing could be removing anomalies and noise from the data that’s about to be mined, filling in missing values, normalising the data or compressing data using techniques like generalisation and aggregation.
Clustering: This is partitioning a huge set of data into related sub-classes.
Classification: This is tagging or classifying data items into different user-defined categories.
Outlier analysis helps in identifying those data elements which are deviant or distant from the rest of the elements in a dataset. This can help in anomaly detection.
Associative analysis helps in bringing out hidden relationships among data items in a large data set. This can help in predicting the occurrence of a particular item in a transaction or an event whenever some other item is present. You can think of this as a conditional probability.
Regression is used to predict values of a dependent variable by constructing a model or a mathematical function out of independent variables.
Summarisation helps in coming up with a compact description for the whole data set.
Data mining is a combination of various techniques like pattern recognition, statistics, machine learning, etc. While there is a good amount of intersection between machine learning and data mining, as both go hand in hand and machine learning algorithms are used for mining data, we will restrict ourselves in this article to only those tools specialised for data mining.
Weka is a Java based free and open source software licensed under the GNU GPL and available for use on Linux, Mac OS X and Windows. It comprises a collection of machine learning algorithms for data mining. It packages tools for data pre-processing, classification, regression, clustering, association rules and visualisation. The various ways of accessing it are – Weka Knowledge Explorer, Experimenter, Knowledge Flow and a simple CL. Explorer is a user-friendly graphical interface for two-dimensional visualisation of mined data. It lets you import the raw data from various file formats, and supports well known algorithms for different mining actions like filtering, clustering, classification and attribute selection. However, when dealing with large data sets, it is best to use a CL based approach as Explorer tries to load the whole data set into the main memory, causing performance issues. This software also provides a Java Appetiser for use in applications and can connect to databases using CJD. Weka has proved to be an ideal choice for educational and research purposes, as well as for rapid prototyping.
Rapid Miner is available in both FOSS and commercial editions and is a leading predictive analytic platform. Gartner, the US research and advisory firm, has recognised Rapid Miner and Knife as leaders in the magic quadrant for advanced analytic platforms in 2016. Rapid Miner is helping enterprises embed predictive analysis in their business processes with its user friendly, rich library of data science and machine learning algorithms through its all-in-one programming environments like Rapid Miner Studio. Besides the standard data mining features like data cleansing, filtering, clustering, etc, the software also features built-in templates, repeatable work flows, a professional visualisation environment, and seamless integration with languages like Python and R into work flows that aid in rapid prototyping. The tool is also compatible with weak scripts. Rapid Miner is used for business/commercial applications, research and education.
Python users playing around with data sciences might be familiar with Orange. It is a Python library that powers Python scripts with its rich compilation of mining and machine learning algorithms for data pre-processing, classification, modelling, regression, clustering and other miscellaneous functions. Orange also comes with a visual programming environment and its workbench consists of tools for importing data, and dragging and dropping widgets and links to connect different widgets for completing the workflow. The visual programming comes with an easy-to-use UI, with plenty of online tutorials for assistance. Due to the ease of programming and integration in Python, Orange can be a great take off point for novices and experts to plunge into data mining.
Knime is one of the leading open source analytic, integration and reporting platforms that comes as free software and as well as a commercial version. Written in Java and built upon Eclipse, its access is through a GUI that provides options to create the data flow and conduct data pre-processing, collection, analysis, modelling and reporting. A Gartner survey reveals that customers are happy with the platform’s flexibility, openness and smooth integration with other software like Weka and R. Given the small size of the company, Knime has a large user base and an active community. It makes use of Eclipse’s extension mechanism capability to add plugins for the required functionalities like text and image mining. This software is ideal for enterprise use.
DataMelt or DMelt does much more than just data mining. It is a computational platform, offering statistics, numeric and symbolic computations, scientific visualisation, etc. To avoid digressing from our topic, I’ll restrict myself to only covering its data mining capabilities. DMelt provides data mining features like linear regression, curve fitting, cluster analysis, neural networks, fuzzy algorithms, analytic calculations and interactive visualisations using 2D/3D plots and histograms. One can play around with its IDE (integrated development kit) or its functions can be called from applications using its Java API. Both community and commercial editions of DMelt are available on Linux, Mac OS, Windows and Android platforms. DMelt is a successor to the jHepWork and SCaVis programs, which some people working in data analysis might be familiar with. This software is well suited for students, engineers and scientists.
Mahout is primarily a library of machine learning algorithms that can help in clustering, classification and frequent pattern mining. It can be used in a distributed mode that helps easy integration with Hadoop. Mahout is currently being used by some of the giants in the tech industry like Adobe, AOL, Drupal and Twitter, and it has also made an impact in research and academics. It can be a great choice for anyone looking for easy integration with Hadoop and to mine huge volumes of data.
ELKI is open source software written in Java and licensed under AGPLv3. This software focuses especially on cluster analysis and outlier detection with a compilation of numerous algorithms from both these domains. The software is accessed through a GUI that displays the results once the selected algorithm is run. ELKI’s design goals are performance, scalability, completeness, extensibility and a modular design to welcome contributions. ELKI currently doesn’t offer professional support and the software is optimised for use in science and research. Hence, this option works best for those in research.
Massive Online Analysis (MOA), as the name suggests, is primarily data stream mining software that is well suited for applications that need to handle volumes of real-time data streams at a high speed. MOA is distributed under GNU GPL, and can be used via the command line, GUI or Java API. It is a rich compilation of machine learning algorithms and has proved to be a great choice during the design of real-time applications. Stream mining algorithms typically require faster computations without storing all of the datasets in the memory and have to get the work done within a limited time. MOA is well suited for these requirements. Weka and MOA can be closely linked to each other and either of the classifiers can be called from the other one. For those looking to analyse and mine information from real-time data, MOA can be the best choice.
KEEL (Knowledge Extraction for Evolutionary Learning) is a Java based open source tool distributed under GPLv3. It is powered by a well-organised GUI that lets you manage (import, export, edit and visualise) data with different file formats, and to experiment with the data (through its data pre-processing, statistical libraries and some standard data mining and evolutionary learning algorithms). Since KEEL is based on Java, JVM has to be installed on the system to run its GUI and do data mining experiments. You may visit http://keel.es/ for the complete list of supported algorithms. KEEL is ideal for research and educational purposes. It serves as a useful aid for teachers.
Rattle, expanded to ‘R Analytical Tool To Learn Easily’, has been developed using the R statistical programming language. The software can run on Linux, Mac OS and Windows, and features statistics, clustering, modelling and visualisation with the computing power of R. Rattle is currently being used in business, commercial enterprises and for teaching purposes in Australian and American universities.
All the tools and software discussed so far are not the only available ones—the list keeps growing. While I have covered only those tools exclusively meant for mining data, there are a few other machine learning, NLP and data analytic tools that could aid in mining, like scikit-learn, NLTK, GraphLab, Neural Designer, Pandas and SPMF, which readers could explore.
The author currently works as a software developer at Cisco Systems India. He is interested in open source technologies and can be reached at firstname.lastname@example.org.