Apache Mahout aims at building an environment for quickly creating scalable and performant machine learning applications.
Machine learning refers to the intelligent and dynamic response of software or embedded hardware programs to input data. Machine learning is the specialised domain that operates in association with artificial intelligence to make strong predictions and analyses. Using this approach, there is no need to explicitly program computers for specific applications; rather, the computing modules evaluate the data set with their inherent reactions so that real-time fuzzy based analysis can be done. The programs developed with machine learning paradigms focus on the dynamic input and data set, so that the custom and related output can be presented to the end user.
There are a number of applications for which machine learning approaches are widely used. These include fingerprint analysis, multi-dimensional biometric evaluation, image forensics, pattern recognition, criminal investigation, bioinformatics, biomedical informatics, computer vision, customer relationship management, data mining, email filtering, natural language processing, automatic summarisation, and automatic taxonomy construction. Machine learning also applies to robotics, dialogue systems, grammar checkers, language recognition, handwriting recognition, optical character recognition, speech recognition, machine translation, question answering, speech synthesis, text simplification, pattern recognition, facial recognition systems, image recognition, search engine analytics, recommendation systems, etc.
There are a number of approaches to machine learning, though traditionally, supervised and unsupervised learning are the models widely used. In supervised learning, the program is trained with a specific type of data set with the target value. After learning and deep evaluation of the input data and the corresponding target, the machine starts making predictions. The common examples of supervised learning algorithms include artificial neural networks, support vector machines and classifiers. In the case of unsupervised learning, the target is not assigned with the input data. In this approach, dynamic evaluation of data is done with high performance algorithms, including k-means, self-organising maps (SOM) and clustering techniques. Other prominent approaches and algorithms associated with machine learning include dimensionality reduction, the decision tree algorithm, ensemble learning, the regularisation algorithm, supervised learning, artificial neural networks, and deep learning. Besides these, there are also the instance-based algorithms, regression analyses, classifiers, Bayesian statistics, linear classifiers, unsupervised learning, association rule learning, hierarchical clustering, deep cluster evaluation, anomaly detection, semi-supervised learning, reinforcement learning and many others.
Free and open source tools for machine learning are Apache Mahout, Scikit-Learn, OpenAI, TensorFlow, Char-RNN, PaddlePaddle, CNTX, Apache Singa, DeepLearning4J, H2O, etc.
Apache Mahout, a scalable high performance machine learning framework
Apache Mahout (mahout.apache.org) is a powerful and high performance machine learning framework for the implementation of machine learning algorithms. It is traditionally used to integrate supervised machine learning algorithms with the target value assigned to each input data set. Apache Mahout can be used for assorted research based applications including social media extraction and sentiment mining, user belief analytics, YouTube analytics and many related real-time applications.
In Apache Mahout, a ‘mahout’ refers to whatever drives or operates the elephant. The mahout acts as the master of the elephant in association with Apache Hadoop and is represented in the logo of the elephant. Apache Mahout runs with the base installation of Apache Hadoop, and then the machine learning algorithms are implemented with the features to develop and deploy scalable machine learning algorithms. The prime approaches, like recommender engines, classification problems and clustering, can be effectively solved using Mahout.
Corporate users of Mahout include Adobe, Facebook, LinkedIn, FourSquare, Twitter and Yahoo.
Installing Apache Mahout
To start with the Mahout installation, Apache Hadoop has to be set up on a Linux distribution. To get ready with Hadoop, the installation is required to be updated as follows, in Ubuntu Linux:
$ sudo apt-get update $ sudo addgroup hadoop $ sudo adduser --ingroup hadoop hadoopuser1 $ sudo adduser hadoopuser1 sudo $ sudo apt-get install ssh $ su hadoopuser1 $ ssh-keygen -t rsa $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys $ chmod 0600 ~/.ssh/authorized_keys $ ssh localhost
Installing the latest version of Hadoop
Use the following code to install the latest version of Hadoop:
$ wget http://www-us.apache.org/dist/hadoop/common/hadoop-HadoopVersion/hadoop-HadoopVersion.tar.gz $ tar xvzf hadoop-HadoopVersion.tar.gz $ sudo mkdir -p /usr/local/hadoop $ cd hadoop-HadoopVersion/ $ sudo mv * /usr/local/hadoop $ sudo chown -R hadoopuser1:hadoop /usr/local/hadoop $ hadoop namenode –format $ cd /usr/local/hadoop/sbin $ start-all.sh
The following files are required to be updated next:
- ~/.bashrc
- core-site.xml
- hadoop-env.sh
- hdfs-site.xml
- mapred-site.xml
- yarn-site.xml
Web interfaces of Hadoop
Listed below are some of the Web interfaces of Hadoop.
MapReduce: http://localhost:8042/
NameNode daemon: http://localhost:50070/
Resource Manager: http://localhost:8088/
SecondaryNameNode: http://localhost:50090/status.html
The default port to access Hadoop is 50070 and http://localhost:50070/ is used on a Web browser.
After installng Hadoop, the setting up of Mahout requires the following code:
$ wget http://mirror.nexcess.net/apache/mahout/0.9/mahout-Distribution.tar.gz $ tar zxvf mahout-Distribution.tar.gz
Implementing the recommender engine algorithm
Nowadays, when we shop at online platforms like Amazon, eBay, SnapDeal, FlipKart and many others, we notice that most of these online shopping platforms give us suggestions or recommendations about the products that we like or had purchased earlier. This type of implementation or suggestive modelling is known as a recommender engine or recommendation system. Even on YouTube, we get a number of suggestions related to videos that we viewed earlier. Such online platforms integrate the approaches of recommendation engines, as a result of which the related best fit or most viewed items are presented to the user as recommendations.
Apache Mahout provides the platform to program and implement recommender systems. For example, the Twitter hashtag popularity can be evaluated and ranked based on the visitor count, popularity or simply the hits by the users. In YouTube, the number of viewers is the key value that determines the actual popularity of that particular video. Such algorithms can be implemented using Apache Mahout, which are covered under high performance real-time machine learning.
For example, a data table that presents the popularity of products after online shopping by consumers is recorded by the companies, so that the overall analysis of the popularity of these products can be done. The user ratings from 0-5 are logged so that the overall preference for the product can be evaluated. This data set can be evaluated using Apache Mahout in Eclipse IDE.
To integrate Java Code with Apache Mahout Libraries on Eclipse IDE, there are specific JAR files that are required to be added from Simple Logging Facade for Java (SLF4J).
The following is the Java Code module, with methods that can be executed using Eclipse IDE with the JAR files of Mahout to implement the recommender algorithm:
DataModel dm = new FileDataModel(new File(“inputdata”)); UserSimilarity us = new PearsonCorrelationSimilarity(dm); UserNeighborhood un = new ThresholdUserNeighborhood(ThresholdValue), us, dm); UserBasedRecommender r=new GenericUserBasedRecommender(dm, un, us); List<RecommendedItem> rs=recommender.recommend(UserID, Recommendations); for (RecommendedItem rc : rs) { System.out.println(rc);
Apache Mahout and R&D
Research problems can be solved effectively using Apache Mahout with customised algorithms for multiple applications including malware predictive analytics, user sentiment mining, rainfall predictions, network forensics and network routing with deep analytics. Nowadays, the integration of deep learning approaches can be embedded in the existing algorithms so that a higher degree of accuracy and optimisation can be achieved in the results.