Mahout is an open source machine learning library from the Apache Software Foundation. It implements many data mining algorithms like recommender engines(), clustering (), classification (), and is scalable to very large data setsup to terabytes and petabytes, which is in the Big Data realm.
In this article, I will focus on recommender systems in Mahout. Recommender engines are very popular machine learning algorithms that are used to recommend books, movies or articles based on the users past actions and interests.
Recommender engine algorithms come in two categories user based and item based recommendations. User based recommendations predict what the users will like, based on their similarity with other users and do not require the properties or description of the items; for example, Facebooks friend recommendations. In item based recommendations, the preferences of users for some items based on their preference of other similar items can be suggested, and this requires prior knowledge of the particular items properties. For example, if a user likes one kind of comedy movie, he may like other similar comedy movies.
The following are the main components of the recommendation system in Mahout.
- Data model interface: This converts the raw data to a Mahout compliant data format. Data sources like MySQL, PostgreSQL, MongoDB, Cassandra, flat files, etc, are supported.
- User similarity interface: This contains methods to compute the similarity among users.
- Item similarity interface: This contains methods to compute the similarity between items. Many similarity metrics are available in Mahout like Pearson Correlation, Euclidean distance, etc.
- User neighbourhood interface: This is a method to construct a neighbourhood around a given user, which satisfies the similarity or nearest neighbour threshold criteria.
- Recommender interface: This is a method that finally makes the actual recommendation.
In order to process Big Data, Mahout can use HDFS. So a prerequisite to installing Mahout is JDK/JAVA 1.7 or later, Maven 3.0 or later and the Hadoop cluster. If you are using Ubuntu, you can use the following commands.
Install Maven from the repository sudo apt-get install maven. Download the latest distribution of Mahout from the site http://www.apache.org/dyn/closer.cgi/lucene/mahout/. Unzip and copy the following command to the desired location:
cp -R SOURCE DESTINATION
I copied it in /usr/local.
Now, cd /usr/local/mahout/distribution or wherever you have copied Mahout. Then run the following command:
sudo mvn install
(Install Maven 3.0.1 or above for Mahout .20 distribution; else, it will throw some error.) You will get the screen shown in Figure 1 after successful installation.
Mahouts recommenders expect interactions between users and items as input. I tested a sample 568.3MB data, which contained the fields userID, movieID and value. Here, userID and movieID (itemID) refer to a particular user and a particular item, and value denotes the strength of the interaction (e.g., the rating given to a movie).
The following steps can be used to run the Recommendation algorithm.
Create a directory in the Hadoop file system to store the ratings file using the following command:
hadoop fs -mkdir /mahout_data/
Copy the downloaded file to HDFS using the following command:
hadoop fs -put /home/hduser/mydata/ml-latest/ratings.csv /mahout_data/
Go to the Mahout directory, cd /usr/local/mahout/bin/ and issue the following command (the output file should be unique and JAVA_HOME should be properly set):
./mahout recommenditembased -s SIMILARITY_LOGLIKELIHOOD -i hdfs://localhost:9000/mahout_data/ratings.csv -o hdfs://localhost:9000/ratings_test/ --numRecommendations 25 -i hdfs://localhost:9000/mahout_data/ratings.csv - Denotes the input file -o hdfs://localhost:9000/ratings_test/ -denotes the output file .
The command will run for a couple of minutes and you can see your output from the Web interface as well, as shown in Figure 2.
Note: In the above snippet, recommenditembased means we are creating an item based recommendation and not a user based recommendation. The difference between the two is that a user based recommendation finds similar users based on what they like, and item based recommendation figures out what the user likes and finds items to match those preferences. Mahouts item-based recommendation algorithm takes customer preferences by item as input, and generates an output recommending similar items with a score indicating whether a customer will like the recommended item.
You can check the output file and it will contain two columns: the userID and an array of itemIDs and scores. This could, for instance, recommend a users preference for a particular movie, which he may be interested in watching.
We are leaving the age of information and entering the age of recommendation, says Chris Anderson, British entrepreneur and the curator of TED. The increasing adoption of the Web as a vehicle for business has changed the way in which businesses interact with the customer. Marketing teams and businesses are now using intelligent algorithms and social technologies to form meaningful, ongoing relationships with customers. It is here that technologies like Mahout Recommender will play a key role.
References
[1] http://mahout.apache.org/users/recommender/intro-itembased-hadoop.html
[2] http://grouplens.org/datasets/movielens/
[3] http://info.mapr.com/rs/mapr/images/PracticalMachineLearning.pdf
[4] Mahout in Action by Sean Owen, Robin Anil, Ted Dunning, Ellen Friedman (Manning Publications Co.)