The Complete Magazine on Open Source

Use Hadoop to Handle Big Data

, / 280 0

hadoop1 big data

Big Data now means big business. The world is constantly accumulating volumes of raw data in various forms such as text, MP3 or Jpeg files, which need to be processed, if any value can be derived from them. Apache Hadoop is open source software that can handle Big Data. So here’s a chance to learn how to install Hadoop and play around with it.

Big Data is currently making waves across the tech field. Everyone knows that the volume of data is growing day by day. Old technology is unable to store and retrieve huge amounts of data sets. With a rapid increase in the number of mobile phones, CCTVs and the usage of social networks, the amount of data being accumulated is growing exponentially. But why is this data needed? The answer to this is that companies like Google, Amazon and eBay track their logs so that ads and products can be recommended to customers by analysing user trends. According to some statistics, the New York Stock Exchange generates about one terabyte of new trade data per day. Facebook hosts approximately 10 billion photos, taking up one petabyte of storage. The Big Data we want to deal with is of the order of petabytes— 1012 times the size of ordinary files. With such a huge amount of unstructured data, retrieval and analysis of it using old technology becomes a bottleneck.


Figure 1: Nodes

Figure 2

Figure 2: Cloudera-VM

Big Data is defined by the three Vs—volume, velocity and variety. We’re currently seeing exponential growth in data storage since it is now much more than just text. This large volume, indeed, is what represents Big Data. With the rapid increase in the number of social media users, the speed at which data from mobiles, logs and cameras is generated is what the second ‘v’(for velocity) is all about. Last of all, variety represents different types of data. Today data is in different formats like text, mp3, audio, video, binary and logs. This data is unstructured and not stored in relational databases.
In order to solve the problem of data storage and fast retrieval, data scientists have burnt the midnight oil to come up with a solution called Hadoop. It was created by Doug Cutting and Mike Cafarella in 2005. Cutting, who was working at Yahoo at that time, named this solution after his son’s toy elephant.
So what is Hadoop? It is an open source framework that allows the storage and processing of Big Data in a distributed environment across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. The core of Apache Hadoop consists of the storage part (Hadoop distributed file system) and its processing part (MapReduce). Hadoop splits files into large blocks and distributes them amongst the nodes in the cluster. It should be noted that Hadoop is not OLAP (online analytical processing) but batch/offline oriented.
The challenge with Big Data is whether the data should be stored in one machine. Hard drives are approximately 500GB in size. Even if you add external hard drives, you can’t store the data in petabytes. Let’s say you add external hard drives and store this data, you wouldn’t be able to open or process those files because of insufficient RAM. As for processing, it would take months to analyse this data. So the HDFS feature comes into play. In HDFS, the data is distributed over several machines, and replicated (with the replication factor usually being 3) to ensure their durability and high availability even in parallel applications.
The advantage of HDFS is that it is scalable, i.e., any number of systems can be added at any point in time. It works on commodity hardware, so it is easy to keep costs low as compared to other databases. HDFS is mainly designed for large files, and it works on the concept of write once and read many times. In HDFS, individual files are broken into blocks of fixed size (typically 64MB) and stored across a cluster of nodes (not necessarily on the same machine). These files can be more than the size of an individual machine’s hard drive. The individual machines are called data nodes.

fig 3

Figure 3: Profile

fig 4

Figure 4: WordCount

Now, let’s move on to the installation and running of a program on a standalone machine. The prerequisites are:

First download the VM and install it on a Windows machine—it is as simple as installing any media player. Just click Next, Next and Finish. After installation, unzip and extract Cloudera-Udacity-4.1 in a folder and now double click on the VM player’s quick launcher; click on ‘Open Virtual Machine’ and select the extracted image file from the folder containing the vmx file. It will take some time to install. Do remember to set the RAM to 1GB or else your machine will be slow. After successful installation, the machine will start and you will find the screen shown in Figure 2.
Now, in order to interact with the machine, an SSH connection should be established; so in a terminal, type the following commands. First install the client, then the server. The SSH key will be generated by this and can be shared with other machines in the cluster to get the connection.

1. sudo dpkg -i openssh-client_1%3a5.3p1-3ubuntu7_i386.deb
2.sudo dpkg -i openssh-server_1%3a5.3p1-3ubuntu7_i386.deb

Now, to install Java on the UNIX side, download the JDK from

Create the directory in the root mode, install the JDK from the tar file, restart your terminal and append /etc/profile as shown in Figure 3.
After installing the VM and Java, let’s install Hadoop. The downloaded tar file can be unzipped using the command sudo tar vxzf hadoop-2.2.0.tar.gz –C/usr/local

Fig 5

Figure 5: SumReducer

Now, some configuration files need to be changed in order to execute Hadoop.
The files with the details are given below:
1. In add:

export JAVA_HOME=/usr/lib/jvm/jdk/<java version should be mentioned here>

2. In core-site.xml add the following between the configuration tabs:

<property> <name></name> <value>hdfs://localhost:9000</value></property>

3. In yarn-site.xml, add the following commands between the configuration tabs:

<property> <name> yarn.nodemanager.aux-services</name><value>mapreduce_shuffle</value> </property>

4. In mapred-site.xml, copy the mapred-site.xml.template and rename it as mapred-site.xml before adding the following between configuration tabs:

<property> <name></name> <value>yarn</value></property>

5. In hdfs-site.xml add the following between configuration tabs:

<property> <name>dfs.replication</name> <value>1</value></property>
<property> <name></name> <value>file:/home/hduser/mydata/hdfs/namenode</value>
<property> <name></name>

6. Finally, update your .bashrc file.
Append the following lines in the end, save and exit.

#Hadoop variables
export JAVA_HOME=/usr/lib/jvm/jdk/<your java version>
export HADOOP_INSTALL=/usr/local/hadoop
fig 6

Figure 6: WordMapper

fig 7

Figure 7: OutputBrowser

After all this, let’s make the directory for the name node and data node, for which you need to type the command hdfs namenode –format in the terminal. To start Hadoop and Yarn services, type and

Now the entire configuration is done and Hadoop is up and running. We will write a Java file in Eclipse to find the number of words in a file and execute it through Hadoop. The three Java files are (Figures 4, 5, 6):


Now create the JAR for this project and move this to the Ubuntu side. On the terminal, execute the jar file with the following command hadoop jar new.jar WordCount example.txt Word_Count_sum. example.txt is the input file (its number of words need to be counted). The final output will be shown in the Word_count_sum folder as shown in Figure 7.

Finally, the word count example shows the number of times a word is repeated in the file. This is but a small example to demonstrate what is possible using Hadoop on Big Data.