HDFS is a core component of the Hadoop package. It is an open source implementation of the distributed file system on the lines of Google File System (GFS), based on a paper released by Google. It has a wide range of uses that we will look at in this article.
Before explaining the Hadoop Distributed File System (HDFS), let’s try to understand the basics of Distributed File Systems (DFS). In a DFS, one or more of the central servers stores files that can be accessed by any client in the network with proper access authorisation. You need DFS for two purposes:
1. To hold large data
2. To share and use data among all users
A NFS (Network File System) is one such ubiquitous DFS. While NFS is easy to use and configure, it has its own limitations, the primary one being a single location storage facility. Other limitations include (but are not limited to) scaling, performance and reliability issues. HDFS is the answer to all these limitations. HDFS is designed to store very large amounts of information in the range of terabytes and petabytes. It supports storing files across the nodes throughout the cluster. It offers high reliability by storing replicas of the files (divided into blocks) across the cluster. It is highly scalable and configurable. Administrators can control the block size, the replication factor and the location of the file in the cluster. HDFS is primarily designed to support ‘map reduce’ jobs in Hadoop. Users are expected to store large files and do more read operations than write operations on the file. Besides, it is assumed to perform long sequential streaming reads from the file.
HDFS is a block structured file system. A data block is, by default, 128 MB (or 64 MB) in size; hence, a 300 MB file will be split into 2 x 128 MB and 1 x 44 MB. All these split blocks will be copied ‘N’ times over a cluster. ‘N’ is the replication factor (which is 3, by default) responsible for providing resilience and fault tolerance of data in the cluster. These blocks are not necessarily saved on the same machine. Individual machines in the cluster that store the files and form part of the HDFS group are called DataNodes.
Because HDFS stores files as a set of large blocks across several machines, these files are not part of the ordinary file system. Thus, typing ls or other Linux commands on a machine, and running a DataNode daemon, will display the contents of the ordinary Linux file system being used to host the Hadoop services but it will not include any of the files stored inside the HDFS. This is because HDFS runs in a separate namespace, isolated from the contents of local files. The files and blocks can only be accessed through the DataNode service, running in its own separate daemon. The complexity of accessing files does not end here. Data must be available to all computer nodes when requested. To enable this transparency, HDFS must maintain metadata of the data. The metadata is stored in a single location called NameNode in Hadoop. The size of the metadata on HDFS is very small due to the large block size and hence can be stored in the main memory of the NameNode. This allows fast access to metadata.
Configuring HDFS
Although Hadoop comes with typical default values that are good for most users, one needs to change these default parameters to suit the application’s needs. All HDFS configuration files are in the form of XML files located in the conf/ directory of the root of the Hadoop installation directory.
To change the replication factor of HDFS, edit the file conf/hdfs-site.xml and modify the value tag of the file to the desired number. The value is set to five in the example below:
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description> </property> </configuration>
Similarly, edit the conf/core-site.xml. The details of the tags are shown in Table 1.
HDFS as a general DFS for applications
HDFS was built to support Hadoop for file storage and access. However, HDFS adheres to the standards of a distributed file system and is at par with many available alternative DFSs. In this section, let’s explore how to use HDFS as a general DFS.
To use HDFS, first start the HDFS service. Follow the steps to start and stop HDFS.
1. Format the HDFS file system. This command should only be executed once; else we will lose all the data stored on HDFS:
$ bin/hadoop namenode -format 2. Start the HDFS: $ bin/start-dfs.sh 3. To stop the HDFS, issue the following command: $ bin/stop-dfs.sh
Interaction with the HDFS file system is not direct. All the file system access must go through the Hadoop DFS service as shown in the example below.
1. Create a directory in HDFS, as follows:
$ bin/hadoop dfs -mkdir /user/hduser 2. Copy files from local directory to HDFS using the following command: $ bin/hadoop dfs -put anyFile.txt /user/hduser/ 3. List all the files of the HDFS directory, as follows: $ bin/hadoop dfs -ls /user/hduser/
We can run several other commands in the same way as shown in the above examples, such as cat, rm, cp, get, copyFromLocal, moveFromLocal, chown, chmod, etc.
Using HDFS from Java coding
While HDFS can be manipulated explicitly through user commands or implicitly as the input to or output from the Hadoop MapReduce job, you can also work with HDFS inside your own Java / C++ (libhdfs api) applications and use it as a general purpose file system for the cluster and the cloud.
In this section, we will write a code to create a ‘hello world’ text file using Java and using Eclipse IDE. We will configure Eclipse for easy programming and compilation of the code.
1. Create a new Java project and make sure to use Oracle Sun Java and no other, as shown in Figure 1.
2. Click on Next.
3. Go to the Libraries tab and click on Add External JARS. Select all the jar files from hadoop/lib directory and hadoop-core.jar from the Hadoop root directory as shown in Figure 2.
4. Click on Finish and we are ready to create the class file to say ‘hello world’ to HDFS.
5. Create a new Java class file and name it helloWorld.
6. Write the following code in the helloWorld class:
import java.io.File; import java.io.IOException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.FSDataInputStream; import org.apache.hadoop.fs.FSDataOutputStream; import org.apache.hadoop.fs.Path; public class helloWorld { public static final String theFilename = "hello.txt"; public static final String message = "Hello, world!\n"; public static void main (String [] args) throws IOException { Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(conf); Path filenamePath = new Path(theFilename); try { if (fs.exists(filenamePath)) { // remove the file first fs.delete(filenamePath); } FSDataOutputStream out = fs.create(filenamePath); out.writeUTF(message); out.close(); FSDataInputStream in = fs.open(filenamePath); String messageIn = in.readUTF(); System.out.print(messageIn); in.close(); } catch (IOException ioe) { System.err.println("IOException during operation: " + ioe.toString()); System.exit(1); } } }
7. You must start the HDFS service before running this code.
8. You can check how successful the code is by running it. If, for some reason, your code does not work as intended, create a jar file and run it through Hadoop as $ hadoop jar /<your jar> helloWorld:
$ bin/hadoop dfs -ls /user/hduser/
In this article, we learnt about HDFS and its usage as a general distributed file system. We found out how to use HDFS for our own application (apart from Hadoop applications), and for storing and accessing files where scalability, reliability and performance are critical to the application/service. HDFS can also be integrated with cloud services to provide storage facilities. Hence, it is an important package that will help you to learn and understand distributed systems.
HDFS is not at par with other distributed filesystems. Real distributed filesystems don’t require data to be explicitly copied in or out, but allow it to be modified in place – efficiently, even at single byte granularity. Real distributed filesystems are not allowed to report writes as complete when they’ve only been buffered at the client. I could go on, but there’d be little point. HDFS does one thing, and one could argue that it does that one thing very well, but it’s not even in the same category as a true general-purpose filesystem.
Disclaimer: I’m a developer for GlusterFS, which is a true distributed filesystem even if it’s not a perfect one.