Data deduplication is a specialised technique for compressing data and deleting copies of repeated data. It plays an important role in today’s world of rapid and massive data generation, as it helps to save resources, energy and costs. This article explains how Lessfs, a Linux based file system, can be used for data deduplication.
The existence of copies of the same files in different locations creates various management problems. One of the main problems, which is also present when it comes to simple storage systems, is data duplication. Storage space availability in most systems is used up by copies of the same files. As an example, the WhatsApp messenger app saves a different copy of the same image when it is received from different chats or forwarded to different persons. This reduces the space available on a device. This is where data deduplication comes in.
Data deduplication is a data compression technique to eliminate redundant data and decrease the space used up on an enabled storage volume. A volume can refer to a disk device, a partition or a grouped set of disk devices—all represented as a single device. During the process, redundant data is deleted, and a single copy of the data is stored on the storage volume.
Necessity for and merits of data deduplication
The primary focus of data deduplication is to point out large sections of data (which can include entire files or large sections of files) that are identical, and store only one copy of this data. Other benefits are:
- Reduced cost of storage equipment
- Reduced cost of energy
- Lesser need for cooling
There are two types of data deduplication — post-process deduplication and inline deduplication.
Post-process deduplication: In this method, the deduplication process starts after the data is stored. After the files are stored, the program checks for duplicated data throughout the file system and makes sure that only one copy exists. This method becomes problematic when the available space is already low and does not allow multiple copies of the files to be saved until the deduplication process is executed. On the flip side, this method doesn’t affect the speed or performance of the storage process.
Inline deduplication: In this method, the deduplication is run in real-time. Thus, less storage is required. However, since the deduplication process runs as the data comes in, the speed of storage is affected, because the incoming data is checked to identify redundant copies.
Data deduplication in Linux
Data deduplication in Linux is affordable and requires lesser hardware. The solutions are in some cases available at the block level, and are able to work only with redundant data streams of data blocks as opposed to individual files, because the logic is unable to recognise separate files over many protocols like SCSI, SAS Fibre channel and even SATA.
The file system we discuss here is Lessfs—a block level deduplication and FUSE-enabled Linux file system. FUSE is a kernel module seen on UNIX-like operating systems, which provides the ability for users to create their own file systems without touching the kernel code. In order to use these file systems, FUSE must be installed on the system. Most operating systems like Ubuntu and Fedora have the module pre-installed to support the ntfs-3g file system.
About Lessfs and Permabit (recently acquired by Red Hat)
Lessfs is a high performance inline data deduplication file system written for Linux. It also supports LZO, QuickLZ and BZip compressions.
While Lessfs is open source, the solution provided by Permabit wasn’t available until recently, when it was acquired by Red Hat. Albeiro is open source block level data deduplication software, which was launched by Permabit back in 2010, and is available as an SDK.
Lessfs in detail
Lessfs aims to reduce disk usage where file system blocks are identical, by storing only one block and using pointers to the original block for copies. This method of storage is becoming popular in enterprise solutions for reducing disk backups and minimising virtual machine storage in particular.
It first compresses the block with LZO or QUICKLZ compression, with a combination of these methods leading to higher compression rates.
Set up and installation
First, make sure that the requirements are all installed. These are:
- mhash
- tokyocabinet
- fuse
Go to http://sourceforge.net/projects/mhash/files/mhash to download the latest version of mhash. Then, download, build and install the package.
/* $ tar xvzf mhash-0.X.X.X.tar.gz $ cd mhash-0.9.9.9/ $ ./configure $ make $ sudo make install */
Tokyo Cabinet is the main database that Lessfs relies on. To build Tokyo Cabinet, you need to have zlib1g-dev and libbz2-dev already installed.
Download and install FUSE from http://sourceforge.net/projects/fuse. Now download the latest version of Lessfs from http://sourceforge.net/projects/lessfs/files/lessfs.
Before we start using Lessfs, we need to do some things. Go to the /etc sub-directory inside the Lessfs source directory. Copy the Lessfs configuration file found there into the system’s /etc sub-directory.
sudo cp etc/lessfs.cfg /etc/
Refer to the SourceForge Lessfs page for the documentation, which is well written for any user to understand.
Demerits
Even though Lessfs provides fast compression and data deduplication in the case of large files and small spaces, in every other case, it proves to be slow. Also, the data security it provides, even though impressive in theory, has been proven to be less effective than the solutions offered by IBM’s ProtecTier or Sepaton’s DeltaStor.
Thanks for the valuable information.