This month’s column continues the discussion on data storage systems, with a focus on how file systems remain consistent even in the face of errors occurring in the storage stack.
Last month we discussed the concept of scale-up and scale-out storage and their relative merits. One of the most important parts of a storage stack is the file system. In Linux, there are a number of popular file systems like ext3, ext4, btrfs, etc. File systems hide the complex details of the underlying storage stack such as the actual physical storage of data. They act as containers of user data and serve any IO requests from user applications.
File systems are complex pieces of software. Traditionally, they have been a kernel component. Different file systems offer different functionalities and performance. However, their interactions with user applications have been simplified through the use of the Virtual File System (VFS), which is an abstract layer sitting on top of concrete file system implementations like ext3, ext4 and ZFS. The client application can program to the APIs exposed by the VFS and does not need to worry about the internals of the underlying concrete file systems. Of late, there has been considerable interest in developing user space file systems using the FUSE module available in mainstream Linux kernels (fuse.sourceforge.net/). User-space file systems can be created by using the kernel FUSE module, which intercepts the calls from the VFS layer and redirects them back to user-space file system code.
In this month’s column, we look at the challenges in ensuring the safety and reliability of data in file systems.
Keeping data safe
Given the massive importance and explosion of data in this era of data driven businesses, file systems are entrusted with the important responsibility of keeping data safe, correct and consistent, forever, while ensuring high accessibility to the data. Data loss is unthinkable, while unavailability of data even for short durations can have disastrous consequences on businesses. While data has been growing at an exponential rate, the reliability of the file systems, which act as data containers, has not matched the demands of the always on/always consistent data. Unfortunately, storage systems do fail and the manner in which they fail is complex-they can exhibit partial failures, failures which are recognised much after they occur, failures which are transient, and so on. Each of these different types of failures imposes different reliability requirements on file systems, which need to detect, handle and repair such failures.
Figure 1 shows a simple model of the storage system, wherein the bottom most layer is the magnetic media on which the data is physically stored, and the top most layer is the generic file system with which the client application interacts. Each block of the storage system can fail in different ways, leading to data unavailability and inconsistency. As the data flows through the storage stack, at each point it is vulnerable to errors. An error can occur in any of the layers and propagate to the file system. This results in inconsistent or incorrect data being written or being read by the user through file system interfaces.
At the bottom-most layer, disks can fail in complex ways. It is no longer possible to make the simple assumption that disks either work or fail, when building reliability mechanisms for file systems. While all-or-nothing disk failures are easier to understand and protect against, the more insidious nature of disk failures is partial failures such as latent sector errors where a disk block or set of blocks become inaccessible.
Silent disk block corruption
Disk firmware bugs can result in misdirected or torn disk writes.
As data flows from disk through transport (such as the SCSI bus or the network), errors in the transport layer can result in incorrect data being propagated to the host. Next in the chain above are the hardware controller and device driver software present in the host, which are again potential sources of errors. Device drivers can be quite complex and can have bugs which lead to silent data corruption. The next component in the chain is the generic block IO interface of the operating system, which again is quite complex and can contain latent bugs that could result in data corruption.
While these errors are external to the file systems, the file system itself is a complex piece of code of millions of lines, prone to insidious software bugs in its own code, which lead to corruption and inconsistency. Such internal file systems errors are difficult to find and fix during development and can remain latent for a long time, only to rear their heads in production, leading to inconsistent file systems or loss of data to the end user. Given this potential for error arising in each part of the storage stack, file systems have the enormously complex task of ensuring the highest degree of data availability to the user.
While failures can result in any data block becoming unavailable or inconsistent, the impact of failures of certain blocks that contain critical file system meta-data can result in the entire file system becoming unavailable, resulting in non-availability of vast amounts of data from the end-user perspective. Hence, certain data such as file system meta-data needs to have much higher levels of data integrity than application data. Therefore, the file system needs to have special mechanisms in place for the following reasons:
- To prevent the corruption of meta-data
- In the event of corruption, to be able to detect any such corruption quickly and recover while in operation
In the worst case event of a crash, achieve a consistent file system state quickly without excessive downtime of data to the users
In common-usage English, the term resilience means the ability to recover quickly from an illness or a misfortune. File systems need to be resilient to errors, no matter where the error occurs in the storage stack. A resilient file system needs to be able to have mechanisms to detect errors or corruptions in meta-data and fix or recover from such errors wherever possible. And in the event of unavoidable crashes, have the ability to return a consistent state without excessive downtime. However, this is not as simple a task as it seems. Detecting faults in file system metadata can be a very complex process, involving the following steps:
0. Make the assumption that the disks are perfect and no errors occur
1. Check error codes in lower level components in the storage stack below the file system
2. Perform sanity checks in the file system such as the use of magic numbers and header information for important file system data structures
3. Check-summing of metadata
4. Detect failures through metadata replication
Checksums have been employed widely in storage systems to detect corruption. Check-sums are block hashes computed with a collision-resistant hash function and are used to verify data integrity. For on-disk data integrity, checksums are stored or updated on disk during write operations and read back to verify the block or sector contents during reads.
Many storage systems such as ext4, ZFS, etc, use checksums for on-disk data integrity. Check-summing, if not carefully integrated into the storage system, can fail to protect against complex failures such as lost writes and misdirected writes. While they are typically useful in detecting issues in components below the file system such as the storage controller, which can lead to missing writes or buggy writes, they cannot detect errors that originate in the file system code. So given the possibility that file systems can become inconsistent due to errors occurring either outside the FS in the storage stack or due to buggy code in the FS itself, how do we check the consistency of the file system and if not consistent, bring it to a consistent state?
File system consistency checking
FSCK or file system consistency check is a utility that is traditionally used to perform a check on the consistency of the file system; if inconsistencies are found, it can repair them automatically or, in certain cases, with the help of the user. Windows users would know it by its avatar, chkdsk.
File systems inconsistencies can arise due to an: (a) unclean shutdown of the file system, typically due to either power failure or the user not following proper shutdown procedures; or (b) due to hardware failures, which lead to the file system meta-data information on the disk becoming inconsistent. Allowing a corrupted file system to be used can lead to further inconsistencies and, in certain cases, even to permanent data loss. Hence, when systems are brought back online after a crash, operators have been known to run the fsck before the file system can be made online and user IO operations allowed on it.
FSCK independently tries to build its knowledge of the structure and layout of the file system from the various data structures on disk, and corroborates it with the summary/computed information maintained by the file system. If the two pieces of information don’t match, an inconsistency is detected and FSCK tries to repair the inconsistency. If automatic repair is not possible, the problem is reported to the user. A good overview of FSCK can be found in http://lwn.net/Articles/248180/.
Here is a question to our readers. Do all file systems need a FSCK utility? For instance, there are file systems that support journaling or write-ahead logging, wherein all changes to metadata are first logged on to a journal log on persistent media before the metadata itself is updated. File system inconsistencies created by partial meta-data writes resulting from a sudden crash while operations are in mid-flight are addressed by means of the file system journal log, which can replay the log on restart and recover to a consistent state. Do such journaling file systems need FSCK?
My ‘must-read book’ for this month
This month’s must-read book suggestion comes from one of our readers, Sonia George. She recommends ‘In Search of Clusters’ (2nd Edition) by Gregory F Pfister, which discusses the internals of cluster computing and provides details of different clustering implementations. Thanks, Sonia, for your suggestion.
If you have a favourite programming book or article that you think is a must-read for every programmer, please do send me a note with the book’s name, and a short write-up on why you think it is useful so that I can mention it in this column. This would help many readers who want to improve their software skills.
If you have any favourite programming questions or software topics that you would like to discuss on this forum, please send them to me, along with your solutions and feedback, at sandyasm_AT_yahoo_DOT_com. Till we meet again , happy programming and here’s wishing you the very best!