In any IT organisation – a start-up, a small scale firm or large scale enterprise, it is data that makes up the most important functional asset of the system. Hence, there is a need to have an excellent storage system in place to effectively maintain, manage, monitor and secure data. There are a variety of open source and private storage solutions that can cater to such business needs. This article walks you through the most important storage technologies available to us and what open source software is available to manage such storage technologies.
Lets look at the case of Eve, an entrepreneur who needs to build a storage solution for her start-up. She consults her friend Bob who is a storage expert.
Eve: Hey Bob, I have recently launched a start-up called XYZ and really need expert advice on how to build a storage solution for it.
Bob: The first thing to do is analyse the various requirements of your company from the data storage perspective. The key factors would be the capacity required for storage, the performance requirements in terms of bandwidth and IOPS (input/output operations per second), downtime restrictions on various applications, scalability, etc.
Eve: Being a small start-up, the overall data requirements will not exceed a few terabytes of storage. We deal with unstructured data like audio/video files and raw or text based data. Scalability is an important factor, as the solution must be able to handle a multi-fold increase in data as the business expands.
Bob: Keeping your basic requirements in mind, I can provide you with two alternatives. You can go with on-premise solutions or you can opt for cloud based storage. Both the solutions have their own pros and cons.
Eve: So what is the basic difference between the two options? How do I decide which one is better for my organisation? There has been a lot of buzz about choosing the right storage environment. In case I do choose an on-premise option for data storage, I have heard about SAN, NAS and DAS (storage area network, network attached storage and directly attached storage, respectively) solutions? What are the specific features provided by each of these?
Bob: Okay, now let me walk you through a few simple scenarios. Lets start with on-premise solutions, with which all the components remain under your own supervision at a local or remote site, under your control.
Lets take an example since yours is a start-up, its advisable to build a computer/server with an array of storage disks, lets say about 5-10TB of hard drive space. These additional hard drives would act as DAS, which is a simple storage system like a hard disk or any block device connected to the computer/server. DAS can be daisy-chained if you plan to spend some additional money later to expand your storage. So Eve, for your business, DAS is preferred due to the following reasons lower latency and costs, as well as higher speed when compared to directly accessing raw data. DAS is also easy to configure and deploy when compared to SAN or NAS. It is the preferred solution for databases, and for file and disk intensive operations.
Eve: Adding more storage, i.e., additional disks to our existing solution would definitely increase storage but what about redundancy or data backup; how do I manage that in the case of DAS?
Bob: I would suggest adding one RAID card to your newly built computer/server and configure the desired RAID.
Eve: Thats good Bob, but still, the solution is centralised. What if I want distributed computing and availability over the network?
Bob: In such a scenario, you can use your DAS array and share it over the network. This will act as a file server over the network. Unlike SAN and NAS, which require a computer with processor, memory and operating system, DAS comprises hardware devices only. Network attached storage (NAS) provides fast, simple, reliable access to data in an IP networking environment. Though NAS is good for collaboration and sharing, its not ideal in case of mission critical applications like heavily used databases and since it relies on the network, that would add to the latency.
Eve: So then what is SAN? How is it different?
Bob: SAN is an option when your applications require the performance of DAS and the benefits of NAS. At the data center level, virtual or cloud computing SAN is the perfect solution for high-end computing, but requires complex configuration especially in case of virtual networks, where different switch rules are used. Now coming to the differences, NAS creates a shared folder, which can be accessed by any user over the network and we cannot install the OS on the folder. On the other hand, in the case of SAN, we create logical partitions and share them over the network, which in turn allows us to install the OS on that logical partition. SAN provides higher hardware utilisation and speed and is the best solution for virtualisation and cloud computing scenarios.
Some open source solutions to manage NAS and SAN are OpenFiler (https://www.openfiler.com/ ), FreeNAS (http://www.freenas.org/) and Gluster (http://www.gluster.org/ ).
Eve: But with the ever increasing complexity and variety of data, the rising need for zero down time applications, etc., can these traditional solutions still provide the same efficiency and functional advantages?
Bob: Though these legacy approaches are still prevalent, there has been a lot of innovation in the field of storage technology in the past decade. As you might have noticed, in all the traditional solutions, the management functionality is provided or built-in at the hardware level itself.
Software-defined storage delivers automated, policy-driven, application-aware storage services by amalgamating the underlying storage infrastructure in support of an overall software defined environment. Open Source Software Defined Storage (OpenSDS) enables companies to achieve massive, virtually limitless scalability even while using the commodity hardware, thereby, virtually eliminating the vast silos of distributed stored data behind traditional arrays and provides commodity based solutions. Few examples of OpenSDS are Skylable (http://www.skylable.com/ ) and StorMax SDS (http://www.amax.com/enterprise/sds.asp).
Eve: Thats interesting, but are there any downsides to SDS?
Bob: A few of the limitations of SDS are that there are more components to manage (due to the modular architecture), less hardened solutions due to limited innovation on this front, and the need for more infrastructure to achieve a high performance thats comparable to existing distributed, storage arrays.
Eve: So are there any innovative solutions that overcome these drawbacks?
Bob: In SDS, where the compute, network and storage are treated as separate layers, there is a management overhead induced. So, an alternative approach called converged storage systems that consolidates all these into one layer to simplify management comes into the picture. Converged storage systems provide one resource pool solution and hence make deployment faster and lower administration costs. With a virtualised storage layer, it is easier to use such systems in a virtualisation platform. With these systems, overall resource utilisation is higher than with legacy infrastructure wherein the storage and the servers are separated.
Eve: Due to consolidated layers, is the flexibility in choosing the suitable layers also reduced?
Bob: Not only is there reduced flexibility due to the use of pre-configured systems, there are other limitations also like vendor lock-in, the higher price for the infrastructure, etc., which makes this option suitable for organisations with certain fixed requirements from the application.
Eve: I have been hearing of one more option called the hyper-converged storage solution. Does it overcome the challenges faced in converged storage?
Bob: In hyper-converged storage, the underlying data architecture has been completely reinvented, allowing data management to be simplified. It also carries forward the benefits of convergence, including a single shared resource pool. Unlike converged systems where each component can be used for different purposes, in a hyper-converged storage solution, each component is used in a dedicated fashion for a specific purpose. Hyper-convergence goes far beyond servers and storage, bringing into the convergence fold many services and features including data protection features (backup and replication), de-duplication, wide area network (WAN) optimisation, solid state drive (SSD) arrays, SSD cache arrays, replication appliances or software, etc.
Stratoscale lets you build your own hyper-converged open stack cloud. Hence you can converge your networking storage and computing at a very low cost. Refer to this link http://www.stratoscale.com/solutions/hyper-convergence/ for more information.
Eve: So these are the various on-premise storage options available. You had mentioned alternative cloud based solutions. How are they different from on-premise solutions in terms of features offered?
Bob: Cloud based storage solutions basically provide you various storage management services in a pay for what you usefashion. They manage your data on a third party site and hence save you staffing and hardware costs as all the tasks are handled by service providers. Not only do they store and manage your data, a number of cloud storage services even provide you features like data redundancy, availability, scalability, etc.
Security is one of the biggest concerns but now there are cloud implementations (like private clouds) that will overcome any such risks. There are other downsides like service outages, limited control and flexibility, the risk of Internet based security breaches, vendor lock-in, etc., which might impact your applications and services.
Eve: I see
so which are the open source cloud storage solutions available?
Bob: I will just list out a few important and popular ones. OpenStack is a highly distributed, cost-effective option for building a cloud based storage solution and it has the huge community support typical of an open source project. It offers a wide variety of storage services like scalability, redundant object storage, data integrity, replication, durability, the use of clusters to store petabyte data and many more features, about which you can learn more at https://www.openstack.org/software/ . According to OpenStack, the top 10 auto makers in the world use OpenStack to turn customer insights into action (http://www.openstack.org/enterprise/big-data/ ).
CloudStack was an alternative to OpenStack, launched by Citrix. But the monolithic architecture of CloudStack unlike the modular one of OpenStack limits scalability. Moreover, CloudStack supports only block based storage and hence is limited to specific application scenarios. For more information on CloudStack, visit https://cloudstack.apache.org/.
Ceph is another widely adopted open source storage solution. It is a unified storage solution that provides block, file as well as object storage capabilities. Other highlights of this tool are its scale-out feature, applicability to a variety of use cases due to the use of different hardware for different workloads, etc., rendering it ideal for enterprises as well. Visit http://ceph.com/ceph-storage/ for more information on Ceph.
Eve: So, which of these would be the best data storage solution?
Bob: The choice very much depends on the exact application and customer requirements. For example, for unstructured data, object storage turns out to be the best solution. If you are providing services to customers, you can opt for building an on-premise cloud or provide a storage solution stack (converged or hyper-converged infrastructure) to your customer. In the cloud, if your customers have doubts or concerns about security, you can host a private cloud or hybrid cloud for a specific set of customers.
While all the storage solutions are aimed at reducing costs and making the set-up and deployment procedures as customer friendly as possible, the choice of the tool should only be made after analysing the organisations business goals and the value addition provided by the tools features, to potential customers.
Reference
http://www.snia.org/tech_activities/tech_center