Storage systems in the current blooming cloud computing age is a hotbed worth contemplating. With the numerous tools an systems out there, it can be daunting to know what to choose for what purpose. This guide alleviates that confusion and gives an overview of the most common storage systems available. Please read ahead to have a clue on them. This guide will dive deep into comparison of Ceph vs GlusterFS vs MooseFS vs HDFS vs DRBD.
Ceph is a robust storage system that uniquely delivers object, block(via RBD), and file storage in one unified system. Whether you would wish to attach block devices to your virtual machines or to store unstructured data in an object store, Ceph delivers it all in one platform gaining such beautiful flexibility. Everything in Ceph is stored in the form of objects, and the RADOS object store is responsible for storing these objects, irrespective of their data type. The RADOS layer makes sure that data always remains in a consistent state and is reliable. For data consistency, it performs data replication, failure detection, and recovery, as well as data migration and rebalancing across cluster nodes.
Ceph provides a POSIX-compliant network file system (CephFS) that aims for high performance, large data storage, and maximum compatibility with legacy applications. The seamless access to objects uses native language bindings or radosgw (RGW), a REST interface that’s compatible with applications written for S3 and Swift. On the other hand, access to block device images that are striped and replicated across the entire storage cluster is provided by Ceph’s RADOS Block Device (RBD).
Features of Ceph
- A single, open, and unified platform: block, object, and file storage combined into one platform, including the most recent addition of CephFS.
- Interoperability: You can use Ceph Storage to deliver one of the most compatible Amazon Web Services (AWS) S3 object store implementations among others.
- Thin Provisioning: Allocation of space is only virtual and actual disk space is provided as and when needed. This provides a lot more flexibility and efficiency.
- Replication: In Ceph Storage, all data that gets stored is automatically replicated from one node to multiple other nodes. A triplicate of your data is present at any one time in the cluster.
- Self-healing: The monitors constantly monitor your data-sets. In case one of the triplicate goes missing, a copy is generated automatically to ensure that there are always three copies available.
- High availability: In Ceph Storage, all data that gets stored is automatically replicated from one node to multiple other nodes. This means that in case a give data-set in a given node gets compomised or is deleted accidentally, there are two more copies of the same making your data highly available.
- Ceph is robust: your cluster can be used just for anything. If you would wish to store unstructured data or provide block storage to you data or provide a file system or you would wish your applications to contact your storage directly via librados, you have it all in one platform.
- Scalability: Ceph works in clusters which can be increased when needed hence catering for future needs of scale.
Ceph is best suited for block storage, big data or any other application that communicates with librados directly. All will work out well. Find out more about Ceph at Ceph Documentation
Installation: How to Install Ceph Cluster on Ubuntu 18.04
MooseFS introduced around 12 years ago as a spin-off of Gemius (a leading European company which measures internet in over 20 countries), is a breakthrough concept in the Big Data storage industry. It allows you to combine data storage and data processing in a single unit using affordable commodity hardware.
Features of MooseFS
- Redundancy: All the system components are redundant and in case of a failure, there is an automatic failover mechanism that is transparent to the user.
- Computation on Nodes: Support for scheduling computation on data nodes for better overall system TCO by utilizing idle CPU and memory resources.
- Atomic Snapshots: Instantaneous and uninterrupted provisioning of file system at any particular point in time. This feature is ideal for online backup solutions.
- Tiered Storage: The assignment of different categories of data to various types of storage media to reduce total storage cost. Hot data can be stored on fast SSD disks and infrequently used data can be moved to cheaper, slower mechanical hard disk drives.
- Native Clients: Enhanced performance achieved through a dedicated client (mount) components specially designed for Linux, FreeBSD and MacOS systems.
- Global Trash: A virtual, global space for deleted objects, configurable for each file and directory. With the help of this advantageous feature, accidentally deleted data can be easily recovered.
- Quota Limits: The system administrator has the flexibility to set limits to restrict the data storage capacity per directory.
- Rolling Upgrades: Ability to perform one-node-at-a-time upgrades, hardware replacements and additions, without disruption of service. This feature allows you to maintain hardware platform up-to-date with no downtime.
- Fast Disk Recovery: In case of hard disk or hardware failure, the system instantly initiates parallel data replication from redundant copies to other available storage resources within the system. This process is much faster than traditional disk rebuild approach.
- Parallelism: Performs all I/O operations in parallel threads of execution to deliver high performance read/write operations.
- Management Interfaces: Provides a rich set of administrative tools such as command line based and web-based interfaces.
More on MooseFS can be found on MooseFS Pages.
Gluster is a free and opensource scalable network filesystem. Using common off-the-shelf hardware, you can create large, distributed storage solutions for media streaming, data analysis, and other data- and bandwidth-intensive tasks. Scale-out storage systems based on GlusterFS are suitable for unstructured data such as documents, images, audio and video files, and log files. Traditionally, distributed filesystems rely on metadata servers, but Gluster does away with those. Metadata servers are a single point of failure and can be a bottleneck for scaling. Instead, Gluster uses a hashing mechanism to find data.
Features of Gluster
- Scalability: scalable storage system that provides elasticity and quotas.
- Snapshots: Volume and file-level snapshots are available and those snapshots can be requested directly by users, which means users won’t have to bother administrators to create them.
- Archiving: Archiving is supported with both read-only volumes and write once read many (WORM) volumes.
- For better performance, Gluster does caching of data, metadata, and directory entries for readdir().
- Integrations: Gluster is integrated with the oVirt virtualization manager as well as the Nagios monitor for servers among others.
- Big Data: For those wanting to do data analysis using the data in a Gluster filesystem, there is a Hadoop Distributed File System (HDFS) support.
- libgfapi: Applications can use libgfapi to bypass the other access methods and talk to Gluster directly. This is good for workloads that are sensitive to context switches or copies from and to kernel space
Other details about Gluster are found at Gluster Docs
Hadoop Distributed File System (HDFS) is a distributed file system which allows multiple files to be stored and retrieved at the same time at fast speeds. It conveniently runs on commodity hardware and provides the functionality of processing unstructured data. It provides high throughput access to application data and is suitable for applications that have large data sets. HDFS is a major constituent of Hadoop, along with Hadoop YARN, Hadoop MapReduce, and Hadoop Common. It is one of the basic components of Hadoop framework.
Features of HDFS
HDFS is designed to reliably store very large files across machines in a large cluster. It stores each file as a sequence of blocks; all blocks in a file except the last block are the same size. The blocks of a file are replicated for fault tolerance.
File System Namespace
HDFS supports a traditional hierarchical file organization. A user or an application can create directories and store files inside these directories. The file system namespace hierarchy is similar to most other existing file systems; one can create and remove files, move a file from one directory to another, or rename a file. HDFS does not yet implement user quotas. HDFS does not support hard links or soft links.
The primary objective of HDFS is to store data reliably even in the presence of failures. The three common types of failures are NameNode failures, DataNode failures and network partitions.
HDFS can be accessed from applications in many different ways. Natively, HDFS provides a Java API for applications to use. A C language wrapper for this Java API is also available. In addition, an HTTP browser can also be used to browse the files of an HDFS instance. Work is in progress to expose HDFS through the WebDAV protocol.
HDFS is designed to reliably store very large files across machines in a large cluster. The cluster can be increased or reduced depending on the desired needs at the time.
Hadoop Distributed File System is designed to reliably store very large files across machines in a large cluster. It stores each file as a sequence of blocks; all blocks in a file except the last block are the same size. The blocks of a file are replicated for fault tolerance and hence data is highly vailable in case of any failures.
Find out more about HDFS on HDFS pages.
DRBD is a distributed replicated storage system implemented as a kernel driver, several userspace management applications, and some shell scripts. Distributed Replicated Block Device (a logical block device in a logical volume schema) mirrors block devices among multiple hosts to achieve Highly Avaailable clusters. DRBD-based clusters are often employed for adding synchronous replication and high availability to file servers, relational databases (such as MySQL), and many other workloads. A DRBD implementation can essentially be used as the basis of a shared disk file system, another logical block device(e.g LVM), a conventional file system or any aplication that needs direct access to a block device.
Features of DRDB
- DRDB has shared-secret authentication
- It is compatible with LVM (Logical Volume Manager)
- There is support for heartbeat/pacemaker resource agent integration
- There is support for load balancing of read requests
- Automatic detection of the most up-to-date data after complete failure
- Delta resynchronisation
- Existing deployment can be configured with DRBD without losing data
- Automatic bandwidth management
- Customisable tuning parameters
- Online data verification with peer
- High Availability: Block Device mirrors block devices among multiple hosts to achieve Highly Avaailable clusters.
- It integrates with virtualization solutions such as Xen, and may be used both below and on top of the Linux LVM stack
DRBD has other details not covered here. Find them at DRBD Online Docs.
The above systems and their features provide an overview of their internals and what they are at a glance. More details about them are found on their various web pages referenced below each of them. Than you for reading through and we hope it was helpful.