Monday, July 24, 2023

High Availability Storage

RAID Concepts

The acronym RAID stands for Redundant Array of Inexpensive Disks and is a technology that provides increased storage functions and reliability through redundancy. It was developed using a large number of low cost hard drives linked together to form a single large capacity storage device that offered superior performance, storage capacity and reliability over older storage systems. This was achieved by combining multiple disk drive components into a logical unit, where data was distributed across the drives in one of several ways called "RAID levels".
This concept of storage virtualization and was first defined as Redundant Arrays of Inexpensive Disks but the term later evolved into Redundant Array of Independent Disks as a means of dissociating a low-cost expectation from RAID technology.
There are two primary reasons that RAID was implemented:

Redundancy: This is the most important factor in the development of RAID for server environments. A typical RAID system will assure some level of fault tolerance by providing real time data recovery with uninterrupted access when hard drive fails;

Increased Performance: The increased performance is only found when specific versions of the RAID are used. Performance will also be dependent upon the number of drives used in the array and the controller;

Hardware-based RAID

When using hardware RAID controllers, all algorithms are generated on the RAID controller board, thus freeing the server CPU. On a desktop system, a hardware RAID controller may be a PCI or PCIe expansion card or a component integrated into the motherboard. These are more robust and fault tolerant than software RAID but require a dedicated RAID controller to work.

Hardware implementations provide guaranteed performance, add no computational overhead to the host computer, and can support many operating systems; the controller simply presents the RAID array as another logical drive

Software-based RAID

Many operating systems provide functionality for implementing software based RAID systems where the OS generate the RAID algorithms using the server CPU. In fact the burden of RAID processing is borne by a host computer's central processing unit rather than the RAID controller itself which can severely limit the RAID performance.

Although cheap to implement it does not guarantee any kind of fault tolerance; should a server fail the whole RAID system is lost.

Hot spare drive

Both hardware and software RAIDs with redundancy may support the use of hot spare drives, a drive
physically installed in the array which is inactive until an active drive fails. The system then automatically replaces the failed drive with the spare, rebuilding the array with the spare drive included. This reduces the mean time to recovery (MTTR), but does not completely eliminate it. Subsequent additional failure(s) in the same RAID redundancy group before the array is fully rebuilt can result in data loss. Rebuilding can take several hours, especially on busy systems.

Standard RAID Levels

RAID 0

Striped Disk Array without Fault Tolerance: Provides data striping spreading out blocks of each file across multiple disk drives, the data is broken down into blocks and each block is written to a separate disk drive therefore I/O performance is greatly improved by spreading the I/O load across many channels and drives. This is not a true RAID because it doesn't provide any redundancy and it is not fault-tolerant; the failure of just one drive will result in all data in an array being lost.

RAID 0 can only be used for tasks requiring fast access to a large capacity of temporary disk storage (such as video/audio post-production, multimedia imaging, CAD, data logging, etc.) where in case of a disk failure, the data can be easily reloaded without impacting the business.

RAID 0 offers low cost and maximum performance and there are also no cost disadvantages as all available drives are used.

RAID 1

Mirroring and Duplexing: Provides disk mirroring allowing for twice the read transaction rate of single disks and the same write transaction rate as single disks. The transfer rate per block is equal to that of a single disk providing 100% redundancy of data which means no rebuild is necessary in case of a disk failure, just copy to the replacement disk.

RAID 1 provides cost-effective and high fault tolerance however it has the highest disk overhead of all RAID types because the usable capacity is 50% of the available drives in the RAID set.

RAID 2

Error Correcting Coding: Not a typical implementation and rarely used, RAID 2 stripes data at the bit level rather than the block level writing each bit of data word to a data disk drive (4 in this example: 0 to 3).

Each data word has its Hamming Code ECC word recorded on the ECC disks. Upon read, the ECC code verifies correct data or corrects single disk errors. It has "On the fly" data error correction allowing for extremely high data transfer rates. However it is not used anymore because all modern disks have built in error correction.

RAID 3

Bit-Interleaved Parity (or Parallel Transfer With Parity) : The data block is subdivided ("striped"), providing byte-level striping, and written on the data disks while stripe parity is written on a dedicated parity disk.

This level, which cannot service simultaneous multiple requests, also is rarely used.

RAID 4

Dedicated Parity Drive: Level 4 provides block-level striping (like Level 0) with a parity disk. Each entire block is written onto a data disk. Parity for same rank blocks is generated on Writes, recorded on the parity disk and checked on Reads If a data disk fails, the parity data is used to create a replacement disk.

A disadvantage to Level 4 is that the parity disk can create write bottlenecks because parity has to be systematically updated.

RAID 5

Block Interleaved Distributed Parity: Provides data striping at the byte level and also stripe error correction information. Data is striped across all of the drives in the array, but for each stripe through the array (one stripe unit from each disk) one stripe unit is reserved to hold parity data calculated from the other stripe units in the same stripe. This results in excellent performance and good fault tolerance. Level 5 is one of the most popular implementations of RAID.

The RAID 5 parity requires one disk drive per RAID set, so usable capacity will always be one disk drive less than the number of available disks.

RAID 6

Independent Data Disks with Double Parity: Provides block-level striping with parity data distributed across all disks. To prevent data loss, a second set of parity information is added; using the stripes of parity information already created, the controller can generate another parity set.

RAID 6 provides an extremely high data fault tolerance and can sustain multiple simultaneous drive failures.

Non Standard RAID Levels

There are several non-standard RAID levels but some of them are proprietary systems develop and sold only by a single company. Here are the most common ones:

RAID 0+1

Mirror of Stripes: RAID 0+1 is implemented as a mirrored array whose segments are RAID 0 arrays. Used for both replicating and sharing data among disks.

Can sustain a single drive failure causing the whole array to become, in essence, a single RAID 0 array. It is very expensive with a high overhead.

RAID 10

Stripe of Mirrors: RAID 10 is implemented as a striped array whose segments are RAID 1 mirrors. This level provides the improved performance of striping while still providing the redundancy of mirroring. RAID 10 has the same fault tolerance as RAID level 1; up to one disk of each sub-array may fail without causing loss of data.

This level has the same overhead as mirroring alone so the usable capacity of RAID 10 is 50% of available disk drives making it very expensive.

RAID 50

RAID 50 comprises RAID 0 striping across lower-level RAID 5 arrays. High data transfer rates are achieved thanks to its RAID 5 array segments while the spanned RAID 0 allows the incorporation of many more disks into a single logical drive. RAID 50 is more fault tolerant than RAID 5 but has twice the parity overhead thus making it very expensive to implement.

Most current RAID 50 implementation is illustrated above. Failure of two drives in one of the RAID 5 segments renders the whole array unusable but the whole set can sustain the failure of one disk in each sub set without data loss.

ITs Amazing IT Technical Support