Part 1 Fundamentals
|Originally published November, 1995|
|¿ 1995, 2005 Carlo Kopp|
The RAID model of storage architecture is an evolutionary rather than revolutionary change in how a host system's main disk storage is accessed. Like all technological solutions to problems, it has some strengths and some weaknesses, but what is significant is that it needs to be well understood before it can be productively fitted into a solution. This two part series will look at the technical fundamentals of the RAID model, and then more closely examine it's attributes from a performance engineering perspective.
The Origins of RAID
RAID is an abbreviation for Redundant Array of Inexpensive Disks, and has its origins in a late eighties series of papers produced at the University of California at Berkeley (UCB). Funded by the US NSF and the US computer industry, the research team at UCB EECS developed the model, prototyped the first RAID storage subsystem and produced most of the theoretical substance behind the idea.
RAID took a number of years to become accepted by the industry, certainly the beginning of the nineties saw only a handful of manufacturers producing RAID storage subsystems. The author's experience at the time, working for a third party vendor, was that users and system managers were simply not very receptive to the idea. This has since changed, and a market survey by a US trade journal in early 1994 contained listings for no less than 44 vendors of RAID products.
The driving force behind the development of RAID was the quest for higher performance. Processors, memory and busses will exhibit the properties of Moore's law, which states that performance will monotonically increase as time progresses. Interestingly, experience suggests that Moore's law, a rule of thumb, is very accurate, in spite of its gross approximation of component behaviour. The reason why this is falls out of the core technologies used, which in all instances are electrical and in the instance of memory and CPU, driven by the performance of monolithic integrated circuits. As you shrink a monolithic integrated circuit, you reduce both electrical capacitance and propagation delays, thereby making the device faster.
Disks on the other hand, while employing significant amounts of electronics, are constrained in performance primarily by Newtonian mechanics (see OSR July/August 95). As a result, performance increases in disk access time have been very modest and by no means comparable to performance gains over time in Silicon devices. While increasing density has vastly improved the storage capacity, and appreciably increased the rate at which heads can read from the disk's platters, these performance gains fall far behind that of CPU technology.
The central idea behind RAID, that of increasing performance through aggregating an array of disks, is not new. Mainframes and supercomputers have exploited this for many years by using the technique of striping, whereby a block or several blocks of data accessed on an I/O operation are spread over multiple drive spindles. In this fashion, the achievable aggregate head bandwidth of an N-drive storage array is approximately N times the bandwidth of a single storage device.
The obvious drawback of this scheme lies in reliability. A 1 GB array of N drives will have about N-times the failure rate of a single 1 GB drive. Drive failures are usually mechanical or electromechanical in nature, and much like drive mechanical performance, do not vary greatly across drives of a given technology and hence density and performance.
It follows that the use of a striping (or logical volume scheme) over an array of drives incurs a reliability penalty proportional to the granularity of the array, in comparison with the use of a single drive for the given purpose.
This is where RAID departed from the established paradigm of array storage. RAID introduced redundancy, whereby every atomic access of data was spread over N storage devices, and M redundancy devices. The redundancy scheme most commonly used is one or another form of parity, and every write to the N drive storage array will also require the calculation of the contents of the M redundancy drives, which also must be written. Every RAID array will thus experience some, visible or invisible, performance penalty on writes, in comparison with the conventional striped array. Read performance in a RAID array in unhindered.
RAID architecture thus exploits an array of drives to improve performance, and redundancy to provide acceptable reliability. As is intuitively obvious, there are a multiplicity of schemes which can be built using these ideas, and each of these schemes will have performance and reliability attributes which are better or worse for any given application. We will now examine these more closely.
RAID Organisations and Nomenclature
The emergence of RAID has had the interesting side effect of creating a new nomenclature for disk storage architectures, indeed even established schemes which predate RAID have been shoehorned into the new naming scheme, no doubt to the chagrin of storage architects of the pre-RAID era. The RAID model will define a series of architectures, and RAID implementations should properly be fitted into this model - recent experience however suggests that our brethren in the sales community have been more than imaginative in this matter, so don't be surprised if a vendor one day offers you a RAID-12 box. An observation made to the author by a salesman was that "customers think that the higher the RAID number, the better the array". Alas, reality is somewhat more complex, as we will find. An array with good performance for one task may be a dog for another.
RAID Level 0 (Non Redundant Array)
The conventional striped disk array, typically implemented with a logical volume driver, is referred to as RAID Level 0. No redundancy is employed (the RAID description is a bit of a misnomer here), and thus this class of array suffers the full reliability penalty of a large spindle count. Performance can be very good, if the stripe size is properly adjusted to the application is use. Striping is a popular technique in both Mainframe (database) and Super-computing systems.
RAID Level 1 (Mirrored Array)
Conventional Mirroring, where a drive or array of drives is "mirrored" by a drive/array holding an exact copy, is termed RAID Level 1. Every write to the drive/array is copied onto both. Should either drive fail, the system will continue to operate while the dead spindle is replaced, and "synchronised" with the running spindle. Mirroring provides good reliability, at a significant cost penalty, because twice the storage capacity is required. Performance can be very good if two striped arrays are mirrored, although the stripe size will need to be matched to the application. Most contemporary Mirror/Stripe schemes are implemented in software, and are commonly available on larger Unix systems aimed at database applications.
RAID Level 2 (Memory-Style ECC Coding)
RAID Level 2 is based upon the same idea as Error Correcting Code (ECC) memory controllers, which employ Hamming Codes to introduce redundant data into the array. Hamming Codes, first used in communications, fall into the category of Forward Error Control (FEC) codes, and allow both the detection of an error, as well as its correction. The penalty to be paid is in the need to encode additional data bits (nothing comes free in nature), another limitation of these codes is that they can only ever correct a lesser number of erroneous bits than the number which is detectable.
A typical Hamming coded RAID Level 2 scheme will employ four data disks, and three redundancy disks, as the number of redundant disks needed is proportional to the logarithm of the total number of array disks, RAID Level 2 becomes increasingly efficient with larger array sizes. As in practice most storage arrays have a modest number of drives, RAID Level 2 is not commonly used.
RAID Level 3 (Bit Interleaved Parity)
Because it is very easy to detect a failed drive, as well as it is easy for a drive to flag a medium error (bad block), the multiple redundancy requirements of a RAID Level 2 scheme can be simplified to a single parity disk. A bit interleaved N+1 array using a single parity disk is termed a RAID Level 3 array.
In a RAID Level 3 system, data is interleaved bitwise across the array, and the parity disk only ever holds parity data. Any given access will result in accesses to each and every drive, although in practice Level 3 read operations will ignore the parity drive, unless an error is detected in a data drive and the parity information needs to be recovered to regenerate the data.
In practice, RAID Level 3 arrays often employ the technique of spindle synchronisation, whereby all drives are synchronised in rotation, thus producing the illusion of a single larger drive with N times the head bandwidth of a single drive. RAID Level 3 thus exhibits excellent bandwidth, but cannot carry out concurrent accesses to different blocks of data in the array.
RAID Level 4 (Block Interleaved Parity)
RAID Level 4 uses a scheme analogous to Level 3, but with a much coarser granularity of block level interleaving rather than bit level interleaving. The parity disk holds parity blocks, each of which is associated with its set of corresponding data blocks resident on the data drives.
Should an access to a Level 4 array involve less than a block worth of data, this access will go only to the first disk, should the access involve more than one block, the corresponding number of drives will be accessed. A small I/O write request in a Level 4 array will typically require four I/O operations, one to write the new data, two to read the old data and parity in order to generate the new parity, and one to write the new parity to the parity disk. As is evident, the Level 4 array will tend to concentrate parity writes to the single parity drive, which can become saturated and thus a bottleneck to overall array performance. It is for this reason that Level 4 arrays are not used in practice.
RAID Level 5 (Block Interleaved Distributed Parity)
To overcome the parity drive bottleneck in Level 4 RAID, the Level 5 scheme was devised. In a Level 5 array, the parity block is interleaved between all of the disks in the array. There is no parity drive in a Level 5 array, all drives will service both parity and data accesses, in the ratio of parity to data blocks in the system.
For load conditions characterised by frequent reads to small files, the RAID Level 5 scheme is generally considered to be the best performer, and it is also considered to be a good performer for large reads and writes. Small write performance is compromised by the need to do in effect a read-modify-write operation on every access.
Needless to say, there are a multiplicity of schemes which can be used for parity block placement in a RAID Level 5 array, and the performance achieved with any of these schemes can vary significantly with the statistical properties of the block addresses used by any given application. Another factor which can significantly influence performance in Level 5 arrays is whether the array interleaves by blocks alone, or by stripes of multiple blocks. The simplistic notion that the RAID Level 5 array is "best" is exactly that, simplistic, and should not be taken at face value.
RAID Level 6 (P+Q Redundancy)
An important limitation of the RAID Level 5 array becomes evident with large array sizes, and that is the scheme's inability to handle multiple drive failures. Loss of more than one drive defeats the parity scheme and causes as a result a catastrophic failure, as the array will lose data. An instance of such a failure would be a situation where a drive experiences a hardware fault which renders it inaccessible, followed by a medium error (bad block) read on one of the remaining drives. For applications where data integrity is paramount, the protection offered by Level 5 is simply inadequate and a more powerful coding scheme must be used.
Level 6 schemes will employ one of the family of Reed-Solomon codes to protect against up to two failures, with two redundancy drives in the array. A Level 6 array will be similar to a Level 5 array, and suffers a similar performance penalty in small writes, in that it must do a read-modify-write operation on every access. Six accesses will be required, as both "P" and "Q" redundancy drives must be updated.
The reliability achievable with any RAID array is a function of both the RAID organisation used (ie Level 0 through 6), as well as the physical implementation of the equipment, and the implementation of the RAID controller function. There are a number of tradeoffs possible, and these offer a range of reliability performance to price ratios.
The RAID controller function is the mechanism which maps a block I/O request against the array into the series of I/O requests which are issued to individual drives, and the function of merging or splitting data in either direction. This functional entity will also be responsible for managing array state, as well as regenerating drives which have been inserted raw into the array, to replace a dead drive.
The RAID controller function may be implemented in software or hardware. In software, it will take the form of a specialised RAID device driver, which will appear to the higher layers in the operating system as another disk, albeit with a somewhat expanded set of ioctl() state and status management functions. The first RAID prototype built at UCB used this scheme, as it allowed for easy modification of array organisation.
Implemented in hardware, the RAID controller function is usually embedded in a board or array of boards sharing a common backplane. The controller board will employ a SCSI (or Fibre Channel) interface to the host, and a SCSI controller chip for each and every drive, or chain of drives in the array. Special purpose hardware is often included to carry out block or bit level operations on the data stream, and a microprocessor will usually be employed to manage the internal functions and configuration of the device. Such arrays may appear to a host as a large and fast, but otherwise standard disk.
The RAID controller function is important as it represents a potential single point of failure for the whole array. Should it cease to behave itself, the whole array becomes inaccessible. Importantly, should its power supply or fan fail, then it also takes down the whole array. No matter how elaborate and reliable the organisation scheme may be, and regardless of the level of redundancy in the disks and their supporting hardware, the loss of the array's controller is a catastrophic failure for the whole system.
The implementation of the storage array, comprised of disks, fans, power supplies and cables, can also have an important bearing upon the reliability of the system. Power supplies today have failure rates often greater than disks, indeed modern switch-mode supplies are probably the most highly electrically and thermally stressed components in any computer system. Fans on the other hand, whilst not very prone to random failure are very much prone to wearout, and after several years of sustained operation tend to wear down their bearings and lose performance, eventually seizing up.
The effect is that the unreliability of the supporting hardware may compromise the reliability achieved by the RAID organisation scheme, particularly where drives share supporting hardware. With the improving reliability of modern drives, this is becoming an ever more important factor in assessing a RAID product's reliability performance.
The best way to appreciate this is to look at the two extremes in this situation. Should we choose to design an array product for maximum hardware reliability, regardless of coding scheme used, we would put each and every disk into a storage module which contained its own power supply, fan, and SCSI controller board, all interfaced to a backplane which distributes power and the RAID controller's internal multiplex bus. This backplane would in turn host multiple host interface controllers, each with a SCSI or other channel to the host, and its own fan and power supply. Like this, the failure of any single element other than the backplane itself cannot disable the whole array. The penalty will be a significant cost, and a high incidence of non-fatal failures of individual components, which in operation would be hot swapped with spares.
The other extreme is to build an array with the purpose of achieving the lowest possible cost. In this arrangement, we will use a single main power supply, single main fan, single main controller board, and all drives will be cabled to the controller and the supply. In this arrangement, single point failure of the the controller, fan or supply will take down the whole array.
In practice, RAID products will span both of these extremes, and various compromise arrangements will exist, where the array is "sliced" across multiple fans and multiple power supplies, and where controllers use motherboards with plug in SCSI controller modules for individual drives or chains of drives. Achievable reliability will thus vary significantly, as will cost, and system managers will need to do some careful analysis if they need to meet particular performance requirements. Should you wish to analyse the reliability performance of any such product, reading Mil-Std-756B is a good first step.
Part 2 will analyse performance issues in RAID arrays, and make some comparisons between the more popular RAID scheme.
|$Revision: 1.1 $|
|Last Updated: Sun Apr 24 11:22:45 GMT 2005|
|Artwork and text ¿ 2005 Carlo Kopp|