Part 2 Performance
|Originally published December, 1995|
|¿ 1995, 2005 Carlo Kopp|
The emergence of RAID technology into the wider marketplace has been accompanied by much commercial hype, but sadly very little substantial discussion. This is unfortunate, because RAID arrays can provide an enormous gain in I/O rate performance if properly integrated into a system. With the emergence of multi-media applications, as well as the ever increasing proliferation of database products, the demand for high performance I/O will not abate.
The key to successfully integrating RAID into a system lies in understanding the central performance issues surrounding this technology, as well as being able to identify the potential pitfalls. Those who can do so will benefit significantly from this technology, those who do not will expend a lot of capital for no measurable performance gain. As always, insight is worth more than hype.
Quantifying RAID Array Performance
As with all disk based storage architectures, RAID system performance must be quantified in a number of areas. Reliability performance is determined by Mean Time Between Failure (MTBF) and Mean Time To Repair (MTTR) - the higher the system level MTBF, the better, the lower the MTTR, the better. Significantly, reliability performance is determined both by hardware implementation, as well as RAID organisation, and any quoted MTBF figures should be very carefully analysed against the manufacturer's definitions of how the MTBF is modelled. Should the application be reliability critical, then a copy of Mil-Std-756B should be used to provide a reference model.
Access time performance is analogous to disk access time, and is a measure of the latency time between the issue of an I/O request and the RAID system's response. RAID arrays will exhibit similar access time latencies to the disks they are comprised of, indeed the only major difference seen in this area is the additional time taken by a RAID controller to calculate parity on writes, or recover data should an error be detected. As parity calculations are usually done by dedicated hardware, and the operations used (eg XOR) are simple, this time is usually two orders of magnitude lower than the mechanically constrained access time. RAID schemes which are implemented in the host operating system, using a RAID device driver, may however incur appreciable CPU overheads as well as time delays to calculate parity on writes, and it is fair to say that software implementations of RAID controller functions are inferior in performance to hardware implementations of RAID controllers.
Sustained transfer rate (bandwidth) performance is the area where a RAID system excels in comparison with a single disk, as the controller is in effect merging the data streams from all accessed disks in the array. The larger the number of disks concurrently accessed in the array, the greater the achieved array transfer rate. In some configurations, array transfer rates can be as high as N times the transfer rate of an individual disk. Consider a RAID Level 3 system with 9 spindles, each with a head transfer rate (bandwidth) of 5 Megabytes/s. The aggregate transfer rate for the 8 data disks in the array is 40 Megabytes/s, which is twice the bandwidth of a fully loaded 16 bit Wide SCSI-2 bus. Indeed the proliferation of high performance RAID arrays has in turn driven the development of standards such as 20 MHz clock-speed ANSI Fast-20 SCSI / Ultra-SCSI, the 16-bit wide incarnation of which will provide a 40 Mbytes/s peak bandwidth.
Host Interface and Controller Performance
The availability of a RAID array capable of delivering in excess of 20 Mbytes/s sustained transfer rate will require both a RAID controller and a host interface capable of supporting this performance. At this time, the host interface is in many instances the bottleneck to performance, and will prevent the system from fully exploiting the potential of the RAID system.
In sizing a host interface to a RAID array it is imperative that we establish what the peak and sustained transfer rate of the array is. As the controller is typically an integral part of the RAID equipment, the only choice we may have in this area is the selection of interface, needless to say the faster the better. Should a wide 20 MHz Fast-20 interface be available, it should be used.
At the host end of the interface the issue is somewhat more complex than merely acquiring an I/O board with the appropriate flavour of SCSI, in that we should also give some consideration to the underlying I/O bus topology in the system. Because a RAID array will transiently saturate most I/O busses, there is a good case to be made for dedicating an I/O bus, SCSI controller board and SCSI bus to the RAID array, so as to provide a dedicated datapath between the system's main bus and the RAID array. Otherwise we run the risk of impairing the performance of any other devices which may share an I/O datapath with the RAID device.
Another important issue is the presence or absence of caching in the RAID controller. Many recent controller designs will include several Megabytes of battery backed DRAM in the RAID controller, performing a function analogous to the embedded cache in a disk drive. The hit rate achieved by such a cache will be dependent upon access statistics, but should it prove to be significant, the array will regardless of its physical access performance saturate the SCSI bus on cache hits. Under sustained load the cache will however saturate, and the array's performance will be determined by the organisation used and disk performance.
An anecdotal note of interest here is a case the author recalls, when some years ago a Unix host attached to a RAID array experienced intermittent hangs on its SCSI interface. The problem was eventually isolated to the device driver, which was written on the assumption that the host would always be faster than the SCSI device. Alas in this instance the RAID array pumped data so quickly, that the host couldn't keep up and the device driver entered an unrecoverable state. As it turned out, this contingency had never been tested for.
File System Issues
An important issue which has received no coverage as yet in the wider RAID debate is the issue of filesystem interaction with RAID array block address mapping. Modern Unix filesystems will often employ elaborate block placement optimisation strategies to minimise access times - these strategies are universally based upon the model of a conventional disk drive and its geometrical dependency of access time.
Many RAID schemes will scatter data blocks across multiple disks in the array, so as to allow the heads on each disk to independently read off part of the data to be extracted. This has the advantage to allowing concurrent seeks as well as concurrent reads or writes. Consider however the situation where the filesystem has tried to be clever and put blocks down into consecutive locations in what it believes to be a cylinder group. The set of block addresses which would map into a single cylinder on one drive may in fact map into very different cylinders on each of the array drives. The effect may result in a situation where each of the drives in the array must seek to very different cylinders, to find their respective blocks of data. As a result, the optimisation of the filesystem may actually increase the range of cylinders which the heads must seek to, thereby compromising access time, as the array cannot complete the I/O until the last of the blocks is read or written.
This is an unfortunate aspect of block interleaved RAID schemes, and those users who may have experienced poor performance from a block interleaved scheme (eg Level 5) may have run into exactly this effect. The solution is to choose a stripe size suitable for the file sizes in use, as this allows consecutive block addresses in the filesystem to map into consecutive blocks on array drives more frequently. Should the array controller not support striping, you are out of luck.
RAID Level 3 Performance
RAID Level 3 has been the victim of much unjustified negative marketing hype, which is most unfortunate because in many applications it is a superlative performer. Level 3 arrays are bit interleaved, as a result of which they will produce the illusion of a single large disk, with N times the transfer rate of a single disk. Ideally a Level 3 array should employ spindle synchronisation, which causes all of the drives in the array to rotate in lockstep. As a result, all drive heads will arrive on cylinder at the same time, and all drive heads will experience th same rotational latency - in an unsynchronised array the controller must wait for the drive with the greatest rotational latency before it can begin to read from the medium.
The behaviour of a synchronised Level 3 array is most interesting, as it has the property of preserving the block address mapping of a single disk. In turn, this means that any filesystem produced optimisations in block placement will be preserved, particularly if we tune the filesystem rotational parameters accordingly. Where we are running with a high performance Unix filesystem, such as the FFS/UFS or 4.4 BSD EFS, filesystem performance will be preserved, in comparison with a conventional disk. The gain in transfer rate performance should be very close to the ratio of transfer rates between the array and a single disk.
Where Level 3 is most useful is in applications which require high sustained transfer rates, typically involving small numbers of large or very large files, and where access times characteristic of conventional disks are acceptable. Remote Sensing, GIS and multimedia are all areas which can benefit from using Level 3 array organisation.
Because Level 3 array organisation preserves the behaviour of the filesystem, applications which do not interact favourably with a Unix filesystem are less suitable for Level 3 systems, and this should be carefully considered when integrating a system.
RAID Level 5 and 6 Performance
The two principal block interleaved array organisations, RAID Level 5 and Level 6, may be implemented in a number of forms, and the performance characteristics of these implementations may vary appreciably. The central issue in block interleaved schemes is that of where to put the parity block. Constraints are usually applied before we can start evaluating parity block placement schemes. A common constraint, in an N x M array, is to ensure that there is only ever one parity block per row or column. This prevents a shared failure across any row or column in the array from disabling the whole array. Other constraints may also be applied, the typical objective of which is to ensure that block aligned and block sized writes do not require readback of other blocks to compute the parity block.
Serious RAID theorists divide parity placement schemes into classes, such as symmetric placement, and a family of assymetric placements. The number of possible placement schemes is staggering, and readers who are really interested are directed to UCB Technical Report CSD 90/573 which provides a good discussion of the subject.
One of the conclusions from the theoretical work at UCB was that Left Symmetric placement schemes generally performed best, although significant differences were found in relative performance across this class of placement schemes. Flat Left Symmetric placement demonstrated best read performance, and also the worst write performance, whereas Left Symmetric placement demonstrated best write performance and worst read performance. Extended Left Symmetric placement provided a reasonable compromise.
What is important from the perspective of an integrator is that there are a multiplicity of ways in which a RAID level 5 parity placement may be implemented, and each of these has idiosyncrasies in performance. The nature of the user application, and its access statistics should therefore be carefully examined. In a typical Unix environment with properly sized buffer cache, and good cache hit rates, traffic will be dominated by writes, and therefore schemes with better write performance should be favoured. However, many database applications may exhibit poor locality in read I/Os and thus poor cache hit rates, thereby biasing I/O traffic toward read operations.
Striping in Block Interleaved RAID Schemes
One of the problems in block interleaved RAID schemes is an inherent paradox, in that schemes that do a good job of spreading I/O operations across all disks will also reduce the efficiency of transfers by reducing the average size of a transfer to a disk. Conversely, schemes which maximise the amount transfered per I/O may cause hot spots by concentrating I/O traffic on some disks, while others may remain idle. A central caveat is therefore finding the best tradeoff in I/O granularity for the application in use.
A good RAID Level 5 implementation will provide for an adjustable stripe size, as this mechanism can be used most effectively for tuning the performance of a Level 5 array to the application in use. A substantial amount of research has been done in this area by Chen and Lee, who concluded that optimal stripe size is sensitive to workload. Thus it follows that where workload is well known, through an existing application or relevant benchmark, there is a good case for benchmarking array performance over a range of stripe sizes to find the optimal value. Where the workload is not known a priori, a number of rules of thumb do exist for calculating reasonable stripe size values, although these figures can only ever be an estimate in the absence of real workload figures.
General conclusions from existing research into striping of Level 5 arrays suggest that write intensive workloads prefer smaller stripe sizes than read intensive workloads, and that read intensive workloads exhibited similar behaviour to non-redundant (Level 0) striped arrays. Optimal stripe sizes for the tests conducted varied between 20 kByte to 40 kByte, with disk access time and transfer rate performance being important factors. It was found that stripe size is usually proportional to the product of average disk access time and disk transfer rate, therefore the optimal stripe size will vary significantly across various disk types. Mixing dissimilar disks in an array is therefore not a clever idea, as it will be exceedingly difficult to tune the array's performance to an optimum.
As is evident, there is more to RAID than necessarily meets the eye. RAID arrays where properly integrated and tuned to an application can provide a significant increase in performance over conventional alternatives, without the painful cost penalties of methods such as mirroring. To properly integrate and tune a RAID array to an application is however not a trivial task, and requires a very good insight into the behaviour of the RAID organisation to be used, the application to be used, and the performance characteristics of the host interface.
Most of the mythology concerning RAID the author has heard was at odds with technical reality, and it follows that those who choose to use RAID without understanding its complexities will run the risk of expending additional resources for no measurable gain in performance.
|$Revision: 1.1 $|
|Last Updated: Sun Apr 24 11:22:45 GMT 2005|
|Artwork and text ¿ 2005 Carlo Kopp|