Lehrveranstaltung Speichersysteme Sommersemester 2009 Kapitel 5: RAID André Brinkmann
IntroducDon into Disk Arrays Inhaltsverzeichnis Why Disk Arrays? MTTF, MTTR, MTTDL RAID 0, RAID 1, RAID 5 MulDple Disk Failures and RAID 6
Use Arrays of Small Disks Katz and Patterson asked in 1987: Can smaller disks be used to close gap in performance between disks and CPUs? Conventional: 4 disk designs 3.5 5.25 10 14 Low End High End Disk Array: 1 disk design 3.5 Folien basieren auf Vorlesung von Prof. D. Patterson (Berkeley)
Ersetzung kleiner Anzahl großer Disks durch große Anzahl kleiner Disks (Beispiel 1988) IBM 3390K IBM 3.5 0061 x70 Improvement Capacity 20 GB 320 MB 23 GB 1x Volume 97 `2 0.1 `2 11 `2 9x Power ConsumpDon 3 KW 11 W 1 KW 3x Bandwidth 15 MB/s 1.5 MB/s 120 MB/s 8x IO Rate 600 IOs/s 55 IOs/s 3900 IOs/s 6x MTTF 250 KHrs 50 KHrs?? Hrs Costs $250,000 $2,000 $150,000 1,6x FestplaUenfelder haben PotenDelle hohe Performanz Hohe Anzahl MBs/Volumen und MBs/KW aber Was ist mit der Ausfallsicherheit? Folien basieren auf Vorlesung von Prof. D. Patterson (Berkeley)
MTTF, MTTR, MTTDL Mean Time to Failure (MTTF): Expected Dme undl a disk fails. MTTF can be defined in terms of the expected value of the failure density funcdon f: Mean Time to Repair (MTTR): Expected Dme from the failure of a disk undl compledng its recovery (if possible) Mean Time between Data Loss (MTTDL): Expected Dme from stardng a storage system undl the loss of data
Bathtub Curve MTTF calculated for normal operadng period of a disk Nearly constant over a period of 3 4 years Failure rate before and a`erwards significantly higher /&,*0'(&)*+,-' 1-,)"2' 3",4&*'"1-,&5#.' 1-,)"2'' 6-&,'"+%' 1-,)"2'' 7&%-'"(' (&)*+,-'!"#$%&#%'(&)*+,-',&%-',-.)"#'
RAID Parallel Disk Arrays are able to provide high bandwidth good properdes concerning MB/Volume und MB/KW, but what about reliability? MTTF of n disks: Inside the example: MTTF 70 = 50,000 h / 70 = 700 h MTTF of the disk array decreases from six years to a single month Arrays (without redundancy) too unreliable to be useful!
Reliability / Ausfallsicherheit Reliability / Ausfallsicherheit Kennzeichnet die Fähigkeit eines Systems oder einer Systemkomponente seine FunkDonalität unter definierten Bedingungen für eine spezifizierte Zeitdauer auszuführen. Wird in MTTF gemessen (Mean Time To Failure MiUler Zeit bis zu einem Ausfall) Availability / Verfügbarkeit Kennzeichnet den Grad, zu dem ein System oder eine Systemkomponente zugreimar ist. Annahme hier: System während der RekonstrukDon nicht zugreimar Folie basiert auf IEEE Standard Computer Dictionary: A Compilation of IEEE Standard Computer Glossaries und Torell / Avelar: Mean Time between Failure- Explanations and Standards
Redundant Array of Inexpensive Disks Files are striped across muldple disks Redundancy yields high data availability Availability: Service is sdll provided to user, even if some components have failed Disks will sdll fail Contents can be reconstructed from data redundantly stored in the array Capacity penalty to store redundant info Bandwidth penalty to update redundant info See D. A. Patterson, G. A. Gibson, R. H. Katz: A Case for Redundant Arrays of Inexpensive Disks (RAID)
RAID I (1989) SUN 4/280 WorkstaDon mit 128 MByte DRAM Vier SCSI Kontroller 28 5.25 FestplaUen und Spezieller Disk Striping So`ware Heute ist RAID eine 27 Mrd. $ Industrie 80% aller nicht PC FestplaUen werden als RAID Systeme verkau` Berkeley History, RAID I Folien basieren auf Vorlesung von Prof. D. Patterson (Berkeley)
RAID Level RAID = Redundant Array of Independent Disks Bekannte RAID Level 0: keine Redundanz (JBOD) 1: Mirroring 10: Striped Mirrors 2: Hamming Codes/ECC (nicht verwendet) 3: Byte Interleaved Parity 4: Block Interleaved Parity 5: Rotated Block Interleaved Parity 6: Double Parity (selten)
RAID 0 RAID 0 stripes data over set of disks Size of each data block is several Kbyte Increase of bandwidth for big accesses or for many parallel, but small accesses RAID 0 does not include redundancy informadon No protecdon against single disk failures Legend: x y means block x 0 1 from stripe y 2 3 4 0-0 1-0 2-0 3-0 4-0 0-1 1-1 2-1 3-1 4-1 0-2 1-2 2-2 3-2 4-2 0-3 1-3 2-3 3-3 4-3 LocaDon can be efficiently calculated for n disks Stripe address y = Address / n Disk number x = Address % n Logical address 12 will be mapped to stripe 2 and disk 2
RAID 1 Every Disk is fully mirrored Very high availability can be achieved Bandwidth sacrifice on write: Logical write = two physical writes Reads can be opdmized Most expensive RAID soludon: 100% capacity overhead Legend: xy means block x on disk y 0 1 0-0 0-1 1-0 1-1 2-0 2-1 3-0 3-1
Parity RAID ProperDes of previous RAID levels: Mirroring produces high overhead Striping does not include failure correcdon FuncDon required with low capacity overhead good failure protecdon properdes low compudng costs Idea of RAID 3 and RAID 4 Use striping plus Parity computadon Parity computed using XOR Example: Divide data block 1101 in 4 sub blocks plus one Parity block 0 1 2 3 4 0-0 1 1-0 1 2-0 0 3-0 1 4-0 1
Striping unit Block used to distribute data Stripe Terminology Set of striping units that share parity computadon Parity block Block that keeps the parity of a stripe Same size as a striping unit
Small Write Problem Read performance of Parity RAID nearly as good as performance of RAID 0 but each small write to a single block x needs to update the parity block! SoluDon 1: Read all other data blocks except the changed one Calculate new parity Write the new data block and the parity block Overhead propordonal to stripe size SoluDon 2: Use properdes of XOR funcdon: and We can read the ONE old data block x and the old parity block p We know the new data block x new Just calculate new parity block to:
RAID 5 RAID 4 produces bouleneck at the parity disk Each write access to an arbitrary block produces one write request at the parity disk RAID 4 does not scale concerning the stripe size Idea of RAID 5: Distribute the funcdon of the parity disk over all disks 0 1 2 3 4 0-0 1-0 2-0 3-0 0-1 1-1 2-1 3-1 Legend: x y means block x from stripe y 0-2 1-2 2-2 3-2 0-3 1-3 2-3 3-3 0-3 1-3 2-3 3-3
Does redundancy help? Failure probability of 1 error correcdng codes for n disks can be calculated to Failure probability during recovery depends on Time to recover the data MTTF of the remaining devices
AssumpDons Error does not occur in wear out phase MTTF (of one disk) is constant ExponenDal failure distribudon leads to failure density funcdon: λ is probability that an element fails; described in failures per unit of measurement It holds that
Probability of second Failure Probability of a second failure as integral over density funcdon: Here: and therefore
Probability of second Failure ExponenDal funcdon can be calculated as series with MTTR << MTTF: and MTTDL can be calculated as
Does redundancy help? Standard RAID schemes are able to increase MTTDL for n disks from without data protecdon to Drawbacks wridng becomes slower complexity of implementadon and administradon significantly increases
Is this safe enough? AssumpDon for storage cluster environment: 1 PByte of data stored on 2000 computers Environment is grouped into 200 RAID 5 systems with 10 disks each MTTF of each computer (including disks) is 1000 days Recovery Dme of a computer is 1 day MTTDL = 1 200 (1000)2 10 9 1d 55d ProtecDon against single disk failures not enough in large scale environments Example taken from Lustre Manual v1.6, August 2007