LVM on RAID
- 1 The Problem
- 2 The Answer
- 3 Doing it...
- 4 Performance Enhancements / Tuning
- I want a lot of storage for my TV!
- I don't want to lose data if a disk crashes
- I want to add more disks later
- Buy disks - lots of them
- Use RAID
- Use LVM
RAID (Redundant Arrays of Inexpensive Disks) is about putting your data on 2 or more disks in such a way that if one of them crashes you don't lose your data (it's as if everything is stored twice - but cleverer!)
[comment: since I did this I've had one of my disks die and be RMA'd. No lost data :) OTOH, during the month that the drive was away I was 'unprotected' and had a minor read error which caused much trouble - but thankfully no data loss]
- Note that you should always have one spare matched drive in house for any RAID system, so if one fails, you can replace it immediately. --04:58, 27 January 2006 (UTC)
You always lose some space - but we don't want to lose half (which is what happens with simple 'mirroring') so we want to use raid5. That way we can buy 4 disks (or more!) and only 'lose' 1 of them.
Why not just use RAID?
If we run out of room then we need to allocate more. You can't (currently) add disks to a RAID5 array.
This is no longer true for users of the 2.6.17 and above linux kernel. To grow and existing raid you have to enable the following kernel option:
Device Drivers->Mutliple devices driver support (RAID and LVM)->RAID support->RAID-4/RAID-5/RAID-6 mode-> Support adding drives to raid-5 array (experimental)
Then to resize your raid5 simply run:
mdadm --add /dev/md1 /dev/sdb1 mdadm --grow /dev/md1 --raid-disks=4 (new number of disks)
See Growing a raid5 for more details
What LVM lets you do is collect all your disks, raid arrays and what-not into a big 'box' of storage (called a volume group). You can then make logical volumes from this space (- think partitions). The good thing is that you can then simply add more disks into the big collection and allocate space to existing logical volumes - ie grow them. Essentially you can have a partition that starts on one disk and ends on another (and may include another in the middle!)
Why not just use LVM?
LVM only lets you mirror or stripe - we want resilience but mirroring is bad - we need to buy twice as much storage as we want (or more realistically - we only end up using half of what we can afford ;) ).
How many disks?
For raid5 you need at least 3. You can go up from there, and every additional drive adds that much more space. You need to do a little pre-planning as it is not necessarily easy to add drives to the array after it is constructed. LVM also adds to this problem. There exists a utility raidreconf, but it is not well supported, and cannot operate while the drive is mounted. It's also doubtful that you can resize the lvm physical volume. As such to add drives, you will have to copy the data off, and re-create the array. Keep this in mind before you store 600 Gb of data, and then will need to copy it off somewhere.
Do both :)
- Create a raid5 array and add it to our 'box'.
- Then we make a filesystem from the space in the box.
- In the future we can add another array and add that to the box too.
- Then we can 'grow' the filesystem.
Oh yes - that's another thing. You must use a filesystem that can grow - so either Reiser or XFS. XFS can't shrink so (despite having worked for SGI myself) I'd use Reiser. (You need to set data=sorted as kernel parameter on pre 2.6.6 kernels so a power outage wont cause data loss). [Later edit: I changed my mind after having had a few corruptions - I now have pure XFS]
Reiser is not a good choice for myth, as it is good for small files, not very large files. XFS is the best choice.
I have 3 250Gb SATA disks (the fourth is on order)
/dev/sda /dev/sdb & /dev/sdc.
Normal ide will be called things like
/dev/hdd /dev/hde & /dev/hdf.
I'm using 2.6.6 with libata and SATA. I've compiled in the raid extensions.
You also need
mdadm - I'm using 1.5.0
Prepare The Disks
Use all the disks - but create a partition so you can specify the type. This lets the kernel pick up your raid array when it boots.
- Format the disks:
% fdisk /dev/sda The number of cylinders for this disk is set to 30515. There is nothing wrong with that, but this is larger than 1024, and could in certain setups cause problems with: 1) software that runs at boot time (e.g., old versions of LILO) 2) booting and partitioning software from other OSs (e.g., DOS FDISK, OS/2 FDISK) Command (m for help): n Command action e extended p primary partition (1-4) p Partition number (1-4): 1 First cylinder (1-30515, default 1): Using default value 1 Last cylinder or +size or +sizeM or +sizeK (1-30515, default 30515): Using default value 30515 Command (m for help): t Selected partition 1 Hex code (type L to list codes): fd Changed system type of partition 1 to fd (Linux raid autodetect) Command (m for help): w The partition table has been altered! Calling ioctl() to re-read partition table. Syncing disks.
Make The Array
- Wipe everything:
% mdadm --stop /dev/md0 % mdadm --zero-superblock /dev/sda1 % mdadm --zero-superblock /dev/sdb1 % mdadm --zero-superblock /dev/sdc1
- Make the array:
% mdadm -v --create /dev/md0 --chunk=128 --level=raid5 --raid-devices=4 /dev/sda1 \ /dev/sdb1 /dev/sdc1 missing mdadm: layout defaults to left-symmetric mdadm: size set to 245111616K mdadm: array /dev/md0 started.
- What have we got?
% mdadm --detail /dev/md0 /dev/md0: Version : 00.90.01 Creation Time : Thu Jun 3 20:24:17 2004 Raid Level : raid5 Array Size : 735334656 (701.27 GiB 752.98 GB) Device Size : 245111552 (233.76 GiB 250.99 GB) Raid Devices : 4 Total Devices : 3 Preferred Minor : 0 Persistence : Superblock is persistent Update Time : Thu Jun 3 20:24:17 2004 State : clean, no-errors Active Devices : 3 Working Devices : 3 Failed Devices : 0 Spare Devices : 0 Layout : left-symmetric Chunk Size : 128K Number Major Minor Raid Device State 0 8 1 0 active sync /dev/sda1 1 8 17 1 active sync /dev/sdb1 2 8 33 2 active sync /dev/sdc1 3 0 0 -1 removed UUID : d6ac1605:db6659e1:6460b9c0:a451b7c8 Events : 0.5078
I've used 'missing' for one of my disks (it's in the post). This has created the array as if one of the disks is dead. When the new one arrives I'll hot-add it and watch the array rebuild itself.
- Update the config file (do this if you ever change the config (eg add a spare, mark a disk as faulty etc)
% echo 'DEVICE /dev/sd*' >/etc/mdadm/mdadm.conf % echo 'PROGRAM /bin/echo' >>/etc/mdadm/mdadm.conf % echo 'MAILADDR firstname.lastname@example.org' >>/etc/mdadm/mdadm.conf % mdadm --detail --scan >>/etc/mdadm/mdadm.conf % cat /etc/mdadm/mdadm.conf
LVM2 : Logical Volumes
- Now make the array device useable by LVM2 :
No physical volume label read from /dev/md0 Physical volume "/dev/md0" successfully created
- Create a volume group that we can use to make logical volumes from:
% vgcreate -s 64M video_vg /dev/md0
-s 64M means that we're 'limited' to 4Tb - sounds mad but I can see us keeping all DVDs, etc. on a volume.
Adding physical volume '/dev/md0' to volume group 'video_vg' Archiving volume group "video_vg" metadata. Creating volume group backup "/etc/lvm/backup/video_vg" Volume group "video_vg" successfully created
I only want to use 600Gb to start with...
- Finally create a volume:
% lvcreate -L600G -nvideo_lv video_vg Logical volume "video_lv" created
- Format it:
% mkfs -treiserfs /dev/video_vg/video_lv All data on /dev/video_vg/video_lv will be lost. Do you really want to create reiser filesystem (v3.6) (y/n) y Creating reiser filesystem (v3.6) with standard journal on /dev/video_vg/video_lv initializing skiped area: done initializing journal: done syncing...done
- Mount it:
% mount /dev/video_vg/video_lv /huge
- See your results:
cu:~# df -k Filesystem 1K-blocks Used Available Use% Mounted on /dev/hda2 19283776 697988 18585788 4% / /dev/hda1 97826 13003 79604 15% /boot /dev/mapper/video_vg-video_lv 629126396 32840 629093556 1% /huge
Performance Enhancements / Tuning
- There are reports on numerous mailing lists about performance issues with LVM on RAID. The below commands attempt to address the majority of these.
Warning! Caveat Emptor!
The following, while well-documented and stable as the rest of the MD / LVM subsytem, can lock up your machine and destroy data. This is most likely to occur if you go overboard with your settings (setting the cache larger than your physical RAM, etc). Have fun, benchmark things out, but DO NOT attempt these settings on arrays already containing important data unless you're confident that your settings are appropriately conservative. Don't say you weren't warned.
If you still want to try it...
This seems to be correctable by setting the read-ahead values. Ideally, this should be done to both your md device and the logical volume:
% blockdev --setra 4096 /dev/md0 % blockdev --setra 4096 /dev/video_vg/video_lv
You can also set the read-ahead on the drive device itself, but not everyone has seen an improvement here (possibly because of the read-ahead configured by hdparm?). The numerical value is in bytes, so larger values are relatively safe to attempt. On DirkGecko's 4-drive RAID 5 SATA array, setting the read-ahead boosted the read performance by a full 50MB/sec. Understandably, it had no significant effect on writes.
If you have RAM to burn, you can also increase the size of the software RAID MD cache. This is directly analogous to the cache RAM on hardware RAID cards. However, the burst performance available via Linux's software RAID significantly outstrips that available to hardware RAID due to the PCI bus that a "relatively inexpensive" RAID card must contend with.
% echo 8192 > /sys/block/md0/md/stripe_cache_size
This value is in pages per device, which, for a 4-drive array, comes out to 128MB. On DirkGecko's array, this increased write performance on an 8GB file from 52MB/sec to 62MB/sec. On files less than the size of the cache, performance was WELL ABOVE 250MB/sec. This will be most useful if you find yourself I/O bound on writes.
- In order for these parameters to be restored later, you'll want to add them to your rc.local or similar bootup script.