Difference between revisions of "LVM on RAID"

From MythTV Official Wiki
Jump to: navigation, search
m
(Apostrophes, let's pretend we've heard of them)
 
(One intermediate revision by one other user not shown)
Line 39: Line 39:
 
   mdadm --add /dev/md1 /dev/sdb1
 
   mdadm --add /dev/md1 /dev/sdb1
 
   mdadm --grow /dev/md1 --raid-disks=4 (new number of disks)
 
   mdadm --grow /dev/md1 --raid-disks=4 (new number of disks)
 
+
{{Note box|This causes the raid to restripe itself which took about 11 hours on my machine.}}
Note that this causes the raid to restripe itself which took about 11 hours on my machine.
 
  
 
See [http://scotgate.org/?p=107 Growing a raid5] for more details
 
See [http://scotgate.org/?p=107 Growing a raid5] for more details
Line 74: Line 73:
 
* Then we can 'grow' the filesystem.
 
* Then we can 'grow' the filesystem.
  
Oh yes - that's another thing. You must use a filesystem that can grow - so either Reiser or XFS. XFS can't shrink so (despite having worked for SGI myself) I'd use Reiser. (You need to set data=sorted as kernel parameter on pre 2.6.6 kernels so a power outage wont cause data loss). [Later edit: I changed my mind after having had a few corruptions - I now have pure XFS]
+
Oh yes - that's another thing. You must use a filesystem that can grow - so either Reiser or XFS. XFS can't shrink so (despite having worked for SGI myself) I'd use Reiser. (You need to set data=sorted as kernel parameter on pre 2.6.6 kernels so a power outage won't cause data loss). [Later edit: I changed my mind after having had a few corruptions - I now have pure XFS]
  
 
Reiser is not a good choice for myth, as it is good for small files, not very large files.  XFS is the best choice.
 
Reiser is not a good choice for myth, as it is good for small files, not very large files.  XFS is the best choice.

Latest revision as of 18:48, 10 October 2010

This article shows how to setup LVM file storage partitions on top of multiple disk drives configured as redunant RAID arrays.

The Problem

  • I want a lot of storage for my TV!
  • I don't want to lose data if a disk crashes
  • I want to add more disks later

The answers:

  • Buy disks - lots of them
  • Use RAID
  • Use LVM

What's RAID?

RAID (Redundant Arrays of Inexpensive Disks) is about putting your data on 2 or more disks in such a way that if one of them crashes you don't lose your data (it's as if everything is stored twice - but cleverer!)

[comment: since I did this I've had one of my disks die and be RMA'd. No lost data :) OTOH, during the month that the drive was away I was 'unprotected' and had a minor read error which caused much trouble - but thankfully no data loss]

Note that you should always have one spare matched drive in house for any RAID system, so if one fails, you can replace it immediately. --04:58, 27 January 2006 (UTC)

You always lose some space - but we don't want to lose half (which is what happens with simple 'mirroring') so we want to use raid5. That way we can buy 4 disks (or more!) and only 'lose' 1 of them.

Why not just use RAID?

If we run out of room then we need to allocate more. You can't (currently) add disks to a RAID5 array.

This is no longer true for users of the 2.6.17 and above linux kernel. To grow and existing raid you have to enable the following kernel option:

Device Drivers->Mutliple devices driver support (RAID and LVM)->RAID support->RAID-4/RAID-5/RAID-6 mode-> Support adding drives to raid-5 array (experimental)

Then to resize your raid5 simply run:

  mdadm --add /dev/md1 /dev/sdb1
  mdadm --grow /dev/md1 --raid-disks=4 (new number of disks)

Important.png Note: This causes the raid to restripe itself which took about 11 hours on my machine.

See Growing a raid5 for more details

What's LVM?

What LVM lets you do is collect all your disks, raid arrays and what-not into a big 'box' of storage (called a volume group). You can then make logical volumes from this space (- think partitions). The good thing is that you can then simply add more disks into the big collection and allocate space to existing logical volumes - ie grow them. Essentially you can have a partition that starts on one disk and ends on another (and may include another in the middle!)

Why not just use LVM?

LVM only lets you mirror or stripe - we want resilience but mirroring is bad - we need to buy twice as much storage as we want (or more realistically - we only end up using half of what we can afford ;) ).

How many disks?

For raid5 you need at least 3. You can go up from there, and every additional drive adds that much more space. You need to do a little pre-planning as it is not necessarily easy to add drives to the array after it is constructed. LVM also adds to this problem. There exists a utility raidreconf, but it is not well supported, and cannot operate while the drive is mounted. It's also doubtful that you can resize the lvm physical volume. As such to add drives, you will have to copy the data off, and re-create the array. Keep this in mind before you store 600 Gb of data, and then will need to copy it off somewhere.

The Answer

Do both :)

  • Create a raid5 array and add it to our 'box'.
  • Then we make a filesystem from the space in the box.
  • In the future we can add another array and add that to the box too.
  • Then we can 'grow' the filesystem.

Oh yes - that's another thing. You must use a filesystem that can grow - so either Reiser or XFS. XFS can't shrink so (despite having worked for SGI myself) I'd use Reiser. (You need to set data=sorted as kernel parameter on pre 2.6.6 kernels so a power outage won't cause data loss). [Later edit: I changed my mind after having had a few corruptions - I now have pure XFS]

Reiser is not a good choice for myth, as it is good for small files, not very large files. XFS is the best choice.

Doing it...

My Setup

I have 3 250Gb SATA disks (the fourth is on order)

They're called /dev/sda /dev/sdb & /dev/sdc.

Normal ide will be called things like /dev/hdd /dev/hde & /dev/hdf.

The Software

I'm using 2.6.6 with libata and SATA. I've compiled in the raid extensions. You also need mdadm - I'm using 1.5.0

Prepare The Disks

Use all the disks - but create a partition so you can specify the type. This lets the kernel pick up your raid array when it boots.

  • Format the disks:
% fdisk /dev/sda
 
The number of cylinders for this disk is set to 30515.
There is nothing wrong with that, but this is larger than 1024,
and could in certain setups cause problems with:
1) software that runs at boot time (e.g., old versions of LILO)
2) booting and partitioning software from other OSs
   (e.g., DOS FDISK, OS/2 FDISK)

Command (m for help): n
Command action
   e   extended
   p   primary partition (1-4)
p
Partition number (1-4): 1
First cylinder (1-30515, default 1):
Using default value 1
Last cylinder or +size or +sizeM or +sizeK (1-30515, default 30515):
Using default value 30515

Command (m for help): t
Selected partition 1
Hex code (type L to list codes): fd
Changed system type of partition 1 to fd (Linux raid autodetect)

Command (m for help): w
The partition table has been altered!

Calling ioctl() to re-read partition table.
Syncing disks.

Make The Array

  • Wipe everything:
% mdadm --stop /dev/md0
% mdadm --zero-superblock /dev/sda1
% mdadm --zero-superblock /dev/sdb1
% mdadm --zero-superblock /dev/sdc1
  • Make the array:
% mdadm -v --create /dev/md0 --chunk=128 --level=raid5 --raid-devices=4 /dev/sda1 \
             /dev/sdb1 /dev/sdc1 missing
mdadm: layout defaults to left-symmetric
mdadm: size set to 245111616K
mdadm: array /dev/md0 started.
  • What have we got?
% mdadm --detail /dev/md0

/dev/md0:
        Version : 00.90.01
  Creation Time : Thu Jun  3 20:24:17 2004
     Raid Level : raid5
     Array Size : 735334656 (701.27 GiB 752.98 GB)
    Device Size : 245111552 (233.76 GiB 250.99 GB)
   Raid Devices : 4
  Total Devices : 3
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Thu Jun  3 20:24:17 2004
          State : clean, no-errors
 Active Devices : 3
Working Devices : 3
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 128K 

    Number   Major   Minor   Raid Device State
       0       8        1        0      active sync   /dev/sda1
       1       8       17        1      active sync   /dev/sdb1
       2       8       33        2      active sync   /dev/sdc1
       3       0        0       -1      removed
           UUID : d6ac1605:db6659e1:6460b9c0:a451b7c8
         Events : 0.5078

Note:
I've used 'missing' for one of my disks (it's in the post). This has created the array as if one of the disks is dead. When the new one arrives I'll hot-add it and watch the array rebuild itself.

  • Update the config file (do this if you ever change the config (eg add a spare, mark a disk as faulty etc)
% echo 'DEVICE /dev/sd*' >/etc/mdadm/mdadm.conf
% echo 'PROGRAM /bin/echo' >>/etc/mdadm/mdadm.conf
% echo 'MAILADDR david@dgreaves.com' >>/etc/mdadm/mdadm.conf
% mdadm --detail --scan >>/etc/mdadm/mdadm.conf
% cat /etc/mdadm/mdadm.conf

LVM2 : Logical Volumes

  • Now make the array device useable by LVM2 : pvcreate /dev/md0
  No physical volume label read from /dev/md0
  Physical volume "/dev/md0" successfully created
  • Create a volume group that we can use to make logical volumes from:
% vgcreate -s 64M video_vg /dev/md0

-s 64M means that we're 'limited' to 4Tb - sounds mad but I can see us keeping all DVDs, etc. on a volume.

 Adding physical volume '/dev/md0' to volume group 'video_vg'
 Archiving volume group "video_vg" metadata.
 Creating volume group backup "/etc/lvm/backup/video_vg"
 Volume group "video_vg" successfully created

I only want to use 600Gb to start with...

  • Finally create a volume:
% lvcreate -L600G -nvideo_lv video_vg

Logical volume "video_lv" created
  • Format it:
% mkfs -treiserfs /dev/video_vg/video_lv

All data on /dev/video_vg/video_lv will be lost. Do you really want to create reiser filesystem
(v3.6) (y/n) y
Creating reiser filesystem (v3.6) with standard journal on /dev/video_vg/video_lv
initializing skiped area: done
initializing journal: done
syncing...done


  • Mount it:
% mount /dev/video_vg/video_lv /huge
  • See your results:
cu:~# df -k
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/hda2             19283776    697988  18585788   4% /
/dev/hda1                97826     13003     79604  15% /boot
/dev/mapper/video_vg-video_lv
                     629126396     32840 629093556   1% /huge

Performance Enhancements / Tuning

  • There are reports on numerous mailing lists about performance issues with LVM on RAID. The below commands attempt to address the majority of these.

Warning! Caveat Emptor!

The following, while well-documented and stable as the rest of the MD / LVM subsytem, can lock up your machine and destroy data. This is most likely to occur if you go overboard with your settings (setting the cache larger than your physical RAM, etc). Have fun, benchmark things out, but DO NOT attempt these settings on arrays already containing important data unless you're confident that your settings are appropriately conservative. Don't say you weren't warned.

If you still want to try it...

This seems to be correctable by setting the read-ahead values. Ideally, this should be done to both your md device and the logical volume:

% blockdev --setra 4096 /dev/md0
% blockdev --setra 4096 /dev/video_vg/video_lv

You can also set the read-ahead on the drive device itself, but not everyone has seen an improvement here (possibly because of the read-ahead configured by hdparm?). The numerical value is in bytes, so larger values are relatively safe to attempt. On DirkGecko's 4-drive RAID 5 SATA array, setting the read-ahead boosted the read performance by a full 50MB/sec. Understandably, it had no significant effect on writes.

If you have RAM to burn, you can also increase the size of the software RAID MD cache. This is directly analogous to the cache RAM on hardware RAID cards. However, the burst performance available via Linux's software RAID significantly outstrips that available to hardware RAID due to the PCI bus that a "relatively inexpensive" RAID card must contend with.

% echo 8192 > /sys/block/md0/md/stripe_cache_size

This value is in pages per device, which, for a 4-drive array, comes out to 128MB. On DirkGecko's array, this increased write performance on an 8GB file from 52MB/sec to 62MB/sec. On files less than the size of the cache, performance was WELL ABOVE 250MB/sec. This will be most useful if you find yourself I/O bound on writes.

  • In order for these parameters to be restored later, you'll want to add them to your rc.local or similar bootup script.