Linux LVM in 2012

Just some notes about sort of newish stuff (depends on your distro) in Linux LVM. It’s August 2012. I’m mainly using Redhat 6.3/Fedora 17 at the moment.

Snapshot Merging into Original (ie. drop the snapshot and revert to the pre-snapshot state)

If you dig through my old posts you’ll see some convoluted method of doing this using low level device mapper commands. But now you can do it with a simple command. First let’s say you create a snapshot:

  lvcreate -s -n lvsnap -L1G /dev/vgcrap/lvhome

Then make some modifications to whatever is in lvhome, and you decide you want to ‘roll back’ to what is in the snapshot. So you just do this;

  lvconvert --merge /dev/vgcrap/lvsnap

Cool stuff. What I thought was cooler is that it even works on mounted volumes with one caveat. Here’s what happened when I tried to do this on a root logical volume;

  Can't merge over open origin volume
  Merging of snapshot lvsnap will start next activation.

So basically, if you reboot now, when the LVM config is first loaded it merges the snapshot back in then (which logically is a safe time to do it).

I’n not too sure when snapshot merging came into the more stable distros. I don’t think Redhat 6.1 had it, but it’s definitely there in 6.3

Auto extension of snapshots

(NB: Apologies if this has been in LVM for ages. I only just noticed it). One of the semi annoying things when creating snapshots of logical volumes is that you need to ‘guess’ how much “changes”  you expect to occur during the life of the snapshot. So in my example earlier I used ‘-L1G’ to allow for up to a gig of changes. I think you’ve always been able to manually extend the size of a snapshot, so I could have written a job that regularly runs ‘lvs’ and looks for snapshots that are getting close to 100% full, then does a ‘lvextend’ on the snapshot volume to ‘give it a bit more space’ depending on the space left in your VG.

Well it turns out there are some config params in /etc/lvm/lvm.conf that will do this auto extension for you. For example, if I have a snapshot get to 70% full, I can auto-extend it by another 20%

 snapshot_autoextend_threshold = 70
 snapshot_autoextend_percent = 20

One thing I immediately noticed when I tried this out myself is that it is still a ‘delayed’ detection. ie. if your snapshot fills up too quickly and hits 100% before the autoextend detector kicks in then you’re stuffed. My test case was with a very small snapshot, and logically as you use larger snapshots at the outset, the time it takes to jump from say 70% to 100% takes longer anyway.

Thin Pools and Thin Volumes

If you have used any kind of storage appliance or ESXi or Xenserver you will know all about ‘thin volumes’. The basic premise is that if say you create a logical volume of say 100GB, you might not be using all of it initially. Perhaps you are using it for a VM, and the initial install only takes 2GB or so, so that’s 98GB that you have sort of tied up for a rainy day. The idea behind thin volumes is that you only really use space out of a logical volume when you write to a block of it. Unwritten blocks are considered to be zeroed, and so if you read from the other 98GB (in my example) you just get zeroes from the LVM driver. No disk read occurs. This way ‘that other 98GB’ can be used for other volumes.

Anyway, You can do this with lvcreate now. You need to create a thinpool first. So you still have to set aside a bit of disk space out of your VG first, but you can make it on the small side and either manually extend it or (just like with snapshots) you can configure in lvm.conf to auto-extend when it reaches a certain size (or so I thought). Anyway, let’s say I want a thin pool called ‘thinpool’ of 90GB in size

  lvcreate -T -L 90G /dev/vgcrap/thinpool

So that ‘ties up’ 90GB that you cannot use for regular volumes in your VG.

Then you create a thin volume within your thin pool;

  lvcreate -T -V 100G -n lvthintest /dev/vgcrap/thinpool

Notice how the size is now specified with ‘-V’ (for virtual size) and notice how the size can be bigger than your thin pool.  You just have to remember, that disk space is only used as you write to it. You can now create a file system on your thin volume, write stuff to it, and the filesystem will ‘think’ it is on a 100GB partition. Of course, the thin volume will only keep working so long as the thin pool does not fill up. So don’t dd /dev/zero to the entire device. Even though you are writing zeros, they are still ‘writes’ so you will eventually fill the thin pool before you fill the thin volume itself.

Just like with regular logical volumes, you can make snapshots of thin volumes, and the cool thing is that you do not have to specify a size.

lvcreate -s -n lvsnap2 /dev/vgcrap/lvthintest

Regarding the auto-extend functionality of thin pools, well I could not get it to work (so far I’ve tried Redhat 6.3 and Fedora 17). The details of how to use it are in the comments of /etc/lvm/lvm.conf on most distros, for whatever reasons, it never auto-extends for me.

 thin_pool_autoextend_threshold = 70
 thin_pool_autoextend_percent = 20

Another very cool feature of thin volumes is that they can actually shrink themselves under certain circumstances. Basically what you do is create a thin volume, put an ext4 file system on it and mount the ext4 filesystem with the ‘discard’ option just like how you would mount it if you were using an SSD drive with TRIM support. What the discard option actually does is to pass ‘delete’ requests to the next block layer down. In the case of an ext4 filesystem on a standard partition on an SSD drive, the ‘deletes’ might get passed to the kernel SATA driver, and hence the ‘deletes’ are passed to the SSD drive. With my example, the ‘deletes’ are now passed to the underlying LVM layer, and LVM knows that it can shrink a thin volume if it receives appropriate ‘delete’ requests. This is regardless of whether you are using an SSD or not. In fact, I think the ‘deletes’ are passed all the way down the chain, so if you were using an SSD, these TRIM requests also end up being sent to the SSD.

One area where this ‘pass down’ of deletes does not appear to be implemented is in qemu-kvm. So if you say create a thin volume, then use that thin volume as the virtual disk that qemu-kvm uses, and let’s say that you install linux in this VM and have its root fs as an ext4 filesystem with ‘discard’ set on the mountpoint, then it doesn’t appear to pass the ‘delete’s down any further than qemu-kvm. The net result is that your thin volume grows and grows (I did figure out a terrible way around this; stop your VM, use kpartx on the thin volume to create some partitions in the host OS such that you can then see the root OS partition from your guest. Then you can run ‘e2fsck -f -E discard /dev/vg/thinvol’, remove the kpartx devices, then start your guest up again).

Better LVM RAID devices

If you’ve used other logical volume systems like Veritas or the built in one with HP-UX, you would have noticed that software RAID is built right in. For example, if you created a simple logical volume with no redundancy, it was a simple logical volume related command to dynamically add a mirror in. You could dynamically break mirrors, mount then elsewhere etc. All very very useful. Compare that with the common usage cases for LVM on Linux, and a quick google will see that you either layer LVM on top of a hardware RAID controller solution, or you layer it on top of MD style Linux Sofware RAID. In fact there are lots of recipes for creating RAID1 or RAID5 redundancy using the md driver, then using these md devices as PV’s in a volume group.

However, LVM on Linux has seemingly had some kind of software RAID ability for some time … but if you bothered to read the docs on it, you’d come away thinking it was a bit odd. For example, the older style LVM RAID1 wanted you to use three disks. Have a look at this redhat page on the older style LVM mirroring. I’m sure quite a few admins have gone ‘well I’ve got my two drives, and I want to create a mirror, let’s check the docs online … and WTF? I need a third disk????’ If you dig a bit, you’ll see some people put this ‘extra log’ data on one of the two disks in a mirror (and hence you don’t need a third disk), or they put this ‘extra log’ in memory which means that your entire mirror needs to resync everytime your box reboots!

So I was quite pleasantly surprised when going through the Redhat 6.3 doco to discover this ‘RAID Logical Volumes’ page. So now there is a ‘raid1’ type as well as the older ‘mirror’ type. Strangely the LVM commands will still let you create the older style mirrors with their ‘extra log’ or the newer ‘raid1’ ones that work the way you want them to work (ie. you use up two disks and you can lose either disk).

You can also split a RAID1 mirror into two new logical volumes as well. I think there is even some clever stuff to allow you to track changes so that a split mirror can be merged back together again later without having to do a full resync.

You can also do RAID 4, 5 or 6 with LVM and admitedly I have not tested them out.

One thing that I tried to do with the new raid1 type is get GRUB2 to boot off it. I’ve been using Fedora 16 and 17 a bit lately and even though GRUB2 seems relatively annoying it has the great feature of being able to boot your kernel off a logical volume. That means you don’t need a separate /boot anymore. BUT, it does not work with these new ‘raid1’ type LVM volumes. So that means you still probably need to put /boot on an md device