- backingupvms
- Jonathan Haack
- Haack’s Networking
- webmaster@haacksnetworking.org
//backingupvms//
Warning: This blog post will age. To find the latest script, go here.
This tutorial is for users of Debian GNU/Linux who seek to back up, manage space, and provide version control for KVM virtual machines, or VMs for short here on out (presume using all pure virsh
and qemu
). First of all, the preconditions for this are that I am using fstrim
with unmap
set on the virsh
hypervisor, with systemctl enable --now fstrim.timer
enabled inside the guest OS. You can optionally create a small script instead. So long as these requirements are in place, this means that the hypervisor should theoretically receive the unused blocks from the guest OS and remove, or unmap them, from the host OS, meaning that you should have ideally reclaimed the space on the host OS. And, indeed, I’ve tested the features on small VMs using dd
to write some files, then removed them and checked the host OS, and it does work. However, there are at least two issues:
- It only partially works, and there are left over blocks that I can confirm are unused on the guest OS but still consume space host OS.
- This results in bloat, i.e., and since
cp --sparse=always
has no way of detecting the untrimmed blocks, backup times necessarily take longer.
So, after scouring through some posts online, I found at least 5 tutorials recommending how to use trim
with unmap
, but they were identical to my workflow and I already knew from a week of testing that although it stopped the images from growing indefinitely (which will happen without), they still had the aforementioned bloat. Finally, I stumbled upon a frustrated vmware
user who noticed the same behavior and decided the only work around was to power down the VMs and use qemu-img convert
instead. Ideally, I would like trim
and unmap
to work, however, I do perform powered down backups on a regular cycle already, so I figured it would not hurt to rewrite my backup scripts to leverage this approach. My original full backup script powered down VM, cp --sparse=always
the qcow2 image, restarted the VM, then tarballed and compressed the qcow2 image copy. I figured that I would harmless to rewrite this to power down the VM and qemu-img convert
the qcow2 image instead of using cp
. The issue, however, was that I wanted the production VM to be the non-bloated version, so since I could not figure out a way to get qemu-img convert
to write two simultaneous images as targets, I had to add a separate line after which copied the trimmed image after creating it. Here’s what I came up with (remember, always check the repo in case there were updates):
#!/bin/sh
DATE=`date +"%Y%m%d-%H:%M:%S"`
IMG="guest1.qcow2 guest2.qcow2"
for i in $IMG;
do
virsh shutdown $i
sleep 2m
if
tail -n -2 /var/log/libvirt/qemu/$i.log | grep "reason=shutdown"
then
START1="$(date +%s)"
touch /root/$i-loop-c.log
cd /mnt/vms/production
#virsh shutdown $i
qemu-img convert -O qcow2 $i trimmed.$i
#move the bloated original to a temp location for recovery
mv $i /mnt/vms/production/1tempstore/$i.$DATE.bloated.bak
#rename the trimmed image to the original vdisk
mv trimmed.$i $i
#make a copy of the newly trimmed image to the staging area
cp -ar --sparse=always $i /mnt/vms/production/2backstage/$i.bak
#once copy is made, restart the vm
virsh start $i
#move to the stagin area and compress it with tar
cd /mnt/vms/production/2backstage/
tar --use-compress-program=pbzip2 -Scf $i.bak.tar.bz2 $i.bak
mv /mnt/vms/production/2backstage/$i.bak.tar.bz2 /mnt/vms/production/3tarballs/$i:$DATE.bak.tar.bz2
mv /mnt/vms/production/2backstage/$i.bak /mnt/vms/production/2backstage/$i.$DATE.trimmed.bak
#prune tempstore, backstage, and tarballs appropriately
find /mnt/vms/production/3tarballs/ -type f -mtime +120 -delete
find /mnt/vms/production/1tempstore/ -type f -mtime +10 -delete
find /mnt/vms/production/2backstage/ -type f -mtime +10 -delete
END1="$(date +%s)"
DURATION1=$[ ${END1} - ${START1} ]
MINUTES=$[ ${DURATION1} / 60 ]
echo "$(date) Jonathan, the $i backup took exactly ${DURATION1} seconds and around ${MINUTES} minutes to complete." | tee -a /root/$i-loop-c.log
mail -s "[loop-community-success]-$(hostname -f)-$(date)" alerts@haacksnetworking.org < /root/$i-loop-c.log
rm /root/$i-loop-c.log
else
echo "The VM $i was either already off or failed to shutdown at $(date)." | mail -s "[loop-community-failed]-$(hostname -f)-$(date)" alerts@haacksnetworking.org
fi
done
At first, I was a bit worried this would double the full back up script. But here’s the thing, lol, since the qemu-img convert
removes the bloat when it writes the target image, the overall time actually went down. Now, of course, if the instance or VM in question scales, and those are actually consumed blocks, not empty blocks, then yes, having to copy the image after the convert will ultimately be longer than not having to do that. But/and, since the images I was originally copying were bloated, i.e., they had stubborn empty blocks that were not honoring trim
and unmap
, this new script actually reduced the total time by about 2 hours, taking only 97 minutes to complete. Now, if I could just find out to make qemu-img convert
write two images simultaneously, I can make this even quicker!
Lastly, while testing this last night, I had a stubborn VM that refused to shut down, which resulted in the rest of the loop completing on the machines that would shutdown, and the other machine continuing to run despite the qcow2 image being moved ???. This was pretty scary and I was surprised I did not lose any data. So, to address this, I added a conditional to the script which gave the guest OS a few minutes to shutdown and checked the qemu
logs to make sure the guest OS had actually shutdown and fail otherwise. I also added some find
commands that flush the tempstore
and backstage
directories just before two weeks, yet keep a healthy amount of tarballs
which I pull to an off-site backup with rsync over ssh.
Note: If the machine is already powered off and backed up, the script will still run, so either remove the unused machine from your array, or accept it will make a new backup.
Update: I briefly took this post down in a moment of rage and spent 5 hours problem solving. I thought it was not working but it turned out to be a typo on a VM name. During the process of problem solving, I wrote another script using while
instead (based on some great Stack Exchange posts here). Here’s the other script I came up with:
#!/bin/bash
DATE=`date +"%Y%m%d-%H:%M:%S"`
#IMG="gnulinux.social.qcow2 gnulinux.club.qcow2"
IMG="hackingclub.org.qcow2 hackingclub2.org.qcow2"
for i in $IMG;
do
virsh shutdown $i
STATE=$(virsh dominfo $i | grep -w "State:" | awk '{ print $2}')
while ([ "$STATE" != "" ] && [ "$STATE" == "running" ]); do
sleep 10
STATE=$(virsh dominfo $i | grep -w "State:" | awk '{ print $2}')
done;
START0="$(date +%s)"
touch /root/$i-loop-c.log
cd /mnt/vms/production
qemu-img convert -O qcow2 $i trimmed.$i
#move the bloated original to a temp location for recovery
mv $i /mnt/vms/production/1tempstore/$i.$DATE.bloated.bak
#rename the trimmed image to the original vdisk
mv trimmed.$i $i
#make a copy of the newly trimmed image to the staging area
cp -ar --sparse=always $i /mnt/vms/production/2backstage/$i.bak
END0="$(date +%s)"
DURATION0=$[ ${END0} - ${START0} ]
MINUTES0=$[ ${DURATION0} / 60 ]
START1="$(date +%s)"
#once copy is made, restart the vm
virsh start $i
#move to the stagin area and compress it with tar
cd /mnt/vms/production/2backstage/
tar --use-compress-program=pbzip2 -Scf $i.bak.tar.bz2 $i.bak
mv /mnt/vms/production/2backstage/$i.bak.tar.bz2 /mnt/vms/production/3tarballs/$i:$DATE.bak.tar.bz2
mv /mnt/vms/production/2backstage/$i.bak /mnt/vms/production/2backstage/$i.$DATE.trimmed.bak
#prune tempstore, backstage, and tarballs appropriately
find /mnt/vms/production/3tarballs/ -type f -mtime +120 -delete
find /mnt/vms/production/1tempstore/ -type f -mtime +10 -delete
find /mnt/vms/production/2backstage/ -type f -mtime +10 -delete
END1="$(date +%s)"
DURATION1=$[ ${END1} - ${START1} ]
MINUTES1=$[ ${DURATION1} / 60 ]
echo "At $(date) the $i image conversion took ${DURATION0} secs & ${MINUTES0} mins and the tarballing took ${DURATION1} secs & ${MINUTES1} mins minutes to complete." | tee -a /root/$i-loop-c.log
mail -s "[loop-community]-$(hostname -f)-$(date)" alerts@haacksnetworking.org < /root/$i-loop-c.log
rm /root/$i-loop-c.log
done
Note: The disadvantage of this script is that it will just hang indefinitely without notifying me if the VM fails to shut down. There’s ways to handle this with timers on while
or exit
and -e
options, which I will explore later. For now, the tail script – although basic – is extremely reliable and powerful.
Kindly,
oemb1905