Backing Up VMs – Haack's Networking

backingupvms
Jonathan Haack
Haack’s Networking
webmaster@haacksnetworking.org

//backingupvms//

Warning: This blog post will age. To find the latest script, go here.

This tutorial is for users of Debian GNU/Linux who seek to back up, manage space, and provide version control for KVM virtual machines, or VMs for short here on out (presume using all pure virsh and qemu). First of all, the preconditions for this are that I am using fstrim with unmap set on the virsh hypervisor, with systemctl enable --now fstrim.timer enabled inside the guest OS. You can optionally create a small script instead. So long as these requirements are in place, this means that the hypervisor should theoretically receive the unused blocks from the guest OS and remove, or unmap them, from the host OS, meaning that you should have ideally reclaimed the space on the host OS. And, indeed, I’ve tested the features on small VMs using dd to write some files, then removed them and checked the host OS, and it does work. However, there are at least two issues:

It only partially works, and there are left over blocks that I can confirm are unused on the guest OS but still consume space host OS.
This results in bloat, i.e., and since cp --sparse=always has no way of detecting the untrimmed blocks, backup times necessarily take longer.

So, after scouring through some posts online, I found at least 5 tutorials recommending how to use trim with unmap, but they were identical to my workflow and I already knew from a week of testing that although it stopped the images from growing indefinitely (which will happen without), they still had the aforementioned bloat. Finally, I stumbled upon a frustrated vmware user who noticed the same behavior and decided the only work around was to power down the VMs and use qemu-img convert instead. Ideally, I would like trim and unmap to work, however, I do perform powered down backups on a regular cycle already, so I figured it would not hurt to rewrite my backup scripts to leverage this approach. My original full backup script powered down VM, cp --sparse=always the qcow2 image, restarted the VM, then tarballed and compressed the qcow2 image copy. I figured that I would harmless to rewrite this to power down the VM and qemu-img convert the qcow2 image instead of using cp. The issue, however, was that I wanted the production VM to be the non-bloated version, so since I could not figure out a way to get qemu-img convert to write two simultaneous images as targets, I had to add a separate line after which copied the trimmed image after creating it. Here’s what I came up with (remember, always check the repo in case there were updates):

#!/bin/sh
DATE=`date +"%Y%m%d-%H:%M:%S"`
IMG="guest1.qcow2 guest2.qcow2"
for i in $IMG;
do
virsh shutdown $i
sleep 2m
if
tail -n -2 /var/log/libvirt/qemu/$i.log | grep "reason=shutdown"
then
START1="$(date +%s)"
touch /root/$i-loop-c.log
cd /mnt/vms/production
#virsh shutdown $i
qemu-img convert -O qcow2 $i trimmed.$i
#move the bloated original to a temp location for recovery
mv $i /mnt/vms/production/1tempstore/$i.$DATE.bloated.bak
#rename the trimmed image to the original vdisk
mv trimmed.$i $i
#make a copy of the newly trimmed image to the staging area
cp -ar --sparse=always $i /mnt/vms/production/2backstage/$i.bak
#once copy is made, restart the vm
virsh start $i
#move to the stagin area and compress it with tar
cd /mnt/vms/production/2backstage/
tar --use-compress-program=pbzip2 -Scf $i.bak.tar.bz2 $i.bak
mv /mnt/vms/production/2backstage/$i.bak.tar.bz2 /mnt/vms/production/3tarballs/$i:$DATE.bak.tar.bz2
mv /mnt/vms/production/2backstage/$i.bak /mnt/vms/production/2backstage/$i.$DATE.trimmed.bak
#prune tempstore, backstage, and tarballs appropriately
find /mnt/vms/production/3tarballs/ -type f -mtime +120 -delete
find /mnt/vms/production/1tempstore/ -type f -mtime +10 -delete
find /mnt/vms/production/2backstage/ -type f -mtime +10 -delete
END1="$(date +%s)"
DURATION1=$[ ${END1} - ${START1} ]
MINUTES=$[ ${DURATION1} / 60 ]
echo "$(date) Jonathan, the $i backup took exactly ${DURATION1} seconds and around ${MINUTES} minutes to complete." | tee -a /root/$i-loop-c.log
mail -s "[loop-community-success]-$(hostname -f)-$(date)" alerts@haacksnetworking.org < /root/$i-loop-c.log
rm /root/$i-loop-c.log
else
echo "The VM $i was either already off or failed to shutdown at $(date)." | mail -s "[loop-community-failed]-$(hostname -f)-$(date)" alerts@haacksnetworking.org
fi
done

At first, I was a bit worried this would double the full back up script. But here’s the thing, lol, since the qemu-img convert removes the bloat when it writes the target image, the overall time actually went down. Now, of course, if the instance or VM in question scales, and those are actually consumed blocks, not empty blocks, then yes, having to copy the image after the convert will ultimately be longer than not having to do that. But/and, since the images I was originally copying were bloated, i.e., they had stubborn empty blocks that were not honoring trim and unmap, this new script actually reduced the total time by about 2 hours, taking only 97 minutes to complete. Now, if I could just find out to make qemu-img convert write two images simultaneously, I can make this even quicker!

Lastly, while testing this last night, I had a stubborn VM that refused to shut down, which resulted in the rest of the loop completing on the machines that would shutdown, and the other machine continuing to run despite the qcow2 image being moved ???. This was pretty scary and I was surprised I did not lose any data. So, to address this, I added a conditional to the script which gave the guest OS a few minutes to shutdown and checked the qemu logs to make sure the guest OS had actually shutdown and fail otherwise. I also added some find commands that flush the tempstore and backstage directories just before two weeks, yet keep a healthy amount of tarballs which I pull to an off-site backup with rsync over ssh.

Note: If the machine is already powered off and backed up, the script will still run, so either remove the unused machine from your array, or accept it will make a new backup.

Update: I briefly took this post down in a moment of rage and spent 5 hours problem solving. I thought it was not working but it turned out to be a typo on a VM name. During the process of problem solving, I wrote another script using while instead (based on some great Stack Exchange posts here). Here’s the other script I came up with:

#!/bin/bash
DATE=`date +"%Y%m%d-%H:%M:%S"`
#IMG="gnulinux.social.qcow2 gnulinux.club.qcow2"
IMG="hackingclub.org.qcow2 hackingclub2.org.qcow2"

for i in $IMG;

do

virsh shutdown $i

STATE=$(virsh dominfo $i | grep -w "State:" | awk '{ print $2}')
while ([ "$STATE" != "" ] && [ "$STATE" == "running" ]); do
  sleep 10
  STATE=$(virsh dominfo $i | grep -w "State:" | awk '{ print $2}')
done;

START0="$(date +%s)"
touch /root/$i-loop-c.log
cd /mnt/vms/production
qemu-img convert -O qcow2 $i trimmed.$i
#move the bloated original to a temp location for recovery
mv $i /mnt/vms/production/1tempstore/$i.$DATE.bloated.bak
#rename the trimmed image to the original vdisk
mv trimmed.$i $i
#make a copy of the newly trimmed image to the staging area
cp -ar --sparse=always $i /mnt/vms/production/2backstage/$i.bak
END0="$(date +%s)"
DURATION0=$[ ${END0} - ${START0} ]
MINUTES0=$[ ${DURATION0} / 60 ]

START1="$(date +%s)"
#once copy is made, restart the vm
virsh start $i
#move to the stagin area and compress it with tar
cd /mnt/vms/production/2backstage/
tar --use-compress-program=pbzip2 -Scf $i.bak.tar.bz2 $i.bak
mv /mnt/vms/production/2backstage/$i.bak.tar.bz2 /mnt/vms/production/3tarballs/$i:$DATE.bak.tar.bz2
mv /mnt/vms/production/2backstage/$i.bak /mnt/vms/production/2backstage/$i.$DATE.trimmed.bak
#prune tempstore, backstage, and tarballs appropriately
find /mnt/vms/production/3tarballs/ -type f -mtime +120 -delete
find /mnt/vms/production/1tempstore/ -type f -mtime +10 -delete
find /mnt/vms/production/2backstage/ -type f -mtime +10 -delete
END1="$(date +%s)"
DURATION1=$[ ${END1} - ${START1} ]
MINUTES1=$[ ${DURATION1} / 60 ]

echo "At $(date) the $i image conversion took ${DURATION0} secs & ${MINUTES0} mins and the tarballing took ${DURATION1} secs & ${MINUTES1} mins minutes to complete." | tee -a /root/$i-loop-c.log
mail -s "[loop-community]-$(hostname -f)-$(date)" alerts@haacksnetworking.org < /root/$i-loop-c.log
rm /root/$i-loop-c.log

done

Note: The disadvantage of this script is that it will just hang indefinitely without notifying me if the VM fails to shut down. There’s ways to handle this with timers on while or exit and -e options, which I will explore later. For now, the tail script – although basic – is extremely reliable and powerful.

Kindly,

oemb1905

Leave a Reply Cancel reply