I’m moving large-ish (half-TB or larger) files between hosts. It’s important to avoid extra copies in this workflow, since each pass over one of the larger files to read the whole thing (and there are several of them) takes hours. I managed to decrypt VM disk images, transform them from one disk image format to another, copy them from one host to another, calculate SHA-256 hashes on both sides to verify data integrity, compress and encrypt them on the destination, and to display a progress bar, all without any additional copies. One big block-device read on the source end and one big block-device write on the destination end is all of the disk I/O that’s happening.
See below for how I did this.
I have an oldish home file server, consisting of a 2015-era Xeon CPU, 32 GiB of RAM, and a heap of RAID-mirrored hard disks and a couple of SSDs. It’s filling a role that I didn’t have a name for before, but apparently the right name these days is a “homelab“: built for virtualization, it hosts a bunch of small guest VMs doing various things and not interfering with each other the way the do-everything predecessor of this server did. It lets me experiment with different operating systems and software in a sandboxed environment, and lets me isolate changes so upgrades to one service don’t break everything else.
Well, that was the idea, anyway. I got about 50% of the way through migrating everything off of the host OS and onto VM guests, and I was enjoying the resulting newness of guest OSs and the new tools that were available (example: newer borgmatic releases will directly ping monitoring services like healthchecks.io to say that the backup succeeded).
But the pain of managing VMs with the Virtualbox GUI made me reconsider, and eventually life happened and I got distracted from this project and just left it half-finished. Accessing the Virtualbox GUI is done via X11, and somehow the latest version of Virtualbox that I can use with the host OS has some serious redrawing issues, that leave windows half-drawn in unpredictable ways. It’s nearly unusable, and I’ve fiddled with it a lot to try and get it to work properly, with no luck. Also, sometimes the GUI just locks up and I have to kill the GUI app. Rarely, the GUI locks up and kill/relaunch doesn’t fix the stuckness (the re-run GUI just locks up immediately), and my only recourse is to reboot the host. (I could live without the GUI, but then I can’t see the VM console anymore. I could use RDP instead of X11, but that doesn’t work properly either, even after quite a bit of futzing.) It’s just janky as hell. Also, even when I’m not managing a VM and I’m just leaving it running, Virtualbox on this host occasionally makes my guest VMs go crazy with “access beyond end of device” kernel errors, and it seems I’m not the only one experiencing this. I don’t know why it’s doing this but it sure would be nice to have a stable and usable virtualization setup. I’ve lived with this for several years, because it’s a time sink and I have other things to do. It’s not great.
Well, now it has a disk failing (it’s a disk in a smallish RAID mirror, so it’s not a huge problem to just move the data off and stop using the remaining drive) and I have some free time to try again, and I also have a gaming PC that can be used for experimental installations and migrations while the existing server just keeps on truckin’. So I’m trying again, with the notion of abandoning the setup of Virtualbox VMs on manually-mounted Veracrypt-encrypted disk images, in favor of something newer and better and easier to manage. I had previously planned to just use LUKS to manage full-disk encryption, but as it turns out, ZFS is finally available for Linux, and Proxmox will happily manage VMs atop a ZFS-based system, so I decided to try that.
It’s great so far. I like ZFS, a lot. It solves a bunch of problems that would otherwise need to be solved by a stack of software, such as Linux Software RAID + LUKS + LVM + ext4.
It took extra work to get the whole zpool to be encrypted, since you can only do that at zpool creation time and the Proxmox installer doesn’t support ZFS encryption (even though the underlying OS, Debian 11 “bullseye”, does support ZFS encryption). Fortunately, I found an excellent guide to setting up a ZFS encrypted root on Debian Bullseye and a guide to installing Proxmox on an existing Debian Bullseye host, so that’s done now.
However, the VM disk images on the source side are .vdi disk image files located on Veracrypt disk images, whereas the destination VM disk images need to be in either qcow2 format in the filesystem, or RAW format in an encrypted and compressed ZFS zvol. So the files will need to go through a lot of processing between origin and destination:
- The files need to be converted from VDI to RAW format.
- I want to have a SHA-256 hash computed on the source, and again on the destination, so that I can be certain that the file copy actually succeeded without introducing any data corruption.
- They need to be copied from the source to destination.
- I want them to be encrypted at rest on both sides.
- Finally, I’d like a progress bar that tells me how the process is going.
The decryption and encryption part is pretty straightforward: Veracrypt volumes mounted as a block device do that transparently, as do ZFS zvols if they are encrypted.
ZFS also offers transparent compression, and zstd is available to make the compression so fast that it barely adds any overhead (smaller I/Os with a bit of added CPU overhead is generally faster than uncompressed I/O), so I’m using that with the zvols too. Just mark it as compression=zstd
at creation time, and everything in it gets compression, unless you mark it as compression=no. (Incompressible data is automatically detected and left uncompressed, also transparent to the user.) It’s very, very slick, and apparently all the ZFS cool kids are just turning compression on for everything, unless they have a good reason not to (such as “it’s huge volume of very incompressible data”, like a file server for compressed video). You can mark the top-level (zpool) as being compressed, and then later mark a nested dataset as not-compressed, and the nested dataset won’t be compressed. So it’s a no-brainer to just turn compression on via compression=zstd
, and override that choice only on a per-dataset basis later.
The VDI to RAW conversion is done on the fly by qemu-nbd, used like qemu-nbd --read-only -c /dev/nbd0 somefile.vdi
, exposing the data in the VDI as a read-only block device in RAW format.
That avoids something gross like making a local copy of the file with qemu-img or VBoxManage clonehd
, which would take hours and require a ton of disk space. When I say “a ton” I mean more than the .vdi, since .vdi files can be created to expand and shrink based on how much data is stored on them in the same way that a .qcow2 file can, whereas RAW files are allocated up-front to their full size. The RAW files can be sparse files, but even with a sparse .vdi file, VBoxManage clonehd
doesn’t preserve the sparseness in the output RAW image file, so you’d have to re-run virt-sparsify on the non-sparse RAW file, producing a third, sparse RAW file in addition to the non-sparse RAW file and the original sparse VDI file. Also, virt-sparsify needs a temporary directory with enough space to store a temporary full copy of the original image, so that’s a total of 2 sparse and 2 non-sparse copies of the image file that you have to find space for, and there’s extra I/O to make all those copies, which eats up a huge amount of time (in my case, it would add several days of non-stop I/O). virt-sparsify can run with --in-place
, but then you risk data loss if something goes wrong.
So, anyway, using qemu-nbd --read-only
avoids all of that: it’s a read-only view of the contents of the .vdi file. No extra copies are made on disk; all of the transformation happens in RAM.
Progress output is done with pv (pipe viewer), which can read from a block device. It gives a nice progress display as it copies data. If you’re not familiar with pv, it’s darn handy.
Now all you need to do is avoid reading the RAW data twice (once to copy it, then once again to compute the SHA-256 hash), which is easy with tee and Bash process substitution. This is a trick I haven’t used before: process substitution directs file-output from one command to STDIN of another command, and combined with tee, you can copy tee’s STDIN to two subcommands’ STDIN, so they operate on the data in parallel.
Copying to the other server is streamed over SSH, with “Compression no” in the SSH config since they are on the LAN next to each other, and OpenSSH’s compression options are not terribly efficient so they would actually bottleneck the transfer.
On the destination side, using tee with Bash process substitution in the same way again allows computation of the SHA-256 to happen in parallel with the writing of the file to the ZFS zvol block device, with no additional I/O.
So, in summary: decryption is done on the fly by Veracrypt since the .vc image is mounted as a block device. Transformation from .vdi to raw is done on the fly by qemu-nbd which mounts the .vdi image as a block device. tee running on the source allows streaming copy-and-also-hash to happen on the source side, and another tee running on the destination allows streaming write-and-also-hash to happen on the destination side. SSH does the actual transferring of bytes from source to destination. I saw data rates of ~100MiB/s which is about the limit of the write speed of the destination zvols, since they are compressed and encrypted and reside on mirrored 5200 RPM disks currently attached via external USB 3 disk enclosures. (dd if=/dev/zero of=/the/zvol/path on the destination didn’t do any better than that, so I know that all the source transformations and network copying aren’t causing a bottleneck.)
Here’s the actual command line, redacted appropriately:
1 2 3 |
SIZEBYTES=$(blockdev --getsize64 /dev/nbd0) ; pv -s $SIZEBYTES /dev/nbd0 | \ tee >(shasum -a 256 --tag >> /root/shasums.txt ) | \ ssh root@dest 'tee >(dd of=/dev/zvol/mypool/vms/myvmname/mydiskname conv=fsync) >(shasum -a 256 --tag >> /root/shasums.txt ) | cat > /dev/null' |
It’s a doozy, but you get the SHA-256 values in /root/shasum.txt on both sides, a progress display thanks to PV, and a transformation from Veracrypt+VDI to ZFS compressed-encrypted-zvol-RAW and a network transfer, all with one big read and one big write. It still takes hours, but you only have to do this to each image file once and it’s all migrated over, with a data integrity check so you know it worked.
Smash that like and subscribe button Let me know in the replies below if this was helpful, if you find an error, or just have suggestions.