chunksync - space-efficient incremental (remote) backups of large files


chunksync { -h | -V | [ -q ] [ -l <linkdir> [ ... ] ] <chunk size> <source> <destination> }


ChunkSync allows you to create space-efficient incremental backups of large files or block devices (encrypted disks, in particular) by splitting the data into a directory structure of chunk files which get hard-linked into new backup generations in case the contents of the respective chunk haven't changed (as judged by a SHA1 sum of the contents). This is similar to the way rsync's --link-dest option works, but a lot faster than using rsync on ChunkFS.

In case of remote sources and/or destinations, ssh is used for invoking ChunkSync backends on the remote machines.

Size changes of the source between backups are handled correctly.

The chunks themselves contain the bare backup data, so the original file's/device's contents can be restored by simply concatenating all the chunks from a backup tree. The layout of the tree is ChunkFS compatible, though, so the image can also be reconstructed using UnChunkFS, which is handy for restoring single files from a large filesystem image without first having to copy huge amounts of data, for example.

In addition to the chunk itself, ChunkSync creates a symlink that stores the SHA1 sum of the chunk, so it doesn't have to re-read all of the old data when creating an incremental backup (it uses symlinks because ext[234] stores the contents of short symlinks within the inode, which avoids every checksum occupying a complete filesystem block).

When creating an incremental backup, ChunkSync reads the SHA1 sums and chunk sizes from the old backup and computes the new checksums from the source, skipping the copying of any chunks for which the checksum and the size don't differ--those are directly hardlinked from the old backup into the new backup. In order to be able to reliably back up SHA1 collision examples, a chunk-specific random 64 bit prefix is fed into SHA1 before the actual chunk data. Those random bits are also stored as a prefix to the hash in the symlink.

The structure of the protocol and the code is completely pipelined, so the different threads/processes (in particular those on different machines) do not process chunks in lockstep, but rather all of them work in parallel, only subject to pipe buffers and tcp windows in between them filling up or running empty and to the I/O subsystem getting things done, thus ensuring high throughput.

The speedup in comparison to using rsync on ChunkFS comes not only from not having to read the complete old backup, but also from a protocol design that requires the source to only be read once for both, the checksumming and the actual copying--rsync, in contrast, reads the data once for checksumming and then the changed chunks a second time in order to copy them.

The <chunk size> specifies the size in bytes of the chunks <source> is to be split into. <destination> specifies where the result is to be stored. If the source is not a multiple of <chunk size> in size, the last chunk will be correspondingly smaller.

Both, <source> and <destination>, are interpreted as follows:

The connection specifier specifies the machine to connect to, as well as potentially the port to connect to or the username to log in as.

If the connection specifier is empty, ssh is not used, but instead the filesystem is accessed directly.

Otherwise, if there is a forward slash (/) in the connection specifier, the part after the last slash is split off and passed to ssh as the port to connect to (ssh option -p). The remainder (or the complete string, if there is no forward slash in it) is passed to ssh as its connection specifier.

The path then is interpreted relative to either the current local working directory (in the non-ssh case) or to the current working directory that the remote ChunkSync backend finds itself in (this only really affects relative paths, of course).

Note that the destination directory must not exist beforehand.



Displays a short usage summary.


Prints the program version.


Suppresses progress indication and statistics, so that output is produced only in case of errors (useful for cron jobs, in particular).

-l <linkdir>

Specifies a directory to use as a base backup for an incremental backup. It is interpreted in the same way as the path part of <destination> (relative to the same directory). You may specify up to 64 base backups by repeating this option. ChunkSync will use all of them as potential link targets for each chunk. If no -l options are specified, a full copy is created.

For reduced network traffic, it is recommended to put closely related base backups (that is, base backups with many identical checksums) next to each other on the command line.

There are no restrictions on mixing base backups. However, mixing base backups that are not incrementally based on one another has a performance penalty, as they probably will have different random chunk prefixes, which means that the checksums of the source chunks will have to be computed multiple times, once for each prefix. Putting base backups with many expected hits first can help alleviate the effect, as further checksum computations for a chunk are skipped once a matching chunk has been found.


ChunkSync exits with a zero exit status if the copy has been successfully created (note that this does not include any guarantee that it has actually hit some physical medium, just that all the objects in the virtual file system tree have been created).

If an error occurs, ChunkSync exits with a non-zero exit status--currently, it's always 1, without any distinction of different causes, but it is not guaranteed to stay that way in future versions.


	chunksync 1048576 /dev/hda1 root@backupserver/42:/backups/00001

Create a backup of local /dev/hda1 split into 1 MiB chunks in /backups/00001 on backupserver, connecting to port 42, logging in as root.

	chunksync 1048576 localhost:/dev/hda1 root@backupserver/42:/backups/00001

Basically the same, just using ssh locally, too, for some slowdown.

	chunksync 1048576 [::1]:/dev/hda1 root@backupserver/42:/backups/00001

Another version of the above. Note that the square brackets are required here in order to avoid the first colon of the IPv6 address being interpreted as the end of the connection specifier.

	chunksync -l /backups/00001 1048576 /dev/hda1 root@backupserver/42:/backups/00002

Create an incremental backup based on the above backup.

	chunksync -l /backups/00001 1048576 []:/dev/hda1 [root@backupserver/42]:/backups/00002

Different syntax, same effect.