November 27, 2011

4

Rsync Snapshots: Space Saving & Fast Recovery

rsync

De-duplication! That is the keyword and if your backup storage hardware does not support it natively you still need it. That is the focus of this article. So I wrote earlier about rsync backups and the strategies used in the backintime software. Having gotten a bit of time to work with it I hope to outline a method to do “snapshot” style backups using rsync. This gives us the ability to quickly and easily “roll-back” to a previous date and time. Additionally the management of our backups gets vastly simplified and the de-duplication that this brings saves a ton of space.

The Problem With Backups

We create backups to ensure the integrity and and availability of our data. A “full” backup is easy to explain as it is an exact copy of our data. If something happens to our source data we can restore it from our full backup data. Using rsync we can do a full backup using something similar to this:

rsync -a /path/to/source/ /path/to/destination/

Here the -a or “archive” option gives us a practically identical copy (see rsync man page for additional information on vast amount of rsync options.) If our source data were static (non-changing) then we are done. However, in the more common scenario the data is changing over time. This means that we have to continue to make backups as time goes on. To continue making full backups would take a great amount of storage space. We could delete the original backups and run the above command again but what if we accidentally deleted a file on the source, or what if some files on the source or destination became corrupted? These types of “mistakes” in the source will be transmitted to the destination in effect “corrupting” our backups. Obviously we will have to make multiple backups based on the amount of data we can afford to loose (risk.) Most scenarios require multiple backups but we need a method to do this without exhausting our available storage.

Mitigating Storage Requirements (Incremental & Differential)

To handle the growing storage issues historically we have Incremental or Differential backups. To make a differential backup we start with a full one. Then when we make consecutive backups we only backup the changes since the last full backup. In this way, when we wish to restore our data we first restore the full backup, followed by the changes captured in the latest differential backup.

An incremental backup is very similar to a differential. Both start with a full backup. An incremental backup then only copies the changes. Unlike a differential, incremental backups only backup the changes since the last incremental backup. In this way when we wish to restore our data we first restore the full backup and then, consecutively, the changes captured from each and every incremental backup. Incremental backups take up less space then differential, but traditionally take longer to restore.

Rsync Snapshot Solution

The method to take rsync backups outlined here is an incremental backup strategy. However, these rync snapshots allow us to very quickly “roll back” to a given day and time. First we start will a full backup. Next we create an exact backup of our data but files which are already in the previous backup are created as hard links to the previous backup. This is best understood via example:

Our source has 1 file which we backup to our destination folder using an rsync command similar to the full backup command listed above. We might want each backup to go into a folder named after the date and time which the backup was taken:

rsync -a /path/to/source/  /path/to/destination/`date +%Y.%m.%d_%H:%M:%S`/

Now the source and backup both contain the same file. All consecutive backups will be run with the -u (update) and –link-dest (hardlink to files in directory when unchanged) options. The link destination directory will always be the folder that contains the previous backup (like a incremental backup.) Programmaticly we can find the previous backup folder and store it in a variable ($link_dest)using a find and sort command like this:

link_dest=`find /path/to/destination -type d | sort | tail -n 1`
# Thanks to trg in the comments section below for the correction!
link_dest=`find /path/to/destinaton -maxdepth 1 -type d | sort | tail -n 1`

Our next update will look something like:

rsync -au –link-dest=${link_dest} /path/to/source/  /path/to/destination/`date +%Y.%m.%d_%H:%M:%S`/

Let us stop there and examine what has happened. Our source still only contains one file. Our first backup contains an exact copy of this same file. Since nothing has changed on the source our second backup contains one 1 file which is a hard link to the file in our first backup. In other words we have captured backups our our source twice without needing twice the storage space. Suppose we now add a second file to the source and run our rsync command. Now we have a backup with a hard-link to file 1 and an exact copy of file 2.

Understanding Hard Links

When we store a file on a computer we have the data and a “pointer” to where that data is stored. When we make a hard link to a file we add another “pointer” instead of a second copy of the data. This makes it look like that file exists in multiple locations when in actuality it is only stored to the hard disk once. Unlike a symbolic link, when we delete the original file the data remains on the disk as long as their is at least one hard link to the data. In this fashion we can delete or archive off old backups without effecting our next backup folder.

How does this fix corruption or accidental deletion?

When we corrupt a file on the source or destination, on the next backup the change will be detected and we will copy over either the corrupt or non-corrupt file. If the corruption is a problem we simply go to a previous backup and copy over the non-corrupt version of the file. This also lets us do some basic file versioning as it is now possible to all restore previous versions of files (not just corrupt files.) Likewise when a file is accidentally deleted on the source it might still exist in a previous backup.

Following the above example you can see that our three backups should take up 665MB of space but actually takes up the exact same space as the source:

 

Read more from Bash Snippets
4 Comments Post a comment
  1. gdr
    Nov 29 2011

    rsnapshot implements this strategy in perl

    http://rsnapshot.org

  2. Nov 29 2011

    gdr – Thanks for the link. The paper by Mike Rubel is very well done.

  3. trg
    Dec 24 2011

    Just wanted to add one caveat/issue with this:

    Doing “link_dest=`find /path/to/destination -type d | sort | tail -n 1`” can sometimes have an unintended consequence… If what you’re backing up has subdirectories and is recent this will dig down into the backup for the link destination, which is not what you want…

    To prevent that, instead make it (at least on a GNU find system)
    “link_dest=`find /path/to/destination -maxdepth 1 -type d | sort | tail -n 1`”

  4. Dec 24 2011

    @tgr – great catch!

    In writing this I definitely typed faster than I thought. Your method works great and showed me that I need to re-read some man pages. I checked one of my scripts and was actually using:

    link_dest=`find /path/to/destination/* -type d -prune | sort | tail -n1`

    …which also works but the -maxdepth option is definitely more intuitive. Thanks for the contribution.

Share your thoughts, post a comment.

You must be logged in to post a comment.