Let’s talk about backup and sync. In this modern, ever-connected world of ours we are constantly consuming and generating digital content. It is of the utmost importance that the most recent version of your data is constantly accessible and that your data is resilient to hardware failure. It is simply a fact of computing that your hardware will fail – hard drives doubly so. The solution to this problem is to replicate your data and backup, backup, backup! Note: If I have made any assumptions or omitted things that make it hard to understand this post, please let me know in the comments or ask me for help via email – I’m glad to help you set this up. There are many different options for synchronizing your files across several machines, but you may be limited by the operating systems that you need to use, costs, security concerns, bandwidth, and / or whether or not the computers are publicly accessible from the internet. Depending on the amount of data you plan to synchronize, you may want to choose a service that will also store your data in the cloud for free or a monthly fee. Personally, my needs are:
- Customizable and simple
- Supports Windows and Linux
- Capable of synchronizing larger amounts of data
- Capable of performing true multi-way sync (not just mirror copying)
For me, Unison fits the bill perfectly. Unison is a free and opensource (GPL) application for *nix (Unix / Linux / Mac) and Windows. It’s fast, can run over ssh, and performs only delta updates. It’s available in the repositories of most major Linux distributions and on Windows can be installed directly or installed as part of cygwin. On Windows, I recommend using cygwin as this also gives you ssh and makes it easy enough so setup auto logging in. Additionally, if your Windows machine ends up being the server, you can run OpenSSH server through cygwin.
Under Ubuntu, install it via:
sudo apt-get install unison
Synchronizing two locations with Unison is as easy as typing:
unison /path/to/folderone /path/to/foldertwo
Or to transfer over ssh:
unison /path/to/folderone ssh://firstname.lastname@example.org//foldertwo
In its default configuration, Unison will calculate all of the differences between the paths you give it (recursively), then present you with all of the differences one-by-one and ask you what to do. It will also make a suggestion of what to do if it doesn’t find any conflicts it can’t resolve, so you can just hit enter all the way through. Lastly, it will ask you if you would like to proceed with making the changes and start copying the files.
By default, it will copy as many files in parallel as possible, which is great for saturating a high bandwidth connection, but for connections over the internet through ssh, I recommend using the -numthreads flag set equal to 1 to force Unison to only use one thread and one file transfer at a time. Additionally, if you are copying between *nix and Windows or even two different *nix systems with mostly incompatible permissions setups, I recommend setting the -perms flag to 0 to disable permission synchronization. Last of all, this wouldn’t be very helpful if you have to sit through the process every time, so I recommend turning on the -batch flag to make Unison determine what to do with the differences for you and automatically transfer without asking. If Unison comes across a difference it can’t resolve, it will just skip it. Conflict resolution is one of the areas that I think Unison really shines and that puts it above rsync. Unison really is about synchronizing and merging while most other tools are about replicating.
For me, I use Unison over ssh, but it can use other network protocols. To automate this process, it is important to enable public-private key auto-log in for ssh (which is also useful if you use ssh much) so that you don’t have to enter a password for Unison to work its magic. Once you can configure and use Unison, I recommend determining which of your computers will make the best server (home fileserver if you have it) and setting up a star topology of unison connections around that. This server will need to (almost) always be on and available. It will need to be accessible from the internet, so forward the necessary ports to it (22 for ssh) and give it a dynamic dns address. At this point you will want to setup a .prt file in the home directory on each computer other than the server under a folder called .unison with the necessary configuration (directories to synchronize, config flags) to help automate Unison. My file .unison/default.prt:
# Unison preferences file root = /home/users/rmccoll/Documents root = ssh://email@example.com//media/data/Documents perms = 0
To automatically run this synchronization process, I have set up a script as a cron job on Linux and a scheduled task on Windows to run every 5 minutes on each machine that I want to synchronize with the server. The script checks to see if a previous copy is still running and dies if it is to keep from overloading the connection if I have recently added a large file.
#!/bin/sh if ps ax | grep -v grep | grep unison > /dev/null then echo "Previous unison running..." else echo "unison is not running" unison -maxthreads 1 -perms 0 -batch fi
The cron job looks like:
*/5 * * * * sh /home/users/rmccoll/Documents/Scripts/Bash/docdyns.sh
At this point, you should have access to a local copy of all your files that synchronizes everywhere you go securely and for free without going through a third-party.
For backing up, I use a customized version of the rsync tutorial found on Mike Rubel’s website. It is of note that this is really only a *nix solution as it is dependent on the file system. His system allows you to create a script / cron job that will automatically make incremental backups of a directory tree using rsync and hard link copies to keep a base revision and delta updates that are all instantly accessible and only store differences. I’ll leave the full explanation of how this works to that tutorial, but the end result is that I can have a directory called backup which has folders for 1 day ago, 2 days ago and so on that have snapshots of what the directory structure looked like at that time. If a file hasn’t changes, then 1day/file and 2days/file will actually link to the same file. Like I said, he does a much better job of explaining it and walking you through than I can, so read his site. I will post my script here though:
#!/bin/bash day="(`date +%y` * 365 + `date +%j`)" if [ $(echo "$day % 28" | bc) -eq 0 ] then echo "Backup: 4 weeks" rm -rf /media/data/4.weeks mv /media/data/2.weeks /media/data/4.weeks else if [ $(echo "$day % 14" | bc) -eq 0 ] then rm -rf /media/backup/2.weeks fi fi if [ $(echo "$day % 14" | bc) -eq 0 ] then echo "Backup: 2 weeks" mv /media/data/1.weeks /media/data/2.weeks else if [ $(echo "$day % 7" | bc) -eq 0 ] then rm -rf /media/backup/1.weeks fi fi if [ $(echo "$day % 7" | bc) -eq 0 ] then if [ $(echo "$day % 5" | bc) -eq 0 ] then echo "Backup: 1 weeks with 5 days" mv /media/backup/5.days /media/backup/1.weeks mv /media/backup/3.days /media/backup/5.days else echo "Backup: 1 weeks without 5 days" rm -rf /media/backup/3.days fi else if [ $(echo "$day % 5" | bc) -eq 0 ] then echo "Backup: only 5 days" rm -rf /media/backup/5.days mv /media/backup/3.days /media/backup/5.days else echo "Backup: no week, no 5 days" rm -rf /media/backup/3.days fi fi echo "Backup: 3 days" mv /media/backup/2.days /media/backup/3.days echo "Backup: 2 days" mv /media/backup/1.days /media/backup/2.days echo "Backup: 1 days" cp -al /media/backup/0.days /media/backup/1.days echo "Backup: 0 days" rsync -a -v --delete /media/data/ /media/backup/0.days/