Secured, De-duplicated Backup Using Python Tools

October 24, 2015

6184

Critical data on any system can be safeguarded against accidental erasure, corruption, damage, etc, by an efficient backup system. Backup and recovery are integral to any well-managed computer system, be it for personal computing or enterprise computing. Explore the essentials of a backup system in this article and take a tutorial on the Python tools, Bakthat and Attic.

Backup and recovery of heterogeneous and unstructured data is a time consuming process. A number of software products and tools have been developed for this purpose, yet the backup of files, folders and repository maintenance is an ongoing process for higher efficiency, integrity, security and de-duplication of content.
Whenever routine office or personal work is done on our laptops or systems, only a small percentage of data on the full hard disk is modified. Lets suppose there is 500GB of hard disk space in a laptop. After regular office or personal tasks, some of the files are changed. Some files and documents can be downloaded from the Internet and stored on the local system. This means there is a change of only a few MBs per day. But when we take a backup, all the drives are copied to a portable hard disk or any other storage medium. This type of backup, which is very time consuming and not implemented on a regular and frequent basis, is known as a full backup.
There are a number of mechanisms by which the modified or newly created files can be stored in your system in very little time.
First of all, lets get familiar with the basic taxonomy of the backup paradigm. Broadly, there are three types of backups in classical computing for multiple domains.
First, there is differential backup, during which only the data that was not present in the last full backup, is stored and saved. If a full backup is taken on July 1, the differential backup on July 2 will copy only the files changed since July 1. All other files will not be copied. Such a backup system is more flexible, efficient and very fast compared to full backups. Differential backup is also known as cumulative incremental backup.
Incremental backup, the second type, supports the copying of files that were modified since the last backup whether it was a differential backup or full backup. For example, if a full backup was done on July 27, 2015, the incremental backup on July 28 will copy all the files which were changed since July 27, 2015. Similarly, if incremental backup is performed on the next day, only the changed files will be copied. This is also known as differential incremental backup.
The third variety, full backup, is performed periodically, like once a week or twice a month. This type of backup is done when major changes on the disk are performed, which can include a software install/uninstall, an upgrade or any such similar task. In a full backup, all the files are fetched, but this process is very time consuming. If any segment of the disk does not have any data, even the empty backup from that location is taken, which leads to the consumption of processor time and overhead.

Features of an effective backup tool
Effective and efficient software is required for backing up data. Some of the major features a backup tool for personal as well as corporate computing must have are:

Non-redundant data/ de-duplication
Security with multi-layered encryption
Dynamic compression and archiving
Dynamic repository management
Automatic backups
Cloud based backup and recovery
Import/export for multiple platforms
Version tracking
Configurability, with the scope for extensions using plugins
Command line as well as GUI panels for better usability
Transaction fault tolerance to avoid data loss
Volume oriented to support compression, splitting and merging for multiple devices and platforms
Malware scanning
Universal view and updates
Cross-platform
Support for multiple data formats
Support for multiple databases
Generation of reports, alerts and logs

Free and open source Python tools
Python is a powerful programming language used for many high performance computing domains including cloud computing, parallel architectures, Big Data, data mining, machine learning, grid computing, heterogeneous databases, and many others.
It is enriched by a number of libraries to enable de-duplicating, secured and cloud based backup implementations, etc. Bakthat and Attic are very effective tools which are based on Python and can be used for a general-purpose and highly efficient backup system.

Bakthat, a local and cloud based backup tool
Bakthat is a Python based framework developed for secured and efficient backup. It is available with the command line as well as the module for Python to support cloud backups on Amazon Cloud Infrastructure like Amazon S3, Amazon Glacier as well as OpenStack Swift. This tool is able to compress, encrypt and upload documents to cloud servers. The base licence of Bakthat is MIT.
Features of Bakthat include the following:

Dynamic compression
Encryption using Beefish
Uploading/downloading to Glacier or S3 with Python Boto Programming
Maintaining the local backups in the SQLite database using Peewee
Deletion of older files and documents as per the grandfather-father-son method
Ability to sync the backups database between multiple clients via a centralised server
Exclude files using .gitignore
Usage of plugins and extensions for better performance

Installing Bakthat: Bakthat can be installed very easily using Pip as the Linux distribution:

$ pip install bakthat

You can auto configure Bakthat as shown below:

$ bakthat configure

Then, create a backup, as follows:

$ bakthat backup MyBackup
Backing up MyBackup
Password (blank to disable encryption):
Password confirmation:
Compressing...
Encrypting...
Uploading...
Upload completion: 0%
Upload completion: 100%

There is an alternate method too:

$ cd MyDirectory
$ bakthat backup
INFO: Backing up MyDirectory

You can show and restore the backup, as follows:

$ bakthat show
<TimeStamp> MyBackup.tgz.enc
$ bakthat restore mydir

You can also back up to Amazon Glacier (Bakthat on Amazon Cloud Infrastructure):

$ bakthat backup -d glacier
INFO: Backing up MyDirectory
Password (blank to disable encryption):

The latest version of a backup can be restored from Amazon Glacier by specifying the beginning of the file name, as shown below:

$ bakthat restore -f bak
INFO: Restoring MyBackUp.tgz.enc
Password:
INFO: Downloading...
INFO: Decrypting...
INFO: Uncompressing...
$ bakthat restore -f MyRemoteFile -d glacier
INFO: Restoring MyRemoteFile20150808.ttgz
INFO: Downloading...
INFO: Job ArchiveRetrieval: InProgress

You can delete a backup as follows:

$ bakthat delete MyBackup.tgz.enc
INFO: Deleting MyBackup.tgz.enc
$ bakthat delete -f MyRemoteFile
INFO: Deleting MyRemoteFile201520274848.ttgz.enc

You can view backup information by using the following code:

$ bakthat info
INFO: Last backup date: <Backup TimeStamp> (1 versions)

Attic, a de-duplicating backup tool
Attic is a cross-platform backup tool for Linux distributions, Mac OS X and FreeBSD. This tool is developed and used for the secured and de-duplicated storage of data so that only the modified files and changes can be stored rather than the full backup, every time.
Attic is developed in the Python programming language in association with implementations of Cython and C.
Its features include:

De-duplication/non-redundant backup
Encryption of data
Backup on remote hosts
Backup with Mount as file systems
Backup validation, verification and restore

Attic installation: To install Attic, Python 3.2 or later is required. To install Attic using Pip, use the following command:

$ pip3 install Attic

To install from source tarballs, use the following code:

$ curl -O https://pypi.python.org/packages/source/A/Attic/Attic-0.16.tar.gz
$ tar -xvzf Attic-0.16.tar.gz
$ cd Attic-0.16
$ python setup.py install

To install from git, use the following commands:

$ git clone https://www.github.com/jborg/attic.git
$ cd attic
$ python setup.py install

Creating the repository: First of all, a new backup repository is initialised, as follows:

$ attic init /MyDirectory/my-repository.attic
After initializing, the creation of repository is done
$ attic create /MyDirectory /my-repository.attic::SundayEvening ~/MyDocuments
$ attic create --stats / MyDirectory /my-repository.attic::MondayEvening ~/ MyDocuments
Archive name: MondayEvening
Archive fingerprint: 178a5e3f9b0e792e91ce98673b0f4bfe45213d 9248cb7879f3fbf3a8e679808a
Start time: Mon Jul 27 12:00:10 2015
End time: Mon Jul 27 12:00:10 2015
Number of files: 1000
Original size Compressed size Deduplicated size
This archive: 97.36 MB 66.28 MB 298.17 kB
All archives: 82.12 MB 80.42 MB 38.29 MB

Another example of code for initialising the backup repository and creating a backup archive is shown below:

$ attic init /MyUSBDrive/MyBackUp.attic
$ attic create -v / MyUSBDrive / MyBackUp.attic::documents ~/AllDocuments

Restore the MondayEvening archive, as follows:

$ attic extract /MyDirectory/ MyBackup.attic::MondayEvening

Recover disk space by manually deleting the MondayEvening archive, using the following command:

$ attic delete / MyDirectory /MyBackup.attic:: MondayEvening

Initialise and create a remote repository, as follows:

$ attic init user@hostname:mybackuprepo.attic

OR

$ attic init user@hostname:repository.attic

OR

$ attic init ssh://user@hostname:port/repository.attic

In case Attic cannot be installed on a remote host, the latter can be used for storing the repository by using mount with the remote file system using sshfs, as shown below:

$ sshfs username@hostname:/PathToDirectory/ /tmp/MountLocation
$ attic init /tmp/ MountLocation /repository.attic
$ fusermount -u /tmp/ MountLocation

Encrypt a remote repository, using the following command:

$ attic init --encryption=passphrase user@remotehostname:backuprepository.attic

Create a backup in the root filesystem named YYYY-MM-DD-MySystem, using the command shown below:

NAME= `date +%Y-%m-%d`- MySystem 
$ attic create /MyDirectory/MyRepository.attic::$NAME / --do-not-cross-mountpoints

Extract an entire archive, as follows:

$ attic extract /MyDirectory/MyRepository::MyDataFiles

Extract an entire archive and list files while processing, using the command given below:

$ attic extract -v /MyDirectory/MyRepository::my-files

Extract the src directory using the following command:

$ attic extract /MyDirectory/MyRepository::my-files home/USERNAME/src

To extract the src directory but exclude object files, use the following command:

$ attic extract /MyDirectory/MyRepository::my-files home/USERNAME/src --exclude *.o

Repository pruning: The repository is pruned to remove archives that do not match any of the mentioned preservation or storage options. Pruning is used to keep specific historic backups. For example, -d 2 refers to keeping the last two days backup.