Backup and recovery of heterogeneous and unstructured data is a time consuming process. A number of software products and tools have been developed for this purpose, yet the backup of files, folders and repository maintenance is an ongoing process for higher efficiency, integrity, security and de-duplication of content.
Whenever routine office or personal work is done on our laptops or systems, only a small percentage of data on the full hard disk is modified. Lets suppose there is 500GB of hard disk space in a laptop. After regular office or personal tasks, some of the files are changed. Some files and documents can be downloaded from the Internet and stored on the local system. This means there is a change of only a few MBs per day. But when we take a backup, all the drives are copied to a portable hard disk or any other storage medium. This type of backup, which is very time consuming and not implemented on a regular and frequent basis, is known as a full backup.
There are a number of mechanisms by which the modified or newly created files can be stored in your system in very little time.
First of all, lets get familiar with the basic taxonomy of the backup paradigm. Broadly, there are three types of backups in classical computing for multiple domains.
First, there is differential backup, during which only the data that was not present in the last full backup, is stored and saved. If a full backup is taken on July 1, the differential backup on July 2 will copy only the files changed since July 1. All other files will not be copied. Such a backup system is more flexible, efficient and very fast compared to full backups. Differential backup is also known as cumulative incremental backup.
Incremental backup, the second type, supports the copying of files that were modified since the last backup whether it was a differential backup or full backup. For example, if a full backup was done on July 27, 2015, the incremental backup on July 28 will copy all the files which were changed since July 27, 2015. Similarly, if incremental backup is performed on the next day, only the changed files will be copied. This is also known as differential incremental backup.
The third variety, full backup, is performed periodically, like once a week or twice a month. This type of backup is done when major changes on the disk are performed, which can include a software install/uninstall, an upgrade or any such similar task. In a full backup, all the files are fetched, but this process is very time consuming. If any segment of the disk does not have any data, even the empty backup from that location is taken, which leads to the consumption of processor time and overhead.
Features of an effective backup tool
Effective and efficient software is required for backing up data. Some of the major features a backup tool for personal as well as corporate computing must have are:
- Non-redundant data/ de-duplication
- Security with multi-layered encryption
- Dynamic compression and archiving
- Dynamic repository management
- Automatic backups
- Cloud based backup and recovery
- Import/export for multiple platforms
- Version tracking
- Configurability, with the scope for extensions using plugins
- Command line as well as GUI panels for better usability
- Transaction fault tolerance to avoid data loss
- Volume oriented to support compression, splitting and merging for multiple devices and platforms
- Malware scanning
- Universal view and updates
- Cross-platform
- Support for multiple data formats
- Support for multiple databases
- Generation of reports, alerts and logs
Free and open source Python tools
Python is a powerful programming language used for many high performance computing domains including cloud computing, parallel architectures, Big Data, data mining, machine learning, grid computing, heterogeneous databases, and many others.
It is enriched by a number of libraries to enable de-duplicating, secured and cloud based backup implementations, etc. Bakthat and Attic are very effective tools which are based on Python and can be used for a general-purpose and highly efficient backup system.
Bakthat, a local and cloud based backup tool
Bakthat is a Python based framework developed for secured and efficient backup. It is available with the command line as well as the module for Python to support cloud backups on Amazon Cloud Infrastructure like Amazon S3, Amazon Glacier as well as OpenStack Swift. This tool is able to compress, encrypt and upload documents to cloud servers. The base licence of Bakthat is MIT.
Features of Bakthat include the following:
- Dynamic compression
- Encryption using Beefish
- Uploading/downloading to Glacier or S3 with Python Boto Programming
- Maintaining the local backups in the SQLite database using Peewee
- Deletion of older files and documents as per the grandfather-father-son method
- Ability to sync the backups database between multiple clients via a centralised server
- Exclude files using .gitignore
- Usage of plugins and extensions for better performance
Installing Bakthat: Bakthat can be installed very easily using Pip as the Linux distribution:
$ pip install bakthat
You can auto configure Bakthat as shown below:
$ bakthat configure
Then, create a backup, as follows:
$ bakthat backup MyBackup Backing up MyBackup Password (blank to disable encryption): Password confirmation: Compressing... Encrypting... Uploading... Upload completion: 0% Upload completion: 100%
There is an alternate method too:
$ cd MyDirectory $ bakthat backup INFO: Backing up MyDirectory
You can show and restore the backup, as follows:
$ bakthat show <TimeStamp> MyBackup.tgz.enc $ bakthat restore mydir
You can also back up to Amazon Glacier (Bakthat on Amazon Cloud Infrastructure):
$ bakthat backup -d glacier INFO: Backing up MyDirectory Password (blank to disable encryption):
The latest version of a backup can be restored from Amazon Glacier by specifying the beginning of the file name, as shown below:
$ bakthat restore -f bak INFO: Restoring MyBackUp.tgz.enc Password: INFO: Downloading... INFO: Decrypting... INFO: Uncompressing... $ bakthat restore -f MyRemoteFile -d glacier INFO: Restoring MyRemoteFile20150808.ttgz INFO: Downloading... INFO: Job ArchiveRetrieval: InProgress
You can delete a backup as follows:
$ bakthat delete MyBackup.tgz.enc INFO: Deleting MyBackup.tgz.enc $ bakthat delete -f MyRemoteFile INFO: Deleting MyRemoteFile201520274848.ttgz.enc
You can view backup information by using the following code:
$ bakthat info INFO: Last backup date: <Backup TimeStamp> (1 versions)
Attic, a de-duplicating backup tool
Attic is a cross-platform backup tool for Linux distributions, Mac OS X and FreeBSD. This tool is developed and used for the secured and de-duplicated storage of data so that only the modified files and changes can be stored rather than the full backup, every time.
Attic is developed in the Python programming language in association with implementations of Cython and C.
Its features include:
- De-duplication/non-redundant backup
- Encryption of data
- Backup on remote hosts
- Backup with Mount as file systems
- Backup validation, verification and restore
Attic installation: To install Attic, Python 3.2 or later is required. To install Attic using Pip, use the following command:
$ pip3 install Attic
To install from source tarballs, use the following code:
$ curl -O https://pypi.python.org/packages/source/A/Attic/Attic-0.16.tar.gz $ tar -xvzf Attic-0.16.tar.gz $ cd Attic-0.16 $ python setup.py install
To install from git, use the following commands:
$ git clone https://www.github.com/jborg/attic.git $ cd attic $ python setup.py install
Creating the repository: First of all, a new backup repository is initialised, as follows:
$ attic init /MyDirectory/my-repository.attic After initializing, the creation of repository is done $ attic create /MyDirectory /my-repository.attic::SundayEvening ~/MyDocuments $ attic create --stats / MyDirectory /my-repository.attic::MondayEvening ~/ MyDocuments Archive name: MondayEvening Archive fingerprint: 178a5e3f9b0e792e91ce98673b0f4bfe45213d 9248cb7879f3fbf3a8e679808a Start time: Mon Jul 27 12:00:10 2015 End time: Mon Jul 27 12:00:10 2015 Number of files: 1000 Original size Compressed size Deduplicated size This archive: 97.36 MB 66.28 MB 298.17 kB All archives: 82.12 MB 80.42 MB 38.29 MB
Another example of code for initialising the backup repository and creating a backup archive is shown below:
$ attic init /MyUSBDrive/MyBackUp.attic $ attic create -v / MyUSBDrive / MyBackUp.attic::documents ~/AllDocuments
Restore the MondayEvening archive, as follows:
$ attic extract /MyDirectory/ MyBackup.attic::MondayEvening
Recover disk space by manually deleting the MondayEvening archive, using the following command:
$ attic delete / MyDirectory /MyBackup.attic:: MondayEvening
Initialise and create a remote repository, as follows:
$ attic init user@hostname:mybackuprepo.attic OR $ attic init user@hostname:repository.attic OR $ attic init ssh://user@hostname:port/repository.attic
In case Attic cannot be installed on a remote host, the latter can be used for storing the repository by using mount with the remote file system using sshfs, as shown below:
$ sshfs username@hostname:/PathToDirectory/ /tmp/MountLocation $ attic init /tmp/ MountLocation /repository.attic $ fusermount -u /tmp/ MountLocation
Encrypt a remote repository, using the following command:
$ attic init --encryption=passphrase user@remotehostname:backuprepository.attic
Create a backup in the root filesystem named YYYY-MM-DD-MySystem, using the command shown below:
NAME= `date +%Y-%m-%d`- MySystem $ attic create /MyDirectory/MyRepository.attic::$NAME / --do-not-cross-mountpoints
Extract an entire archive, as follows:
$ attic extract /MyDirectory/MyRepository::MyDataFiles
Extract an entire archive and list files while processing, using the command given below:
$ attic extract -v /MyDirectory/MyRepository::my-files
Extract the src directory using the following command:
$ attic extract /MyDirectory/MyRepository::my-files home/USERNAME/src
To extract the src directory but exclude object files, use the following command:
$ attic extract /MyDirectory/MyRepository::my-files home/USERNAME/src --exclude *.o
Repository pruning: The repository is pruned to remove archives that do not match any of the mentioned preservation or storage options. Pruning is used to keep specific historic backups. For example, -d 2 refers to keeping the last two days backup.