Duplication and De-duplication - Latest News from Backup Technology

Organisational databases are not created by a single individual with a single access device. These databases grow and the growth is fed by multiple users inputting data from multiple devices from diverse locations. The data is often shared across devices by users attempting to collaborate. As a result, data is downloaded and stored on local devices for instant access and use. This results in disorganised duplication of same, similar or slightly modified version of the information and storage of such data at multiple locations.

The IT Administrator entrusted with the task of consolidation backup and recovery of information for the organisation is often flummoxed by the infinite number of times a single piece of information is duplicated across the organisation. If each piece of information is to be checked for duplication manually and then dropped into the backup basket, the task will be gruelling to say the least and will assume nightmarish proportions for the individual over a period of time. De-duplication technologies are used to automate the task of identifying and eliminating duplicates of information during the process of consolidation.

Most cloud backup and recovery software come with integrated de-duplication technologies. The IT Administrator has to begin the process of consolidation by identifying a primary backup set for seeding the backup repository. Each piece of information is encoded with a hash algorithm that is unique to the file/folder or block of information seeded. Backup of data from every other device connecting to the enterprise network is also encoded with the hash algorithm and hash algorithms are compared for identifying any duplicate information that may exist in the current backup set. All duplicates are then eliminated and references to the original information is stored in place of the duplicates in case the data has to be recovered to a new device with all duplicates intact.

De-duplication is often described as a compression function. This is because the removal of data compresses the volume of information that is ultimately stored in the cloud database. Moreover, compression functions in a sense remove duplicates of information at granular levels within the file or folder. For instance compression removes all the spaces between words to reduce the amount of space that is occupied by the data in the storage repository. However, the two functions differ from each other in purpose and scope. De-duplication attempts to remove duplicates of information to rationalise the data stored in the database. Compression is purely a functionality used to save on space. The process of de-duplication and compression will have to be reversed at the time of recovery, in order to obtain the complete data set from the storage.