What is Pre-Process De-Duplication?

The context of Pre-Process De-duplication is also known as Source De-duplication in most case. The source de-duplication is mostly used for de-duplication of data prior to its transmission to the device for storage. The entire data are channelled via the source de-dupe software or hardware before being ready to be transmitted to the storage device where it will be stored. The major objective of source de-duplication is to prevent sending of duplicated data across the network to the device where it will be stored. There is establishment of connection using the designated storage device as well as evaluation of data prior to the time when it will initiate the de-duplication process. The synchronisation with the target disk upholds all through the process in order to ensure synchronisation of data, removing the files that match at the source. The main advantage of this is that it helps save bandwidth for the user.

In order to identify changed bytes, there is always the need for byte level scans by either the source de-dupe software or hardware. To make recovery easy for the user, the changed bytes are transferred to the destination or target device, pointing it to the original indexes and files updated with the pointer. Indeed, it does not take time to control the entire operation as they happen quickly without compromising the accuracy and efficiency of the process. The process of source de-dupe is light on processing power when compared to post process de-dupe. It has been observed that source de-duplication has the capability to categorise data in real time. The device configurations that are based on policy can classify data at granular levels, as well as filter out data while they pass across the source de-dupe device. There can be addition or removal of files on usual basis of the group, domain, user, owner, age, path, file type, or storage type, or even on the basis of RPO or retention periods.

Having said the advantages of source de-dupe, there are some disadvantages associated with source de-dupe. It is true that source de-duplication helps decrease the bandwidth you need to transmit data or files to the destination or target, however, there is imposition of higher processing load on the clients, as the entire process is involved in the source de-duplication. In addition, the central processing unit (CPU) power consumption of your device will go higher by about 25% to 50% during source de-duplication process, which may not really be favourable for you at all. There may be needs to incorporate source based de-dupe nodes into each of the locations connected. This involves more cost and will obviously be more expensive than the target de-duplication techniques, where all the de-duplications are carried out on one de-duplication device, within the network nodal point.

Lastly, if the existing software does not support de-duplication hardware or algorithms, there may be the need for redesigning of the software. This, however, is not a problem in target de-duplication, where there is isolation of de-dupe hardware and software from the organisation’s hardware or software. Also, there are no changes needed at the source de-dupe.