Overview of Data Deduplication Technology

Disk to Disk technology

The backup and disaster recovery industry today is experiencing a renaissance with the advent of Disk to Disk backup technology. The arrival of Disk to Disk based backup technology has enabled a complete rethinking of the backup process and has provided the foundation for new and exciting features to make an administrator’s life much easier.

One bright example of this renaissance is the arrival of Data Deduplication technology. Deduplication is getting a lot of press and attention lately and rightly so as deduplication is one of the biggest industry changing features to date.

Benefits of Data Deduplication

In a nutshell, data deduplication is the process of identifying duplicate copies of data, recording where that data is stored, and finally storing a single copy of that data. This process can result in large space and cost savings for customers.

The typical space savings depends a lot on the data that the data deduplication agent is processing. For example, large encrypted files with little duplicate data will not benefit much from the data deduplication process. However, a team of developers working on the same code, or a team of graphic designers working on the same project generally will have a lot of duplicate data.

A process such as Backup and Archiving which generally has a lot of redundant data (how many copies of Windows or Microsoft Office does your organization have?) can benefit immensely from data deduplication. Some vendors claim up to a factor of 300x for space savings, but a factor of 20x is more realistic.

As mentioned earlier, the data deduplication is a new feature and technology born from the paradigm switch from using tape as a backup medium to using disk as a backup medium. Unlike disk, tape is a sequential or non random-access medium; data can only be read or written in sequence.

This sequential characteristic of tape means that generally, it takes longer to access specific files. Disk to Disk backup enables immediate access to specific files without having to read through the preceding files on the disk. This greatly speeds up the process of both backup and recovery and can save valuable time. It also enables the system to be more efficient, backup the data and move on to the next task.

Client Side Deduplication

In this scenario, the agent residing on the client handles the deduplication process. The client is the ultimate authority of what data resides on it, and what data is changed. Especially for remote offices where network bandwidth is at a premium, being able to deduplicate the data prior to sending it to the backup appliance at the main office saves valuable bandwidth. The down side to this is that the processing of the data consumes client processor time. Depending on the circumstances, the trade off for network bandwidth could be well worth it.

In-Band Deduplication

For appliance based products that use the in-band approach, data is deduped before the data is actually written to the disk. This process has the advantage that the data is only touched once. However, the in-band approach adds increased overhead to the actual backup process, and can slow down the process, which is not ideal.

Out-of-Band Deduplication

For appliance based products that use the out-of-band approach, backup data is first written to disk in-line during the backup process. After the backup is finished, the data is then processed and duplicated data is discarded. Since the data is not processed in-line, there is no overhead penalty during the backup. The trade-off here is that extra storage is necessary while the backup data is being post processed, but there is the assurance that the backup data is captured as quickly as possible.

Best Approach

In our view, the biggest advantage of disk-to-disk technology and data deduplication is being able to combine all three approaches to leverage the strengths of each, while mitigating their weaknesses. For example, Continuous Data Protection (CDP) is a Client Side Approach that makes a lot of sense. A CDP based client knows which data has been modified, and can keep track of it on the fly. Newly modified data is rarely a duplicate and can be confidently sent to the backup server appliance. When the back end server is coordinated with smart clients, duplicated data can be identified on the fly; greatly reducing overall network congestion while minimizing client side processing.

Data deduplication is big win in any backup environment; as is replacing tape with disks as an archival medium. We recommend you consider incorporating Disk to Disk and Data Deduplication technologies into your Backup processes.

About Max Ackerman

Steve "Max" Ackerman, Founder and CTO, Revinetix Max Ackerman formerly served as co-founder and CTO of Phobos Corp., where, in just 3 years, his extensive entrepreneurial, management, team building, and engineering skills led to SonicWALL's $279 million acquisition of Phobos in 2000. Max has a history of innovative technology product development spanning 13 years for various companies. His past successful product developments include: SonicWALL's and Cisco's internet security offloading devices; fiber channel net card - 3M; gigabit Ethernet - 3Com; Fast Ethernet - Silicon Graphics. Max holds a Bachelor of Science in Computer Science from the University of Vermont.
This entry was posted in Backup Solutions, Data Backup, Deduplication, Disk Backup. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>