Demystifying Data Deduplication

Written by Bryan Antepara
April 22, 2010
9:26 am

Stop Repeating yourself – How to eliminate redundant data from backups
reprinted with permission from HP

Data deduplication is a storage technology for managing explosive data growth and providing data protection. There’s a lot of talk about the technique, but little clarity on details and practical applications.

What is it?
Data deduplication is a method for eliminating redundant data from storage, especially from backups. It works by saving a single copy of identical data, replacing any further instances with pointers back to that one copy.

Here’s a simple example: Say 500 people receive a company-wide e-mail with a 1 megabyte attachment. If each recipient saves that attachment locally, it is replicated 500 times on desktops around the network. During backup, a system without data deduplication would then store the data in that one attachment 500 times — consuming 499 MB more backup space than necessary.

Data deduplication backs up just one instance of the attachment’s data and replaces the other 499 instances with pointers back to that copy.

The technology also works at a second level: If a change is made to the original file, then data deduplication saves only the block or blocks of data actually altered. (A block is typically tiny, somewhere between 2 kilobytes and 10 KB of data.)

So let’s say the title of our 1 MB presentation is changed. Data deduplication would save only the new title, usually in a 4 KB data block, with pointers back to the first iteration of the file. Thus, only 4 KB of new back up data is retained.

When used in conjunction with other methods of data reduction, such as conventional data compression, data deduplication can cut data volume even further.

Now extrapolate that scenario beyond e-mail to thousands of gigabytes of data every month or year. That’s a lot of storage that data deduplication could help you to free up, allowing you to retain more backups for a longer time on a given amount of space. And the benefits can go even further. Data deduplication can also help:

Save money with lower disk investments
Free up bandwidth
Rely less on tape backup
Recover faster after an outage

A little myth-busting
It might seem that squeezing more data into less space would mean there’s more room to cram in new data, but that’s not how data deduplication works. Because the technology uses pointers to locate repeated data, the ratio of data you can store increases with each backup you make.

However, adding more unique data doesn’t take advantage of the space savings pointers. Therefore, the technique makes it possible to store more backups for a longer time in the same amount of space.

That means a faster recovery when you need an older version of data (as opposed to retrieving a tape from a remote site). But it doesn’t necessarily translate into freeing up room for more unique data.

Comparing technologies
When it comes to data deduplication, one size does not fit all. That’s why it is important to consider a solution’s approach from the following three levels before making a decision:

Where does data deduplication occur? Does it occur at the source (a server, for example) or at the target that stores the data (a virtual tape library, for example)? A source-based approach results in less data being sent across the wire for backup, potentially shortening backup windows. A target-based approach is well-suited for a virtual tape device and therefore can replace tape backup and speed up data retrieval processes.
When does deduplication happen? In target-based implementations, data can either be backed up first, then deduplicated (post-process), or deduplication can be executed during the backup process (inline). Each method has pros and cons: Post-process deduplication may result in a faster backup, but inline can be replicated immediately after a backup concludes.
How does it happen? Object-level differencing reduces data by storing only the changes that occur, while hash-based chunking products locate global redundancies that occur among all files in a backup. Some technologies even difference at the file level, a technique with so many drawbacks there’s little point in considering it here.

Which approach is best for my organization?
The best approach to data deduplication depends on your size and backup needs.

Deduplication for enterprises: Object-level differencing, or accelerated deduplication, is a good choice for enterprise customers because it focuses on performance and scalability. It delivers the fastest restores, as well as the fastest possible backup by deduplicating data after it has been written to disk. You can scale up to increase performance simply by adding extra nodes.
Deduplication for midsize businesses and remote enterprise sites: Hash-based chunking, or dynamic deduplication, is a good choice for small and midsize businesses or large enterprises with remote sites because it focuses on compatibility and cost. It delivers a low-cost, small footprint in a format-independent solution.

The importance of options
Some companies offer only one method or the other — object-level differencing or hash-based chunking. However, the two technologies offer different strengths and weaknesses in differing environments. That’s why HP now offers both options in configurations tailored to the needs of different business environments:

For enterprises, HP’s Virtual Library Systems family of products offers accelerated deduplication on a proven platform that integrates into existing backup applications and processes to accelerate backups in complex SAN environments, all while improving reliability.
For small and midsize businesses (and remote enterprise offices), HP delivers dynamic deduplication in simple, self-managing, reliable and low cost solutions: the HP StorageWorks D2D Backup System family.

No matter your needs, HP puts a range of data deduplication options at your disposal, not just one that’s been scaled up or down.

Bryan Antepara

Bryan Antepara: IT Specialist

Bryan Antepara is a leader in Cloud engagements with a demonstrated history of digital transformation of business processes with the user of Microsoft Technologies powered by the team of eMazzanti Technologies engineers.

Bryan has a strong experience working with Office 365 cloud solutions, Business Process, Internet Information Services (IIS), Microsoft Office Suite, Exchange Online, SharePoint Online, and Customer Service.

He has the ability to handle the complexity of moving data in and out of containers and cloud sessions, makes him the perfect candidate to help organizations large and small migrate to new and more efficient platforms. Bryan is a graduate of the University of South Florida and is Microsoft Certification holder.