What is Data Domain deduplication?

Deduplication is the main benefit of Data Domain appliances. The basic premise is that we are eliminating redundant data, storing only one instance of each segment and using pointers to take the place of the duplicates. Pointers consume much less space than the actual data, so there is a significant reduction in amount of disk required.

Deduplication uses a hashing algorithm to generate a hash value for each data segment. An index of the hash values is kept for quickly referencing when comparing new data to existing data. If a match is found in the hash table, only a pointer to the original will be kept on disk. "Hash," "fingerprint" and "checksum" are all synonyms to EMC.

File-based deduplication compares entire files and keeps only one instance of each file. There is a slight reduction in space wehn there are multiple copies of the same file in a filesystem, however once a change is made to one of those copies, the entire document is again stored. This is an inefficient method of providing deduplication.

Fixed-length deduplication is breaking down data into a fixed-length segment and replacing duplicate segments with pointers. Because the comparison happens at a more granular level than file-based dedupe the reduction in actual data storage is significant. However, when data is added or modified, the segment stream is broken up and data changes on disk. This requires a reprocessing of data to accommodate the new data. While an improvement over file-based dedupe, it is still somewhat inefficient. "Fixed-length" is also known as "lock based" or "fixed-length segment" deduplication, and is the method most deduplication products employ today.

Variable-length deduplication is a more efficient means of deduplicating data, because each data stream is analyzed and common patterns are found, and only unique patterns are stored on disk. The duplicate patterns are marked with pointers. This is the method Data Domain and Avamar use for storing data. "Variable-length," "variable segment size" and "variable block" are all synonyms.

Inline deduplication takes place in real-time before the data is written to disk, where post-process deduplication takes place once the data is on the disk. Inline is more efficient from a disk utilization perspective, but requires more CPU and memory. Post-process dedupe requires more disk space and administrative overhead as the staging area of the system needs to be monitored for capacity, also.

Source-based dedupe uses a client or piece of software on the system being backed up to hash data blocks before sending them. This process requires more processing from the client, but saves considerably on network utilization. Target-based dedupe is where all data is sent from the backup client to the backup device, where data segments are analyzed and unique data written to disk. This is more efficient for the system being backed up, but requires more network bandwidth. Data Domain natively uses target-based dedupe, but with the addition of DDBoost is able to accommodate source-based deduplication.

Global Compression is not really compression, but is what Data Domain calls deduplication. Local compression is normal, on-disk compression using lz, gz or gzfast algorithms in Data Domain. Delta compression is what takes place during replication, where the source sends a hash list of changed data to the replication target. The target sends back a list of unique segments to the source, which then sends new segments only. Data Domain has the ability to send data changes by means of measuring data similar to what exists on disk, and then just sending the change.

DD boasts a 10-30 time reduction in disk utilization, but I can tell you that I have seen small, simple CIFS shares reduce by better than 90% in the wild.