Data domain cleaning phases overview

Cleaning is an important process on a data domain system. It is important because it is used to prevent over writing data. Unfortunately, this process can impact the performance of a system, and it can take more than 24 hours to complete.

This post will help identify what is happening during a particular phase. It is possible to get an idea of how long previous cleaning sessions have taken by searching the messages log for "cleaning completed".

Below is an explanation of the cleaning phases for DD OS 4.3 and later:

Cleaning Phases Explanation

  • Beginning in DD OS 4.3, the cleaning process will take one of two paths, depending on number of containers in use. This is due to a limit on the number of containers that can be cleaned on a single cleaning run.
  • Sampling is required for a filesystem that uses more containers than the limit. In that case, the cleaning process will perform focused cleaning on a subset of containers that have the most reclaimable space. All cleaning phases below will be followed including phases 5-8. Note that phases 6-9 will restrict their working set to the candidate containers obtained in phase 5.
  • Different DDR models have different amounts of memory so the amount of physical space that can be cleaned in a single cleaning run depends on that. On systems that are fairly empty with the number of containers used below 25-30% of the total container set, all the physical space can be cleaned in a single cleaning run. The cleaning process will complete much more quickly for these systems because the cleaning process will skip directly from phase 4 to phase 9, the copy phase, eliminating phases 5-8. Note that the phases skipped will be displayed as 100% complete.
    1. pre-enumerate - enumerate all the files in the logical space. It may only sample part of the data to help with estimating where live data is located in physical space.
    2. pre-merge - do an index merge to flush index data to disk.
    3. pre-filter - if duplicate data has been written, find out where it is.
    4. pre-select - select the physical space that has the most dead data. This is what we want to clean. At this point the cleaning process will follow one of the two paths described above, depending on the number of containers in the filesystem.
    5. candidate - due to memory limitations, only a fraction of physical space can be cleaned in each cleaning run. The candidate phase is run to select a subset of data to clean and remember what's in the data.
    6. enumerate - enumerate all the files in the logical space and remember what data is active.
    7. merge - do an index merge to flush index data to disk.
    8. filter - determine what duplicate data has been written and find out where it is.
    9. copy - copy live data forward and free the space it used to occupy
    10. summary - create a summary of the live data that's on the system. Phase 9 generally takes the longest as this is where deleting/copying takes place. This is where the results of the cleaning can be observed with the "df" command. Phase 1 and perhaps phase 2 can take a long time if there is replication lag time involved.

Sample output:

# filesys clean start
 
 Cleaning started.  Use 'filesys clean watch' to monitor progress.
 
# filesys clean watch
Beginning 'filesys clean' monitoring.  Use Control-C to stop monitoring.
 
Cleaning: phase 1 of 10 (pre-enumeration)
100.0% complete, 18810 GiB free; time: phase  0:00:08, total  0:00:08
 
Cleaning: phase 2 of 10 (pre-merge)
100.0% complete, 18810 GiB free; time: phase  0:00:17, total  0:00:26
 
Cleaning: phase 3 of 10 (pre-filter)
100.0% complete, 18810 GiB free; time: phase  0:00:25, total  0:00:51
 
Cleaning: phase 4 of 10 (pre-select)
100.0% complete, 18810 GiB free; time: phase  0:00:13, total  0:01:05
 
Cleaning: phase 5 of 10 (candidate)
0.0% complete, 18810 GiB free; time: phase  0:00:02, total  0:01:08
 
Cleaning: phase 6 of 10 (enumeration)
0.0% complete, 18810 GiB free; time: phase  0:00:02, total  0:01:10
 
Cleaning: phase 7 of 10 (merge)
100.0% complete, 18810 GiB free; time: phase  0:00:02, total  0:01:12
 
Cleaning: phase 8 of 10 (filter)
100.0% complete, 18810 GiB free; time: phase  0:00:02, total  0:01:14
 
Cleaning: phase 9 of 10 (copy)
100.0% complete, 18810 GiB free; time: phase  0:00:19, total  0:01:33
 
Cleaning: phase 10 of 10 (summary)
100.0% complete, 18810 GiB free; time: phase  0:00:29, total  0:02:03

Other information about GC/cleaning

  1. Phase 1 (pre-enumeration) and 6 (enumeration) can take a long time when the following conditions are present:
    1. Poor Lp locality
    2. Very high global compression. If 2 DDRs consume the same amount of physical space (i.e. # of containers) and one DDR has x50 global compression and the other has x100 global compression, then the time it takes to enumerate the second DDR would be longer than the first DDR because we have a much larger logical space to traverse in the second DDR.
    3. Many small files
  2. The runtime of phase 1 and phase 6 depends on the logical size of the filesystem (i.e. Logical Bytes).
  3. The runtime of other phases depends on the physical size of the filesystem (i.e. # of containers in use).
  4. Performance bottleneck
    1. Before DD OS 4.5, index merge could be a performance bottleneck. This has been fixed in DD OS 4.5 and beyond.
    2. Pre-enumeration/enumeration/copy phases are the most time-consuming phases in GC/cleaning.
  5. Phase 9 (copy) can take a long time for the following cases:
    1. High live percentage of containers selected copy forward Not enough physical data deleted before running GC/cleaning. Possibly that GC/cleaning is being run more often than it should be.
    2. Additional processing - Re-encryption, recompression, features, sketching
      1. Feature is introduced in 5.0 and beyond. Upgrade from pre-5.0 to 5.0 and beyond will experience slowness in the first round of GC/cleaning since features are computed in each container.
      2. Enable delta replication requires sketch. GC/cleaning will need an extra cycle to recompute sketch during the copy phase.
      3. Gz local compression is significantly more expensive than Lz.
      4. The cost of encryption and key rotation (5.2) is significant.