Windows Storage - Deduplication and Compression

Windows Deduplication

Data deduplication is an extension of compression. Compression works on files, Deduplication happens at block level and eliminates duplicate copies of repeating data, which includes duplicate files or duplicate data within several files. Data deduplication works by splitting data up into small variable-sized chunks then comparing these chunks with existing chunks to identify duplicates. If the chunk is a duplicate, then it is replaced with a reference so that only a single copy of each chunk is stored.
Windows Deduplication is post-process, so files are initally created full size and are not deduplicated at once, but are retained full size for a minimum amount of time before they are processed. Deduplication has a setting called MinimumFileAgeDays that controls how old a file should be before processing the file. The default setting is 5 days. This setting is configurable by the user and can be set to '0' to process files regardless of how old they are. Chunks are stored in container files and new chunks are appended to the current chunk store container. When its size reaches about 1 GB, that container file is sealed and a new container file is created.
Deduplicated data chunks are not deleted immediatley when a file is deleted. The references (or reparse points) to the deduplicated chunks are deleted, and a garbage collection job runs later to reclaim the data from obsolete chunks.

Data Deduplication was introduced in Windows Server 2012 R2, but had some limitations. Volume sizes above 10 TB were not considered good candidates for deduplication, the very large files that are typical of backup processes were not good candidates and the Data Deduplication process used a single-thread and I/O queue for each volume.

Data Deduplication in Windows Server 2016 supports volume sizes up to 64 TB and files up to 1TB. The Data Deduplication process now runs multiple threads in parallel using multiple I/O queues for each volume, so speeding up the post-processing operations.

INSTALLATION

To install Data Deduplication, run the following PowerShell command as an administrator:

Install-WindowsFeature -Name FS-Data-Deduplication

ENABLING DEDUPLICATION

Before you enable Deduplication, you need to analyse your workloads, as not all are suitable. You will most likely get storage savings, but the performance overhead might be prohibitive. If you are using all-flash disks, then they should be able to cope with any deduplicated data. If you are using spinning disk, then Microsoft recommends that the following workloads should work fine:

  • General purpose file servers hosting things like team shares, user home folders, work folders, and software development shares.
  • Virtualized desktop infrastructure (VDI) servers.
  • Virtualized backup applications, such as Microsoft Data Protection Manager (DPM).

For other workloads, try to test deduplication out to see how it works. If you cannot do this, then consider the following

Because Windows Deduplication is post-process, write operations are not affected by definition. Read operations could be affected as some file content is moved into the Chunk Store and is then organised by file as much as possible. This means that read I/Os work best if they are sequential. High performance database workloads have a high percentage of random read patterns which can mean reads are slower then they would have been from a non-deduplicated volume. Also, you need to factor in that the system will need a 'quieter' time, maybe an overnight time slot, to run the deduplication task, as this will use system resource. You can also chose the type of deduplication that best fits your workload. There are three Usage Types included with Data Deduplication.

  • Default - tuned specifically for general purpose file servers
  • Hyper-V - tuned specifically for VDI servers
  • Backup - tuned specifically for virtualized backup applications

Applications which use files that are constantly open, or rapidly changing, are not good candidates. Also, consider that a deduplicated chunk could contain parts of a hundred or more files, so if you lose that chunk you lose a lot of data. Speak to your Backup and Recovery product vendor and make sure that they 'rehydrate' the deduplicated files completely during the backup.

The good news is that there is a tool to help you with all these decisions, called DDPEval. This is installed with the product in the \Windows\System32\ directory. The command syntax is

DDPEval {VolumePath:}

Once you decide which drives are suitable for deduplication, and what type to use, you enable it with the command:

Enable-DedupVolume -Volume {Volume-Path} -UsageType {Selected-Usage-Type}

Compression

You use compression to save on disk space, but the act of compression to store, then de-compression to read, uses CPU power. Its a trade off, you need to decide which is the most important, saving CPU cycles or saving disk storage. Generally, you would compress files which are not used much, and not compress very active files. Compression could reduce your disk usage by about 60%. Compression does not work well on files that are already compressed, for instance .jpg, .mp3 or .mpg files. Microsoft also does not recommend that you compress files bigger than 30MB as the files become fragmented and performance suffers.

So, compression saves some disk space but burns your CPU and does not work well for some files. Is compression worthwhile? In my opinion, no as it is cheap enough to add more disk space to a server. However if you have an old, historical text folder about that you rarely use, it could be worthwhile to compress it. An other useful reason to compress is if you are sending a few text files by e-mail.

On NTFS volumes, you can compress individual files, folders, or entire drives by simply right clicking on the object, selecting 'properties' then the 'advanced' tab. You should see a box like the one below. Just click on the 'Compress contents to save disk space' button. When you compress a whole folder, any new files added to that folder are automatically compressed as well.



How do you know if compression is active? In Windows 10, compresses folders have 2 blue dots in the top right corner. Open any folder window; choose 'View', 'Options'; and on the View tab check the box labeled 'Display Compressed Files and Folders with Alternate Color'.

Older releases of Windows gave you the option to compress old files with the 'Disk Cleanup' utility, but that was removed in Windows 7, which probably indicates how useful Microsoft thinks compression is.

The other useful option is to be able to zip up a folder or a group of files to send by email. To do this, either select the files you want by clicking on them with your mouse while holding the ctl key down, or select an entire folder. Right click on them, then take the 'Send to' option then 'Compressed (zipped) folder'. This will create a new, zipped folder while leaving your original data intact.

Remote Differential Compression

If you are working with a very distributed infrastructure, maybe using Distributed File System, then you could end up transferring a lot of data around your network.
Remote Differential Compression (RDC) is intended to help manage this data transfer over limited-bandwidth networks. If a file is updated, RDC will only transfer the changed parts of files, called deltas, instead of the whole file. Microsoft claims that RDC can reduce bandwidth requirements by as much as 400:1.

back to top