Windows Storage - Deduplication and Compression

Windows Deduplication

It is advisable to check to see if your storage hardware supports hardware level deduplication before considering the Windows version. Hardware dedup runs on the storage array, so that deduplication is not consuming CPU cycles on your server.

Data deduplication is an extension of compression. Compression works on individual files, Deduplication happens at block level and eliminates duplicate copies of repeating data, which includes duplicate files or duplicate data within several files. Data deduplication works by splitting data up into small variable-sized chunks then comparing these chunks with existing chunks to identify duplicates. If the chunk is a duplicate, then it is replaced with a reference so that only a single copy of each chunk is stored.
Windows Deduplication is post-process, so files are initally created full size and are not deduplicated at once, but are retained full size for a minimum amount of time before they are processed. Deduplication has a setting called MinimumFileAgeDays that controls how old a file should be before processing the file. The default setting is 5 days. This setting is configurable by the user and can be set to '0' to process files regardless of how old they are. Chunks are stored in container files and new chunks are appended to the current chunk store container. When its size reaches about 1 GB, that container file is sealed and a new container file is created.
Deduplicated data chunks are not deleted immediatley when a file is deleted. The references (or reparse points) to the deduplicated chunks are deleted, and a garbage collection job runs later to reclaim the data from obsolete chunks.

Data deduplication is transparent to the end user, the path to the deduplicated file remains the same and if he requests a deduplicated file he should retrieve it as exactly the same manner as if deduplication was not enabled.
Deduplication will try to keep the deduplicated chunks together on disk, so sequential access should not be affected. Deduplication also has its own cache, so if a file is requested repeatedly, it will not be necessary to reconstitute the file for every request.
The downside of deduplication is that if a file chunk is lost, then this can cause several files to be corrupted, including file backups held elsewhere on disk. RAID protection will help avoid this risk, and off-disk backups are essential.

Data Deduplication was introduced in Windows Server 2012 R2, but had some limitations. Volume sizes above 10 TB were not considered good candidates for deduplication, the very large files that are typical of backup processes were not good candidates and the Data Deduplication process used a single-thread and I/O queue for each volume.

Data Deduplication in Windows Server 2016 supports volume sizes up to 64 TB and files up to 1TB. The Data Deduplication process now runs multiple threads in parallel using multiple I/O queues for each volume, so speeding up the post-processing operations.
Data Deduplication is fully supported on Nano Server, and the configuration for Virtualized Backup Applications is simplified, as the requirement for manual tuning of the deduplication settings has been replaced by a predefined Usage Type option.
Starting with the Windows Server 2016, the cluster rolling upgrade functionality allows a cluster to run in a mixed-mode, that is, the Windows OS versions are not required to be identical now. Data Deduplication supports this new mixed-mode cluster configuration to enable full data access during a cluster rolling upgrade.

Installing Deduplication

Microsoft provides a tool that you can use to check the savings that deduplication might provide, but to use it you need to install the Data Deduplication component. You can either do this from the Server Manager GUI by expanding the 'File and Storage Services' role and then selecting the 'Data Deduplication' component from within the 'File and iSCSI Services' container; or you can use this Powershell command:

Install-WindowsFeature -Name FS-Data-Deduplication

Once you install the component, start the tool by running DDPEval.exe from C:\Windows\System32 as shown below, then select the drives and paths that you want to check. If you see projected improvements of 10% or less then it is arguably better to not use deduplication, but just add more disk space.

DDPEval VolumePath:

ENABLING DEDUPLICATION

Before you enable Deduplication, you need to analyse your workloads, as not all are suitable. You will most likely get storage savings, but the performance overhead might be prohibitive. If you are using all-flash disks, then they should be able to cope with any deduplicated data. If you are using spinning disk, then Microsoft recommends that the following workloads should work fine:

For other workloads, try to test deduplication out to see how it works. If you cannot do this, then consider the following

Because Windows Deduplication is post-process, write operations are not affected by definition. Read operations could be affected as some file content is moved into the Chunk Store and is then organised by file as much as possible. This means that read I/Os work best if they are sequential. High performance database workloads have a high percentage of random read patterns which can mean reads are slower then they would have been from a non-deduplicated volume. Also, you need to factor in that the system will need a 'quieter' time, maybe an overnight time slot, to run the deduplication task, as this will use system resource. You can also chose the type of deduplication that best fits your workload. There are three Usage Types included with Data Deduplication.

Applications which use files that are constantly open, or rapidly changing, are not good candidates. Also, consider that a deduplicated chunk could contain parts of a hundred or more files, so if you lose that chunk you lose a lot of data. Speak to your Backup and Recovery product vendor and make sure that they 'rehydrate' the deduplicated files completely during the backup.

Now you are almost ready to enable deduplication, but first, make sure that all available Windows Server patches are applied, especially KB4025334. Second, make sure you take a full backup of any disk before enabling deduplication. This should be standard practice for any change of course but is especially important when you are making wholesale changes to your data.
Once you decide which drives are suitable for deduplication, and what type to use, you enable it with the command:

Enable-DedupVolume -Volume "E:","F:" -UsageType Default

This example shows deduplication enabled on two drives; 'E' and 'F', with usage type 'default'.

Once you enable deduplication you need to schedule the tasks. Here's two examples. The first one will 'optimize' or run the deduplication task, the second one is a garbage collection job, which cleans up any data chunks that are no longer referenced and recovers the space. Garbage collection uses a lot of CPU, so it needs to be scheduled at a quiet time.

New-DedupSchedule -Name "OverNightOpt" -Type Optimization -Start 23:00 -DurationHours 7 -Days Monday,Tuesday,Wednesday,Thursday,Friday,Saturday -Priority Normal

New-DedupSchedule -Name "SundayGC" -Type GarbageCollection -Start 01:00 -DurationHours 5 -Days Sunday -Priority Normal

There are lots of other deduplication PowerShell commands, and more parameters for the cmdlets above. Check out this link for more details. https://docs.microsoft.com/en-us/powershell/module/deduplication/?view=win10-ps

So, you have enabled deduplication on your f:, and performance is suffering. How do you back it out? First use the Disable-DedupVolume cmdlet, which will stop any new data deduplication activity on the selected volumes. However existing data on the volume will still be deduplicated. To backout completly, use the command:

Start-DedupJob -Volume "F:" -Type Unoptimization

Windows Compression

You use compression to save on disk space, but the act of compression to store, then de-compression to read, uses CPU power. Its a trade off, you need to decide which is the most important, saving CPU cycles or saving disk storage. Generally, you would compress files which are not used much, and not compress very active files. Compression could reduce your disk usage by about 60%. Compression does not work well on files that are already compressed, for instance .jpg, .mp3 or .mpg files. Microsoft also does not recommend that you compress files bigger than 30MB as the files become fragmented and performance suffers.

So, compression saves some disk space but burns your CPU and does not work well for some files. Is compression worthwhile? In my opinion, no as it is cheap enough to add more disk space to a server. However if you have an old, historical text folder about that you rarely use, it could be worthwhile to compress it. An other useful reason to compress is if you are sending a few text files by e-mail.

On NTFS volumes, you can compress individual files, folders, or entire drives by simply right clicking on the object, selecting 'properties' then the 'advanced' tab. You should see a box like the one below. Just click on the 'Compress contents to save disk space' button. When you compress a whole folder, any new files added to that folder are automatically compressed as well.



How do you know if compression is active? In Windows 10, compresses folders have 2 blue dots in the top right corner. Open any folder window; choose 'View', 'Options'; and on the View tab check the box labeled 'Display Compressed Files and Folders with Alternate Color'.

Older releases of Windows gave you the option to compress old files with the 'Disk Cleanup' utility, but that was removed in Windows 7, which probably indicates how useful Microsoft thinks compression is.

The other useful option is to be able to zip up a folder or a group of files to send by email. To do this, either select the files you want by clicking on them with your mouse while holding the ctl key down, or select an entire folder. Right click on them, then take the 'Send to' option then 'Compressed (zipped) folder'. This will create a new, zipped folder while leaving your original data intact.

Remote Differential Compression

If you are working with a very distributed infrastructure, maybe using Distributed File System, then you could end up transferring a lot of data around your network.
Remote Differential Compression (RDC) is intended to help manage this data transfer over limited-bandwidth networks. If a file is updated, RDC will only transfer the changed parts of files, called deltas, instead of the whole file. Microsoft claims that RDC can reduce bandwidth requirements by as much as 400:1.

If you use DFS Replication on LANs with speeds of 1GB or faster on Windows Server 2008 R2 then Remote Differential Compression when enabled can slow your network down and is best disabled. To do this:
Open 'DFS Management '
Click the 'Replication Group'
select the group
Select the 'Connections' Tab
Right click the 'Sending Member'
Choose 'Properties' then the 'General' Tab
Unselect 'Use remote differential compression (RDC)'

back to top


Windows Storage

Lascon latest major updates

Welcome to Lascon Storage. This site provides hints and tips on how to manage your data, strategic advice and news items.