Windows ReFS

Microsoft is developing Windows as a server operating system that is capable of hosting the most demanding of applications and one of the issues it faced was NTFS. The NTFS file system is the principle Windows operating system but it has a couple of serious limitations. It cannot easily handle the multi-terabyte drives that are now in common use, and when the file system breaks, it needs a lengthy disk check operation to fix it. Most companies cannot afford to have business critical applications down for an extended period while disk issues are being fixed.

To fix these issues, Microsoft developed a new file system called ReFS, or 'Resilient File System', which was first introduced with Windows Server 2012. Microsoft has a 'statement of intent' to move to ReFS as the default file system but there is no timescale for this as yet. The fact that we cannot yet boot from an ReFS system is an immediate show stopper. ReFS was designed to support most of the NTFS features, so it would not need new system APIs and so most file system filters will continue to work with ReFS volumes.
The NTFS features that are supported include; Access Control Lists, BitLocker encryption, change notifications, file IDs, USN Journals, junction points, symbolic links, mount points, reparse points, volume snapshots and oplocks.
Some NTFS features were removed in the initial release of ReFS, then restored in later editions. These include; Alternate Data streams and automatic correction of corruption when integrity streams are used on parity spaces. Alternate data streams was required to allow ReFS to support MSSQL servers.
Some features were dropped and have not been re-instated so far. these are object IDs, short 8.3 filenames, NTFS compression, file level encryption (EFS), user data transactions, sparse, hard-links, extended attributes, and quotas.
Major issue are that ReFS does not offer data deduplication, it does not support SAN attached volumes and Windows cannot be booted from a ReFS volume.

Coping with large volumes

So how does ReFS cope with large, multi-terabyte volumes? NTFS uses a file called an MFT to hold all the metadata about files and directories, and the MFT restricts the practical size that disks can become. ReFS abandoned the MFT and borrowed structures called B+ trees from relational database architecture. B+ trees are very flexible, trees can be very large and multi-level or really compact with just a few keys and trees can be embedded within other trees. This means that it is possible to build massive file systems on very large disks, with no performance overhead.

Metadata and file data are organized into tables similar to a relational database. So a disk would consist of a table of directories where the directory metadata are also tables. Files then correspond to rows within each directory table, but each file metadata is also a table that contains rows which describe the files attributes
Free space is managed by a hierarchical allocator which includes three separate tables for large, medium, and small chunks. Splitting the allocations like this improves performance as related metadata is naturally collocated.

ReFS still has theoretical capacity limits as shown below. Practical hardware limitations will mean that actual capacity limits are less than this.

Maximum size of a single file - 2^64-1 bytes = 8 exabytes.
Maximum size of a single volume is 2^78 bytes with 16KB cluster size = 47 zettabytes
Maximum number of files in a directory - 2^64
Maximum number of directories in a volume - 2^64
Maximum file name length - 255 unicode characters
Maximum path length - 32K
Maximum size of any storage pool - 4 PB
Maximum number of storage pools in a system - No limit
Maximum number of spaces in a storage pool - No limit

Storage Spaces and ReFS

Microsoft intended that ReFs would be used in conjunction with Storage Spaces, as this would combine the data mirroring functionality of Storage Spaces with the data checking facilities of ReFS, so making data doubly secure. Storage Spaces is a virtualisation product that combines physical disks into storage pools, then lets you create virtual volumes from those pools. The virtual volumes can be mirrored and the data on them striped.

Because Storage Spaces maintains copies of data on multiple disks, if it cannot read data from one disk due to a hardware failure then it will be able to read it from an alternate disk. If Storage Spaces encounters a write failure then it can reallocate data transparently. However if the data itself is corrupt then Storage Spaces will not detect it. This is where ReFS comes in as it can detect corruption using checksums. Once ReFS detects such a failure, it interfaces with Storage Spaces to read all available copies of data and chooses the correct one based on checksum validation. It then tells Storage Spaces to fix the bad copies based on the good copies. All of this happens transparently from the point of view of the application. If ReFS is not running on top of a mirrored Storage Space, then it has no means to automatically repair the corruption. In that case it will simply log an event indicating that corruption was detected and fail the read if it is for file data.

Other stuff

ReFS uses VDL (Valid data length) to speed up initialising large volumes.

There is no way to convert data in place from NTFS to ReFS. Data must be copied from one file system to the other.

Failover clustering is supported, whereby individual volumes can failover across machines. In addition, shared storage pools in a cluster are supported.

ReFS and VSS work together to provide snapshots in a manner consistent with NTFS in Windows environments, but writable snapshots or snapshots larger than 64TB are not supported yet.

Because ReFS was designed not to fail, if failure does occur, there are no system tools provided to repair it. Reclaim Me is a popular third party tool that can recover ReFS errors. It uses the fact that ReFS uses copy on write when updating data. Copy on Write does not update the original entry, but it copies the original data block, together with the updates, to a new location. When the copy operation is complete, the internal filesystem links are modified to point to the new copy of the data, not the old one. NTFS updates the old block in place, so if you get a power failure while an NTFS write is in progress, the data update is incomplete so the data is corrupt. ReFS is not supposed to suffer from data corruption as it uses copy on write, but there are many documented cases of corruption happening.

Ensuring Data Integrity

Metadata is 'data about data' and is used to describe disks, directories and files. As such it is vital that the metadata does not get corrupted. ReFS uses a number of techniques to make sure the metadata stays valid, including independently stored 64-bit checksums and ensuring that metadata is not written in place to avoid the possibility of 'torn writes'.
All ReFS metadata is check-summed at the level of a B+ tree page, and the checksum is stored independently from the page itself. This allows ReFS to detect all forms of disk corruption, including lost and misdirected writes and bit rot, or degradation of data on the media.

The same technique can be used for file data, but this is optional. It is called an 'integrity streams', and if it is used then ReFS always writes the file changes to a location different from the original one. This allocate-on-write technique ensures that pre-existing data is not lost due to the new write. The action of writing the update and writing the checksum is 'monatomic', that is, they must both be completed as a single transaction. An important result from this is that if a file does get corrupted, say by a power failure during write, then that file can be deleted, then either restored from backup or re-created.
Older NTFS file system could not open or delete corrupted files, so for NTFS the only resolution to a corrupt file wass to run chkdsk against the whole volume. The ReFS solution ensures that if a single file does get corrupted, then access to the rest of the good data is not affected. This is especially important as volume sizes get ever larger and so volume checks take longer and longer to run.