Storage Spaces Direct

Overview

Storage Spaces Direct is the next step in the evolution from Storage Spaces in Windows Server 2012 and Virtual Disk Services pre Windows 8. It has evolved into a lot more than just software RAID. The basic idea is that you have a cluster of servers connected together with a Software Storage Bus. Each server has a mixture of SSD and HDD storage attached, but every server can see all the storage via the Storage Bus, so the data storage is in common. Data is duplicated over the cluster network, either as simple mirrors, or in a RAID like format for resilience, so the cluster can lose at least one server or one disk and not lose access to data. Performance is achieved and sustained by spreading workloads over the servers, with caching and tiering software to keep current data on faster storage.

Storage Spaces Direct Components

Clusters, Servers and Network

Storage Spaces Direct uses one or more clusters of Windows servers to host the storage. As is standard in a Windows Cluster, if one node in the cluster fails, then all processing is swapped over to another node in the cluster. The individual servers communicate over Ethernet using the SMB3 protocol, including SMB Direct and SMB Multichannel. Microsoft recommends that you use 10+ GbE with remote-direct memory access (RDMA) to provide direct access from the memory of one server to the memory of another server without involving either server’s operating system.
The individual servers use the ReFS filesystem as it is optimised for virtualization. ReFS means that Storage Spaces can automatically move data in real time between faster and slower storage, based on its current usage.
The storage on the individual servers is pulled together with a Cluster Shared Volumes file system, which unifies all the ReFS volumes into a single namespace. This means that every server in the cluster can access all the ReFS volumes in the cluster, as though there were mounted locally. accessible through any server, so that to each server, every volume looks and acts like it's mounted locally. Scale-Out File Server.
If you are using converged deployments, then you need a Scale-Out File Server layer to provide remote file access using the SMB3 access protocol to clients, such as another cluster running Hyper-V, over the network.

Storage

The physical storage is attached to, and distributed over all the servers in the cluster. Eash server must have at least 2 NVMe attached solid-state drives and at least 4 slower drives, which can SSDs or spinning drives that are SATA or SAS connected, but must be behind a host-bus adapter (HBA) and SAS expander. All these physical drives are in JBOD or non-RAID format. All the drives are collected together into a storage pool, which is created automatically as the correct type of drive are discovered. Microsoft recommends that you take the default settings and just have one Storage Pool per cluster.

Although the physical disks are not configured in any RAID format, Storage Spaces itself provides fault tolerance by duplicating data between the different servers in the cluster in a similar way to RAID. The different duplication options are 'mirroring' and 'parity'.

Mirroring is similar to RAID-1 with complete copies of the data stored on different drives that are hosted on different servers. It can be implemeted as 2-way mirroring or 3-way mirroring, which will require twice as much, or three times as much physical hardware to store the data. However the data is not just simply replicated onto another server. Storage Spaces splits the data up into 256 MB 'slabs', then writes out 2 or 3 copies of each slab out to different disks on different servers. A large file in a 2-way mirror will not be written to 2 volumes, but will be spread over every volume in the pool, with each pair of 'mirrored' slabs being on separate disks hosted on separate servers. The advantage of this, is that a large file can be read in parallel from multiple volumes, and if one volume is lost, it can be quickly reconstructed by reading and rebuilding the missing data from all the other volumes in the cluster.
Storage Spaces Mirroring does not use dedicated or 'hot' spare drives to rebuild a failed drive. As the capacity is spread all over the drives in the pool, so the spare capacity for a rebuild must be spread over all the drives. If you are using 2 TB drives, then you have to maintain at least 2 TB spare capacity in your pool, so a rebuild can take place.

A Parity configuraton comes in two flavours, Single Parity and Dual Parity, which can be considered equivalent to RAID-5 and RAID-6. You need some expertise in maths to fully understand how these works, but in simple terms, for single parity, the data is split up into chunks and then some chunks are combined together to create a parity chunk. All these chunks are then written out to different disks. If you then lose one chunk, it can be recreated by manipluating the remaining chunks and the parity chunk.
Single parity can only tolerate one failure at a time and needs at least three servers with associated disks (called a Hardware Fault Domain). The extra space overhead is similar to three-way mirroring, which provides more fault tolerance, so while single parity is supported, it would be better to use three-way mirroring.
Dual parity can recover from up to two failures at once, but with better storage efficiency that a three way mirror. It needs at least four servers and with 4 servers, you just need to double up on the amount of allocated storage. So you get the resilience benefits of three way mirroring for the storage overhead of two way mirroring. The minimum storage efficiency of dual parity is 50%, so store 2 TB of data, you need 4 TB of physical storage capacity. However, as you add more hardware fault domains, or servers with storage, the storage efficiency increases, up to a maximum of 80%. For example, with seven servers, the storage efficiency is 66.7%, so to store 4 TB of data, you need just 6 TB of physical storage capacity.
An advanced technique called 'local reconstruction codes' or LRC was introduced in Storage Spaces Direct, where for large disks, dual parity uses LRC to split its encoding/decoding into a few smaller groups, to reduce the overhead required to make writes or recover from failures.

The final piece in the Storage jigsaw is the Software Storage Bus. This is a software-defined storage fabric that connects all the servers together so they can see all of each other's local drives, a bit like a Software SAN. The Software Storage Bus is essential for caching, as described next.

Cache

What Microsoft calls a service side cache, is essentially a top-most disk tier, usually consisting of NVMe connected SSD drives. When you enable Storage Spaces Direct, it goes out and discovers all the available drives, then automatically selects the faster drives as the 'cache' or top tier. The lower tier is called the 'capacity' tier. Caching has a storage overhead which will reduce your usable capacity.
The different drive type options are:

  • All NMVe SSD; the best for performance and if the drives are all NMVe, then there is no cache. NMVe is a fast SSD protocol when the drives are attached directly to the PCIe bus
  • NMVE + SSD; The NMVe drives are used as cache, and the SSD drives as capacity. Writes are staged to cache, but reads will be served from the SSDs unless the data has not been destaged yet
  • All SAS/SATA attached SSD; there is no automatically configured cache but you can decide to configure one manually. If you run without a cache then you get more usable capacity
  • NVMe + HDD; both reads and writes are cached for performance and data is destaged to the HDD capacity drives as it ages
  • SSD + HDD; as above, both reads and writes are cached for performance. If you have a requirement for large capacity archive data, then you can use this option with a small number of SSDs and a lot of HDDs. This gives you adequate performance at a reasonable price.

When Storage Spaces Direct de-stages data, it uses an algorithm to de-randomizes the data, so that the IO pattern looks to be sequential even if the original writes were random. The idea is that this improves the write performance to the HDDs.
It is possible to have a configuration with all three types of drive, NMVe, SSD and HDD. If you implement this, then the NMVe drives become a cache for both the SSDs and the HDDs. The system will only cache writes for the SSDs, but will cache and both reads and writes for the HDDs.

Deployment options

There are two different ways to implement Storage Spaces Direct, called 'Converged' and 'Hyper-Converged'. If you are not keen on those names, then you could call the Converged option 'Disaggregated' instead.
Storage Spaces Direct uses a lot of file servers to host the physical disks. If you work in an SME, it can be quite an overhead to dedicate those servers to just manage disks. If you work for an enterprise business, or as a Service Provider, it is a good idea to run your storage and application, or 'compute' servers in seperate clusters, as the two different workloads can be scaled independently.

So the Converged deployment option means run the Storage and Compute servers in separate clusters. This needs an extra Scale-out File Server (SoFS) layer to sit on top of Storage Spaces Direct, to provide network-attached storage over SMB3 file shares.
The Hyper-Converged option just uses a single cluster for compute and storage, and runs applications like Hyper-V virtual machines or SQL Server databases directly on the servers providing the storage.

back to top