Storage Virtualization

Storage Virtualisation is pretty much mainstream now. Several products were touted by different manufacturers, only to disappear after a few years, but some products seem to be well established. This SAN virtualisation section discusses IBM's SVC, EMC's VPLEX and VMwares Virtual SAN. This page looks at some of the promises that virtual SANs were going to deliver. Check out the product pages to see what actually arrived.

Virtualization Defined

Few organisations or vendors can agree on a definition of Storage Virtualization

SNIA came up with the following; "The act of integrating one or more (back end) services or functions with additional (front end) functionality for the purpose of providing useful abstractions. Typically virtualization hides some of the back end complexity, or adds or integrates new functionality with existing back end services. "
The Enterprise Strategy Group definition Virtualization as; "a technology that gathers data location information from physical storage devices, network services and applications, and then abstracts the locations into logical views for end users".
I like the VMware definition "Virtualization is an abstraction layer that decouples the physical hardware from the operating system to deliver greater IT resource utilization and flexibility. Virtualization allows multiple virtual machines, with heterogeneous operating systems (e.g., Windows Server and Linux) and applications to run in isolation, side-by-side on the same physical machine."

Taking these definitions together, combining them with others, and not worrying too much about rigorous formality, Virtualization can be summarised into the following points.

  • Storage Virtualization simplifies the physical implementation of different devices by standardising them into one logical view.
  • Virtualization divorces the storage from the server and lets the server concentrate on processing application requests
  • Virtualization removes the requirement for technicians to have a good knowledge of each server platform to be able to manage and configure physical storage.
  • Virtualization adds value to a physical implementation by providing extra functionality.

The advantages of Virtualization

Common tool set

Storage software can now cost more than the hardware it supports. Each storage vendor has its own brand of software and requires a different skill set to manage it. Virtualization permits centralised and consistent management of all volumes within a data centre using the same methods and products, no matter what the platform. This results in a reduction in staff costs, as the process of managing and relocating the data would be standardised. Virtualization also cuts the cost of replication software as it uses one set of software tools to manage all storage.

Data placement within storage tiers.

Tiered storage can be defined as a set of storage pools having different performance and availability characteristics, the most important point being that each tier has a different cost. In the past, most sites adopted a 'one size fits all' approach, and either used expensive storage for all their data, or went for a middle ground compromise. This function has moved to the storage subsystem, for example EMC's easytier. Very active data is held on flash drives for fast access, and moved to spinning disk if it becomes less active.

Data Replication

Data replication involves taking an instant copy of a disk, a file space, or maybe a file, The benefits of this are a combination of fast backup to eliminate the backup window and a point-in-time disk copy for fast recovery from virus attacks and rapid creation of test data. It has been possible to do this for some time with host and subsystem based Virtualization systems to disks of the same type. The Replication section has more details
Virtualization combines this function with tiered storage and enables replication from flash drives to spinning disk. The financial benefits of using cheaper storage tiers for these applications can be considerable.
Cross-site replication has always required two disk subsystems of similar type and cost. Virtualization permits cross-site replication to unlike, and potentially much cheaper devices.

Data Migration

Data migration is similar in principle to replication, except that the data is moved between devices, not copied. Because the files spaces seen by applications are virtualized then mapped to a physical implementation, that physical implementation can be changed by copying the data then adjusting the mapping pointers. If you need to free up older devices for disposal, the data can be moved off the old devices transparently with no need to stop the applications. Also, if a file space has a performance problem it can be moved to a quieter disk without any need to stop applications, a process known as 'hot file reallocation'. The benefits are reduced application downtime and fewer requirements to work unsocial hours. The VMware Vmotion function is an extension this, because as the server is also virtualised the whole server can be migrated to a new location.

Capacity management

The process of managing LUNs, volumes and files in the Open Systems area is still a major issue. If it is necessary to change a LUN size or add new servers, this requires a lot of complicated manual effort. The processes are different depending on the platform and that makes the issue worse. To make the effort more manageable, storage managers often make LUNs big enough to cope with several months of growth and that can be expensive. The physical / logical separation that virtualization introduced means you can expand, define and delete LUNs without affecting the rest of the system, and use up free space more efficiently.
Virtualization facilitates more efficient capacity management as it allows you to share physical disk capacity between different server platforms. It means the end of the days when you had gigabytes of free space for Windows, while your Unix systems were struggling to find space. Virtualization also makes capacity management easier, as it can automatically expand disks, file systems and databases when they hit a space threshold.
Another benefit of virtualization is that the physical storage implementation is separated from the logical view at the file servers. This also allows you to mix and match many kinds of physical storage, from several vendors, while hiding the detail of this from the application servers.

The three types of Virtualization

While there are lots of different virtualization products around, they are all variants of three basic architectures; Host based, Controller Based and Network Based. Network based virtualization has variants and combinations of In-band and Out-band; and Switch based and Server based. Some of the various architectures are mutually exclusive and some are complementary. The following summaries may help master the confusion.

Host-based Virtualization

Host based virtualization has been around for years and involves splitting up a physical volume into virtual disks or LUNs using volume manager type software. Examples are Logical Volume Manager for UNIX, Veritas Storage Foundation for Windows or VMware. The function of the LVM is to intercept IO requests from applications, work out which physical storage subsystem they will be directed to, then translate the IOs into a format that makes sense to those physical boxes. As this Virtualization runs on the host, it will consume host CPU. VMware's Virtual SAN is an example of Host Based, as the virtualisation runs from the hypervisors.

Controller-based Virtualization

Application servers usually write data out to file spaces. Controller-based virtualization creates virtual images of those files spaces in the storage subsystem and maps them to pools of physical disks.
Virtualization in the controllers or storage subsystems began with cached and RAID storage controllers. Neither of these simplified storage subsystems, but they both added value; faster responses and increased resilience.
We can now install very large storage controllers with a hundreds of terabytes of internal capacity and petabytes of external storage connection. The application servers are connected through a SAN to the virtualization controller hosts the top-tier storage directly, then third party mid and low tier disks can be attached to it. This is an in-band solution with the architecture scaled to a single subsystem managing petabytes of storage. Even though it will have lots of redundant internal components, the subsystem will be a Single Point Of Failure and it might need lots of channels for performance.
The main problem with controller based virtualization is that it is difficult to share an FC SAN between different controllers, especially from different vendors. Controller based virtualization will pretty much lock you into one vendor.

Network-based Virtualization

If host based virtualization lies in the host and storage based virtualization lies in the subsystem it should be obvious that Network based virtualization lies in the SAN that connects the Hosts to the Storage Servers. However there are a few different implementations and to some extent these depend on how they manage the application data.

There are two types of information that passes between the hosts and the storage, data and metadata. The data is the blocks of information that make up files and records; the metadata contains information about the data, including the location of the blocks of data.

If the virtualization is 'in-band' then it lies in the data path so all the data and the metadata pass through it. The virtualization 'appliance' will create and allocate virtual volumes on the storage subsystems as required. It presents these back to the hosts, and when it receives an I/O request from an application the virtualization server will translate the IO from the file system virtual request to the physical disk IO and pass it on to the correct disk array.
If the virtualization is out-band then it will trap the metadata IO and use that to set up a path for the data IO. Once the path is defined, the appliance takes no further part in the operation so the meta-data passes through the virtualization appliance, but the data does not. The virtualization appliance will create and allocate virtual volumes on the storage subsystems, but it requires agent software in the data path to do the virtual/physical IO translation and to present virtual volume information to the operating system. If an application requests an I/O then the software agent performs the virtual/physical translation and directs the I/O request to the appropriate storage sub-system.

The virtualization code can either run on a dedicated server or 'appliance', or it can run inside the SAN switches. It might see intuitive that virtualization in the SAN switches must be in-band, but in fact the virtualization software can run on blades inside the switch and be out-band.

The IBM SAN Volume Controller (SVC) and the EMC VPLEX are both in-band virtualization that run on dedicated servers. It seems that all the major virtual SAN implementations run in-band.

back to top