What is HSM?

The concept behind HSM is simple. When you create a file, you allocate it on expensive disk or maybe SSD. When a file is new, you would expect to be updating or accessing it quite often, but after a while the file becomes 'stale' and is hardly ever looked at. The data is still important, but it does not need to be kept on expensive primary disk. AN HSM product moves stale data automatically to less expensive, slower secondary storage devices like cheap SATA disk or tape libraries.
When you access the file again, it is automatically and transparently retrieved back to the expensive, or primary disk. In theory, users never run out of storage and have constant access to their data regardless of where it is stored.

The HSM principle is explained in a bit more detail in the GIF below.
The customer sees the left hand disk in which the dark blue boxes represent normal data files that are currently being used. In this context, 'currently' probably means that these file have been looked at in the past 3 months.
The disk is full of older files shown in yellow and they typically use 80% of the disk space. The HSM product migrates the older files off to cheaper storage, but leaves a small green stub file behind in the same directory as the original. This means that it's easy to find the file, as the stub is where you expect it to be.

Animation showing migration and recall

The stub files use a lot less space than the original, so they represent a considerable space saving. If you try to access the stub file, the HSM software intercepts the open request, and holds it, while it goes off to retrieve the file from the near-line storage. Once it copies the data back to the on-line disk it releases the open request, and you get your file. This typically takes a few seconds for a recall from cheap disk, while a recall from tape can take a minute or two. Some products give you a warning message that the recall is in progress, and some give you the option to cancel the recall.

HSM can work on two retention parameters, the amount of time data is allowed to live on primary storage before it is moved to secondary storage, and the amount of time it can live on secondary storage before it is deleted. This means that you can use HSM to automate your deletion policies, but to be honest the big challenge is to analyse your data and decide what the deletion policies should be, before you can automate them.

Alternatively, HSM can work on disk thresholds. It will start migrating older data if the disk occupancy exceeds a high water threshold, and then stop migrating once the occupancy level hits a low threshold. The intention here is that you never run out of disk space.

HSM Advantages

  • The customer gets a 'virtual volume' that more or less has no size limit.
  • HSM allows a 'half way house' between having the data on fast disk, and deleting the data. This allows you to keep data available for longer, without the problems of maintaining old data on expensive primary disk.
  • HSM can also speed up disk recovery. If you need to recover a whole disk, it will take a lot longer if you have to recover all the old, stale data alongside the current data. Even if you protect your disks from hardware failure by data mirroring, this does not protect the data from logical errors. Accidental or deliberate disk reformat, or file corruption caused by the introduction of a virus could corrupt all mirrored copies of the data, and then the disk would have to be recovered from a tape backup. Some products can take up to 24 hours to recover a 1TB disk. If you move all that older data off the primary disk, you could reduce the primary data from 1TB to 200GB, which you could recover much faster.

The issue that HSM software providers face now is that hardware storage tiering provides an HSM like function without any special software and works in a much more dynamic fashion. The enterprise disk page has some information about hardware tiering.

What is the difference between HSM Migration and Archiving?

An archive is typically a point in time application backup, which is retained for several years. After a verified successful backup, selected files can be deleted from hard disk to free up space. They can then be recovered from archive later if required.
The advantage of archiving is that it can be achieved using existing hardware and software. The only cost is the archive tape.
The disadvantage is that the process is all manual. You may have to delete the files manually, and you will have to keep a manual record kept of the deleted files and associated archive tapes. If a file is required again, then a manual restore is needed.

Archiving is suitable for remote file servers connected by narrow bandwidth networks. Archiving could also be considered for project based servers, where an end of project backup could be taken, and filed away for reference, before the system is cleaned up for the next project.

back to top