HSM Components

The main HSM components are illustrated below.
HSM components

The white boxes are the data functions, the yellow ellipses are the control datasets, the blue boxes are the data stores, and the green boxes are the recovery logs and trace datasets.

   EADM Advert

Accelerate DB2 Write with zHyperWrite and "EADM™ by Improving DB2 Logs Volumes Response Time:


Backup and Migration are the two main functions of HSM.

HSM recognises data in three places. ML0 (Migration level 0) is the on-line data that is accessed by applications and users. Data is usually moved to ML1 (Migration level 1) first, according to SMS management class rules. ML1 is a dedicated pool of disks, which are non-SMS managed. The data is compressed. The minimum size a dataset can be is about 53K (assuming 3390 track geometry). Small datasets could be archived to ML1, but still use the same space, so small datasets are held as records in Small Data Set Packing datasets (SDSPs). These are standard KSDS VSAM files.
To use small dataset packing, you have to tell DFHSM how big a small dataset is. You need an entry in your ARCCMDxx dataset like


As a guide, a single track record using half track blocking, and getting 3:1 compression will use about 160KB on ML1. If all these assumptions are correct for your site, then 160KB is a reasonable cutoff point, as anything smaller would not occupy a whole track.

If a dataset continues to be unused, it will eventually be migrated off to ML2 (migration level 2), which is usually high capacity cartridge. Large datasets are often migrated straight off to ML2.
Migrated datasets are given a special catalog entry, with a volser of MIGRAT, to indicate that the dataset is migrated. The MCDS (Migration Control Data Set) keeps a record of what has been migrated, and where the migrated data is held. If you try to access a migrated dataset, it will be automatically recalled back to ML0.

There are claims that ML1 is not relevant anymore. Disk subsystems with tiered Flash / Fast Disk / Slow Disk storage and automatic tiering software provide the same functionality as a compressed disk pool, and arguably manage things better. The mainframe CPU saving in not needing to manage the ML1 data can be considerable, but then again, HSM housekeeping normally runs at quiet times when there are free CPU cycles. I guess that this is something that each site needs to evaluate for themselves, then decide what is best for them.

There are three variations of space management

Primary Space Management

To run primary space management you need to issue commands like these


This means run primary space management on Mondays and Fridays, starting at 01:00. HSM will not start to process any new volumes after 03:00. The reason why it starts on a Monday is because January 23rd 2017 was a Monday. You would typically enter this command once when setting up HSM, and then just enter it again if you wanted to change the parameters.

Primary space management does all the space management functions on the primary, or ML0 disks. If allowed by parameters, it will work its way through each ML0 volume, processing the largest dataset first, and delete temporary and expired datasets, release unused space, then migrate data to ML1 or ML2 as appropriate, until all volumes are below their SMS thresholds, or there are no more datasets elligible for processing.

Secondary Space Management

Secondary space management needs an initial command, similar to primary space management. If you never enter this command, then secondary space management will never run.


Secondary space management basically looks after the ML1 and ML1 archive pools. If the management class criteria are met it moves data from ML1 to ML2, it runs TAPECOPY commands if they are needed and it deletes expired migrated datasets.

The end of that last sentence needs a bit of expansion. HSM can delete migrated datasets, that is it can delete data once it is not required. It does this based on retention policies set in the DFSMS management class. Normally, HSM will not delete a dataset unless it has a current backup of it. This is an issue if you do not use DFHSM to backup your data, so it is possible to apply a patch to HSM that allows it to delete data that it has not backed up.
So if you are expecting HSM to delete data and this is not happening, one possibility is that HSM requires a backup before it will delete the data. Other things that can go wrong is that for SMS volumes the storage group containing the volumes must be defined with AM=Y and the HSM parameter 'Scratch expired Data Sets' must be set to YES. If it is not, change it with command


Interval Migration

You run interval migration on one LPAR, so for that LPAR you specify


in the ARCCMDxx Parmlib member, and in all other LPARs you specify


Interval migration runs every hour, and checks each volume occupancy against the SMS threshold settings for the volume's storage pool. If the high threshold is exceeded, then DFHSM will migrate eligible datasets until the low threshold is reached or no more data sets are eligible. It will also delete temporary and expired data sets

Automatic Recall

You can restrict the number of recall tasks with the following SETSYS parameter.


The tape recall tasks are a subset of the max recall tasks, so n1 must be smaller than n2.

back to top


Data needs to be backed up on a regular basis, incase it is accidentally deleted or corrupted. Hardware failure is very rare these days. HSM can stage backups to ML1, or write them straight to tape. Backups are recorded in the BCDS (Backup Control Data Set). This makes recovery very easy. The OCDS (Offline Control Data Set) keeps a record of all tapes used by HSM, both backup and migration.

To schedule HSM backups to run automatically, you need to add lines like these to your ARCCMDxx member


What this says is that the backups will start between 01:00 and 02:00, and no new volume backups will start after 06:00 Up to three concurrent backup tasks can run on this host. If you are running in a sysplex with several LPARs, its best to run several concurrent backup tasks from a single LPAR, rather than spreading the tasks between LPARS.

Fast Replication uses FlashCopy to create almost instant volume level backups of disks, storage groups or sets of storage groups. The backup includes catalog information for consistent recovery. Recovery can be by entire pool, volume or individual datasets. To use this, you must define a copy pool in DFSMS (option 'P' from the primary DFSMS menu) and add some volumes to it. Each copy pool is associated with one or more primary storage pools in SMS. Here is two sample HSM commands that you can use to invoke Fast Replication. The first command will backup whichever storagepools are associated with copy pool CP004. The second command will recover a dataset from a fast replication dump.


Log & PDA files

The CDS files are critical to HSM, so updates to them can be logged. If a CDS fails, it can be recovered from backup, then the log updates applied to get it back to the point of failure. The PDA (Problem Determination) files are trace datasets. The CDS Recovery section explains how the logs can be used to fix CDS errors.


When several HSM instances run in a Sysplex they are called an HSMplex. You can run more than one HSM instance on a single LPAR, as well as running instances on separate LPARS, with a maximum of 39 instances in a single plex. All the instances must share the MCDS, BCDS, OCDS and JRNL datasets.
As these datasets are shared, you need to consider how you will manage dataset integrity by managing ENQs between instances and LPARs. Check out the available IBM manuals and redbooks, which detail how to do this.

Once instance must be defined as the Primary instance, by placing PRIMARY=YES in the startup parms for that instance. The Primary instance is responsible for running the CDS backups, pre-migration backups, moving backups from ML1 to the backup volumes, deleting expired dumps and extra VTOC copy datasets.
If there is no primary instance running, then those functions do not run. You can set several of the other instances to be primary standbyes, so they will be automatically elligible to be promoted if the original primary fails. If this happens, then the first standbye primary that manages to take over will succeed as new primary, and the others revery to being standbyes again.

For LPARS that are able to access same set of data, you can use a Common Recall Queue to balance out recall work between LPARS. If some of your LPARS can access all the data and some cannot, you can exclude the instances that do not share data from the common recall queue with startup parameters like these:


back to top