The problem with moving disks

One of the problems that the storage person has to deal with from time to time is moving disk volumes. In the past two or three years I've had to do this when installing new disk subsystems and moving the data off the old ones, and when relocating to a new machine hall. Another possible reason is to relocate a volume that is performing badly onto a less busy disk string.
What makes this a problem is that it is almost impossible to get system downtime these days to expedite the move. I used to be able to book out exclusive Sunday day shift for disk maintenance. Now, even getting a couple of hours on a Saturday night shift is almost impossible.

Fortunately, there are products out there that will move disks without needing system downtime and with minimal impact on user applications. One of these is FDRPAS.

    GFS Advert

FDRPAS supports the big four mainframe storage hardware vendors; IBM, EMC, SUN and Hitachi, and will move disks between different vendor's hardware. FDRPAS will handle every type of volume, including SYSRES volumes, the only exception being volumes containing local page or swap datasets.

There are two general types of FDRPAS processing, single system image and multiple images in a sysplex. Before we take a look at these, here are a few general comments about FDRPAS.

  • Because FDRPAS is working alongside active applications on potentially disparate hardware, it is essential that you have the correct levels of microcode and software PTFs applied. InnovationDP, the FDRPAS suppliers, will advise you here.
  • All the currently active volumes ('Source volumes') must be online, and all the volumes that you are going to move to ('Target volumes') must be offline to all LPARs. At the end of the job, FDRPAS will vary the Source volume off-line then vary the Target volume on-line.
  • One of the issues with doing a disk-to-disk copy with the VOLSER preserved is that you end up with two identical disks on the system, though only one of them can be online. This causes problems at IPL time as z/OS does not know which one to mount and will ask the operator to decide. If the original volume is mounted by mistake then the data will be back-leveled and this can be catastrophic. FDRPAS automatically modifies the label of the Source volume when it varies it offline, so this situation cannot happen
  • FDRPAS can be initiated with batch jobs, from a started task or with ISPF panels. The examples below assume that you will use batch jobs.
  • Before you start to run swap it would be a good idea to check that the various disk components are not faulty. This means checking the VTOC, the VTOCIX and the VVDS. The catalog section shows you how to check the VVDS. FDRCPK is a good tool for checking the integrity of the VTOC and VTOC Index
  • After the swap has completed, it would be a good idea to wipe the data from the source volume with FDRERASE

   EADM Advert

Accelerate DB2 Write with zHyperWrite and "EADM™ by Improving DB2 Logs Volumes Response Time:

Running FDRPAS on a single LPAR

All the volumes to be moved must be just accessed by a single z/OS image.

You start the process by running a batch job like this.

//SWAP EXEC PGM=FDRPAS,REGION=0M
//STEPLIB DD DISP=SHR,DSN=your.fdrpas.loadlib
//SYSPRINT DD SYSOUT=*
//FDRSUMM DD SYSOUT=*
//SYSUDUMP DD SYSOUT=*
//SYSIN DD *
 SWAP TYPE=FULL,MAXTASKS=3,CHECKTARGET=YES
 MOUNT VOL=D30001,SWAPUNIT=FC01
 MOUNT VOL=D30002,SWAPUNIT=FA20
 MOUNT VOL=D30003,SWAPUNIT=F134

This job will concurrently move the three volumes identified by the VOL= parameters to the addresses identified in the SWAPUNIT= parameters. CHECKTARGET=YES is an optional safety feature used to make sure that the target volumes are empty. Each job can handle up to 64 pairs of volumes in parallel and you can run several jobs concurrently. In this example, MAXTASKS=3 means move three volumes in parallel.

The move process goes through 5 phases

Phase1 Validation
FDRPAS checks that the conditions are correct for the move; both devices must be the same type (3380 or 3390), target devices must be offline, target device must be at least as big as source, source device does not contain an active page dataset.

Phase2 Install IO intercept
You are moving the volumes while they are active and the move will take some time, so FDRPAS needs to be able to detect IO activity on the source volumes. To do this, it suspends IO to the source volume for a short while, while it installs an IO intercept. This suspend will take a short while and will not affect active applications.

Phase3 Copy
FDRPAS will now copy data to the target volume. All used tracks (for inactive data sets) and all allocated tracks (for active data sets) are copied, while FDRPAS simultaneously detects updates to the Source.

Phase4 Consolidate
FDRPAS will now re-copy any tracks that were updated after the copy process starts. If more than 150 tracks have changed, it will copy changed tracks with IO active. It will then repeat this process until there are less than 150 tracks to copy, at which point it will suspend the IO again while it copies the last few tracks.

Phase5 Swap completion
At the end of the consolidation process, the Source and Target volumes are identical. All I/O activity to the Source is now quiesced for a second or so, the source volume is taken offline and the target varied online, then the active application are swapped across to the Target volume.
This process is illustrated in the movie below.

With FDRPAS 5.4 the volsers of target and source volume can be flipped once a SWAP has completed to better work with products such as Global Mirror for z/Series (XRC), GDPS, and others. This is accomplished with the COPYVOLID=YES,VOLRESET=NO operands on the SWAP command.

Running FDRPAS on multiple LPARS

Most z/OS sites run with several LPARS accessing disks concurrently, usually in a parallel sysplex. In this case, you need to ensure the starting conditions are correct for all LPARs and that FDRPAS intercepts active IO from every LPAR that is accessing the Source disk. To do this, FDRPAS has two types of task, Swap tasks and Monitor tasks.

A Swap task can process 64 volumes in parallel, and up to 5,000 SWAP tasks can run on single LPAR. Up to 15,000 volumes can be swapped at the same time, which means that FDRPAS works well with GDPS, and could even be used to swap out an entire datacenter.

Monitor Tasks run on all the other LPARs, and their job is to check that the source device is online and the target offline to their LPAR and intercept all active IO on that LPAR. The monitor services will also perform the physical disk swap on its LPAR when the copy is complete. Monitor tasks can run as batch jobs or Started Tasks.

The multi-LPAR move process goes through 8 phases

Phase1 Start monitor tasks
Monitor tasks must be started on every LPAR that has access to the Target disks, including the one that will run the swap task. The sample JCL below could be used to run a monitor in batch. Note that the JCL just identifies the Target units.

//MONITOR EXEC PASPROC
//SYSIN DD *
 MONITOR TYPE=SWAP
 MOUNT SWAPUNIT=1100
 MOUNT SWAPUNIT=1101
 MOUNT SWAPUNIT=1102

The monitor will watch for a SWAP task starting on a different LPAR that is swapping to the same devices. The monitor task will run until a swap to the target has completed.

Phase2 Start a Swap task
The swap task should be started on the busiest LPAR. This task details both the Source and Target devices. Sample JCL for a Swap task is

//SWAP EXEC PGM=FDRPAS,REGION=0M
//STEPLIB DD DISP=SHR,DSN=your.fdrpas.loadlib
//SYSPRINT DD SYSOUT=*
//FDRSUMM DD SYSOUT=*
//SYSUDUMP DD SYSOUT=*
//SYSIN DD *
 SWAP TYPE=FULL,MAXTASKS=3,CHECKTARGET=YES
 MOUNT VOL=D30001,SWAPUNIT=1100
 MOUNT VOL=D30002,SWAPUNIT=1101
 MOUNT VOL=D30003,SWAPUNIT=1102

Phase3 Validate the swap request
FDRPAS checks that the conditions are correct for the move on all LPARS.

Phase4 Check Monitor tasks status
The Swap task initiates this by issuing a 'swap pending' message. The monitor tasks intercept this message and reply back if they are ready to participate.

Phase5 Install IO intercept
The Swap task signals to the Monitors that the swap process has started, suspends IO and installs the IO intercept. The Monitor tasks also suspend IO to the source volume while they install an IO intercept on every LPAR.

Phase6 Copy
Once the Monitors tell the Swap task that all IO intercepts are in place, the SWAP task will start to copy data to the target volume. While the copy is in progress, the monitor tasks trap IO updates and pass the list of updated tracks to the Swap task.

Phase7 Consolidate
The SWAP task will now re-copy any tracks that were updated after the copy process started. The Monitor tasks will continue to trap IO updates and pass them to the Swap task, which repeats the process until the two disks are identical.

Phase8 Swap
The Swap task and the Monitor tasks will suspend all I/O activity to the Source on their relative LPARS while the source volume is taken offline and the target varied online everywhere, then the active applications swapped over.
FDRPAS has always worked with PAV and FDRPAS 5.4 has seen improvements to the channel program processing to reduce I/O wait time during the SWAP operation for PAV volumes.

It should be obvious that it is vitally important that a correctly defined monitor is run on every LPAR defined to IO configuration in the sysplex. If any LPAR is missed, then IO activity could be missed and the copied data could be corrupt. FDRPAS can interrogate newer storage subsystems like the 3990-6 or 2105 to find out how many LPARs are attached to it. With older disk subsystems, you have to tell FDRPAS how many LPARS are contributing using the parameter #SYSTEMS= **WARNING** IN THIS CASE IT IS YOUR RESPONSIBILITY TO GET THE NUMBER OF SYSTEMS RIGHT.

The simulation feature of FDRPAS will display all of the systems that have access to the Source volumes specified.

//SIMSWAP EXEC PGM=FDRPAS,REGION=0M
//STEPLIB DD DISP=SHR,DSN=fdrpas.loadlib
//SYSPRINT DD SYSOUT=*
//SYSUDUMP DD SYSOUT=*
 SIMSWAP TYPE=FULL,MAXTASKS=3,CHECKTARGET=YES
 MOUNT VOL=D30001,SWAPUNIT=1100
 MOUNT VOL=D30002,SWAPUNIT=1101
 MOUNT VOL=D30003,SWAPUNIT=1102

Let's start by assuming that you are either running with a newer disk subsystem, or that you specified the correct number of systems on an older subsystem. If you forget to start a monitor on one system, or if the monitor has a target device coded incorrectly, FDRPAS will wait for a while to get responses from all its monitor tasks, then issue an FDRW68 WTOR, with the options 'reply RETRY,NO,YES'. If you get this message, try to correct the condition that caused it (eg not starting monitors on all relevant systems) and reply 'RETRY'. If the message is then issued again, the vendors recommend that you contact them.
If you reply 'YES', then the Swap will proceed with no IO protection on one LPAR, and at the end of the Swap, that device will not be swapped around on that LPAR. Data corruption is almost guaranteed to happen.

An FDRPAS copy process can be terminated at any time before the final SWAP has completed, either through the ISPF panels, or with the z/OS STOP command. This can be done without affecting the original device or any applications using it.

If you stop a SWAP task then any SWAPs that are active SWAPs will be allowed to complete, but any pending SWAPs requested in that same task will not start. If you specify the CANCELPROT=-YES parameter (the default is NO), then a cancel command will be treated like a stop command and active Swaps will run to completion. This can be overridden by issuing the Cancel command twice.

If you use the LARGERSIZE=YES parameter, then FDRPAS will move the data to a larger capacity disk, for example a model 9 to a model 27. Z/OS records the free space on a disk in the indexed VTOC. If you are moving from a small disk to a larger disk then you need that free space map updated to show the extra free space. FDRPAS will do that for you automatically.

FDRPAS can send messages on Swap task completion, either by e-mail using an FDREMAIL DD statement or as TSO notifies.

FDRPAS can also work with FDRinstant to run point-in-time backups, but that is out with the scope of this page.

FDRPAS V 5.4 L80 introduced a new feature called GENSWAP which can be used to generate control statements to migrate a complete SSID or Control Unit. GENSWAP generates the necessary control statements for the SIMSWAP, SIMSWAPMON, MONITOR, and SWAP processes. The GENSWAP command sorts all the specified devices by size, control unit, then by SSID and spreads the jobs across the SSIDs to reduce contention.
The GENSWAP command creates the JCL and control statements for both the main task and the MONITOR task to ensure they are consistent and in agreement.

FDRPAS 5.4 allows you to run the MONITOR tasks as Started Tasks, so can can run more MONITOR tasks simultaneously without tying up initiators.

GDPS issues

GDPS is a set of controlling scripts that manages remote data mirroring. GDPS HyperSwap is a feature that can automatically switch all disk processing from the primary site to the secondary site in response to an event, while leaving the CEC processing intact. GDPS identifies volumes by UCB number, not volser so if you have GDPS hyperswap active, then this is an issue for FDRPAS (and TDMF) as there is an IBM rule that a SWAP cannot be done while a volume is eligible for HyperSwap.

FDRPAS can dynamically switch HyperSwap off and on, but to keep HyperSwap downtime to a minimum, FDRPAS keeps HyperSwap enabled while the data is being copied to the target volumes, disables HyperSwap during the actual UCB SWAPs, and then re-enables HyperSwap immediately afterwards.
While this keeps downtime to a minimum, you need to be aware of what can happen if you get an I/O error while HyperSwap is down.

What happens when HyperSwap detects a primary disk failure is determined by a couple of parameters set in the GEOPLEX PRIMARYFAILURE option. The first parameter, usually set to 'SWAP' determines what happens with HyperSwap active. The second parameter determines what happens if GDPS detects a primary disk failure and HyperSwap is disabled, and it can be 'STOP' or 'GO'. Both options freeze the secondary disks by suspending PPRC, but 'GO' allows the production systems to continue to run using the primary disks and 'STOP' will stop all of the production systems.
InnovationDP states that there 'is a possibility of a false trigger associated with FDRPAS processing' and so recommends that you use SWAP,GO. This means you get full HyperSwap protection during the copy phase, and will not bring all your sysplex down if you do get a spurious error while FDRPAS has HyperSwap disabled for the UCB swap.
If you run HyperSwap then you need to carefully investigate and analyse this situation then take the action that is best for your site.

back to top