TSM - General Tape Tips

These links lead to sections in the text below.


Calculating Tape requirements

A method to forecast future tape growth. This assumes that you had steady growth in the last month, and you expect this to continue. It also assumes that you are not using collocation. If you know you are going to add 200 more servers next month, then they represent a step growth, which will be over and above the incremental growth forecast here. This method also assumes that you copy your tape storage pools.

Query the TSM server to get statistics on data copied from the primary storage pool in the past month using

Q ACTL BEGINDATE=-30 SEARCH=ANR1214

add up the daily archive and backup copy from primary to copypool storage and divide by 30; now you have a per day average that is being backed up. Double the value because if you have primary and copy storage pools. Call this value 'DailyData'.

now find out how much data is held per full tape, on average, with the following query

SELECT STGPOOL_NAME AS STGPOOL, CAST(MEAN(EST_CAPACITY_MB/1024) AS DECIMAL(5,2)) AS GB_PER_FULL_VOL FROM VOLUMES WHERE STATUS='FULL' GROUP BY STGPOOL_NAME

add the average of each storage pool and divide by the number of storage pools; this should give you an average capacity for a tape from all data types and using compression. Call this value 'AvCap'

DailyData / AvCap = TAPES TSM WANTS PER DAY!

Now we need to calculate how many tapes TSM frees up each day. Use the query

Q ACTL BEGINDATE=-30 SEARCH=ANR1341

Add these lines up and divide by 30 days. This will give you the number of tapes reclaimed a day.

The difference between the number of tapes used, and the number of tapes reclaimed is your growth rate. Its unlikely to be a negative number.

back to top


Collocation

Setup

With collocation on, when TSM starts a backup, or migration for a given client, it tries to put the data on a 'filling' tape first, where data already exists for that client. If there isn't one, it selects a scratch tape unless the pool has already reached MAXSCRATCH, in which case it puts the data on the least-full tape available in the pool. There is some detail about collocation in the TSM Backup section

Restores run significantly faster with collocated tapes, as fewer tape mounts are required.

Problems

One of the problems with collocation is that you can end up with very little data on very high capacity tapes. If TSM has a new client, and a scratch tape available, it will use the scratch tape, rather than a 'filling' tape with very little data on it. To efficiently fill your tapes, you need MAXSCRATCH set to fewer tapes than you have clients.

The other side of this is that your tapes will start to fill up over time, and there will be tapes that are full but not yet eligible for reclaim, so you will have fewer FILLING tapes in the pool. So you either have to increase MAXSCRATCH or reduce RECLAIM%, or you have fewer and fewer filling tapes and gradually lose the benefits of collocation. You really need the 'suck it and see' approach to find out the best balance between MAXSCRATCH vs. RECLAIM% for your site.

If you do set MAXSCRATCH to less that your number of clients, then you need to realize that you will never have any scratch tapes in your pool. Your tapes will always be either 'full' or 'filling'. If you use MOVE DATA to free up a tape, you will be back to square 1 after the next backup run. That's the way collocation works.

Another possible problem is that with collocation on you get a LOT more tape mounts during migration. Also, if you copy your onsite collocated pool to an offsite non-collocated copy pool, you will get more mounts during reclaim of offsite storage pools, and during backup stgpool. Some tape drives cannot handle all that activity. You can reduce the amount of extra tape activity somewhat by trying to schedule your copy storage pool before migration happens, so that most of the data goes from disk to tape, rather than tape to tape.

The term 'Imperfect collocation' is sometimes used to describe the situation that occurs when collocation is enabled, but there are insufficient scratch tapes to ensure that each node stores its data on different tapes. Some nodes will have their own tapes, and some will share, so some collocation will happen.

back to top


Using Collocation to speed up restores

When migrating data from disk to tape, or backing up direct to tape, TSM will normally use any tape in that storage pool that is in 'filling' status first, before it asks for a scratch tape. The consequence of this is that data for any given node can be spread over a lot of tapes. This is not an issue for the odd file restore, but it can be a major problem if you want to recover a full disk, or even a large directory, as TSM will spend a lot of time mounting and dismounting tapes. To resolve this issue, IBM introduced Collocation, where every node had its own dedicated set of tapes.
The consequence of this was that TSM then used a lot more tapes than it used to, so then IBM introduced Group Collocation, which is a sort of 'half way house' that is useful for grouping together smaller clients so that they can share tapes, without interfering with the storage for the larger clients. It is also possible to collocate by filespace for very large clients.

Collocation is enabled by storage pool, and the type of collocation specified applies to all the nodes in that pool. To enable collocation you use the command

UPDate STGpool pool_name COLlocate=NODe
UPDate STGpool pool_name COLlocate=FIlespace
UPDate STGpool pool_name COLlocate=GRoup

to turn collocation off, use

UPDate STGpool pool_name COLlocate=None

If you use group collocation then you need to define some collocation groups and add nodes to them with commands

DEFine COLLOCGroup group_name DESCription=description
DEFine COLLOCMember group_name node_name,node_name,...

Collocation groups do let you be a bit granular with nodes in a storage pool. For example say you group nodes into small, medium and large, then you define two collocation groups, one for small and one for medium size nodes and add your small and medium size nodes to the appropriate group. These nodes will then be group collocated, but as the large nodes do not belong to a group they will be collocated onto individual tapes.

However be aware that collocation affects the way that TSM does disk to tape migration. Migration processes each collocation entity as a separate transaction, so if you use Group Collocation. then all files in a specific group will be migrated in a single transaction. Collocation by Node means a seperate transaction for each node, and Collocation by Filespace means a separate migration transaction for each filespace. The processing time for migration is dependent on the number of files being migrated, and also the number of transactions used, as more transactions means more commit time. So migration on files that are collocated by filespace could take longer that migration on files that are collocated by group.

There is another potential issue with Filespace Collocation. When TSM initiates a backup, it builds up a list of all the files it needs to backup, but it runs multiple sessions on the client to do this, so the resulting list will not be arranged in filespace order. Instead, the files from different filespaces will be interleaved in the list. Now if you are backing up direct to tape, and are using filespace collocation, TSM needs to write the data from each filespace to a different tape. The result is that TSM will mount a tape for filespace1, write out some files, find the next set is for filespace2, dismount the tape and mount another tape from filespace2, write out some files until it find files that belong to filespace3, dismount the tape for filespace2, mount a tape for filepspace3, write out some files, and keep 'thrashing' around different tape volumes until the backup is complete.

You stop this from happening by simply adding the line

COLLOCATEBYFILESPEC YES

in the dsm.opt file on the client, then TSM will change the way it builds the list, so it lists all backups required for each filespace in turn.

Using Collocation with deduplication is a no-no. A deduplicated backup will have links to bit of data on other volumes and so will call other volumes for a restore. This defeats the purpose of collocation. It is not that it won't work, it is just that there is no point in doing it.

back to top


Controlling wait times for tape mounts

By default, a process will wait 60 minutes for a tape mount, before it gets cancelled. To change this, use the server option

MOUNTWAIT n

where n is a number in minutes from 0 to 9999.
The parameter will only start counting once the process gets the mount message, so if a process is waiting because all the drives are in use, it will not time out.

back to top


End of Volume behaviour

When a TSM session is writing to sequential media such as tape and it reaches end of volume, it will fail the current transaction and that forces the client to resend all of the data for the given transaction again. This can be a real issue for a very large database piece, some of these can be bigger than 1 TB.

This happens if the client is not running in quiet mode and is a TSM design limitation as the TSM server has to interrupt the current client sending of data by failing the transaction, so the server can notify the client that End-of-Volume (EOV) has been reached. This lets the client post a 'waiting for a mount of offline media' message and if it could not do that, the session would appear to be hung. Because the server must fail the transaction in order to give this notification to the client, the client in turn has to resend all the data that was previously sent in this transaction.

back to top


Auditing Tape Libraries

You need to run an audit occasionally to make sure that what TSM thinks is in your library, matches reality.
The audit command is

AUDIT LIBRARY library-name CHECKLABEL=BARCODE

The CHECKLABEL=BARCODE switch is optional, but it will make the audit go pretty fast, say 5 minutes. With that switch, all the audit involves is your robot scanning the barcode labels of all the tapes. Without that switch, the default action is to mount each tape, which will take a long time.
The audit may wait until all tapes are dismounted from drives, so it could take a while for an audit to start. Consider canceling tape processes if your audit is waiting, and you need it in a hurry.

back to top


Auditing Tape Volumes

You can audit a tape volume with the command

audit volume volser fix=yes

and this will check all the backups on the tape and fix the database entries for any that are damaged. It is also possible to audit all the volumes in a storage pool with a single command

audit volume stgpool=pool-name fix=yes

This could audit a lot of tapes and take a long time so you can restrict it by date. For example say you had a tape drive that went faulty on March 25th 2017 and was fixed on April 1st 2017. You want to check all the tapes that were written in that period for errors. Use the command

audit volume stgpool=pool-name fromdate=032517 todate=040117

If you chose a volume that is part of a volume set because it contains files that span volumes TSM will select the first volume in the set and scan them all, even if you pick a volume in the middle. If you just want to audit one specific volume in a set then you need to use the skippartial parameter

audit volume volser fix=yes skippartial=yes

back to top


Mount Point and Volume Access Preemption

What does TSM do if an urgent request comes along and all the tape drives are in use, or the required volume is being used by another task?
The answer is that in some circumstances TSM will cancel a lower priority task to free up the resource. This is called preemption.

If you think this might be happening, you can investigate by searching the activity log for any of the following messages.

If you are not happy that processes and sessions can be cancelled, you can disable this function by adding NOPREEMPT in the server options file. If you do this, the BACKUP DB command and the EXPORT and IMPORT commands will still be able to preempt other operations but everything else will have to wait for the resources to be freed up.

Operations that cannot preempt other operations or be preempted themselves, for either mount points or volumes are:

Audit Volume
Restore from a copy storage pool or an active-data pool
Prepare a recovery plan
Store data using a remote data mover

For mount points, operations that can preempt other operations, in pecking order, are

Backup database
Export
HSM recall
Import
Node replication
Restore
Retrieve

and operations than can be preempted are

Move data
Migration from disk to sequential media
Backup, archive, or HSM migration
Migration from sequential media to sequential media
Reclamation

These are listed in order of priority, so for example if it is running, reclamation would be preempted first.

For volume access the following high priority operations can preempt operations for access to a specific volume:

HSM recall
Node replication
Restore
Retrieve

and the following operations can be preempted, with items at the bottom of the list being preempted first.

Move data
Migration from disk to sequential media
Backup, archive, or HSM migration
Migration from sequential media
Reclamation

back to top


ANR8779E error 16/170 EBUSY failures on drive OPEN

ANR8779E Unable to open drive /dev/rmtX, error number = 16
ANR8779E Unable to open drive mtx.y.z.n error number=170

Basically, these errors mean that TSM could not open a tape because the drive was in use by another host. The first message comes from AIX, the second message from Windows but they are essentially the same. Some possible causes of this problem, and fixes for it are:

Non-IBM drives can end up being reserved after an otherwise successful Windows cluster failover. This happens with older releases of TSM and the recommendation is to upgrade storage agents, library managers, and library clients to levels 6.3.4.0 or 7.1.0.0 or higher.

AIX 6.1 has a known defect (IV05718) that can leave tape drives in an inaccessible state after running cfgmgr, which then presents a EBUSY error to TSM. The solution is to review and apply the AIX APAR.

Check your current microcode/firmware levels on any IBM manufactured HBAs and upgrade to the latest level and also upgrade all IBMtape/Atape/lin_tape device drivers to current levels.

Older versions of SAN status/health monitoring utilities such as SanSurfer and HBAExplorer have been known to place reserves on devices. Also HP DDMI (Discovery and Dependency Mapping Inventory, part of the HP OpenView suite) may place reserves on drives during scans. The recommendation is to upgrade any SAN monitoring utilities to current levels to avoid known defects that can place reserves. Alternatively, just stop using these utilities.

Watch out for home grown scripts that use utilities like tapeutil and ITDT to check on the status of devices. These scripts can place reserves on devices and should be disabled permanently. In fact, tapeutil has been superceded by ITDT (IBM Tape Diagnostic Tool) so it is best to disable tapeutil and use only the most currently available version of ITDT to avoid known defects.

On AIX, check to see if any drives have 'retain_reserve' set to 'yes'. You do this using

lsattr -El /dev/rmtxx

where xx is the number of the drive. If it is set to 'yes', use the chdev command to reset it to 'no'.

The HP-UX Ignite-UX system management utility can incorrectly obtain device reservations on tape driveseven if that system is not actually using the drives. The recommendation is to remove Ignite-UX from any systems that have access to SAN attached tape drives and/or libraries.

Under some circumstances, Protectier VTL devices can fail to respond to a drive inquiry request from an application within 30 seconds. This can lead to stale or orphaned drive reservations. This issue was resolved in Protectier version 3.3.4, so the recommendation is to upgrade Protectier to version 3.3.4 or higher.

Consider enabling the RESETDRIVES parameter on the library definition, if possible. This option can allow the device driver to attempt(!) to break a reservation. It is important to note that if the Persistent Reservation option is enabled on the HBA, RESETDRIVES cannot send a LUN reset to break a reservation.

Enable SANDISCOVERY to rediscover devices that have disappeared from the SAN. This can often self-correct pathing issues.

Investigate using the SANREFRESHTIME TSM option

If a drive is stuck in a reserved state, power cycling it at the physical hardware level can often free the reservation, but this might not be a permanent solution.

back to top


Understanding TSM wait times

TSM will spend some time waiting for or processing tape drives, and all these times are recorded in the accounting records. Some times are related and some overlap.

'idlewaittime' is the time a session waits for the client to send another work request, once the previous work request completes. A work request is a client command, for instance backup, archive or query.

'commwaittime' is the time the server waited to receive data from or send data to a client.

'mediawaittime' is the time the session waited for tapes to be located, moved to a drive then mounted and made ready for input or output. Mediawaittime can be larger than duration, idlewaittime, commwaittime and process time because of the overlap in processing that takes place within the server for the session.

'duration' is the sum of 'idlewaittime', 'commwaittime', and 'processtime'. 'total processing time' can be determined by subtracting the sum of idlewaittime and commwaittime from duration. Note that duration does not include mediawaittime.

back to top