TSM Backups and Restores
- Journal Backups
- Image Backups
- TSM 7.1 and VMware
- TSM 6.4 and VMware
- Windows Cluster Backups
- VCS Cluster Backups
- LAN free backups
- TDP for MSSQL DBs
- TSM with DB2
- Oracle TDP Backups
These links lead to sections in the text below.
A method to forecast future tape growth. This assumes that you had steady growth in the last month, and you expect this to continue. It also assumes that you are not using collocation. If you know you are going to add 200 more servers next month, then they represent a step growth, which will be over and above the incremental growth forecast here. This method also assumes that you copy your tape storage pools.
Query the TSM server to get statistics on data copied from the primary storage pool in the past month using
Q ACTL BEGINDATE=-30 SEARCH=ANR1214
add up the daily archive and backup copy from primary to copypool storage and divide by 30; now you have a per day average that is being backed up. Double the value because if you have primary and copy storage pools. Call this value 'DailyData'.
now find out how much data is held per full tape, on average, with the following query
SELECT STGPOOL_NAME AS STGPOOL, CAST(MEAN(EST_CAPACITY_MB/1024) AS DECIMAL(5,2)) AS GB_PER_FULL_VOL FROM VOLUMES WHERE STATUS='FULL' GROUP BY STGPOOL_NAME
add the average of each storage pool and divide by the number of storage pools; this should give you an average capacity for a tape from all data types and using compression. Call this value 'AvCap'
DailyData / AvCap = TAPES TSM WANTS PER DAY!
Now we need to calculate how many tapes TSM frees up each day. Use the query
Q ACTL BEGINDATE=-30 SEARCH=ANR1341
Add these lines up and divide by 30 days. This will give you the number of tapes reclaimed a day.
The difference between the number of tapes used, and the number of tapes reclaimed is your growth rate. Its unlikely to be a negative number.
With collocation on, when TSM starts a backup, or migration for a given client, it tries to put the data on a 'filling' tape first, where data already exists for that client. If there isn't one, it selects a scratch tape unless the pool has already reached MAXSCRATCH, in which case it puts the data on the least-full tape available in the pool. There is some detail about collocation in the TSM Backup section
Restores run significantly faster with collocated tapes, as fewer tape mounts are required.
One of the problems with collocation is that you can end up with very little data on very high capacity tapes. If TSM has a new client, and a scratch tape available, it will use the scratch tape, rather than a 'filling' tape with very little data on it. To efficiently fill your tapes, you need MAXSCRATCH set to fewer tapes than you have clients.
The other side of this is that your tapes will start to fill up over time, and there will be tapes that are full but not yet eligible for reclaim, so you will have fewer FILLING tapes in the pool. So you either have to increase MAXSCRATCH or reduce RECLAIM%, or you have fewer and fewer filling tapes and gradually lose the benefits of collocation. You really need the 'suck it and see' approach to find out the best balance between MAXSCRATCH vs. RECLAIM% for your site.
If you do set MAXSCRATCH to less that your number of clients, then you need to realize that you will never have any scratch tapes in your pool. Your tapes will always be either 'full' or 'filling'. If you use MOVE DATA to free up a tape, you will be back to square 1 after the next backup run. That's the way collocation works.
Another possible problem is that with collocation on you get a LOT more tape mounts during migration. Also, if you copy your onsite collocated pool to an offsite non-collocated copy pool, you will get more mounts during reclaim of offsite storage pools, and during backup stgpool. Some tape drives cannot handle all that activity. You can reduce the amount of extra tape activity somewhat by trying to schedule your copy storage pool before migration happens, so that most of the data goes from disk to tape, rather than tape to tape.
The term 'Imperfect collocation' is sometimes used to describe the situation that occurs when collocation is enabled, but there are insufficient scratch tapes to ensure that each node stores its data on different tapes. Some nodes will have their own tapes, and some will share, so some collocation will happen.
By default, a process will wait 60 minutes for a tape mount, before it gets cancelled. To change this, use the server option
where n is a number in minutes from 0 to 9999.
The parameter will only start counting once the process gets the mount message, so if a process is waiting because all the drives are in use, it will not time out.
When a TSM session is writing to sequential media such as tape and it reaches end of volume, it will fail the current transaction and that forces the client to resend all of the data for the given transaction again. This can be a real issue for a very large database piece, some of these can be bigger than 1 TB.
This happens if the client is not running in quiet mode and is a TSM design limitation as the TSM server has to interrupt the current client sending of data by failing the transaction, so the server can notify the client that End-of-Volume (EOV) has been reached. This lets the client post a 'waiting for a mount of offline media' message and if it could not do that, the session would appear to be hung. Because the server must fail the transaction in order to give this notification to the client, the client in turn has to resend all the data that was previously sent in this transaction.
When you define a Windows storage agent, it is necessary to match the serial numbers of drives when defining the drive paths, between the Windows server and the TSM server. A typical define path command will look something like
define path agent_name tape_name srtc=server destt=drive libr=library_name device=mt_name
Where the agent_name is the name of the storage agent, the tape_name is the name of a tape as defined to your TSM server and the mt_name is the name of the mt device on your Windows client. So the first thing you need to know, is how to map between physical device names and mt names at the windows end. IBM provides a utility called tsmdlst to do this, and since TSM 6.3, it can be used to automatically define the paths. The tsmdlst command can be found in C:\Program Files (x86)\Tivoli\TSM\storageagent and the command to automatically define the paths is
tsmdlst /genmacropathsync /addpaths /execmacropathsync /id=userid /pass=password /tcps=ip address of server /tcpp=port of server /server=server name /stagent=storage agent name
The userid and password need to be for a user on the TSM server that has system rights. The ip address of server and port of server are for the TSM server, the default port address is 1500. The server name is the name of the TSM server and the storage agent name is the name of the storage agent running on the Windows machine.
If you are running TSM 6.2 or earlier, then you need to run the tsmdlst command piping the output to a file (tsmdlst > dlist.out) then use that output to relate the WWN names of the drives on the Windows client to the WWN names of the drives on your TSM server. From this, you can get the mt name and corresponding TSM drive name that you need to build the define paths command.
You need to run an audit occasionally to make sure that what TSM thinks is in your library, matches reality.
The audit command is
AUDIT LIBRARY library-name CHECKLABEL=BARCODE
The CHECKLABEL=BARCODE switch is optional, but it will make the audit go pretty fast, say 5 minutes. With that switch, all the audit involves is your
robot scanning the barcode labels of all the tapes. Without that switch, the default action is to mount each tape, which will take a long time.
The audit may wait until all tapes are dismounted from drives, so it could take a while for an audit to start. Consider canceling tape processes if your audit is waiting, and you need it in a hurry.
You can audit a tape volume with the command
audit volume volser fix=yes
and this will check all the backups on the tape and fix the database entries for any that are damaged. It is also possible to audit all the volumes in a storage pool with a single command
audit volume stgpool=pool-name fix=yes
This could audit a lot of tapes and take a long time so you can restrict it by date. For example say you had a tape drive that went faulty on March 25th 2005 and was fixed on April 1st 2005. You want to check all the tapes that were written in that period for errors. Use the command
audit volume stgpool=pool-name fromdate=032505 todate=040105
If you chose a volume that is part of a volume set because it contains files that span volumes TSM will select the first volume in the set and scan them all, even if you pick a volume in the middle. If you just want to audit one specific volume in a set then you need to use the skippartial parameter
audit volume volser fix=yes skippartial=yes
What does TSM do if an urgent request comes along and all the tape drives are in use, or the required volume is being used by another task?
The answer is that in some circumstances TSM will cancel a lower priority task to free up the resource. This is called preemption.
If you think this might be happening, you can investigate by searching the activity log for any of the following messages.
If you are not happy that processes and sessions can be cancelled, you can disable this function by adding NOPREEMPT in the server options file. If you do this, the BACKUP DB command and the EXPORT and IMPORT commands will still be able to preempt other operations but everything else will have to wait for the resources to be freed up.
Operations that cannot preempt other operations or be preempted themselves, for either mount points or volumes are:
Restore from a copy storage pool or an active-data pool
Prepare a recovery plan
Store data using a remote data mover
For mount points, operations that can preempt other operations, in pecking order, are
and operations than can be preempted are
Migration from disk to sequential media
Backup, archive, or HSM migration
Migration from sequential media to sequential media
These are listed in order of priority, so for example if it is running, reclamation would be preempted first.
For volume access the following high priority operations can preempt operations for access to a specific volume:
and the following operations can be preempted, with items at the bottom of the list being preempted first.
Migration from disk to sequential media
Backup, archive, or HSM migration
Migration from sequential media
ANR8779E Unable to open drive /dev/rmtX, error number = 16
ANR8779E Unable to open drive mtx.y.z.n error number=170
Basically, these errors mean that TSM could not open a tape because the drive was in use by another host. The first message comes from AIX, the second message from Windows but they are essentially the same. Some possible causes of this problem, and fixes for it are:
Non-IBM drives can end up being reserved after an otherwise successful Windows cluster failover. This happens with older releases of TSM and the recommendation is to upgrade storage agents, library managers, and library clients to levels 18.104.22.168 or 22.214.171.124 or higher.
AIX 6.1 has a known defect (IV05718) that can leave tape drives in an inaccessible state after running cfgmgr, which then presents a EBUSY error to TSM. The solution is to review and apply the AIX APAR.
Check your current microcode/firmware levels on any IBM manufactured HBAs and upgrade to the latest level and also upgrade all IBMtape/Atape/lin_tape device drivers to current levels.
Older versions of SAN status/health monitoring utilities such as SanSurfer and HBAExplorer have been known to place reserves on devices. Also HP DDMI (Discovery and Dependency Mapping Inventory, part of the HP OpenView suite) may place reserves on drives during scans. The recommendation is to upgrade any SAN monitoring utilities to current levels to avoid known defects that can place reserves. Alternatively, just stop using these utilities.
Watch out for home grown scripts that use utilities like tapeutil and ITDT to check on the status of devices. These scripts can place reserves on devices and should be disabled permanently. In fact, tapeutil has been superceded by ITDT (IBM Tape Diagnostic Tool) so it is best to disable tapeutil and use only the most currently available version of ITDT to avoid known defects.
On AIX, check to see if any drives have 'retain_reserve' set to 'yes'. You do this using
lsattr -El /dev/rmtxxwhere xx is the number of the drive. If it is set to 'yes', use the chdev command to reset it to 'no'.
The HP-UX Ignite-UX system management utility can incorrectly obtain device reservations on tape driveseven if that system is not actually using the drives. The recommendation is to remove Ignite-UX from any systems that have access to SAN attached tape drives and/or libraries.
Under some circumstances, Protectier VTL devices can fail to respond to a drive inquiry request from an application within 30 seconds. This can lead to stale or orphaned drive reservations. This issue was resolved in Protectier version 3.3.4, so the recommendation is to upgrade Protectier to version 3.3.4 or higher.
Consider enabling the RESETDRIVES parameter on the library definition, if possible. This option can allow the device driver to attempt(!) to break a reservation. It is important to note that if the Persistent Reservation option is enabled on the HBA, RESETDRIVES cannot send a LUN reset to break a reservation.
Enable SANDISCOVERY to rediscover devices that have disappeared from the SAN. This can often self-correct pathing issues.
Investigate using the SANREFRESHTIME TSM option
If a drive is stuck in a reserved state, power cycling it at the physical hardware level can often free the reservation, but this might not be a permanent solution.
TSM will spend some time waiting for or processing tape drives, and all these times are recorded in the accounting records. Some times are related and some overlap.
'idlewaittime' is the time a session waits for the client to send another work request, once the previous work request completes. A work request is a client command, for instance backup, archive or query.
'commwaittime' is the time the server waited to receive data from or send data to a client.
'mediawaittime' is the time the session waited for tapes to be located, moved to a drive then mounted mounted and made ready for input or output. Mediawaittime can be larger than duration, idlewaittime, commwaittime and process time because of the overlap in processing that takes place within the server for the session.
'duration' is the sum of 'idlewaittime', 'commwaittime', and 'processtime'. 'total processing time' can be determined by subtracting the sum of idlewaittime and commwaittime from duration. Note that duration does not include mediawaittime.