TSM Performance Tuning

General performance considerations

If your TSM performance is not too hot, there could be a lots of reasons why. Here's a list of some of them.

  • size of files being backed up
  • number of files being backed up
  • rate that files can be read from disk
  • concurrent read datastreams from the same disk
  • rate that client can send data
  • network topology between client/server
  • rate that server can receive data
  • concurrent data streams into the server
  • tape drive speed (streaming, start/stop)
  • bus speed to tape drive
  • concurrent data streams for multiple tape drives sharing a single bus
  • compressibility of data
  • I/O capability of tsm server
  • cpu speed of tsm server
  • anti-virus software can slow backups down

There is no easy way to work out exactly what is causing your problem. A good starting point is to find out if the problem is with TSM, or with the hardware, or the network. FTP a big file from the affected client to the TSM server disk, and see how long it takes. FTP will always be a bit faster than TSM, as it has no database overhead. However if the FTP times are slow, the problem is probably outside TSM. A common network problem is mixed full duplex/half duplex environments.

IBM has some performance related documentation for each TSM release. They can be found here
For V6.3 http://publib.boulder.ibm.com/infocenter/tsminfo/v6r3/topic/com.ibm.itsm.nav.doc/c_performance.html
For V6.4 http://pic.dhe.ibm.com/infocenter/tsminfo/v6r4/topic/com.ibm.itsm.nav.doc/c_performance.html
For V7.1 http://pic.dhe.ibm.com/infocenter/tsminfo/v7r1/topic/com.ibm.itsm.perf.doc/c_performance.html

If you think that your problems may be down to subset of clients, then you can investigate the performance of all the clients on your server by checking the server accounting records.
Accounting records are held on the TSM server host, and a record is wrtten for every client, when a session ends. Accounting is switched off by default, and can set switched on with the SET ACCOUNTING command. Accounting logs are called dsmacct.log and are stored in the server directory by default, but this can be changed by setting the DBMSERVE_ACCOUNTING_DIR variable.

The account values consist of 31 records separated by commas, really useful for loading into a spreadsheet. Some of the useful fields are
4 - date
5 - time
6 - client name
17 - data backed up in KB
20 - data sent from client to server in KB
21 - session duration in seconds
22 - idle wait time in seconds
23 - comms wait time in seconds
24 - media wait time in seconds
The IBM Maunals tell you what the rest of the fields are

'idlewaittime' is the time a session waits to receive a work request from the client after a previous work request completes. A work request is a backup, archive, query, or any other client command. For example, a client connects to the server and submits a request to do a selective backup. The backup completes and the server sends completion status to the client. The server then waits for the next request to be submitted. If the client responds 6 minutes later with a request, then idlewaittime for this segment of the session will be 6 minutes.

'commwaittime' is the time the server waited to receive data from or send data to a client. This occurs within a work request. For example, a client submits a request for a backup of a 1M file. The file data has to be sent in small chunks to the server requiring a number of receives to get the data from the client and a number of sends to acknowledge its receipt. Commwait begins when a receive or send data request is made to the communications layer and stops when the receive or send completes. For example after the client sends some data, it will wait for an acknowledgement, if it takes 5 seconds until this acknowledgement is received, the commwaittime will be 5 seconds.

'mediawaittime' is the time the session waited for tapes to be mounted and made ready for input or output. Mediawaittime is independent of idlewaittime, commwaittime, and process time. It can be larger than duration, idlewaittime, commwaittime and process time because of the overlap in processing that takes place within the server for the session.

IBM provides a Perl script to collect TSM V6 and newer server monitoring data. This can be useful for collecting ongoing server performance data to detect when problems appear, and also for providing support data to IBM. Details of the script can be found here

back to top


TSM and DB2 issues

Resource Contention

It is essential to avoid resource contention between TSM backups and TSM maintenance tasks. Schedule server maintenance operations to run outside of the client backup window, with as little overlap as possible. Rather than let these processes start automatically, schedule them to run at specific times and try to avoid overlap. The exact times for each step will depend on your site and how much work is happening, so you might need a little trial and error to get the best times for you. Some tasks need to run to completion, and some can be stopped before the next task starts. However be aware that if you consistenty stop tasks like expiration and pool migration, you are likely to run out of storage space. Here is a suggested schedule.

08:00 - storage pool backup - run to completion
11:00 - expiration, halt at 14:00
14:00 - storage pool migration, halt at 16:00
16:00 - reclamation, halt at 18:00
18:00 - database, volhist and device config backup - run to completion
20:00 - client backup - run to completion

DB2 INTRA_PARALLEL

If the DB2 INTRA_PARALLEL option mistakenly set to YES this can degrade your database transaction performance. If it is set to YES, then run to following commands from the DB2 commandline to fix it

db2 attach to TSM server instance
db2 update dbm cfg using INTRA_PARALLEL NO
db2 terminate

DB2 and LDAP

DB2 can be configured to use LDAP user authentication, and this can slow the authentication between the Server and DB2, especially if LDAP is broken. This will especially slow down processes like inventory file expiration that are heavy database users. The recommendation is that you should consider disabling LDAP user authentication if issues cannot be fixed.

Table Reorgs and RUNSTATS

The DB2 system that underlies TSM should automatically reorganise the database tables and indexes, and also run RUNSTATS to optimise the paths through the database. If this stops working it will slow your server down. Potential issues and fixes for reorgs are discussed on the TSM Database and Log page.

'RUNSTATS' is used to optimise the access paths to a DB2 database. TSM should run Runstats regularily, but how can you check when runstats last ran? This could be important if you see database performance starting to suffer. To find out, run the following

Start up a DB2 command line,
Windows go to Start-Programs-IBM DB2-Command Line tools-Command Window
UNIX, su - db2inst1 (db2inst1 is the default instance, if you change the instance name or have multiple instances, you need to su to the correct userid for your instance). You then type 'db2' to open the DB2 command line
From the db2 command line type db2= select stats_time,SUBSTR(TABNAME,1,40) from syscat.tables where tabsChema='TSMDB1' AND stats_time is not null order by stats_time.
The output should look something like below, where the first column contains the date when runstats last ran against the table in column 2.

STATS_TIME                 2
-------------------------- ----------------------------------------
.....
2012-12-17-02.21.04.813068 DF_MIGRBITFILES
2012-12-17-02.21.05.619488 ACTIVITY_LOG
2012-12-17-02.21.05.798333 GLOBAL_ATTRIBUTES
2012-12-17-02.21.06.148028 AF_VOL_CLUSTERS
2012-12-17-03.23.16.309308 BACKUP_OBJECTS
.....

DBMEMPERCENT

Ideally, A TSM server should run as a single instance on its own physical server. There is a DBMEMPERCENT parameter that determines how much system memory TSM can use, and the default value is 'AUTO', which means the database manager sets the percentage automatically to a value of approximately 70 to 80 percent of system RAM.

Multi-instance machines require DBMEMPERCENT configuration for best performance. If you run multiple server instances on the same machine, then you should set the DBMEMPERCENT option for each instance to dedicate a portion of memory. If you run other applications besides TSM on a machine, you will need to lower DBMEMPERCENT to allow those applications for get adequate memory.

Volume History File too large

You need to regularily prune your volume history file, as if it gets too large that can cause backup and sequential media interaction performance degradation.

back to top

External System Issues

DNS problems

If domain name resolution is not correctly configured and responding quickly, that can cause slow server connect times from clients. If this happens, speak to your system and network administrators and get the issue fixed.

AIX issues

If you are running TSM 6.3 on an AIX POWER7 system, make sure that you are not running a GSKIT level lower than 8.0.14.32.

The overhead required for the AIX Active Memory Expansion feature (AME) can degrad memory access performance on AIX systems. Consider disabling it if it is enabled.

If AIX is using a single heap memory allocator, then this can cause slow server and client backup performance. Consider enabling a multiheap memory allocator for applications on AIX.

Linux and UNIX issues

Under-sized kernel configuration can cause resource starvation on Linux, HP-UX, and Solaris systems. Ensure that the kernel parameters have been properly tuned per documented recommendations.

back to top

Trace Parameters

You can focus in on a problem by adding a trace parameter to your tsm.opt file. The parameter is

trace flags instr-client-detail

In addition to the summary information you usually get after backup you'll find something like this:

You need to bounce DSMCAD to start, stop or change trace parameters.

you can use this information to get an idea where most time gets wasted. If just some of your clients are performing badly, then compare traces between good and bad clients.

Other trace options are

tracefile output-file-name.txt
traceflag perform

It is possible to some extent to stop and start tracing without bouncing the client process, which can be very useful if you just want to investigate part of a long running process. Issues are that you must have started the client with a listener thread, which might not work for some clients that use the TSM API, and the client must have been started with the DSMTRACELISTEN YES option set in dsm.sys (UNIX) or dsm.opt (Windows). If you meet these conditions, then you can change the trace parameters 'traceflags', 'tracefile', 'tracemax' and 'tracesegsize'. As the default is 'DSMTRACELISTEN NO', this would not normally happen.

With DMSTRACELISTEN YES set, if you change the trace parameters in the option file, then they will be picked up the next time the client process starts a new session thread and tracing will start. However if you are concerned about a space consuming trace file then it is not easy to stop tracing while the session is active. You cannot just switch tracing off and if you delete the trace file, TSM will continue to write data to the file handle. The only way to reduce space consumption in the trace file is to switch off all the trace parameters in the dsm.opt file to

traceflag -all

which will slow down the amount of trace data written. Also, make sure you remove the trace options from the dsm.opt file before another client process runs

.

back to top


Client Side buffer parameters

TCPNOdelay
Set this to YES

USELARGEBUFFERS
The default setting is USELARGEBUFFERS YES, make sure its set on both the server and the clients

DISKBUFFSIZE
DISKBUFFSIZE should be set to YES. For the large buffers to take effect, every single link in your network must also be configured for large buffers. If you have fast ethernet then make sure you explicitly configure the speeds on the switch ports rather than setting them to autodetect, to prevent transfer size mismatches

TXNGROUPMAX
TXNBYTELIMIT

These pair of parameters are used to batch up small file transfers, so the transfer overhead on an individual file is shared out. The default setting for TXNGroupmax is 4096 and for TXNBytelimit it is 25600. TXNGROUPMAX refers to the number of objects transferred, like files or directories. TXNBYTELIMIT refers to the number of bytes transferred. A batch will be transmitted as soon as one of these limits is reached.
If you increase TXNGroupmax and TXNBytelimit, keep an eye on your recovery logs, as they will need more space. If you find that performance actually gets worse, it possible that this is due to faults on your network, which are causing a lot of retries. Retries will take longer with bigger data chunks, which can totally offset the benefits of lower transport overheads.

TCPWindowsize
TCPBUFFSIZE

Setting depends on your TSM server platform. 63 is best for an Windows servers, and 64 for a UNIX servers. If a Windows 2000 server is communicating with Windows 2xxx clients only, then the TCPW parameter can be larger, as Win2k supports TCP window scaling. Try a value of 512 for TCPBUFFSIZE, this seems to work well for WIN2K clients.

back to top


Multi-streaming

RESOURCEUTILIZATION is a flag which you set in the client options file, which enables multiple backup streams. The resources are the number of control sessions (sessions that figure out what to back up) and the number of transfer sessions (sessions that actually back up or archive the data). If you set RESOURCEUTILISATION to 8 on a client, then it will use not necessarily use 4 concurrent data transfer sessions and 4 control sessions. RESOURCEUTILIZATION just provides a guideline for the number of resources the client should use. The number of concurrent sessions you get will be based on the real-time performance characteristics of the client, and the value of RESOURCEUTILIZATION. The higher the RESOURCEUTILIZATION value, the more producer/consumer sessions the client may use, but if the system is starved for other resources, or the number of files to process does not warrant it, then a larger number of sessions may not be used, even with a large RESOURCEUTILIZATION value.


back to top

Directory structure restores

If you're recovering a Windows server, or even a large directory structure, then the restore goes a lot faster if the directories are held in a separate, disk storage pool
Set up a disk storage pool for directories, and allocate a management class which sends the directories to it. This disk storage pool should not require a lot of space, since directories are typically very small.
Then specify option DIRMC directory_mgmtclass_name.

back to top


Server tasks

EXPinterval
This parameter specifies how long between automatic expiration of backup and archive files. This process is very CPU intensive, and needs to run at a quiet time. Its best to set EXPinterval to 0, and run expiration from an Admin schedule.

Logpool size

This parameter determines the size of the recovery log buffer pool. If the buffer pool is not big enough, transactions will wait while recovery records are written to the log. You can see if this is a problem by using the command q log format=detail The command will show the wait percentage, which ideally should be 0. If its not 0, try increasing the Logpoolsize parameter, but make sure it does not affect overall system memory usage. A logpoolsize of about 4096 is about standard.

TSM Server caching is designed to optimize restore times but sites have experienced slow migration times with caching active. If you are having problems with migration, consider turning caching off, but be aware that this could affect restore speeds.

back to top


Indications of disk problems

The following two SQL queries are based on an IBM white paper and are intended to help you decide if your TSM server disks need tuning. The basic idea is to look at how fast your database backups and expire inventory are going, and if they are below 'normal' figures then you might have disk issues.

Database Backups

Run the following SQL query on your server. The query is just shown as one long line so you can cut and paste it without having to remove end-of-line markers.
select activity, ((bytes/1048576)/cast ((end_time_start_time) seconds as decimal(18,13))*3600) "MB/Hr" from summary where activity='FULL_DBBACKUP' and days(end_time) - days(start_time)=0
output looks something like

IBM state that if the backup is process less than about 28 GB per hour per hour then this might indicate a disk problem and further investigation is advised.
Another possible indication is expire inventory processing. Try the following SQL query
select activity, cast((end_time) as date) as "Date", (examined/cast ((end_time_start_time) seconds as decimal(24,2))*3600) "Objects Examined Up/Hr" from summary where activity='EXPIRATION'
output looks something like

It is difficult to say what is acceptable with this query as so many factors can affect the throughput. If the throughput drops suddenly then this may indicate possible disk problems. The query above is clearly indicating a potential problem after Feb 02.

back to top

Hardware Configuration

Try to spread your database and log volumes across SCSI controllers.

Use several small volumes for disk pools rather than a small number of large volumes. Sessions lock volumes so more volumes means more simultaneous sessions.

back to top


Tape to Tape copy performance

The options which affect tape-to-tape copy most, are movebatchsize, movesizethresh and bufpoolsize. Bufpoolsize is explained above.

Movebatchsize and Movesizethresh determine how many files are grouped together and moved in a single transaction. Movebatchsize is the number of files which will be grouped, movesizethresh is the cumulative size of all the files that will be moved. Files are batched up until one of these thresholds is reached, then the files are sent. The default for movebatchsize is 1000, which is the maximum, and the default for movesizethresh is 4096. It is possible to increase movesizethresh up to as far as 32768. However, if the numbers are set high, then you will need more space in the recovery log. If you change the settings, keep an eye on the log for a while, and make sure it is not getting too full.
These parameters, and TXNGROUPMAX, can be dynamically changed by TSM, if SELFTUNETXNsize is set to YES.