Reliability

A batch job can fail for a variety of reasons; bad input data, program error, hardware errors, space problems are just some. The real issue is that batch jobs are usually linked together, and if one fails, the rest wait on it being fixed. The odd failure here and there is not too much an issue, but if your failure rate is 2% or more, then you probably have an underlying problem. One way to investigate this, is to use the process below

   EADM Advert

Accelerate DB2 Write with zHyperWrite and "EADM™ by Improving DB2 Logs Volumes Response Time:

  • Analyse the data
    If you record job failures in a problem management system, then try running reports against the database. If not, then you may be reduced to logging failures as they happen in your favorite data repository; excel spreadsheet, access database, or piece of paper.
  • Look for trends
    Apart from the simple "do we get a lot of the same abend code" analysis, also look for time patterns. Do you get the same failures at month end or weekend? If you work on credit card systems, do they get fragile on Public Holidays?
  • Determine root causes
    So now you've got a trend. Why is it happening? If your credit card systems go wild on Public Holidays, its almost certain that's because the masses are out spending loads of money. Before you go off on your holiday -
  • Apply fixes
    To continue with the credit card example, you don't really want to discourage people from using their cards, so make sure you provide enough spare resources a public holidays to cope.

    GFS Advert

Another example might be cartridge failures which trend to data creation on a particular cartridge drive. Get your maintainer in to fix it!

The one Storage item that is worth keeping an eye on is space allocations. There are three certain things in life, Death, taxes and data growth! If you allocate space through Data Classes, you need to review the values from time to time to see if they are still appropriate. You data will be growing, so make sure that your data classes keep pace. You should get very few space abends with a well designed DFSMS system, but keep an eye on the numbers and be prepared to tune DFSMS if you see too many SB37 or SE37 errors.

This stuff all sounds simple common sense, but there's little point spending a fortune tuning a batch run, if all you do is speed it on its way to the next failure.

However, with the best will in the world, you will still get batch failures from time to time. To make sure they are fixed correctly and quickly, you need to develop good, effective recovery procedures. With these in place, you should be able to cope with the occasional failure, without them delaying your systems too much.

back to top