Reliability

A batch job can fail for a variety of reasons; bad input data, program error, hardware errors, space problems are just some. The real issue is that batch jobs are usually linked together, and if one fails, the rest wait on it being fixed. The odd failure here and there is not too much an issue, but if your failure rate is 2% or more, then you probably have an underlying problem. One way to investigate this, is to use the process below

   

Another example might be cartridge failures which trend to data creation on a particular cartridge drive. Get your maintainer in to fix it!

The one Storage item that is worth keeping an eye on is space allocations. There are three certain things in life, Death, taxes and data growth! If you allocate space through Data Classes, you need to review the values from time to time to see if they are still appropriate. You data will be growing, so make sure that your data classes keep pace. You should get very few space abends with a well designed DFSMS system, but keep an eye on the numbers and be prepared to tune DFSMS if you see too many SB37 or SE37 errors.

This stuff all sounds simple common sense, but there's little point spending a fortune tuning a batch run, if all you do is speed it on its way to the next failure.

However, with the best will in the world, you will still get batch failures from time to time. To make sure they are fixed correctly and quickly, you need to develop good, effective recovery procedures. With these in place, you should be able to cope with the occasional failure, without them delaying your systems too much.

back to top


   

z/OS Storage and Datasets

Lascon latest major updates