Admin Alert: Quick and Dirty Ways to Find Job Gone Wild
December 6, 2006 Joe Hertvik
i5, iSeries, and AS/400 administrators dread the call telling them that: a) the system has ground to a halt; b) they better figure out why; because c) invoices are not printing, orders are not being entered; and d) the company president wants to talk to them. Fortunately, there are some simple i5/OS tools that you can use to find runaway jobs that cause performance problems.
When i5 systems choke up like this, chances are good that the system is slowing down because one or more jobs are using too much CPU time. This job could be in an infinite loop, it could be a spider crawling every page in your Web site and forcing a read of every single record in your database, or it could be a programmer who is interactively updating a 10 million record sales detail file. Many times, these problems are exacerbated by aging equipment that is now undersized for current workloads.
To quickly identify and stop misbehaving jobs before they stop my system, I find that the following i5/OS and OS/400 tools are the most effective in determining who’s doing what to my CPU.
Each tool has its own pluses and minuses in identifying rogue jobs. Here’s how I use these tools to quickly figure out which jobs are slowing down my production.
WRKACTJOB–The Old Standby
Old when the System/38 was young, the text-based Work with Active Jobs command’s purpose is to display performance and status information for currently active jobs. Information is gathered for each job and, when called with its default settings, WRKACTJOB presents each job’s information alphabetically by job name within the subsystems that the jobs are running in.
To display a WRKACTJOB screen, you simply enter the following command from a 5250 green-screen terminal session.
To find a hungry CPU-eating job, you can easily sort the display by moving your cursor under the CPU % column and pressing the F16 key (the CPU % column shows what percentage of CPU processing capacity each active job is using). Before you do that, however, you may want to press the F14 key on this screen to include any of the following hidden jobs that are not listed on the default WRKACTJOB screen:
I have found that disconnected jobs can sometimes chew up excessive CPU time, especially when they are disconnected in the middle of a data operation. Once these jobs are cancelled, the system usually goes back to normal. So in a runaway CPU situation, I always check to see if these jobs are up to any mischief.
By sorting my WRKACTJOB screen by CPU %, I find that it’s usually easy to spot jobs that are using too much CPU time, but there are a few gotchas. First, IBM‘s help text reports that there are certain situations where an active job may not be shown on a WRKACTJOB display. This is because WRKACTJOB may misread the job’s status indicator and fail to display that job under the following circumstances: when there are too many jobs in the system; when there are no available activity levels to start a new job in a subsystem; or when the subsystem running the job is interrupted by other requests. So while it may not be highly probable that a rogue job will disappear from the WRKACTJOB screen, it is possible, and you should refresh your sorted WRKACTJOB screen (by using the F5 key) several times just to make sure a rogue job isn’t slipping by undetected.
Second, it’s important to understand that WRKACTJOB displays job status based on the elapsed time since WRKACTJOB was first run inside your session. If you ran WRKACTJOB an hour ago and you now want to see which job is currently consuming too much CPU time, the CPU % statistics will be diluted because they were taken over too long a period. Averaging statistics over time can easily hide a rogue job’s CPU eating binge, so you have to remember to sort your active jobs by CPU % and then press the F10 Key to reset your statistics to an elapsed time of zero seconds. Only then will you be able to tell which jobs are currently using the most CPU.
If you want a running total of which jobs are tops in CPU consumption over the short run, you can press the F19 key to tell WRKACTJOB to automatically refresh its statistics every 300 seconds (its default). If you want to set WRKACTJOB’s Automatic Refresh Interval parameter (INTERVAL) to something quicker (say every five seconds), you can exit and then restart WRKACTJOB by using the following command.
Once you are back inside the WRKACTJOB screen, be sure to sort your screen by CPU % again and turn on Automatic Refresh. If you’re examining the situation for a long time, the only thing you have to remember is to periodically press the F10 key to reset the statistics while the automatic refresh is running.
WRKSYSACT–Better than WRKACTJOB?
As opposed to WRKACTJOB, the Work with System Activity command (WRKSYSACT) allows you to collect and view job performance data on a real-time basis without any regard to the subsystem each job is running in. Also unlike WRKACTJOB, there is no reset statistics function in WRKSYSACT because every WRKSYSACT refresh show only the activity that was reported since the previous view. The command’s default setting is to sort active jobs in descending order according to how much CPU activity each job is currently utilizing. So you may be able to quickly pinpoint which job is hogging the system by running WRKSYSACT and pressing the F5=Refresh key a number of times.
To display a WRKSYSACT screen, you simply enter the following command from a 5250 green-screen terminal session.
Because CPU job utilization often happens in short bursts, you can use WRKSYSACT’s Automatic Refresh function to view which jobs are consuming the most CPU over a shorter amount of time. Like WRKACTJOB, you can press the F19 key to activate auto-refresh and (by default) it will refresh your screen data every five seconds. Because of this active update, if you see the same job consistently using the highest amount of CPU on your system, that job may be your performance problem. If you want to change the auto refresh time to another value (say 10 seconds), you can exit WRKSYSACT and re-enter the command in the following way:
The Interval Length parameter (INTERVAL) specifies the number of seconds to wait before automatically refreshing the screen when F19 is pressed. Besides automatically sorting and resetting individual user CPU consumption data, WRKSYSACT also shows overall CPU utilization for the system, as well as individual and overall database CPU utilization. So you can use this screen to look for bottlenecks caused by jobs performing excessive database activity.
There are two downsides to running WRKSYSACT to check system performance. The first is that the command can only be run by one user at a time. If you start the command on one machine and then try to start the command a second time in another session, the system will lock you out of WRKSYSACT until the first iteration of the command ends. The second problem with WRKSYSACT is that it may not necessarily show disconnected, inactive, and suspended jobs. So if the problem lies with one of those jobs, you may not find it by using this screen.
Text Commands Versus Graphical Processing
So far, I have talked about WRKACTJOB and WRKSYSACT as two command line ways to discover and deal with jobs gone wild. They are not the most efficient tools, but they do the job. Another advantage of using green-screen tools is that when a runaway job is hogging the system and slowing down processing, you have a better chance of getting a command line solution going than you do with a graphical solution, such as you might find with iSeries Operations Navigator. If nothing else, you can always run green-screen commands on your system console, which you can’t do with a graphical tool. Graphical tools like OpsNav usually need slightly higher resources to run, and they may even hinder you in a tight processing environment.
Going Graphical with Active Jobs
OpsNav offers a graphical version of the WRKACTJOB screen called Active Jobs. To start Active Jobs, open the Work Management — > Active Jobs node listed under the OpsNav partition that you want to examine. You can also reach Active Jobs by right-clicking on the partition name in the OpsNav tree and selecting System Status from the pop-up menu that appears. Once you’re inside the System Status window, open the Jobs tab in that window and click on the Active Jobs button on that screen. Both techniques bring up the same Active Jobs screen.
Once inside the Active Jobs screen, you can sort your active jobs in descending order by clicking on the CPU % column twice (once to sort jobs in ascending order, a second time to sort them in descending order). You can also auto refresh your screen by clicking on View — > Customize this view — > Auto Refresh from the Active Jobs screen’s Windows toolbar. But you should note that the Auto Refresh for Active Jobs isn’t as flexible as those used in the WRKACTJOB and WRKSYSACT commands. This is because you can not specify a new timed refresh value less than one minute for Active Jobs’ auto refresh function (valid values range from 1 minute to 1440 minutes, which really is of little use when trying to identify jobs that are slowing down the system). However, one of the nicer Active Jobs features is that you can also customize your screens to include or exclude certain types of jobs running in the system. So if you suspect an interactive or communications job is slowing down your system, you can press the F11 key from the Active Jobs screen and tell the display to only show those types of jobs. This feature is also called from the Active Jobs Windows toolbar by clicking on View — > Customize this view — > Include, and it can be helpful in cutting down on the clutter when reading Active Jobs screens.
As opposed to WRKACTJOB and WRKSYSACT, there is no obvious way to see the overall CPU or Database utilization percentages for the entire system with Active Jobs. Like WRKACTJOB, Active Jobs gathers and reports on cumulative job statistics since the first time the panel was opened, which can mask a rogue job if the panel has been running for a long time without having its statistics reset. You can reset system job statistics by clicking on the Reset Statistics icon (the circular arrow) in the Windows toolbar. And like the two green-screen commands, you can also refresh (but not reset) the statistics at any time by pressing the F5 keys.
A Difficult Situation
It’s always a difficult when jobs go crazy and start slowing down the system. In spite of the fact that there is no perfect tool to identify rogue jobs when they occur, these three techniques can help you narrow down the field in order to bring your system back into harmony as quickly as possible.
About Our Testing Environment
All configurations described in this article were tested on an i5 box running i5/OS V5R3. The iSeries Navigator configuration (OpsNav) was tested by using the OpsNav version that comes with iSeries Access for Windows V5R3M0. The WRKACTJOB and WRKSYSACT commands have been included with i5/OS and OS/400 for a very long time now, so they should be available in your version. However, you may notice minor variations in pre-V5R3 copies of these features. These differences may be due to incremental command improvements that have occurred from release to release.