Admin Alert: When System Job Tables Attack, Part III

October 8, 2008 Joe Hertvik

In the last two issues, I discussed i5/OS problems that occur with system job table overflows, and how overflows can hurt your system. This week, I’m wrapping up the series by going over my own recent experience with job table overflows. I’ll also share some excellent reader advice on job tables that’s been coming into my mailbox lately.

What Are System Job Tables; The Short Version

System job tables are internal system objects that i5/OS uses to track partition jobs. There can be up to 10 job tables on a partition, and the maximum number of system jobs for a partition is designated in the Maximum Number Of Jobs (QMAXJOB) system value. As the number of partition jobs approaches the system’s QMAXJOB value, your system job tables start to fill up and a number of issues can occur. For more information about how these problems occur and how to take care of excessive job entries, see part one and part two of this series.

Joe in Job Table Overflow Land: A Case Study

I started investigating system job table issues when I began receiving CPF1468 error messages: System job tables nearing capacity. The first occurrence represented a mild attack and it didn’t really affect system processing. Because it was happening slowly, it was easy to correct by clearing out some output queues containing a large number of spooled files. No big deal.

However, as I was writing this article, one of my Capacity BackUp (CBU) partitions started sending out CPF1468 messages faster than a banker asking for a bailout. By the time I could check the machine, I was unable to start an interactive 5250 session and I couldn’t even sign on to the system console. My partition was no longer starting any new work.

Horrified! I had become the subject of my own article.

The first decision was what to do with my running partition. My Hardware Management Console (HMC) was showing a normally running system with no errors. But nobody could use the partition and I had to do something to open up the system again…quick.

Here’s what I did after consulting with IBM:

1. Using the HMC, I put the partition into manual startup mode. I did this by opening the Server and Partition: Server Management node and right-clicking on my partition name. I selected Properties from the pop-up menu that appeared. I then selected the Settings tab from the Partition Properties screen. On the Setting screen, I changed the Keylock Position dropdown in the Boot area from Normal to Manual. I then clicked OK to save my Settings change.

2. I proceeded to restart my system by right-clicking the partition name and selecting Restart Partition from the pop-up menu that appears. It’s important to note that because I wasn’t able to bring down my system cleanly, this would end all my system jobs immediately. However, since I had double-confirmation that my job tables were filled (the CPF1468 messages and the fact that I couldn’t start the system console), I realized that my system was gridlocked and I needed to restart it.

3. Before IPLing, the HMC asked me if I wanted to perform an Immediate IPL or if I wanted to perform a System Dump. Since I knew what the problem was, I took the option to immediately IPL. Producing a system dump would add a lot of time to my system IPL.

4. When you restart in manual mode, the restart brings the console to the IPL or Install the System screen. To IPL the system from this screen, I took option 1, Perform an IPL option. As the system IPLed, it asked me to sign on and it eventually displayed the IPL Options Screen, which has a number of different IPL options that affect how the system starts up.

5. Here are the Yes/No IPL options listed on this screen that would affect my job table overflow.

Start system to restricted state–I set this option to “Y”. Since the system wasn’t currently starting jobs, I didn’t want to start my subsystems. I wanted the system to start in restricted state, where I could clean up my jobs and take care of the issue. I could always restart my subsystems later.

Define or change system at IPL–If I changed this option to “Y”, the system will take the system console to a special screen where I can change or display system values after restarting the IPL. IBM recommended that I go to this screen and change my QMAXJOB system value to a higher value than it was currently set at. The theory was that by increasing QMAXJOB, i5/OS would extend my job table’s permanent entries structure, which would allow the partition to start running system jobs again.

6. At this point, I continued the IPL and when the system allowed me to work with my system values, I doubled my QMAXJOB value from 163520 to 327040. This should have allowed the system to start up in restricted mode so that I could clear out the excessive jobs that were clogging up my job tables. Unfortunately, my system hung during the rest of the IPL and eventually produced a B9003610 error on the HMC for the partition. I called IBM again. They said that error specifically pointed to the system job table problem. So I wouldn’t be able to solve my problem by expanding the tables and going into restricted state.

7. I manually IPLed the system again. When I got to the same IPL options screen that I mentioned in step 4, I turned on the following IPL options that would have an effect on the permanent job structures in my system job tables.

Clear job queues–When set to “Y”, this clears out any jobs that are sitting in my job queues waiting to be run.

Clear output queues–When set to “Y”, the system will clear out all output queues during the IPL.

Clear incomplete job logs–When set to “Y”, the system will delete all job logs that were active the last time the system was powered down.

These values take effect immediately and they will be in force during the next IPL steps. After the IPL, all three values are reset to “N” (off) again.

8. As the IPL progressed, it cleared all the output queues and cleaned up my system job tables. The system started in restricted mode, and I was able to verify that my system job tables were cleared and compressed. The bad news was that I had to sacrifice my spooled files to get the system back in shape. Perhaps if I had been able to IPL into restricted state without clearing the job queues, I could have saved some of my spooled files.

Proactively Maintaining Your System Job Tables

Aside from performing triage when your system job tables overflow, here are some simple steps you can take to maintain your system job tables.

1. If you’re on i5/OS V5R4 or above, apply the appropriate PTFs and lower the threshold value at which the CPI1468 message is sent. (See the last issue for details.)

2. Change your clean up options to delete job logs and other system output on a more regular basis (again, see last issue).

3. When you receive a CPI1468 warning, delete unnecessary and excessive spooled file output as much as you can.

4. To improve performance when you have a large number of unused permanent job table entries in system job tables, IPL your system to compress the tables. You can do this by entering the following Change IPL Attributes (CHGIPLA) command and pressing F4 to prompt for its parameters:

CHGIPLA

On the Change IPL Attributes screen that appears, change the Compress Job Tables (CPRJOBTBL) parameter to *NEXT, which tells the system to compress the job tables during the next IPL. Then IPL your system at the next available opportunity. During the IPL, i5/OS will compress the tables and remove unnecessary available entries. Theoretically, this should help your performance. However, be warned that changing this value can add a significant amount of time to your IPL.

5. Occasionally check your system job tables for damage. You can do this by going into CHGIPLA again and changing the Check Job Tables (CHKJOBTBL) parameter to *ALL. This will check your job tables for damage during the next IPL. After your next IPL completes, set this value back to *ABNORMAL (its default), which only checks the job tables during an abnormal IPL.

Readers Chime In On System Job Tables

While I’m finished with everything I have to say about job tables, our regular Admin Alert readers still have something to say about the subject. Here are some additional tips and corrections that came in over the last two weeks.

In part two of this series, reader Richard Shearwood pointed out that I said that an i5/OS admin could alleviate system job table problems “by simply clearing overcrowded job queues when you find them.” Richard questioned whether that was a “little harsh.” And he’s right. I made a mistake when I said that you had to clear overcrowded job queues. I meant to say that the best way to alleviate system job table problems is to clear overcrowded output queues, not job queues. I apologize for the mistake.

Reader Dawn May from IBM wrote in to let me know that IBM recently published a Redpaper that complements my recent columns on system job tables. The Redpaper is called Best Practices for Managing IBM i Jobs and Output (and a few other special tips), and it covers a lot of the same ground that I’ve gone over the last few weeks as well as some other items that I haven’t touched on. (Dawn is also one of the authors.) Like these articles, it’s a good primer on everything you need to understand and manage your system job tables.

Finally, reader Bill Bremerman points out one significant flaw with increasing the Maximum Number Of Jobs (QMAXJOB) system value. He recommended that you NEVER change QMAXJOB to its maximum value, 485000. I agree. If you max out your QMAXJOB value and then hit a system job table overflow, you will not have the option of manually IPLing your system and increasing QMAXJOB to get out of the problem, as described above. So if you’re increasing QMAXJOB to handle job tables, be careful not to increase it too much. You may regret it.