Admin Alert: Looking for i5/OS Trouble, Part II
January 14, 2009 Joe Hertvik
Last week, I discussed the best ways to automatically monitor iSeries, System i, and Power i systems for hidden signs of trouble, and I focused on monitoring the QSYSMSG and QSYSOPR message queues for developing problems. This week, I’m changing direction to discuss several specific i5/OS monitoring situations that can also help you detect system problems as they occur.
As I mentioned last week, the following are the best places to look for trouble in your OS/400, i5/OS, and i operating systems:
I covered QSYSMSG and QSYSOPR monitoring last time, so today I’ll look at what to monitor for in the other areas of my list. IMHO, the best way to monitor for and be alerted to problem situations on your partition is to use an automated system monitoring tools, such as Bytware MessengerConsole, Help/Systems’ Robot/ALERT, or CCSS’ QSystem Monitor package. Each of these products can look for system problems as they occur and let you know via pages, emails, text messages, etc., when something is going awry. They also allow you to set up custom monitors that are specific to your own environment. While it’s true that you can build your own monitoring system without purchasing a package, I passionately believe that it’s generally cheaper in the long run and more effective to build a monitoring platform based on the established products listed here.
When I’m using one of these tools, these are the items I am most likely to monitor for.
Sudden Changes In Disk Storage Usage
In addition to looking for the serious storage overflow conditions that I talked about last time, you can also use a monitoring package to detect more subtle shifts in storage usage. On my systems, I usually set up a monitor to alert the staff when there are spikes in disk drive activity.
Disk utilization spike detection (such as when ASP disk utilization goes up 5 percent in an hour) is valuable because it can indicate unusual activity on the system, which may require someone to investigate. The disk space may be going up because the program is in a loop and it’s chewing up drive space. Or you could have a job gone wild that’s spitting out thousands of joblog entries. Either way, early detection of large changes in drive usage can be valuable in locating developing problems.
Where’s the Damage?
It’s also helpful to set up your monitoring tool to look for damaged objects as they occur. A damaged object can not only corrupt other data, it can stop objects from being backed up. Damaged objects may also prevent an object from being correctly replicated to a Capacity BackUp (CPU) system.
Whenever the system detects a damaged object, it sends a system message out to the QSYSOPR message queue with a message ID between CPF81xx and CPF8299. If you set up your monitoring system to alert your staff when one of these messages enters the QSYSOPR message queue, you will immediately know when a damaged object has been detected and you can take measures to find and fix the object.
Jobs and Subsystems That Must Always Be Running
For a typical environment, certain subsystems and jobs must always be running or severe problems can occur with your data. Examples might include the QINTER subsystem, as well as server-based subsystems and jobs that provide critical round-the-clock functions (such as validating credit cards). If you have an automated monitoring program, I recommend that you find all your critical jobs and subsystems and set up monitors to alert the staff when any of them are not running. By setting up these monitors, your staff will immediately know when a critical applications job has stopped working.
Problems with Interactive Jobs
When monitoring interactive jobs, I usually set up an automated monitor to detect the following two conditions.
1. An interactive job that is using more than 30 percent of the available CPU–This could be indicative of an interactive job that has disconnected from the system (as sometimes happens with scanners and other devices that connect to the operating system through wireless connections). This may also indicate an interactive job that has problems or that is running in a loop.
2. Interactive jobs running at less than priority 20–This monitor might detect a case where a user has lowered his run priority in order to process his job faster.
Problems with Batch Jobs
For batch jobs, the following situations might indicate problems with the system and should be monitored for.
1. Batch jobs running at priority 20 or less–This may indicate a problem in batch submission, scheduling, or even a user who doesn’t appreciatewhy batch jobs should never run at interactive priorities.
2. Long-running batch jobs–Depending on how long it takes critical processes to run in your shop, you may want to set up monitors that alert you when a batch job runs longer than a set amount of time. In my shop, one of our monitors alerts staff when application-oriented batch jobs run longer than one hour. Long-running batch jobs can indicate a programming issue, an inquiry job that is processing all the records in an absurdly large number of records, or a poorly written query running out of control.
When monitoring for long-running batch jobs, however, you have to be careful to only monitor finitely running application-oriented batch jobs instead of nearly infinite running server jobs. Application-oriented jobs usually run in batch subsystems, such as QBATCH or QPGMR, and complete in a relatively short amount of time. Server-oriented jobs usually run in their own subsystem or a system subsystem, such as QSYSWRK, and they usually remain active as long as the machine is running. To effectively set up this type of monitor, you have to exclude server jobs from examination.
Backed Up Job Queues
Many batch job queues are single-threaded, meaning that the job queue’s associated subsystem will only accept one job at a time for processing from that queue. You can sometimes detect long-running jobs or jobs with error messages by looking at the number of jobs waiting to be run in a single threaded job queue. If the job queue has an unusually high number of jobs waiting to run, it may indicate a long-running job was submitted from that job queue or the current job is stuck waiting for a reply to an error message. Like the long-running job monitors, a judiciously used job queue monitor can also spotlight potential trouble.
Depending on how you process work, you may have set up one or more job streams that always start and end at certain times of the day. A good example is a backup job stream where the backup generally starts at midnight and can usually be counted on to complete at 2:00 a.m. You could set up a monitor to send an alert if the back up job is still running at 2:30 a.m., which may indicate a problem with the backup. These types of monitors require a lot of analysis and awareness of how your job streams run, but if the monitor is set up correctly, it can show you where a problem is occurring before it turns into a full-blown crisis.
Excessive Numbers of Spooled Files In An Output Queue
Similar to a backed up job queue, an output queue with too many spooled files waiting to be printed may specify a printer problem. Many packages offer output queue monitors where an alert can be sent out if there are a high number of released spooled files waiting to print from a queue. This situation may specify a printer that is out of paper, not running, or waiting for someone to answer a printer message.
Places to Start
During the last two weeks, I’ve offered up a number of situations where you can monitor for real and potential problems on your iSeries, System i, and Power i machines. These ideas are meant to provide suggestions for items that you may want to monitor for on your system. System monitoring isn’t a static process that you set up once and never touch again. Rather, monitors need tending to in order to weed out unreliable situations and to add newer items that should be monitored. The key is to get started with a few good ideas and then adjust your monitors to provide the maximum benefit to your system.