Admin Alert: Seven Things You Should Be Monitoring On Your System

October 3, 2012 Joe Hertvik

Last year, I wrote a two-part article outlining a basic plan for monitoring and answering IBM i error messages. But while it’s important to detect and answer error messages that require a response right now, it’s equally important to detect developing situations that will cause system problems if left alone. This week, I’ll discuss seven other things besides error messages that you should be monitoring for on your IBM i systems.

The Basics

For this article, let’s assume you are already using a system monitoring product to send out pager, email, or text alerts whenever an error message shows up on your system. You can perform this monitoring by using one of the more common IBM i system monitoring products, including:

Bytware MessengerConsole
CCSS QSystem Monitor
Halcyon Software IBM i (i5/OS, System i, iSeries, AS/400) Monitoring, Scheduling & Automation Software
Help/Systems Robot/ALERT
SEA absMessage

If you’re new to system monitoring, check out part 1 and part 2 of my earlier articles on i/OS error monitoring and response strategies for a primer on setting up basic monitoring.

Once your monitoring system is set up, you’ll want to go beyond basic error message monitoring and use your monitoring software to look for developing problems on your IBM i partitions. You’ll want to find silent, non-obvious issues that can cause disruptions if not resolved.

To detect silent trouble in your system, here are seven of the most common situations that you should be monitoring for on your IBM i partition.

Long-running batch jobs.
Excessive number of jobs in job queues.
Jobs that should be running, but aren’t.
Critical lines, controllers, or devices that aren’t active.
IP interfaces not active.
Interactive users using a large amount of CPU.
Interactive response time spiking.

I’ll look at each situation in turn, and explain why you should be monitoring for them.

Situation #1: Long running batch jobs.

It’s a good idea to set up monitors to look for jobs that are running much longer than usual. In my shop, I set up my job performance monitors to notify a tech on duty when any of the following situations occur.

Jobs that are running more than 30 minutes. I use this sparingly, looking for specific critical jobs that usually finish in a short amount of time, rather than having it monitor all jobs on the system that have been running more than 30 minutes.
Jobs that are running more than four hours. There are some legitimate jobs such as file reorganizations, that may trigger this monitor while still running within guidelines. But overall, if any batch job is running more than four hours, that job may be looping or running into another problem.
Job that are running more than eight hours. This is a definite red alert and should be investigated.

Long running batch jobs by themselves aren’t necessarily a problem. But they can represent an unusual situation and should be investigated to insure that everything is running correctly on your system.

Situation #2: Excessive number of jobs in a job queue.

If you have single-threaded job queues feeding work to a batch subsystem, excessive jobs waiting to run may be an indicator a problem is occurring. I like to set my paging system to alert me when more than seven or eight jobs are lined up in a batch job queue waiting to run. It may indicate that the jobs are stuck behind a long-running batch job.

Situation #3: Jobs that should be running, but aren’t.

For our nightly end-of-day batch jobs, we use monitoring to alert us when a job has not started by its’ usual start time. We have a stable schedule and if a critical job doesn’t kick off by its’ target time, it may be an indication a system problem is preventing it from running. These monitors are a good heads-up that the system may not be functioning the way it’s supposed to.

If you’re planning on monitoring for late jobs, beware of the following issues when setting up timing monitors.

Put a little play in your start job monitoring. If your target job usually starts running at 1:30 a.m., you may want to set your timing monitor to go off at 2:15 a.m. or 2:30 a.m. if the job hasn’t started by then. This is because there may be a valid reason the job didn’t kick off on time, and you don’t want to set off a false alarm. Give your system some time to right itself before you wake up a technician.
Be careful setting up timing monitors for jobs that usually run between 11 p.m. and midnight. A delay could push these jobs to run into the next day and that might set off a false alarm. IMHO, it’s best to stay away from start time monitoring for jobs that usually begin close to midnight.

Situation #4: Critical lines, controllers, or devices that aren’t active

If your monitoring software allows it, monitor for critical devices such as printers, controllers, or other devices being off-line. You can easily do this by monitoring for when these devices are varied off, not active, or in recovery pending state. This can give you an early warning that a critical system resource such as a shipping label printer or an Ethernet line, is not available.

Situation #5: IP interfaces not active.

Most monitoring packages have an option to ping an IP address and send out an alert if that address doesn’t answer. You can use this to test whether your partitions’ IP addresses are working, whether companion servers are on the network, or whether other IBM i partitions are active. Be careful with these monitors, however, as you can sometimes get a false alarm on a ping test. You may want to fiddle with your ping monitor parameters, such as number of pings to send or the wait time for a return response, to cut down on the number of false alarms.

Situation #6: Interactive users using a large amount of CPU.

Interactive CPU monitors can help you determine when an interactive job is experiencing a problem or when a user is doing something that he’s not supposed to (like running batch work in QINTER when he should be submitting it to batch). I like to set up a monitor to detect when an interactive job is using more than 25 percent of the available CPU. I find I get too many false alarms if I set the monitor to detect jobs that are using a lower value than 25percent CPU.

Situation #7: Interactive response time spiking.

Trouble can sometimes be detected when interactive users start experiencing increased response time. This may indicate a situation where there’s a runaway job on your system that’s taking away necessary resources from other jobs in the system. Looking for these jobs can help alert you to a developing situation.

More Than Just Error Messages

There’s more to effective system monitoring than just looking for error messages. Once you get your monitoring software in place, be sure to start creating effective system monitoring that can detect developing problems, even when it doesn’t look like your system has an issue.

Follow Me On My Blog, On Twitter, And On LinkedIn

Check out my blog at joehertvik.com, where I focus on computer administration and news (especially IBM i); vendor, marketing, and tech writing news and materials; and whatever else he come across.

You can also follow me on Twitter @JoeHertvik and on LinkedIn.

Joe Hertvik is the owner of Hertvik Business Services, a service company that provides written marketing content and presentation services for the computer industry, including white papers, case studies, and other marketing material. Email Joe for a free quote for any upcoming projects. He also runs a data center for two companies outside Chicago. Joe is a contributing editor for IT Jungle and has written the Admin Alert column since 2002.