Admin Alert: Seven Things You Should Be Monitoring On Your System
October 3, 2012 Joe Hertvik
Last year, I wrote a two-part article outlining a basic plan for monitoring and answering IBM i error messages. But while it’s important to detect and answer error messages that require a response right now, it’s equally important to detect developing situations that will cause system problems if left alone. This week, I’ll discuss seven other things besides error messages that you should be monitoring for on your IBM i systems.
For this article, let’s assume you are already using a system monitoring product to send out pager, email, or text alerts whenever an error message shows up on your system. You can perform this monitoring by using one of the more common IBM i system monitoring products, including:
Once your monitoring system is set up, you’ll want to go beyond basic error message monitoring and use your monitoring software to look for developing problems on your IBM i partitions. You’ll want to find silent, non-obvious issues that can cause disruptions if not resolved.
To detect silent trouble in your system, here are seven of the most common situations that you should be monitoring for on your IBM i partition.
I’ll look at each situation in turn, and explain why you should be monitoring for them.
Situation #1: Long running batch jobs.
It’s a good idea to set up monitors to look for jobs that are running much longer than usual. In my shop, I set up my job performance monitors to notify a tech on duty when any of the following situations occur.
Long running batch jobs by themselves aren’t necessarily a problem. But they can represent an unusual situation and should be investigated to insure that everything is running correctly on your system.
Situation #2: Excessive number of jobs in a job queue.
If you have single-threaded job queues feeding work to a batch subsystem, excessive jobs waiting to run may be an indicator a problem is occurring. I like to set my paging system to alert me when more than seven or eight jobs are lined up in a batch job queue waiting to run. It may indicate that the jobs are stuck behind a long-running batch job.
Situation #3: Jobs that should be running, but aren’t.
For our nightly end-of-day batch jobs, we use monitoring to alert us when a job has not started by its’ usual start time. We have a stable schedule and if a critical job doesn’t kick off by its’ target time, it may be an indication a system problem is preventing it from running. These monitors are a good heads-up that the system may not be functioning the way it’s supposed to.
If you’re planning on monitoring for late jobs, beware of the following issues when setting up timing monitors.
Situation #4: Critical lines, controllers, or devices that aren’t active
If your monitoring software allows it, monitor for critical devices such as printers, controllers, or other devices being off-line. You can easily do this by monitoring for when these devices are varied off, not active, or in recovery pending state. This can give you an early warning that a critical system resource such as a shipping label printer or an Ethernet line, is not available.
Situation #5: IP interfaces not active.
Most monitoring packages have an option to ping an IP address and send out an alert if that address doesn’t answer. You can use this to test whether your partitions’ IP addresses are working, whether companion servers are on the network, or whether other IBM i partitions are active. Be careful with these monitors, however, as you can sometimes get a false alarm on a ping test. You may want to fiddle with your ping monitor parameters, such as number of pings to send or the wait time for a return response, to cut down on the number of false alarms.
Situation #6: Interactive users using a large amount of CPU.
Interactive CPU monitors can help you determine when an interactive job is experiencing a problem or when a user is doing something that he’s not supposed to (like running batch work in QINTER when he should be submitting it to batch). I like to set up a monitor to detect when an interactive job is using more than 25 percent of the available CPU. I find I get too many false alarms if I set the monitor to detect jobs that are using a lower value than 25percent CPU.
Situation #7: Interactive response time spiking.
Trouble can sometimes be detected when interactive users start experiencing increased response time. This may indicate a situation where there’s a runaway job on your system that’s taking away necessary resources from other jobs in the system. Looking for these jobs can help alert you to a developing situation.
More Than Just Error Messages
There’s more to effective system monitoring than just looking for error messages. Once you get your monitoring software in place, be sure to start creating effective system monitoring that can detect developing problems, even when it doesn’t look like your system has an issue.
Follow Me On My Blog, On Twitter, And On LinkedIn
Check out my blog at joehertvik.com, where I focus on computer administration and news (especially IBM i); vendor, marketing, and tech writing news and materials; and whatever else he come across.
Joe Hertvik is the owner of Hertvik Business Services, a service company that provides written marketing content and presentation services for the computer industry, including white papers, case studies, and other marketing material. Email Joe for a free quote for any upcoming projects. He also runs a data center for two companies outside Chicago. Joe is a contributing editor for IT Jungle and has written the Admin Alert column since 2002.