fhg
Volume 12, Number 24 -- October 3, 2012

Admin Alert: Seven Things You Should Be Monitoring On Your System

Published: October 3, 2012

by Joe Hertvik

Last year, I wrote a two-part article outlining a basic plan for monitoring and answering IBM i error messages. But while it's important to detect and answer error messages that require a response right now, it's equally important to detect developing situations that will cause system problems if left alone. This week, I'll discuss seven other things besides error messages that you should be monitoring for on your IBM i systems.

The Basics

For this article, let's assume you are already using a system monitoring product to send out pager, email, or text alerts whenever an error message shows up on your system. You can perform this monitoring by using one of the more common IBM i system monitoring products, including:

If you're new to system monitoring, check out part 1 and part 2 of my earlier articles on i/OS error monitoring and response strategies for a primer on setting up basic monitoring.

Once your monitoring system is set up, you'll want to go beyond basic error message monitoring and use your monitoring software to look for developing problems on your IBM i partitions. You'll want to find silent, non-obvious issues that can cause disruptions if not resolved.

To detect silent trouble in your system, here are seven of the most common situations that you should be monitoring for on your IBM i partition.

  1. Long-running batch jobs.
  2. Excessive number of jobs in job queues.
  3. Jobs that should be running, but aren't.
  4. Critical lines, controllers, or devices that aren't active.
  5. IP interfaces not active.
  6. Interactive users using a large amount of CPU.
  7. Interactive response time spiking.

I'll look at each situation in turn, and explain why you should be monitoring for them.

Situation #1: Long running batch jobs.

It's a good idea to set up monitors to look for jobs that are running much longer than usual. In my shop, I set up my job performance monitors to notify a tech on duty when any of the following situations occur.

  • Jobs that are running more than 30 minutes. I use this sparingly, looking for specific critical jobs that usually finish in a short amount of time, rather than having it monitor all jobs on the system that have been running more than 30 minutes.
  • Jobs that are running more than four hours. There are some legitimate jobs such as file reorganizations, that may trigger this monitor while still running within guidelines. But overall, if any batch job is running more than four hours, that job may be looping or running into another problem.
  • Job that are running more than eight hours. This is a definite red alert and should be investigated.

Long running batch jobs by themselves aren't necessarily a problem. But they can represent an unusual situation and should be investigated to insure that everything is running correctly on your system.

Situation #2: Excessive number of jobs in a job queue.

If you have single-threaded job queues feeding work to a batch subsystem, excessive jobs waiting to run may be an indicator a problem is occurring. I like to set my paging system to alert me when more than seven or eight jobs are lined up in a batch job queue waiting to run. It may indicate that the jobs are stuck behind a long-running batch job.

Situation #3: Jobs that should be running, but aren't.

For our nightly end-of-day batch jobs, we use monitoring to alert us when a job has not started by its' usual start time. We have a stable schedule and if a critical job doesn't kick off by its' target time, it may be an indication a system problem is preventing it from running. These monitors are a good heads-up that the system may not be functioning the way it's supposed to.

If you're planning on monitoring for late jobs, beware of the following issues when setting up timing monitors.

  1. Put a little play in your start job monitoring. If your target job usually starts running at 1:30 a.m., you may want to set your timing monitor to go off at 2:15 a.m. or 2:30 a.m. if the job hasn't started by then. This is because there may be a valid reason the job didn't kick off on time, and you don't want to set off a false alarm. Give your system some time to right itself before you wake up a technician.
  2. Be careful setting up timing monitors for jobs that usually run between 11 p.m. and midnight. A delay could push these jobs to run into the next day and that might set off a false alarm. IMHO, it's best to stay away from start time monitoring for jobs that usually begin close to midnight.

Situation #4: Critical lines, controllers, or devices that aren't active

If your monitoring software allows it, monitor for critical devices such as printers, controllers, or other devices being off-line. You can easily do this by monitoring for when these devices are varied off, not active, or in recovery pending state. This can give you an early warning that a critical system resource such as a shipping label printer or an Ethernet line, is not available.

Situation #5: IP interfaces not active.

Most monitoring packages have an option to ping an IP address and send out an alert if that address doesn't answer. You can use this to test whether your partitions' IP addresses are working, whether companion servers are on the network, or whether other IBM i partitions are active. Be careful with these monitors, however, as you can sometimes get a false alarm on a ping test. You may want to fiddle with your ping monitor parameters, such as number of pings to send or the wait time for a return response, to cut down on the number of false alarms.

Situation #6: Interactive users using a large amount of CPU.

Interactive CPU monitors can help you determine when an interactive job is experiencing a problem or when a user is doing something that he's not supposed to (like running batch work in QINTER when he should be submitting it to batch). I like to set up a monitor to detect when an interactive job is using more than 25 percent of the available CPU. I find I get too many false alarms if I set the monitor to detect jobs that are using a lower value than 25percent CPU.

Situation #7: Interactive response time spiking.

Trouble can sometimes be detected when interactive users start experiencing increased response time. This may indicate a situation where there's a runaway job on your system that's taking away necessary resources from other jobs in the system. Looking for these jobs can help alert you to a developing situation.

More Than Just Error Messages

There's more to effective system monitoring than just looking for error messages. Once you get your monitoring software in place, be sure to start creating effective system monitoring that can detect developing problems, even when it doesn't look like your system has an issue.

Follow Me On My Blog, On Twitter, And On LinkedIn

Check out my blog at joehertvik.com, where I focus on computer administration and news (especially IBM i); vendor, marketing, and tech writing news and materials; and whatever else he come across.

You can also follow me on Twitter @JoeHertvik and on LinkedIn.


Joe Hertvik is the owner of Hertvik Business Services, a service company that provides written marketing content and presentation services for the computer industry, including white papers, case studies, and other marketing material. Email Joe for a free quote for any upcoming projects. He also runs a data center for two companies outside Chicago. Joe is a contributing editor for IT Jungle and has written the Admin Alert column since 2002.


RELATED STORIES

Basic i/OS Error Monitoring and Response, Part 2

Basic i/OS Error Monitoring and Response, Part 1



                     Post this story to del.icio.us
               Post this story to Digg
    Post this story to Slashdot


Sponsored By
CONNECTRIA HOSTING

Download the State of the i White Paper

This white paper authored by Mel Beckman, President of Beckman
Software Engineering, and Scott Azzolina, VP of Marketing for Connectria,
explores the current state of IBM Power Systems and IBM i,
offers insight into the probable future of the platform,
and provides information on Connectria's flexible
IBM i Cloud offering.

For additional IBM i resources or to speak to a Sales Engineer
please go to: www.connectria.com


Senior Technical Editor: Ted Holt
Technical Editor: Joe Hertvik
Contributing Technical Editors: Edwin Earley, Brian Kelly, Michael Sansoterra
Publisher and Advertising Director: Jenny Thomas
Advertising Sales Representative: Kim Reed
Contact the Editors: To contact anyone on the IT Jungle Team
Go to our contacts page and send us a message.

Sponsored Links

Sirius Computer Solutions:  A comprehensive, cost-effective cloud solution for IBM i users
Tributary Systems:  Storage Director® makes your tape work better. FREE Webinar and PDF
System i Developer:  RPG & DB2 Summit, Oct 23-25 in Minneapolis. Register by Oct 12 to save $100!


 

IT Jungle Store Top Book Picks

BACK IN STOCK: Easy Steps to Internet Programming for System i: List Price, $49.95

The iSeries Express Web Implementer's Guide: List Price, $49.95
The iSeries Pocket Database Guide: List Price, $59
The iSeries Pocket SQL Guide: List Price, $59
The iSeries Pocket WebFacing Primer: List Price, $39
Migrating to WebSphere Express for iSeries: List Price, $49
Getting Started with WebSphere Express for iSeries: List Price, $49
The All-Everything Operating System: List Price, $35
The Best Joomla! Tutorial Ever!: List Price, $19.95


 
The Four Hundred
Some Things To Ponder On The Impending Power7+ Era

iBelieve NY: If You Don't Like Change. . .

Arming The IBM i Nation

Mad Dog 21/21: Shamoon And Six Trends

IBM's Rometty Takes Over As Chairman Of The Board

Four Hundred Stuff
Coglin Mill Reaches Out to Salesforce.com with ETL Connector

Infor Rolls Out Cloud for Heavy Equipment Biz

Android or iOS: Which Mobile OS Fits Best with IBM i?

Remain Software Adds to Multi-Platform Choices

Joomla Now Optimized for Mobile Devices

Four Hundred Monitor
Four Hundred Monitor's
Full iSeries Events Calendar

System i PTF Guide
September 29, 2012: Volume 14, Number 39

September 22, 2012: Volume 14, Number 38

September 15, 2012: Volume 14, Number 37

September 8, 2012: Volume 14, Number 36

September 1, 2012: Volume 14, Number 35

August 25, 2012: Volume 14, Number 34

TPM at The Register
Fujitsu, Oracle pair up on future 'Athena' Sparc64 chips

Oracle customers DEMANDED infrastructure cloud

Big mainframe shops embiggen, says BMC survey

Big Blue: 'New PureSystem? Madness? No, THIS IS SPARTA!'

Oracle cranks up the flash with Exadata X3 systems

AMD, Oracle tag-team on GPU acceleration for Java apps

Oracle fudges touts Sparc SuperCluster prowess

OpenStack gets "Folsom" release out on time

Quantcast gives the boot to Hadoop's HDFS

'Double Stuf' Power7+ sockets: Yummy, but so is overclocking

HP pitches four-socketeer Xeon E5 borg boxes

Chambers says Cisco is mulling succession plans

THIS ISSUE SPONSORED BY:

Bug Busters Software Engineering
WorksRight Software
Connectria Hosting


Printer Friendly Version


TABLE OF CONTENTS
Debugging Server Jobs In Green Screen

Alternatives To SQL Literals

Admin Alert: Seven Things You Should Be Monitoring On Your System

Four Hundred Guru

BACK ISSUES




 
Subscription Information:
You can unsubscribe, change your email address, or sign up for any of IT Jungle's free e-newsletters through our Web site at http://www.itjungle.com/sub/subscribe.html.

Copyright © 1996-2012 Guild Companies, Inc. All Rights Reserved.
Guild Companies, Inc., 50 Park Terrace East, Suite 8F, New York, NY 10034

Privacy Statement