fhg
Volume 9, Number 2 -- January 14, 2009

Admin Alert: Looking for i5/OS Trouble, Part II

Published: January 14, 2009

by Joe Hertvik

Last week, I discussed the best ways to automatically monitor iSeries, System i, and Power i systems for hidden signs of trouble, and I focused on monitoring the QSYSMSG and QSYSOPR message queues for developing problems. This week, I'm changing direction to discuss several specific i5/OS monitoring situations that can also help you detect system problems as they occur.

Trouble-Finding Tools

As I mentioned last week, the following are the best places to look for trouble in your OS/400, i5/OS, and i operating systems:

  • The QSYSMSG message queue
  • The System Operator (QSYSOPR) message queue
  • Disk drive statistics
  • Active job entries
  • Job queues and output queues

I covered QSYSMSG and QSYSOPR monitoring last time, so today I'll look at what to monitor for in the other areas of my list. IMHO, the best way to monitor for and be alerted to problem situations on your partition is to use an automated system monitoring tools, such as Bytware MessengerConsole, Help/Systems' Robot/ALERT, or CCSS' QSystem Monitor package. Each of these products can look for system problems as they occur and let you know via pages, emails, text messages, etc., when something is going awry. They also allow you to set up custom monitors that are specific to your own environment. While it's true that you can build your own monitoring system without purchasing a package, I passionately believe that it's generally cheaper in the long run and more effective to build a monitoring platform based on the established products listed here.

When I'm using one of these tools, these are the items I am most likely to monitor for.

Sudden Changes In Disk Storage Usage

In addition to looking for the serious storage overflow conditions that I talked about last time, you can also use a monitoring package to detect more subtle shifts in storage usage. On my systems, I usually set up a monitor to alert the staff when there are spikes in disk drive activity.

Disk utilization spike detection (such as when ASP disk utilization goes up 5 percent in an hour) is valuable because it can indicate unusual activity on the system, which may require someone to investigate. The disk space may be going up because the program is in a loop and it's chewing up drive space. Or you could have a job gone wild that's spitting out thousands of joblog entries. Either way, early detection of large changes in drive usage can be valuable in locating developing problems.

Where's the Damage?

It's also helpful to set up your monitoring tool to look for damaged objects as they occur. A damaged object can not only corrupt other data, it can stop objects from being backed up. Damaged objects may also prevent an object from being correctly replicated to a Capacity BackUp (CPU) system.

Whenever the system detects a damaged object, it sends a system message out to the QSYSOPR message queue with a message ID between CPF81xx and CPF8299. If you set up your monitoring system to alert your staff when one of these messages enters the QSYSOPR message queue, you will immediately know when a damaged object has been detected and you can take measures to find and fix the object.

Jobs and Subsystems That Must Always Be Running

For a typical environment, certain subsystems and jobs must always be running or severe problems can occur with your data. Examples might include the QINTER subsystem, as well as server-based subsystems and jobs that provide critical round-the-clock functions (such as validating credit cards). If you have an automated monitoring program, I recommend that you find all your critical jobs and subsystems and set up monitors to alert the staff when any of them are not running. By setting up these monitors, your staff will immediately know when a critical applications job has stopped working.

Problems with Interactive Jobs

When monitoring interactive jobs, I usually set up an automated monitor to detect the following two conditions.

1. An interactive job that is using more than 30 percent of the available CPU--This could be indicative of an interactive job that has disconnected from the system (as sometimes happens with scanners and other devices that connect to the operating system through wireless connections). This may also indicate an interactive job that has problems or that is running in a loop.

2. Interactive jobs running at less than priority 20--This monitor might detect a case where a user has lowered his run priority in order to process his job faster.

Problems with Batch Jobs

For batch jobs, the following situations might indicate problems with the system and should be monitored for.

1. Batch jobs running at priority 20 or less--This may indicate a problem in batch submission, scheduling, or even a user who doesn't appreciatewhy batch jobs should never run at interactive priorities.

2. Long-running batch jobs--Depending on how long it takes critical processes to run in your shop, you may want to set up monitors that alert you when a batch job runs longer than a set amount of time. In my shop, one of our monitors alerts staff when application-oriented batch jobs run longer than one hour. Long-running batch jobs can indicate a programming issue, an inquiry job that is processing all the records in an absurdly large number of records, or a poorly written query running out of control.

When monitoring for long-running batch jobs, however, you have to be careful to only monitor finitely running application-oriented batch jobs instead of nearly infinite running server jobs. Application-oriented jobs usually run in batch subsystems, such as QBATCH or QPGMR, and complete in a relatively short amount of time. Server-oriented jobs usually run in their own subsystem or a system subsystem, such as QSYSWRK, and they usually remain active as long as the machine is running. To effectively set up this type of monitor, you have to exclude server jobs from examination.

Backed Up Job Queues

Many batch job queues are single-threaded, meaning that the job queue's associated subsystem will only accept one job at a time for processing from that queue. You can sometimes detect long-running jobs or jobs with error messages by looking at the number of jobs waiting to be run in a single threaded job queue. If the job queue has an unusually high number of jobs waiting to run, it may indicate a long-running job was submitted from that job queue or the current job is stuck waiting for a reply to an error message. Like the long-running job monitors, a judiciously used job queue monitor can also spotlight potential trouble.

Timing Issues

Depending on how you process work, you may have set up one or more job streams that always start and end at certain times of the day. A good example is a backup job stream where the backup generally starts at midnight and can usually be counted on to complete at 2:00 a.m. You could set up a monitor to send an alert if the back up job is still running at 2:30 a.m., which may indicate a problem with the backup. These types of monitors require a lot of analysis and awareness of how your job streams run, but if the monitor is set up correctly, it can show you where a problem is occurring before it turns into a full-blown crisis.

Excessive Numbers of Spooled Files In An Output Queue

Similar to a backed up job queue, an output queue with too many spooled files waiting to be printed may specify a printer problem. Many packages offer output queue monitors where an alert can be sent out if there are a high number of released spooled files waiting to print from a queue. This situation may specify a printer that is out of paper, not running, or waiting for someone to answer a printer message.

Places to Start

During the last two weeks, I've offered up a number of situations where you can monitor for real and potential problems on your iSeries, System i, and Power i machines. These ideas are meant to provide suggestions for items that you may want to monitor for on your system. System monitoring isn't a static process that you set up once and never touch again. Rather, monitors need tending to in order to weed out unreliable situations and to add newer items that should be monitored. The key is to get started with a few good ideas and then adjust your monitors to provide the maximum benefit to your system.


RELATED STORIES

Looking for i5/OS Trouble, Part I

When Batch Meets Interactive



                     Post this story to del.icio.us
               Post this story to Digg
    Post this story to Slashdot


Sponsored By
GUILD COMPANIES

Internet Programming for AS/400, iSeries & System i

Available NOW from the IT Jungle Bookstore

This guide from author Hideyuki Yahagi, an IBM Certified IT Specialist
with Internet and open source programming expertise, is suited for
programmers with traditional skills who want to quickly learn to use
the built-in Web serving capabilities of the System i.

Progressing from basic to advanced, this tutorial includes
programming tips, snippets of sample code, and a CD.

Price: $49.95
Buy Now!


Senior Technical Editor: Ted Holt
Technical Editor: Joe Hertvik
Contributing Technical Editors: Edwin Earley, Brian Kelly, Michael Sansoterra
Publisher and Advertising Director: Jenny Thomas
Advertising Sales Representative: Kim Reed
Contact the Editors: To contact anyone on the IT Jungle Team
Go to our contacts page and send us a message.

Sponsored Links

SkyView Partners:  Security software with a measurable return on investment
ARCAD Software:  FREE Webinar, ITIL Best Practices with Philippe Magne, January 28
COMMON:  Join us at the 2009 annual meeting and expo, April 26-30, Reno, Nevada


 

IT Jungle Store Top Book Picks

Easy Steps to Internet Programming for AS/400, iSeries, and System i: List Price, $49.95
Getting Started with PHP for i5/OS: List Price, $59.95
The System i RPG & RPG IV Tutorial and Lab Exercises: List Price, $59.95
The System i Pocket RPG & RPG IV Guide: List Price, $69.95
The iSeries Pocket Database Guide: List Price, $59.00
The iSeries Pocket Developers' Guide: List Price, $59.00
The iSeries Pocket SQL Guide: List Price, $59.00
The iSeries Pocket Query Guide: List Price, $49.00
The iSeries Pocket WebFacing Primer: List Price, $39.00
Migrating to WebSphere Express for iSeries: List Price, $49.00
iSeries Express Web Implementer's Guide: List Price, $59.00
Getting Started with WebSphere Development Studio for iSeries: List Price, $79.95
Getting Started With WebSphere Development Studio Client for iSeries: List Price, $89.00
Getting Started with WebSphere Express for iSeries: List Price, $49.00
WebFacing Application Design and Development Guide: List Price, $55.00
Can the AS/400 Survive IBM?: List Price, $49.00
The All-Everything Machine: List Price, $29.95
Chip Wars: List Price, $29.95


 
The Four Hundred
There's No i in Barack Obama, But There Is One in Bailout

Layoff Rumors Panic IBM Workers; Nothing Confirmed

Application Modernization: Money in the Bank

Mad Dog 21/21: Shoes for Cheeses

IT Jobs 2009: The Dot-Com Bubble Burst Was 'A Cake Walk'

Four Hundred Stuff
IBM Throws Apple a Bone with Notes-Domino 8.5

Binary Tree Migrates Notes E-Mail to Google with New Product

Vision Offers Migrate While Active As a Service

Global Mounts New Drive for Spreadsheet Automation

ACOM Unveils EZCapture Front-End for Content Management System

Four Hundred Monitor
Four Hundred Monitor's
Full iSeries Events Calendar

System i PTF Guide
January 10, 2009: Volume 11, Number 2

January 3, 2009: Volume 11, Number 1

December 27, 2008: Volume 10, Number 52

December 20, 2008: Volume 10, Number 51

December 13, 2008: Volume 10, Number 50

December 6, 2008: Volume 10, Number 49

TPM at The Register
New York judge OKs Amazon Tax

Red Hat, Novell rejigger execs

Citrix rides virtualization into 2009
Dell buys Windows consultancy chunks

Another 524,000 US jobs go

AMD claims 'fastest graphics supercomputer ever'

Sun downgraded to Goldman Sachs sell list

Apple should start taking enterprise servers seriously

EMC celebrates record revenue, axes 2,400 heads

IBM and ITIF pitch for $30bn to save America

Who says COBOL doesn't get tweaks?

Intel figures take kicking in Q4

Super Micro fiscal Q2 sales not so super

IBM approves Obama's IT stimulus package

THIS ISSUE SPONSORED BY:

ProData Computer Services
Vision Solutions
Guild Companies


Printer Friendly Version


TABLE OF CONTENTS
Bypass Locked Records in SQL Queries

Data Queues vs. MQSeries

Admin Alert: Looking for i5/OS Trouble, Part II

Four Hundred Guru

BACK ISSUES

From the IT Jungle Forums
Insert via Java

iSeries Access for Web

Mimix installation and configuration docs

EDI Inovis Programmer - Heavy Duty Problem Solver - Anytime

Data Queues vs. MQ Series: Performance

Removing blanks from a CL Variable

XML




 
Subscription Information:
You can unsubscribe, change your email address, or sign up for any of IT Jungle's free e-newsletters through our Web site at http://www.itjungle.com/sub/subscribe.html.

Copyright © 1996-2009 Guild Companies, Inc. All Rights Reserved.
Guild Companies, Inc., 50 Park Terrace East, Suite 8F, New York, NY 10034

Privacy Statement