Admin Alert: Looking for i5/OS Trouble, Part I
January 7, 2009 Joe Hertvik
I love stories with Cassandra characters, the slightly crazed player who accurately foresees oncoming doom, only to be mocked or ignored. Like Cassandra, i5/OS systems can also see events that portend system problems, clear omens that can easily be missed or ignored. This week and next, I’ll discuss avoiding doom by monitoring several i5/OS situations that should be checked early and often. Ignore these warning at your own peril!!!
Finding Doom On Your Local i Partition
Omen hunting on iSeries, System i, and Power i machines is a lot easier if you know where to look. I’ve generally found the following locations to be the best place to find trouble.
This information forms the raw material for finding developing (or developed) problems on your system. The trouble is that it’s an incredible hassle to manually monitor these areas for developing issues. That’s why I highly recommend using a system monitoring tool such as Bytware’s MessengerConsole, Help/Systems’ Robot/ALERT, or CCSS’ QSystem Monitor package. All of these tools can automate problem monitoring and immediately alert you via email, page, etc., when an issue occurs. You can also set up custom monitors to watch the system for situations that are specific to your own environment.
Now that we know which general areas to watch, let’s look at the specifics of what we should be looking for.
In the i5/OS, i, and OS/400 operating systems, IBM allows you to create a critical message queue called QSYSMSG. QSYSMSG does not come preconfigured with your operating system. It is an optional message queue that must be created in the QSYS library. According to IBM you can create QSYSMSG by using the following Create Message Queue (CRTMSG) command.
CRTMSGQ MSGQ(QSYS/QSYSMSG) TEXT ('Optional MSGQ to receive specific system messages')
Once QSYSMSG exists on your system, the operating system will automatically copy critical system messages directly to that queue for analysis. In general, only system messages are sent to QSYSMSG. More routine messages such as programming and printer alert messages are usually sent to the System Operator message queue, QSYSOPR.
You can deal with QSYSMSG messages in several different ways. First, you can set your monitoring software to alert you whenever any new critical messages are sent to QSYSMSG. When monitoring QSYSMSG, you may also want to create filters to exclude certain messages from triggering alerts. I created QSYSMSG on a development system and it started adding the following message each time a user disabled his sign on device by typing in the wrong password three times.
CPF1397 - Subsystem QINTER varied off work station device_name for user user_name.
Since I usually only look for urgent messages and we have other procedures for handling this situation, I set up my monitoring software to ignore this message. As you set up your QSYSMSG alert infrastructure, you may also choose to ignore less urgent messages.
The second way to monitor the queue is to set up a program to read new QSYSMSG messages and to automatically perform certain actions when a new message arrives. A third way is to monitor QSYSMSG in break mode on either the system console or on a designated machine that someone is always watching. You would do this by running the following Change Message Queue (CHGMSGQ) command on the workstation that you will be monitoring the queue from.
CHGMSGQ MSGQ(QSYSMSG) DLVRY(*BREAK)
Whenever a new message appears in the queue, it will pop up (break) on the display session where the CHGMSGQ command was run. The only downside to having QSYSMSG commands pop up in break mode is that this solution is inherent on someone always being near the breaking terminal when the message occurs. If the problem happens in off-hours or when no one is near the monitoring terminal, the message will be missed.
The QSYSOPR message queue is trickier to monitor than QSYSMSG because it contains generic messages alongside any critical messages that may occur. This makes it tougher to either write a program to process new QSYSOPR messages or to put QSYSOPR in break mode, because you will get a lot of irrelevant messages along with the ones you need to know about. The other issue is that since many shops no longer have dedicated system operators, it’s not as feasible as it used to be to have someone watch QSYSOPR all day.
Like monitoring the QSYSMSG message queue, your best bet is to use a system monitoring tool to look for critical messages. My starting recommendations for automatically monitoring the QSYSOPR message queue to alert you to problem situations are the following:
1. Monitor for any inquiry messages that require someone to type in a response before the program will continue. Inquiry messages require an operator to enter a reply (such as ‘C’, ‘D’, ‘I’, or ‘R’) when a job needs specific information to keep processing. It goes without saying that these messages need to be attended to as soon as possible.
2. Filter out any inquiry messages that are associated with the QSPLJOB user. These messages are for items such as loading different form types on a printer or aligning a form. Form alignment and loading messages are fairly common and they are usually handled by the user who is working with the printer, not by staff that are monitoring for system problems.
3. Monitor for any QSYSOPR messages that have a severity code of 80 or above (excluding messages generated by the QSPLJOB user). Severity code 80 generally distinguishes messages that must be dealt with immediately or messages that signify that something is going wrong with the system.
4. Monitor for jobs that did not complete normally. Monitor for any message that indicates a job did not complete successfully. Some of these messages include:
CPF1240 - Job &3/&2/&1 ended abnormally CPC1234 - Job &3/&2/&1 ended from job queue by user &4 CPC1125 - Job &3/&2/&1 was ended by user &4 CPC1126 - Job &3/&2/&1 was ended by user &4
5. Monitor for serious storage conditions. Watch for the following storage error message:
CPF0907 - Serious storage condition may exist. Press HELP.
This message appears when the ASP storage threshold has been breached. The ASP threshold is a user-based setting for each storage pool. It indicates what percentage of ASP storage must be filled before you considered the ASP’s storage to be filled. The CPF0907 message is sent out after ASP system storage exceeds the threshold value. For more information on ASP threshold values, see my previous article on protecting your system from critical storage errors.
There are other QSYSOPR messages that you can monitor for, but these particular situations will cover many of the general problems that occur in most shops and get you started on automated monitoring. As you become more comfortable with the tools and the processes involved, you can add more monitors to your system. I’ll also review a few other messages to monitor for next issue.
Coming Next Issue
Besides monitoring QSYSMSG and QSYSOPR, next week I’ll look at some other critical areas to monitor for on an i5/OS partition, including:
See you in seven.