Admin Alert: Basic i/OS Error Monitoring and Response, Part 1
January 5, 2011 Joe Hertvik
I once worked for an insurance company that had a 24x7x365 operations staff. When iSeries error messages occurred, the operations staff mostly handled problem resolution and other staff members were rarely alerted. Sadly, this world no longer exists and it is now left to i/OS administrators to detect when system errors occur on a 24×7 basis and to react to them in a timely manner so that production isn’t delayed.
Toward that goal, this week and next, I will describe a basic game plan for managing i/OS error messages so that all system and programming issues are quickly handled without having dedicated 24x7x365 coverage on site. While most shops may have pieces of this plan already in place, it’s a good idea to periodically review the basic setup and goals of monitoring to ensure you are not missing anything. This plan is also helpful if your organization is cutting down on its iSeries, System i, or Power i staff, as these articles can also be used as a primer for them to create their own 24x7x365 error monitoring and response plan.
The Basic Structure of i/OS Error Monitoring and Response
An i/OS error monitoring and response system contains the following components:
Once these components are in place, you’ll have a robust system that doesn’t require someone to constantly be watching the system. Your monitoring goals are threefold:
Let’s look at each component in turn.
Detection: The Root of the System
You can’t fix a problem if you don’t know it exists. The first question in putting together an automatic error detection and response system is: What do you want to detect?
There are literally thousands of situations that you may need to handle to keep your system going. As a baseline, you should at least be monitoring for the following inquiry and error messages in each production system’s operator message queue (QSYSOPR).
• RPG program messages that start with the literal RPG*. RPG messages are defined in the QRPGMSGE message file in the QSYS library.
• Allocation and I/O errors that start with the RNQ* or RNX* literals, as defined in the QRNXMSG message file.
• The following general purpose error messages as defined in the QCPFMSG file:
The text for these messages can be viewed in the QCPFMSG message file by entering the following Work with Message Description (WRKMSGD) command from a 5250 green screen.
• Any QSYSOPR error message that has a severity code of 81 or above. These indicate critical errors on the system.
• Make sure that your monitoring system ignores any severity 81 or higher error messages that come from the QSPLJOB user. QSPLJOB user error messages usually indicate printer messages, including printer out of paper, check printer alignment, change form type, etc. You want to set up your monitoring system to only check for critical messages during off-hours, and you will quickly become inundated with messages if you send out alerts for standard printer messages.
• Any application specific critical messages that may be issued from your third-party software.
Other messages can be monitored as needed, but this basic list will take care of a number of issues, including system issues such as hardware errors.
The next step is selecting how your system is automatically going to monitor for messages after your staff has gone home for the night or weekend. For message monitoring, you generally have three choices for monitoring software that can be configured to look for these errors.
Notification–Who Gets the Message When?
In your message monitoring system, you will generally be sending out alerts to mobile device users via text messaging when an error occurs. As mentioned above, it’s easy to send out a text message through email software, and the assumption here is that all your on-call resources are reachable through text messages on their mobile phones. However, there are a number of management issues you should consider as you configure your notification system. Among these issues are:
Do you have a list of defined responders who will be responsible for ensuring all issues are resolved? What compensation are you offering them for responding to off-hours system issues? What procedure will they follow when they receive an alert? What happens if the off-hours responder doesn’t receive the alert because of mobile device issues, such as a dead battery, out of cell phone range, etc.?
If you have more than one off-hours responder, do you have a published schedule for who’s on duty each night? Is your software set up to follow the schedule and only send to the on-call person or does everyone get the alerts regardless of whether they are on duty that night or not? How do you handle vacations or business travel when the scheduled off-hours responder may not be available?
If you’re requiring people to drop everything and answer a call, do they at least have the required company equipment to do the job? Are they aware that they will have to answer the call whenever they are on duty?
If the off-hours responder can’t resolve the issue, what escalation procedure should he follow and who should he call? Have you defined your subject matter experts (SMEs) who need to resolve certain types of issues, such as ERP system errors, hardware errors, Web site errors, etc.? What happens if the designated SME isn’t available? A negotiated and published call tree can alleviate many of these issues.
In my experience, these issues can be just as tricky as the software configuration. You need to carefully define your call trees and ensure that everyone knows what needs to done in case of a problem. Nobody likes to be on-call during off-hours, so proceed carefully and make sure your responders are taken care of in some way, shape, or form for their trouble.
Still To Come
Once you have your detection and notification configuration and procedure set up, your monitoring system is already in place. Next issue, we’ll discuss more advanced issues for enhancing that system and freeing up more responder time, including i/OS techniques for automatically answering error messages, freeing up your tech resources so that they don’t have to be tethered to their computers on nights they are monitoring the system, and things you should do after the problem is resolved.