Admin Alert: Basic i/OS Error Monitoring and Response, Part 1
Published: January 5, 2011
by Joe Hertvik
I once worked for an insurance company that had a 24x7x365 operations staff. When iSeries error messages occurred, the operations staff mostly handled problem resolution and other staff members were rarely alerted. Sadly, this world no longer exists and it is now left to i/OS administrators to detect when system errors occur on a 24x7 basis and to react to them in a timely manner so that production isn't delayed.
Toward that goal, this week and next, I will describe a basic game plan for managing i/OS error messages so that all system and programming issues are quickly handled without having dedicated 24x7x365 coverage on site. While most shops may have pieces of this plan already in place, it's a good idea to periodically review the basic setup and goals of monitoring to ensure you are not missing anything. This plan is also helpful if your organization is cutting down on its iSeries, System i, or Power i staff, as these articles can also be used as a primer for them to create their own 24x7x365 error monitoring and response plan.
The Basic Structure of i/OS Error Monitoring and Response
An i/OS error monitoring and response system contains the following components:
- Detection--Knowing when there's a problem
- Notification--Alerting the on-call staff that something is wrong so they can mobilize the proper resources
- Mobility--Ensuring that on-call staff can handle issues wherever they are
- Automatic Answering--Configuring the system to automatically handle problems without outside input, wherever possible
- Resolution--Analyzing the issue to ensure that the problem doesn't occur again
Once these components are in place, you'll have a robust system that doesn't require someone to constantly be watching the system. Your monitoring goals are threefold:
- You must be responsive and quickly attend to the problem
- You must maintain system production and availability
- You must still be able to get a good night's sleep every night
Let's look at each component in turn.
Detection: The Root of the System
You can't fix a problem if you don't know it exists. The first question in putting together an automatic error detection and response system is: What do you want to detect?
There are literally thousands of situations that you may need to handle to keep your system going. As a baseline, you should at least be monitoring for the following inquiry and error messages in each production system's operator message queue (QSYSOPR).
• RPG program messages that start with the literal RPG*. RPG messages are defined in the QRPGMSGE message file in the QSYS library.
• Allocation and I/O errors that start with the RNQ* or RNX* literals, as defined in the QRNXMSG message file.
• The following general purpose error messages as defined in the QCPFMSG file:
- CPA0701 - Error message in job stream
- CPA0702 - An error occurred in a procedure
- CPA4072 - File full
- CPA5305 - File member full
The text for these messages can be viewed in the QCPFMSG message file by entering the following Work with Message Description (WRKMSGD) command from a 5250 green screen.
• Any QSYSOPR error message that has a severity code of 81 or above. These indicate critical errors on the system.
• Make sure that your monitoring system ignores any severity 81 or higher error messages that come from the QSPLJOB user. QSPLJOB user error messages usually indicate printer messages, including printer out of paper, check printer alignment, change form type, etc. You want to set up your monitoring system to only check for critical messages during off-hours, and you will quickly become inundated with messages if you send out alerts for standard printer messages.
• Any application specific critical messages that may be issued from your third-party software.
Other messages can be monitored as needed, but this basic list will take care of a number of issues, including system issues such as hardware errors.
The next step is selecting how your system is automatically going to monitor for messages after your staff has gone home for the night or weekend. For message monitoring, you generally have three choices for monitoring software that can be configured to look for these errors.
- A third-party package such as Help/Systems' Robot/ALERT, Bytware MessengerConsole, CCSS QSystem Monitor, Halcyon Software's IBM i Monitoring, Scheduling & Automation Software, or SEA's absMessage package. All of these packages provide the capability for monitoring many different types of i/OS messages and for alerting other devices (such as cell phones or email accounts) when an error is found.
- iSeries Navigator(OpsNav) provides message monitoring and notification services. See the IBM i and System i Information Center for more information on how to set up message monitoring inside OpsNav.
- Custom written software to monitor the QSYSOPR message queue and to send out messages when it finds an error. You can roll your own solution by dumping all the current QSYSOPR inquiry messages into a printer spooled file, reading that file, and then using the Send Distribution (SNDDST) command to send email alerts out as text messages to on-call support cell phones. For an example of how to dump all your QSYSOPR inquiry messages to a spooled file, see this article on determining which locked object is holding up a job. For an example of how to use the SNDDST command to send emails to users when an error message occurs, check out this article on monitoring whether a specific subsystem is up. For information on sending out text messages as email messages from your i/OS partitions, see this article on configuring messaging software for overnight monitoring.
Notification--Who Gets the Message When?
In your message monitoring system, you will generally be sending out alerts to mobile device users via text messaging when an error occurs. As mentioned above, it's easy to send out a text message through email software, and the assumption here is that all your on-call resources are reachable through text messages on their mobile phones. However, there are a number of management issues you should consider as you configure your notification system. Among these issues are:
- Who are your responders? Do you have a list of defined responders who will be responsible for ensuring all issues are resolved? What compensation are you offering them for responding to off-hours system issues? What procedure will they follow when they receive an alert? What happens if the off-hours responder doesn't receive the alert because of mobile device issues, such as a dead battery, out of cell phone range, etc.?
- Do you have a responder rotation? If you have more than one off-hours responder, do you have a published schedule for who's on duty each night? Is your software set up to follow the schedule and only send to the on-call person or does everyone get the alerts regardless of whether they are on duty that night or not? How do you handle vacations or business travel when the scheduled off-hours responder may not be available?
- Do the responders have mobile devices they can carry with them and is your company handling the cost of the mobile device? If you're requiring people to drop everything and answer a call, do they at least have the required company equipment to do the job? Are they aware that they will have to answer the call whenever they are on duty?
- Do you have a call tree? If the off-hours responder can't resolve the issue, what escalation procedure should he follow and who should he call? Have you defined your subject matter experts (SMEs) who need to resolve certain types of issues, such as ERP system errors, hardware errors, Web site errors, etc.? What happens if the designated SME isn't available? A negotiated and published call tree can alleviate many of these issues.
In my experience, these issues can be just as tricky as the software configuration. You need to carefully define your call trees and ensure that everyone knows what needs to done in case of a problem. Nobody likes to be on-call during off-hours, so proceed carefully and make sure your responders are taken care of in some way, shape, or form for their trouble.
Still To Come
Once you have your detection and notification configuration and procedure set up, your monitoring system is already in place. Next issue, we'll discuss more advanced issues for enhancing that system and freeing up more responder time, including i/OS techniques for automatically answering error messages, freeing up your tech resources so that they don't have to be tethered to their computers on nights they are monitoring the system, and things you should do after the problem is resolved.
Determining Which Locked Object is Holding Up a Job
Configuring Messaging Software for Overnight Monitoring
Monitoring the Monitors
Post this story to del.icio.us
Post this story to Digg
Post this story to Slashdot