Admin Alert: Basic i/OS Error Monitoring and Response, Part 2
Published: January 12, 2011
by Joe Hertvik
Last week, I began outlining a basic plan for monitoring and answering i/OS error messages, allowing you to resolve system and programming issues without having dedicated 24x7x365 personnel on site. That article focused on detection and notification of errors. This week, I'll focus on extending the plan to include allowing responders to correct issues from anywhere, to automatically answer error messages without human input, and some other things to do after the error message is resolved.
The Basic Structure of i/OS Error Monitoring and Response
As outlined in Part 1, an i/OS error monitoring and response system contains the following components.
- Detection--Knowing when there's a problem.
- Notification--Alerting the on-call staff that something is wrong so they can mobilize the proper resources.
- Mobility--Ensuring that on-call staff can handle issues wherever they are located.
- Automatically answering error messages--Configuring the system to automatically handle problems without outside input, when possible.
- Resolution--Analyzing the issue to ensure that the problem doesn't occur again.
Having these components in place accomplishes three goals:
- Your organization becomes more responsive and quickly attends to the problem.
- You can maintain system productivity and availability.
- Your staff can still get a reasonable night's sleep.
Last time, I covered the detection and notification pieces. This issue, I'll explain how mobility, automatically answering messages, and resolution fill in the missing pieces of the plan.
A basic i/OS monitoring and response plan involves having scheduled responders on duty who can attend to system issues and programming problems as they occur. But since you're trying to avoid dedicated 24x7x365 resources, you have to rely on various IT personnel (somewhat) voluntarily monitoring the system wherever they are. Once the call comes in, they need to quickly diagnose and respond to the issue or contact a Subject Matter Expert (SME) who can get the system working correctly again.
The only problem is that they can't do this if they don't have the proper tools. It doesn't do your shop any good if a responder gets the call and they aren't able to connect to the system to solve the issue. (Remember, system issues often happen during off-hours when people are usually living their lives.)
To help off-site monitoring resources watch the system WHILE they are living their lives, every off-site monitoring resource on duty should have one or more of the following tools available to get the job done.
- Laptop computer with a wireless card and all their work software--If you're expecting people to monitor the system in off-hours, they need a laptop rather than a desktop computer. They should also be able to contact your network through a VPN from their home. The laptop should be their work computer with all their work software on it so in the event of an emergency, they have all the necessary tools to diagnose and fix an issue. It should also be understood that if your organization provides them with a laptop, they have a responsibility to take it home with them every night, so they have the proper desktop environment to resolve issues remotely.
- An air card for their laptop--While most laptops now have wireless radios, responders should also have access to a satellite air card with Internet access so they can take their laptop with them and be able to contact the network when they are away from home. This will enable the responder to roam at will and still be able to correct an issue from any location. To cut down on the expense of an air-card, you can purchase a set number of air cards for the number of responders who are on duty each night. If your geography allows it, each responder can hand off the air-card to the next night's responder, cutting down on the number of air cards your organization has to purchase.
- Cell phone 5250 emulation--For quick fixes, the responder should be able to access their iSeries, System i, and Power i partitions from their cell phones. This will allow them to solve many issues on the go, where they may not necessarily have access to their laptop computer. There are dozens of 5250 emulators for cell phones, many of which are freeware that can be used by your remote responders. To get started finding an emulator, type in "5250 emulator cell phone" into Google search and you'll get dozens of links to different emulators that are available for immediate download.
While it's not pleasant to be on call, having the right tools handy will give your responders more freedom to carry on as they normally would without having to worry about how to access the system if a problem occurs.
Automatically Answering Important Messages While Ignoring Annoying Ones
To cut down on manual intervention when a problem occurs, there are a few techniques for automatically answering i/OS error messages when they occur. There are many error inquiry messages where the system can automatically enter a response that may eliminate the need for an external responder to sign on and handle the message. Here are some techniques for handling error messages that do not require manual intervention.
- Printer error messages--Most printer inquiry messages deal with items such as loading paper, specifying form types, etc., and they can easily be ignored in your off-hours monitoring software. If there are other printer errors that you may want to automatically answer, you can add entries to your operating system's System Reply List that will feed in the required response. For an example of how to do this for printer messages, check out this article on auto-answering printer load form messages.
- File and record allocation errors where a program is trying to allocate a record/file that is in use by someone else--In these cases, you may be able to set up your system to automatically send out a set number of file/record retry entries ('R') to an error message before the system calls out for help. We recently set up automatic retries for allocation errors in our shop using Bytware's MessengerConsole and you'd be surprised how well this can take care of allocation issues without intervention.
- Setting up automatically scheduled jobs to complete processing when an error occurs--In our shop, we use Help/Systems' Robot/ALERT scheduler software to automatically run scheduled jobs. Scheduler has a feature that allows you to specify that if certain processes in a job stream fail, the entire job stream can complete without sending out an i/OS error message. In certain situations where it may not be a problem to allow a job to fail, you can set up specific off-hours batch jobs to finish completing without sending out an error message. When a failure occurs, an IT analyst can examine the job log the next morning to determine what happened.
- Making judicious use of the System Reply List entries--Reply list entries are an old and well-tested part of the operating system that can be used to automatically reply to an i/OS inquiry message. If you have recurring messages on off-hours jobs that you want to enter a specific reply for, you would configure an entry for that particular message ID in the System Reply List, configure your job to use the reply list to answer messages, and if the error message occurs, the operating system would automatically answer it with the designated reply. I'll explain how to use the System Reply List in a future article, but if you want to start working with it right away, call up the IBM i and System i Information Center and search for the System Reply List entry.
After the Problem is Resolved
Once your off-hours responder has answered the message and resolved the situation, make sure you instill the importance of reporting the problem and its resolution to IT management. It's my opinion that every off-hours issue should be thoroughly analyzed to correct the underlying issue that caused the problem. It doesn't do any good to force a responder to get out of bed to answer a message if nothing is being done to prevent the message from occurring again.
This is where the political part of off-hours answering comes in. If possible, the people who have the power to solve off-hours issues should also be involved in answering the error messages during off-hours so they have a greater incentive for providing a permanent fix, rather than just relying on the responder to handle it while they sleep peacefully through the night. If you want the number of off-hours issues to decrease, you will have to draw attention to how frequently problems are occurring so that other staff members will take action to decrease the number of off-hours issue. It's the old squeaky wheel gets the oil routine. If the SME is required to attend to off-hour issues, they then have a responsibility to help you do everything possible to reduce the number of off-hours issues that occur.
Basic i/OS Error Monitoring and Response, Part 1
How to Auto-Answer Printer Load Form Messages
Post this story to del.icio.us
Post this story to Digg
Post this story to Slashdot