Admin Alert: Basic i/OS Error Monitoring and Response, Part 2
January 12, 2011 Joe Hertvik
Last week, I began outlining a basic plan for monitoring and answering i/OS error messages, allowing you to resolve system and programming issues without having dedicated 24x7x365 personnel on site. That article focused on detection and notification of errors. This week, I’ll focus on extending the plan to include allowing responders to correct issues from anywhere, to automatically answer error messages without human input, and some other things to do after the error message is resolved.
The Basic Structure of i/OS Error Monitoring and Response
As outlined in Part 1, an i/OS error monitoring and response system contains the following components.
Having these components in place accomplishes three goals:
Last time, I covered the detection and notification pieces. This issue, I’ll explain how mobility, automatically answering messages, and resolution fill in the missing pieces of the plan.
A basic i/OS monitoring and response plan involves having scheduled responders on duty who can attend to system issues and programming problems as they occur. But since you’re trying to avoid dedicated 24x7x365 resources, you have to rely on various IT personnel (somewhat) voluntarily monitoring the system wherever they are. Once the call comes in, they need to quickly diagnose and respond to the issue or contact a Subject Matter Expert (SME) who can get the system working correctly again.
The only problem is that they can’t do this if they don’t have the proper tools. It doesn’t do your shop any good if a responder gets the call and they aren’t able to connect to the system to solve the issue. (Remember, system issues often happen during off-hours when people are usually living their lives.)
To help off-site monitoring resources watch the system WHILE they are living their lives, every off-site monitoring resource on duty should have one or more of the following tools available to get the job done.
While it’s not pleasant to be on call, having the right tools handy will give your responders more freedom to carry on as they normally would without having to worry about how to access the system if a problem occurs.
Automatically Answering Important Messages While Ignoring Annoying Ones
To cut down on manual intervention when a problem occurs, there are a few techniques for automatically answering i/OS error messages when they occur. There are many error inquiry messages where the system can automatically enter a response that may eliminate the need for an external responder to sign on and handle the message. Here are some techniques for handling error messages that do not require manual intervention.
After the Problem is Resolved
Once your off-hours responder has answered the message and resolved the situation, make sure you instill the importance of reporting the problem and its resolution to IT management. It’s my opinion that every off-hours issue should be thoroughly analyzed to correct the underlying issue that caused the problem. It doesn’t do any good to force a responder to get out of bed to answer a message if nothing is being done to prevent the message from occurring again.
This is where the political part of off-hours answering comes in. If possible, the people who have the power to solve off-hours issues should also be involved in answering the error messages during off-hours so they have a greater incentive for providing a permanent fix, rather than just relying on the responder to handle it while they sleep peacefully through the night. If you want the number of off-hours issues to decrease, you will have to draw attention to how frequently problems are occurring so that other staff members will take action to decrease the number of off-hours issue. It’s the old squeaky wheel gets the oil routine. If the SME is required to attend to off-hour issues, they then have a responsibility to help you do everything possible to reduce the number of off-hours issues that occur.