Admin Alert: Elements Of An IBM i Incident Management Plan, Part 1

April 2, 2014 Joe Hertvik

How do you react when there’s an issue with your IBM i partition? This isn’t an easy question as IBM i incidents can involve many different areas, including networking/infrastructure, applications, storage, performance, and third-party vendors. This issue and next, I’ll outline the critical elements of an IBM i incident management plan and how it helps you quickly determine what to do and who to call when problems come up.

The Elements Of IBM i Incident Management

Here are the critical elements I believe every IBM i incident management plan should include.

What type of monitoring are you doing: Manual, automatic, or hybrid?
What are you monitoring for?
Call trees: Who should be alerted when a problem occurs?
Call tree protocol: How do you contact responders?
Redundancy: What happens if your response protocol breaks down?
Who handles damage control and keeping users/management informed?
Recovery: What happens after the problem is over?

Only a few of these items are specifically IBM i related. Most of them have to do with setting up the right incident management protocol and executing that protocol. Many of these items can also be applied to network and server incident monitoring systems, so you can use this plan as a template for all your IT operations monitoring, not just your IBM i system.

Here’s my take on how incident management works with an IBM i. This issue, I’ll cover what type of monitoring you should be doing (item 1) through call tree protocols (item 4). Next issue, I’ll finish up by covering redundancy through recovery (items 5 through 7).

Here’s how you can approach setting up an IBM i incident monitoring plan, from start to finish.

Part 1: What type of monitoring are you doing: Manual, automatic, or hybrid?

Up through the early 2000s, it was common for companies to have round-the-clock system operators on staff. Besides handling administrative tasks such as printing, delivering forms and reports, running backups, and handling tapes, they often watched their pre-IBM i systems (usually an AS/400 or iSeries running OS/400) for problems. . . and they sometimes slept on the job;-). It’s getting rare to find these old types of pure system operators anymore. But they represent the first type of monitoring: manual monitoring.

In some larger installations, the operator roles have been maintained but they’ve been upgraded. These modern operators may still handle backups and print forms, but they are also now used as Level 1 Help Desk personnel that respond to and escalate issues with IBM i partitions, other servers, and user desktops. Data center personnel can also act as Level 2 Help Desk personnel and perform break/fix, configuration, software installation, or hardware repair on company devices (for more information on Help Desk roles, check out my post on What is Level 1, Level 2, and Level 3 Help Desk Support?).

Even if you still retain a data center with live personnel monitoring the system, there is still a need to provide IBM i system monitoring software, because your personnel are probably too busy handling other functions to constantly be watching your IBM partitions. Having on-site personnel who take advantage of system monitoring constitutes a hybrid environment that combines old time system operator/help desk functions with automated monitoring.

The third type of monitoring is a lights out (or darkened) data center, where there aren’t any personnel on duty around the clock. In 2014, many of these old time operators have been completely replaced as report handling and backup needs have become more automated (or as reports migrated to emails and disk-based backups became more common) and modern IBM i system monitoring software has become more sophisticated. There are plenty of system monitoring packages in the market that can watch your system for you, alerting you by email, text messages, tweets, and pagers (for the one or two people out there who still have pagers) when there is a problem with the system. To find a complete list of vendors who supply monitoring software, check out the community list of IBM i system monitoring products I maintain on my website. In a lights out environment, it’s imperative you have reliable, well-configured IBM i monitoring software to alert you when there’s an issue.

The main point of this section is that no matter what type of data center you run (manual, hybrid, or lights out), one of the better investments you can make for your IBM i incident management plan is to purchase and configure a good IBM i system monitoring product. In certain instances, there may be a call to write your own monitoring packages, but the vendor packages are very good and cover such a wide variety of issues that you are generally better off if you buy an off-the-shelf package and save your programming skills for other business-oriented issues.

Part 2: What are you monitoring for?

There are a number of items that you could and should be monitoring for on your IBM i. To get a feel for what items you should be monitoring, check out the following articles that cover specific problems you can monitor for on your partition.

To find even more articles on this topic, check out my index of all the articles I’ve ever written on IBM i system monitoring, most of which have appeared in IT Jungle.

Part 3: Call trees: Who should be alerted when a problem occurs?

Once you’ve defined your monitoring type (part 1) and what you are monitoring for (part 2), the next step is to determine who will handle an issue when it comes up. To do this, you have to create a call tree for your monitoring software or data center personnel to us. The call tree should contain the following items for each type of IBM i issue you may encounter.

Expertise–Who should respond to each type of problem? Understand that you’ll need more than one expert from each area to cover vacations and other out of office situations when your primary expert isn’t available. In your IBM i environment, there may be different call trees for application support; job scheduling issues; administrators; storage (especially if you’re connecting via a Storage Area Network, SAN); and network support.
Contact information for each expert–How do you contact your experts? For redundancy’s sake, there should be two ways to contact each person in case the first method fails (as may occur when a cell phone dies).

Part 4: Call tree protocol: How do you contact responders?

Different shops have different protocols for contacting incident responders. Some people like email notification. Some like tweets or texts. Some even preferred an old fashioned phone call. Determine and document what the protocol is for contacting your experts when an issue needs to be worked on.

Most IBM i system monitoring packages allow you to put in call trees for different types of monitors. As mentioned above, different issues may call for different technical resources, so be sure to account for each problem type as you create your call trees.

The call tree is the list of responders and the protocol to follow for how long your monitoring software or on-site personnel wait for a technical resource to respond, before they escalate an issue to the next person in the call tree.

A typical escalation tree might work something like this:

Send a message out to responder A and wait 30 minutes.
If there is no response or responder A hasn’t started working on the issue after a half-hour, escalate the message to responder B.
Set up escalation protocols for responders C, D, E, etc. in your call tree.

Once you define your escalation tree, you can encode it into your IBM i monitoring software or you can document it in a central location that all of your technical resources can use to contact the next highest expert for assistance in resolving the problem.

These items are necessary regardless of whether you’re using manual, hybrid, or a lights out form of monitoring for your incident management.

To be concluded in our next issue on April 16th.

Joe Hertvik is an IBM i subject matter expert (SME) and the owner of Hertvik Business Services, a service company that provides written marketing content and presentation services for the computer industry, including white papers, case studies, and other marketing material. Email Joe for a free quote for any upcoming projects. He also runs a data center for two companies outside Chicago, featuring multiple IBM i ERP systems. Joe is a contributing editor for IT Jungle and has written the Admin Alert column since 2002. Check out his blog where he features practical information for tech users at joehertvik.com.