Admin Alert: Elements Of An IBM i Incident Management Plan, Part 2

April 16, 2014 Joe Hertvik

Last issue, I started outlining how to set up an IBM i incident management plan, going through four of the seven elements that are crucial for IBM i monitoring and response. This issue, let’s finish up and discuss the final elements an IBM i incident management template should provide.

The Elements Of IBM i Incident Management, Revisited

As presented last time, here are the critical elements every IBM i incident management plan should include.

What type of monitoring are you doing: Manual, automatic, or hybrid?
What are you monitoring for?
Call trees: Who should be alerted when a problem occurs?
Call tree protocol: How do you contact responders?
Redundancy: What happens if your response protocol breaks down?
Who handles damage control and keeping users/management informed?
Recovery: What happens after the problem is over?

Last issue, I covered items 1 through 4. Today, let’s look at what you can do to add redundancy, damage control, and recovery planning to the list (items 5 through 7).

Part 5: Redundancy: What happens if your response protocol breaks down?

It’s important to plan for what happens if your notification system breaks down.

Let’s say your notification protocol calls for having your IBM i server send out an email message and a text message to your responders when a problem occurs. It uses the company’s email server to deliver those messages. (Check out this article for how to deliver text messages via email). But suppose the email system is down or the TCP/IP network hosting your IBM i is unavailable? How do your responders receive alert messages in those cases?

One way to answer this question is with a two-pronged approach that takes advantage of email and an old fashioned modem. With this approach, every IBM i alert is sent out through two different transmission methods.

The first alert is sent out through the company email system as both an email and a text message.
The second alert is sent out as a text message through an analog modem and a phone line.

This set up takes advantage of TAP paging terminal phone numbers. Many telecommunications companies still supply their own dial-up number for sending out text messages. This means you can send all your alerts out through standard email AND through an analog phone line to your cell phone provider’s TAP numbers. Doing this, you can insure that your automated IBM i text alerts will always go out, even if your email service is down.

See this article I previously wrote for more information on setting up an IBM i modem to use TAP in conjunction with email messages for IBM i system monitoring.

Part 6: Who handles damage control and keeping users/management informed?

When you’re in the middle of handling an IT emergency, it’s easy to forget there are people who may be unable to work because the system is down. Conversely, other parts of the system may not be working due to the IBM i problem you’re working on.

So any good IBM i incident management plan should also specify people who play the following roles:

User liaison–Keeps your users informed about what’s happening and how soon a fix will be implemented. Ideally, this should be an IT manager or someone else who isn’t involved with solving the actual issue. The help desk manager is also a good candidate for user liaison.

The user liaison’s job is to get the latest information on progress for an incident fix and to notify affected users how the fix is going. The user liaison’s other job is to keep the pressure off the responders, so they have time to troubleshoot and fix the issue. Depending on how wide-spread the issue is, the user liaison may need to notify the following groups when a problem occurs.

Management–Depending on proximity and company preferences, notification can be accomplished through an email, but you may also have to make a personal phone call or visit.
Users–Can generally be notified by email. If the issue only affects one department or a small set of users, you may also want to discuss by phone call or personal visit.
Business partners –Call or email.
Customers–May need to be contacted either by the IT department or the business owner of the customer relationship.

It can be helpful to use a form email that can be updated as problem resolution proceeds. Any email notification you send out should include time of notification; a short description of the problem; the expected fix; the expected time the fix will be implemented; and the expected time you’ll send out the next notification email. An hour is a reasonable amount of time between updated notifications, and it’s important to keep users updated on a regular basis for an extended issue.

Damage control–A production bust may also affect other IT processing or production functions. Aside from your responders, you may need someone to gather a team to devise work arounds for the affected systems. Again, this should be someone besides the people working on the problem, though you might employ the User Liaison team to perform this function.

Part 7: Recovery: What happens after the problem is over?

After the problem is finished, you need to perform the following functions:

Final notification to users that the problem is fixed–This notification should include any special instructions the users need to follow or items they need to be aware of.
Clean-up–Determine who needs to perform follow-up work to correct any additional issues that occurred because of the original problem. If the issue happened during off-hours, you may need to call in a crew to affect cleanup. You may also use the damage control crew from step 7 for this function.
Setting things straight–Reverse any temporary changes that were put in during the fix period, such as holding reactive jobs, limiting customer or employee access to affected functions, etc.
Lessons learned–Analyze the root cause of the problem and determine whether additional items need to be changed to prevent the issue from occurring again. Both the responders and IT management should participate in this phase. New projects may need to be created and approved because of this phase.

This completes my template on setting up an IBM i incident management plan. If you have any comments or other items to add to the plan, email me at joe@joehertvik.com and I may use them in a future column.

Joe Hertvik is an IBM i subject matter expert (SME) and the owner of Hertvik Business Services, a service company that provides written marketing content and presentation services for the computer industry, including white papers, case studies, and other marketing material. Email Joe for a free quote for any upcoming projects. He also runs a data center for two companies outside Chicago, featuring multiple IBM i ERP systems. Joe is a contributing editor for IT Jungle and has written the Admin Alert column since 2002. Check out his blog where he features practical information for tech users at joehertvik.com.