Admin Alert: Elements Of An IBM i Incident Management Plan, Part 2
April 16, 2014 Timothy Prickett Morgan
Last issue, I started outlining how to set up an IBM i incident management plan, going through four of the seven elements that are crucial for IBM i monitoring and response. This issue, let’s finish up and discuss the final elements an IBM i incident management template should provide.
The Elements Of IBM i Incident Management, Revisited
As presented last time, here are the critical elements every IBM i incident management plan should include.
Last issue, I covered items 1 through 4. Today, let’s look at what you can do to add redundancy, damage control, and recovery planning to the list (items 5 through 7).
Part 5: Redundancy: What happens if your response protocol breaks down?
It’s important to plan for what happens if your notification system breaks down.
Let’s say your notification protocol calls for having your IBM i server send out an email message and a text message to your responders when a problem occurs. It uses the company’s email server to deliver those messages. (Check out this article for how to deliver text messages via email). But suppose the email system is down or the TCP/IP network hosting your IBM i is unavailable? How do your responders receive alert messages in those cases?
One way to answer this question is with a two-pronged approach that takes advantage of email and an old fashioned modem. With this approach, every IBM i alert is sent out through two different transmission methods.
This set up takes advantage of TAP paging terminal phone numbers. Many telecommunications companies still supply their own dial-up number for sending out text messages. This means you can send all your alerts out through standard email AND through an analog phone line to your cell phone provider’s TAP numbers. Doing this, you can insure that your automated IBM i text alerts will always go out, even if your email service is down.
See this article I previously wrote for more information on setting up an IBM i modem to use TAP in conjunction with email messages for IBM i system monitoring.
Part 6: Who handles damage control and keeping users/management informed?
When you’re in the middle of handling an IT emergency, it’s easy to forget there are people who may be unable to work because the system is down. Conversely, other parts of the system may not be working due to the IBM i problem you’re working on.
So any good IBM i incident management plan should also specify people who play the following roles:
User liaison–Keeps your users informed about what’s happening and how soon a fix will be implemented. Ideally, this should be an IT manager or someone else who isn’t involved with solving the actual issue. The help desk manager is also a good candidate for user liaison.
The user liaison’s job is to get the latest information on progress for an incident fix and to notify affected users how the fix is going. The user liaison’s other job is to keep the pressure off the responders, so they have time to troubleshoot and fix the issue. Depending on how wide-spread the issue is, the user liaison may need to notify the following groups when a problem occurs.
It can be helpful to use a form email that can be updated as problem resolution proceeds. Any email notification you send out should include time of notification; a short description of the problem; the expected fix; the expected time the fix will be implemented; and the expected time you’ll send out the next notification email. An hour is a reasonable amount of time between updated notifications, and it’s important to keep users updated on a regular basis for an extended issue.
Damage control–A production bust may also affect other IT processing or production functions. Aside from your responders, you may need someone to gather a team to devise work arounds for the affected systems. Again, this should be someone besides the people working on the problem, though you might employ the User Liaison team to perform this function.
Part 7: Recovery: What happens after the problem is over?
After the problem is finished, you need to perform the following functions:
This completes my template on setting up an IBM i incident management plan. If you have any comments or other items to add to the plan, email me at firstname.lastname@example.org and I may use them in a future column.
Joe Hertvik is an IBM i subject matter expert (SME) and the owner of Hertvik Business Services, a service company that provides written marketing content and presentation services for the computer industry, including white papers, case studies, and other marketing material. Email Joe for a free quote for any upcoming projects. He also runs a data center for two companies outside Chicago, featuring multiple IBM i ERP systems. Joe is a contributing editor for IT Jungle and has written the Admin Alert column since 2002. Check out his blog where he features practical information for tech users at joehertvik.com.