Admin Alert: Prepping For And Responding To An Unheard Of IBM i #FAIL

May 9, 2012 Joe Hertvik

In the past six months, I’ve had three occasions where two of the six IBM i partitions I manage have needed emergency IPLs to restart their systems. This is unusual for an IBM i installation and it got me thinking what could have been done to avoid the issue and how I could have reacted better after the issue occurred. This article summarizes what I’ve learned from these experiences.

This Ain’t Windows, You Know

These emergency situations occurred on a Power 720 i 6.1 partition (one emergency) and a Power 6 550 partition (two emergencies). While it’s not as unusual for a Windows box or partition to need rebooting one to two times a year, it’s almost unheard of for it to happen on an IBM i. If that kind of instability were normal, IBM i administrators wouldn’t be able to gloat about the system’s renowned reliability the way we do. We expect at least 99.99 percent reliability in our world, and when we don’t get it, we should seriously ask why.

That said, here are the situations I ran into over the last six months and what I learned from each situation:

Situation 1: Ending jobs without reason.

During November in the middle of a production day, our i 6.1 Power 720 started dropping IP connections (including PC5250 sessions) and ended batch and remote jobs. But the system wouldn’t close the jobs on the system. No jobs were starting or ending.

We immediately called IBM and were walked through a number of drills. We went into a semi-restricted mode to download some PTFs that might have been relevant to our situation. The box started working correctly again after an IPL.

Situation 2: Dangerous IPL.

Without warning in the middle of a production day, an i5/OS V5R4 Power 6 550 production partition spontaneously IPLed in March. In this case, we had the option to take a main storage dump so that IBM could analyze what happened. Needing to get the machine back on line, we skipped the storage dump and brought the machine up again. Unfortunately, IBM also meant that our troubleshooting was finished for this issue.

By sheerest coincidence, we had already loaded the latest cumulative PTF package along with DB, HIPER, and group PTFs on to the machine in anticipation of a weekend IPL and application.

So when the machine IPLed, it did so with the latest PTFs on it, bringing us up to date on fixes. But due to a lack of a storage dump, we had no root cause.

Situation 3: The known Telnet bug.

In the last week of April, the i5/OS V5R4 production partition started dropping its PC5250 sessions that were connecting to the machine through QINTER. This was on a machine that had redundant IP interfaces. We tried to ping the machine, but it wouldn’t answer our pings. We turned off the primary IP interface and we were able to ping the machine again. Every function that relied on TCP/IP was still working (HTTP, FTP, ODBC, etc.) except for interactive sessions running in QINTER. We ended the QINTER subsystem and restarted it, but we still weren’t able to start any interactive sessions.

We contacted IBM to troubleshoot the issue. IBM identified a known bug in the i5/OS Telnet server and told us we may have exacerbated the situation by bouncing QINTER. To get our QINTER capability back, we ended all subsystems, downloaded some recommended PTFs from IBM, and restarted the system.

Lessons Learned From Three IBM i Failures

While these three situations were different, they did share some similarities that may or may not have caused us to avoid the problem in the first place. They also taught me some lessons in crisis management for when an IBM i machines fails. Here’s what I learned.

1. Keep your PTFs current–In two of the situations, the latest cumulative PTFs were more than a year old. While it’s possible that more current PTFs may not have prevented the fails, PTF application was ultimately the solution that IBM recommended to solve the problem. Good regular PTF application (once or twice a year) should be a goal for any shop. And make sure that you apply cumulative, HIPER, DB, and all relevant group PTFs for your partition.

2. Create a response team and a crisis-management team–When the situation occurs, your most technically adept technician (or team of techs) should be analyzing the situation and calling IBM tech support, if possible. There should also be a manager functioning as a crisis-management coordinator. This person works with the tech team, the management team, and the users to relay technical information about the issue, help devise a timetable for when the system will return, keep management away from the tech team so they can do their work, and to issue updates to the user community. If at all possible, isolate your technical resources so they can focus on getting the machine back up again instead of doing crisis management.

3. If you need to apply PTFs, set your system up for semi-restricted state–Semi-restricted state is nothing more than restarting your system with only TCP/IP and your IP interfaces active. You don’t execute your startup program and you don’t start any other subsystems for processing. Period. It allows you to download PTF fixes straight to your IBM i partition without starting up any other function. It’s a handy processing state if you haven’t already tried it. See the Related Stories section below for more references about semi-restricted state.

4. Decide your timetable and next step soon after you start troubleshooting–Determine your organization’s tolerance for downtime and plan accordingly. If your fail is software-related rather than a physical component going down, chances are good IBM will help you identify relevant PTFs to apply and have you IPL your machine. The question becomes how long should you troubleshoot before you have to bring production up again? It may ease tensions if you set a deadline at the beginning of the crisis as to how long you and your technical support will troubleshoot until you just have to bring the machine back up again.

5. If you have a Capacity BackUp (CBU) system, decide how long you’ll wait until you switch processing to the CBU–This goes hand-in-hand with deciding on a troubleshooting timetable and next steps. This should be done by the crisis management team. Decide how long your organization can wait for troubleshooting to yield results before it activates and switches processing over to its CBU box.

6. Decide if you want to perform a system dump to help IBM troubleshoot the issue–In certain instances, you may have an option to perform a full system dump. A system dump can be valuable in helping IBM troubleshoot what actually happened on your system. But a system dump has two serious disadvantages when you’re trying to get your production system back on line. First, it can take a fair amount of time: time that your company may be losing in processing orders and servicing the customer. Second, it may take a lot of disk space. If your machine is running close to capacity (90 percent of available system storage), the additional disk space needed for the system dump may be enough to put your over the threshold. So decide if it’s worth your while to attempt a system dump or whether you just want to bring the system back up.

7. Decide how to validate that everything is still working after your machine comes back up–In most cases, IBM i partitions are IPLed so infrequently that you may not have a current list of all the programs that should be running. As part of your crash kit, you should always have a Work with Active Jobs (WRKACTJOB) command ready to validate that your system has come up correctly. Once the machine finishes IPLing and the system is running again, you can compare the subsystems and jobs in the WRKACTJOB listing against the jobs running on your system to make sure all your critical subsystems and jobs are running again. If you have a CBU run book, you may also be able to use any validation checklists documented in your switchback procedures to ensure that all critical subsystems, features, and jobs are running again.

8. Check the Work with Problems (WRKPRB) screen to determine if there’s a hardware issue involved–In some cases, a hardware problem could be the underlying issue. For example, my problem with users disconnecting could have been due to a faulty Ethernet card. Check WRKPRB to determine if there are any hardware alerts on your system while the problem is occurring.

While it’s rare for an IBM i machine to fail, it does happen occasionally. Hopefully, you can use the information in this article to help make these events rare and to recover quickly after a software failure.

Follow Me At My Blog, On Twitter, And On LinkedIn

Come check out my blog at joehertvik.com, where I focus on computer administration and news (especially IBM i and soon PureSystems); vendor, marketing, and tech writing news and materials; and whatever else I come across.

You can also follow me on Twitter @JoeHertvik and on LinkedIn.

Joe Hertvik is the owner of Hertvik Business Services, a service company that provides written marketing content and presentation services for the computer industry, including white papers, case studies, and other marketing material. Email Joe for a free quote for any upcoming projects. He also runs a data center for two companies outside Chicago. Joe is a contributing editor for IT Jungle and has written the Admin Alert column since 2002.