Admin Alert: Preparing Your CBU For a Real Emergency

March 3, 2010 Joe Hertvik

Alongside human tragedies, computer tragedies can also occur when a critical i/OS production system stops working and a shop must unexpectedly rely on its Capacity BackUp (CBU) system as its production box. Because a hard CBU cutover can offer more challenges than a planned switch test, here are some additional configuration tasks that can help you account for the unexpected challenges that go along with an emergency CBU switch over.

When Those Who Stand and Wait No Longer Wait

Like a good insurance policy, you hopefully never have to use your CBU. Unfortunately, there are a number of situations where the CBU can be pressed into service for real, not just for a test. Here are just a few of the scenarios that have caused organizations to suddenly lose their production iSeries, System i, or Power i box, facilitating an emergency CBU activation.

Floods, which includes traditional flooding as happened in the Midwest in 2009, as well as non-traditional flooding situation such as happened in 1992, when buildings in the Chicago Loop flooded after a utility tunnel that ran under the Chicago River was breached.
Other acts of nature such as earthquakes and hurricanes.
Major power outages. If the utility company isn’t able to get power up and running for a few days, the CBU may need to be turned on to take over processing.
Power i hardware problems, such as a backplane error that takes the machine down unexpectedly.
A collapsed roof in the computer room due to snow or rain buildup.
A fire in the computer room.

It may not take much to put your System i out of commission, forcing your CBU into service. To facilitate an unexpected change, you should modify your CBU run book to include the following measures that need to occur during an emergency switch over. Some of these steps may not be obvious when performing routine switch exercises, but they can become critical during a real emergency.

Call Your HA Vendor

For an emergency fail-over, be sure to include a run book step to contact your high availability software vendor as soon as possible. If your CBU suddenly takes over processing from a crashed production box, there may be unapplied replication updates that haven’t been applied to the CBU. Your vendor can help you work through any issues involving corrupted or incompletely replicated data. They should also be able to counsel you on how to recover from an interrupted replication.

What Was Running When Production Crashed?

The first few moments after your production system disappears can be critical in getting your CBU system correctly back up and running. If the main system went down quickly, it may also have abnormally ended critical batch or interactive jobs. This may result in corrupted data or unfinished jobs that need to be restarted to process customer orders correctly.

To get information on which jobs were running when your main system crashed, you will need to take frequent snapshots of your system and make them instantly available on the CBU. On our production box, we set up a snapshot job that runs the following Work with Active Jobs (WRKACTJOB) command every 30 seconds and immediately sends that output to a replicated output queue on the CBU system.

WRKACTJOB OUTPUT(*PRINT)

If you have this information and your production box crashes, the tech staff can bring up the latest WRKACTJOB printout and see what was running within 30 seconds of when the machine went down. Once you have the automated snapshot process set up, you can place a step in the run book to remind the recovery team to check the latest WRKACTJOB printout.

What Was Waiting To Run When Production Crashed?

In addition to knowing what was running, you also need to know what was waiting to run when the production system went down. This is important for several reasons, including knowing how many orders were waiting to process; restarting automated job streams that contain programs that are dependent on other programs completing properly; and knowing how many and which jobs need to be resubmitted.

Similar to how you can track which jobs were running on your production system when the failure occurred, you can write a program to run the following commands to keep track of any jobs that were waiting to run when the system crashed. This program needs to cover the following steps.

1. Use the Work with Job Queue (WRKJOBQ) command to create a spooled file that lists out how many entries were in each of your job queues. Run this command every 30 seconds to get a complete view of what your job queues looked like before the crash. To get this information, run the following WRKJOBQ command.

WRKJOBQ JOBQ(*ALL) OUTPUT(*PRINT)

Like the WRKACTJOB command discussed above, place this report into a replicated output queue so that it will immediately be transferred and saved on your CBU.

2. Cycle through the list of job queue entries created in step 1 and for every job queue that had entries in it that were waiting to run, take a picture of the jobs that were waiting in that job queue. Again, place each spooled file output in an output queue that is automated replicated to your CBU.

By keeping track of which jobs were running when the production system stopped working and which jobs were waiting to be run in job queues, you can give your recovery team some basic tools to determine how much damage may have done to your system integrity as they restart production on the CBU.

You should also note that when you run automated jobs to produce printouts of your active jobs and jobs waiting to run every 30 seconds, you will slowly start to fill your production and CBU systems up with spooled file output. To avoid this situation, you should have another automated program in place to delete excessive spooled files that are older than x number of days old. In an earlier column, I demonstrated how to create a job that automatically deleted spooled files in an output queue that meet certain deletion criteria. You can use that program or another spooled file deletion system to clear out excessive WRKACTJOB and WRKJOBQ spooled files so that they don’t clog up your system.

Synchronizing Your Production Scheduling System With Your CBU

If you haven’t already done so, you should implement two procedures involving your automated job scheduling software.

1. On a regular basis, preferably several times a day, save and transfer your production job schedule from your production machine to your CBU. Depending on how your CBU is set up, you may even be able to directly replicate your automated job schedule to the CBU machine. In addition to the run schedule, be sure to also transfer over any processing history involved with the schedule (which includes dates and times that each package ran) so that the recovery team will be able to determine which production jobs did and did not run when the system crashed. By replicating the schedule and its history on a regular basis (the more often the better), you should be able to capture all the relevant scheduling information that occurred during the day.

2. Incorporate steps into your run book so that if necessary, you can recreate and restart your production job schedule and run history on the CBU. If your run book has been thoroughly tested, this procedure may already be available.

For information on how to replicate or restore your production job schedule and its attendant history, contact a third-party software vendor.

Selectively Deleting OS/400 Spool Files

                     Post this story to del.icio.us
               Post this story to Digg
    Post this story to Slashdot