fhg
Volume 10, Number 8 -- March 3, 2010

Admin Alert: Preparing Your CBU For a Real Emergency

Published: March 3, 2010

by Joe Hertvik

Alongside human tragedies, computer tragedies can also occur when a critical i/OS production system stops working and a shop must unexpectedly rely on its Capacity BackUp (CBU) system as its production box. Because a hard CBU cutover can offer more challenges than a planned switch test, here are some additional configuration tasks that can help you account for the unexpected challenges that go along with an emergency CBU switch over.

When Those Who Stand and Wait No Longer Wait

Like a good insurance policy, you hopefully never have to use your CBU. Unfortunately, there are a number of situations where the CBU can be pressed into service for real, not just for a test. Here are just a few of the scenarios that have caused organizations to suddenly lose their production iSeries, System i, or Power i box, facilitating an emergency CBU activation.

  • Floods, which includes traditional flooding as happened in the Midwest in 2009, as well as non-traditional flooding situation such as happened in 1992, when buildings in the Chicago Loop flooded after a utility tunnel that ran under the Chicago River was breached.
  • Other acts of nature such as earthquakes and hurricanes.
  • Major power outages. If the utility company isn't able to get power up and running for a few days, the CBU may need to be turned on to take over processing.
  • Power i hardware problems, such as a backplane error that takes the machine down unexpectedly.
  • A collapsed roof in the computer room due to snow or rain buildup.
  • A fire in the computer room.

It may not take much to put your System i out of commission, forcing your CBU into service. To facilitate an unexpected change, you should modify your CBU run book to include the following measures that need to occur during an emergency switch over. Some of these steps may not be obvious when performing routine switch exercises, but they can become critical during a real emergency.

Call Your HA Vendor

For an emergency fail-over, be sure to include a run book step to contact your high availability software vendor as soon as possible. If your CBU suddenly takes over processing from a crashed production box, there may be unapplied replication updates that haven't been applied to the CBU. Your vendor can help you work through any issues involving corrupted or incompletely replicated data. They should also be able to counsel you on how to recover from an interrupted replication.

What Was Running When Production Crashed?

The first few moments after your production system disappears can be critical in getting your CBU system correctly back up and running. If the main system went down quickly, it may also have abnormally ended critical batch or interactive jobs. This may result in corrupted data or unfinished jobs that need to be restarted to process customer orders correctly.

To get information on which jobs were running when your main system crashed, you will need to take frequent snapshots of your system and make them instantly available on the CBU. On our production box, we set up a snapshot job that runs the following Work with Active Jobs (WRKACTJOB) command every 30 seconds and immediately sends that output to a replicated output queue on the CBU system.

WRKACTJOB OUTPUT(*PRINT)

If you have this information and your production box crashes, the tech staff can bring up the latest WRKACTJOB printout and see what was running within 30 seconds of when the machine went down. Once you have the automated snapshot process set up, you can place a step in the run book to remind the recovery team to check the latest WRKACTJOB printout.

What Was Waiting To Run When Production Crashed?

In addition to knowing what was running, you also need to know what was waiting to run when the production system went down. This is important for several reasons, including knowing how many orders were waiting to process; restarting automated job streams that contain programs that are dependent on other programs completing properly; and knowing how many and which jobs need to be resubmitted.

Similar to how you can track which jobs were running on your production system when the failure occurred, you can write a program to run the following commands to keep track of any jobs that were waiting to run when the system crashed. This program needs to cover the following steps.

1. Use the Work with Job Queue (WRKJOBQ) command to create a spooled file that lists out how many entries were in each of your job queues. Run this command every 30 seconds to get a complete view of what your job queues looked like before the crash. To get this information, run the following WRKJOBQ command.

WRKJOBQ JOBQ(*ALL) OUTPUT(*PRINT)

Like the WRKACTJOB command discussed above, place this report into a replicated output queue so that it will immediately be transferred and saved on your CBU.

2. Cycle through the list of job queue entries created in step 1 and for every job queue that had entries in it that were waiting to run, take a picture of the jobs that were waiting in that job queue. Again, place each spooled file output in an output queue that is automated replicated to your CBU.

By keeping track of which jobs were running when the production system stopped working and which jobs were waiting to be run in job queues, you can give your recovery team some basic tools to determine how much damage may have done to your system integrity as they restart production on the CBU.

You should also note that when you run automated jobs to produce printouts of your active jobs and jobs waiting to run every 30 seconds, you will slowly start to fill your production and CBU systems up with spooled file output. To avoid this situation, you should have another automated program in place to delete excessive spooled files that are older than x number of days old. In an earlier column, I demonstrated how to create a job that automatically deleted spooled files in an output queue that meet certain deletion criteria. You can use that program or another spooled file deletion system to clear out excessive WRKACTJOB and WRKJOBQ spooled files so that they don't clog up your system.

Synchronizing Your Production Scheduling System With Your CBU

If you haven't already done so, you should implement two procedures involving your automated job scheduling software.

1. On a regular basis, preferably several times a day, save and transfer your production job schedule from your production machine to your CBU. Depending on how your CBU is set up, you may even be able to directly replicate your automated job schedule to the CBU machine. In addition to the run schedule, be sure to also transfer over any processing history involved with the schedule (which includes dates and times that each package ran) so that the recovery team will be able to determine which production jobs did and did not run when the system crashed. By replicating the schedule and its history on a regular basis (the more often the better), you should be able to capture all the relevant scheduling information that occurred during the day.

2. Incorporate steps into your run book so that if necessary, you can recreate and restart your production job schedule and run history on the CBU. If your run book has been thoroughly tested, this procedure may already be available.

For information on how to replicate or restore your production job schedule and its attendant history, contact a third-party software vendor.


RELATED STORY

Selectively Deleting OS/400 Spool Files



                     Post this story to del.icio.us
               Post this story to Digg
    Post this story to Slashdot


Sponsored By
SYSTEM i DEVELOPER

RPG & DB2 Summit SESSION GRID Posted
- Quality training, excellent value -

Come to the RPG & DB2 Summit March 23-25 in Fort Worth for three full days of intensive education focused on RPG IV, ILE, DB2, embedded SQL, SQL tuning, PHP, RSE/RDi, RPG and the Web & more.

Learn the latest in practical, usable tips and techniques from top gurus Susan Gantner, Skip Marchesani, Jon Paris, Paul Tuohy, Scott Klement and others in a fun, highly interactive, invigorating environment.

Check out the Session Grid and register by March 5 for just $1295,
which includes 7 meals and 1-on-1 Q&A with the experts.


Senior Technical Editor: Ted Holt
Technical Editor: Joe Hertvik
Contributing Technical Editors: Erwin Earley, Brian Kelly, Michael Sansoterra
Publisher and Advertising Director: Jenny Thomas
Advertising Sales Representative: Kim Reed
Contact the Editors: To contact anyone on the IT Jungle Team
Go to our contacts page and send us a message.

Sponsored Links

Northeast User Groups Conference:  20th Annual Conference, April 12 - 14, Framingham, MA
DRV Technologies:  SpoolFlex automatically converts reports to user friendly PC formats - FREE trial!
COMMON:  Join us at the annual 2010 conference, May 3 - 6, in Orlando, Florida


 

IT Jungle Store Top Book Picks

Easy Steps to Internet Programming for AS/400, iSeries, and System i: List Price, $49.95
The iSeries Express Web Implementer's Guide: List Price, $49.95
The System i RPG & RPG IV Tutorial and Lab Exercises: List Price, $59.95
The System i Pocket RPG & RPG IV Guide: List Price, $69.95
The iSeries Pocket Database Guide: List Price, $59.00
The iSeries Pocket SQL Guide: List Price, $59.00
The iSeries Pocket Query Guide: List Price, $49.00
The iSeries Pocket WebFacing Primer: List Price, $39.00
Migrating to WebSphere Express for iSeries: List Price, $49.00
Getting Started With WebSphere Development Studio Client for iSeries: List Price, $89.00
Getting Started with WebSphere Express for iSeries: List Price, $49.00
Can the AS/400 Survive IBM?: List Price, $49.00
Chip Wars: List Price, $29.95


 
The Four Hundred
X64 and Blade Servers Lead the Server Recovery

Custom Baby Data Centers Coming from Big Blue

System Automation, VTL, and Security Linked in Help/Systems, Crossroads Deal

Mad Dog 21/21: It's i or Die for Power in the Midrange

Hackers Escalate Web Site Attacks, Despite Decline in Security Vulnerabilities

Four Hundred Stuff
CNX Offers Free Community Edition of Valence Web 2.0 App

Altova Adds DB2/400 Support to XML Development Tools

nuBridges Calls for Tokenization Standards

InstallAnywhere Utility Updated with Significant New Features

TN5250 for Android Available from Mochasoft

Four Hundred Monitor
Four Hundred Monitor's
Full iSeries Events Calendar

System i PTF Guide
February 27, 2010: Volume 12, Number 09

February 20, 2010: Volume 12, Number 08

February 13, 2010: Volume 12, Number 07

February 6, 2010: Volume 12, Number 06

January 30, 2010: Volume 12, Number 05

January 23, 2010: Volume 12, Number 04

TPM at The Register
Citrix goes virtual with more appliances

Chip biz to grow 10% in 2010

HP slips Intel's desktop Cores into biz laptops

Marathon reels in another $6.5m

Windows server revenue outpaced Linux in Q4

Novell: Linux finally breaks even

EMC shuffles Ionix to VMware

Novell flirts with Citrix

HyTrust nets $10.5m in funding

Cray inks $45m super pact with DoD

Gartner report card gives high marks to x64, blades

Netezza to bake analytics into appliances

THIS ISSUE SPONSORED BY:

SEQUEL Software
CNX
System i Developer


Printer Friendly Version


TABLE OF CONTENTS
Variable Program Calls in Free-Format RPG

How to Replace Display Files While They Are In Use

Admin Alert: Preparing Your CBU For a Real Emergency

Four Hundred Guru

BACK ISSUES




 
Subscription Information:
You can unsubscribe, change your email address, or sign up for any of IT Jungle's free e-newsletters through our Web site at http://www.itjungle.com/sub/subscribe.html.

Copyright © 1996-2010 Guild Companies, Inc. All Rights Reserved.
Guild Companies, Inc., 50 Park Terrace East, Suite 8F, New York, NY 10034

Privacy Statement