Admin Alert: The Road to Live CBU Fail Over, Part 1

September 2, 2009 Joe Hertvik

One of the companies I work with performed its first live Capacity BackUp (CBU) switch test last month, where they switched over and used their CBU system as their live production system for several days. In the next few issues, I’ll use their experience in prepping for a live switch as a possible guide for others trying to ensure that their CBU can substitute for a live system.

CBU 101: Understanding the CBU

A CBU is an i, System i, or iSeries machine that is an exact duplicate of a live production system. CBUs generally contain the same amount of memory, disk, and CPU activations as their source counterparts. With the help of replication software from companies such as Vision Solutions, IBM‘s DataMirror division, or Bug Busters Software Engineering, production information (including databases, programs, and system objects) is automatically replicated from the source machine to the target CBU. In the event of an emergency where the production system is not available, you can keep your business moving by quickly switching processing over to the target box. See the “Related Stories” section for more articles describing i5/OS high availability and CBUs.

While the CBU is an i5/OS machine in waiting, many companies consider it a big step to actually switch over and run the target machine as a production system substitute for any appreciable amount of time. Our example company took the following steps to reach this goal.

Certifying the CBU Switch-Over Process

A live CBU switch-over doesn’t happen overnight. It takes a great deal of planning and testing to gain confidence that if you switch live processing to the CBU, you are not putting business processing at risk. To allay this fear, the staff developed the idea of certifying the CBU for use as a production machine.

CBU certification evolved because switching live production processing to a duplicate machine was a scary thought to both management and IT staff. Imagine what might happen if you were processing orders and a key data library was out of sync, such that thousands of orders were filled, delivered, and invoiced to customers with incorrect pricing? Or what would happen if you switched over and your key application wouldn’t work, holding up your production and shipping line for days? Company executives and the IT staff were looking for a comfort level that the business would continue to function efficiently if they lost the production machine.

The certification process encompassed a series of switch tests and accompanying documentation that tested critical processing features that the company relied on every day. To meet this end, CBU deployment was subdivided into the following certification steps.

Initial CBU configuration and infrastructure certification–Determine that the CBU itself is set up correctly to impersonate the production machine. This step tests the basic mechanics of switching over to the CBU.
Application certification–Determine whether all the critical custom-written and homegrown applications can function on the CBU. This includes obtaining software licensing, license keys, and testing the applications to see whether they work as intended on the machine.
User certification–Determine whether the user community can perform its essential business processing on the CBU.
Process certification–Determine whether critical automated processing can run on the machine.
Audit certification–Confirm with an outside authority that the company’s CBU configuration was correct and that no key pieces were missing.
Extended switch over certification–Determine whether the company can actually switch processing over to and run their business on the CBU.

Each completed step led to the next step and cumulatively, all the steps would give the company confidence and documentation that the CBU would perform correctly in a crisis. The group felt that by certifying CBU fitness for duty this way, they could reap the following benefits.

Certification by step would slowly build confidence that the CBU would work as intended. IT, management, and users could watch the progress as the CBU was readied for usage.
Segmentation would create ownership and comfort that each group’s particular needs were being addressed. The system administrators would ensure the infrastructure worked correctly. The applications people would tend to application configuration. The users would directly test that their needs were being met.
Documentation after each step would create a reporting system for CBU progress. It would produce accountability and motivation for each group to ensure that they tested thoroughly before they gave the go-ahead to move on.
Certification would provide flexibility to reconfigure and retest. The company could identify problems and ensure that each step was perfected as much as possible before moving on the next step. It also provided structure for how to deploy the CBU.

It’s also worth noting that this framework didn’t appear overnight. It was the result of two or three earlier switch tests where the company worked with the CBU and determined that this was the best course of action to follow. In particular, most of the initial CBU configuration and infrastructure configuration and the entire application configuration were completed before the company determined that the other steps were needed. Once all the steps were identified, the rest of the certifications proceeded as presented here.

Initial CBU Configuration and Infrastructure Certification

After the CBU was purchased, the company hired an outside consultant to perform the initial configuration. They used Vision Solutions’ MIMIX HA software as its high availability solution. The consultant worked with the company to install the software and determine what information (data, programs, and system objects) should be replicated to the CBU, set up the replication configuration, and started the process of replicating information from one machine to another. He helped the company create their initial “run book”, which is the set of instructions the company follows to switch processing from the production machine to the target machine and back again. The consultant also helped them set up HA audits that would alert staff by email when libraries or objects were out of sync between the machines and when libraries were added to the production box that were not available on the CBU.

When dealing with high availability scenarios, one of the hardest situations is performing the first switch-over test. This test does nothing more than run the procedures for switching processing from the production machine to the CBU and back again. When switching over in this test, the CBU performed little information processing. Rather, this exercise tested the mechanics of switching over and switching back again to see if it was possible to perform the switch using their existing run book.

The first test also helped the company understand if their replication scheme was valid. When processing was temporarily switched over to the CBU, the company shut off all normal information processing functions (interactive jobs, Website updating, remote updates, batch jobs, etc.). The testers had to remember that a CBU switch-over is a fundamentally different animal than a traditional disaster recovery test. In a switch-over, the CBU is functioning as the production machine and any processing that occurs will be replicated back to the source production system at the end of the test (i.e., all CBU testing uses live data).

To check that changed CBU data would be replicated correctly back to the production machine at the end of the test, the testers only changed data on a few insignificant files on the CBU. When the testers switched processing back to the CBU, they checked the test files on the source box to ensure that any changes that were made on the target machine during a switch test were replicated back to the production machine.

The goal of the first test was to create and test the basic structure of a switch-over, including basic data update and replication. The testers wanted to be comfortable enough with the exercise that they could perform this switch again and again as needed for later tests. The initial test answered a few simple questions:

Can a switch-over be performed?
Is data replicated correctly from the production system to the target system and back again?
What steps should be taken to make succeeding switch over tests more successful?

The first test was the building block on which all of the other CBU testing would rest. Until the CBU infrastructure was correct, the company couldn’t move on to the more complicated CBU functions.

For the example company, it was necessary to run two tests to make sure the basic infrastructure of the CBU was correct. That is, the CBU needed to totally impersonate the production machine so that the outside world (including network equipment, DNS servers, communications partners, printers, etc.) couldn’t tell the difference between the two machines. After two tests, the testers felt that they could move on to the next certification step.

Between Tests: Tweaking the Run Book

During each switch test, the testers took detailed notes in the run book as to what went right with each step, what went wrong during the test, and what they did to fix it. After each test, those notes formed the basis of the next run book. The previous run book was archived for reference and a new run book was created.

The new run book contains all the fixes, shortcuts, and expansions needed to make the next test more successful. It became mandatory to update the run book during the first few days after the test completed, while all the events were still fresh in the testers’ minds. If the run book sat for a few weeks before being updated, the testers could misunderstand some of their own notes and accidentally omit important changes that were needed for the next test.

More To Come

As I mentioned, this company identified CBU configuration as a series of steps. Next week, I’ll look at what was required for the next certification steps and how they led up to the ultimate goal of a live switch-over.