Admin Alert: Common Mistakes When Failing Over to a CBU
August 13, 2008 Joe Hertvik
My shop ran its first high availability failover test in 2007. We’ve since run many more tests where we configured two different System i 550 Capacity BackUp (CBU) units to impersonate other systems. We’ve also made many mistakes in the process. Today, I’m focusing on common CBU failover problems and how to avoid them. Hopefully, this information will help make your failover process run smoothly.
Meet the CBU
In the i5/OS world, a CBU system is a specially configured Power i, System i, or iSeries machine that communicates with your production machine to continuously replicate production data and applications using high availability software. In case of disaster, the CBU can be switched over to “impersonate” the production box, servicing your users, devices, and companion servers with very little delay. When the main production machine comes back up, the CBU relinquishes its role and production is switched back to the regular system.
Because it’s delicate work to configure a real-time replacement machine that your entire organization may have to run on, most CBUs are exhaustively configured and tested to ensure that stupid mistakes don’t wreck your processing if the CBU ever has to pinch hit for the production server.
Enter Stupid Mistakes
There are three critical areas that either allow or prevent a CBU’s successful impersonation of a production i5/OS server:
Given this broad outline, here are some simple mistakes that can blow your CBU implementation right out of the water. Avoid these issues and you’ll be sitting pretty. Trip them, and you’ll have a good-sized problem on your hands.
Replication and Auditing–Eyes On the Prize
When setting up replication, make sure that you’re replicating all the necessary objects your system relies upon, not just the data and programming libraries. These objects include:
The first common mistake is neglecting to audit your object replication. If your objects get out of sync, you will lose necessary data or program objects when you failover. So take full advantage of your replication software’s auditing functions, and check your audit reports every day. There is no other way to ensure that your objects remain in sync. When auditing, beware of out-of-sync objects and production system libraries that are not present on the CBU partition.
New libraries are especially critical to replicate, as applications programmers frequently add new functionality that must be ported over and set up on the CBU. On a regular basis, you may want to cross-reference the list of libraries and IFS directories that you are replicating to the CBU with the list of all the libraries and IFS directories on your production box. This comparison will help you catch any new libraries or directories that need to be added to your replication list.
While auditing helps ensure that you have a complete replicated copy of your database and application software, it is possible to replicate too many objects, particularly for third-party software packages. Be careful to exclude these items from your replication scheme, or they will cause a problem when you failover.
Run Book Issues
As you put together the run book (your bible for failing over), you may wind up working with a consultant who will provide you with a basic run book for failing over. This book will contain all the information you need to failover to the CBU, have your CBU impersonate your production machine, and then fail back to the production box.
However, you’re going to need more than the basic run book to fully failover to the CBU. Once the initial run book is finished, you’re going to have to put in your own run book instructions to set up the CBU exactly the way it must look to successfully impersonate the production box. Some of the additional items you may need to add to the run book include:
There’s a lot more to failing over than just changing your CBU’s identity to impersonate the production box. Hopefully these tips will help you understand and avoid some key mistakes that can prevent your failover process from running successfully.