Admin Alert: The Road to Live CBU Fail Over, Part 2
Published: September 16, 2009
by Joe Hertvik
Last week, I began describing the strategy and methodology that a company I work with used in preparing for a i5/OS Capacity BackUp (CBU) system to switch over and impersonate their live production system for several days. This week, I'll continue the example by describing how they prepped their applications, users, batch processes, and the system itself for a live switch over.
The Framework for Making the Switch
A CBU is an i, System i, or iSeries machine in waiting, receiving replicated data and other system objects from a production partition. In an emergency, the CBU can be activated to stand in for its partner system, minimizing the amount of downtime the company experiences. The CBU uses high availability software to keep system objects and files current with its source system, so that system data are never out of sync with the source box by more than a minute or two. See the Related Stories section below for other articles describing i5/OS high availability and CBUs.
Our example company created a CBU certification process to ensure the CBU could stand in for production in an emergency. Certification encompassed a series of steps and documentation that helped guarantee that most functions running on the production machine would run correctly on the CBU when switched over. To ensure that the CBU would function as advertised in a switch-over, the company performed multiple switch-over exercises using these certification steps.
- Initial CBU configuration and infrastructure certification--Determined if the CBU was set up correctly to impersonate the production machine. Tested the basic mechanics of switching over to the CBU.
- Application certification--Determined if critical custom-written and homegrown applications can run correctly on the CBU.
- User certification--Determined if the user community can perform essential business processing on the CBU.
- Process certification--Determined whether critical automated processing can run on the CBU.
- Audit certification--Used an outside authority to certify that the company's CBU configuration was correct.
- Extended switch-over certification--Determined if the company can actually run the business live on the CBU for an extended period.
Last issue, I introduced the certification framework and initial CBU configuration and infrastructure certification. This week, I'll cover the other steps.
In the application certification step, the intent was to ensure each custom-written, third-party, and IBM application ran correctly on the CBU. To meet this goal, the company contacted each application vendor to get answers to the following questions.
1. What does it take to run your software on a different machine? Does the company require an additional license to run the software on the CBU? Does the company have to buy a full license or is there a lower-cost CBU only option? A few packages didn't require any licensing fees. Other packages allowed the company to buy lower-cost disaster recovery or high availability licenses for CBU usage only. A few vendors required a full-blown license to run their packages on the CBU.
It was also important not to forget about the IBM licensed programs that many applications needed to run correctly, such as the Java Developer Kit (5722JV1), IBM Portable Utilities for iSeries (5733SC1), and the IBM HTTP Server for i5/OS (5722DG1). If the IBM programs aren't licensed correctly for the CBU, it may cause other applications to fail.
Once the software list was compiled, the company asked for authorization to spend additional money, and bought whatever software licenses were missing from the CBU.
2. What key objects need to be excluded from replication? All key objects for third-party software verification had to be excluded from replication, lest the CBU key be overwritten by the production key. To do this, the administrators tweaked their configurations to exclude machine-specific licensing objects from replication before applying the keys.
Once the CBU licensing issues were handled, the IT department ran preliminary application testing, which consisted of switching over to the CBU and running the applications with small batches of test data. The final results of this certification were that: a) all identified custom-written, third-party, and IBM applications would run on the CBU; and b) preliminary test data would correctly be posted to the CBU database and replicated back to the production machine after switch back. The testers were required to sign off on their results before the certification process could move on to the next step. It took several switch tests to reach this objective.
User and Process Certification
To get the users involved, the company decided to perform a limited live switch exercise where representatives from each department entered data on the CBU during off-hours (early Sunday morning).
To do this, the IT staff first had to get management buy-in to perform the exercise. This was simple since upper management had authorized the CBU purchase and the company had performed many CBU exercises in the past. The IT staff held a series of meeting to educate the corporate staff about the exercise and to obtain commitments to test critical applications. Each user department created a CBU test plan for their area. Each test plan included a limited set of live production data that could be verified both during the switch-over and after the switch back to the production system.
The departments executed their test plans, verified the results during and after the test, and signed off (certified) whether or not they felt they could run critical processes during a CBU switch-over. The company performed as many exercises as needed until all critical user departments certified that the CBU could fill in for production during a switch-over.
The company also made arrangements to test critical batch processing while the system was switched over. They had identified a series of overnight batch jobs that processed orders. These processes also needed to be run on the CBU. The company ran the processes on the CBU, checked the results, and certified that they would run when switched over. For management purposes, the testers documented that they believed the CBU would correctly run critical batch processing during a switch-over.
The last certification before the live switch-over was to audit their high availability configuration with an outside expert. Management requested the audit to ensure that all critical objects or libraries were included in the replication scheme. Although the replication software had been configured by an outside consultant, the company did not want to use the same consultant to certify the work, because they were auditing both the consultant's and the IT staff's work. So they contacted the replication software manufacturer, who offered an audit service to ensure that the software was configured and working correctly. The manufactured audited the software and found several areas that needed improvement, including:
- Configuration changes to reduce replication software overhead on the production machine
- Best practices to ensure that all relevant objects would be replicated, especially core operating system items such as subsystem descriptions, job queues, and output queues residing in the QUSRSYS, QSYS, and QGPL libraries
- Operating system PTFs and software levels that should be applied for maximum efficiency
- Run book changes that took advantage of less obvious features of the replication software
The manufacturer found no significant problems with the replication configuration and issued a report. The company applied many of the operational recommendations to their configuration, and saved the more comprehensive changes for future implementation.
Until this point, the CBU had only been switched over in a controlled off-hours environment with limited processing. Company management now reviewed all the completed certifications to ensure that the CBU would work correctly when impersonating the production machine in a live environment.
- The administrators had certified that the CBU infrastructure worked correctly and that the CBU could be switched over for production processing (Initial CBU configuration and infrastructure certification)
- The IT staff had certified that the applications would run correctly when production processing was switched over to the CBU (Application certification)
- The users had certified that their departments could correctly process critical functions during switch-over (User certification)
- The IT staff had certified that critical batch processes would work correctly (Process certification)
- The replication software manufacturer had certified that the run book and replication configuration were set up correctly (Audit certification)
The CBU certification process convincingly made the case that the CBU was able to function as the company's production machine for any length of time. The only thing left to do was to run an extended switch exercise.
Extended Switch-Over Certification
The company performed another switch-over after close of business in the middle of the week. For the most part, the CBU worked as planned during this certification.
Most switch-over issues were communications issues, where the CBU was having trouble talking to a partner machine or remote location using FTP, Secured Sockets Layers (SSL), or to warehouse devices through a remote controller. None of these issues were show stoppers, and all issues were quickly fixed. As with the other exercises, issues were documented so that they could be fixed either in the run book or in the configuration before the next test. While not perfect, the live switch exercise was a success. The company was able to process all orders and financial functions correctly, and many of the company employees didn't even notice that the machine was switched over.
The company ran production processing for three days on the CBU, and then successfully switched back to the production machine. No production data was lost. The result is that management now has confidence in the CBU's ability to take over for production in an emergency.
Two side benefits were also incurred from the certification process. The company can now take advantage of its live switch-over capability when it needs to perform other tasks that require production system downtime, such as operating system upgrades and hardware upgrades. The certification process can also help the company in other auditing or regulatory situations, where the company may have to prove that CBU processing identically mirrors production system processing.
Not Perfect, But. . .
The last two weeks I presented this example as a template for how a company could approach certifying that a CBU setup is ready for the big time: live substitution for a production machine. These steps aren't meant as the definitive effort for CBU implementation. Rather, they are a general guideline for how a company could approach the task. Feel free to borrow or modify these steps for your own particular situation. Also, please feel free to email me at the IT Jungle Contact Web page with your own CBU questions or stories.
The Road to Live CBU Fail Over, Part 1
Common Mistakes When Failing Over to a CBU
Beyond Replication in an i5/OS High-Availability Environment
How System i Boxes Impersonate Each Other, Part 2
How System i Boxes Impersonate Each Other, Part 1
Five Benefits of a High Availability System
The System i High Availability Roadmap