Admin Alert: High Availability Eliminates Disaster Recovery. . . Right?

June 30, 2010 Joe Hertvik

Imagine you’re a flea with one limitation. Whenever you want to go anywhere, you can only jump half the distance to your objective and no further. The second hop halves the remaining distance. The third halves that again, and so on. How long will it take to reach your goal? Administrators dealing with High Availability (HA) and Disaster Recovery (DR) are a lot like that flea.

I thought of the jumping flea after we started revisiting our DR plan. I was feeling pretty smug about things. After all, I have an off-site HA setup for my production Power i machine, I’ve performed at least 20 HA switchover exercises since 2007, my system has been certified to work correctly by both the Applications staff and the user community, and I’ve even run on my HA box for a week when we had trouble completing a production system upgrade. My disaster recovery work is done, I thought. I am ready to go if the big one hits. Or was I?

The truth is that for all our HA work, we were still only in a good position, not a great position. There was still work to be done. Here are some of the things I discovered when I started exploring the ground where high availability meets disaster recovery.

Let’s Define

Before I start the discussion, it’s helpful to have a good working definition of HA and DR.

A High Availability system uses a capacity backup (CBU) system to provide near-continuous availability for an iSeries, System i, or Power i box. A CBU is a system in waiting, continuously receiving replicated data and other system objects from a production partition. In an emergency, the CBU can be quickly activated to stand in for its partner system, minimizing the amount of downtime the company experiences. See the Related Stories section for other articles describing i5/OS HA and CBUs.

A Disaster Recovery plan can take advantage of your HA system, but it contains more than just HA. (Very) loosely speaking, a DR plan is a preset group of instructions for what your IT department does to restore computer capabilities when a disaster takes out your network and i/OS capabilities. It answers the (not so) simple question of what you will do when everything falls apart due to fire, earthquake, tornado, electrical outage, terrorist attack, etc.

Where HA Ends

As I said in the intro, I get a little cocky when HA plans come up. I’ve done and tested everything I can possibly think of to ensure my HA solution will be up and ready if the unthinkable occurs. However, when I started rewriting our DR plan, I ran into the following additional items that I never even thought about with HA. If you’re looking at HA in relationship with your DR plan, you might also want to think about these issues.

• Where do your HA and DR plans live?–If your HA and DR plans live on your computer network, what happens if the network is destroyed along with your i/OS machine? Have you printed out a paper copy of each plan, and do these copies exist off-site where your DR team can retrieve them in an emergency? It’s wise to keep multiple copies of your HA and DR documents both on-site and off-site. You may even want to require that your key personnel keep copies of each document in their cars.

• Passwords and contact numbers for retrieving recovery media from an off-site vendor— If the building burns down, do you know who to call to retrieve backup tapes, as needed? This is another item that is worth keeping off-site.

• Companion servers–For all our HA technology, we’ve found that a lot of our purchase orders, invoices, and other documents were still being faxed to our customers on a companion fax server. In addition to having a plan for restarting i/OS processing, you also need to plan for companion servers that may or may not be needed during a disaster. Ask yourself, can I do without this capability? If not, what’s your backup plan for replacing it? The same goes for other companion servers or capabilities that connect you to valuable customers or business partners. (Anybody still have dial-up modems?) How will you compensate for losing those connections and functionality in a disaster, where recovery may take several days, weeks, or even months?

• Development i/OS partitions–In our shop, i/OS software changes are managed and promoted to production through a secondary partition using Aldon Lifecycle Manager. If the entire computer room is taken down for a month or two, do you know how your developers will keep working? This is another issue beyond HA that should be looked at if a real disaster occurs.

• Going home–In our situation, all of our HA scenarios have been run according to the following scenarios:

Switch production processing over to the CBU
Run production on the CBU for a specified period of time
Bring up the production machine as the target machine and synchronize processed data back from the CBU (which is functioning as the source machine) to the production machine (which is functioning as the target CBU machine)
Switch processing back to the production Power i machine

This is fine for limited types of disasters, such as an electrical outage or any disaster that takes out user connectivity to the computer room without destroying your i/OS machine. But what happens when a tornado, hurricane, explosion, or any other type of disaster destroys your computer room along with your i/OS machine(s)? Eventually, you will probably replace your wrecked machines with new Power i systems. When going home to a new machine that has never been configured for HA, you would then also need instructions for:

• Restoring and resetting your new production system as the CBU–This is tricky because any full system backup tapes you have were most likely taken when your old system was configured for production. So after restoring the production box from tape to a new system, you would also need to flip your new production box from functioning as a production system to functioning as a CBU system before putting it back on the network.

• Synchronizing data from the CBU with your production system–This is also trickier than it sounds. If you’ve been running production on the CBU for several weeks or months, it may be difficult to resynchronize your data with the CBU because a) there may not be a sync point to start data resynchronization from; and b) it may take too long to process every transaction by just using file replication. To fill this hole in your HA and DR plan, you’ll have to contact your vendor to determine what their best practices are for returning production to a brand new machine after an extended period without replication. Given this, it’s wise to have a plan for how you resynchronize production data with your CBU after an extended outage.

To complete my analogy, running HA with DR is similar to our jumping flea. You can continue to refine your procedures, but you may never get 100 percent to your goal. Something will always come up when faced with a real disaster. However, if you continue studying the problem and keep identifying and solving the issues involved, you can get very, very close.