Admin Alert: The System i High Availability Roadmap

October 24, 2007 Joe Hertvik

I’ve recently been involved in a high availability (HA) project where the goal is to be able to switch processing from a failed System i box to a Capacity Backup (CBU) machine within one hour of failure. Growing from that experience, this week’s column is the first of an occasional series where I’ll discuss HA concepts and how i5/OS administrators can expedite continuous availability procedures in their shop.

What Does High Availability Mean To a System i Shop?

Although there are more complicated definitions for high availability, my favorite explanation is that high availability consists of whatever you do to provide continuous access to computer resources in the event of component failures on the system or a total system failure.

Given this broad definition, there are many different ways that i5/OS techniques and procedures can contribute to continuous system processing, many of which you might not consider high availability by today’s standards. It’s possible to make an administrative roadmap of sorts that provides a rudimentary checklist of common techniques and methods that support high availability functions in System i, iSeries, and AS/400 shops. To that end, I humbly submit the following techniques and methods that System i shops can use to get started with and use to refine their high availability strategies. Some of these techniques are elementary and common to all shops. Others require more planning and investment capital.

Daily and Weekly Backups: Backups provide a way to recover data and applications in the event of file corruption or destruction. Regular valid backups are the core component of a high availability system and for most organizations, not much can be done in terms of restoring a system unless a good a backup strategy is in place.

Backup Media and Equipment Maintenance, and Backup System Auditing: The best backup in the world will not do you any good if your tape drive is malfunctioning and needs to be cleaned. In addition to backing up the data, you should occasionally audit your backups and perform test restores on a regular basis to make sure that what you’re backing up can be restored. If your company falls under Sarbanes-Oxley compliance, backup monitoring and test restores may be an auditing requirement.

Off-Site Backup Media Storage: Backup media should always be physically separated from the systems where their source data resides. And the tapes should be stored in a protected environment with limited access, preferably in a location that is far enough away from the data source so that a disaster can’t take out both locations. Secured storage of backup tapes protects your system from total calamity if a disaster takes out your entire computer room. Auditing requirements may also require your company to keep a log of all tapes that are kept off site and record the movement of each tape to and from the storage facility.

Save-While-Active Backups (SWA): If you’re able to run them, SWA backups save data as it is being used by applications and system processing, providing access while the information is still being used. Properly executed, an SWA strategy extends your processing window while still backing up your data. And in a 24×7 environment where Web users and business partners from around the world are looking to access your site at any hour of the day or night, continuous access to data is a key requirement. To make sure that there are always complete backups of your entire system, you may want to supplement regular backups with occasional full system backups once a quarter or whenever you perform system maintenance.

Uninterruptible Power Supply (UPS) Monitoring Systems on Your System i Box and All Attached Components: UPS systems provide continuous power so that short-term power outages or full-fledged blackouts don’t unexpectedly take down your system, possibly damaging data and applications in the process. During a short-term outage, a UPS helps your system keep working during the time that it takes for the power to stabilize. In an extended outage, UPS systems provide administrators with a chance to get to the computer room and take down the system in an orderly fashion to avoid a system crash. These systems can also keep your systems going long enough for a secondary power supply, such as a generator, to kick in and keep the system running for a longer period of time. The last benefit of a good UPS system is that it can absorb and block electrical spikes coming down the power line.

Computer Room Generator: In an extended outage, a generator can power an entire computer room or section of the building until the power returns, avoiding the disruption that may occur. Once in place, a generator allows your System i to continue with its daily processing (batch, interactive, and server), even if there aren’t any on-site users to take advantage of it.

Disaster Recover Contracts, Services, and Testing: Detailed plans and contracts to restore company and System i processing in an off-site location when a disaster occurs are staples in many System i shops. A good disaster recovery contract includes an off-site facility where you can restore and restart system processing and where users can access the system until the main facility is available again. Disaster recovery plans should be tested at least once a year. In an ideal situation, they should also be combined with a business continuity plan, where a logistical plan for how the organization (with and without its computer systems) functions after an extended disruption or a disaster.

Capacity BackUp Systems (CBU): The CBU is the Cadillac of the high availability world. A CBU is basically a system in waiting on your network. A System i CBU communicates with your main production system and replicates system data through the use of high availability software, such as the different HA solutions provided by Vision Solutions (MIMIX HA, iTera HA, and ORION HA) or the DataMirror products offered by IBM. When replication is correctly performed, the database on your CBU is a duplicate of the database on your production system. If your main production system becomes unavailable when you have a CBU, you would manually initiate a switching process to reconfigure your CBU system to impersonate your production system, complete with almost up-to-date replicated data. If configured correctly, users, devices, and companion servers will then be able to log on to and interact with the CBU as if it were the production server. When your production box is ready to come back on line, the CBU resynchronizes its data with production, repopulating the production box with all the database changes that occurred while production was down. After system synchronization, the CBU is quickly reconfigured and restarted as the backup replicated system again, and the production system is restarted and resumes servicing all system users and devices.

Now while it seems like an expensive proposition to dedicate an entire System i box to doing nothing but wait for your production box to fail, IBM offers several System i Capacity Backup editions that are significantly cheaper (but not free) than its Enterprise Edition products. The cost of these servers must be weighed against what the business would lose in the event of a severe or significant disaster, such as what happened when Hurricane Katrina hit the Gulf Coast in 2005.

It’s also worth noting that there are several other costs involved with staging a CBU in addition to the cost of the backup System i box itself. To protect the box from disasters that take out a company’s entire computer infrastructure, the CBU should be accessible on the same subnet as your production system but it should be physically housed in another location at a respectable distance from the production system’s location. Some people house their CBUs at sister locations and others co-locate them at outside vendor centers that are specifically set up for high availability processing. In either case, there will be additional telecom and infrastructure costs to connect a server in a remote environment as part of your network. Beside co-location costs, you need to purchase high availability software as well as acquire some experienced help to set up and configure your HA replication strategy.

In future issues, I’ll explain some of the other costs and responsibilities you may incur when setting up a high availability system as well as some of the other unexpected benefits that grow out of these systems. But the main point is this:

While a high availability system will be invaluable to any company that chooses to implement one, it is a relatively expensive undertaking and for most organizations, a good business case must be made before undertaking the project. For banks, insurance companies, and other large organizations that have a critical need for close to zero downtime, it may be an easy sell to implement a CBU. Other smaller organizations will have to make a clear evaluation on whether it is worth it to the business to implement a dedicated solution.

Regardless of where you are at on the high availability roadmap, it’s important to review your options every so often to ensure that system availability will be as high as possible in your organization. You should also make a point to be looking at different alternatives to keep the business going in the event of disaster. While this roadmap is a good first start, I’ll flesh out the challenges associated with implementing high availability in future issues. For information on some of the other options that IT Jungle has covered in previous issues, see the Related Stories section below.