Choose Wisely: High Availability Performance and Reliability Issues
December 6, 2004 Dan Burger
How will your company deal with downtime? The same as it always has? Maintaining the status quo might be the goal for many companies, where not falling behind is considered getting ahead and business demands are not all that demanding. But that type of business operation simply will not fly for many companies. The speed of decision-making, coupled with e-business, has made terms such as high availability and managed availability commonplace in large enterprises, and increasingly in small and midsized businesses as well.
Availability is all about developing a strategy for dealing with planned or unplanned downtime. It usually means adding another server with your database and applications that mirrors your production server (often called the source machine). If the production server goes down, the mirrored server (often called the target machine) is ready in an instant to pick up the production workloads and run with them. Depending on the business, the objective may be to recover in seconds, minutes, or hours when a production server is taken offline, either by your own planning or by an unplanned event, such as a natural disaster, a power failure, or even human error. Keeping business running even when your production server isn’t, by mirroring machines, usually also means mirroring in a different geographic location, to avoid having both systems hit at once by the same disaster. At many companies, however, source and target machines are kept in the same facility, and a separate disaster recovery server is set up in a remote location.
The cost of downtime is playing an increasing role in calculating the cost of high availability ownership. Some companies have recovery times that are measured in days. But in reality, many more companies are faced with data recovery times that are measured in days when unplanned downtime occurs and in hours when back ups are being done. That, however, is not high availability. And for a growing number of companies, it is not acceptable. According to the best educated guesses, 4,000 to 5,000 iSeries shops have implemented high availability solutions.
There are many debates over what level of availability is needed in any given business situation. Out in the real world, where companies are deciding on high availability projects, it’s the business units of those companies that are driving the demand, and those business units are only willing to pay for certain levels of availability. Some companies may have an eight-hour high availability requirement, not a 24-hour requirement. They are willing to pay for that and to put certain measures in place.
In the evolution of high availability software, it began as a very complex procedure, making it time- intensive and expensive. This is no surprise. Almost every new technology follows a similar curve, with an exotic technology being expensive and tough to use and gradually maturing and getting cheaper and easier to use. High availability software in the IBM midrange has a 15-year history, and it has undoubtedly evolved considerably in that time and along such a curve. High availability is nowhere near as complex as it once was, but it is just as easy to describe it as nowhere near as simple as some would make it out to be. False impressions about high availability are rampant. (See “Myths, Misconceptions Run Wild in World of High Availability,” which sorts through this to bring some clarity.)
Complexity is proportional to the degree of availability. If the circumstance calls for 99.999 percent availability, the resulting systems will be more complex than an implementation calling for a lesser degree of availability. The level of complexity also changes if companies are using multiple platforms (and they usually want to manage them from a single interface) or require many levels of managerial reporting. The complexity of an HA setup will also vary if a company is building either a cluster of many servers, a one-to-one cluster, or a single-machine replication.
Having said that, companies can attain a level of availability that is comfortable for them with far less complexity than was required only five years ago in the AS/400 market, and this is real progress. That reduction in complexity has come with a decrease in the cost of HA solutions. Whereas high availability was once the exclusive advantage of the world’s largest corporations, its advantages are now within the budgetary reaches of small and midsized companies.
The big architectural change in high availability came with the development of remote journaling and the inclusion of that technology in the guts of the OS/400 operating system. Because of remote journaling, performance gains have been made, while the costs of implementing an HA solution have plummeted.
Briefly explained, remote journaling is a process that writes transaction and file data to the target as well as to the source. As a comparison, local journaling writes data only to the production server, and the HA software is required to move that data to the target server. Because remote journaling is part of the operating system, it performs the transfer of data and objects at the operating system level, which increases the speed compared with the previous methodology, and it also reduces the system overhead.
Before remote journaling, heavy system overhead from the HA software put a strain on production servers. In worst-case scenarios, production systems were bogged down by HA, and to some extent the stigma of HA being a resource hog still persists. That problem can still occur, but the likelihood is relatively small. Many of the high availability vendors allow companies to “test drive” their software to take lack-of-performance issues head on before contracts are signed. If CPUs are going to be an issue at your installation, you’ll want to know that and adjust accordingly.
The speed at which a system operates is the primary performance issue. This is particularly true as you go up the high availability ladder to “five nines” (99.999 percent availability), which means your target server operates with near synchronization with the production server. Generally speaking, the financial industry is the most concerned with attaining this level of availability, but any company with employees, customers, and suppliers that can add productivity at any time of the day or night could be planning for this Holy Grail of high availability.
A potential glitch in any managed availability system, but one that is particularly annoying when trying to dial in true high availability, is an issue called latency. When production systems can’t keep up with handling, processing, and sending data changes to the target server, a backlog can occur on the production side. That puts the two systems out of synch and, depending on the workload and the resources on the machine, it could be out of synch by a matter of minutes or hours. The difference between the two servers is referred to as latency. If the production server goes down while out of synch with the target server, data could be lost or simply inaccessible until the production server is brought back online.
Depending on the goals of the high availability system that was implemented, this could be a huge problem or not a significant issue. To eliminate latency, vendors use synchronous remote journaling, which prevents lost data from transactions that are occurring while the production machine is lagging. Synchronous remote journaling sends transactions to the target journal receiver first, then the local journal receiver, and finally to the production database. That means the target is always as current or more current than the production server.
A company that is not particularly concerned with zero latency could choose asynchronous remote journaling and not worry (it’s a relative thing) about latency. With asynchronous remote journaling changes first to the local journal, then the changes are updated to the database, and then updates get made to the target system.
Performance issues also crop up concerning the bandwidth connecting the source and target machines to each other, whether you are talking about either type of remote journaling (synchronous or asynchronous). Without enough bandwidth to push the changes across to the target machine, latency again can become a problem. Companies need to consider the size of the pipe between the two machines with regard to how much data is going to be pushed through the pipe and how much it costs. It’s not always necessary to send every piece of data and every application. Stripping out some of the information pushed between source and target affects the size of the pipe that is needed, and that may have its advantages in terms of performance and expense. This is less of a problem in the United States, where an excellent communications network is in place, than in other parts of the world. The declining cost of bandwidth in the United States, combined with the lower cost of an iSeries back up server (see the article “IBM’s iSeries for HA, CBU Editions Gain Traction”).
Companies doing high-volume transactions should also take into account multithreading, which enhances the speed with which information is taken off the target system and written to the database. Because two processes are taking place simultaneously–one that reads from the journal and one that writes to the database–multithreading was applied to deal with the I/O wait time that the OS imposes.
Synchronization of the production and the target machines is a critical feature, and the ease with which this takes place and can be monitored is an important point in the high availability purchasing decision. Having confidence in the integrity of the data and applications on the target server allows companies to handle planned and unplanned downtime without concerns of inaccessibility. Differences among systems relate to issues such as full or partial replication, whether all objects–profiles, data areas, data queues, and so forth–are being replicated, and whether the replication includes the Integrated File System within OS/400. If the journals on the two systems (assuming a one-to-one cluster) don’t match up, having a feature that explains what went wrong and how to get it back in synch is very useful. (Next week, we’ll give an overview of the various OS/400 HA products and their features.)
Reliability and readiness directly relate to testing the process of moving information (role swapping) to the back up server. The frequently used term for this is role swapping, but it is also sometimes referred to as switch over or roll over. If this can’t be done easily and with confidence, it is unlikely to be used to reduce downtime in conjunction with system maintenance, and it greatly diminishes the value of the product. It should be tested regularly and used to its maximum benefit. Companies get a return on investment on HA by role swapping during hardware upgrades or operating system upgrades. Users go on the target server while this takes place.
The time it takes to complete the role swap includes preparation time, executing the role swap, and verification.
Prep time relates to latency and whether the source machine is delayed. As has always been the case, the best time to be caught up is on a weekend or at night, when transaction processing is slowest. Naturally, before the swap is made the data on both machines should be verified for accuracy.
The role swap itself may take as little as 20 seconds. After the swap, there should be additional verification that all the interfaces and devices have moved and user profiles and security authorities have been activated or inactivated.
Monitoring features that provide these assurances are offered with green-screen or browser interfaces, depending on the HA vendor and the preferences of the user. Some monitoring systems include a paging system or e-mail to contact the manager if action needs to be taken. The bottom line is that no one will do a role swap until he is comfortable that everything needed is on the back up.