Make High Availability Work for You
December 13, 2004 Dan Burger
What is high availability and what can it do for companies that are searching for ways to better handle downtime and make their data accessible around the clock to employees, customers, suppliers, distributors, and others? Much of the background on high availability and the market in the OS/400 arena has been covered in the past two issues of this newsletter. This week, we take a look at the primary high availability software vendors and their products. DataMirror DataMirror has designed a large degree of autonomic computing features into its iCluster HA software for the iSeries. Although these features require some examination to determine the full extent of the features, iCluster does provide self-correcting, self-configuring, self-protecting, and self-optimizing. It also is designed to scale to environments of any size. To address the issue of data integrity between system audit journals and database journal transactions, DataMirror uses a method that matches and merges, in real time, records that are transmitted from the source to the target system so that object- and record-level events are applied in the exact sequence in which they occurred on the primary system. This functionality allows iCluster to mirror complex batch jobs without requiring the user to write customized CL code. Additional functionality, which DataMirror calls Real-Time Auto Registration, is designed to aid the administration and immediate handling of new objects as they come into the replication process. For larger companies that maintain millions of large byte stream files (BSFs) within the OS/400 Integrated File System that are created and deleted but are never actually updated, iCluster replicates these BSF objects without the additional overhead of journaling. (The option of using remote journaling still exists.) Users can also perform sync checks to ensure that BSF objects are synchronized between primary and recovery systems. Proven real-time mirroring of both journaled and non-journaled IFS files streamlines the processing of image files and document management applications such as those used by the healthcare and insurance industries. To deal with triggers used in the DB2/400 database, iCluster keeps them current on the backup system in a disabled state. They are enabled whenever a rollover occurs. In terms of performance, DataMirror has benchmarked its XtremeCache for iCluster in a customer setting where over 100 million operations per hour was achieved. XtremeCache is designed to optimize the flow into and out of high speed software cache. To maximize bandwidth efficiency, iCluster is designed to filter out temporary files on the primary system. Additionally, it propagates changes only to objects being replicated, rather than sending data for objects that are not in replication scope. It also takes advantage of the iSeries’ Minimize the Entry-Specific Data (MINENTDTA) journaling option to reduce network traffic. DataMirror sources claim iCluster can be installed and basic mirroring established within an hour. However, they also note that the average physical implementation typically takes five to 17 days, depending on the complexity of the environment. An additional three days, on average, is necessary for training system administrators to use iCluster. This includes testing after the physical implementation to ensure that large volumes of data are properly supported within the users’ environment. In a complex user environment, the implementation could take longer, depending on the number of systems, their geographical location, and the extent of the users’ replication needs. To aid with configuration, iCluster includes a GUI, green-screen menu, and command line configuration options. It eliminates the need to specify individual objects for replication, allows single configuration for both files and regular objects, and lets users add new libraries into replication while the replication is still active–without the need to end and restart replication. Wizards are used to guide configuring operations. The iCluster combines both high availability and clustering functionality into a single solution. File and object configuration and monitoring are integrated. iTera The iTera Echo2 software has made great inroads into the OS/400 high availability market in the past five years. The company took the remote journaling that IBM built into the OS/400 operating system and used it as the basis for a product that was less expensive and less complex than traditional high availability products. That is not to say that iTera was the only company to do this; it’s just to give it credit for carrying the remote journaling flag into the small and midsized marketplace and building a great deal of success there. In most instances, remote journaling is recognized as being faster than local journaling, the standard method for doing high availability from HA’s conception. “Vendors are often close to one another in speed of replication,” says iTera’s Bill Rice. “But speed is not that high of a priority. Yes, some companies want to be current or have no more than a minute of latency. Others may have a lesser machine, and if they are five or 10 minutes behind real-time, it is not important to them. It’s all about how quickly you want to recover your data. Cost goes up with speed and accuracy.” Echo2 makes use of autonomic features to reduce the time requirement of system monitoring. “The software does what it needs to do,” Rice says, “and if intervention is needed, the software will tell you. When there are issues, the software resolves the issues, and then it tells you what it did.” An example is an automatic history of what has been synchronized or resynched.” Auditing processes in Echo2 are automated, and the detection of constraints, as one example, can be managed automatically. Its real-time audits run continuously, and its parallel audits can be set up to run at user-chosen intervals to verify and validate replication. Based on results of audits, objects are automatically repaired or resynchronized, and, if necessary, alerts are sent to the operator. When an audit does detect an out-of-sync condition, it is automatically resolved. Ease of management features, like these and similar features from other vendors, factor into how much time the staff spends managing the system and are among the considerations most companies look at when comparing high availability solutions. Comparing the number of processes that require manual attention, the capability of software scalability as a company grows, and the ease with which a user can enable and disable features are also significant considerations. Other features of note in Echo2 are such things as the automation of operational processes, the drill-down capabilities of monitor screens, the capability of IFS replication to handle DLO objects, and the WebSphere and MQSeries object replication capacities. Echo2, as described by iTera, is a full-blown product that can be used as much or as little as you need. This comment is pointed toward a comparison with its HA competitors that sell product components, but the bottom line is recognizing which features are useful to you and understanding what you are getting for your money. An example of what iTera is describing is that features such as replication of the iSeries IFS or spool files can be used or turned on or off at the users’ option. “As a customer, knowing what you need, and how quickly and completely you need to recover, sets the benchmark for software costs,” Rice says. “You don’t have to buy the second machine to gain the benefits of high availability. You can use logical partitioning and eliminate downtime from tape saves right away. That allows you to roll-swap to some extent. You can’t move workloads to the backup. If you want a second machine, to move workloads to the backup and have full control over the backup, that’s another level and a higher cost.” Lakeview Technology Because Lakeview Technology is one of the vendors that has been around since the beginning of high availability, it is sometimes regarded as a vendor that works only with enterprise-level companies. However, it offers two high availability products: an entry-level choice called MIMIX ha Lite, for basic replication, and MIMIX ha1, for multiple-server high availability. Both products offer support for OS/400; i5/OS; Windows NT/2000/XP; Windows OS on an Integrated xSeries Adapter (IXA) or Integrated xSeries Server (IXS); z/OS; HP-UX; AIX; and Linux. While MIMIX ha1 can support those diverse platforms, it is made for the iSeries and requires OS/400. MIMIX ha Lite is designed for the small or midsized company that wants less-costly high availability software than the full-bore MIMIX ha1. Like other HA software, MIMIX ha Lite makes use of remote journaling in OS/400 and places an emphasis on autonomic features and auditing capabilities. It supports logical partitions, one-to-one replication, and mutual backup replication. A basic MIMIX ha Lite implementation ranges from 40 to 80 hours, including planning, installing, and configuring the software and training administrators. Most vendors are hesitant to talk about implementation times because they vary from customer to customer, and stories about installation times are often used against vendors by competitors. MIMIX ha1 has features not included in MIMIX ha Lite, such as local journaling (in addition to remote journaling), stand-alone or switched iASP, and library mapping. It was also designed to handle clustering, distributed parallel database, and a mix of broadcast, cascading, or hybrid replication. MIMIX ha1 typically involves a more complex IT environment and a desire for advanced features. The company provided an estimation of implementation times that range from 80 to 120 hours. It’s important to note that you have to contact each vendor regarding its implementation times and never rely on one vendor’s explanation of another’s implementation time. Also, be sure to talk to the customers of high availability vendors that have iSeries setups and workloads most like yours and have installed the products you are thinking of buying. There are significant differences as well as similarities between the HA requirements of small and midsized companies and large enterprises, says Terry Lewis, vice president of marketing at Lakeview. “There is a need for different levels of availability that depends on a company’s business plan and the industry it is in. Smaller companies are playing on a global scale and competing against larger vendors; they need 24/7 for that. Some enterprises can take advantage of lower-level solutions. Divisions within those businesses, for instance, may not need the high-level requirements. “The typical small or midsized company is focused on being more productive. Planned downtime used to tolerate having servers down for four hours each night. Now they can use an HA tool to stay online during those four hours and bring in additional business.” Both MIMIX products include multi-threading, a separate target database reader and apply processes, and adaptive cache, a technology that informs the operating system of upcoming I/O requests to improve performance. In managing the synchronization process, it eliminates redundant I/O. For scalability and efficiency, ha Lite and ha1 use separate target DB reader and apply processes, so the system does not have to wait for journal entry read or DB apply to finish in order to process the other. Lakeview’s Adaptive Cache technology, designed cooperatively with the IBM AS/400 development lab, is a performance feature that informs the operating system in advance of upcoming I/O requests. Autonomic features in both MIMIX products provide capabilities such as the creation of local and remote journals and the communications paths between them; automatic change and deletion of journal receivers; and replication checks for source and target synchronization. Both products support more than 60 object types, including user profiles, configuration objects, authorization lists, message files and queues, and query definitions. Maximum Availability Telecom New Zealand is one of the largest replication sites in the world, with approximately 1 billion transactions per day replicated between two iSeries servers positioned in separate locales. Maximum Availability‘s *noMAX is making it happen. The *noMAX is implemented in three configurations: single machine, which is often used to replicate databases between ASPs or logical partitions on one server; one-way replication, primarily used to run simple replication in one direction between two machines; and two-way replication, designed for high availability and role swaps. “We have customers replicating across the Internet at the small end, others using a standard phone line, still others on a T1 line, and the larger ones using an optical link,” says Simon O’Sullivan, sales director at Maximum Availability. In explaining the roll swap procedure, O’Sullivan says some customers choose to load the software on a single machine and duplicate the database across ASPs. They then take tape backups from the second copy of the database. “This gives them the capability to take backups at any time, without affecting the users. Miami Children’s Hospital is one *noMAX customer that does this. It is high availability at the very low end,” O’Sullivan says. According to statistics cited by Maximum Availability, 95 percent of all iSeries shops are backing up to tape as their only means of disaster recovery. In O’Sullivan’s view, they are not ready to replicate data and objects and do role swaps on their first encounter with the high availability concept. Maximum Availability’s plan is to offer levels of disaster recovery and high availability for those customers that want to start out slowly. “If later they want to add user profiles and other objects,” O’Sullivan says, “then, okay, let’s do that.” It’s a step-by-step process that may or may not lead to two-way dynamic replication. “This way we can work to their timeframe, budget, and skill set,” he says. Although the price of these disaster recovery/high availability solutions has declined dramatically since the introduction of remote journaling, O’Sullivan predicts further declines in 2005. “With the price of a second iSeries shrinking, the cost of bandwidth shrinking, the cost of HA and DR solutions shrinking, there is a lot of life and growth in this market.” The company gets a lot of mileage out of its offer of a free 30-day, fully supported trial of its software. It is installed remotely; the Maximum Availability support people turn on journaling and remote journaling and then work with the company to develop the program. Maximum Availability also gets traction because it publishes its software prices, which leads to an important point: the pricing of HA software. The iSeries HA vendors, with the exception of Maximum Availability, do not publish list prices for their software. This can make negotiations somewhat tough. So you need to have competitive bids from as many HA suppliers as possible, and for a broad range of solutions, to get the right kind of product at a price you can afford. “The market is changing,” O’Sullivan says, “and businesses want to see the product work in their environments before they buy it. Some have been stung by paying for software, loading it up, and finding it doesn’t work as they expected it to. Then the buyer has to admit a mistake was made and the purchase was a bad idea.” Lower costs for hardware and software have helped companies move to high availability. And with an entry-level i5 Model 520 for around $11,000 that offers 500 CPWs of processing power and 70 GB of disk capacity, O’Sullivan claims it can be less expensive to use *noMAX and get an entire copy of the database than to buy a high-speed tape drive. The *noMAX uses a PC-based GUI to control its HA environment. It has the capability to replicate data and objects, user profiles, IFS, data areas, and data queues, and it has dynamic replication features and uses the “before image” and “after image” methodology to perform real-time integrity checking. As an analogy, O’Sullivan says that *noMAX is like a Model T Ford compared with a Rolls Royce. “We don’t have a lot of bells and whistles,” he says. “If you want to replicate your data and your objects from one machine to another, and you want to do role swaps, we can do all that. The Rolls Royce HA products might replicate things that we can’t. But that’s not to say that we don’t have very large customers.” OS Solutions “You can put any high availability vendor in any category you want, but what it comes down to is reliability, quickness to implement, and ease of use.” You can tell by this comment from Pete Massiello, at OS Solutions, that his message is going to be that simple is better than complex. And this is a big issue in the HA marketplace. There’s an ongoing war of words over which product is best able to eliminate complexity while delivering reliability. “Companies want to know about price and functionality. Those are factors in the decision-making process,” Massiello says. “All the vendors are using the same infrastructure: remote journaling. There’s no differentiating in that area. And it is remote journaling that allowed HA products to be price-competitive and more affordable in the SMB [small and midsized businesses].” OS Solutions offers data replication and high availability capabilities within its OS Director systems management product, which manages and enhances the performance of OS/400 applications and systems. The software supports the mirroring of objects, data areas, data queues, and IFS objects. The OS Solutions approach to high availability is a familiar theme. Wizards aid in the configuration process, and it was designed to require a minimum amount of staff time. Like all of the HA products, the amount of management time is directly proportional to the amount of complexity required by the company implementing the solution and the amount of data being pushed through the replication pipeline. And like the other vendors, OS Solutions will provide bare-bones high availability (which is basic replication and more commonly referred to as disaster recovery) or a true high availability, automated failover solution. OS Director will support fully manual, semi-automatic, and fully automatic failovers. “High availability is for companies that don’t want to be down while doing a backup,” Massiello says. “Being down means potentially losing customers to a competitor. That’s a lost opportunity cost. To avoid it, you must keep the source system up at all times. The world has changed. People are understanding how important it is to have computer systems running 24/7. They understand the consequences of not being able to get to their data.” Massiello says that if the HA implementation is done right, there could be fewer things for the IT staff to do than before. For instance, it is no longer necessary to take down the system for an extended period. It’s a matter of moving everyone to the target machine, upgrading the source machine, and then roll-swapping it back. “It’s a total systems implementation,” Massiello says. “We have a solutions management product as well, and that integrates with our HA environment. It helps clean up the system, optimize it, and get things in good shape before starting HA.” Massiello’s advice to those considering an HA implementation is to begin planning by asking the following questions. Do you want this for roll-swapping and continuing operations? What do you want to replicate and move from one side to another? What is the impact on the business? Do you understand the business needs, and can you translate those into what you need to set up? Do you have a clear plan for the implementation timeframe? “HA isn’t as complex as many people think it is. It’s not that you can take the shrink wrap off and go, like PC software, but it is very easy to implement,” he says. Trader’s Remote journaling is an option, not a requirement. With all the buzz surrounding remote journaling and how it has opened the doors to high availability for many small and midsized businesses (as well as opening the eyes of many enterprise level companies), it’s a bit startling to read such a statement. It comes from Trader’s, a company that has also been doing quite well selling high availability to OS/400 users. However, the first thing you notice about Trader’s is that its customer base is in Europe, mostly in France. That may soon change. Trader’s HA product is called QUICK EDD. General Manager Thierry Roux explains that it is capable of fast replication with less resource utilization than remote journaling. Not that Roux has anything against remote journaling: EDD allows customers to decide whether to use it or not. “The potential for an HA system to experience a target system that is out of synch with the source system is pretty good and becomes more likely with a greater amount of information being transferred,” Roux says. “There can be communication and network problems and damage on files or objects. Being able to quickly resynch the target with the source is critical. “QUICK EDD is designed to make the switch as fast as possible without any mandatory pre-requisite from OS/400 or the application. All vendors say their software is easy to install and maintain, and Trader’s is no different in that regard. Simplification is a much-appreciated commodity for any user. Roux says EDD has one-step replication, which may lead to thinking that HA is actually easier than it is in real life; nonetheless, it does put all applications at the same level when you have to switch from the source to the target system. You’ll save time during the switch because you won’t have to wait to “reorganize” your target. Quickly resynching the target with the source is a critical process that is handled in different ways by the various HA solutions. With the synch checker embedded in EDD, users can initiate a process that compares each object on the source and target system and only retransmits the records/objects from the source system if the target records/objects aren’t identical to the source system. “The emphasis of any HA buying decision should be to have a synch check that compares each object on the source and the target system and only retransmits the records/objects that are not identical. Synch check should be an embedded function.” Roux says. Roux says that, during the real-time replication process from production to target, EDD requires only four steps for object replication and acknowledgement, instead of the eight steps required with remote journaling. In terms of performance, this yields faster replication with less resource utilization. “An average of 40 percent of volume transmitted by communication will be saved,” Roux says. “After a disaster, the HA software has to manage the transfer of production jobs and users to the recovery system. If the outage is a short one, the HA administrator will calmly make the switch back to the production system and the replication will quickly synchronize the two systems. However, in disaster situations where the production system is unavailable for an extended period, the HA administrator may see dangerously high disk utilization on the recovery system. The risk of high disk utilization is much greater when the HA solution requires journaling on the recovery system in order to replicate back to the production system. EDD allows you to perform your processing on the recovery system without activating journaling. Processing without journaling on the recovery system decreases the CPU utilization and disk utilization requirements on the target system.” QUICK EDD needs a disk space of 1.5 MB and is downloadable from the Internet. One of Trader’s customers uses the HA product on its Model i595. The typical implementation involves two days of consultancy to audit the customer’s system and organization; three days to install, set up the system, educate the users, and begin tests; and one day to test a failover situation. Roux says companies are asking for two things when it comes to high availability. The first is return on investment and to not just have “a sleeping system” in case of a disaster. The second is to avoid having IT staff tied to the system and taken away from the main business of the company. The majority of iSeries customers (those using servers in the P10 or P20 class) have to keep their HA software costs in line with what they are gaining in reduced downtime in order to make the return on investment pencil out in their favor. If the company can reduce its planned downtime to less than two hours, the ROI looks pretty good. Vision Solutions At Vision Solutions, there are four interrelated HA products with a common engine and interface. There is also a large support staff that comes from being one of the “old timers” in the high availability market. The HA vision at Vision Solutions is that most small to midsized companies are not yet moving the amount of data and applications that require high availability solutions with all the bells and whistles that enterprise companies require. However, they are finding disaster recovery options useful and cost effective when applied to reducing planned downtime. There is a trend among companies to begin with disaster recovery and evolve into high availability as the need to make applications available 24/7 comes into play. The Vision product line is a reflection of the diverse needs of the iSeries HA market and the competitive pressures that have recently shaped it. The Vision product line begins with Orion Express, which is a disaster recovery solution. It does replication between two servers or two nodes and has manual switching. Orion Express was designed to replicate journals to the database, and it can be used to perform tape backups, while minimizing downtime. It shares the same Java/XML-based replication engine as the entire lineup of Vision HA products. One step up the product ladder is Orion Professional, aimed at the company that wants a relatively simple high availability solution that goes beyond disaster recovery. This product primarily appeals to a company that wants to mirror its production server database and one or two applications. Full replication is available with Orion Professional, which means that, compared with Orion Express, the user gets features such as IFS support, automated or manual switch over, a graphical interface, and a choice of local or remote journaling. The usefulness of local journaling has been a hotly debated topic among the vendors since the introduction of remote journaling, but there are, as Vision executives will point out, instances where local journaling can be useful. These are rare, but Vision can assist if it works best in a company’s implementation. Orion Enterprise is a higher level of high availability designed for use with multiple servers and multi-node clustering for companies moving high-volume data. It provides advanced synch-checking features that offer more granular views, and allows it to be done live and while the product is running, with the capability to sample from journals and databases. Vision notes that one of its customers has 30 terabytes of data that is kept in synch across multiple servers. At the top of the Orion product lineup is Orion Advance Enterprise edition, designed for highly customized situations involving cutting-edge technologies such as independent ASPs and cross-site mirroring. “The most critical aspect of HA is data integrity,” says Vision’s vice president of marketing, David Wegman. “You can’t swap without integrity. And you get integrity with features such as advanced synch checkers, file compare and repair, and merge and purge.” All of Vision’s products allow users to analyze the synchronization and determine whether the data is important to the roll swap. The user decides whether to apply it before the swap, repair it, or roll it back. It provides options.” Vision’s HA products include the monitoring of held objects, which can be a stumbling block in the replication process. “There may be reasons why some things are not ready to apply to the target side,” Wegman says. “There could be constraints or triggers that didn’t get moved over, or a degradation in the journal. There has to be held object analysis to know these things. In those cases, it has to be sent again or repaired. Users have to be aware of these things before doing a roll swap.” The Orion products include this capability, beginning with the Express level. “Some products don’t have synch checkers,” Wegman says. And some will say remote journaling is so good you don’t need a synch checker. That’s baloney.” Related Articles “Choose Wisely: High Availability Performance and Reliability Issues” “Myths, Misconceptions Run Wild in World of High Availability” |