Four Things That IBM i Shops Get Wrong About HA, DR, And Data
July 22, 2020 Timothy Prickett Morgan
It’s always a little hard to generalize, but IBM i shops tend to be conservative, thoughtful, and thorough when it comes to running IT. But no IT organization is perfect, and there are still issues when it comes to disaster recovery (DR) and high availability (HA).
We think this is more often a reflection of the situation than it is of the people running IBM i systems. And to get a sense of the HA and DR missteps that people are making, we had a chat with Lilac Schoenbeck, Vice President of go-to-market strategy in the Power Systems division at Rocket Software, and Stephen Aldous, Senior director of product management at the same company. Rocket, of course, bought the iCluster high-availability software suite from Big Blue in 2012, which IBM got control of back in 2007 in the acquisition of its original creator, DataMirror. Rocket itself was acquired by Bain Capital in 2018, and counts Rocket Aldon, Seagull Software (Rocket LegaSuite), and Rocket BlueZone as other IBM i pillars among its broad and deep portfolio of software.
Timothy Prickett Morgan: Having established that nobody’s perfect, let’s take it from the top and talk about the first thing that IBM i shops get wrong when it comes to HA/DR and data protection.
Stephen Aldous: So this is a pretty obvious one, but it seems to trip everybody up. Too many shops don’t test the DR strategy, don’t test that the HA failover works, or don’t test that the backed up data is actually usable and will work when it needs to be recovered. Many of the disaster recoveries that take place in real life fail as a result of not going through the testing process.
We typically find that when disaster strikes, it’s unplanned. And if the employee that knows the system isn’t around, or is maybe on vacation and unreachable, the IT staff trying to deal with the situation struggle to understand how to use the system that may be more complex than it should be. Further, they don’t understand the processes and the procedures because no one has tested it, or it’s not documented anywhere. It impacts their recovery time objectives, which adds pressure in a situation when they are trying to get mission-critical platforms back online as quickly as possible.
And it’s not just about testing once and being done. This is something that needs to be done multiple times throughout the year.
Lilac Schoenbeck: That’s exactly right. Testing is obviously a part of the broader business continuity plan, and there is usually a fairly long runbook of things that have to happen in an emergency.
I saw a study a while back that said that the primary reason that HA and DR systems fail is because people simply don’t press the button because they’re not confident they know what’s going to happen when they push that button. And that really comes back down to testing. If you run the test and you run the play, when it’s game time, you know exactly what you’re doing. And if you haven’t, then you’re going to have a lot of concern that you’re shutting down a primary system and may not have a recovery destination. And so that’s why it’s absolutely important that everybody involved is really comfortable with the plan.
TPM: I want to get some demographics here for IBM i in particular—I have my own sense of the first part of this question, and no idea on the second half. How many companies actually have HA or DR on their IBM i platforms? And then what percentage of them actually go through this testing process?
We’re all speaking anecdotally here, but my guess is somewhere around one-fifth to one-sixth of the base has some sort of proper HA/DR and then the rest of them are all using tape or something like Virtual Tape Libraries or disk-to-disk backup, and so forth. What I don’t know is once they get the HA or DR software, is it the 80/20 rule where 80 percent of them don’t hit the button? And what do you do to change that psychology?
Lilac Schoenbeck: It’s a really good question. We’ve asked a number of analyst firms that, but it’s hard to get a number for how many companies have actually implemented HA in their environment. And then the denominator shifts around, too: Is it on every system? Is it just a production system? Do you count LPARs? There are a lot of different sub-questions associated with that big question. I think that broadly, somewhere between a 20 percent to 30 percent penetration rate across the board wouldn’t be unusual. The fallback position may be tape, but there’s also intermediate options, which are backup-based recovery and so on.
TPM: These backups are needed and useful, but they are awfully slow when the bits hit the fan.
Lilac Schoenbeck: That’s the issue. The timeline it takes to recover from a backup, particularly if you’ve got a hardware failure, is very protracted. And when you have a hardware failure on an interesting piece of hardware that isn’t lying around – people don’t have Power Systems servers lying around just waiting to boot up – then you have a much longer recovery time. So, I think you’re right; it’s a challenge to recover from any backup.
I think, however, that once you get people comfortable with an HA system, and they’ve actually tested it and feel good about it, they use it much, much more than you think. Partly because disasters are a lot more prevalent than anyone is willing to let on. Outages at the system level are much more prevalent than natural disasters. And the second machine in an HA setup is useful for other things, such as for testing or staging or compliance audits or all sorts of other things, not strictly recovering from a tornado or hurricane.
TPM: Let’s move on to mistake number two, then.
Lilac Schoenbeck: Number two is that when people think about what a critical system is, they have this idea that it’s only the “platinum production systems” that need this level of protection. And then at the same time, I think it’s difficult to find an IT manager that can’t tell you a story about a large DevOps system or testing environment that went down for a period of time, and therefore their software release slipped and their targets were missed on it by weeks, if not longer.
Losing weeks on a software release is actually quite impactful to a business. And we only think “critical” if it’s a revenue-generating transactional or operational system that actually costs the business when it’s down. That’s actually not true at all. I think the second mistake is not taking an expansive view of what systems ought to be protected.
Stephen Aldous: People know exactly what their core systems are, but there are a lot of other applications and services on the periphery that are accessed by this core system. Like an API gateway, for example, that drives the website. If the API gateway is not being protected, even when you’ve got the main system back up and running – that’s great – but your website is still broken.
It’s about making sure that an organization understands the full end-to-end implications of those different systems and what is critical, and not just protecting that central core in order to truly get them back up and running in the event of a disaster. Sure, you could claim they got the core system back up; but if nothing else is up and running, then that’s still not a good thing. Companies have to make sure that the entire end-to-end is considered in their planning for such scenarios.
TPM: It is hard to imagine systems that are not critical these days, either on premises or those served as SaaS from the cloud. So, on to the third thing people get wrong. Let’s move on to the third thing people get wrong.
Stephen Aldous: I would say number three is protecting or replicating in the same physical location as your primary source.
We understand that it’s tough sometimes – particularly for organizations that are smaller and don’t have a secondary location. Keeping your backup and HA platform in the same building as your primary system is as good as keeping two keys on the same keychain. It is likely that that when a DR scenario impacts the building, such as a power outage or other natural disaster like flooding, then it will also impact the secondary system. So, you’re out of luck in trying to get that back online, because both have been impacted at the same time! The same holds for keeping your tapes and backups in the same location as your source. Something can happen, and then you’ve lost both copies. Thus, enabling customers to replicate systems and perform real-time DR across different sites and different locations makes sure that a scenario that may impact one is not going to also impact the overall HA outcome.
TPM: When I think about this, I think HA local – with some distance, of course – but DR to the cloud. That DR in the cloud seems to be getting easier now that there are so many suppliers of Power Systems in the cloud running IBM i now. So, you can actually put your HA software of choice in the datacenter for local replication and you can replicate using the same or even a different piece of software to the cloud. Are you advising people to think about it that way?
Stephen Aldous: Certainly. We advise people to not keep everything in the same location. We already replicate across sites with Rocket’s iCluster platform, and we’re working on improved cloud support where customers have Skytap or IBM Cloud as a service provider that’s hosting the IBM i platform and want to replicate to an IBM i platform to.
TPM: Alright, let’s do one more then. Give me a fourth thing that IBM i shops get wrong with their HA, DR, and data protection strategies.
Stephen Aldous: This one plays into the testing. Things go wrong, and they do so all the time. You have network glitches, you have a storage-array failure or a RAID-array failure. But you have to verify and validate the data that you are backing up to make sure it represents exactly what was on the source machine so that, in the event of a disaster, it’s usable in the future.
Providing things like sync check and byte-to-byte verification, enabling you to certify that the backup copy is good and is a true reflection of what’s on the source, is vital. Because you think you’re safe, but things go wrong that are outside of your control, especially when you’re dealing with the cloud and latencies between systems. Packets may get lost along the route. Because when DR strikes and you hit that button, if you find that there’s a problem with the data and something is missing or corrupted, you are in trouble. And then you are talking about going back in time with backups, and that gives you a different recovery point, which could have significant financial ramifications depending on the type of transactions that that system is processing.
This content is sponsored by Rocket Software.
Lilac Schoenbeck is the vice president of go-to-market strategy in the Power Systems division at Rocket Software. With two decades of experience in enterprise software, data center technology and cloud, she focuses on the IBM i and IBM Power and how Rocket can best meet the unique needs of these core IT markets.
Stephen Aldous is senior director of product management. Stephen has two decades of experience in product management of enterprise technologies and, prior to that, sales engineering roles that drove his empathy for customers and understanding of how products succeed within the sales process.