Thoroughly Modern: Time To Develop Your IBM i HA/DR Plan For 2022
December 6, 2021 Josh Osborne
Disasters and unplanned downtime can hit any company at any time – regardless of where its applications or data sits. Establishing a high availability and disaster recovery (HA/DR) plan is paramount to getting your mission critical IBM i applications and your business back up and running – fast!
2021 saw its fair share of challenges. Everything from bush fires to earthquakes to flash floods, and the continuation of the pandemic. And, those are just the ones mother nature sent our way. The manmade risks in the form of cyberattacks and ransomware also grew exponentially in 2021. With over 30,000 security breaches occurring every day, organizations need to protect themselves and be prepared to recover from anything and everything.
So, some of the big questions to ask yourself are:
- How quickly could you recover if an incident happened?
- What impact would a disaster have on your business?
- How confident are you that you could even restore all your systems and data if a disaster struck?
We always joke that putting your backup in the trunk of your car is not a plan. And even having a plan is not always enough to ensure recovery. Have you ever actually run a test and proven that you can recover? Does your plan minimize potential downtime and is it optimized for recovery costs? Everyone needs a recovery plan so how do you develop the right plan for you?
There are a few key things you’ll need to consider but the big ones are risk tolerance, recovery requirements, and understanding the different strategies and which ones will work best for you.
Risk tolerance can be uncovered by understanding your recovery point objective (RPO) and the recovery time objective (RTO). The recovery point objective says how much data the company is willing to have exposed or be at risk and the recovery time objective says how much downtime the company is willing to endure. Both of these elements will drive the options and costs for your HA/DR strategy.
As you look at different strategies it is safe to assume that the smaller the recovery point, the more costs you will have – seconds will be more expensive than minutes, minutes more expensive than hours, and so on. That’s why it is critical to understand what the real cost and impact of downtime is to the business, which applications, data and systems are critical and agreeing with the business on what the recovery time needs to be.
A few key things to think about:
- Which applications and data are critical to your business?
- Who uses them and how do they use them?
- Where do those applications and data sit?
- How are they being managed, accessed and kept secure?
- How are they backed up and what would be required to recover them?
- How long can they be down?
Understanding The Differences Between HA And DR
Many people look at disaster recovery as a baseline and high availability as this gold standard. The reality is that disaster recovery should be looked at as more of a “restore the whole system from media onto another media” scenario.
High availability is meant to keep the system available, but it doesn’t necessarily mean in a separate datacenter or a separate site. In fact, for a lot of companies, high availability is a local replication to ensure that machines can operate even if there is a crash of some sort or to allow for patching of live systems, and disaster recovery is how you get workloads running in a remote site if your primary site is wiped out by a natural disaster (hurricane or flood or fire) or unnatural disaster (like a ransomware hack).
If you’ve got two machines side by side in the same datacenter that allow you to do maintenance and things back and forth between them, that can be a high availability solution. You still need to have a disaster recovery solution in place. In this example, if the building where your servers reside gets hit by a storm or the main lines go down and connectivity is lost, you still need a way to restore access.
However if you want to define HA and DR, the point is this: Everyone needs a plan for both. There is no excuse in 2022 for not having both, and every IBM i shop is by definition running mission-critical, business-critical applications that need to be protected in an increasingly hostile world.
Tape backups are the absolute minimum when it comes to having a disaster recovery plan. To be fair, tape solutions can be sophisticated but most implementations are not. Many companies rotate the tapes through their tape drives on a weekly basis rather than daily, so if they’ve got a library, they stick all seven tapes in and then next week they take all seven tapes out. Those tapes sit there in that library and if they’re not offsite, you’re not necessarily protected.
The recovery point objective around tape-based solutions is really defined by the frequency and the quality of the backups you’re doing. If you’re taking a full backup every day, then your RPO is 24 hours or less. Whenever the last backup was taken is the time that you would have to get back to, or the data you could get back to. If your full saves are done once a week, then that stretches out toward seven days. An RPO of seven days really means that the business has decided that they can survive with losing as much as seven days worth of data, plus the rebuild time on the system.
Here is a good customer example of what not to do. One of our clients who had a weekly backup strategy experienced a disaster. When they went to do a recovery, they realized they had many different tapes, and were unsure of what was on them, so they reached out to us for help.
They showed up with their backup machine and 200 backup tapes but no idea what was on them. Our team went through and cataloged all the contents of the tapes and looked through them, and what we found is their backups hadn’t actually been running for eight months. Here we have a resource problem, right? Where they didn’t have an IBM i resource, the backups hadn’t been running. Fortunately for everybody, the happy ending of the story, we were able to get the data off the hard drives and get the machine back up and running. But they lost 15 days worth of data because the system of record was down for 15 days. They didn’t have a DR plan that they had tested. They hadn’t validated it.
The funny part about this is that if we get into some of the ransomware conversations, tape is one of the few solutions that actually satisfies the security officer requirement for an air gap, which is protecting the data in a way that can’t be accessed.
Virtual Tape Library
VTL is basically a disk-to-disk backup where there’s an appliance that emulates a tape library. Your IBM i views the VTL as a tape library with tapes in it, even though it is a much faster disk array behind it.
I’m not sure which part I like more about the VTL: The fact that there are no tapes to actually handle because it’s all in that space or the replication or that VTL can automatically replicate the data offsite to another appliance, using a reasonably low amount of bandwidth that duplicates it. It’s done almost as soon as the backup is done. If you have the right appliance configuration, it replicates that data offsite, and so now you get your data offsite every day with your backups.
Pair this with a disaster recovery contract or having another machine you can recover to and you have a pretty solid DR plan.
These are all things that Abacus and Fresche are doing for customers. We are working with companies to help them build their plans and help them remotely manage the backup using this BTL, replicate it over to our side, provide the infrastructure for recovery, and do an annual DR test on your behalf. It’s all turnkey.
This is where we really see that RPO/RTO numbers start to move. With the software-based options for HA include solutions like MIMIX, iTera, Maxava, QuickEDD, most companies like Abacus support all of them where it’s appropriate for the customer. We believe in choice.
One big problem with the software-based replication is that it’s expensive, although admittedly it has gotten a lot less expensive over time and the cost of secondary machines has also come way, way down. Most of the solutions out there also consume a lot of overhead in the system. They also require some more expensive expertise in managing that and then there’s licensing. And then you also have to have two machines on your site.
You have got to have journal management so it’s consuming more disk space, more processors, like sort of licensing and the network between them. Not my favorite solution, but it definitely works and has a place in the market. It’s a good solution where appropriate. You could usually take backups from the target system to keep your production system more up and running, but it doesn’t replicate everything. You have to tune and manage what gets replicated, like parts of the operating system and other things that don’t get replicated. And the recovery point objective is dependent on how much bandwidth they have got between the two machines and how busy the production machine is; if the machine is busy, the journal is buffered and the data is not shoved over the wire right away to the secondary machine. We see machines that can get hours and hours behind during the day. And then at night when things are quiet, they catch up. And if you have busy times at night, maybe it catches up during the day. But it’s not really that “less than one hour” that marketing often leads people to believe it is.
This is why one of my favorite HA solutions is hardware-based replication on the storage area network. This replication is done at the storage layer. So all of the data that is in the system – the operating system, the PTFs, every block of data that’s part of your machine – is replicated by the SAN under the covers. The IBM i platform has no idea this is going on.
You can replicate data to a local SAN or you can also push it to a remote SAN that is in turn attached to a backup server that can run workloads if need be or push full backups to other media without impacting the performance of the production machine in the primary datacenter. There’s a great deal of compression that takes place to make this possible, and hardware replication is gaining in popularity because storage costs are coming down and it’s more attainable with the improvements in the algorithms. For instance, you don’t need a 10 Gb/sec circuit to replicate data anymore. So that helps out a lot. The recovery point objective, depending on the configuration, is so good that you can get it down to where it is synchronous mirroring, which means every block of data that got written to the production storage is written to the target storage. Most customers go for more in the sub 15-minute recovery point just because there can be some performance hits depending how far apart the two datacenters are that are hosting the replicated SANs.
Importantly, the SAN also allows for this other option: To air gap the storage. So we can take snapshots daily, hourly – pick your number of snapshots on the production machine – to give customers rollback points just in case there ever is a ransomware event.
Let’s say we take a daily snapshot on the production machine and Monday morning we came in and find that we have been compromised. We have got the previous six days of images of the machine, and we can turn on any one of those and then use the backups that we did to extract out the data that we may have put at risk. So there are very, very good options around the recovery point objective and the recovery time objective. Data replication like Db2 Mirror, or you can do some things around the journaling or iASPs.
Here is a funny problem you must be aware of: The faster the replication, the more efficient the replication is for the ransomware. Hence the need to air gap early and to air gap often. You need to keep lots of point-in-time replicas for a long time to ensure you can get back to a clean copy, and you have to have some way of determining you have a clean set of data and applications to restore from. We can help with that.
IBM i shops have to take this seriously, and starting right now. Ransomware is the most common disaster we will see in 2022. We have been doing disaster recovery and high availability for 30 years and the number of real disasters we have handled – not tests, but real disasters for customers who have had buildings wiped out and things of that nature – is on the order of two dozen. In 2021, we have assisted in the recoveries around ransomware for 30 customers already. This is the reality of the situation.
With Abacus, we know how to set up HA and DR systems, and we have managed services that can help you run this – and we can even host IBM i in the cloud as part of your HA/DR setup. On the Fresche side, obviously we can help you with the applications on there. We are happy to get on a call and walk through what we can do together to build you a better HA/DR strategy for 2022. Contact us at email@example.com
Josh Osborne is chief technology officer at Abacus Solutions.
This content was sponsored by Fresche Solutions.