DR Plans Keep Bad Things from Happening to Good Companies
May 16, 2011 Richard Dolewski
See a graphic overview of Richard’s Phased Approach to Disaster Recovery Planning here.
Some take it for granted that a disaster will never strike their organization. Rather than developing proactive solutions for such an event, their focus usually falls on other corporate IT deliverables. How about you? Does your company have a comprehensive IBM iSeries disaster recovery plan that would allow your business to function in the event of a disaster?
You can’t surf the Web or turn on the evening news without hearing about turbulent weather-related events. The dramatic events FEMA has declared of the past two years have included severe flooding in 14 states, tornadoes in record numbers, devastating tropical storms, and record winters, which should make companies realize just how vulnerable they are. This year, weather forecasters are again predicting severe weather that will no doubt become newsworthy. Let’s hope the weatherman is wrong on this one!
But there are other dangers to your company besides extreme weather. Everyday life is filled with incidents that can disrupt your business, and these all typically come as a complete surprise. A power outage, broken water mains, power shortages and brownouts, and hardware failures (yes, even the IBM i does break down) can all have major impacts on today’s business needs.
IT: More Important than Ever
It was not long ago that most companies were open for business Monday through Friday between 9 a.m. and 5 p.m. Customer transactions were usually conducted in person, over the phone, or on paper. So if some form of disaster shut down computing services for a day or even a few days, we simply continued working in a manual business mode. In other words, business as usual.
Today we are all faced with new realities. Employees, customers, and suppliers are interrelated. With a global economy and the use of electronic commerce, there is no end to a business day; we’re always open for business! Systems are no longer isolated; instead they interact with other systems to complete a transaction, regardless of hardware platforms. Key business initiatives such as ERP, supply chain management, customer relationship management, email, and e-business have all made continuous access to information crucial. To successfully manage business continuity during a disaster and restore normal operations, organizations require a proper disaster recovery (DR) plan.
Defining Your DR Plan Objectives
As you document your disaster recovery plan, it’s important to understand the goals of your organization in the event of a major system interruption. Aligning your information technology deliverables with the requirements and objectives of your business is imperative to obtain a common recovery goal. Internal politics can complicate recovery efforts. Differing opinions about which information and supporting technology is important can paralyze your DR planning initiatives at the outset. It’s imperative that everyone is in alignment with the common goals of the business.
Talk to your business executives and ensure they are fully aware of what IT can deliver. Don’t be afraid to ask the difficult questions before you document your plan. Disaster recovery is a combination of how long you wait to bring your business back up (Recovery Time Objective or RTO) after a failure, how much data the company is willing to lose in the process (Recovery Point Objective or RPO), and finally, how much system availability you are willing to pay for, which leads us to return on investment (ROI).
Planning objectives must:
Three Cs of a Successful DR Plan
Disasters come in many forms and will create numerous business interruption challenges. The goal for companies with no business tolerance for downtime is to achieve a state of business continuity where critical systems and networks are continuously available, no matter what happens.
At the end of the day, your disaster recovery plan has to be just three things:
Your DR plan must be detailed enough to pass the document over to any IBM i professional and enable them to recover your servers based on just the information supplied. The plan must spell out every step so it can be followed like a recovery roadmap. That means no implying or reading between the lines. It’s a mistake to rely on specific individuals during a disaster, as they may be unavailable. The plan must be able to stand on its own.
The plan must cover all critical aspects of your business, including hardware platforms, business processes, and network recovery elements required to meet business objectives.
Both technical recovery and management aspects should be clearly outlined, including answers to these questions:
Ensure that your plan is up-to-date. It is inevitable in the changing technology environment that a disaster recovery plan will become outdated and unusable unless someone keeps it updated with all the major technology changes in your organization. Changes that will likely affect the plan fall into several categories:
As changes occur in any of the areas mentioned above, all revisions must be incorporated into the body of the plan and distributed as required.
Backup and Recoverability
Everyone should be aware of the importance of backing up critical data. If you aren’t aware, you’re likely to become painfully aware on the day after a server crash, when you have no recoverable data. As a systems administrator, you need to bring all your key processes and procedures together through a backup solution that is reliable and recoverable. Adequate backups protect against permanent loss of data; however, the time it takes to rebuild and recover from a disaster can cause catastrophic business consequences if proper procedures are not in place.
A traditional DR plan focuses on the technology recovery. Most companies have some form of data backup methodology in place. Unfortunately, executives often assume their IT department’s current data backup solution is capable of supporting key business functions even when a disaster strikes. Nothing can be further from the truth. You need to ensure that all critical servers are being backed up on a reliable and consistent schedule. This process, though it seems obvious, is often not exercised in IT shops. Assigning backup responsibilities to an administrator is not enough. The IT department needs to have formal written schedules that describe which systems get backed up, how backup completion is verified, when and whether the backups are executed, and if they are full or incremental.
Every backup strategy must be examined closely to ensure both the system and the data are indeed recoverable. Many companies design a backup program and then go no further. A natural tendency is to recover from last night’s backup tapes–and then try to bridge the gaps. In other words, recovery by “reverse engineering.” A better approach is to design an effective recovery strategy first, followed by a backup strategy that will support the desired recovery objectives. Ensure your RPOs are examined from both a mid-week and a weekend failure, as these will most certainly require different backup program design.
It is imperative that your disaster recovery plan specify, in detail, when backups are executed, the complete contents of your backup, frequency of your backups, and your vaulting procedures. Supporting documentation should be included for each section. In addition, an effective DR plan should detail all the steps required to support the process of restoring data, within timeframes dictated by business objectives.
There are several recovery strategies that can be employed. The most reliable and cost-effective is either a vendor-supported hot site or an internal hot site in another location within the same organization. In addition, you must also observe regulatory compliance regulations your organizations are governed under.
Traditional DR plans provide 24- to 48-hour disaster recovery for mission critical applications. With technology linked closer than ever to business processes, and with costs of outages escalating, many organizations are seeking shorter recovery times for critical applications. Use of high availability or clustering techniques is increasing, especially for ERP and e-business applications. RTOs and RPOs are now calculated in minutes and hours rather than days.
For high availability implementations, the recovery point is always what was last transmitted and received by the backup server. Any changes that haven’t been completely transmitted will be lost. One factor to consider is your line capacity; insufficient bandwidth can cause latency and a backlog of transactions on the source system. The use of object journaling with synchronous replication for logical copies can help ensure that changes to critical data are transmitted as they occur. If multiple techniques are used by the logical copy operation, delays can result in lost changes for objects that aren’t journaled. With 24/7 systems availability requirements, a hot site solution may not be good enough.
From a transaction perspective, the recovery point is the same for all solutions and is dependent on the application’s use of techniques such as commitment control to ensure transaction integrity. When commitment control boundaries are required, the recovery point includes only those transactions that were completed and, for logical copies, only those transactions that were successfully transmitted, received, and accepted on the backup server. Without commitment control, the recovery point may include changes from partial or incomplete transactions, which then must be manually reconciled before you restart your applications following an outage.
The shift from an emphasis on data recoverability to continuous application availability is natural and logical. The traditional approach for deploying high availability has been to replicate data from one system to another with data recoverability as the primary objective. A common goal today, however, is 100 percent availability against any business disruption, and a positive ROI for your business.
Criticality of Servers
A hierarchy of all critical business service offerings and the infrastructure to support these applications must be determined. It’s a reality that not all applications or all servers that reside in your computer room are of equal importance to your business. To reflect that reality, you should assign a priority to your various applications.
One of the first steps in putting together a DR plan is to identify the criticality and number of servers required in a disaster. While you obviously want your mission-critical servers to run the exact same equipment, in an emergency, many people will suggest that any equipment is better than none. It’s important to define your minimum requirements specifications with your mission-critical applications in mind.
Many companies take their installed IT infrastructure for granted and assume that all major components of every computer room are correctly integrated, properly managed, fully compliant, and tested regularly. This is a common mistake. In an evaluation of your computing facility’s design, it’s crucial to look for all single points of failure. To do this, perform a site vulnerability assessment to examine the current operating environment and investigate all system level components, infrastructure, environmental, and backup processes, and determine the capabilities to recover all critical systems.
The major components to investigate for a vulnerability study include:
Testing Your DR Plan
Once your organization’s DR plan documents have been assembled and distributed. Testing that plan is an essential part of developing an effective disaster recovery strategy; it will identify where a company’s DR plan may fall short and assist you in finding ways to better prepare for possible future disruptions.
The worst way to test a disaster recovery plan is to wait for an actual event to occur. A real disaster, where stress and high emotions often rule the day, is not a conducive environment for learning or for gathering data.
Disaster testing is more than just going through the motions–it requires postmortem analysis of every test to identify where the plan has failed. The failure might not be due to a bad plan; it could be the result of changing business conditions or the performance of an outside organization, such as a hot site or communications provider. Testing is an exercise involving testing objectives, scenarios, evaluation, and remediation. Learn what’s important in a DR test and how to make such tests effective, and you’ll be well positioned to ensure your organization’s survival in a calamity.
In today’s time-sensitive, mission-critical economy, every organization should have a documented and tested DR plan. When building your DR solution, a management commitment to building a recovery strategy will protect your company against the greatest risk of all: complacency. Investing in DR keeps bad things from happening to good companies.
Richard Dolewski is chief technology officer and vice president of business continuity services for WTS. Richard is a certified systems integration specialist and disaster recovery planner and is globally recognized as a subject matter expert for business continuity for IBM iSeries and i5 environments. An author and frequent technical contributor to publications such as COMMON and iSeries Magazine, Richard is a winner of numerous speaking awards including COMMON’s Impact Award and is a member of COMMON’s Speaker Hall of Fame. Richard’s book, System i Disaster Recovery Planning is available from MC Press and Amazon.