Weighing Your Disaster Recovery Test Plan: The Tale of Two Test Cases

disaster recovery test plan Disaster recovery can be a business lifesaver. It can also lead to a realization that great technology alone isn’t enough without the right test plan.

Here we’ll share the story of two organizations who recently implemented the same disaster recovery solutions, each one harnessing the same IBM i technology. Both organizations planned for a seamless failover with near zero impact to the users. Yet the results for each organization were drastically different. Why?

Missteps are common in the planning phase. IT teams have the difficult job of keeping everyone focused long after the technology budget is settled. What happens next is often the most critical and overlooked piece of the disaster recovery solution: testing, testing and more testing.

Extensive testing is essential to ensure that when disaster strikes, your organization achieves its recovery time objectives (RTOs) and recovery point objectives (RPOs).

Read more about RTO and RPO in our related blog, The Ultimate Guide to Disaster Recovery.

While every organization may think they’d like to have near-zero RTO or RPO for ALL applications, the decision should really be based on your business priorities. The key is determining application and data priority across your business units. Your RTO and RPO limits are best determined in discussion with business stakeholders. These discussions can have a significant impact on the results of any disaster recovery program. (Discovering an application's RTO during a disaster is not an ideal process for any organization!)

Yet despite the importance of a holistic approach to planning, time is a precious commodity – and asking business units to help plan for a potential disaster is challenging. Just remember, without involvement from your business, IT will need to make some hard choices about RTO and RPO thresholds. Some applications could be offline for days without a material impact on the company. In other cases, an application being unavailable during the workday could have a crippling effect. Here’s a look at those two different real-life outcomes in action.

Test Case 1: A Planned Approach Involving Business Stakeholders

In this case, a manufacturing organization allowed plenty of time for planning and testing. Even in the early phases, the disaster recovery approach involved a thoughtful discussion with business leaders. A framework was created and agreed upon before the disaster recovery (DR) program was finalized, leaving no question about what was expected from the DR program.

This organization set up short weekly tests during business hours that didn't impact production. Over four weeks, the IT team tested all elements of the DR plan against their established goals to understand all the scenarios and ensure they were meeting the set objectives.

When it came time for the business leaders to test the environment, the process was well understood. Interestingly, even after testing and re-testing the environment over four weeks, one of the prioritized applications failed during the test. If this was an actual disaster, the impact on the business could have been material. However, the test surfaced the errors and corrections were made to protect the RTOs. The result? The organization is prepared and confident in its ability to rebound from a crisis.

Test Case 1 Takeaways

It is important to remember the reason for disaster recovery … to protect the business. A big part of protecting the company is understanding how to prioritize data and applications in a recovery scenario.

Good IT teams are savvy on the business requirements, and great IT teams work hand-in-hand with business leaders. The same is true for disaster recovery programs. A solid disaster recovery program includes input from the business operations, as shown in this case. The entire organization understands the importance of disaster recovery, is actively engaged in defining the process, and participates in making hard decisions on the RTOs and RPOs.

So what happens when the business leaders default to asking the IT team alone to define the program? You are about to find out.

Test Case 2: Disaster Recovery Decisions Made In An IT Silo

In our second scenario, an organization successfully recognized that it needed a robust disaster recovery solution. Without the ability to dedicate time to outlining the business requirements, the IT team was left to make isolated decisions on the importance of applications and data.

What is lost in this model is the complete documented view of the organization. Unless there is a defined map of the business requirements, RTO and RPO thresholds are set based on the best information IT has available – and even the best-effort business decisions are likely to yield at least a few surprises.

For Company 2, a few critical steps were missing from the planning phase of the DR: the involvement of the business units and the users’ input. The IT team was in the position of making decisions without a clear set of business requirements. Additionally, the end users that relied on the data and applications weren’t aware of what would happen in a disaster.

What was the business impact from this lack of planning? When the first and only test was scheduled to occur, there had been so much focus on the data center that a system hadn’t been created to give employees access to applications that were restored. After a short fire drill, a networking solution was implemented overnight to enable access.

The test was riddled with issues and the identification of holes in the planning. The business was surprised by the results of some decisions. Overall, the company now needs to dedicate significantly more time to revisiting the previous planning stage and creating new tests.

Test Case 2 Takeaways

From this second case, there are a few critical reminders. For one, technology decisions always form the foundation – and while the importance of this cannot be discounted, this is only the first step.

The second part of any successful DR program is planning. The business needs to be part of the planning sessions. What applications are most important? How much data loss can any individual business unit within the company tolerate? Do the end users know what will happen and when? Time invested here will ultimately save you from revisiting previous decisions.

The third takeaway is to test, test and re-test. There is a reason we practice fire drills. The more you expose your team and systems to the DR program, the more likely when disaster does strike, you’ll be ready. You never know when a disaster will strike. If you lose power and your systems go offline, how long until you recover? If you are hit with ransomware, can you recover without paying the fee? What if lightning hits your data center? The situations are endless. For the first organization outlined here, they feel confident they can recover without skipping a beat. You can do one of two things: hope that your business never faces these issues, or you can be ready.

Blog