How to Build a Disaster Recovery Plan

Executive summary

How do you start building a DR plan? While there are lots of tools from vendors, it's hard to find a practical approach that comes from firsthand experience.

This 4-step DR planning framework – Business Impact Assessment, Risk Assessment, Risk Management, and Recovery Testing – were developed by Zetta's Director of Operations, Rich Webster, over 20 years of managing large scale IT infrastructure environments, at companies including Netscape, eBay, and Shutterfly.

How to Build a Disaster Recovery Plan

No amount of money or planning can stop some IT disasters from happening. But a good disaster recovery plan can reduce your downtime from a week or a day to hours or even minutes.

Like any important project, DR starts with planning, followed by bestpractice templates and procedures, which in turn are implemented, in part, by the right tools.

In addition to identifying mission-critical applications and any infrastructure they rely on, you should also identify the data these applications and tasks need to have access to.

This can include recent email, customer databases, and any documents, spreadsheets, presentations and other "unstructured" files used by project/product management, development, sales, manufacturing, etc.

Your company has accumulated a substantial amount of data over time – hundreds of gigabytes, perhaps terabytes or even petabytes. But only some – often a small fraction – of this data has to be made available again quickly.

Disaster recovery planning

STEP 1: Business Impact Analysis

A Business Impact Analysis (BIA) defines what capabilities your company can't operate without. This is the first step in creating a working disaster recovery plan.

Doing a BIA must involve top-level non-IT management, to identify and agree on the list of applications that are considered essential, and IT management, to map these tasks against the applications along with the associated infrastructure and other services needed to run and use these applications.

All top stakeholders must be involved in the analysis. You don't want to find out only after a site goes down that there was an additional application an executive considers essential.

STEP 2: Risk Assessment

The second step to a complete DR plan for your organization includes mapping the 2 types of IT infrastructure:

IT infrastructure you control, whether located in your offices or in colocation facilities, and IT.
IT infrastructure you don't control – like web and cloud services or web sites running in a hosting center.

Once the IT infrastructure has been mapped, look for single points of failure, like a server with only one network card.

These are your first places to consider "fortifying" with redundancy.

STEP 3: Risk Management

To lower the risk of a data disaster occurring, fortify yourself against the most common issues and you will have protected yourself against 90%-95% of the small incidents that may impact you.

For example, says Webster, "Because of good DR planning by IT, Zetta's uptime, both for its own operations, and for the service that Zetta provides to its customers, is above 'five nines' – 99.999%, meaning total annual downtime of less than five and a half minutes – availability for over five years. From a customer perspective, we've never had more than a brief blip of unavailability." Redundancy is one popular approach to avoiding or minimizing many IT disaster events. For example, servers, storage and network gear can be configured with two power supplies, connected in turn to separate power sources. Servers, firewalls, UPSs and other gear, even entire sites, can be duplicated. Network and electrical service can be supplied by two separate utilities, on separate cables.

Data can be stored across multiple hard drives.

STEP 4: DR Testing

There are only two ways to determine whether a DR plan works.One is when there's a disaster. This, of course, is the wrong time to discover that you chose wrong, or that one of your tools or services has failed, or that you didn't include a critical application.

The other way is to periodically conduct tests. "It is better to uncover a shortcoming in your infrastructure by testing failure scenarios under controlled circumstances," says Webster. "For example, in a controlled test, if you discover that a network card isn't working properly, you can halt the test, install a new card, and run the test again. If you don't discover this until a real event, you may miss your target time to restore IT service." External audits can help identify whether there are any parts of your DR that still need work. One reason is that not all organizations will simulate a full disaster scenario, or carry through to confirm that a full recovery can be done.

"An external audit can hold you to a higher standard than your company may have set, and conduct full, rigorous tests, forcing you to follow the best IT practices," says Webster.

Offsite Backup Approaches

In most IT disaster events, disaster recovery involves restoring data, because the primary copy has been damaged, destroyed, or rendered inaccessible.

To ensure that a copy of your data is available if and when an IT disaster occurs, an offsite backup is critical. It should be geographically far enough away to ensure that a major event like fire, flood, power outage, explosion or earthquake doesn't damage or isolate the backup.

Tape ruled the offsite backup world for decades. But there are problems with tape-based backups:

Offsite tapes take time to request, find, and retrieve.
If a tape is faulty, you don't find out until you need it.
To read older-generation tapes, you need to have a working tape drive that supports them. Since your site may be inaccessible, you need one at your alternate location as well. This adds to infrastructure costs.
You may have to go through the entire tape just to retrieve a few files.
Many tape-oriented backups use proprietary formats, and require vendor software to be read – another recurring cost.

In today's online, 24x7x365 world, a backup that's not quickly and easily available may be good for preserving important company data – but it isn't useful for disaster recovery. Today's RTOs are measured in hours or even minutes.

Free DR Plan Templates and Samples

PROVIDER	LINK
IBM	http://publib.boulder.ibm.com/iseries/v5r1/ic2924/index.htm?info/rzaj1/rzaj1sampleplan.htm
Texas A&M University	http://www.tamuct.edu/departments/informationtechnology/extras/ITDisasterRecoveryPlan.pdf

If you're lucky – and have fortified your IT infrastructure – your company may escape major IT-impacting disaster events.

"During my 20 years in IT, I haven't yet had to invoke a full DR plan -- although I have come close," says Webster. "But I have had to invoke parts of my DR plan, about once a quarter. You have to be ready to do some level of DR periodically".