A disaster is an event that causes harm to or loss of life/property. This is what recently happened in Texas and Louisiana when Harvey made landfall. In the case of IT services, a disaster is the loss of access to IT systems and networks. IT systems and networks are so important that their loss can be paralyzing to an organization. Can a disaster be prevented? If not, what can be recovered and how long does it take? Is the recovery delivering everything or just the mission-critical functions?
Defining a Disaster/Recovery
Disaster recovery (D/R) includes policies, procedures, technologies, and tools that enable the recovery or continuation of technology infrastructure and systems following a natural or human-created disaster. The focus should be on the IT systems supporting critical business functions.
Business continuity means maintaining essential business elements despite significant disruptive events. D/R is a subset of business continuity.
Disaster Causes Are Everywhere
Internal disasters can be caused by poor design, malicious behavior, negligence, or ignorance. External disasters can be caused by weather, fire, earthquakes, volcanic activity, cyberattacks, and loss of utility services (power). A disaster will occur; it's how you prepare that makes a difference. One of the weakest points in planning for disasters is the set of assumptions you make about what will or will not work. Another erroneous assumption is that the required IT staff is available on-site or can remotely access the systems and network. Those assumptions include factors beyond the scope of the organization affected by the disaster.
Survey Says
CloudEndure recently published it fourth-annual "2017 Disaster Recovery Survey Report ." One interesting conclusion from the survey is that 60% of enterprises surveyed have some form of disaster recovery capability for at least half of their production systems. This means not everyone is planning to back up every function in the business. An important question to answer is, "What functions do not warrant a disaster recovery investment?" The graphics in this blog are from that survey.
Disaster Sources
According to the survey, human errors account for 23% of the top risks to service of availability. The next two common problems are network failures at 17% and external threats at 13%. What I found surprising is cloud provider downtime accounts for 12% of the problems.
IT infrastructure design and operations can create disasters, too. IT may learn that their designs are not effective. I wrote two blogs that provide many instances of the design problems with D/R planning, "Disaster Recovery Duds," and "8 More Disaster Recovery Duds."
Meeting Availability Goals
Availability is that number, 99.XXX%, we all like. Twenty-eight percent of the enterprise respondents reported their goal is 99.999%, 15% have a goal of 99.99%, and 21% have a goal of 99.9%. This does not mean they satisfy their goals, either. A problem with the availability number is that it could be based on multiple short disasters or one huge disaster all producing the same number. Which is it?
As seen in the graphic below, enterprises meet their goals consistently about 41% of the time. Forty-three percent meet the goal most of the time, while 16% meet the goal some of the time or not at all.
Recovery Point Objective (RPO)
You need to establish a recovery point. This is a checkpoint that you periodically store so you can rollback to pre-existing operating condition. The recovery point objective of 6 to 59 minutes is the most common at 26%. What is surprising is that 32% report a RPO of more than one hour, and 10% admit they don't know what their RPO is at all. The greater the RPO, the more information is lost.
Recovery Time Objective (RTO)
The recovery time objective is the time it takes to get back in business. Only 4% of respondents said they have zero recovery time. (How do they do this?) The vast majority, 30%, expect the recovery time to be from 6 to 59 minutes. Another 21% reported 1 to 6 hours. A total of 26% either don't know their RTO or their RTO is greater than seven hours. The RTO is not the outage time. It is the time after the outage has been resolved to bring systems and networks back into production.
Testing You D/R Solutions
How often do you test your disaster recovery solutions? About 79% conduct disaster recovery drills at least once a year. Twenty-one percent test the D/R plan every few years or don't conduct drills at all. Testing once a year may not be enough. Your network and systems will probably change over the year. The once-a-year test may be testing an old configuration of systems and networks. It may not produce valid results.
Most organizations do not want to test the D/R plan because it interrupts their operations. I came across one case where everyone left the office when they knew that a disaster recovery test was going to be performed. Everyone downloaded their material onto laptops and went home. This did not test a real-world situation. The best test disaster recovery testing should be done without any notification to anyone.
Final Thoughts
When you do your disaster recovery planning, make sure you include non-IT personnel from various business units. You will have to compare the budget for D/R to the possible losses of the organization. Some of the losses may be measured in dollars, while other losses may be measured in reputation or lost opportunities. It can be difficult to convince C-level executives that the budget is adequate for disaster recovery. What are your RPO and RTO?
These are some other No Jitter blogs that can help you think through D/R planning: