Disaster Recovery Reality Check

Disaster recovery (DR) services based on the cloud have grown significantly over recent years. Enterprises have to study their operations and determine what is important to store for disaster recovery purposes, how often things need to be backed up, and how fast they will be able to recover.
Not only is each enterprise a bit different, but DR for each IT function within that enterprise may be different. The enterprise has to evaluate how to budget for DR and when to use the cloud (see "Ready for a Disaster?").
Disaster Recovery Report 2018
CloudEndure’s Disaster Recovery Survey Report resulted from a March online survey of 375 IT professionals from around the globe who are using or looking to implement disaster recovery. Although the survey covers a range of responses, this blog is focused on two measurement goals and the success of these goals. The graphics in this blog are from the report.
Recovery Point Objective (RPO)
The RPO is how old are the files that must be recovered from the DR site for normal operations to resume when there is an outage as the result of a hardware, program, power, AC, or communications failure. RPO can be for the entire enterprise or specific applications, and it’s measured backward in time from the instant at which the failure occurs. Once the RPO is defined, then that will dictate the minimum frequency with which backups must be made.
The survey reported that about one-fifth (21%) of this year’s survey respondents report RPOs of less than one minute. When compared to the 2017 survey, the number of companies expecting zero RPO has increased. RPOs of four hours or less were expected by 74% of enterprises. Unfortunately, 8% of respondents reported that they have not determined RPOs at all.
In one of my consultancy projects, the customer’s RPO had to be effectively zero. The system for a Federal Reserve Bank required that we needed to recover all the money and security transfers (no matter what state the transaction was in) at the point the outage occurred. We check-pointed every stage of a transaction to avoid creating any financial errors upon recovery.
Another project provided business information that only changed daily. Therefore, in this case we could tolerate a RPO of hours, but less than 24 hours.
Recovery Time Objective (RTO)
RTO is the maximum allowable length of time that an IT function can be down after an outage occurs. The RTO is a measure of the extent to which the outage disrupts normal operations, and thus, it must be measured against the amount of revenue/profit/fines/fees lost per unit time as a result of the outage. The determined RTO will vary based on the functions and applications effected.
In the CloudEndure report, the majority of respondents (69%) report a RTO of four hours or less, with 6% of survey respondents having an RTO of zero. An additional 6% have an RTO goal of under one minute. A surprising 13% of respondents report being able to accept an RTO of more than 24 hours or no determined RTO at all.
In the case of the Federal Reserve Bank, RTO was one minute so that transactions would be delayed but not lost. The business information service could tolerate an RTO of up to 10 minutes without any loss of revenue or reduced profit.
There have been many studies conducted to determine the cost of downtime for various applications in enterprise operations. The studies indicate that the outage cost depends on long-term and intangible effects. The costs are also based on immediate, short-term, or tangible factors (see “Cloud Out”).
Meeting Goals
Of course, every business wants to meet its DR goals, but they may be hampered by budget and/or poor design and execution. Less than 43% of enterprises surveyed could meet their RPO consistently. Fewer enterprises (37%) could meet their RTO consistently. Setting the goals is easy; meeting the goals is hard and may not be possible.
Other Survey Results
  • 47% of the enterprises use disaster recovery for at least half of their systems.
  • In 2018, 15% of enterprises aim for five-nines (99.999%) availability or better.
  • Almost half (47%) of all those surveyed use a public cloud as their disaster recovery target site. Only 15% use physical systems, and 39% use private clouds.
  • Sadly, only 7% perform a monthly disaster recovery drill, while 28% conduct drills quarterly. Some enterprises (15%) admitted that they never conduct disaster recovery drills, so they don’t really know if the DR works as planned.
In one of my consultancy projects, we were so successful avoiding an outage that required instituting the backup, that when we tried to use the backup system (which was on site) the personnel necessary to run the backup had been reassigned without our knowledge. The backup was no longer functional. A major lesson we learned was to exercise the backup periodically to ensure it worked correctly and all the resources are available when you need them.