Are You Planning for a Disaster or Downtime?
Enterprises should continually review and update the procedures and technologies they need in order to mitigate downtime.
Though the human costs from megastorm Sandy continue to be severe, from the perspective of those charged with maintaining enterprise systems, the key issue for the network is downtime. Enterprises should continually review and update the procedures and technologies they need in order to mitigate downtime.
The Many Elements of Downtime
Downtime may be unplanned--for example, as a result of a disaster--or it may be planned. Planned downtime may result from implementing new or changed/upgraded hardware, new or modified/upgraded software, new or changed network connections, system/network testing, and physical site construction and modification.
As a rule, these planned downtimes are not included in the calculation of the benchmark 99.999% availability goal. The primary element for the 99.999% calculation is unexpected hardware failures.
Unexpected downtime can be due to:
* Natural disasters that knock out resources such as power and building access due to weather, floods, earthquakes, wildfires etc.
* Power outages that are not related to natural disasters
* Software bugs
* Malicious behavior
* Security breaches
* Human error
There are also problems that may not be technically considered a failure but cause performance degradation. These include:
* Network overload and congestion
* Poor operating system performance
* Unstable applications and application configuration
* Data unavailability, corruption and access limitation
No matter what is the downtime problem, the enterprise should establish some key performance indicators (KPIs), measuring what matters most. KPIs for downtime include:
* Time to discover a problem
* Time to diagnose the problem
* Amount of the organization that is affected
* Amount of resources needed to stop the downtime
* Time to initialize the solution(s), measured from the time to discover the problem
* Time to complete the resolution(s) measured from the time solutions are initialized
* Cost to resolve the downtime
* Cost the downtime produces in lost productivity, customers, reputation, etc.
* What is the new availability figure when the downtime is included? (i.e., how far does the downtime drive you down from the 99.999% goal?)
Stages of a Downtime Problem
There are four stages to dealing with downtime:
1. A problem has occurred and no one knows yet.
2. IT and/or the users determine there is a problem. Knowing the extent of the problem is critical to its resolution.
3. Deciding what to do in response to the problem, what resources are needed, the cost, what steps to take, and how long will it take for problem resolution. Alternative solutions should be investigated during this stage.
4. Performing the failover and recovery procedures and notifying the users of problem resolution.
Is the Backup Working?
All too often I have encountered enterprises switching to a backup solution only to discover that the backup does not work. Those in charge of the backup systems made assumptions that proved incorrect. This happened frequently for enterprises and other organizations during and after Sandy. No one had tested the backup or it had been a long time since it was tested.
I had one occasion when the backup carrier connection did not work. My client had had no need for the backup for the three years it was in place. We subsequently discovered that the backup had never worked. It had been installed improperly, never tested by the carrier, just checked off as working. My client was able to obtain a refund for the connection for its entire life span, but this was not much of a consolation.
On another occasion, the server backup had been taken off line for some work. No one who needed to know it was off line was informed of its status. Therefore no backup existed when the primary failed. It took a few hours to learn what was wrong and restore IP PBX service.
Oversights in Downtime Planning
Downtime can be caused by small events. You don't need a megastorm for a failure. Application and server failures are far more likely to occur than a natural disaster.
If you have configured an active/active configuration for the servers, ensure that when one server fails, the other server can handle the full workload. If not, users will see performance degradation or even terminated sessions.
Monitoring the KPIs means that not only should failures be detected quickly, but degradation of the KPIs should touch off alerts when they are reaching poor performance levels, before there are real problems. Doing this may prevent any user performance degradation, by solving the problem before it causes a larger issue.
The enterprise should also monitor for application software issues and data corruption. The backup hardware and software should be monitored equally as well as the primary.
Testing the backup procedures and operation is extremely important; test frequently. I had one client that used their electrical generators for 24 hours every weekend as a live power source. Turning a backup on for few minutes is not enough time. Backup failure can occur once a full load is applied to the system.
Don't select cheaper backup components. They should be equal in quality and reliability to the primary systems. Fully understand the assumptions made when coping with downtime--for example, beware of assuming that the mobile network will be operating when the wired network fails. Mobile services failed over a wide area with Sandy.
Practice, Practice, Practice
Practice the backup procedures live, not on paper. Don't let anyone know that a backup test will be performed. I had client that warned all the users that a backup test was to be performed the next Monday. On Friday, most of the employees downloaded their work into their PCs so they would not be affected by the backup procedure. This defeated the purpose of the backup test, as it did not demonstrate that the backup and procedures would work properly under a full load.
Most of the time, enterprise management does not allow full failover operation, as it will disrupt the business. This is not so smart when a big problem occurs and the failover does not work as expected.