Cloud UC Availability: Is Five-Nines Real?

Subscribing to a cloud based PBX, UC, or contact center service can be riskier than you think. I know of at least one cloud PBX service that went down because part of the Amazon EC2 cloud failed in April 2011. The delivery of 99.99+% availability depends on your perspective: cloud provider or customer. It also depends on what is and what is not included in the calculation. Is 99.999% real? Probably not from the end user's perspective. But a data center or multiple data centers geographically distributed may reach 99.999% with a well-designed infrastructure and network. Does the availability metric include the reliability of the data stored in the cloud? Not likely.

What is Availability and 99.999%?
The talk about five-nines began with the central office telephone switch and the PBX. 99.999% refers to availability and is related to reliability. Availability is a function of two basic factors: Mean Time Between Failures (MTBF) and Mean Time To Repair (MTTR). Both are usually expressed in hours. Availability is described by the following equation:

If there are no failures, then the MTTR is zero and therefore the availability is 100%. If there is any MTTR, then the availability has to be less than 100%. The term Repair is misleading; it should be called Mean Time To Restore Operation. There are five elements to MTTRepair:

* Failure Detection
* Failure Notification
* Response of the persons/systems to the failure notice
* Actual repair/replacement time
* Recovery/Restart/Reboot time

The first four elements can be significantly reduced or even eliminated by deploying redundant components and automatic reconfiguration.

MTBF provides a measure of system and network reliability. But over the course of time, the metric does not necessarily tell you what you need to know. The system or network could have 99.9% availability and suffer a disaster, one long outage, or multiple short outages. The availability metrics are indifferent to the impact of the outages. However, they still provide a useful function; they provide a frame of reference, as shown in Table 1.

Table 1: Translating the Metrics

Availability of Voice Communications
For years, the legacy PBX vendors not only created systems that delivered five-nines, they also performed on-site surveys to ensure that the rest of the telephone infrastructure did not reduce reliability. For most voice users, five-nines was related to dial tone, and they were not disappointed.

The measure of reliability was based on hardware calculations, not software. In addition, the phone, cabling and power were not included in the five-nines calculation. That does not mean that these components were not reliable, just that these were part of the customer premises, not the vendor's responsibilities. The five-nines calculation was based on a hardware reliability prediction model called Parts Count. Bell Labs, and later Telcordia refined this calculation method over the years and found it to be a very accurate predictor of hardware reliability, dubbing it MTBF.

Next Page: Adding UC to the Mix

Adding Unified Communications to the Mix
Over the years, hardware reliability has been high and continues to be delivered. However, with the advent of software-driven systems, software reliability is now the predominant factor in availability.

Unified Communications has many more media components besides voice. Further, the vendors keep upgrading the software, improving operation, adding features and functions, and fixing bugs. This means that software does not have the same level of stability as hardware.

There is also no accepted method for predicting software reliability. So when cloud providers predict their availability, they are depending upon infrastructure configurations that can compensate for failures whether hardware or software. These high-availability infrastructure designs are not always successful.

There will always be events that the communications cloud provider either does not anticipate, or does not know how to avoid, or cannot afford the cost to mitigate. A Black Swan event is by definition a surprise.

A Black Swan event can occur with any computer and communications environment. It is a high-impact event that is hard to predict and is beyond the normal expectations of the cloud provider. Because these events are rare, they therefore cannot be assigned any quantitative metric. The occurrence of a Black Swan event is probably not even imagined by the cloud provider designers, nor is there an approach to prevent such an event.

So what does this have to do with cloud availability? The answer is that these unlikely events are not part of the availability calculations and would not be covered by any service level agreement. These could include:

* An employee or contractor that does not follow procedures and causes a failure through software misconfiguration, improper installation, or a mistake implementing switchover procedures
* An employee or contractor that sabotages the data center
* A plane crashing into a data center
* A hostage situation where the data center power needs to be turned off
* A terrorist attack
* Acts of god like an earthquake, volcano, high flood waters, gas or fuel explosion, fire etc.

None of these possibilities can be calculated and therefore are excluded from the 99.999% availability calculation.

Black Swan events have occurred. Even the best data centers can experience outages that were either unpredictable or were considered so unlikely that there were no plans, procedures and facilities in place to deal with them. However, the outage can also be caused by poor design or not verifying that the design specifications were followed. Some examples of outages that had major impact:

* Rackspace--A major cloud service provider had two power outages at their Grapevine TX data center, on June 29 and July 7, 2009. There was a 40-minute downtime for some customers on June 29, and another downtime of 15 to 20 minutes on July 7, affecting about 2,000 customers. The total of about one hour of downtime reduces annual availability to 99.98%, as a result of just these two events.

* Amazon Web Services (AWS)--AWS had a partial failure on April 21, 2011. AWS EC2 has a network of availability zones, which are multiple zones in each of the AWS regions. This design is supposed to prevent any single point of failure. One customer found that storing data in multiple zones in the same region did not protect the data. Multiple zones in the eastern U.S. region went down because AWS did not follow its own design specifications. It took about 3 hours to contain the problem, which was a network failure. This results in an annual availability of 99.97%. However the failure continued into the next day, reducing the availability to about 99.5%.

Next Page: Network Access and the Endpoint

Network Access and the Endpoint
Are you connected to the service provider over the Internet, or over MPLS? This does make a difference. Most Internet access technologies do not specify reliability metrics. There are goals set by ISPs, but there are no guarantees that these will be delivering 99.99+% availability.

The enterprise has a better chance of obtaining a network SLA if it is using a service like MPLS or VPLS. When the enterprise subscribes to a cloud service, the network access to the cloud service is not generally included in the cloud SLA. The cloud service does not include the enterprise's infrastructure in the SLA either, nor are the endpoints included. Therefore, the end user will see less availability than stated in the cloud SLA because there are other elements between the end user and the cloud service.

The enterprise will also be responsible for its own internal network and any wireless network connection reliability. All the endpoints, wired and wireless, are the enterprise's responsibility. It would be impossible for the cloud provider to be responsible for these elements' reliability unless the cloud provider made this part of their offering, which is highly unlikely.

Service Level Agreements (SLA)
When a Service Level Agreement is offered, read it carefully. The cloud providers have exclusions that may not be acceptable. If you accept the exclusions, you may find that you need to make some other investments if your goal is 99.99% availability for the entire user experience.

Reading the SLA can be informative. The exclusions are reasonable in most cases, but the enterprise should fully understand what the exclusions mean to their operation in the cloud. The Rackspace SLA states (emphasis is mine):

"We guarantee that our data center network will be available 100% of the time in a given month, excluding scheduled maintenance. The data center network means the portion of the Rackspace network extending from the outbound port on your edge device to the outbound port of the data center border router and includes Rackspace managed switches, routers, cabling.

We guarantee that data center HVAC and power will be functioning 100% of the time in a given month, excluding scheduled maintenance. Power includes UPSs, PDUs and cabling, but does not include the power supplies on your servers. Infrastructure downtime exists when a particular server is shut down due to power or heat problems.

We guarantee the functioning of all server hardware components and will replace any failed component at no cost. 'Hardware' means the processor(s), RAM, hard disk(s), motherboard, NIC card and other related hardware included with the server. Hardware replacement will begin once we identify the cause of the problem. Hardware replacement is guaranteed to be complete within one hour of problem identification.

"

The Amazon EC2 SLA exclusion statement has a different way of describing the exclusions:

"The Service Commitment does not apply to any unavailability, suspension or termination of Amazon EC2, or any other Amazon EC2 performance issues: (i) that result from a suspension described in Section 6.1 of the AWS Agreement; (ii) caused by factors outside of our reasonable control, including any force majeure event or Internet access or related problems beyond the demarcation point of Amazon EC2; (iii) that result from any actions or inactions of you or any third party; (iv) that result from your equipment, software or other technology and/or third party equipment, software or other technology (other than third party equipment within our direct control); (v) that result from failures of individual instances not attributable to Region Unavailability; or (vi) arising from our suspension and termination of your right to use Amazon EC2 in accordance with the AWS Agreement (collectively, the 'Amazon EC2 SLA Exclusions'). If availability is impacted by factors other than those explicitly listed in this agreement, we may issue a Service Credit considering such factors in our sole discretion."

Next Page: Puny Penalties

Puny Penalties
The Rackspace guarantee says they will credit your account 5% of the monthly fee for each 30 minutes of network downtime, 30 minutes of infrastructure downtime, and/or each additional hour of downtime for hardware up to 100% of your monthly fee for the affected server.

The Amazon website states that "If the Annual Uptime Percentage for a customer drops below 99.95% for the Service Year, that customer is eligible to receive a Service Credit equal to 10% of their bill (excluding one-time payments made for Reserved Instances) for the Eligible Credit Period. We will apply any Service Credits only against future Amazon EC2 payments otherwise due from you."

None of the credit agreements covers any losses of business, or legal actions that may occur for service loss that the enterprise will incur because of a failure. One of the open questions not covered is the temporary loss of data access even if the service resumes, or the permanent loss of data. When communications functions are in the cloud, the data is in the cloud. One major consideration for the enterprise is what should be done to back up data--store cloud data on another service or on the enterprise premises?

Suppose that there is a failure that harms the enterprise. Read the SLA and you will probably find a statement to the effect that the cloud provider does not bear any liability for whatever occurs during the loss of service; the final responsibility rests with the enterprise customer, not the cloud provider.

In the end, the enterprise can only recover up to one year's service fees and nothing else. With these SLA provisions, it is unlikely that bringing suit against the cloud provider to gain greater compensation will be successful. Have your lawyers read the contract provisions. You may not be able to sign the contract because of legal objections.

Tags:

News & Views

Search form

Cloud UC Availability: Is Five-Nines Real?

Tags: