Cloud Collapse

Have you heard, the sky is falling? Actually, part of the cloud collapsed on April 21, 2011, when a portion of the Amazon EC2 cloud service went down.

According to the Amazon Web Services site, "Amazon Elastic Compute Cloud (Amazon EC2) is a web service that provides resizable compute capacity in the cloud. It is designed to make web-scale computing easier for developers. Amazon EC2 changes the economics of computing by allowing you to pay only for capacity that you actually use. Amazon EC2 provides developers the tools to build failure resilient applications and isolate themselves from common failure scenarios." But this is just what happened.

My concern with this outage on EC2 is that the cloud communications provider you are using may implement part or all of their communications services in the cloud. If that cloud fails, what are the liabilities accepted by the communications provider? Will the limitations of liability be dictated by the cloud infrastructure provider that the communication applications operate on?

When preparing my survey of cloud communications providers, "2011 Sourcebook of Hosted and Cloud-Based VoIP and UC Services", I discovered that some if not all of the communications services of some of the providers are only cloud-based. How does the customer deal with a cloud outage of their VoIP and UC services? Amazon posted a summary of what happened and the impact on their customers, "Summary of the Amazon EC2 and Amazon RDS Service Disruption in the US East Region" at the Amazon Web services site. The post offered this explanation:

The issues affecting EC2 customers last week primarily involved a subset of the Amazon Elastic Block Store ("EBS") volumes in a single Availability Zone within the US East Region that became unable to service read and write operations. In this document, we will refer to these as "stuck" volumes. This caused instances trying to use these affected volumes to also get "stuck" when they attempted to read or write to them. In order to restore these volumes and stabilize the EBS cluster in that Availability Zone, we disabled all control APIs (e.g. Create Volume, Attach Volume, Detach Volume, and Create Snapshot) for EBS in the affected Availability Zone for much of the duration of the event. For two periods during the first day of the issue, the degraded EBS cluster affected the EBS APIs and caused high error rates and latencies for EBS calls to these APIs across the entire US East Region. As with any complicated operational issue, this one was caused by several root causes interacting with one another and therefore gives us many opportunities to protect the service against any similar event reoccurring.

The lengthy posted document offered a detailed explanation of the EBS architecture, the outage timeline, its impact and the problem resolution.

The problem was a configuration error, an error that routed traffic to a lower capacity network, not the primary network. This affected a single Availability Zone in the US east region. The outage lasted through the Easter weekend. The posted Amazon document stated that:

The traffic shift was executed incorrectly and rather than routing the traffic to the other router on the primary network, the traffic was routed onto the lower capacity redundant EBS network. For a portion of the EBS cluster in the affected Availability Zone, this meant that they did not have a functioning primary or secondary network because traffic was purposely shifted away from the primary network and the secondary network couldn’t handle the traffic level it was receiving. As a result, many EBS nodes in the affected Availability Zone were completely isolated from other EBS nodes in its cluster. Unlike a normal network interruption, this change disconnected both the primary and secondary network simultaneously, [a big error--GA] leaving the affected nodes completely isolated from one another.

What if your communications applications were resident on the isolated clusters? This outage exposes the fact that cloud services and the infrastructure they depend on are still maturing. It also makes the enterprise face the issue of providing its own disaster recovery plans that do not depend on the same cloud infrastructure.

My article, "The Legal Side of the Cloud, Worrisome?" was submitted before the EC2 failure event occurred so the article did not cover this cloud nightmare. The article covers a number of issues that the enterprise should consider for its own protection. My article focuses on the contract and legal issues that should be analyzed before subscribing to cloud services. If you are considering cloud communications services, read my article. If you are already a cloud communications subscriber, use my article as a checklist when reviewing or renewing your cloud service contract.

Tags:

News & Views

Search form

Cloud Collapse

Tags: