Business Telecom Continuity Strategies in the Cloud Era
The Amazon EC2 "cloudpocalypse" taught us all some lessons, and tornadoes in Alabama had a specific impact on Digium's BC strategies.
I'd like to begin this conversation by recognizing that this not the first, nor the last blog post that you're likely to read on this topic. The big "oops" in the Amazon EC2 cloud in early May of this year has caused many to consider the potential shortcomings of public clouds with respect to business continuity. While the EC2 cloud outage was due to an apparent "fat-fingering" of a single router port configuration, we at Digium were recently confronted with a six-day power outage at the hands of Mother Nature, who hammered our home town of Huntsville, Alabama with a barrage of tornadoes on April 28, 2011 that knocked down the Tennessee Valley Authority's power grid across the entire Northern half of the state. While these two events were very different in type and duration, they both served as examples of how business continuity strategies work and do not work in disastrous circumstances.
Because open source IP communications systems like Asterisk and Asterisk SCF can be deployed as software, in almost any hardware and/or network configuration, they can be equally attractive for cloud deployments as they are for premises deployments. However, the business continuity strategy for these styles of deployment are decidedly different. Let's take a moment to look at both strategies through the lens of the Amazon EC2 outage and the Huntsville, Alabama storms respectively.
Surviving a "Cloudpocalypse"
Amazon's EC2 utility computing platform has become synonymous with cloud computing. It was one of the earliest and is one of the largest cloud computing infrastructures in existence. A wide variety of commercial web applications and services are hosted from EC2, including many VoIP telephony applications. The typical EC2 customer has come to expect that up time and survivability are a part of what they are buying in this environment.
The events in early May this year proved that no infrastructure is foolproof to a sufficiently talented fool when an engineer hit the enter key on an errant router configuration and triggered a cascading, systemic failure of Amazon's public cloud. For the duration of this disruption, any customer of the affected infrastructure who did not have redundant infrastructure running in a different network or infrastructure was offline. This outage impacted a number of notable web properties deployed from EC2 including Foursquare and Quora.
So the EC2 outage highlights the need for diversity at the infrastructure layer and at the network layer. If you're running your company's VoIP systems in a public cloud (not recommended) you must achieve infrastructure diversity by enlisting a backup cloud in separate infrastructure. An example would be to have an EC2 deployment backed up by a GoGrid deployment. It is important to confirm that the primary and the backup clouds are connected to the Internet through different primary providers in order to avoid a single point of failure at the network layer.
Surviving a Natural Disaster
A number of weather events in recent times have highlighted the importance of having geographic diversity between primary and backup infrastructure. When Digium's headquarters lost power as the result of the tornadoes that struck Huntsville, Alabama in late April, the company's business continuity plan was enacted to allow the company to continue to operate all of its critical systems, including its Switchvox VoIP appliances, using emergency power supplied by a diesel generator. However, the duration of the power event raised concerns that the generator would not have enough fuel to last until power was restored--and without power, gas stations can't pump more diesel fuel.
Digium was fortunate to be able to fuel its generator and stay in business for the entire six day power outage. However, there were plenty of other companies in the area that were not so lucky and had to close their doors and have their telecom infrastructure offline until the power came back on. The localized nature of the weather event emphasizes the importance of geographic diversity for primary and secondary critical systems in a business continuity strategy.
If your primary business telecommunications infrastructure is run within your company's headquarters, you need a backup solution run in a different geographic location. This can be accomplished by using a SIP PBX in a different company facility in a different state, trunked to the same SIP trunking provider; or by deploying a backup within a cloud. Alternatively, many modern VoIP PBX platforms allow for calls destined to extensions to be routed out to mobile devices when the user's desk phone is not available. As long as you can keep your VoIP switch online, you can get calls to where they need to go.
I am sure that the Amazon EC2 "cloudpocalypse" has caused many of their customers to take a long, hard look at their business continuity strategies. The obvious issue for these customers is how they overcome their singular vendor dependency on Amazon without spending the same money twice on a monthly basis. Additionally, I would expect that Amazon is scrambling to improve their own infrastructure to be able to better cope with broad infrastructure disruptions, so that they can recover the trust of their customers.
At Digium, the weather event that took out power for almost a week opened our eyes to opportunities for improvement in our core telecommunications infrastructure. There are a number of ways that we can better utilize our San Diego facilities as backup capacity for our Huntsville headquarters in order to achieve geographic diversity. Digium can also make more extensive use of cloud operations as backup for our central communications platforms as a means of enhancing infrastructure diversity.
It is not possible predict the future and see what kinds of issues might create the need for a solid business continuity strategy. However, if your company relies upon its telecommunications infrastructure to sell its products, communicate with its customers or collaborate internally you must ask yourself what it costs to be without these tools for hours, days or even weeks, and then you must plan accordingly.