Where's My Cloud?

We've all heard about the mythical five nines of availability, the Holy Grail set decades ago for residential telephony.

Five nines availability (99.999%) translates to an average of five minutes of unavailability (00.001%) per year. In other words, while some users may experience longer outages and others may experience no outages at all, average service outages/unavailability doesn't exceed five minutes per user in a year.

While premises-based TDM PBX solutions didn't always achieve five nines, they generally came very close -- commonly achieving at least four nines through redundancy and reliability-focused manufacturing of line cards and phones for high mean time between failures (MTBF). Achieving that level of availability has proven a challenge with VoIP, given the underlying IP data network's complexity and lack of edge redundancy. Core routers, the open Internet, power, SIP trunking, and other factors can reduce VoIP availability to between three and four nines. For example, a non-redundant edge/desktop Ethernet switch with a MTBF of 10 years and a mean time to repair (MTTR) of four hours is unavailable 24 minutes per year (240 minutes in 10 years). This means that one device in the entire network has about five times more unavailability alone than the target needed to achieve that mythical five nines of availability.

Redundant servers in distributed data centers can increase availability in the core (even to six nines or beyond). One of the key benefits of VoIP and centralized architectures is the ability to have multiple devices, with the use of redundant devices increasing overall availability. If one device fails and the core is operating, the user can use an alternative device (phone, PC, mobile, etc.). However, should the core fail, so too would all devices down the chain.

For cloud-delivered communications services, a range of events can affect service availability. This can happen in the data center or network; on servers; with virtualization or UCaaS software; and due to administrative issues. What happens when the core cloud service itself goes off-line? A core cloud service outage can impact many end users simultaneously. Without an operating core platform, alternatives to the IP endpoint like mobile integration also fail. With most UCaaS and CCaaS solutions, using the open Internet as over-the-top (OTT) deployments, Internet issues can also impact availability from the end-user perspective.

If you're using or thinking about using cloud communications services, you need to know how often providers and end users experience outages and whether those events are limited or universal. While enterprises weigh factors such as cost, flexibility, continual upgrade, and self-service capabilities in considering cloud migrations, they tend to overlook availability. In the end, you need a clear understanding of cloud availability and the impact on your organization before deciding to make the move. It turns out there's an app for that.

Downdetector.com, a crowd-sourced site, tracks outages/availability of a wide range of Internet/cloud services. While Downdetector covers most of the usual cloud suspects, it also tracks UCaaS providers such as 8x8, Microsoft, Mitel, RingCentral, and others.

For example, I received an alert regarding a RingCentral outage on April 3. As can be seen on the below screen capture (ads redacted), users began reporting problems with the RingCentral service at 10:14 a.m. ET and continued reporting issues for a couple of hours. Based on comments on Downdetector, the outage appears to have impacted some users for up to five hours. However, based on the length of consistent problem reports, the outage appears to have a had a major impact at just about one hour.

The online Downdetector graph shows the number of problem/outage reports received per 15-minute period. In looking at the chart and estimating the counts by 15-minute period, it seems that Downdetector received almost 1,500 separate reports of this outage.

Continue to next page: Exploring the RingCentral outages