This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.
Zoom Outage Resurrects Availability Concerns
I wrote an article back in April 2018, “Where’s My Cloud?,” about the reality of cloud reliability and availability. The point of the article was how data on reliability was lacking, while claims of availability were questionable. In light of Zoom’s recent outage that impacted millions, the topic is relevant for enterprises deciding on their cloud direction.
The point of the previous article was that achieving five-nines (99.999%) availability in the cloud is challenging, and most UCaaS vendors were delivering a service that was closer to three-nines at best. The five-nines standard, established in the early days of telephony as a measure of a carrier’s performance, has become a benchmark for reliability in the communications world. The reliability of telephony and today’s premises-based systems led people in the industry to say that a “dial tone comes from God.”
Achieving five-nines implies five minutes or less unavailability (downtime) — a very challenging goal. When you consider that an enterprise user is often connected through an Ethernet switch, with MTBFs of 10 years and MTTRs of four hours, resulting in 24 minutes of unavailability/downtime from that one device in the IP packet path. With WFH, we have now introduced the variability of in-home network/users and an unknown ISP. Achieving a sustained five-nines, or even higher, may be possible in the cloud data center. However, when you include carriers, providers, and end networks, the resulting availability to the end user is reduced by each component’s individual availability, or lack thereof. I referenced several Downdetector cases where UCaaS vendors had issues, in the previous article. One of the issues examined was a relatively large RingCentral data center outage, which was caused by storms and peering issues, according to a RingCentral operations leader.
What the Zoom Outage Means
Just as millions of students were heading back to school (virtually) on the morning of Aug. 24, a range of issues hit Zoom. As you can see below, the Downdetector reports exploded (normal outages have much smaller reporting numbers).
Before discussing the recent availability issue, it’s important to first look at what Zoom achieved in the last six months. They first scaled their capabilities by over 30 times to meet the demands. This rapid expansion in capacity has made life tolerable, saved many businesses, and educated students. If this had happened in 1999, the outcome would have been very different. Zoom also responded to the transformed need for security and privacy, caused by Zoom suddenly being used in every facet of our lives. For this effort, Zoom is to be commended and thanked. They have made the last five months much more tolerable for millions of people.
However, the outage brings back the main point of my previous article, that guarantees of availability in UCaaS are both questionable and not verifiable. For example, leading up to the Zoom outage, there was an increased number of outages reported, most in the morning during the join time crunch. Reported outages topped just a few hundred for any 15-minute reporting period, according to Downdetector. During the Zoom outage, the peak was 17, 000 outage reports in 15 minutes. This wasn’t a limited outage but rather a large-scale issue at some level of the Zoom architecture or data center structure. The early morning rush on Aug. 24 from students logging on was clearly overwhelming.
One factor that makes this outage even more topical is the position Zoom took on their availability when Brent Kelly, principal analyst at KelCor, and I interviewed them for our “Cisco v Microsoft v Zoom” session at Enterprise Connect Digital Conference & Expo 2020. We had several informative meetings with the Zoom team. While we were impressed with many aspects of Zoom, one area that was not examined in detail was availability. In the Zoom presentations, there was a claim that Zoom delivered 99.999% availability, which was made without any specific qualifiers. I noted the claim to ask them about it. Unfortunately, due to limited time, there wasn't time to discuss the topic, so I did not follow-up before the session on how they were achieving this goal. Zoom declined to be interviewed for this post.
A key question is whether the outages Zoom experienced are common across UCaaS. The below Downdetector chart shows reported outages from April to August for five key vendors. RingCentral and Cisco Webex both had fewer reported outages than Zoom, and their outages impacted only a few hundred in most cases. Interestingly, Microsoft Teams had a larger number of outages as well.
While it is easy to compare the number of outage events, it is also important to consider the unprecedented growth in capacity that was happening. Across the board, utilization of all meetings services has increased since April. Since May, the higher utilization rates for Zoom and Microsoft Teams seem to have been difficult to accommodate at times. While Cisco Webex has seen a 4x growth from the pandemic, Microsoft Teams is over 10x, and Zoom is 30x. This unabated increase in demand has been challenging to accommodate.
3 Things to Consider with Cloud Availability
The ongoing development and use of cloud-based solutions make clear that availability is an important characteristic of a cloud solution and one that can’t be brushed aside with claims of five-nines. There are a few points that should be considered, including:
- Trust, but verify — Cloud services are ethereal things. You are paying for something that can’t be held or easily measured. It only exists in packets and data sent over networks. And communications are even more challenging to verify. Was bad quality the fault of UCaaS vendors or network/intranet connections? For most cloud IT teams, they don’t hear about issues until long after they’ve been resolved, and tracking down root causes is challenging. Having issues reported with no way to resolve becomes a major issue. Also, in times of pre-paid or time-based contracts, the availability SLA is generally a breach of service and enables a vendor change, but only if the breach of terms is verified. New tools can manage cloud service SLAs and should be used in cloud migrations.
- Invest in success — Of the big three meetings companies, Cisco has clearly demonstrated an ability to manage the growth they have been challenged with versus issues for their users. For example, Cisco users experienced only 6% of the outages Zoom users experienced and 12% of Microsoft Teams users. Cisco has been the clear availability leader of the big three. While both RingCentral and Slack have similar levels, they haven’t seen the explosive growth in video collaborative meetings that Cisco has managed very well. This may be an indication that Cisco has a better capability/architecture to manage availability within its user base.
- Invest in experience — Admittedly, Zoom has been challenged by their 30x growth and Microsoft by their 10x. However, the learnings of that growth may be the key to future success, ”what doesn’t kill you, makes you stronger.” While Zoom and Microsoft have both had their challenges, the stress test they’ve been through on the growth curve has hardened their architecture and deployments, potentially resulting in long-term stability at scale, which will be harder for other players to achieve.
As we move to the cloud for our communications and collaboration solutions, availability/usability will become more crucial. In the future, the relative availability of all solutions may be equal and approach the mythical five-nines, but in the interim, it is a distinguishing factor among UCaaS vendors. As enterprises consider longer-term options for an overall communications solution, considering the availability ALS, and how it is verified and enforced should be a critical part of every cloud strategy.