Zoom Outage Resurrects Availability Concerns

Phil_AdobeStock_56605253_83120.jpeg

A business person with questions and concerns on the mind

Image: ra2 studio - stock.adobe.com

I wrote an article back in April 2018, “Where’s My Cloud ?,” about the reality of cloud reliability and availability. The point of the article was how data on reliability was lacking, while claims of availability were questionable. In light of Zoom’s recent outage that impacted millions, the topic is relevant for enterprises deciding on their cloud direction.

The point of the previous article was that achieving five-nines (99.999%) availability in the cloud is challenging, and most UCaaS vendors were delivering a service that was closer to three-nines at best. The five-nines standard, established in the early days of telephony as a measure of a carrier’s performance, has become a benchmark for reliability in the communications world. The reliability of telephony and today’s premises-based systems led people in the industry to say that a “dial tone comes from God.”

Achieving five-nines implies five minutes or less unavailability (downtime) — a very challenging goal. When you consider that an enterprise user is often connected through an Ethernet switch, with MTBFs of 10 years and MTTRs of four hours, resulting in 24 minutes of unavailability/downtime from that one device in the IP packet path. With WFH, we have now introduced the variability of in-home network/users and an unknown ISP. Achieving a sustained five-nines, or even higher, may be possible in the cloud data center. However, when you include carriers, providers, and end networks, the resulting availability to the end user is reduced by each component’s individual availability, or lack thereof. I referenced several Downdetector cases where UCaaS vendors had issues, in the previous article. One of the issues examined was a relatively large RingCentral data center outage, which was caused by storms and peering issues, according to a RingCentral operations leader.

What the Zoom Outage Means

Just as millions of students were heading back to school (virtually) on the morning of Aug. 24, a range of issues hit Zoom. As you can see below, the Downdetector reports exploded (normal outages have much smaller reporting numbers).

Before discussing the recent availability issue, it’s important to first look at what Zoom achieved in the last six months. They first scaled their capabilities by over 30 times to meet the demands. This rapid expansion in capacity has made life tolerable, saved many businesses, and educated students. If this had happened in 1999, the outcome would have been very different. Zoom also responded to the transformed need for security and privacy, caused by Zoom suddenly being used in every facet of our lives. For this effort, Zoom is to be commended and thanked. They have made the last five months much more tolerable for millions of people.

However, the outage brings back the main point of my previous article, that guarantees of availability in UCaaS are both questionable and not verifiable. For example, leading up to the Zoom outage, there was an increased number of outages reported, most in the morning during the join time crunch. Reported outages topped just a few hundred for any 15-minute reporting period, according to Downdetector. During the Zoom outage, the peak was 17, 000 outage reports in 15 minutes. This wasn’t a limited outage but rather a large-scale issue at some level of the Zoom architecture or data center structure. The early morning rush on Aug. 24 from students logging on was clearly overwhelming.

One factor that makes this outage even more topical is the position Zoom took on their availability when Brent Kelly, principal analyst at KelCor, and I interviewed them for our “Cisco v Microsoft v Zoom” session at Enterprise Connect Digital Conference & Expo 2020. We had several informative meetings with the Zoom team. While we were impressed with many aspects of Zoom, one area that was not examined in detail was availability. In the Zoom presentations, there was a claim that Zoom delivered 99.999% availability, which was made without any specific qualifiers. I noted the claim to ask them about it. Unfortunately, due to limited time, there wasn't time to discuss the topic, so I did not follow-up before the session on how they were achieving this goal. Zoom declined to be interviewed for this post.

A key question is whether the outages Zoom experienced are common across UCaaS. The below Downdetector chart shows reported outages from April to August for five key vendors. RingCentral and Cisco Webex both had fewer reported outages than Zoom, and their outages impacted only a few hundred in most cases. Interestingly, Microsoft Teams had a larger number of outages as well.

While it is easy to compare the number of outage events, it is also important to consider the unprecedented growth in capacity that was happening. Across the board, utilization of all meetings services has increased since April. Since May, the higher utilization rates for Zoom and Microsoft Teams seem to have been difficult to accommodate at times. While Cisco Webex has seen a 4x growth from the pandemic, Microsoft Teams is over 10x, and Zoom is 30x. This unabated increase in demand has been challenging to accommodate.

3 Things to Consider with Cloud Availability

The ongoing development and use of cloud-based solutions make clear that availability is an important characteristic of a cloud solution and one that can’t be brushed aside with claims of five-nines. There are a few points that should be considered, including:

Trust, but verify — Cloud services are ethereal things. You are paying for something that can’t be held or easily measured. It only exists in packets and data sent over networks. And communications are even more challenging to verify. Was bad quality the fault of UCaaS vendors or network/intranet connections? For most cloud IT teams, they don’t hear about issues until long after they’ve been resolved, and tracking down root causes is challenging. Having issues reported with no way to resolve becomes a major issue. Also, in times of pre-paid or time-based contracts, the availability SLA is generally a breach of service and enables a vendor change, but only if the breach of terms is verified. New tools can manage cloud service SLAs and should be used in cloud migrations.
Invest in success — Of the big three meetings companies, Cisco has clearly demonstrated an ability to manage the growth they have been challenged with versus issues for their users. For example, Cisco users experienced only 6% of the outages Zoom users experienced and 12% of Microsoft Teams users. Cisco has been the clear availability leader of the big three. While both RingCentral and Slack have similar levels, they haven’t seen the explosive growth in video collaborative meetings that Cisco has managed very well. This may be an indication that Cisco has a better capability/architecture to manage availability within its user base.
Invest in experience — Admittedly, Zoom has been challenged by their 30x growth and Microsoft by their 10x. However, the learnings of that growth may be the key to future success, ”what doesn’t kill you, makes you stronger.” While Zoom and Microsoft have both had their challenges, the stress test they’ve been through on the growth curve has hardened their architecture and deployments, potentially resulting in long-term stability at scale, which will be harder for other players to achieve.

As we move to the cloud for our communications and collaboration solutions, availability/usability will become more crucial. In the future, the relative availability of all solutions may be equal and approach the mythical five-nines, but in the interim, it is a distinguishing factor among UCaaS vendors. As enterprises consider longer-term options for an overall communications solution, considering the availability ALS, and how it is verified and enforced should be a critical part of every cloud strategy.

Tags:

Zoom

outage

UCaaS

News & Views

Monitoring, Management and Security

Digital Transformation

Digital Workplace

Monitoring & Management

Unified Communications as a Service (UCaaS)

Articles You Might Like

How Verizon Enables Scalable And Seamless Multi-Vendor SASE

Zeus Kerravala

May 10, 2023

Secure access service edge (SASE) deployments have seen strong momentum thanks to increased complexity in managing networks and dealing with security threats.

Protecting Your Business – When the Rubber Hits the Cloud

Scott Murphy

January 23, 2023

Securing technology assets needs to be a priority for every business -- the challenge is determining how. Here are the three primary classes of solutions you should look at.

Improving Network Security Through Segmentation

Terry Slattery

August 30, 2022

While segmentation may seem to increase network complexity due to the additional filtering points, a good implementation will improve and simplify security.

How to Handle Network Performance Data