Average Network Statistics Can Hide Big IT Problems

shock_AdobeStock_37876975.jpeg

Image: .shock - stock.adobe.com

We like to use averages when analyzing network statistics, but unfortunately, those averages often hide a lot of bad news, like the worst response rates. Let’s take a closer look below.

Network Statistics Averaging

It’s easy to apply what we think we know about averaging to network statistics to identify when our IT networks are not performing well. Unfortunately, averages hide a lot of news, both good and bad. The good news is great, but it’s not something that compels us to act. However, the bad news tells us about the parts of our IT systems where something is wrong, which is most likely impacting an employee’s or customer’s intended actions.

Take network availability for example. What does it mean to say that the average network uptime is 99.999%? That’s the magic five-nines that many organizations strive for, which translates into about 5.25 minutes of downtime a year. But is it calculated by averaging the uptime of network devices across the entire network? And how should you account for redundant devices? Is it better to use average network connectivity availability across the network? These are very different measurements.

Averages also don’t account for the relative importance of different systems. Parts of the network that support critical business functions like manufacturing, customer order taking, billing, and fulfillment are more important on a day-to-day basis than the parts of the network that support less critical functions.

But I Use Percentiles

For some network statistics, using percentiles can be used to provide better visibility into system performance. For example, the 95th percentile calculation is successfully used for network utilization measurements.

But applying the same calculation to website response times, which many web analytics systems do, is simply hiding the worst five percent of the response times. If your metric for the 95th percentile webpage response time is one second, then one customer out of twenty has a poor experience on your website. Yet, the website dashboard shows green as long as the metric is below one second. This is where you have to be aware of watermelon metrics that show green, but just below the surface, there is a lot of red. One poor experience out of twenty isn’t a great metric, and this information is hidden from view.

Then, there is the temptation to average percentiles to arrive at a satisfying single metric across multiple systems. As this article explains, you shouldn’t be tempted. Investigate any of your systems that average percentiles to produce a single metric. It is almost certainly hiding critical performance data.

By examining that hidden top five percent, you can see how bad it is really. Don’t be surprised to find that the top samples are much higher than the 95th percentile metric.

Hiding the Bad News

Our analysis systems hide bad news in several ways. The first, and perhaps most obvious now, is by averaging data. When our systems collect data every N seconds, it is effectively averaging that data over the interval. Network monitoring systems tend to have pretty long sample intervals for interface performance data, frequently five or ten minutes, which hides the peak values that occur during those periods. All we know is that X bytes were transmitted or received on a network interface in those N seconds. The sampled data is then used in the percentile calculations that we use for capacity planning purposes. This is a valid use of percentiles to inform our capacity planning objectives.

But we often see averaging and percentile calculations applied to other data where it hides meaningful data, such as web server response times. Instead, look for the maximum values. These are real values because they define the worst case. I found it helpful to find an article on how our monitoring tools often hide data (for example, “Everything You Know About Latency Is Wrong,” with its associated video, “How NOT to Measure Latency”).

Collecting enough data to perform analysis much beyond 99.9% is challenging. An alternative is to capture the Top-N values. For webpage latency, capture the top 100 or 1000 slowest transactions in enough detail to inform your analysis, which may find a common problem across multiple transactions. You should consider feeding high-latency transaction data into an unsupervised machine learning engine for several months to see if it identifies something that your analysis missed. Gartner has an interesting research paper on the subject, which you can find here (also available for free from several vendors with website registration).

Our minds also fool us. We frequently have difficulty conceptualizing how problems scale up as our IT systems grow. Events that are supposed to be extremely rare are occurring more frequently than our minds would otherwise indicate. James Hamilton, VP and distinguished engineer at Amazon, has seen his share of rare events and wrote an article about it in “At Scale, Rare Events Aren’t Rare.”

Uncovering the News

You’ll find that many network monitoring systems don’t provide enough resolution in their default data to drive the detailed analysis described above. Instead, use the percentile and maximum metrics to identify parts of the network and IT systems that need more detailed investigation. You may need to customize data collection on those systems to obtain the higher resolution data you’ll need for better analysis.

By taking the time to think about the data presented and how your system collects it, you can find the missing information that hides in data averaging.

Tags:

Network architecture

analytics

Latency

metrics

News & Views

Enterprise Networking

Best Practices

Consultant Perspectives

Industry News

News & Views

Technology Trends

Articles You Might Like

Why Don’t Enterprises Believe Telcos on Optical Networking?

Tom Nolle

October 02, 2023

According to recent research, telcos haven't given enterprise customers any reason to be optimistic about technological innovations done in a timely fashion, or competitive pricing in the market.

Beware the Network Security Breaches Caused by Carelessness

Tom Nolle

March 24, 2023

Overexposure, overpermission and overdistribution all present threats to an enterprise's security – but there are ways to offset all three of these security issues.

ISP Channel Service Units – Are They A Good Thing

Sorell Slaymaker

February 08, 2023

Every technology/product has its time and place – and as Network as a Service (NaaS) takes off, the CSU's time might be coming to a close

Your WAN: The Overlooked and Vital Link to the Cloud

Cheryl O'Brien

February 02, 2023

The WAN is the most important link in this whole chain of dependency on the cloud, as the WAN is the weakest link. Therefore, 'X' As A Service is only as good as the ability to get to X.

Search form