No Jitter is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Average Network Statistics Can Hide Big IT Problems


Image: .shock -
We like to use averages when analyzing network statistics, but unfortunately, those averages often hide a lot of bad news, like the worst response rates. Let’s take a closer look below.
Network Statistics Averaging
It’s easy to apply what we think we know about averaging to network statistics to identify when our IT networks are not performing well. Unfortunately, averages hide a lot of news, both good and bad. The good news is great, but it’s not something that compels us to act. However, the bad news tells us about the parts of our IT systems where something is wrong, which is most likely impacting an employee’s or customer’s intended actions.
Take network availability for example. What does it mean to say that the average network uptime is 99.999%? That’s the magic five-nines that many organizations strive for, which translates into about 5.25 minutes of downtime a year. But is it calculated by averaging the uptime of network devices across the entire network? And how should you account for redundant devices? Is it better to use average network connectivity availability across the network? These are very different measurements.
Averages also don’t account for the relative importance of different systems. Parts of the network that support critical business functions like manufacturing, customer order taking, billing, and fulfillment are more important on a day-to-day basis than the parts of the network that support less critical functions.
But I Use Percentiles
For some network statistics, using percentiles can be used to provide better visibility into system performance. For example, the 95th percentile calculation is successfully used for network utilization measurements.
But applying the same calculation to website response times, which many web analytics systems do, is simply hiding the worst five percent of the response times. If your metric for the 95th percentile webpage response time is one second, then one customer out of twenty has a poor experience on your website. Yet, the website dashboard shows green as long as the metric is below one second. This is where you have to be aware of watermelon metrics that show green, but just below the surface, there is a lot of red. One poor experience out of twenty isn’t a great metric, and this information is hidden from view.
Then, there is the temptation to average percentiles to arrive at a satisfying single metric across multiple systems. As this article explains, you shouldn’t be tempted. Investigate any of your systems that average percentiles to produce a single metric. It is almost certainly hiding critical performance data.
By examining that hidden top five percent, you can see how bad it is really. Don’t be surprised to find that the top samples are much higher than the 95th percentile metric.
Hiding the Bad News
Our analysis systems hide bad news in several ways. The first, and perhaps most obvious now, is by averaging data. When our systems collect data every N seconds, it is effectively averaging that data over the interval. Network monitoring systems tend to have pretty long sample intervals for interface performance data, frequently five or ten minutes, which hides the peak values that occur during those periods. All we know is that X bytes were transmitted or received on a network interface in those N seconds. The sampled data is then used in the percentile calculations that we use for capacity planning purposes. This is a valid use of percentiles to inform our capacity planning objectives.
But we often see averaging and percentile calculations applied to other data where it hides meaningful data, such as web server response times. Instead, look for the maximum values. These are real values because they define the worst case. I found it helpful to find an article on how our monitoring tools often hide data (for example, “Everything You Know About Latency Is Wrong,” with its associated video, “How NOT to Measure Latency”).
Collecting enough data to perform analysis much beyond 99.9% is challenging. An alternative is to capture the Top-N values. For webpage latency, capture the top 100 or 1000 slowest transactions in enough detail to inform your analysis, which may find a common problem across multiple transactions. You should consider feeding high-latency transaction data into an unsupervised machine learning engine for several months to see if it identifies something that your analysis missed. Gartner has an interesting research paper on the subject, which you can find here (also available for free from several vendors with website registration).
Our minds also fool us. We frequently have difficulty conceptualizing how problems scale up as our IT systems grow. Events that are supposed to be extremely rare are occurring more frequently than our minds would otherwise indicate. James Hamilton, VP and distinguished engineer at Amazon, has seen his share of rare events and wrote an article about it in “At Scale, Rare Events Aren’t Rare.”
Uncovering the News
You’ll find that many network monitoring systems don’t provide enough resolution in their default data to drive the detailed analysis described above. Instead, use the percentile and maximum metrics to identify parts of the network and IT systems that need more detailed investigation. You may need to customize data collection on those systems to obtain the higher resolution data you’ll need for better analysis.
By taking the time to think about the data presented and how your system collects it, you can find the missing information that hides in data averaging.