How to Manage Interface Packet Loss Thresholds

TS Article Image___Feodora_AdobeStock_246200099.jpeg

Image: Feodora - stock.adobe.com

Interface packet loss provides indications of link problems that shouldn’t go ignored. But then you have to decide on an alerting threshold that indicates a problem without creating too many false alerts. So, what’s there to do? Allow me to explain.

Causes of Packet Loss

Packet loss results in packet retransmissions that consume multiple round-trip times, leading to significantly lower application throughput, in other words, application slowness. Real-time protocols are generally more tolerant of small amounts of random packet loss. However, they don’t work well with bursts of packet loss and certainly not when the packet loss gets too high.

Link and Interface Errors

Link and interface errors can be due to many sources. Fiber-based networks are subject to anything that reduces the optical signal, such as dirty, high-loss connections and fibers that are pinched or stretched. Copper cabling, most often twisted pair, has its own set of failure modes, including poorly crimped connectors, cable runs close to high voltage sources, or pinched cables. Wireless networks are known for a variety of limitations that create packet loss, such as overloaded access points, radio frequency (RF) interference from non-Wi-Fi sources like microwave ovens, and poor RF signal strength. You should treat interface errors as a soft infrastructure failure—they affect applications in subtle ways.

Network Congestion

Network congestion occurs in cases where network devices (including host interfaces) run out of buffer space and must drop excess packets. The intuitive action is to increase buffering, but that negatively affects congestion control algorithms, to the point that it has a name: buffer bloat.

Interface drops (sometimes called discards) aren’t necessarily a bad thing. Congestion can occur at aggregation points or where link speed changes occur. It becomes a problem when it occurs too frequently, and the packet loss causes applications to become slow. Quality of service (QoS) gets used in these cases to prioritize crucial, time-sensitive traffic flows and force packet drops of less important packets. We have successfully used QoS to prioritize business applications over less important entertainment traffic (streaming audio).

A Surprisingly Low Threshold

So, you want to configure your network management platform to alert you to potential sources of packet loss that impact application performance. What’s a reasonable figure to use for an alerting and reporting threshold? You would think that one percent would suffice, based on our intuition developed in other disciplines, like financial. However, that intuition is flawed when applied to networking.

The transmission control protocol (TCP) is very sensitive to packet loss. Some researchers measured TCP performance at different speeds and packet loss characteristics and the result is known as the Mathis Equation. The short summary is that packet loss of more than .001% of all packets causes significant decreases in throughput. That’s a packet loss rate of one packet out of 100,000 (1 out of 10E5). That translates into a bit error rate (BER) of about 10E-10. (The figures are approximate because of differences in packet sizes).

Before you say that this error threshold is too small, let’s look at it differently. How long do you think a link should run before it experiences a packet loss? Using the 10E-11 figure, a one gigabit per second (1Gbps) link would run about 10 seconds between errors, while a 10Gbps link would experience an error every second. You can use this information to determine your network management system packet loss thresholds.

Network Management Thresholds

Network management systems (NMS) should be collecting interface performance data from all network interfaces within the organization, including errors and drops/discards. Your selection of an alerting threshold for errors/drops/discards will depend on what error rates you are willing to tolerate for your network and what threshold setting the network management tools will support. I was recently surprised to find an NMS in which packet drop thresholds couldn’t be set smaller than one percent. In these cases, it may be better to use absolute count values as thresholds. Also, note that management systems typically count errors separately from drops/discards.

Regardless of the exact threshold, you should configure the NMS to use Top-N reports (e.g., Top-10) of the interfaces with the highest number of errors and drops. You can then focus on diagnosing the interfaces that have the most impact on applications. Note that some interfaces will have errors/drops but aren’t handling much traffic. I’ve seen cases where packet loss on a link was nearly 100%, but it was for a minimal number of packets. Beware, some of these paths are likely to be backup links that will have high loads if the primary fails. It’s risky to ignore these problems. You should create synthetic loads between network devices to verify their integrity.

Let’s examine an actual link error situation in which I was talking with a network engineer at a major financial services firm. The network engineering team couldn’t make network changes—that was reserved for the network operations team. Some key applications were slow, and the engineer had determined that it was due to a duplex mismatch on a router-to-router link. But because packet loss was one percent, the operations team ignored it, looking for some other cause. It took the engineer several weeks to convince the operations team to fix the problem, whereupon the applications immediately returned to the desired performance.

Digital Experience and Application Performance Monitoring

Packet loss monitoring and analysis gets tricky with cloud-based applications. You don’t have network management visibility into the server-side network statistics. There are two potential alternatives:

digital experience (DX) monitoring products
application performance monitoring (APM) systems

DX products can include a client-based monitoring system that collects important client-side data like Wi-Fi signal strengths and packet retransmissions.

Application performance monitoring products monitor application performance, frequently by performing packet captures at points between the application servers and the client endpoints. A bit of setup to identify applications and client endpoints makes it easy for these systems to detect a variety of problems, including client-side slowness, network retransmissions (due to packet loss), and slow application servers.

Summary

You have a wide variety of tools to monitor for packet loss, even extending to cloud-based applications. Setting appropriate thresholds on network error and drop counters to provide you with visibility into how well your infrastructure is running.

Tags:

packet loss

Internet

network congestion

network management

QoS

News & Views

Enterprise Networking

Cloud Communications

Consultant Perspectives

Industry News

News & Views

Real-Time Communications

Articles You Might Like

Why Don’t Enterprises Believe Telcos on Optical Networking?

Tom Nolle

October 02, 2023

According to recent research, telcos haven't given enterprise customers any reason to be optimistic about technological innovations done in a timely fashion, or competitive pricing in the market.

Beware the Network Security Breaches Caused by Carelessness

Tom Nolle

March 24, 2023

Overexposure, overpermission and overdistribution all present threats to an enterprise's security – but there are ways to offset all three of these security issues.

ISP Channel Service Units – Are They A Good Thing

Sorell Slaymaker

February 08, 2023

Every technology/product has its time and place – and as Network as a Service (NaaS) takes off, the CSU's time might be coming to a close

Your WAN: The Overlooked and Vital Link to the Cloud

Cheryl O'Brien

February 02, 2023

The WAN is the most important link in this whole chain of dependency on the cloud, as the WAN is the weakest link. Therefore, 'X' As A Service is only as good as the ability to get to X.

Search form