Correcting Network Packet Loss

In my last post, I described how to detect network packet loss and promised to talk about several causes of packet loss and how to fix them. The examples I have all came from real networks where I, or someone at Netcraftsmen, was working with a customer. I've written blog posts about some of the problems and the resolution of each. However, I've not put all the information in one place, which I intend to address with this blog.

Enterprise Connect in March 2012
Related to the content of this post are my presentations at Enterprise Connect 2012, March 26-29, 2012. I am hosting How To Keep Video From Blowing Up Your Network, in which I’ll talk about running a real network and how to keep video traffic from impacting the other business applications and to identify when business applications are being impacted. It is related to the congestion topic below. The second session is Network Test Tools for Voice and Video, where I will lead a panel discussion on how to use tools to diagnose problems with voice and video applications. Finally, I will be participating in John Bartlett's session on QoS & Net Design for Converged Networks. This session is always well attended because it covers a wide range of topics. I talk about resilient network design during my part of the session. All of these presentations are related to the topics in this post.

The Impact of Packet Loss on TCP
I've written about the impact of packet loss on TCP in several prior blog posts:

* http://www.netcraftsmen.net/resources/blogs/rethinking-interface-error-reports.html
* http://www.netcraftsmen.net/resources/blogs/application-performance-troubleshooting.html

The Mathis equation is an excellent way to determine the estimated packet loss of a given path. You have to know the round trip latency between the source and destination as well as the packet loss on the path to determine the maximum goodput, which is the volume of delivered user data. A graph of the resulting goodput for three different round trip times (RTT) is very informative. At 1Gbps, there is not much impact at 0.00001% packet loss. But at any higher packet loss, there is a significant reduction in goodput. Path RTT also has a big impact on goodput, as you can see in the graphs below.

Video and voice do not use TCP, so the graphs don't apply to them, but when you have significant amounts of voice and video, the other business applications that are using TCP may be impacted, depending on how you've set up QoS and allocated bandwidth. (That's the topic of my first Enterprise Connect session.)

(Click here for larger version)

Packet Loss from Network Congestion
My first example of network congestion came from an over-subscribed 1Gbps core network link between two data centers. The RTT latency was about 2ms and it was showing a 95th percentile utilization (see http://www.netcraftsmen.net/resources/blogs/95th-percentile-calculation.html) of 40%-50%. That didn't look too bad at first. But the number of drops during business hours was greater than 0.01% and regularly showed 0.1% packet loss. Looking at the above charts, we see that 0.1% packet loss provides about 23Mbps of goodput for any TCP applications on this path. Further examination verified the congestion, with high numbers of TCP retransmissions, higher than would normally be expected. Because there weren't any applications that we could de-prioritize, the only good solution was to add more bandwidth.

The second example comes from a T3 link (45Mbps) that connected to a remote site from a corporate headquarters. The applications worked well early in the morning and late in the day and at night. But during business hours, there was a substantial drop in application performance. Link utilization showed that the utilization went from about 10% of the link to over 80% of the link at about 9am and stayed high until about 5pm. It didn't increase on holidays. Breaking out some tools, Opnet's Application Response Xpert in this case, allowed us to identify the endpoints of the traffic. Three endpoints for Internet-based traffic stood out, because they collectively represented 50% of the total traffic. The three sites were Pandora.com, Akamai, and LimeLight Networks. All three are content provider sources for streaming audio and video as well as video downloads. The network interfaces were showing significant packet drops. It was a clear case of an oversubscribed link. You can read about some of the details in the blog post Diagnosing a QoS Deployment. Marking some traffic as low priority improved the performance of the business applications.

Link Errors
I commonly see two types of link errors as sources of packet loss. The first is the switch interface duplex mismatch, which I've written about before in Auto-negotiate Duplex or Not? as well as a number of other blogs. I mention it frequently because it is such a prevalent problem that is easy to identify and correct. It typically happens on server connections, but network infrastructure links are not immune to it. The error counts increase significantly as the link starts to handle more traffic. The type of errors indicate the duplex setting. Look for FCS errors and Runts on a full duplex interface to indicate that the connected device is running half duplex. If the local interface is running half duplex, then any late collisions imply that the connected device is running full duplex.

The advantage of learning about these signatures is that you don't need to have access to the connected device to detect the duplex mismatch. At its worst, a duplex mismatch on a very busy interface will only achieve a few Mbps of goodput. I'm a fan of using auto duplex except where a piece of equipment is known to fail with auto negotiation. The list of devices that fail should be well documented, so that most of the network interfaces can be run with a standard auto-negotiate setting.

The second source of link errors is due to bad cabling or a WAN connection. I've seen a bad fiber patch cable cause a few errors per day on a 10G connection. It was interesting because the other fiber connections throughout the network were all running with no errors. Over the course of two months, I watched the errors increase from about 5 per day to over 40 per day. It didn't really impact the data flow, but the trend was clear. The customer replaced the patch cable and the link now runs with no errors. It was a case of being proactive and correcting a problem before the errors increased to the point that they would have affected the data traffic.

In another case of link errors, one WAN connection that was created from two bonded links was showing errors. The network staff reconfigured the link to not run bonded and was able to detect that one WAN link was the source of the errors. The carrier was then contacted to troubleshoot the defective link.

Overruns
I recently ran into a new source of packet loss: ingress overruns, where the interface card is unable to handle an inbound Layer 2 frame before the next frame arrives. The diagnosis found that the customer had four high-powered servers connected with 1Gbps links to an older switch interface card. The interface card could only handle a total of 1Gbps of traffic and the load of all four servers sending traffic at the same time was 4Gbps, far exceeding what the card could handle. The overruns were more than 0.005% of all ingress traffic on the interfaces to which these servers were connected (refer to the graphs above for the effective throughput at different RTTs). The solution was to move the servers to a modern interface card that could handle traffic at full wire-speed. In this case, detecting the problem was the key factor. Once the problem was understood, it was relatively easy to move the servers to switch interfaces that could handle the traffic load.

Summary
Detecting packet loss is critical. Monitoring systems should collect a variety of interface performance data, including the different types of errors. Once you know why packets are being lost, you can take the appropriate action, which varies for each type of packet loss.

My rule of thumb is to investigate any link that shows more than 0.00001% drops or errors, to make sure that network congestion doesn't cause slow applications.

How does the above tie into voice and video? When more and more video joins the network, and it is prioritized above data traffic, the data will begin to suffer from more congestion loss. The resulting packet loss can have a detrimental effect on the productivity of those business applications and the people who use them. And ultimately, the productivity of the business applications affects the productivity of the business itself.

Tags:

News & Views

Search form

Correcting Network Packet Loss

Tags: