QoS - It Really Is Important
Tying together QoS, network monitoring, and the impact of packet loss
QoS Is Still Misunderstood
I received an email this week from a gentleman named Steve, who asked about QoS. It seems that some of his co-workers don't believe in QoS in their network. They evidently believe that the network has sufficient bandwidth that QoS is not needed. Steve, however, had read a couple of my blog posts about QoS, interface drops, and network performance. He was concerned that they needed QoS and wanted to learn more about the factors that indicate a need for it.
Over the years, I've written several articles about QoS, network monitoring, and the impact of packet loss on network performance. Since Steve asked about all of these, I thought that it would be useful to write a summary article that ties all the parts together.
The Impact of Packet Loss
It doesn't take much packet loss to negatively impact applications. With all the emphasis on network performance for real-time applications like video and voice, they are surprisingly resilient when faced with random packet loss. The codecs in popular use are able to interpolate between received data samples to synthesize samples that are close to what was lost. Our human sensory systems for visual and audio noise and dropouts are also very resilient and allow us to make sense of imperfectly operating systems. So these real-time applications tend to perform reasonably well when faced with packet loss.
TCP, on the other hand, is seriously impacted by packet loss. The very small amounts of packet loss that impact TCP are what is surprising. We've been trained that TCP handles packet loss all on its own. From my perspective, our intuition on the volume of packet loss has been very wrong. I discovered the difference between reality and my intuition when I came across something that's been called The Mathis Equation, named after the principal author of the initial research paper on the topic of the impact of packet loss on TCP. I first wrote about it at Chesapeake Netcraftsmen.
The result is that packet loss of greater than 0.0001% (that's 0.000001 * packet_count) should be investigated. That's the point in the curve where packet loss begins to impact TCP performance. Could you use a larger packet loss figure? Sure. I wouldn't go any higher than 0.01% packet loss, due to the loss of throughput for TCP as shown in the Mathis Equation throughput chart. Since most business applications run on TCP, higher packet loss can have a significant impact on productivity.
How do you know that you have a problem? After all, if no congestion is occurring, QoS isn't necessary. This is where your Network Management System (NMS) becomes useful.
I should point out here that simply monitoring link utilization isn't sufficient for detecting whether a link needs QoS. The averaging that an NMS performs when collecting interface performance data seriously understates the actual bandwidth used at any point in time. Data traffic is very bursty, and the reporting from the NMS is averaged over much longer periods of time than are used by the bursts. We've seen customers who have links running at 40% long-term utilization and which are experiencing significant congestion packet loss. In my experience, most network engineers and managers would not be concerned by 40% utilization. They are missing the peaks and how often those peaks occur.
It is better to look for other indicators of packet loss, such as interface drops. Some systems may report drops as "discards," so look carefully at both the network equipment and NMS to find the right variable. Two previous posts that deal with detecting packet loss are "Detecting Network Packet Loss" and "Detecting Link Congestion."
Another way to detect packet loss is to ask the endpoints, such as VoIP phones and video conferencing systems. The voice/video controller can report the statistics from the endpoints and allow you to sort by loss statistic or export to a tool in which you can do the sorting.
Finally, you can monitor the business servers for packet loss, using 'netstat -s -p tcp'. The output includes the following:
~ tcs$ netstat -s -p tcp
697719 packets sent
313408 data packets (81383562 bytes)
247 data packets (72106 bytes) retransmitted
TCP ramps up its throughput during the "slow start" phase of a data transfer. It relies on dropping a packet when it reaches the throughput limit of the path, so some packet loss is to be expected.
My sample netstat output above is from my laptop, which runs over wireless most of the time, so I expect it to have higher retransmissions than normal. I prefer a double sort of server data to show me the systems that need the most attention. The first sort is by retransmissions. The second sort is by 'packets sent.' I then look for the systems with the largest volume of traffic and the highest retransmissions.
Once the high-loss systems are identified, you then need to determine the network path that's being used. You may need to determine the typical set of TCP connections with 'netstat -an" or by talking with the server team about the application architecture. Once you determine the other endpoints of that server, you can use trace/traceroute to identify the path that the packets are taking. All this investigation can be a bit time consuming, but it is well worth it for critical business servers that are experiencing significant packet loss. While you're at it, you may find a simple duplex mismatch that's causing a significant problem.
Finally, there may be other sources of packet loss, as described in this blog post:
An old switch interface card and unplanned server connections resulted in significant congestion at the switch interface card. We wouldn't have found this information without doing some CLI data collection and analysis.
Finally, let's talk about QoS. As I mentioned above, the network can experience congestion due to micro-bursts (also called instantaneous buffer congestion), as described in this blog post.
But you shouldn't stop there. You need to verify that the QoS implementation is doing what you want it to do. We discovered that a QoS configuration wasn't doing what we wanted on a highly congested T3 link in a customer's network. We had to modify the QoS buffering to force the drops into the low priority traffic queue and buffer the high priority data traffic.
I find it interesting that old ideas stay with us so long. It wasn't until I learned of micro-bursts that I gained a real appreciation for the value of QoS, even in the LAN. Then I gained some experience in several customer networks where we were able to see congestion and its impact. More experience allowed me to gain more knowledge and understanding of the impact of congestion and its sources. Actually implementing QoS and seeing that the default configuration didn't work for a significantly oversubscribed link was very interesting. We learned what to tweak and were able to achieve the desired result.