Before I get into the technical topic, I want to report on Enterprise Connect. It was a great conference and trade show. The vendor show was full, with big booths by all the big companies as well as all sorts of smaller companies. I received good feedback on the sessions in which I participated. One fellow had done his homework, checking out the slides before the conference to see which presentations he wanted to attend. Of course, my presentations tended to be a bit more technical, but that was what he was seeking. He wasn't alone. My session on How to Keep Video From Blowing Up Your Network was well attended, with over 100 people present. John Bartlett's QoS session went well, in which I talked about making networks more resilient, a requirement for today's voice and video networks. Finally, I led a panel on Network Test Tools in which we shared information about how different tools provide different views of the voice/video infrastructure. This session was the last morning at 8am and had over 150 people attending. Eric Krapf graciously gave me permission to use the same material in a presentation to the Cisco Mid-Atlantic User Group (CMUG) this week, which was also well attended. It is obvious that good technical content is in demand.
This post's technical topic is the result of a question at both Enterprise Connect and the CMUG meeting:
You say that congestion is a prime source of packet loss in the network. How can I easily detect congestion in my network?
An interface drops packets when the egress queues fill. The queues fill only when packets are arriving faster than they can be transmitted. When an interface egress queue is full, there is no place to store an outgoing packet and the device must drop it.
Some network engineers don't like to see drops, but this is actually desirable. Too much buffering can make it impossible for TCP to determine the un-congested path capacity. Packet loss due to congestion is what TCP uses to perform flow control. If it doesn't detect packet loss, it continues to increase its data rate, which only adds to the congestion problem. UDP doesn't get this same feedback, so any applications using UDP continue to send at their defined rate unless some other feedback mechanism is employed.
By monitoring interface statistics for drops and discards, it is possible to detect excessive congestion. Note that there will always be a small number of drops at congestion points within the network, because TCP will ramp up its data rate until it is running at the maximum that the endpoints can handle, or until TCP experiences packet loss. What we want to detect is excessive discards. Since TCP throughput is reduced by packet loss of greater than 0.0001% (see Mathis Equation blogs), we have a figure to use.
Using the 0.0001% loss figure, we can set a trigger on interface statistics that generates an alert when the drop rate exceeds that figure. The equation is
drop rate = drops / (output packets + drops)
We're looking for drop rate to be less than 0.0001% or less than 0.000001. If the drop rate exceeds this figure, then generate an alert, because the interface is congested over the monitored period.
Be careful about the length of the monitoring period. It is possible for a short monitoring period to generate frequent alerts due to the bursty nature of most data transfers. Use longer monitoring periods like 30 minutes or an hour to detect interfaces that have long-term congestion. If your monitoring system only allows analysis on the regular polling period, which is often five minutes, use a low severity level for the alerts and look for the number of alerts as a metric for identifying the most congested interfaces.