Recognizing the Need for More Bandwidth
Do you know when a network link needs more bandwidth?
When do you have to add more bandwidth to a congested link? When it runs out of bandwidth is the obvious answer. But how do you know that a link is congested? If you look at most network management link utilization displays, they rarely show anywhere close to 100% link utilization.
First, the effect of polling every few minutes effectively averages the link utilization over the time between each poll. Second, most products average several samples to arrive at aggregate data used in longer-interval displays. Figure 1 shows one-minute samples while Figure 2 shows the same data plotted with 10-minute resolution. The peaks are significantly less in the 10-minute plot because of averaging.
The 95th Percentile level, shown below in Figure 3 as horizontal blue (Receive) and red (Transmit) lines, displays more useful data. When calculated over 24 hours, the 95th Percentile shows the minimum utilization of the busiest 72 minutes of the day. Since 72 minutes is just a bit over an hour, I like to think of it as the minimum utilization during the "busy hour." In Figure 3, the Transmit 95th Percentile utilization is 25% and because the reporting interval is 12 hours, it shows the minimum utilization for the busiest 36 minutes of the day. Regardless of the type of measurement, utilization is not a good indication of congestion.
Fortunately, there is a simple way to identify link congestion. Look for interface output drops (sometimes called "discards"). These occur when the network device has no free buffers in the output interface queue. Because packets awaiting transmission occupy all the buffers, the device must drop the new packet. Since a few drops are part of normal network operation, use the network management system (NMS) "Top-N" reports to show the interfaces that have the highest drop counts.
A second check is that these drop counts are greater than 0.0001% of link bandwidth. This is the point at which packet loss causes TCP performance to degrade. The combination of the Top-N drops and drop counts greater than 0.0001% identifies interfaces that are candidates for bandwidth upgrades.
Now we need to determine if other tools, such as quality-of-service (QoS) monitors, can handle the congestion. Is there any important or time-sensitive traffic to prioritize? Conversely, can any packets be dropped (i.e., much less important or less time-sensitive data)?
You need to analyze the traffic using a tool that can show which applications are running so that you can prioritize the important applications and de-prioritize the least important applications. At one end of the tool spectrum is Wireshark, a basic packet capture tool that can perform application analysis (refer to this Sharkfest '12 presentation, "Application Performance Analysis"). Other tools include Riverbed's SteelCentral AppResponse for doing application performance analysis, and any of several types of flow data analysis tools.
At one customer, we were tasked to investigate slow application performance at a remote site. Our traffic analysis showed that entertainment traffic from Pandora.com, Akamai Networks, and Limelight Networks accounted for more than 50% of the 40-Mbps capacity available (Pandora is a streaming music site while Akamai and Limelight are content providers that deliver things like movies.) We de-prioritized the entertainment traffic into a low-priority queue, and gave higher priority to voice calls and business applications. This worked well at this site.
What About Buffering?
But couldn't the excess packets simply be buffered and transmitted when the congestion subsides? Well, that depends. If the congestion is of very short duration, buffering may work. However, do not use large amounts of buffering to try to control and reduce drops on heavily utilized links, especially if the links have low latency.
TCP will retransmit any unacknowledged packets after twice the round trip time (2 * RTT) because it thinks the packets have been dropped. My rule of thumb is to not use more buffers than the bandwidth-delay product of the link, in bytes, divided by the typical packet size (BW * RTT / typical-packet-size). It is important to identify a good packet size for use in this calculation. Voice packets are about 220 bytes long while file transfers are the maximum packet size (normally 1,500 bytes, but may be up to 9,000 bytes if jumbo frames are allowed). A value of 500 bytes is normally a reasonable starting point.
A good example may help you understand. Let's assume a 1Gbps link (125MBps) between two nearby data centers, with a normal RTT of 2 milliseconds (0.002 seconds). Most of the applications involved file transfers, so the typical packet size was close to 1,500 bytes. Plugging into our equation:
1Gbps * 2ms / 1,500B = 125MBps * 0.002s / 1,500B = 166 packets
If you configure an oversubscribed interface with too many buffers, it confuses the TCP retransmit algorithm. When queues build, the TCP retransmit timer expires and another copy of transmitted data is sent. This effectively reduces the link's "goodput" because it sends two copies of the same packet. So too much buffering actually hurts more than it helps.
When to Add More Bandwidth
If some of the traffic in our example could be put into a low-priority queue, then this might be a suitable solution. However, what if there isn't any low-priority traffic? That's when it is necessary to add more bandwidth. But how much bandwidth? Doubling the available bandwidth is typically a good starting point. I've not seen any calculations that would allow us to calculate how much bandwidth. It may be possible to calculate it using the Mathis Equation.
We encountered this situation at one customer. The link was dropping a lot of packets and was running at high utilization levels during working hours (this was an example where the NMS average utilization plots did show useful data). None of the applications were low priority, so we couldn't use QoS to prioritize the traffic. Adding bandwidth was the only recourse. Fortunately, the customer had already started the process of getting a link upgrade. Based on drop counts, we were able to determine that, at a minimum, the company needed to double link capacity.
What's That Again?
Interface utilization plots from network management systems are often not very useful for identifying congested interfaces. Instead, use interface drop counts, sorted from highest value to lowest value. Begin working on the Top-10 or Top-20 interfaces, focusing on those interfaces that show drop rates greater than 0.0001% of the link capacity (remember to convert interface speed from bits/second to packets/second, so you'll need a rough idea of packet size).
Analyze the traffic on the links with the most drops to determine the application mix. Identify any applications that must have high priority and those applications that can be given a low priority (i.e., that can be dropped when congestion occurs). If low-priority applications are consuming a significant amount of bandwidth, use QoS to force those packets to be dropped when congestion occurs. Check the QoS queue drop counts to determine if additional buffering is needed, but never add so much buffering that it confuses TCP.
If all applications are important and you can't select some traffic to drop using QoS, then you need more bandwidth. In this case, doubling the existing link capacity is a good start.