QoS in the LAN? You're Kidding!
There are a bunch of excuses for not doing QoS and none of them are valid when examined in detail.
I occasionally hear from different sources that QoS is not needed in the LAN. A variety of rationales are used, but I think the main factor is the complexity of configuring QoS. I thought that it would be interesting to take a close look at a number of excuses for not using QoS.
We Use High Speed Links
Some people think that bandwidth is a good replacement for QoS. Their statement is often something like this:
QoS configuration is pretty complex, so we just made sure that we have at least X-bandwidth links everywhere. [Replace X with the speed du jour.] With high-speed links, we don't need QoS.
The presumption is that with sufficient bandwidth, congestion will never occur. In practice, this is not the case. Applications are consuming increasingly more network bandwidth. The addition of video content, data-intensive applications, and graphics is increasing the volume of network traffic.
Possibly the worst example that I've seen of this excuse is where the network staff decided that all edge ports should be configured for 10 Mbps Half Duplex! Their intent was to limit the data volume at the edge–somewhat like admission control. It definitely limited the data volume--however, everyone had poor application performance, which didn't help the organization compete in the business world.
Instantaneous Buffer Congestion The high speed link argument doesn't work because of something called instantaneous buffer congestion (also called instantaneous interface congestion). Interface buffers quickly fill due to the bursty nature of IP network traffic. Modern workstations and servers can easily fill a 1-Gbps link, and multiple flows on a common network infrastructure can quickly congest 10-Gbps links. It is only when TCP detects packet loss that it reduces its data rate. A link that is transporting multiple flows can quickly become congested, causing packet loss. QoS is needed to handle the prioritization of packets at the congested interface.
The NMS Doesn't Show Congestion
The next excuse is about measuring congestion:
Our NMS shows low link utilization, so there can't be any congestion.
The problem is that NMS polling happens at such a slow rate that the instantaneous peaks aren't visible in the performance data. Reducing the polling period from 10 minutes to 1 minute will show more detail, but still not enough to show instantaneous buffer congestion. We've seen serious congestion on a 1-Gbps link that the NMS reported as being 30%-40% utilized (at a 10-minute polling interval).
To detect congestion, you should be using the network monitoring system to report interface drops. A drop will occur when the interface buffers are full.
Cisco products typically have a small number of buffers allocated, often 64 buffers. When 64 packets are queued on the interface, the buffer queue is full and any successive packets will be dropped until at least one packet is transmitted out of the buffer.
Looking for excessive interface drops will allow you to identify interface congestion even though utilization figures show low utilization. Be careful though; TCP uses drops as part of its feedback to reduce speed (see the next section for more on this).
Look for interfaces that have thousands of drops per hour. A list of interfaces, sorted in descending order of drop count, is the best way to find congested interfaces. These are the interfaces that should be configured with QoS.
Increase Interface Buffers
At another customer, we heard this story:
We had drops on a key link, so we added interface buffers to reduce the number of drops.
The number of drops was indeed bad. However, increasing the number of buffers on the interface was not the right answer. The problem with this solution is that they added thousands of buffers. The latency due to buffering during data bursts became a major problem. Real-time applications were dropping late packets and TCP was retransmitting packets that it thought had been lost, but were in fact simply waiting in the buffers for transmission.
This was on a 1-Gbps link. The duplicate packets consume network bandwidth with no benefit. We knew that TCP was retransmitting packets because we found large numbers of duplicate TCP ACKs in our packet traces.
It is actually better to drop excess traffic than to buffer very many packets. The right solution is to increase the link bandwidth. However, the short-term solution was to reduce the number of buffers and use QoS to prioritize traffic.
Next page: The Data Center doesn't need QOS?