Why's My Network So Slow?
We have had several customer cases at NetCraftsmen that involved slow applications as of late. The first step in determining the cause is to identify and isolate the factors that contribute to slow applications. In each case, we started by trying to determine if the application slowness is caused by something in the network or by something in the application.
Is It the Network?
Network causes include obvious things like interface errors and less obvious things like network congestion, which also results in packet loss. Interestingly, packet loss has a significant detrimental effect on applications that rely on TCP. A small amount of packet loss will reduce a 10ms, 1Gbps path to a path with only 200Mbps of goodput. Goodput is the volume of delivered application data, excluding packet retransmissions. How much is small in terms of packet loss? Loss of 0.0001% is the threshold in this case. Learn more about the impact of packet loss on TCP by reading about the Mathis Equation.
Real-time voice and video (UC) applications use UDP for transport and are able to handle up to 1% packet loss as long as the lost packets are random. The codecs in use are able to interpolate between adjacent samples, allowing the audio or video systems to cover up for an occasional lost packet. However, they do not work well with burst loss. In this case, the codecs do not have the necessary samples from which to perform the interpolation to recover lost packets.
Interface congestion occurs at two places in networks. The first is at speed mismatch points, such as data from a LAN that needs to transit a lower speed WAN link to a remote site. The router that connects the LAN segment to the WAN link contains a small number of buffers in which received packets can be stored while the WAN link transmits a previous packet. But this buffering is limited. Using too many buffers causes problems with transport protocols like TCP, so it is better to drop packets when the router buffers fill and let TCP handle the retransmission. The packet loss tells TCP that the path bandwidth has been filled and that it should slow down. This is normal. It is high volumes of packet loss that are an indication of network congestion. We've found that more than about 100,000 drops per day is an indication of significant network congestion warranting investigation.
Another source of network-induced problems is due to high latency paths, sometimes known as long fat pipes if the path is high bandwidth. In this case, an application that uses many small packets in a back-and-forth interaction between the client and the server will seem slow, simply due to the length of time that it takes for all the packets to transit the high latency link.
Let's examine the worst case scenario: a client that needs to exchange 1,000 packets with an application server to display a complex graphical interface. The typical round-trip latency across the continental U.S. is 60 milliseconds. If the application waits for the client to acknowledge each packet before sending the subsequent packet, we're talking about 1,000 * 60ms = 60,000ms, or 60 seconds. An application like this would likely run well in a local LAN environment where the latency is 2ms (2 seconds to refresh the display).
Is this a network problem or is it an application problem? Well, it is some of both. There isn't anything that can be done about latency -- it is due to the speed of light of the electrical or optical pulses over a path that's 6,000 miles long. We sometimes find strange routing path selection will create a long path when a much shorter and faster path is available. The solution in these cases is to route the traffic over the shorter path.
A packet capture of an application can tell us whether the application is sending a lot of small packets and whether it is waiting for each packet to be acknowledged. We also use packet captures to identify packet loss, which appears as a significant volume of retransmissions and duplicate ACKs. In one customer case, we found that there was insufficient bandwidth between two local facilities for the types of applications that were running over them. The packet captures showed hundreds of thousands of retransmitted packets per day. The routers that connected each site were showing high packet discard rates on their metro Ethernet connections.
Is It the Application?
An Application Performance Management (APM) system makes it easy to differentiate between network causes and application-specific causes. I think of these systems as super-smart packet capture and analysis systems. A good system can identify when server responses are slow, indicating an application problem instead of a network problem.
On the network side, they can identify packet retransmissions that indicate packet loss within the network or high latency in network transactions, both of which indicate network problems. Since an APM also sees all the packets, it can identify a poorly designed application that uses many small packets instead of fewer, larger packets. However, not many customers have an APM installed, so we often have to resort to other approaches.
Modern, multi-tiered applications can often have internal problems that cause an application be sluggish. At another customer, we found that a poorly written SQL query between two tiers of an application caused slowness that was initially attributed to a network problem. In this case the customer had an APM and was able to diagnose the problem within an hour. Similarly, an SQL query that works well in software development may not work well in production when the debase grows, so look for those queries as well. A packet capture will show whether the server is sending updates to the endpoints in a timely manner (an application problem) or if it is encountering packet loss that requires retransmission (a network problem).
We've also seen misconfigured applications cause problems that make applications perform poorly. An interesting case involving video conferencing systems took several months to diagnose. There were constant reports of the video being garbled, unclear audio, and long call establishment times. Of course, the network was accused of being the problem ... and at first, it sure looked that way. Packet loss was high, as reported by the UC video systems.
But separate tests between the video conference systems showed no problems. We finally looked over the system configurations in great detail and found that some of them were configured to use a Multipoint Control Unit (MCU) on the Internet instead of the MCU within the organization for internal calls. The volume of video traffic, combined with data traffic, overwhelmed the Internet links, causing packet loss. Our testing had been directly between the internal subnets, not realizing that the traffic was being routed out to the Internet.
The application server staff can also be of great assistance by reporting if the servers are taxing their memory, storage system, or CPU during the reported times of slow applications. They can also report on TCP stats that indicate significant packet loss, helping everyone understand which components are likely candidates for further investigation.
It can sometimes seem impossible to determine why an application runs slowly. Looking for packet loss is an easy indicator of whether it might be a network problem or an application problem. And as we saw in the above examples, it might be an incorrect configuration that sends traffic over unexpected paths that either increase latency or encounter packet loss.
Make sure you are testing over the path that the traffic is actually using. Finally, get the server and application teams involved to provide additional data. Everyone has to work together to resolve the more challenging cases.