Detecting Network Packet Loss
Key steps include investigating user complaints, examining endpoints, and monitoring device interfaces and traffic flows.
I've been involved in network analysis with two customers recently who had problems with excessive packet loss. Of course, packet loss impacts all applications and voice/video is no exception. I thought that it would be useful to review some of the things that I found in the cases that I investigated.
Investigate Customer Complaints
I start by looking for evidence of packet loss. In the example that I described last month (see Know the Path Your Media Sessions Take, a video conferencing system showed evidence of packet loss. I could have also checked the video conferencing system operational statistics to see if it was experiencing packet loss, because the systems keep track of delay, jitter, and packet loss statistics for each call. The reports we received from customers and the technical staff indicated a consistent problem with packet loss.
We used traceroute to determine the path between the systems. The links in the path didn't show any evidence of packet loss. Fortunately, someone else checked the configuration of the video conferencing systems and found that they were using a media gateway that was on the Internet instead of the one within the enterprise. The relatively large bandwidth that was required for the video link was not able to transit the path via the Internet without a significant amount of loss. This scenario was a rather strange cause of packet loss, but highlights the point that when you've eliminated common sources of a problem, you have to start looking at less common sources.
Examine the Endpoints
Another similar source of packet loss information is from the communicating endpoints. Voice and video endpoints keep information about delay, jitter, and packet loss for at least the prior call. Some systems report the statistics at the end of each call to the call controller, where the statistics are recorded. I like to examine end device packet loss figures. In addition to the packet loss figures, I want to know the IP address of the other endpoint that was participating in the call. By looking a a lot of calls between different systems, I can determine if there is packet loss on subsets of calls. Then, looking at the geography of the endpoints, I can narrow the search for the origin of the packet loss.
I use the endpoint statistics to determine which endpoints have the highest packet loss. I then examine the set of destinations that were involved in calls with the highest packet loss. If a wide set of destinations are showing packet loss, the source of the problem is close to the endpoint that I'm examining. I will collect traceroute data for the calls and look for common network elements. Is a certain link, router, or switch always in the path when high errors are observed? I can then begin examining the network for the source of the problem.
Monitor Network Device Interfaces
Another approach is to look at interface statistics collected by the network management system (NMS). As I discovered recently (see Rethinking Interface Error Reports, simply looking at interfaces with high percentages of errors is not sufficient for finding all the sources of errors that need to be examined.
I am now also looking at the total number of errors on an interface. The total error count finds interfaces that have high volume and high errors, but because of the high volume, have a low percentage of errors. The high data volume tells me that it is an important interface and that the errors are impacting all applications using that path, not just voice and video. At one site, some interfaces were recording more than 1 million packet errors per day. Since there are 86,400 seconds in a day, that's 11 errors per second, if averaged over a day. In reality, more data is probably being sent during the day, so the peak error rate is likely much higher. Finding these problems and fixing them is key to improving overall network performance, and ultimately, the productivity of the organization.
What if you can't get good data from your network management systems? We were recently working with a customer who didn't have an NMS configured to monitor many network device interfaces. So we collected the output from running a series of "show interfaces" commands. We were then able to use some simple scripts to find interfaces that had high error counts. Several data center switch interfaces appeared in our list, so we immediately focused on them and identified problems that affected several key servers. While these servers were not voice or video servers, they impacted several very important business applications. We also identified a core link that had a bad cable, which was affecting voice, video, and data connectivity to a large number of WAN sites.
Monitor Traffic Flows
One of the best ways to identify network packet loss is to install packet capture and analysis systems at key locations within the network. Several products, such as those from NetScout, Opnet, and Wireshark can collect voice and video data, analyze the resulting packet streams, and report whether packet loss is occurring. Many of these systems can also report on high jitter or high delay (delay is harder to identify, due to the need to measure transit time across the network). If packet loss is found, the monitoring system may need to be moved to identify the specific source of the packet loss.
Detecting packet loss is the first step. Remediation is the next step. I'll describe several causes of packet loss and how to fix them in the next post.