Troubleshooting VoIP Packet Loss
A straightforward approach to troubleshooting VoIP that can be applied to video as well.
VoIP is particularly sensitive to packet loss, and determining the cause of the packet loss can be difficult. Each vendor has different tools that may be used to help with the diagnosis. I'm going to cover some of the causes of packet loss and describe some of the tools that exist. Of course, contacting the vendor will provide you with more information about any new tools or troubleshooting methodologies that are specific to that vendor.
Packet loss in VoIP will typically have a slowly degrading impact on speech communications. The human ear is very good at handling the short gaps that are typical of packet loss. So it may take a significant amount of packet loss for the user community to be annoyed enough to report it.
It is best to use automated management systems to collect the packet loss data from the VoIP system, allowing you to generate reports that you can use to determine the scope of any loss (see below). However, note that fax communications isn't tolerant of any packet loss, so don't try to run fax machines over VoIP. Finally, the principles below can be applied to video troubleshooting, since many of the same mechanisms are in use for interactive video.
Determine the scope of the problem
First, it is important to gain some understanding of the scope of the problem. Is the packet loss restricted to a subset of all VoIP endpoints, or is it occurring across all endpoints?
It is useful to think of groups of endpoints. A simple group is the endpoints at a remote site. A more complex group is the set of endpoints that communicate with external peers, such as calls out to the PSTN. There might even be problems with a particular gateway to the PSTN, indicated by problems with only those calls that traverse that gateway. If the problems seem to appear across all endpoints, perhaps they have a systemic origin, such as a missing or incorrect QoS configuration.
Don't forget about media gateways and MCUs. If the enterprise's conference calls are more prone to voice dropouts, then there may be a problem related to one of the MCUs or the network around it. In fact, there may be multiple small problems and configuration errors that interact to create significant packet loss. For example, a firewall may be configured to drop out-of-order fragments, while at the same time a part of the network is configured to use per-packet load balancing, which may prove to be a source of out-of-order packets.
Second, determine if the problem is time-dependent. Can the problem be correlated with other network or VoIP system events? If it is happening all the time, then you can eliminate many causes that would only occur at a specific time of day, such as packet loss due to congestion during busy times of the day.
The combination of scope and time may provide clues to the origin of the packet loss. At a minimum, the data should allow you to eliminate a lot of potential problems, making it easier to eventually identify the core problem. Keep in mind that multiple minor problems may be interacting to cause the packet loss.
Sources and types of packet loss
There are two types of packet loss in a VoIP system: Receive Packet Loss, and Receive Packet Discard. Receive Packet Loss is where a packet is never delivered to the receiving system, while Receive Packet Discard is where a packet is received at a time when it is not useable for generating audio playback.
VoIP-receiving endpoints typically record both Packet Loss and Packet Discard counters, allowing you to determine which type of problem you are troubleshooting. Cisco phones include these stats in either the manual user interface on the phone set, or through its web interface. Avaya phones report these stats via RTCP, which at one time required a separate RTCP receiver to record the stats. As VoIP control software is revised, you may find these stats available within the VoIP controller.
Receive Packet Loss means that the packet is dropped somewhere in the network. In this case, you'll be looking for problems that cause packets to be discarded within the network.
1. It could be a bad link that is causing packet errors, which would affect any connections via that network link. This problem may not vary by time of day or load. A WAN connection may exhibit continuous errors while a duplex mismatch could cause errors that are load-dependent.
2. Network congestion without QoS could create packet loss in extreme cases where router or switch buffers overflow.
3. A transient network problem like a flapping link can cause convergence problems in routing protocols or in the Spanning Tree protocol used in switch-based networks. Packets will get dropped if a valid alternate path is not immediately available.
4. In rare cases involving MCUs, packet loss could be due to an overloaded software-based MCU or a problem with a Digital Signal Processor (DSP) in a hardware-based MCU.
Receive Packet Discard occurs when the packet arrives at the receiver at a time when the packet cannot be used for playback. It is more typical for a packet to arrive late than it is for a packet to arrive too early to store into the playback buffer. When a packet arrives, but it can't be played back, it is discarded and the Discard counter is incremented.
1. A network without QoS or with an incorrect QoS configuration may cause high jitter during periods of congestion, when the routers and switches need to buffer packets. High jitter typically causes packets to arrive too late to be played back.
2. Out-of-order packets are also typically discarded by VoIP endpoints. Packet ordering can be affected by per-packet load balancing over parallel paths. It can also happen when routing changes cause an alternate path to be used, though this is typically a transient event rather that an ongoing problem. Flapping links, however, can cause out-of-order problems on a more continuous basis.
Because some of the above problems are transient, look for Loss or Discard statistics that are regularly increasing and have relatively high values. For example, a Discard counter that's increasing by thousands per day would be reason for action, while an increase of less than 1,000 per day would only add it to my watch list.
Detecting packet loss
Each vendor implements a different mechanism for reporting packet loss. Cisco phones report their statistics back to the call controller, where the values are used to calculate voice quality. The phone keyboard can be used to examine the packet loss counters, or a remote web interface can be used to examine the same counters. The Cisco Unified Call Manager (the call controller) uses packet loss statistics to calculate the overall voice MOS score, but doesn't make the specific values available. So the best way to find problems in a Cisco VoIP implementation is to look at reports for calls or endpoints with consistently low MOS scores.
Avaya uses its VoIP Monitoring Manager to receive the RTCP stream from VoIP endpoints. The RTCP data includes packet loss statistics, so it is easy to generate reports on endpoints with the highest packet loss.
Other vendors have similar packet loss reporting mechanisms, either within their own products, or provided by third-party VoIP analysis and reporting systems. Of course, end user reports of dropouts in calls is an important factor, especially since a significant amount of packet loss will have to exist before it really begins to affect voice quality.
I think it is imperative that anyone running a VoIP system should have the tools installed to monitor all call quality parameters, particularly packet loss/discard statistics. Examining the call quality reports regularly, and proactively identifying packet loss and rectifying the causes, will keep your VoIP network running smoothly.