One Beer Too Many

One challenge to centralized IP Telephony system availability is recovering from network outages. The demand on system resources to recover from the event is greater than what the system can provide. In college, when one drank more beer than the body could handle, bad things would happen. The mantra was to avoid the "One Beer Too Many" problem.

When voice travels over an IP network, and the network stops passing traffic for more than 5 seconds, users will hang up and try to redial. If the outage is long enough, the IPT phones will try to reregister. This instantaneous demand for phone registration and/or call set-up can overwhelm an IPT system. Once overwhelmed, the IPT system will exhaust its resources trying to manage all the incoming requests, and the end systems will keep sending new requests, especially when they get intermittent responses. The result is a few-second network outage can cause an IPT system to be unavailable for hours.

Traditional phone systems were distributed with a PBX at every business site. With IPT, all the call control functions are being centralized to lower costs (economies of scale and support) and enable integration of voice with other services.

Skype experienced a similar problem in August of 2007 when a Windows update triggered a massive number of Skype clients to re-register in a short period of time. In my experience, I have seen large IPT deployments suffer the same problem. For example, in a virtual call center model, if the entire WAN stops passing traffic for a few seconds, all the callers and agents will hang up and immediately try to redial/reconnect.

To avoid the "One Beer Too Many" problem in centralized IPT systems, one should:

* Throttle--Limit the number of call set up and registration requests per second. For example, Session Border Controllers (SBCs) can limit the number of new in-bound calls from a carrier.

* Head-room--Design server clusters with enough capacity to handle peak demand as defined as number of calls or registrations handled per second.

* Sub-Second Network Rerouting--Design the core IP network to reroute sub-second and a dual carrier MPLS WAN site to reroute in under 5 seconds.

* Test--After major changes, have a capacity testing plan. Empirix and other vendors provide on-site and hosted solutions to do voice capacity testing.

* Monitor--Track the 5-7 Key Performance Indicators (KPIs) on all network and IPT equipment. Set green, yellow, and red thresholds, alarm and create monthly reports from this data. Core routing topology changes should also be monitored.

* Definition of an Outage--IP network managers should define a network incident as one where IP traffic flow was impeded for greater than 5 seconds. Service Level Agreements (SLAs) should be defined not only in terms of latency, jitter, dropped packets, and overall availability, but number of incidents.

IP networks will have outages. The goal is not only to prevent outages from occurring, but being able to recover quickly after they occur. Thanks to redundancy, most network outages are not due to a device failing, but when a device is sick and cannot keep up with demand. The same is true with IPT. As they say in the corporate world: One "aw sh*t" negates a hundred "that a boys."

Tags:

News & Views

Search form

One Beer Too Many

Tags: