Peeling Back the SIP Resiliency Layers

Like most people, I like it when my machines work. If I get into my car and turn the key, I expect the engine to start and the tires to roll. Of course, I play an important role in keeping my Prius on the road. I follow the manufacturer's recommended maintenance schedule. I keep the tires inflated and regularly check the tread for excessive wear. In the case of an unforeseen breakdown, I travel with a few essential tools and know who to call if a problem surpasses my ability to fix it.

Just like my car, SIP resiliency needs to be a layered approach. While it's impossible to build a system that is completely unbreakable, it's not that difficult to eliminate all single points of failure and design something that can handle a myriad of software, hardware, and signaling failures.

There are a number of different ways to break down a SIP infrastructure into its components, but for the sake of simplicity, I will focus on three major systems: the SIP carrier and the interfaces it delivers to an enterprise; the border elements that sit between the carrier and the communications system; and the communications system itself.

In nearly all cases, carriers deliver SIP trunks to an enterprise by way of an MPLS network. That MPLS network is commonly managed by the same SIP carrier, but this isn't a requirement. For example, it is possible to use Verizon for your MPLS network and Level3 for your SIP trunks.

The MPLS network terminates at an enterprise's demarcation point in the form of a label edge router (LER). You can think of the router as the on and off ramp for all data traffic on the MPLS circuit. It has been my experience that the router is owned and maintained by the carrier, but it's certainly possible for an enterprise to take on that responsibility.

The first level of resiliency is to use LERs with redundant components such as hot swappable power supplies and fans. This allows the router to continue to function when one of its components fails.

The second technique is to deploy a high available (HA) LER. This configuration uses an active router paired with a standby router. A failure of the active LER causes the standby LER to seamlessly take control of all data traffic.

The third resiliency technique is to use multiple data circuits. Traffic can be shared on these links or one circuit can be the backup for the other.

Lastly, I want to see geo-redundant data circuits in two or more data centers. As with the duplicated links, these circuits can be active/active or one can be designated as the failover link.

If you have been following my No Jitter articles, you likely know my thoughts on session border controllers (SBC) by now. Not only are they necessary for security, but they are used for remote endpoints, call admission control, call recording, routing, and SIP adaptation. I would never open up a network to external SIP traffic without first having it pass through an SBC.

Therefore, resiliency needs to be a critical aspect of every SBC configuration. This critical component of SIP communications must be as rock solid as possible, lest you risk a break in your SIP chain.

As with the LER, I am a big fan of SBCs with redundant components. At a minimum, power supplies and fans should be duplicated and hot swappable.

I am nearly always insistent that an enterprise deploy SBCs as an HA pair. Like an LER, an HA configuration consists of an active SBC paired with a standby SBC. On a sunny day, the active SBC handles all SIP traffic, and the standby only kicks in when the active fails. A link exists between the two that lets the standby be fully aware of all active calls. This facilitates a seamless failover with no lost calls.

It's important to know that an HA pair of SBCs must be separated by a Layer 2 network. This means that they must be on the same subnet. Since most geo-separated data centers are Layer 3 connected, you cannot split an HA SBC pair across data centers. In this case, I commonly recommend an HA pair in the prime site and another HA pair in the disaster recovery data center.

This brings me to the IP SIP PBX. Here, too, I avoid as many single points of failure as possible. This means duplicated call processors, enterprise survivable servers, and when appropriate, survivable branch servers. While some failures may cause calls to drop and call processing to become temporarily suspended, the goal is to minimize the disruption and return service as quickly as possible.

In addition to call processing servers, there will most likely be some form of session management between the SBCs and the call servers. This, too, needs to be made resilient. In my mind, that means N + 1. Determine how many session manager servers you need and add one. Do you need one server? Deploy two. Do you need three? Deploy four.

I do a great deal of work with Avaya Aura, and its session managers support HA as active/active and not active/standby. This means that all session managers process calls at all times. Additionally, a failover from one session manager to another session manager is seamless with no lost calls.

While there are certainly more points of possible failure (endpoints, networks, power sources, etc.), these three go a long way in keeping the bulk of an enterprise's communications system up and running even when a disaster strikes. Believe me, servers crash, links die, cooling systems fail, and electricity suddenly disappears. The loss of communications can result in angry customers, lost revenue, and, for example, in verticals like healthcare, death.

Resiliency and redundancy do not come for free, but careful planning coupled with comprehensive risk management will help determine what needs reinforcement and what does not. Failure to plan is planning to fail, and you don't want to be the one responsible for dead air and dropped calls.

Andrew Prokop writes about all things unified communications on his popular blog, SIP Adventures.

Follow Andrew Prokop on Twitter and LinkedIn!
@ajprokop
Andrew Prokop on LinkedIn