Terry Slattery
Terry Slattery, is a senior network engineer with decades of experience in the internetworking industry. Prior to joining Chesapeake NetCraftsmen as...
Read Full Bio >>

Terry Slattery | December 22, 2016 |


Verifying Resilience

Verifying Resilience How do you know that your network is resilient to failures? If you don’t test, you really don’t know.

How do you know that your network is resilient to failures? If you don’t test, you really don’t know.

There have been several excellent posts recently on No Jitter about network resilience and network testing. Gary Audin describes "How to Approach Resilience Planning," Darc Rasmussen talks about using testing to "Make This a Happy Holiday Season," and Mike Burke tells us "How Not to Repeat History of Failed Testing."

"But it can't happen to us!" you say.

Really? It happened to Macy's... and over Black Friday, too. As Fortune senior writer Phil Wahba wrote, Macy's website went down on the second biggest shopping day of the year due to overflow shopping traffic.

Each of the above mentioned articles describes a slightly different perspective on resilience and testing. Underlying the different stories is a common theme: Good planning needs good testing in order to validate the implementation and the assumptions that went into the design and configuration.

That brings me to the question: Do you conduct failure testing and analysis of your network and UC infrastructure? Or is your organization afraid of touching the network for fear that it will break? Organizations that don't do regular testing are working from a position of hope, as in, "We hope that nothing breaks because it might not fail over to our backup systems." That's a precarious position to be in.

Many organizations already have redundant infrastructure -- dual WAN carriers, redundant core routers and switches, uninterruptible power supplies, backup data paths, and redundant IT services systems. However, I keep encountering organizations that have never run a planned test of their redundant infrastructure. Why wait for an emergency to learn that something doesn't work? It is much better to use planned downtime in which you can perform controlled tests.

It is a good idea to evaluate the failover process. Does the failover work the way you think it should? Is it fast enough for the applications? Does it self-heal when the failed device comes back online?

Disaster recovery may force a backup site to become the primary site for an extended period of time. Will the infrastructure and staff be able to handle the movement of the IT services that would be forced by a disaster at the former primary site? Think about all the companies that were affected by Hurricane Sandy, flooding in the Midwest, fires in the South, or earthquakes in the West. Many inadequately prepared companies simply cease to exist when their IT operations can't quickly return to functional health.

External Factors

You may find that there is something unexpected that is outside the IT infrastructure that creates a problem. A good example of external factors was a facility that had two emergency generators, one large and one small. A major power failure caused the generators to start, but the smaller one soon failed. Unfortunately, the ingress cooling air vents were controlled by power from the smaller generator. When the smaller generator failed, the vents closed, causing the main generator to overheat and shut down. No one had thought to test the generator redundancy.

Dynamic Networks

The server environment in most organizations has already become very dynamic, with VMs, containers, and application mobility. Dynamic networks are next. The network will be changing as the workloads increase or decrease and as the workloads move between hardware platforms within the data center. Expect to see application migration between data centers or to add burst capacity at a cloud provider.

Network dynamics will make static testing plans less useful. Sure, there will still be parts of the network that are static, such as ISP connections and perhaps some of the major interconnect links within an organization. But the applications will become more mobile and change size as the customer loads change. Subnets will move around. If a whole rack loses power, can the IT infrastructure move the workloads to another set of servers and reconfigure the network within an acceptable timeframe? Does the application gracefully handle and recover from the loss of some of the infrastructure?

Dynamic Testing

Dynamic testing is needed in IT infrastructures in which applications can move around. Some simple tests need to be run to validate that the new application instance is configured and running properly before moving workloads onto it. This may result in building something that I call an "application-level ping." It is a request that is processed like a real client request but only results in validation that the application is functioning correctly. A simple example is sending a test email to an email server. The test verifies that the email is received by a test account within a specified time. Similar tests are available for credit card processing.

Getting Started

Developing good test plans is challenging, and requires that you understand the IT systems and its interdependent components. To develop good test plans, you often need someone with a different personality who looks at systems differently, so you may need to find a consultant to lead the development process.

Another approach is to start with small parts of the IT infrastructure and expand as testing experience is gained. Static parts of the infrastructure will be easy to test, such as ISP links or failover to a backup UC controller. Don't forget to test the small services that the infrastructure may need to run smoothly, such as the internal DNS servers. I'm always surprised and disappointed to discover both primary and secondary DNS, NTP, and DHCP servers on the same subnet and the same power feeds. Kill the power on the switch to which these servers connect and see how well the IT systems continue to function.

When creating tests, look for things that have a high probability of occurrence, such as an ISP link failure or power failure. Don't overlook device problems like power supply failures or fans that stop running and cause overheating and shutdown. This latter set of problems affects a single device, which is easy to test.

There is another advantage to having regular testing schedules. It allows you to do upgrades on your infrastructure. If the network is configured with A and B redundant halves, can one half of the redundant infrastructure be taken down (offline) for service and upgrades? How easy is it to move traffic onto the upgraded half so that the second half can be upgraded?


Of course, you should include the UC infrastructure in the test plans. Fortunately, it is one of the easier components to test. Do phones properly failover to the secondary when the primary is turned off or disconnected from the network? Does the dial plan still work? Are there any functions that are dependent on the primary UC controller and must be migrated to the secondary controller if the primary is destroyed (think fire or flood)?

Automation makes the testing easier and faster. You must eliminate manual testing from the process or it won't get done as often as it needs to be. However, there will be some tests where there is simply no substitute for a pair of hands, like pulling the power plug on a core router. Just make sure that the automation system verifies that the redundant router is good before pulling the plug.

Learn more about systems management and network design trends and technologies at Enterprise Connect 2017, March 27 to 30, in Orlando, Fla. View the Systems Management & Network Design track, and register now using the code NOJITTER to receive $300 off an Entire Event pass or a free Expo Plus pass.


September 26, 2018

Join Kevin Kieller, Microsoft UC&C expert, along with Ribbon Communications and Polycom, for an update on Microsoft Ignite, and a focus on critical things you need to know about your voice depl

August 29, 2018

Moving your voice services to the cloud introduces new challenges for 9-1-1 services. These include the need to serve multiple locations, and the increased mobility that comes with having a phone t

August 8, 2018

Artificial intelligence (AI) is becoming a reality for your contact center. But to turn the promise of AI into practical reality, there are a couple of prerequisites: Moving to the cloud and integr

March 12, 2018
An effective E-911 implementation doesn't just happen; it takes a solid strategy. Tune in for tips from IT expert Irwin Lazar, of Nemertes Research.
March 9, 2018
IT consultant Steve Leaden lays out the whys and how-tos of getting the green light for your convergence strategy.
March 7, 2018
In advance of his speech tech tutorial at EC18, communications analyst Jon Arnold explores what voice means in a post-PBX world.
February 28, 2018
Voice engagement isn't about a simple phone call any longer, but rather a conversational experience that crosses from one channel to the next, as Daniel Hong, a VP and research director with Forrester....
February 16, 2018
What trends and technologies should you be up on for your contact center? Sheila McGee-Smith, Contact Center & Customer Experience track chair for Enterprise Connect 2018, gives us the lowdown.
February 9, 2018
Melanie Turek, VP of connected work research at Frost & Sullivan, walks us through key components -- and sticking points -- of customer-oriented digital transformation projects.
February 2, 2018
UC consultant Marty Parker has crunched lots of numbers evaluating UC options; tune in for what he's learned and tips for your own analysis.
January 26, 2018
Don't miss out on the fun! Organizer Alan Quayle shares details of his pre-Enterprise Connect hackathon, TADHack-mini '18, showcasing programmable communications.
December 20, 2017
Kevin Kieller, partner with enableUC, provides advice on how to move forward with your Skype for Business and Teams deployments.
December 20, 2017
Zeus Kerravala, principal analyst with ZK Research, shares his perspective on artificial intelligence and the future of team collaboration.
December 20, 2017
Delanda Coleman, Microsoft senior marketing manager, explains the Teams vision and shares use case examples.
November 30, 2017
With a ruling on the FCC's proposed order to dismantle the Open Internet Order expected this month, communications technology attorney Martha Buyer walks us through what's at stake.
October 23, 2017
Wondering which Office 365 collaboration tool to use when? Get quick pointers from CBT Nuggets instructor Simona Millham.
September 22, 2017
In this podcast, we explore the future of work with Robert Brown, AVP of the Cognizant Center for the Future of Work, who helps us answer the question, "What do we do when machines do everything?"
September 8, 2017
Greg Collins, a technology analyst and strategist with Exact Ventures, delivers a status report on 5G implementation plans and tells enterprises why they shouldn't wait to move ahead on potential use ....
August 25, 2017
Find out what business considerations are driving the SIP trunking market today, and learn a bit about how satisfied enterprises are with their providers. We talk with John Malone, president of The Ea....
August 16, 2017
World Vision U.S. is finding lots of goodness in RingCentral's cloud communications service, but as Randy Boyd, infrastructure architect at the global humanitarian nonprofit, tells us, he and his team....
August 11, 2017
Alicia Gee, director of unified communications at Sutter Physician Services, oversees the technical team supporting a 1,000-agent contact center running on Genesys PureConnect. She catches us up on th....
August 4, 2017
Andrew Prokop, communications evangelist with Arrow Systems Integration, has lately been working on integrating enterprise communications into Internet of Things ecosystems. He shares examples and off....
July 27, 2017
Industry watcher Elka Popova, a Frost & Sullivan program director, shares her perspective on this acquisition, discussing Mitel's market positioning, why the move makes sense, and more.
July 14, 2017
Lantre Barr, founder and CEO of Blacc Spot Media, urges any enterprise that's been on the fence about integrating real-time communications into business workflows to jump off and get started. Tune and....
June 28, 2017
Communications expert Tsahi Levent-Levi, author of the popular blog, keeps a running tally and comprehensive overview of communications platform-as-a-service offerings in his "Choosing a W....
June 9, 2017
If you think telecom expense management applies to nothing more than business phone lines, think again. Hyoun Park, founder and principal investigator with technology advisory Amalgam Insights, tells ....
June 2, 2017
Enterprises strategizing on mobility today, including for internal collaboration, don't have the luxury of learning as they go. Tony Rizzo, enterprise mobility specialist with Blue Hill Research, expl....
May 24, 2017
Mark Winther, head of IDC's global telecom consulting practice, gives us his take on how CPaaS providers evolve beyond the basic building blocks and address maturing enterprise needs.
May 18, 2017
Diane Myers, senior research director at IHS Markit, walks us through her 2017 UC-as-a-service report... and shares what might be to come in 2018.
April 28, 2017
Change isn't easy, but it is necessary. Tune in for advice and perspective from Zeus Kerravala, co-author of a "Digital Transformation for Dummies" special edition.
April 20, 2017
Robin Gareiss, president of Nemertes Research, shares insight gleaned from the firm's 12th annual UCC Total Cost of Operations study.
March 23, 2017
Tim Banting, of Current Analysis, gives us a peek into what the next three years will bring in advance of his Enterprise Connect session exploring the question: Will there be a new model for enterpris....
March 15, 2017
Andrew Prokop, communications evangelist with Arrow Systems Integration, discusses the evolving role of the all-important session border controller.
March 9, 2017
Organizer Alan Quayle gives us the lowdown on programmable communications and all you need to know about participating in this pre-Enterprise Connect hackathon.
March 3, 2017
From protecting against new vulnerabilities to keeping security assessments up to date, security consultant Mark Collier shares tips on how best to protect your UC systems.
February 24, 2017
UC analyst Blair Pleasant sorts through the myriad cloud architectural models underlying UCaaS and CCaaS offerings, and explains why knowing the differences matter.
February 17, 2017
From the most basics of basics to the hidden gotchas, UC consultant Melissa Swartz helps demystify the complex world of SIP trunking.
February 7, 2017
UC&C consultant Kevin Kieller, a partner at enableUC, shares pointers for making the right architectural choices for your Skype for Business deployment.
February 1, 2017
Elka Popova, a Frost & Sullivan program director, shares a status report on the UCaaS market today and offers her perspective on what large enterprises need before committing to UC in the cloud.
January 26, 2017
Andrew Davis, co-founder of Wainhouse Research and chair of the Video track at Enterprise Connect 2017, sorts through the myriad cloud video service options and shares how to tell if your choice is en....
January 23, 2017
Sheila McGee-Smith, Contact Center/Customer Experience track chair for Enterprise Connect 2017, tells us what we need to know about the role cloud software is playing in contact centers today.