SHARE



ABOUT THE AUTHOR


Terry Slattery
Terry Slattery, is a senior network engineer with decades of experience in the internetworking industry. Prior to joining Chesapeake NetCraftsmen as...
Read Full Bio >>
SHARE



Terry Slattery | March 03, 2015 |

 
   

The Case for Periodic Infrastructure Reviews

The Case for Periodic Infrastructure Reviews Organizations continue to discover that their infrastructure contains vulnerabilities that can take it down for hours or days. When was your network last reviewed? Why wait for a failure?

Organizations continue to discover that their infrastructure contains vulnerabilities that can take it down for hours or days. When was your network last reviewed? Why wait for a failure?

All Systems Down
Is your network about to crash? How do you know it isn't? When was the last time that it was reviewed? Think of an infrastructure review as the equivalent to the 120-point automobile inspection.

Read about a major outage in All Systems Down, which appeared in CIO magazine in 2003. In the article, John Halamka, CIO of Beth Israel Deaconess Medical Center, describes dealing with a four-day network outage. It is quite an interesting article because it goes into great detail about what happened and his team's steps to get the network running again. The summary is that it started with a massive spanning tree forwarding loop that consumed network bandwidth and eventually caused network devices to crash.

What does this 12-year-old article have to do with today's networks? Well, events like it continue to happen. Paul Whimpenny, Senior Officer for IT Architecture in the IT Division of the Food and Agriculture Organization of the United Nations, describes a similar network outage in Our bullet-proof LAN failed. Here's what we learned. Fortunately, Whimpenny's outage was only four hours.

Common to both outages was a spanning tree problem. Spanning tree network design is one of the key network functions that we include in our network assessment. (I use the term "our" in reference to NetCraftsmen, the consulting company that employs me. I created the first version of our network assessment process and draft report template a good number of years ago. Automated tools help streamline the network data collection and analysis process.)

Think about it. When was the last time your network and UC infrastructure was reviewed? A good review is actually a detailed audit of the network and UC infrastructure. It should examine the design, operational data, and operations. The result should be an identification of things that are working correctly, as well as the areas that need review and remediation.

Why Failures Happen
One of the things we look for in an assessment is whether the spanning tree design is actually making redundant data centers into a larger, single, distributed data center. Problems in one data center can be propagated by the protocol to the other data center. Visually, this looks like a barbell design. Each data center is a weight on the ends of the link that connects them to each other. That's probably not what was intended. In fact, it is often the result of the network growing and changing over time.

Another common source of outage is failed redundancy. A network will be designed and built with redundant elements and links. But then a redundant component will fail, and because the system is very resilient, the failure doesn't cause an outage. If network and UC monitoring systems are not in place, not properly configured, or not used on a regular basis, the failure isn't noticed. It is only when the second failure occurs that the first failure is found. It is common to find that the first failure occurred months or weeks before the second failure. There was plenty of time to correct the first failure and avoid an outage, if it had only been discovered in time.

On occasion, an infrastructure review of ours will find a network that is like an old farmhouse. It started as a one- or two-room building. Then, as the family grew, rooms and wings were added onto the existing structure. To reach one bedroom, you have to walk through another bedroom. The "old farmhouse" networks are similar. They often include single points of failure, where one part of the network connects to the core of the network via a single path. In many cases, this was the expedient way to provide network connectivity that was previously not planned. When asked about the lack of redundancy, many of these network administrators say that they intended to go back and correct it, but have not had time or they had simply forgotten about it.

I've also seen network problems created because the network staff misunderstood some operational data and installed a configuration that exacerbated a problem. A good example of this is configuring too many buffers on an interface that's dropping packets.

Operations
Network operations figures into almost every network failure. Occasionally, a fundamental design flaw causes a problem, but most often, it is a lapse in running the network that allows a failure to create an outage. Policies, processes, and procedures are key to good operations. If you think that each of these three things are the same or at least very similar, take a look at this link for a description of them.

For example, a good design policy is to not extend Layer 2 networking between data centers. Violations of this policy contributed to the failure that Whimpenny experienced and probably was also a factor in the Beth Israel Deaconess Medical Center outage. Policies should cover many design principles as well as when and how to enact processes and procedures. They are the rules for designing and running the network. Processes are what to do when something needs to be done while procedures are the steps that must be followed to implement a process. Knowing the process for breaking spanning tree loops and the procedure to follow, with specific staff assigned to perform those steps would have helped with both of the above problems. Procedures are the specific steps to follow and who should be performing those steps.

One operations idea that I've rarely seen in networking is failure testing. When was the last time that redundancy failover was tested in your network? This means taking down a major device or link and verifying that the redundant infrastructure works as designed. In a well-designed network, there will be no outage. Routing will automatically switch to the backup path with little or no packet loss.

For more information about UC infrastructure, attend the Enterprise Connect session "Preparing Your Infrastructure for UC" with Terry Slattery and John Bartlett on Monday, March 16, 2015 at 2pm. Register with code NJSPEAKER to get $300 off Entire Event or Tues-Thurs pass.





COMMENTS



August 16, 2017

Contact centers have long been at the leading edge of innovation in communications technology, given their promise of measurable ROI and the continual need to optimize customer interactions and sta

July 12, 2017

Enterprises have been migrating Unified Communications & Collaboration applications to datacenters - private clouds - for the past few years. With this move comes the opportunity to leverage da

May 31, 2017

In the days of old, people in suits used to meet at a boardroom table to update each other on their work. Including a remote colleague meant setting a conference phone on the table for in-person pa

August 16, 2017
World Vision U.S. is finding lots of goodness in RingCentral's cloud communications service, but as Randy Boyd, infrastructure architect at the global humanitarian nonprofit, tells us, he and his team....
August 11, 2017
Alicia Gee, director of unified communications at Sutter Physician Services, oversees the technical team supporting a 1,000-agent contact center running on Genesys PureConnect. She catches us up on th....
August 4, 2017
Andrew Prokop, communications evangelist with Arrow Systems Integration, has lately been working on integrating enterprise communications into Internet of Things ecosystems. He shares examples and off....
July 27, 2017
Industry watcher Elka Popova, a Frost & Sullivan program director, shares her perspective on this acquisition, discussing Mitel's market positioning, why the move makes sense, and more.
July 14, 2017
Lantre Barr, founder and CEO of Blacc Spot Media, urges any enterprise that's been on the fence about integrating real-time communications into business workflows to jump off and get started. Tune and....
June 28, 2017
Communications expert Tsahi Levent-Levi, author of the popular BlogGeek.me blog, keeps a running tally and comprehensive overview of communications platform-as-a-service offerings in his "Choosing a W....
June 9, 2017
If you think telecom expense management applies to nothing more than business phone lines, think again. Hyoun Park, founder and principal investigator with technology advisory Amalgam Insights, tells ....
June 2, 2017
Enterprises strategizing on mobility today, including for internal collaboration, don't have the luxury of learning as they go. Tony Rizzo, enterprise mobility specialist with Blue Hill Research, expl....
May 24, 2017
Mark Winther, head of IDC's global telecom consulting practice, gives us his take on how CPaaS providers evolve beyond the basic building blocks and address maturing enterprise needs.
May 18, 2017
Diane Myers, senior research director at IHS Markit, walks us through her 2017 UC-as-a-service report... and shares what might be to come in 2018.
April 28, 2017
Change isn't easy, but it is necessary. Tune in for advice and perspective from Zeus Kerravala, co-author of a "Digital Transformation for Dummies" special edition.
April 20, 2017
Robin Gareiss, president of Nemertes Research, shares insight gleaned from the firm's 12th annual UCC Total Cost of Operations study.
March 23, 2017
Tim Banting, of Current Analysis, gives us a peek into what the next three years will bring in advance of his Enterprise Connect session exploring the question: Will there be a new model for enterpris....
March 15, 2017
Andrew Prokop, communications evangelist with Arrow Systems Integration, discusses the evolving role of the all-important session border controller.
March 9, 2017
Organizer Alan Quayle gives us the lowdown on programmable communications and all you need to know about participating in this pre-Enterprise Connect hackathon.
March 3, 2017
From protecting against new vulnerabilities to keeping security assessments up to date, security consultant Mark Collier shares tips on how best to protect your UC systems.
February 24, 2017
UC analyst Blair Pleasant sorts through the myriad cloud architectural models underlying UCaaS and CCaaS offerings, and explains why knowing the differences matter.
February 17, 2017
From the most basics of basics to the hidden gotchas, UC consultant Melissa Swartz helps demystify the complex world of SIP trunking.
February 7, 2017
UC&C consultant Kevin Kieller, a partner at enableUC, shares pointers for making the right architectural choices for your Skype for Business deployment.
February 1, 2017
Elka Popova, a Frost & Sullivan program director, shares a status report on the UCaaS market today and offers her perspective on what large enterprises need before committing to UC in the cloud.
January 26, 2017
Andrew Davis, co-founder of Wainhouse Research and chair of the Video track at Enterprise Connect 2017, sorts through the myriad cloud video service options and shares how to tell if your choice is en....
January 23, 2017
Sheila McGee-Smith, Contact Center/Customer Experience track chair for Enterprise Connect 2017, tells us what we need to know about the role cloud software is playing in contact centers today.