The Case for Periodic Infrastructure Reviews
Organizations continue to discover that their infrastructure contains vulnerabilities that can take it down for hours or days. When was your network last reviewed? Why wait for a failure?
All Systems Down
Is your network about to crash? How do you know it isn't? When was the last time that it was reviewed? Think of an infrastructure review as the equivalent to the 120-point automobile inspection.
Read about a major outage in All Systems Down, which appeared in CIO magazine in 2003. In the article, John Halamka, CIO of Beth Israel Deaconess Medical Center, describes dealing with a four-day network outage. It is quite an interesting article because it goes into great detail about what happened and his team's steps to get the network running again. The summary is that it started with a massive spanning tree forwarding loop that consumed network bandwidth and eventually caused network devices to crash.
What does this 12-year-old article have to do with today's networks? Well, events like it continue to happen. Paul Whimpenny, Senior Officer for IT Architecture in the IT Division of the Food and Agriculture Organization of the United Nations, describes a similar network outage in Our bullet-proof LAN failed. Here's what we learned. Fortunately, Whimpenny's outage was only four hours.
Common to both outages was a spanning tree problem. Spanning tree network design is one of the key network functions that we include in our network assessment. (I use the term "our" in reference to NetCraftsmen, the consulting company that employs me. I created the first version of our network assessment process and draft report template a good number of years ago. Automated tools help streamline the network data collection and analysis process.)
Think about it. When was the last time your network and UC infrastructure was reviewed? A good review is actually a detailed audit of the network and UC infrastructure. It should examine the design, operational data, and operations. The result should be an identification of things that are working correctly, as well as the areas that need review and remediation.
Why Failures Happen
One of the things we look for in an assessment is whether the spanning tree design is actually making redundant data centers into a larger, single, distributed data center. Problems in one data center can be propagated by the protocol to the other data center. Visually, this looks like a barbell design. Each data center is a weight on the ends of the link that connects them to each other. That's probably not what was intended. In fact, it is often the result of the network growing and changing over time.
Another common source of outage is failed redundancy. A network will be designed and built with redundant elements and links. But then a redundant component will fail, and because the system is very resilient, the failure doesn't cause an outage. If network and UC monitoring systems are not in place, not properly configured, or not used on a regular basis, the failure isn't noticed. It is only when the second failure occurs that the first failure is found. It is common to find that the first failure occurred months or weeks before the second failure. There was plenty of time to correct the first failure and avoid an outage, if it had only been discovered in time.
On occasion, an infrastructure review of ours will find a network that is like an old farmhouse. It started as a one- or two-room building. Then, as the family grew, rooms and wings were added onto the existing structure. To reach one bedroom, you have to walk through another bedroom. The "old farmhouse" networks are similar. They often include single points of failure, where one part of the network connects to the core of the network via a single path. In many cases, this was the expedient way to provide network connectivity that was previously not planned. When asked about the lack of redundancy, many of these network administrators say that they intended to go back and correct it, but have not had time or they had simply forgotten about it.
I've also seen network problems created because the network staff misunderstood some operational data and installed a configuration that exacerbated a problem. A good example of this is configuring too many buffers on an interface that's dropping packets.
Network operations figures into almost every network failure. Occasionally, a fundamental design flaw causes a problem, but most often, it is a lapse in running the network that allows a failure to create an outage. Policies, processes, and procedures are key to good operations. If you think that each of these three things are the same or at least very similar, take a look at this link for a description of them.
For example, a good design policy is to not extend Layer 2 networking between data centers. Violations of this policy contributed to the failure that Whimpenny experienced and probably was also a factor in the Beth Israel Deaconess Medical Center outage. Policies should cover many design principles as well as when and how to enact processes and procedures. They are the rules for designing and running the network. Processes are what to do when something needs to be done while procedures are the steps that must be followed to implement a process. Knowing the process for breaking spanning tree loops and the procedure to follow, with specific staff assigned to perform those steps would have helped with both of the above problems. Procedures are the specific steps to follow and who should be performing those steps.
One operations idea that I've rarely seen in networking is failure testing. When was the last time that redundancy failover was tested in your network? This means taking down a major device or link and verifying that the redundant infrastructure works as designed. In a well-designed network, there will be no outage. Routing will automatically switch to the backup path with little or no packet loss.
For more information about UC infrastructure, attend the Enterprise Connect session "Preparing Your Infrastructure for UC" with Terry Slattery and John Bartlett on Monday, March 16, 2015 at 2pm. Register with code NJSPEAKER to get $300 off Entire Event or Tues-Thurs pass.