Want a Five-Nines Network?
How can you get close to this? Most networks blow that figure in the first month. Here are some tips to build the foundation.
Converged networks today rarely have the availability metric that was common for the historical voice network: five-nines, or 99.999% availability. That's 5.25 minutes of downtime per year. How can you get close to this? Most networks blow that figure in the first month.
In the next series of blog posts, I'm going to describe the steps that you can take that will make your network much more stable and get you much closer to that magic five-nines metric.
The One True Version
Run one OS version. Don't get into the trap of not upgrading some devices simply because they are not currently having a problem. When you start mixing OS versions, you start encountering problems where network protocol operation is slightly different between the versions, due to different bug fix levels and enhancements to the protocol. You will eventually have a situation arise where you can't implement some new functionality because of older OS versions in some part of the network. Or you'll be unable to migrate to a new OS with new features in your core because it no longer interoperates correctly with the older OS versions.
A good example of OS version interactions is with the Spanning Tree Protocol (STP). Older switches implement the standard spanning tree protocol, 802.1D. Newer switches can run Rapid Spanning Tree, 802.1W, which offers significantly improved convergence times as well as features that help prevent spanning tree loops. The new switches will fall back to 802.1D in order to interoperate with the older switches. The result is that your network can’t take advantage of the new functionality. It is more susceptible to spanning tree loops. (Have you ever debugged an STP loop before? Read up on it if you haven't so you know what's involved.)
The other advantage of running one OS version is that you only have one set of bugs to track. You know what configuration options work for that OS version and which configurations to not use.
When I recommend one OS version, that's one per hardware platform. With a few exceptions, most vendors have a slightly different OS version for each hardware platform. Identify the OS that you will run for each hardware platform and make sure that it is installed on all the devices of that model. You may have an exception from time to time, but make sure that they are exceptions and that the reasons for the exception are clearly documented. When the exception ceases to exist, bring the device into compliance with the rest of the network. If you have a situation where an edge device needs the functionality of an obsolete piece of network hardware, isolate that network hardware too. Make it part of the edge device configuration so that you can move forward with the rest of your network.
Current Hardware, Clean Rooms, Neat Racks
You can't run a current OS version on old equipment. You don’t have to be on the bleeding edge of hardware deployments, but you should have a regular hardware refresh program. One site that I've worked with had a policy of upgrading one third of their hardware each year. That's the most aggressive program I've seen. I recommend the same process, perhaps with a longer timeframe of 4 years--updating one-fourth of the hardware each year.
Some organizations can't afford to perform hardware refreshes that often. Then consider dividing the network into zones and upgrading a zone at a time. A zone could be based on function, such as core and distribution, or it could be based on geography, such as Eastern US or Europe. The hardware within a zone is of the same vintage, allowing you to use standard configurations and operational policies and procedures across the zone. This reduces your management load and makes it easier to troubleshoot.
In some cases, you can push the old hardware out to the edge to replace even older gear there. This is just another way of segmenting the network and upgrading parts of it. With the re-allocation approach, you are keeping the old hardware, but moving it into areas of the network where the hardware and software deficiencies have less impact on the network and on the business.
While you're doing your upgrades, make sure that the equipment rooms are kept clean. There's nothing like dust and dirt getting into the gear to cause fan failures, which cascade into overheated equipment, which then doesn't last as long. Dust in equipment can also cause intermittent failures, which are extremely difficult to troubleshoot, consuming massive amounts of staff time to diagnose.
Make sure that your computer room staff has easy access to cabling of the proper types and lengths. A lot of time is typically wasted in the search for the right cable. And have a process for handling suspected bad cables. It is silly for a bad cable to repeatedly get installed simply because it was available nearby.
Finally, keep the racks neat so that it is easy to track connections between equipment. Remove old cables so that you aren't looking at the back of a piece of equipment and wondering if that disconnected cable was supposed to be connected to something. A clean, well-designed cabling system and rack layout will yield benefits in reduced troubleshooting time as well as reducing the time that is required to replace equipment as your hardware is refreshed.
Standardizing on OS versions and hardware versions is a fundamental practice of organizations that operate highly reliable networks. An underlying theme in their operations is the application of good, basic procedures that are standardized across multiple facets of their operation. With Standard Operating Procedures (SOPs), they have repeatable processes that allow them to achieve greater operational efficiency with fewer errors than organizations that don't adopt standard processes.