Want a Five-Nines Network? (Part 2)
Building and running a good test lab is a key ingredient in your plan for maintaining high availability.
In my prior blog post, I started describing some simple steps that can help make your network a five-nines network. Converged networks today rarely have the availability metric that was common for the historical voice network: five-nines, or 99.999% availability, which translates into 5.25 minutes of downtime per year. I'll describe another set of steps in this post that will help you get closer to that magic five-nines figure.
My prior post included several fundamental steps: use a standard OS version per hardware platform, keep the hardware platforms up to date, keep the equipment rooms clean, and use good, neat rack layouts that make it easy to perform changes. Here are some additional steps:
Build and Use a Test Lab
A good test lab is needed in which you validate new OS versions and perform experiments with new device configurations. Assemble as much of a test lab as you can afford. In general, with a minimum of equipment, you should be able to prototype pretty much any configuration prior to deploying it in the production network. Start with the basic equipment, then include firewalls, load balancers, and test equipment. The variety of equipment makes the test lab more complex, but it is more realistic.
You will need to use network management and automation tools to make it easy to change configurations and to return the lab configuration to a known state. If you rely on manual configuration practices, the lab will quickly fall into disuse because it is simply too difficult to change configurations. You'll also find that different uses of the lab will conflict with one another. Testing an edge configuration change along with a core OS update is unlikely to easily coexist in most lab environments. However, if you have a standard Layer 1 topology and good automation tools for loading OS versions and configurations, you should be able to easily switch between different modeling and testing tasks.
The same automation tools and procedures that you use in the lab will also be valid in the production network. In addition, you can test new versions of the network management and automation systems as they become available, without risking a problem in the production network. Verify that the NMS can identify critical errors that you create in the lab. Use it to develop and evaluate new automation tasks.
I've seen test labs that become a source for spares and to build out the production network. Don't fool yourself--that's not a test lab. For a test lab to be useful, you must not pillage it for spares or to do an emergency installation. If the company management advocates using test equipment in the production network, it is clear that they do not support running a five-nines network.
Get good traffic generators and learn how to use them. You may need to include some copies of production servers and clients with transaction control software to allow you to automate the load on the production systems. Make copies of the production applications so that your test lab can emulate the data flows that exist in the production network.
Staffing the test lab can be challenging. Look for the person who likes to learn new things and enjoys digging into a problem and eventually solving it. Foster a healthy working relationship between the test team and the production team. An adversarial approach isn't ideal. Good communications between the two teams is essential for optimum results. It may be useful to rotate the production team members through the test lab.
Document the test lab and use a lab logbook or wiki page to track who is using the lab, what they are doing, and how to contact them. Use the logbook to record any deficiencies in the lab environment. Make sure that you implement a program for reviewing and resolving the deficiencies.
Build automatic processes that can perform basic tests on any new OS or hardware that you are evaluating. With automatic processes, the test lab should be accessible from any location within the company. After all, in a production environment, the network engineer is seldom anywhere near the network equipment.
For all the Layer 1 components, like cables, have a place where they are kept. Determine the process for handling broken cables or suspect cables. Is it worth the time to test a cable, or is it better to just replace it? Can your old cables be donated to a school that may have students who can test them for use in the school?
Next page: Design a good network architecture