In my prior blog post, I started describing some simple steps that can help make your network a five-nines network. Converged networks today rarely have the availability metric that was common for the historical voice network: five-nines, or 99.999% availability, which translates into 5.25 minutes of downtime per year. I'll describe another set of steps in this post that will help you get closer to that magic five-nines figure.
My prior post included several fundamental steps: use a standard OS version per hardware platform, keep the hardware platforms up to date, keep the equipment rooms clean, and use good, neat rack layouts that make it easy to perform changes. Here are some additional steps:
Build and Use a Test Lab
A good test lab is needed in which you validate new OS versions and perform experiments with new device configurations. Assemble as much of a test lab as you can afford. In general, with a minimum of equipment, you should be able to prototype pretty much any configuration prior to deploying it in the production network. Start with the basic equipment, then include firewalls, load balancers, and test equipment. The variety of equipment makes the test lab more complex, but it is more realistic.
You will need to use network management and automation tools to make it easy to change configurations and to return the lab configuration to a known state. If you rely on manual configuration practices, the lab will quickly fall into disuse because it is simply too difficult to change configurations. You'll also find that different uses of the lab will conflict with one another. Testing an edge configuration change along with a core OS update is unlikely to easily coexist in most lab environments. However, if you have a standard Layer 1 topology and good automation tools for loading OS versions and configurations, you should be able to easily switch between different modeling and testing tasks.
The same automation tools and procedures that you use in the lab will also be valid in the production network. In addition, you can test new versions of the network management and automation systems as they become available, without risking a problem in the production network. Verify that the NMS can identify critical errors that you create in the lab. Use it to develop and evaluate new automation tasks.
I've seen test labs that become a source for spares and to build out the production network. Don't fool yourself--that's not a test lab. For a test lab to be useful, you must not pillage it for spares or to do an emergency installation. If the company management advocates using test equipment in the production network, it is clear that they do not support running a five-nines network.
Get good traffic generators and learn how to use them. You may need to include some copies of production servers and clients with transaction control software to allow you to automate the load on the production systems. Make copies of the production applications so that your test lab can emulate the data flows that exist in the production network.
Staffing the test lab can be challenging. Look for the person who likes to learn new things and enjoys digging into a problem and eventually solving it. Foster a healthy working relationship between the test team and the production team. An adversarial approach isn't ideal. Good communications between the two teams is essential for optimum results. It may be useful to rotate the production team members through the test lab.
Document the test lab and use a lab logbook or wiki page to track who is using the lab, what they are doing, and how to contact them. Use the logbook to record any deficiencies in the lab environment. Make sure that you implement a program for reviewing and resolving the deficiencies.
Build automatic processes that can perform basic tests on any new OS or hardware that you are evaluating. With automatic processes, the test lab should be accessible from any location within the company. After all, in a production environment, the network engineer is seldom anywhere near the network equipment.
For all the Layer 1 components, like cables, have a place where they are kept. Determine the process for handling broken cables or suspect cables. Is it worth the time to test a cable, or is it better to just replace it? Can your old cables be donated to a school that may have students who can test them for use in the school?
Next page: Design a good network architecture
Design a Good Network Architecture
Base your design on a good network architecture that reflects the current best practices in network design. These best practices change over time, so it is useful to periodically examine them and identify anything that should be improved, perhaps reviewing them once a year. If necessary, get an outside consultant who specializes in network architecture design to participate in a design review.
A word of caution about network redundancy is in order. You can design a network with too much redundancy. The problem in these networks is that the links are typically under-sized for handling the loads that are generated when major failures happen. It is more difficult to implement consistent network routing policies and security policies when there are many redundant paths. I've frequently seen instances where a network failure occurs and in the post-mortem analysis, the traffic was found to have taken an unexpected path. Know where your traffic will go for the common failures. Predictability is critical.
Your network architecture should also incorporate features that take advantage of the strengths of each technology and avoid the vulnerabilities that inevitably exist. A good example is the use of various spanning tree technologies that avoid spanning tree loops, such as 802.1w (Rapid Spanning Tree).
Network design is where a lot of network engineers want to work. They like building new things and learning about new features. However, the design team needs to temper that desire with the need for a smoothly running network. To many people, a stable, smoothly operating network is a boring network. Put these people in the lab where they can spend their time working with new features and understanding if they are suitable for production use.
Go to your test lab and make sure that any new architectural changes work like you think they should. Software bugs often cause change to design plans, and finding a critical bug during a deployment to the production network is not good. Making deployments run smoothly is the primary reason for a good test lab.
During design, segment the core network separately from the distribution and edge. If you have edge systems connecting into the core devices, you'll have more complexity in the core from QoS, security, and policy routing. That makes it certain that you'll have a failure due to a simple configuration change at some point during the year. Keep the core simple and fast. Put the complexity in the edge.
Summary
The main theme in this post is building and running a good test lab. Even in the section on design, I talked about how the features that are used should be validated in the lab. When you are doing network design, look for stable features that will allow the core to run with at least five-nines uptime.
I'll talk about network management and documentation in the next post and provide some tips on what functions a good network management system provides.