This is the final post in my series about running a five-nines network (you can access the others by going here). We've covered a two items in each blog, and this one finishes by covering the final three tips: documentation, fixing small problems, and failure testing. These are the steps that a lot of organizations often skip. However, to get to a five-nines network, downtime needs to be minimized, and these functions help get there.
Documentation, Documentation, Documentation
Did I mention the need for good documentation? The documentation starts with good design documents. These documents should describe the design goals and how those goals will be accomplished.
There will need to be multiple auxiliary documents that describe basic processes, such as how VLANs will be numbered and named, and the IP address allocation process. You then need to record the assignments somewhere so that new allocations don't overlap with a prior allocation.
Ideally, the mechanism that you use to record the allocations isn't Excel, which only one person can edit at a time. Ideally, the allocation database is updated by the network management system as the NMS performs its periodic network discovery. The automatic update process is useful because people tend to forget to do the documentation updates when under tight time pressure (always a concern in networking).
The first thing that network people need when troubleshooting a connectivity problem is a network topology diagram. What devices are connected to each other? It helps the troubleshooting process to see the path that the data is (or should be) traversing. There are several good automatic network topology discovery and mapping products on the market. Using them to create topology maps of each segment of the network is a good process for keeping network topology diagrams updated.
There should be some written documentation and network diagrams that describe any exceptions to the design guides. This documentation should be clear about why the exception exists, so that if the reason changes, the exception can be modified or removed.
Good documentation reduces initial errors in network design and significantly reduces troubleshooting time when problems occur. While it takes time to do the documentation, it helps reduce network downtime.
Death By a Thousand Cuts
All the networks I've encountered have many small problems--things like duplex mismatch, native VLAN mismatch, high interface errors, static routes in the wrong places, and flapping interfaces. The network administrators often see these as small problems that don't need their attention.
However, these problems sometime interact with other problems to create outages. Or the network performance is just poor over some paths and the people who are running applications over these paths have lowered productivity. In some networks, these problems number in the hundreds or thousands. With a problem count that high, most network staff choose to ignore the problems because the volume is so high.
Fixing all the small problems cleans up the NMS system because there are fewer events to handle. The NMS will then run more efficiently. The Operations staff that has to use the NMS can more easily spot important events because they do not have to ignore or filter out the events that deal with small problems.
But if the problems are repaired a few at a time, the volume goes down and the network begins to get better and more efficient. It is important for the organization's executives to support fixing the problems, or the technical staff will likely continue to ignore the problems.
Next page: Failure Testing
Failure Testing
How do you know that your redundant network will react the way that you want when a failure occurs? I've never known of a failure that scheduled itself at a convenient time. You can wait for a failure, or you can force failures during maintenance windows.
Think of it like testing emergency generators. Periodically turn off a part of the network infrastructure and make sure that the backup paths and devices perform as they should. You can have all the network staff watching to make sure that the failover happens as desired and to gather data if it doesn't. With controlled conditions, it is easy to restore service and make sure that the network returns to the desired operating state.
The failover and recovery should be deterministic. Use the NMS to measure loads on the backup systems and links during the failure. Do the backup links handle the load, or does the system operate in degraded (overloaded) mode? Does the NMS recognize that a failure has occurred and does the notification process work as designed?
Make sure that the staff understands what is supposed to happen and that the alerts are properly received and handled. Getting the staff involved is like running a fire drill. Everyone runs through the steps, knowing that it is a drill, but in the process everyone builds a memory of how the process works.
It takes a lot of effort to prepare for one of these tests, but over time, they get easier to perform. It gives everyone a good feeling to know that the tested failure was properly handled. And if the failure wasn't handled properly, the network team has the opportunity to learn what didn't work, make corrections, and repeat the process until the failures are properly handled.
Make sure that you have a good restoration plan in place for those times when the redundant systems do not work. Of course, this testing is done during maintenance windows where everyone knows that some piece of the infrastructure may go down.
The IT systems should continue to function--if they are properly designed and deployed and you have made sure that any single point of failure does not cause an outage.
Summary
The first year of implementing all of the recommended changes will be the most challenging. There will be outages due to fixing things that have not been optimal for a long time but that no one wanted to touch for fear of breaking something. Sometimes, you have to break the system and reassemble it differently in order to have a better network. A good process for implementing changes, with good back-out plans minimizes the potential for outages that exceed the defined outage window.
The efficiency of the network improves and therefore the efficiency of the applications improves. Ultimately, this results in increased productivity of the business, which was the objective.