Want a Five-Nines Network? (Part 4)
The network's efficiency improves, so the efficiency of the applications improves. Ultimately, you increase productivity of the business, which was the objective.
This is the final post in my series about running a five-nines network (you can access the others by going here). We've covered a two items in each blog, and this one finishes by covering the final three tips: documentation, fixing small problems, and failure testing. These are the steps that a lot of organizations often skip. However, to get to a five-nines network, downtime needs to be minimized, and these functions help get there.
Documentation, Documentation, Documentation
Did I mention the need for good documentation? The documentation starts with good design documents. These documents should describe the design goals and how those goals will be accomplished.
There will need to be multiple auxiliary documents that describe basic processes, such as how VLANs will be numbered and named, and the IP address allocation process. You then need to record the assignments somewhere so that new allocations don't overlap with a prior allocation.
Ideally, the mechanism that you use to record the allocations isn't Excel, which only one person can edit at a time. Ideally, the allocation database is updated by the network management system as the NMS performs its periodic network discovery. The automatic update process is useful because people tend to forget to do the documentation updates when under tight time pressure (always a concern in networking).
The first thing that network people need when troubleshooting a connectivity problem is a network topology diagram. What devices are connected to each other? It helps the troubleshooting process to see the path that the data is (or should be) traversing. There are several good automatic network topology discovery and mapping products on the market. Using them to create topology maps of each segment of the network is a good process for keeping network topology diagrams updated.
There should be some written documentation and network diagrams that describe any exceptions to the design guides. This documentation should be clear about why the exception exists, so that if the reason changes, the exception can be modified or removed.
Good documentation reduces initial errors in network design and significantly reduces troubleshooting time when problems occur. While it takes time to do the documentation, it helps reduce network downtime.
Death By a Thousand Cuts
All the networks I've encountered have many small problems--things like duplex mismatch, native VLAN mismatch, high interface errors, static routes in the wrong places, and flapping interfaces. The network administrators often see these as small problems that don't need their attention.
However, these problems sometime interact with other problems to create outages. Or the network performance is just poor over some paths and the people who are running applications over these paths have lowered productivity. In some networks, these problems number in the hundreds or thousands. With a problem count that high, most network staff choose to ignore the problems because the volume is so high.
Fixing all the small problems cleans up the NMS system because there are fewer events to handle. The NMS will then run more efficiently. The Operations staff that has to use the NMS can more easily spot important events because they do not have to ignore or filter out the events that deal with small problems.
But if the problems are repaired a few at a time, the volume goes down and the network begins to get better and more efficient. It is important for the organization's executives to support fixing the problems, or the technical staff will likely continue to ignore the problems.
Next page: Failure Testing