Want a Five-Nines Network? (Part 3)

This is the third post in a series about steps that you can take to have a five-nines network--that is, a network with 99.999% availability. Five-nines is generally considered to be the goal of converged networks. It is the metric that was common for the historical voice network.

This blog post describes how to use network and configuration management to increase network availability. Network management is one of my specialties and I've created a Network Management Architecture, which is described at http://www.netcraftsmen.net/resources/blogs/nms-architecture-fcaps-or-itil.html.

Manage IT!
How do you know when one component fails in a resilient network? A resilient network will continue to run, perhaps in degraded mode. Network management systems must be used to monitor all parts of a resilient network and must let you know that some part of it has failed so that you can fix it before another component fails, causing an outage.

Having spent time working in financial networks, which have similar requirements, I've seen quite a number of failures occur where the analysis showed that both parts of a redundant configuration failed, often weeks or months apart. The first failure went unnoticed because there was no outage. It is only when the second failure occurred that both the first and second failures were found.

How do you prevent such failures? Network Management! You have to monitor the network to identify failures. The system should generate alerts when a key device or interface fails. You can also set thresholds to create alerts when the utilization of an interface changes substantially, either to near zero or to very high levels. Big changes may mean that the routing or spanning tree protocols changed paths due to a change in the network. If you are aware of the change, then the alert is validation that the network management system is working correctly. If you're not aware of a change that would create the alert, then there is something to investigate.

Another way to monitor the network is to perform active monitoring by using synthetic tests. I sometimes call these tests "application level pings," because they run at the application layer. For example, if sending an email takes longer than usual or fails to complete, then there's either a network problem or an email server problem. Web page retrieval tests perform the same type of monitoring.

In converged networks, there are two important monitoring steps to take. The first is to monitor the endpoints for connection quality. What are the typical stats for delay, jitter, and loss? Are calls terminated abnormally? The stats from real calls are a great way to keep an eye on how the network is performing and to highlight trouble areas. Increasing loss and jitter are early indications of congestion somewhere in the path. The path may have changed due to a failure in the original path and the result is oversubscription on the secondary path. Or perhaps the primary path is now oversubscribed and it is now congested.

The second step for monitoring converged networks is to generate synthetic voice and video traffic. I refer to this as active testing. It is similar to the "application level pings" that I described above. There are at least two methods for generating voice/video synthetic traffic. One is to add probes to key points in the network, such as at each major site, and run tests between the probes. Another is to create synthetic calls to the endpoints, but this requires that the voice and video endpoints support test calls without someone manually initiating them.

When a problem is identified it should be entered into a trouble ticket system to aid in tracking the failures. You can then perform analysis on the most frequent types of failures, allowing you to determine which failures are most common.

Finally, spend time to identify and fix common well-known problems. Duplex mismatch comes to mind as a great example. A lot of people think that duplex mismatch isn't a big problem and that they can let it go. As long as the link is very lightly loaded, they are correct. But high-volume links will have very poor throughput. Other examples are flapping interfaces, unstable routing and spanning tree protocol instances, high-utilization links (more than 50% average utilization or 70% 95th percentile utilization), and interfaces reporting errors or discards.

Taking care of all the small problems makes the network more stable and efficient. You can then focus on bigger problems and you know that two small problems aren't interacting to produce a larger symptom.

Next page: Configuration management

Configuration Management
Network industry statistics say that you'll find that failed configuration changes are likely to be the most common source of network outages. A configuration management system can potentially help avoid some of these failures, particularly if you employ a configuration change review board and a change approval process. The peer review process will help avoid silly mistakes. Also make sure that the change validation steps and the back-out procedures are valid. I know of one back-out process that failed because it had to be executed in a certain order, which was not specified. So even though the commands were right, they didn't work because they needed to be done in a certain sequence.

The configurations that you use should be as standard as possible. By minimizing the number of configuration variations, you minimize the number of OS bugs you are likely to encounter. You also make it easier to understand and remember what the configurations are doing. That translates into fewer mistakes. If a one-off configuration change must be implemented, it needs to be clearly documented so that anyone reading the configuration knows why it is different.

Just to be clear, the standard configurations do not have to be the entire configuration for a device. It is quite reasonable to use configuration snippets. For example, the SNMP and syslog configuration may be the same across all devices from a single vendor and would be very similar between vendor products. When the snippet changes, it should be reapplied to all network devices.

Once standard configurations are developed, a configuration management system can help make sure that they are properly deployed to all network devices. A key function of the configuration management system is to check that the configurations of all devices match the standard configuration snippets and to generate an alert for any exceptions. When the snippets change, it is easy to identify all the devices that need to have configuration updates applied. Also, if someone makes an unapproved change, an alert is generated.

The scripting capabilities of some configuration management systems can automate periodic tasks such as changing passwords or SNMP community strings, as required by government regulations. I don't know of many network admins who like the task of updating passwords on more than a handful of systems. Automating these tasks eliminates the manual work and frees the staff to perform more valuable tasks.

I am a proponent of automatic network discovery in which the NMS checks for new devices and automatically adds them to the NMS and to the configuration validation system. If a new device is discovered and it does not have one of the system-wide SNMP community strings, an alert should be generated so that it can be brought into compliance. I've encountered some network administrators who want to manually add every device to the NMS. I inevitably find devices in their networks that should be managed, but slipped through that process, so automatically finding devices is important. Automatically adding devices to the configuration management system makes sure that all devices are compliant with network configuration policies, which is a good thing.

Summary
Network monitoring is essential for detecting failures in a resilient network infrastructure as well as providing valuable insight into how well the network is running and where it could be improved. To achieve five-nines of network availability, all the small problems need to be corrected so that any problems are easier to spot. You can think of this as improving the signal-to-noise ratio in the event stream. A large number of small, regularly occurring events, such as duplex mismatch or unstable interfaces, can hide more critical events. Fix the unstable interfaces and it is easy to spot the failure of a critical interfaces.

Create a system for performing active network path testing. The system will likely need the deployment of additional probes to perform the testing, and you have to think about where to deploy the probes and what tests to create in order to minimize the number of tests that are required to provide the desired level of coverage. Creating a full mesh of tests is difficult to scale, so look for ways to test critical paths with a minimal number of tests.

Examine the trouble ticket system to identify common failures. Create active tests and identify critical monitoring points that provide visibility into the common failures. Is there a way to re-design the network to avoid the common failures? Should a procedure be modified to reduce errors?

Implement a network discovery and configuration management system. Make sure that the management systems have identified all of the devices in the network. Back up the configurations of all the network equipment. Finally, create configuration policies and configuration snippets and make sure that the network configuration matches the desired policies and configurations.

I will finish this series next month when I discuss Death By a Thousand Cuts and Failure Testing.

Tags:

News & Views

Search form

Want a Five-Nines Network? (Part 3)

Tags: