I recently reviewed a customer's network in preparation for a network design and refresh. The review would provide us (both my company NetCraftsmen and the customer) with an idea of the current state of the network. We could then work together to determine the changes that would be required to design their next-generation network.
The customer had a network management system in place, but it was not configured quite right and didn't have a good view of the current network's performance. The customer was looking to add new applications, including more unified communications and collaboration (UC&C) solutions, which would increase their bandwidth requirements. However, we were unable to tell how much bandwidth they were using currently, because the tools weren't configured correctly. We were able to get enough data to fulfill the requirements of the network review, but we all decided that spending a little time on getting the tools updated and correctly configured would be worth the effort.
The event started me thinking about Network Management Systems (NMS) best practices, so I looked for some articles and found several. Some are rather old, but still contain useful recommendations. I even found one that included virtual networking and SDN! (See my series on No Jitter about SDN management.)
I found a number of documents that describe a variety of network management best practices. I looked through them and picked out the ones that contained useful information. Not surprisingly, only one paper talked about SDN, and, in fact, several of them were quite old. Even though some were old, they contained points that are still relevant today.
The Cisco Performance Management: Best Practices White Paper contains a valuable recommendation that was missing from all the other papers I reviewed: Develop a network management concept of operations. The concept of operations provides a set of objectives that the NMS platform must accomplish in order to provide value to the organization. It formulates a basis from which you can establish requirements and create a well-defined plan for network management. This process of identifying requirements and creating a plan is fundamental to successful execution of any project and is critical to the success of a complex project like a network management system. Pay particular attention to the section about the concept of operations.
While the Cisco Baseline Process Best Practices White Paper is old (2005), it contains some valuable tips. For example, looking at the "What If?" sections towards the end, it recommends setting multiple thresholds at three different performance levels. When the first threshold is crossed, it is an early heads-up indicator that you need to start planning for action. The second threshold is a warning that you will soon need to be prepared for action. Your planning should be finished and you should be prepared to implement your plan. This means that you should be acquiring any required resources (purchasing a new device, upgrading a link, etc). The third threshold is the trigger point for taking action. Of course, you shouldn't delay planning until the third threshold is triggered.
Another useful tip is to manage by exception: "Notice that in this process, attention is focused on the exceptions in the network and is not concerned with other devices. It is assumed that as long as devices are below thresholds, they are fine," the document states.
Networks generate so much data that it is critical to find ways to reduce the volume so that you can see the information (the trees) instead of the data (the forest). Management by exception allows you to focus on where the network needs attention and not dwell on parts of the network that are performing well. Since it is an old document, you will need to ignore the sections that talk about configuring old Cisco NMS software, making it a quick read.
Another Cisco document, Network Management System: Best Practices White Paper (dated 2007) discusses the FCAPS model of network management. The fault detection section discusses the use of a syslog system to collect network event data. What it is missing is a mechanism for reducing the large volume of event data to something that is more easily consumed. For this, I recommend using a syslog summary mechanism, which I've written about in the past.
A more recent paper, Seven Best Practices for Network Management, talks about monitoring SDNs, virtual networking, and multi-vendor support. I particularly like best practice No. 5: unified policy management, because policies define what you want to happen and how the network should function. For example, defining the policy for handling certain network events reported by syslog, how devices will be named, defining unacceptable levels of interface performance, or the details of the addressing plan. Each policy that defines a functional threshold, such as interface performance, should include a description of how it should be remediated. Defining policies and actions reduces the amount of work required to manage a network.
SolarWinds has two papers on the subject of network management: Network Monitoring Best Practices and 10 Best Practices to Streamline Network Monitoring. A key element in both papers is establishing a network baseline. Without a good baseline, it is impossible to know if something you observe is normal or is a typical behavior. With periodic baselines, you can identify an approximate time when a particular network anomaly began appearing. Sometimes you can correlate the timing of network anomalies with network changes, allowing you to reduce your troubleshooting time.
A Network World Article by Denise Dubie, Guide to Network Management and Monitoring, includes a well-written set of considerations for buying into a network management platform. She also mentions that network management best practices require planning, planning, and more planning. (Got that?) I think that lack of good planning is the death of most NMS implementations.
Unified communications and collaboration require some additional management functions beyond those mentioned in the articles above. First, QoS is typically needed to prioritize UC&C traffic over bulk data traffic. The network management system should monitor QoS queues in order to detect excess traffic in the high priority queues, particularly in networks where multiple sources and types of high priority traffic exist.
Second, I look to configure the NMS to generate a Top-N report on interface discards or drops. A drop occurs when a network link is too congested to handle an outgoing packet. All the interface buffers are full, so the packet is dropped. These are indications of network congestion. Small numbers of drops are not a big concern because TCP ramps up its speed with bulk data transfers until congestion occurs. However, large numbers of drops are an indicator of something amiss. Wireless interfaces, with their lower speeds and greater potential for congestion, are critical pieces of infrastructure to monitor.
Third (and certainly not the last), the network may not be the best source of information about how UC&C systems are functioning. I always encourage customers to monitor the UC&C controllers and endpoints to collect information about performance, problems, and configurations. A challenge is that these systems often have their own data collection and reporting systems, making interfacing with an NMS difficult. Even if the UC&C generates separate reports, it is useful to identify one or two reports that can be created on a daily or weekly basis to help identify when the system is experiencing problems that may be undetected by looking only at the network.
It is important to apply network management best practices and principles to prevent problems. These same systems can make it easier to perform troubleshooting when problems arise.