Requirements for Network Management Best Practices
The elements you need for network management, and why it matters.
Maybe you're wondering why we need network management best practices. To start, all complex systems will have many different ways that they can be configured, each with nearly the same functional characteristics. However, some of those configurations are easier to create, and other configurations will be easier to operate. Finding the optimum trade-off point between creation complexity and operational complexity is often challenging. System experts use their experience to make the right decision in those tradeoffs. It should be obvious that network management is one of those complex systems.
Network management systems (NMS) best practices provide us guidelines, but also help us know when we have finished the system configuration. If we don't have guidelines, then we could spend a lot of effort attempting continual optimization and improvement with insignificant results. I've seen NMS implementations in which the management team is in continuous "tweaking" mode. In some cases, the NMS don't provide the right type of automation to implement functionality across the entire network, so the GUI must be used for every change. This can be very tiring, even on a small network of 30 to 50 devices. The right automation makes the system a pleasure to use because changes to the NMS and changes to the network devices are easily implemented.
With a set of best practices, we have guidelines that drive what we implement and how we use it. Specifically, it tells us what data to collect, how the data is used, and the set of actions that would result from using the data. We can then configure the network management system to collect the data and the subsequent analysis that turns the raw data into actionable information. Our network's operations policies then tell us what to do when actionable information is presented to us.NMS Policies
An organization should develop network management policies that dictate what data is collected and how it is used. We start with policies as a top-down process. By defining what we want to happen, we can then identify the data and analysis that is required to implement the policy.
For example, we want to monitor interface utilization so that we can predict when traffic levels indicate that we should plan to upgrade the speed of a link. The NMS records interface performance data, and trend analysis can tell us that we will be approaching saturation on a link in two to three months. The operational policy can then dictate that we investigate the link usage and begin planning an upgrade of the link to handle the additional demand.
I like to collect and monitor interface error and discard statistics because it focuses on what isn't working correctly. These statistics allow me to identify interfaces that are experiencing problems or that are becoming congested. The time of day of the congestion and the extent of the congestion are important for determining if further investigation is warranted. This is a case where the operational policy may simply be "investigate further."
The next question about interface performance monitoring is whether network performance data should be collected on all edge interfaces. Or should performance only be monitored on router and switch interconnections and server interfaces? The organizational policy may limit the budget for NMS tools in a way that prevents the installation of a system large enough to monitor all edge interfaces. In this case, it may be necessary to limit performance data collection to infrastructure interfaces. Then decide if server interfaces should also be monitored, or perhaps only interfaces to specific business-critical servers (UC servers, Media Control Units, business application servers, etc).
Another policy example is collecting and archiving device configurations. Operational policies can require that we identify and validate configuration changes each day. In practice, we can then identify configuration changes that happened right before a network failure. We can also use the saved configurations to restore a failed device's configuration after the replacement device is installed.
The list of what to collect, the analysis to be done, and the actions to take can be quite extensive. Start with a short list, and add to it over time. Include data and factors that affect the operation of the business. Don't collect data without a defined policy on how it will be used. This was very clear in Cisco's Performance Management: Best Practices White Paper.What Do We Need to Monitor and Manage?
Where do we start? It is best to start with simple policies and work up to increasingly complex policies. This way we use what we learn in the initial steps to our advantage in the more complex policies.
Since we can't monitor what we don't know exists, we must start with network discovery. The preferred policy is to use regular network discovery scans to identify all devices on the network. (An alternative policy is to only monitor devices that the networking team tells us about. But this creates an opportunity for a device to be added and not monitored, giving rise to the potential for a failure to affect the organization.) If edge devices are included, you can then use network discovery to locate devices. A recent customer found that several switch interfaces that connected to servers had high utilization. We were able to use the network inventory information to identify the servers and begin planning to increase the interface speeds to those servers.
The next policy is to detect End-of-Life (EoL) equipment and track maintenance agreements. This policy can be a money saver because it reports what is actually installed in the network while most vendors track what was sold to the organization. It also allows the organization to make sure that only supported equipment is used in the critical network infrastructure. The data that must be collected to implement this policy includes installed chassis, network cards, and software/firmware versions. Vendors occasionally have to recall specific network cards, and this policy allows an organization to easily identify them. With the EoL information, planning can take place to upgrade hardware and software before risking the organization's operation on old systems.
Once we know what is on the network, we can implement policies that detect basic hardware failures. Fans and power supplies tend to fail more frequently than other components, so they are an easy choice. CPU, memory, and temperature should be included, with a policy to examine and react to any exceptions over the selected thresholds.
The policy with the most benefit is monitoring configuration changes, also known as Network Change and Configuration Management (NCCM). Configuration changes are the greatest source of network failures (40 to 80%, by most reports). An operational policy that tracks configuration changes and archives all updated configurations can provide a basis for reducing those failure figures. If a network outage occurs, what was the most recent set of configuration changes? Look for changes that are close by in terms of time and in terms of topology to reduce the time to repair.
Creating a policy for interface statistics is pretty straight forward, but I've found that most organizations skip the policy definition phase and focus on interface performance. Performance is easy to understand, but more difficult to create a policy around. Instead, it is better to start with basics like up/down status. An up/down policy should recommend that any router interface or switch trunking interface that is configured in admin-up state should also be operationally up (i.e. up/up). Then it is easy to implement a status check and report on any interfaces in up/down state. An enhancement to this policy is to tag important interfaces (see Device and Interface Tagging) and alert on any important interfaces that are down. This has the advantage of identifying important access interfaces.
The next interface statistics policy would be to determine how to handle interface errors and discards. This policy can state that any errors should be identified and fixed (i.e. the network should run with zero errors). An exception might be half-duplex interfaces where a collision is counted as an error. Interface discards happen naturally at a low level, so set some thresholds for them, perhaps alerting on more than 500 per hour. The policy would require investigating any exceptions to the defined thresholds. A Top-N report helps to sort the list of interfaces to be examined so that the worst offenders are examined and corrected first. With these reports, an interface utilization policy becomes more of a planning tool instead of a network problem-reporting tool. A utilization policy could be defined to examine a Top-N utilization report on a weekly or monthly basis, with the intent to identify interfaces that should be upgraded in the coming months.
Once the above policies are defined, we begin to focus on more complex network analysis. These policies would focus on subsystem operation, looking at things like router redundancy protocols (HSRP/VRRP), Spanning Tree Protocol changes, and broadcast storm detection. Policies around QoS functionality -- making sure that it is applied consistently across the organization and is not dropping packets -- becomes important.
Finally, we begin to look at network virtualization policies. These policies make sure that the virtual topologies are configured and operating correctly. They will depend on what kind of virtual topology implementations are in use (MPLS, GRE tunnels, etc.). Make sure that each policy contains the actions that should be taken if and when the policy is violated.What's Wrong With This Picture?
Using the above, we see that with a few tweaks, it should be possible for the NMS to automatically configure itself. The main factor is to make the NMS reflect the intent of the policies and to provide the information needed by someone who is implementing the policies. By automating the NMS setup, we can reduce the size of the gigantic NMS puzzle that exists today. But that's a topic for a future post.