The Quest for Network Visibility

Monitoring everything in a network is a big challenge, but I think it's a necessary undertaking.

Monitoring Everything
Unfortunately, many network management systems (NMS) aren't up to the challenge of polling every network interface, storing the data, and generating useful information out of the collected data. This is perhaps the biggest reason why network management is so difficult. Scaling up to handle a large network is challenging.

Why monitor everything? Can't we get by with a system that monitors only the important interfaces, as is often the case due to budgeting constraints. Licensing for most (all?) NMS products is by the number of monitored elements, so limiting that number can keep the NMS budget in check. But this is one case where what you don't know can, in fact, hurt you.

As we attempt to identify the "important" interfaces, we end up having to implement a process and procedure to validate whether any new interfaces are important enough to be monitored. We also need a process and procedure to remove old interfaces from the monitoring system as endpoints and network devices are removed from the network. This process is likely to impede other network maintenance processes, and will soon fall into disuse. We either have a cumbersome process or a failed process, both of which have other costs.

So I like to monitor everything in the network. Network monitoring automation tools make it easy to identify interfaces and begin monitoring them. I let the NMS filter out the unimportant data. But this means that the NMS must handle all interfaces, which can be expensive.

It's Too Expensive
I was working with a customer a while back that had purchased an expensive NMS and was spending more money customizing it. I asked the NMS team to monitor all the interfaces in the data center, figuring that at least those interfaces were important. I wanted to be able to identify server interfaces that reported problems with the connected servers. I rely on late collisions and frame check sequence, or FCS, errors to indicate duplex mismatch problems. Counters for discards/drops and ingress overruns tell me about oversubscribed interfaces.

In this same network, we found that several key servers connected to the same 48-port 6148 Ethernet blade in a Cisco Catalyst 6500 switch. In fact, three of the highest-volume servers connected to the consecutive ports on one ASIC on this blade, as shown in the figure below. At the busy part of the day, these servers would send more traffic than the switch could handle, resulting in high counts of ingress overruns. Distributing the servers to other blades in the same switch solved the congestion problem. In addition, the analysis identified this switch as a single point of failure for most of the business functions, which were running on these three servers.

When I asked NMS team to monitor all server interfaces, I was told that was too expensive to do. In addition, the system's default configuration did not monitor the SNMP objects needed to identify problems similar to those described above.

Is There a Solution?
My current favorite interface monitoring system is Statseeker, because it is affordable and fast. It only needs one server to monitor more than 500,000 interfaces at a fast polling rate. Any of the collected data is viewable within a few seconds versus the many minutes required by some other systems I've used. Among the many NMS options available, this is the system that seems to provide a good value for the money and provides the capability of monitoring everything. An added benefit is that it is easy to set up and use. Many monitoring systems require a lot of fiddling and configuring, effectively making them expensive to install and use. I prefer something that works well out of the box.

What About Deployment?
I use tags to build a hierarchy of relative interface importance. Briefly, I use interface descriptions to add one or more tags the NMS can use to classify and rank the importance of interfaces. I add device tags to the SNMP Location string, or some other device string variable that's accessible via SNMP. Any interface tagged with "Critical Server" would be grouped into the "Server" interface group. A problem on any interface in this group would generate a high-priority alert. In normal operation, no interface in this group would have a problem that hasn't been diagnosed and corrected. (For more details, see my NetsCraftsmen blog, Device and Interface Tagging.)

Similarly, infrastructure interfaces would have tags like "Core-Core," "Core-Dist" or "Dist-Edge," allowing for easy grouping for alerting and reporting purposes. In order for this mechanism to work, the NMS must be able to create device and interface groups automatically based on the tags.

I handle edge interfaces that are relatively unimportant differently (all active interfaces are important or nothing would be connected to them). The default is not to tag an edge interface, which results in a very large group of relatively unimportant interfaces. The NMS is set up to produce a top-down sort of interfaces with errors. The interfaces with a high volume of errors appear first in this list. The network operations team then uses this report to identify and correct problems that have a major impact on an edge device.

Applying Network Visibility to UC
Gaining visibility into UC system connections to the network is easy with the above tips. Use an NMS that allows monitoring of all interfaces. Tag key interfaces to UC infrastructure such as session border controllers, UC managers, multipoint control units, and important teleconferencing systems. Using the tags, these interfaces are then automatically grouped together for monitoring and reporting purposes, as described above.

When a trouble ticket gets opened for an edge interface, the operations team should first check the NMS reports on the affected edge interface as well as the server interface or interfaces to make sure the culprit isn't something like a simple duplex mismatch.

It is important to use the periodic interface error reports to correct simple network problems. I've seen numerous examples where the network staff refuses to correct problems because an end user didn't report a problem. That's not being proactive, and does not lead to good network operational practices. It typically takes some involvement by the IT management staff to encourage tracking network problems and correcting them.

Summary
It isn't unreasonable to monitor all network interfaces. The tools exist to do it without breaking the budget. Adding a few operational procedures like tagging makes the tools much more useful. Finally, create processes and procedures to follow for handling the problems that the NMS reports.

Tags:

News & Views

Enterprise Networking

Enterprise Connect

Articles You Might Like

Why Don’t Enterprises Believe Telcos on Optical Networking?

Tom Nolle

October 02, 2023

According to recent research, telcos haven't given enterprise customers any reason to be optimistic about technological innovations done in a timely fashion, or competitive pricing in the market.

Beware the Network Security Breaches Caused by Carelessness

Tom Nolle

March 24, 2023

Overexposure, overpermission and overdistribution all present threats to an enterprise's security – but there are ways to offset all three of these security issues.

ISP Channel Service Units – Are They A Good Thing

Sorell Slaymaker

February 08, 2023

Every technology/product has its time and place – and as Network as a Service (NaaS) takes off, the CSU's time might be coming to a close

Your WAN: The Overlooked and Vital Link to the Cloud

Cheryl O'Brien

February 02, 2023

The WAN is the most important link in this whole chain of dependency on the cloud, as the WAN is the weakest link. Therefore, 'X' As A Service is only as good as the ability to get to X.

Search form