No Jitter is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

4 Ways to Improve Performance Monitoring

Clearly, legacy monitoring tools can no longer keep pace with the size and complexity of 21st century IT infrastructures. Tool vendors -- both legacy players and startups -- are adapting in several ways, offering new processing architectures, support for more protocols and data formats, faster analytics, sophisticated baselining and dashboard-style user interfaces, among other things.

As operations teams exploit these new capabilities, two things often happen. First, they see that they can take things in unexpected directions. Second, they realize they want to do more, and demand the means to do so.

Here are four ways to improve performance monitoring for large, complex networks:

Today's monitoring tools can track, almost in real-time, the impact of changes like network upgrades, hardware swap-outs, firewall changes, code updates, and new application versions. For carriers and mobile operators especially, it's vital to know whether a service has improved, worsened or stayed the same following a change.

Using a range of standard and proprietary interfaces, protocols and adapters, these monitoring tools can collect and analyze much more than SNMP data. The result is a bigger, clearer picture of what's actually happening. If solutions are designed for fast polling, processing, and reporting, then network operations teams can watch the impact as scheduled changes happen.

Some providers now routinely set up war rooms to monitor the introduction of such changes and gauge their impact. For example, they can now validate a vendor's promise about the results of a software update by comparing it with actual performance. Problems can be seen quickly via dashboard displays and custom reports, and the change can be rolled back if the results aren't acceptable.

Polling SNMP devices every five minutes is no longer viable. During those five minutes, short-lived events and anomalies, such as sudden, big spikes in traffic, can pass undetected. These may be precursors of emerging problems that could overwhelm network capacity.

Over and over, we hear that if such events are detected sooner, the problems can be sidestepped or moderated. In fact, some service providers are shifting to one-minute intervals on almost every network element they can reach. Others are using even shorter collection intervals on selected components or those comprising a specific, critical service.

But it's not enough to just poll more frequently. If you go from a five-minute schedule to one minute, you're collecting five times more data. Your monitoring infrastructure must be engineered to collect, store, process, analyze and report that data surge. If it can't, you won't get the visibility into the infrastructure you were hoping to achieve.

Traditionally, SNMP collects a range of device (CPU, memory, power, temperature, etc.) or interface statistics focused on an individual device. But by itself, this is a narrow view of what's happening.

Operations groups now realize that seeing aggregate numbers -- such as total bandwidth across multiple links or out to their Internet service providers (ISPs) and comparing that to normal usage -- yields a clearer view of overall performance. Some solutions even let you create key performance indicators that don't exist on network devices.

For example, SNMP polling of a target device may yield bytes in and out of a given interface, but not total bytes. A synthetic indicator can sum these and other metrics to provide a real-time view of the total bytes consumed by a given link. Yet that value doesn't exist in the target device's MIB. Ideally, these new indicators, like other metrics, can be baselined, providing a foundation for alerts.

These new aggregate indicators also enable you to calculate:

  • Ratios, such as packets sent vs. errors received
  • Composite end-to-end performance metrics when measuring multiple redundant paths through a network
  • Overall resource consumption over multiple devices, such as available voice trunk ports across multiple voice gateways

4. Tie together flow traffic, SNMP-based usage data, and log analytics

A growing number of monitoring solutions are leveraging the wealth of protocols and interfaces in today's infrastructures. Bringing together three of them -- SNMP-based statistics, flow traffic information, and log data pushed from network devices like routers, switches, load balancers, firewalls, and Wi-Fi controllers -- provides dramatically better visibility.

With SNMP, you can see how much traffic moves over a link to the network, or a spike in "octets in" to a voice gateway. But SNMP can't tell you which applications are using the link, or the traffic source and destination. That becomes visible by looking at flow. For example, a Top Talkers report may identify the source IP addresses and their associated traffic volumes. The key is the automatic marriage of the SNMP interface statistics to the flow records, enabling faster and more precise troubleshooting.

In one case, a wireless operator noticed their peering points in one part of the country were much busier than in other regions. Linking SNMP data and flow records showed that subscribers' Netflix and iTunes traffic was being shunted across the operator's backbone, and across the country, to the congested service provider link.

The operator tweaked routing policies, and this non-critical data traffic was shifted to Internet access closer to the subscriber's location. This also freed up backbone capacity for critical, time-sensitive traffic like voice calls.

Then there's the added benefit of being able to pivot from SNMP and flow to related log data. Because the performance monitoring platform understands the context of an event, it should be able to present log records that reveal possible configuration or policy changes that instigated the performance event -- without having to manually search log data.

Matt Goldberg is VP of Global Strategic Solutions at SevOne .