How do you know that the network is having problems? Do you rely on your customers to report problems? There's a better way: have the network tell you when it detects problems.
This is a reactive approach to keeping an eye on the network; when you receive an alert, the network has already detected a problem. However, using alerts is a great way to get started with monitoring a network. It is relatively easy to implement and it can alert you to problems that may go unnoticed. In a converged network that is providing transport to UC application flows, knowing that the network is having a problem is a big win.
There are several ways in which the network can provide alerts. The ideal is that components within the network report problems and all you need to do is to look at what the network is saying and correct the problems. It sounds easy, and while it is straightforward, the sheer volume of alerts can make it a daunting task.
What are Network Alerts?
Networks can provide alerts in more than one way. I always like to start with basic sysylog alerting.
Many devices, both network and non-network, can generate syslog messages when the device detects a problem. The messages are in text format and can be parsed by machine or read by a person. The text format can cause machine parsing problems, but with the newer syslog servers, the job of handling messages is greatly reduced. I have had success with the open source version of syslog-ng, and a supported version is also available for organizations that need a supported product. It is able to perform local processing as well as forwarding messages to other network management systems.
Cisco has made the machine parsing job easier by uniquely identifying each syslog message. Below are two sample syslog messages that demonstrate the formatting. The error identifiers are "%CDP-4-DUPLEX_MISMATCH:" and "%PM-SP-4-ERR_DISABLE: ", respectively. All identifiers start with '%' and end with ':', making them easy to read as well as easy to parse by machine.
Note that the first error is a duplex mismatch with a Cisco phone, which creates more packet loss as the data rate on the phone increases. A computer attached to the local data port on the phone can create enough traffic to cause voice calls to have problems, in addition to impacting application performance on the computer.
05-01-2013 00:00:05 Local7.Warning 10.1.1.124 1410780: May 1 00:00:04.384 edt: %CDP-4-DUPLEX_MISMATCH: duplex mismatch discovered on GigabitEthernet3/2 (not half duplex), with SEP00070F67A24F Port 1 (half duplex).
05-03-2013 07:15:38 Local7.Warning 10.1.1.130 22119: May 3 07:15:36.805 edt: %PM-SP-4-ERR_DISABLE: diagnostics error detected on Gi3/39, putting Gi3/39 in err-disable state
05-03-2013 07:15:38 Local7.Warning 10.1.1.130 22119: May 3 07:15:36.805 edt: %PM-SP-4-ERR_DISABLE: diagnostics error detected on Gi3/39, putting Gi3/39 in err-disable state
SNMP Traps are another source of network alerts. These alerts are in binary format, encoded by SNMP, and require a network management application that is loaded with the corresponding MIBs (Management Information Bases) in order to decode the traps. The disadvantage is that because they are binary, they can't be examined without the help of the MIB and a network management system to decode them.
Check your network management system for SNMP Trap processing functionality. Hopefully it will have a way to summarize the events so that you are working with groups of problems instead of individual problems (see the section below about handling large event volume).
A third source of network management alerts is from a network management system. It can generate alerts when interfaces are experiencing too many errors or when interface utilization is higher than some threshold. These types of alerts are provided by most network management systems--everyone seems to want to start monitoring performance data. In practice, monitoring errors and drops is more beneficial than monitoring link utilization.
Finally, the UC system servers are a good source of alerts. Many of them can generate syslog and/or SNMP Traps. There's nothing better than the UC system to tell you that a particular end station is having problems.
Because these alerts are typically via syslog or SNMP Trap, they fall into the processing methods used for either of these alerting mechanisms. Some UC systems may require that you login and examine an error report page. This can make it less timely because it requires that you login to the UC system to learn of a problem.
Handling Large Event Volume
A large network, or a network that needs cleanup, can generate a large number of events per day. If you look at the total number of events, you can easily get discouraged. How can you fix one problem at a time to reduce a list of 100,000 or more events?
Fortunately, some events are easily classified into groups, which greatly simplifies their handling. For example, the CDP Duplex mismatch that is shown in the example above may be generated every ten minutes from a single device, creating 1,440 syslog entries per day. If multiple ports on a switch have this problem, the syslog messages from the one switch may be in tens or hundreds of thousands.
I prefer to use a syslog summary script to summarize hundreds of thousands of syslog messages into a few pages, sorted by frequency, so I know what to tackle first. It is easy to look through the list of message types to find those that are the most important, then identify the device that is experiencing the problem.
The CDP Duplex Mismatch problem identified above was due to several mis-configured ports on one switch. The summary showed that it reported a total of 129,130 Duplex Mismatch events in a single day (see below). Correcting the configuration across all ports on this switch corrected quite a number of problems that were impacting end station connections into the network.
129130 10.1.1.124 10.1.1.124 CDP-4-DUPLEX_MISMATCH
The summary scripts are available for free at:
http://www.netcraftsmen.net/resources/technical-articles/712-syslog-summary-scripts.html.
Since the summary output is only a few pages, even on a big, busy network, it is easily emailed to the network team or posted on a web page. The operations staff can then look for critical errors and for high volume errors. Over a few weeks, it is possible to make a significant improvement in the network.
Network Events and SDN
Since I've been writing about SDN recently, I thought it might be worthwhile to discuss how events might be used in an SDN environment. There is basically no change. The events will typically be from the underlay network--the physical infrastructure. There will also be a set of new alerts that are specific to SDN, but those are likely to be a subset to the volume generated by the physical network infrastructure.
Summary
All that's required is to collect what it has to say and summarize it so that the volume does not overwhelm you. Finding and fixing the problems that it reports will result in a much more smoothly operating network.