This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.
Drinking from the Fire Hose: Handling Network Event Volume
Events are the network’s way of telling you that it has a potential problem, and IT and networking professionals should pay close attention to them. Given the sheer volume of potential network events, it’s enough to keep any seasoned professional on their toes.
Below, I examine several ways to reduce the volume to a manageable level.
Network and IT system logs are of immense value for tracking how well the overall IT systems are performing daily. Event logs can come from syslog, server logs, SNMP traps, security systems (intrusion prevention/detection), and application logs, just to name a few. Of course, the problem is that large networks can produce vast quantities of events each day — megabytes to gigabytes of text, much of which are repetitive log messages with minor differences. The differences are due to many factors, including source device ID, message formats, or timestamp changes.
Log message formats vary according to the source and from vendor to vendor. Initial processing of events can be used to normalize the data into common categories, such as these examples:
- Device reboot
- Neighbor device unreachable or reachable
- Link down or link up
- High interface errors over some time period
- Routing neighbor protocol error
- Application restart
- Application response time above alerting threshold for 30 minutes
- Server CPU or memory utilization high
- Client authentication failed
Another source is from the network and IT systems management platforms, which can be configured to send a copy of management alerts to the event management system and to the trouble ticket system. Adding these alerts to the event stream improves the contextual information needed to resolve the problem.
Automated processing systems are required due to the large volume of events, which is simply too large to handle manually. To help make sense of the various events, you should consolidate them into major categories. Additionally, events should be given a severity level (critical, major, minor, and routine). Only a few severity levels are needed because the top one or two will consume all your time.
Once you have the events classified and assigned a severity, you can define parameters to identify events that need attention. One message that reports the failure of a critical device or link may be just as important as a minor severity message that reports a significant volume of errors.
Event Summary Report
A low-tech and very effective solution to reducing the volume is through summary reports. We’ve found that two categories of summary information are sufficient:
- Count the number of events of each type to facilitate a review by volume. High-volume events indicate a systemic problem, perhaps due to a configuration error or an intermittent condition. A good example is a flapping BGP peer relationship. Low-volume events should also be reviewed since they identify events that don’t happen frequently but could be critical. For example, an infrastructure interface that goes down could impact network resilience. In practice, you’d focus your attention on the top and bottom ten events by volume.
- Count the events by device and interface. This part of the report lists the count of events for each network element that is reporting problems, an important item that’s missing from raw event counts. This is a longer section of the report but is still surprisingly short, given the volume of the incoming data.
The resulting summary report is often only a few pages long, turning vast volumes of data into concise and useful information. The reason this report format works is that networks tend to have a limited number of unique event types, and the number of network elements experiencing problems tends to be small.
What do you do with the report? First, it’s good to have the network systems staff examine the report daily and identify any potential problems that should be investigated and corrected. Then, it’s a good idea to archive the daily report, so it’s available as a historic baseline. You can then look back at the past event history to see how long a particular event has been occurring or to see if other network elements have had the same problem in the past. An additional mechanism is to record the summary information into a database and create graphs of various elements to indicate trends.
Apply Machine Learning
The high-tech approach involves the technology that’s being applied to many aspects of IT: machine learning. Event processing systems are ideal candidates due to the repetitive nature of the events and machine learning’s ability to identify patterns in vast volumes of data. Most event processing systems have adopted some form of machine learning. What’s the advantage? Machine learning replaces maintenance-intensive static rules with a dynamic system that identifies trends and outliers that are nearly impossible to find by other analysis methods. Some reports I’ve seen report a significant reduction in actionable events — those events that identify a problem’s root cause. And faster identification results in faster problem resolution.
My top list of products, in no particular order, includes Splunk, Moogsoft, and Elastic. Splunk has a ML toolkit available. Moogsoft has its ML system built-in. Elastic (perhaps better known as the source of the ElasticSearch, Logstash, Kibana — ELK stack) has a ML library as well. If you’ve not worked with a machine learning system before, you should consider hiring a consultant to help with the implementation, tuning, and initial operation of these systems.
Machine learning technology is being adopted by these products to correlate events and identify core problems, replacing maintenance-intensive static rules.
Event Manager Reporting
The third option is to use the reporting mechanisms built into the event management product you select. You can configure reports on various items within the event stream, create and clear trouble tickets, or take other actions. You should verify that interfaces exist between the event system and the trouble ticketing system — otherwise, you’ll have to implement it yourself using APIs.
What’s the best approach? I prefer to use the summary report and ML reporting to provide a high-level “how’s the network doing” view daily. Then, I rely on the built-in reporting mechanisms for in-depth analysis. Each mechanism has its own strengths, and I focus on taking advantage of each of those strengths at the appropriate part of the network monitoring process.