Network Analytics: Checklist for Failure
Analytics is a tool, not an answer. What you collect with analytics can lead to valuable conclusions or erroneous actions. We have seen analytics applied across a wide spectrum of situations.
Network Operational Goals
There are number goals that network operations needs to deliver:
- Reduce downtime
- Limit problem scope
- Reduce time to problem resolution
- Extend the life of network resources
- Minimize labor efforts by the network staff so they can work on other projects
- Deliver happy customers and users
- Improve the reputation of network operations
More Data = More Knowledge
The more data collected, the more you know, and the better network operations will be. Sometimes the data collected may not seem important at the time of its collection. I have managed several large projects in my career and each one has had one or more problems that took a long time to determine the resolutions. The tools I had could report the problem but could not tell me why it was happening.
For example, on one project in lower Manhattan I was losing 80% of my backup lines every day. The next day some were fixed, but we still had 80% loss. Finally, when talking to the local carrier's manager, he told me in private that their record-keeping of physical circuits was inadequate. He would not admit it, but he implied that if I put powered up modems on every backup line, I would experience a significant reduction in my losses. I did that, and within a week our failure rate was down to less than 5%. It appears that the local technicians would go to check pairs of wire and if they measured no signal, they assumed that it was not assigned and reassigned the wires to someone else.
On another project, we ran into issues with power interruptions. The computer would run fine through the winter but once summer arrived, power would be interrupted four or five times per day, causing significant problems with our software development. I studied the design of the power failsafe interrupt circuitry. It turns out that there was a flaw with a decimal point error in the design of the circuitry. It was 10 times more sensitive to power outages than specified. Once fixed, we had no more power interruptions.
In a third example, on another project, the special communication processor was working fine, and we had the vendor in to perform regular maintenance. We assumed he had performed his work competently, but after he left, while the computer still worked fine, the communications processor did not. What we discovered is that the maintenance technician had replaced the microprocessor boards one slot over from their correct positions. When we replaced the boards properly, it worked correctly.
In these three scenarios, network management tools would have told us that there were problems, but they would not point to solutions. You need to look beyond the usual data, for information that may seem ancillary. For example, dates are more important than you might realize. When was the product manufactured? When was it installed? When was the software installed? When was the software updated? When and who changed the configuration? You need all this information to get to the root cause of many problems.
It's also important to keep track of all the serial numbers so devices can be correlated to other failures in your network. The serial numbers may help you determine that there was a group of products from your vendor that were poorly manufactured and need to be replaced or upgraded.
It is assumed that privileged access management only allows authorized personnel to access your network resources. Do not assume that all those privileged accessors will do a perfect job. Some are lazy or negligent, and some simply make mistakes. You need to know the history of what happened during privileged access. You need to be able to analyze what was done right or wrong, and what you might do to prevent those problems in the future.
Don't Just Fix It, Prevent It
Most network management tools have some analytics functions. Most are good for identifying problems and failures and providing focus on the right solutions. The analytics should be applied to prevent future problems. The concept is to use analytics to improve operations, not to keep them running as always.
Use analytics as a mechanism to keep improving the network resources, performance, and reliability. Analytics tools should be capable of pointing out recurring problems and solutions that can prevent the problems from persisting.
What Not to Do in Analytics
Using analytics effectively will depend on several factors. There are good and poor approaches to analytics, but there are limitations to be aware of as well. Here are some thoughts of what not to do:
- Don't expect executives to have a clear vision of what you are trying to do without first taking yourself and the executives through an education of analytics, its processes, and anticipated results
- Stop trying to determine the value of the analytics in the first year and assuming that the analyses are valid, especially if you haven't done this before
- Avoid using a few use cases to create a strategy. The case studies you choose can have a major and sometimes negative impact on the strategy development
- Do not assume the present staff is competent to use new analytics tools
- Do not create analytics translators, especially using new hires. You should be using internal candidates that have a good deal of network knowledge as the translators
- Don't create an analytics team that is separate from the network operations team. This will result in ineffective organization
- Don't assume that older collected data should be thrown away. Data cleaning efforts should be selective and not performed globally
- Don't buy generic analytics tools. Ensure that the platform you use is tailored to network operations
- Don't believe anyone who says confidently they will know what the impact of analytics program will be. It will be very hard to quantify in advance