Putting Machine Learning to Use for Network Management
Machine learning (ML) is a popular buzzword, including relative to networking. But is ML real technology we can use?
Understanding Machine Learning
ML is a small subset of the artificial intelligence (AI) field. A system is determined to be artificially intelligent if it can use the same observations as a human to arrive at similar decisions and take similar actions. An ML system, however, isn’t as intelligent. It’s simply a system that learns to identify specific patterns. The most well-known ML system, built by Google, has been taught to recognize cats in pictures. In healthcare, ML systems are being trained to identify tumors in radiological scans. A related technology, deep learning, uses interconnected layers of neural networks, resulting in the label “deep.”
On the humorous side, Janelle Shane, a research scientist, has been using ML systems to create mixed drink recipes, cookie recipes, and drawings. Her results are educational, showing what can happen if the learning corpus is relatively small or if the corpus is not a good match for the desired solution space. Look at the difference between a neural network that has to learn to spell words versus one in which the smallest quantity is a correctly spelled word. Her examples provide us a way to get a feel for how these systems work.
How can we apply this technology to networking, and specifically to network management? We’re seeing vendors starting to develop ML systems that can identify common networking problems. Moogsoft uses ML to correlate network events. Splunk has a similar system. Gartner has a name for the space: Artificial intelligence for IT operations, or AIOps. The most important step for any of these systems is to train it with enough data of the type that we want it to analyze and recognize.
ML’s learning process depends on getting input on what it is observing. The cat pictures experiment was successful because data scientists taught the system by showing it pictures of cats -- millions of cats -- all identified by humans. It could then recognize cats in a previously unseen picture. A similar process must happen with network management data.
Training an ML System
We have several problems to overcome when we try to apply ML to network management.
Sufficient Volume of Data -- To be able refine the weights a network uses internally, ML neural networks require very large volumes of data. The volume of data from one network may not be sufficient for the learning algorithm to generate the results we want. What if we could import the initial neural network weights from other organizations? Would another site’s network be representative of our own network? While networks tend to be designed around similar constructs (at least if you’re following industry best practices), they operate in vastly different ways. The data from a well-run network will be significantly different than the data from a poorly run network. The resulting ML system may not function properly in your own network. If you only use data from your network, the ML system may not have enough data to properly identify problems until many months after the initial installation.
Feedback -- Let’s say you have enough data to drive the ML system. The next challenge is to provide feedback that identifies problems. It should be possible to automate some of the feedback mechanism.
For example, the ML system should be able to detect a duplex mismatch on an Ethernet link by several means. A simple set of rules should suffice for identifying the signature of the problem and providing feedback. This is basically the same functionality that drives rules-based network management systems. The feedback mechanism becomes more complex when we want to start combining data sources. The ML system needs accurate data and feedback as it learns to identify good and bad patterns. A good example is the ML system’s ability to detect configuration issues in a four-port Ethernet channel. The ML system will require a lot of feedback from a network engineer before it has learned enough to identify an improper configuration.
A network management system should be able to provide the feedback to a ML system as the operations team identifies and corrects problems. Part of the change process could incorporate ML feedback. Of course, you would need a way to reverse the effect of incorrect feedback resulting in the problem not being corrected until a second or third examination.
I rarely see a well-run network. The volume of data from devices and interfaces is too great for most network teams to handle well. The result is that a lot of problems go identified and uncorrected until a customer is affected and complains. ML may be a technology that helps us proactively identify and correct problems -- and maybe even help us get to the point where our network management systems can predict what a network is going to do.
For more on this topic, please join me at Enterprise Connect, March 18 to 21 in Orlando, Fla., for the session, “How Machine Learning Is Changing Network Management Tools.” That’s only two months from now, but still enough time to get discounted admission. In fact, the deadline for the Advance Rate is now this Friday, Jan. 18, so register now for your best deal. You can even save an additional $200 off has using the code NJPOSTS.