Monitoring a Software Defined Network, Part 2
Monitoring of an SDN controller needs to be useful to both network operators and software developers.
Monitoring the SDN Control System
The controller contains the smarts of a software-defined network and is next on our list of things to monitor. (See the prior post about monitoring SDN switches.) The prior post was focused primarily on things that affect the network and are therefore of great interest to network operators. The controller system may be different because of its significant software component. We need to make the monitoring of an SDN useful to both software developers and network operations staff. This will make the specification and development of a monitoring system rather interesting.
Monitoring for Software Developers
Software developers tend to be more interested in failures logged within the software system. A good example is a controller that consumes all of main memory because a table grew too large. Perhaps there was a memory leak in one of the API calls or maybe the developer's code didn't release a data structure after using it. Each call or use would cause a little more memory to be used, and eventually, with enough use, the system would run out of available memory and ultimately crash. Before that happens, though, the system would likely become sluggish as it tries to handle the large table or data structure. Providing tools that allow software developers to clearly see the full sizes of tables and collections of data structures will be important to aid in troubleshooting these types of problems.
There will undoubtedly be some time-critical processes within an SDN controller. An ARP request or a new flow request could require special processing in order to make the network run fast--or at least to keep the network from running too slowly. In this sense, an SDN controller is more like an embedded control system. There will need to be methods for monitoring and reporting on execution times of time-critical processes. It will also be useful if the monitoring of these time-critical processes could generate an asynchronous message (maybe syslog) when the processing time exceeds some boundary or when the process didn't start at its specified time.
Speaking of syslog, I recommend a useful addition to messages. The first part of the text message should contain a unique ID string. Cisco uses this technique to allow the identification of the source of each message.
Message logs will often be about common network problems, such as CDP duplex mismatch detected, but other messages will report significant internal problems. The message ID is divided into three parts that identify the subsystem, the severity, and a descriptive word. The ID is useful for facilitating the sorting and grouping of messages by severity and error type. Some examples appear in a syslog summary script that shows the message IDs, the number of occurrences, and the systems that sent them. Functionality like this is very useful to both developers and network operators.
Software developers who are working on the internals of an SDN controller will have many more diagnostic tools. Those who are developing applications that communicate with an SDN controller via the so-called Northbound interface will also have debugging tools, but will probably not provide visibility into the SDN internals. Both of these development communities will have a lot of good development and diagnostic tools. After all, if they identify a tool that could be useful, they will create it.
Monitoring for Network Operators
Network operators also require different types of tools--those that are focused more on network operations and the ability to diagnose or troubleshoot common problems. Network operators typically don't have the software skills or resources to create their own monitoring tools, so the SDN will need to provide either an API for an external monitoring system or direct access to the necessary data.
SDN doesn't mean that we should forget all the lessons that we have learned in the past about network management. Nor does it mean that good network design practices are somehow incorrect. We will need access to data for physical and logical connectivity. We will need to understand the symptoms indicating that an SDN domain needs to be subdivided. And we will need to know when the SDN domain is having internal and external communication problems. Finally, there will need to be a suite of diagnostic tools that show us flow paths and how those paths were created (i.e., which controller(s) created the entries that resulted in the path).
Communication failures between switches and controllers will be fairly common, so that's at the top of the list to monitor and report. If possible, the system should identify the potential cause of the failure. Switches and controllers both need to report problems. Network topologies can fail in many different ways, some of which may allow one component to communicate with a monitoring system while the other component is isolated. Or latency may increase due to a topology change, causing control system timeouts. Finally, as new revisions of the SDN communication and control protocols are rolled out, we should expect to occasionally find a protocol failure (e.g., an OpenFlow protocol incompatibility), which needs to be reported to the network operations staff.
We should learn something from the history of failures in redundant systems. Remember the stories about the failure of one component of a redundant system? The first failure was frequently not detected and corrected before a second failure in the second path caused a network outage. A communication failure notification from one or more switches should alert the network operations staff to the communication problem so that they can take remedial action before a second failure causes part of the network to become segmented, creating an outage.
Not all network problems create hard failures. The monitoring system will need to report packet loss statistics and retransmissions. These statistics will provide an indication of the reliability of network connections between elements.
I don't expect SDN to eliminate the major source of network outages: configuration errors. So there will need to be mechanisms in place to troubleshoot and monitor all the configuration parameters necessary in order for the SDN to function. For example, encryption keys will need to match, and there must be mechanisms that allow easy migration from one set of keys to another without requiring a flag day conversion.
We will also need to track the configuration that creates the switch-controller relationships. And don't forget that each SDN domain will need to communicate with the rest of the network or with external networks. There will certainly be configurations for routing protocols so that the SDN domain can forward packets to the correct next hop for external destinations.
Network operators will also need access to data from protocols that run within SDN switches, such as LLDP (Link-Layer Discovery Protocol) and BFD (Bi-directional Forwarding Detection). Note: Some of these protocols will need to run within the switches instead of in the controller, simply because that's a much more logical place for them to be located. The data that these protocols collect will be sent to the controller for use in determining forwarding tables and for diagnostic purposes.
Network diagrams that provide visibility into the SDN control system topology (both logical and physical) will be extremely useful, especially in the early days when we're all learning how these new networks operate (when they are functioning correctly) and fail (when they are not).
The combination of SDN switch monitoring and SDN control system monitoring can provide us better visibility into the operation of the SDN system. I just hope that the necessary hooks are created that will allow the visibility that is needed.