Monitoring a Software Defined Network, Part 1
In addition to many traditional problems, there will be a set of new problems that must be understood.
The Need for Monitoring
Just because a network is Software-Defined does not mean that it doesn't need to be monitored and managed. There are many network problems that SDNs do not eliminate, so there is still a requirement to identify the sources of problems. In fact, in addition to many traditional problems, there will be a set of new problems that must be understood, and a means must be developed for identifying and correcting them.
A lot has been written about the potential for SDN to change how networks are designed and how they operate. However, very little has been written about monitoring an SDN. Most of the development effort that I've seen is about OpenFlow, which one could argue is not by itself SDN, but is simply a mechanism by which an SDN could be implemented. Other mechanisms can be used to create an SDN, so let's not restrict ourselves to OpenFlow for this analysis.
Monitor the Network
The network will still need to be monitored to detect traditional problems. At the physical layer, we will still need to detect Frame Check Sequence (FCS) errors, Runts, Giants, Late Collisions, and link errors (some of these are specific to Ethernet; others are generic to any interface). Several of these errors are indications of duplex mismatch in Ethernet links. Perhaps the SDN initiative will allow us to create a mechanism that we can use to switch both ends of a link to the same duplex setting--it would be really nice to get this problem solved.
At the interface queuing level, we need to detect packet discards, which are due to interface congestion. A discard is a packet that can't be transmitted on an egress interface because there were no free interface buffers (i.e. the egress queue was full); it's caused when one or more high-speed interfaces are feeding one lower speed egress interface.
An ingress overrun can occur where the switching hardware is unable to handle an inbound packet before the next inbound packet arrives on that interface. These should be rare, but can happen on low-cost devices that have less than line-rate ingress processing capability.
An incredibly useful addition to the basic error counters would be a cache of the failed packet's header. When an error is detected, the packet header would be copied into this cache. By saving the header, it is possible to retrieve the information regarding the most recent error, providing valuable troubleshooting data. Otherwise, it would be necessary to use a packet capture device to try to determine which systems were having the problem.
If only a single storage location exists for each counter, any new error would overwrite the previous header. So ideally, there would be at least one storage location for each error counter. If two or four storage locations existed per counter, it would operate as a ring buffer, storing the headers of the last two or four errors. I have written about this suggestion and a workaround in the past at Netcraftsmen: How To Improve SNMP MIBS and Diagnosing the ipOutNoRoute Counter. An alternative for SDN is for the switch to forward the header to the SDN controller for storage, possibly only when we're actively troubleshooting a problem, much like we do with the "debug" command in today's equipment.
Monitor Forwarding Counters
SDN switches make forwarding decisions based on a large set of bits, including MAC address, QoS bits, and, potentially, application header bits. It is important for the SDN switches to record forwarding successes and failures as well as tracking the number of packets and octets that are processed by each forwarding criteria. This is particularly useful for tracking bandwidth utilization in a QoS queue or being able to detect when an application or a Forwarding Equivalency Class is consuming more bandwidth than anticipated.
Keep in mind that SDN switches, like today's switches, often have aggregate packet forwarding limits. With both packet count and octet count metrics available, the SDN controller can make smarter decisions, as well as providing feedback to applications that are capable of understanding it.
While we're wishing for monitoring functions, let's ask for functionality to make the measurements more accurate. It would be useful for the switch hardware to support atomic snapshots of multiple counters. For example, getting the packet count and octet count in one atomic operation would make the resulting calculation accurate. This is something that's not possible with the current SNMP protocol.
As with physical counters, it is important to cache the header of any packets that are not forwarded. Let's say that we are troubleshooting a connectivity problem. If we have the headers of dropped packets, we can check them against the forwarding entries that exist in the switch to determine why a specific set of packets are being dropped.
In addition, it would be nice if the switch hardware supported tracing of packet processing, recording the table entries that matched for a specified packet header. We could then ask the switch to tell us what internal processing happened and know whether a packet with a certain header was forwarded or dropped and why. A version of this capability has been developed at Stanford, in a network debugger called "ndb".
To Be Continued...
It would be nice if the monitoring of traditional errors was improved, making it easier to troubleshoot common problems. With SDN, an opportunity exists to gain more visibility into the forwarding engine's operation. We should spend time investigating whether the proposed mechanisms are sufficient for doing the level of debugging and troubleshooting that will be required.
Next time, I'll discuss monitoring the SDN controller.