Monitoring a Software Defined Network, Part 6
The dynamics of an SDN make it particularly challenging to monitor and manage.
Note: My discussion of SDN monitoring covers several topics. Here are the prior posts:
The Agile SDN
One of the big advantages of Software Defined Networking is its agility; it can quickly adapt to changes in compute and storage environment. But how does one go about monitoring a rapidly changing network environment? Most network management/monitoring systems periodically scan for new devices in the network, perhaps as infrequently as once a day or perhaps even longer. These long-term processes were fine when network infrastructure changed on a weekly or monthly basis. But when infrastructure can be deployed and recovered in a few hours in response to some business demand, the network monitoring system must be much more responsive.
Let's look at a simple case of DevOps with respect to developing a new application. Developers need the ability to quickly test the application as development proceeds. These tests are typically done whenever the developer is ready to integrate new code into the application. A virtual test system needs to be quickly created and regression tests run. When the tests are complete, the test system resources need to be returned to the compute, storage, and networking pools. The VM controller, storage controller, and SDN controller all feature heavily in the process of creating and destroying the test system.
[Note: The word "destroy," when used with the virtual networking infrastructure, really means that the resources that were used to implement the virtual network infrastructure are reclaimed for use in future infrastructure requirements. The terms "reclaimed," "decommissioned," and "redeployed" could also be used. When virtual resources are reclaimed, they are returned to the pool of available resources that are used for future implementation requests.]
Mapping Physical Problems to Virtual Infrastructure
In the above scenario, what if there was a physical network problem that caused some of the tests to fail? The developer could be mistakenly looking into code problems when the real problem was the test infrastructure. The SDN monitoring system needs to know when new infrastructure is brought into service and removed from service. It needs to be able to keep data about the virtual infrastructure so that the DevOps team can look for problems when things don't run as expected.
An SDN based on an overlay model could include a monitor of the physical infrastructure and report whenever it exhibited symptoms of a problem. A smart SDN controller should even be able to detect some types of physical problems and configure the virtual network to work around the problems. For example, detecting a bad uplink where a redundant path exists.
This adaptability is similar to how the brain uses redundant neurons and pathways to work around brain damage. I wouldn't go as far as saying that it is a "self-healing network" though, because we will still need someone to physically replace the bad cable, defective switch, or misbehaving router. It also wouldn't be able to handle failures where redundant infrastructure doesn't exist. I think that calling it "an adaptable network" is the most accurate description.
Recording Virtual Infrastructure Configurations
We'll need to record the virtual infrastructure configuration, how it was implemented, what physical hardware was used, and who or what caused the infrastructure to be created and destroyed. This information will be critical to effectively troubleshoot problems. In a sense, tracking the network infrastructure will be much like tracking the locations of laptops in today's physical networks.
Example: A virtual network and virtual servers are implemented for a business function. However, a physical link is experiencing problems - perhaps a duplex mismatch or a bad connector. The business function that relies on this infrastructure will experience problems. The IT team (server staff, application staff, and network staff) may think that the app didn't work correctly, perhaps pointing fingers at each other. Deployment of the same application later may work better, because the new virtual implementation is using different physical infrastructure.
Troubleshooting applications that have significant changes in performance is going to be the realm of the cross-functional experts. These will be the staff members who understand compute, storage, networking, and virtualization of each. I forecast that many IT teams won't understand enough about their applications and the virtual infrastructure to identify the root cause of many problems.
What they will understand is that redeploying the application may change how well it works. I predict that some organizations will develop the ultimate "three finger salute" (so-called because of the three fingers needed to press CTL-ALT-DEL on a Microsoft Windows system to force a reboot). They will encounter a problem with an application, and instead of trying to understand the cause, they will destroy the virtual implementation and restart it, hoping that it works better the next time. Since it will sometimes work, their action will reinforce the behavior.
My point is that it is going to be important to know what physical infrastructure is used for each virtual infrastructure element so that problems that are exhibited in the virtual space can be mapped to the physical infrastructure and vice versa.
Avoid SDN Thrashing
I foresee a problem occurring that I'll call SDN Thrashing. It is when one part of the IT system causes a virtual infrastructure to get created. Then something happens to cause it to be destroyed and the resources reclaimed. This process could continue until someone notices and stops it. Or an automation system could identify that it is thrashing and stop it.
A better solution is to incorporate some checks to break the cycle after it repeats some small number of times. A good example is that a virtual infrastructure gets created and due to the traffic, it causes congestion on a link. The high drops could cause the SDN control system to destroy the instance and try to recreate it using different physical infrastructure. But because the virtual infrastructure itself is causing the congestion overload, the system starts thrashing.
The SDN monitoring system should allow the IT organization to see and understand what is happening. Some threshold may need to be modified or the application may need to be changed to reflect the volume of data that moves in a live implementation. In any case, the SDN monitoring system is the eyes into the problem, helping identify the factors that caused the virtual infrastructure to be created and destroyed.
Create a Baseline
The SDN monitoring system should capture a baseline of the virtual infrastructure immediately after its creation. This step validates that the infrastructure is capable of supporting the application. Server-to-server latency, database request latency, network bandwidth, and interface errors are statistics that should be recorded and verified against the application's requirements.
As part of the baseline, the SDN monitoring system will also need to record the topology of the virtual infrastructure. The record should be the interconnection data needed to produce a drawing, not an image copy of the drawing itself. With the interconnection data, it is possible to recreate the drawing and move things around on it to provide better views. A drawing alone will have limited usefulness.
What are some of the problems that we anticipate a smart SDN monitoring system would identify?
- High latency paths that cause an application's goodput to be intolerably low. This may happen when part of the virtual infrastructure is constructed at an ISP's facilities (e.g., Amazon or Google virtual services).
- Network links that are exhibiting high errors. Because the errors are reflected in interface stats, it could be a bad interface or a bad link. Additional testing would be necessary to discern the cause of the errors.
- Congested links, causing high discard rates. Long-term discard rates might be marginally acceptable while burst discard rates could be unacceptable.
- Redundancy failures. When half of a redundant path fails, the application continues to run, using the redundant path. But if the failure isn't identified, reported, and corrected, a hard failure will occur when the redundant path fails. And it eventually will. We've seen cases where months passed between the first failure and the second failure of a redundant configuration.
- Link any problems that are detected in the physical infrastructure to the virtual infrastructure and the applications running therein.
- Who or what initiated a virtual infrastructure change?
- Does the infrastructure support the application? What did/does the topology look like, to facilitate troubleshooting?
Hear more from Terry Slattery at Enterprise Connect Orlando 2015!