Terry Slattery
Terry Slattery, is a senior network engineer with decades of experience in the internetworking industry. Prior to joining Chesapeake NetCraftsmen as...
Read Full Bio >>

Terry Slattery | January 06, 2015 |


Monitoring a Software Defined Network, Part 6

Monitoring a Software Defined Network, Part 6 The dynamics of an SDN make it particularly challenging to monitor and manage.

The dynamics of an SDN make it particularly challenging to monitor and manage.

Note: My discussion of SDN monitoring covers several topics. Here are the prior posts:

  1. Monitoring the SDN data plane
  2. Monitoring the SDN control system
  3. Should the SDN monitoring system be integrated with the controller?
  4. SDN offers an opportunity to re-design network monitoring
  5. Is a separate monitoring and management network needed for SDN?

The Agile SDN
One of the big advantages of Software Defined Networking is its agility; it can quickly adapt to changes in compute and storage environment. But how does one go about monitoring a rapidly changing network environment? Most network management/monitoring systems periodically scan for new devices in the network, perhaps as infrequently as once a day or perhaps even longer. These long-term processes were fine when network infrastructure changed on a weekly or monthly basis. But when infrastructure can be deployed and recovered in a few hours in response to some business demand, the network monitoring system must be much more responsive.

Let's look at a simple case of DevOps with respect to developing a new application. Developers need the ability to quickly test the application as development proceeds. These tests are typically done whenever the developer is ready to integrate new code into the application. A virtual test system needs to be quickly created and regression tests run. When the tests are complete, the test system resources need to be returned to the compute, storage, and networking pools. The VM controller, storage controller, and SDN controller all feature heavily in the process of creating and destroying the test system.

[Note: The word "destroy," when used with the virtual networking infrastructure, really means that the resources that were used to implement the virtual network infrastructure are reclaimed for use in future infrastructure requirements. The terms "reclaimed," "decommissioned," and "redeployed" could also be used. When virtual resources are reclaimed, they are returned to the pool of available resources that are used for future implementation requests.]

Mapping Physical Problems to Virtual Infrastructure
In the above scenario, what if there was a physical network problem that caused some of the tests to fail? The developer could be mistakenly looking into code problems when the real problem was the test infrastructure. The SDN monitoring system needs to know when new infrastructure is brought into service and removed from service. It needs to be able to keep data about the virtual infrastructure so that the DevOps team can look for problems when things don't run as expected.

An SDN based on an overlay model could include a monitor of the physical infrastructure and report whenever it exhibited symptoms of a problem. A smart SDN controller should even be able to detect some types of physical problems and configure the virtual network to work around the problems. For example, detecting a bad uplink where a redundant path exists.

This adaptability is similar to how the brain uses redundant neurons and pathways to work around brain damage. I wouldn't go as far as saying that it is a "self-healing network" though, because we will still need someone to physically replace the bad cable, defective switch, or misbehaving router. It also wouldn't be able to handle failures where redundant infrastructure doesn't exist. I think that calling it "an adaptable network" is the most accurate description.

Recording Virtual Infrastructure Configurations
We'll need to record the virtual infrastructure configuration, how it was implemented, what physical hardware was used, and who or what caused the infrastructure to be created and destroyed. This information will be critical to effectively troubleshoot problems. In a sense, tracking the network infrastructure will be much like tracking the locations of laptops in today's physical networks.

Example: A virtual network and virtual servers are implemented for a business function. However, a physical link is experiencing problems - perhaps a duplex mismatch or a bad connector. The business function that relies on this infrastructure will experience problems. The IT team (server staff, application staff, and network staff) may think that the app didn't work correctly, perhaps pointing fingers at each other. Deployment of the same application later may work better, because the new virtual implementation is using different physical infrastructure.

Troubleshooting applications that have significant changes in performance is going to be the realm of the cross-functional experts. These will be the staff members who understand compute, storage, networking, and virtualization of each. I forecast that many IT teams won't understand enough about their applications and the virtual infrastructure to identify the root cause of many problems.

What they will understand is that redeploying the application may change how well it works. I predict that some organizations will develop the ultimate "three finger salute" (so-called because of the three fingers needed to press CTL-ALT-DEL on a Microsoft Windows system to force a reboot). They will encounter a problem with an application, and instead of trying to understand the cause, they will destroy the virtual implementation and restart it, hoping that it works better the next time. Since it will sometimes work, their action will reinforce the behavior.

My point is that it is going to be important to know what physical infrastructure is used for each virtual infrastructure element so that problems that are exhibited in the virtual space can be mapped to the physical infrastructure and vice versa.

Avoid SDN Thrashing
I foresee a problem occurring that I'll call SDN Thrashing. It is when one part of the IT system causes a virtual infrastructure to get created. Then something happens to cause it to be destroyed and the resources reclaimed. This process could continue until someone notices and stops it. Or an automation system could identify that it is thrashing and stop it.

A better solution is to incorporate some checks to break the cycle after it repeats some small number of times. A good example is that a virtual infrastructure gets created and due to the traffic, it causes congestion on a link. The high drops could cause the SDN control system to destroy the instance and try to recreate it using different physical infrastructure. But because the virtual infrastructure itself is causing the congestion overload, the system starts thrashing.

The SDN monitoring system should allow the IT organization to see and understand what is happening. Some threshold may need to be modified or the application may need to be changed to reflect the volume of data that moves in a live implementation. In any case, the SDN monitoring system is the eyes into the problem, helping identify the factors that caused the virtual infrastructure to be created and destroyed.

Create a Baseline
The SDN monitoring system should capture a baseline of the virtual infrastructure immediately after its creation. This step validates that the infrastructure is capable of supporting the application. Server-to-server latency, database request latency, network bandwidth, and interface errors are statistics that should be recorded and verified against the application's requirements.

As part of the baseline, the SDN monitoring system will also need to record the topology of the virtual infrastructure. The record should be the interconnection data needed to produce a drawing, not an image copy of the drawing itself. With the interconnection data, it is possible to recreate the drawing and move things around on it to provide better views. A drawing alone will have limited usefulness.

What are some of the problems that we anticipate a smart SDN monitoring system would identify?
  • High latency paths that cause an application's goodput to be intolerably low. This may happen when part of the virtual infrastructure is constructed at an ISP's facilities (e.g., Amazon or Google virtual services).
  • Network links that are exhibiting high errors. Because the errors are reflected in interface stats, it could be a bad interface or a bad link. Additional testing would be necessary to discern the cause of the errors.
  • Congested links, causing high discard rates. Long-term discard rates might be marginally acceptable while burst discard rates could be unacceptable.
  • Redundancy failures. When half of a redundant path fails, the application continues to run, using the redundant path. But if the failure isn't identified, reported, and corrected, a hard failure will occur when the redundant path fails. And it eventually will. We've seen cases where months passed between the first failure and the second failure of a redundant configuration.
  • Link any problems that are detected in the physical infrastructure to the virtual infrastructure and the applications running therein.
  • Who or what initiated a virtual infrastructure change?
  • Does the infrastructure support the application? What did/does the topology look like, to facilitate troubleshooting?
There are a lot of new things that SDN monitoring will need to do that are not done in today's network management systems. It will be very interesting to see how the industry handles the transition.

Hear more from Terry Slattery at Enterprise Connect Orlando 2015!


March 7, 2018

Video collaboration is experiencing significant change and innovation-how can your enterprise take advantage? In this webinar, leading industry analyst Ira Weinstein will present detailed analysis

February 21, 2018

Business agility has become the strongest driver for enterprises to begin migrating their communications to the cloud-and its a benefit that enterprises are already realizing. To gain this benefit

February 7, 2018

Enterprises are starting to grasp the critical importance of security and compliance in their team collaboration deployments. And once the risks are mitigated, your enterprise can integrate these n

March 12, 2018
An effective E-911 implementation doesn't just happen; it takes a solid strategy. Tune in for tips from IT expert Irwin Lazar, of Nemertes Research.
March 9, 2018
IT consultant Steve Leaden lays out the whys and how-tos of getting the green light for your convergence strategy.
March 7, 2018
In advance of his speech tech tutorial at EC18, communications analyst Jon Arnold explores what voice means in a post-PBX world.
February 28, 2018
Voice engagement isn't about a simple phone call any longer, but rather a conversational experience that crosses from one channel to the next, as Daniel Hong, a VP and research director with Forrester....
February 16, 2018
What trends and technologies should you be up on for your contact center? Sheila McGee-Smith, Contact Center & Customer Experience track chair for Enterprise Connect 2018, gives us the lowdown.
February 9, 2018
Melanie Turek, VP of connected work research at Frost & Sullivan, walks us through key components -- and sticking points -- of customer-oriented digital transformation projects.
February 2, 2018
UC consultant Marty Parker has crunched lots of numbers evaluating UC options; tune in for what he's learned and tips for your own analysis.
January 26, 2018
Don't miss out on the fun! Organizer Alan Quayle shares details of his pre-Enterprise Connect hackathon, TADHack-mini '18, showcasing programmable communications.
December 20, 2017
Kevin Kieller, partner with enableUC, provides advice on how to move forward with your Skype for Business and Teams deployments.
December 20, 2017
Zeus Kerravala, principal analyst with ZK Research, shares his perspective on artificial intelligence and the future of team collaboration.
December 20, 2017
Delanda Coleman, Microsoft senior marketing manager, explains the Teams vision and shares use case examples.
November 30, 2017
With a ruling on the FCC's proposed order to dismantle the Open Internet Order expected this month, communications technology attorney Martha Buyer walks us through what's at stake.
October 23, 2017
Wondering which Office 365 collaboration tool to use when? Get quick pointers from CBT Nuggets instructor Simona Millham.
September 22, 2017
In this podcast, we explore the future of work with Robert Brown, AVP of the Cognizant Center for the Future of Work, who helps us answer the question, "What do we do when machines do everything?"
September 8, 2017
Greg Collins, a technology analyst and strategist with Exact Ventures, delivers a status report on 5G implementation plans and tells enterprises why they shouldn't wait to move ahead on potential use ....
August 25, 2017
Find out what business considerations are driving the SIP trunking market today, and learn a bit about how satisfied enterprises are with their providers. We talk with John Malone, president of The Ea....
August 16, 2017
World Vision U.S. is finding lots of goodness in RingCentral's cloud communications service, but as Randy Boyd, infrastructure architect at the global humanitarian nonprofit, tells us, he and his team....
August 11, 2017
Alicia Gee, director of unified communications at Sutter Physician Services, oversees the technical team supporting a 1,000-agent contact center running on Genesys PureConnect. She catches us up on th....
August 4, 2017
Andrew Prokop, communications evangelist with Arrow Systems Integration, has lately been working on integrating enterprise communications into Internet of Things ecosystems. He shares examples and off....
July 27, 2017
Industry watcher Elka Popova, a Frost & Sullivan program director, shares her perspective on this acquisition, discussing Mitel's market positioning, why the move makes sense, and more.
July 14, 2017
Lantre Barr, founder and CEO of Blacc Spot Media, urges any enterprise that's been on the fence about integrating real-time communications into business workflows to jump off and get started. Tune and....
June 28, 2017
Communications expert Tsahi Levent-Levi, author of the popular blog, keeps a running tally and comprehensive overview of communications platform-as-a-service offerings in his "Choosing a W....
June 9, 2017
If you think telecom expense management applies to nothing more than business phone lines, think again. Hyoun Park, founder and principal investigator with technology advisory Amalgam Insights, tells ....
June 2, 2017
Enterprises strategizing on mobility today, including for internal collaboration, don't have the luxury of learning as they go. Tony Rizzo, enterprise mobility specialist with Blue Hill Research, expl....
May 24, 2017
Mark Winther, head of IDC's global telecom consulting practice, gives us his take on how CPaaS providers evolve beyond the basic building blocks and address maturing enterprise needs.
May 18, 2017
Diane Myers, senior research director at IHS Markit, walks us through her 2017 UC-as-a-service report... and shares what might be to come in 2018.
April 28, 2017
Change isn't easy, but it is necessary. Tune in for advice and perspective from Zeus Kerravala, co-author of a "Digital Transformation for Dummies" special edition.
April 20, 2017
Robin Gareiss, president of Nemertes Research, shares insight gleaned from the firm's 12th annual UCC Total Cost of Operations study.
March 23, 2017
Tim Banting, of Current Analysis, gives us a peek into what the next three years will bring in advance of his Enterprise Connect session exploring the question: Will there be a new model for enterpris....
March 15, 2017
Andrew Prokop, communications evangelist with Arrow Systems Integration, discusses the evolving role of the all-important session border controller.
March 9, 2017
Organizer Alan Quayle gives us the lowdown on programmable communications and all you need to know about participating in this pre-Enterprise Connect hackathon.
March 3, 2017
From protecting against new vulnerabilities to keeping security assessments up to date, security consultant Mark Collier shares tips on how best to protect your UC systems.
February 24, 2017
UC analyst Blair Pleasant sorts through the myriad cloud architectural models underlying UCaaS and CCaaS offerings, and explains why knowing the differences matter.
February 17, 2017
From the most basics of basics to the hidden gotchas, UC consultant Melissa Swartz helps demystify the complex world of SIP trunking.
February 7, 2017
UC&C consultant Kevin Kieller, a partner at enableUC, shares pointers for making the right architectural choices for your Skype for Business deployment.
February 1, 2017
Elka Popova, a Frost & Sullivan program director, shares a status report on the UCaaS market today and offers her perspective on what large enterprises need before committing to UC in the cloud.
January 26, 2017
Andrew Davis, co-founder of Wainhouse Research and chair of the Video track at Enterprise Connect 2017, sorts through the myriad cloud video service options and shares how to tell if your choice is en....
January 23, 2017
Sheila McGee-Smith, Contact Center/Customer Experience track chair for Enterprise Connect 2017, tells us what we need to know about the role cloud software is playing in contact centers today.