I had an interesting conversation with a vendor a few months ago at Enterprise Connect Orlando. I am naturally interested in network problems that affect applications, including Voice and Video, so I was talking with one of the vendors of a Voice/Video management tool. The product manager was showing me how their product worked for detecting and diagnosing a problem on a phone.
What’s Wrong With This Picture?
Learning how a product works for diagnosing a single endpoint problem is ok. There are always problems where you need something that allows you to collect and analyze the data necessary to diagnose individual problems. However, I tend to work with networks that are pretty large. Diagnosing individual problems is like missing the forest because you’re looking at individual trees. In the larger networks, I could get totally consumed with individual problems and miss out on the fact that there are general, systemic problems that are affecting many endpoints. If I can identify the systemic problems, I can improve the service for many endpoints by addressing those problems.
Let's take the case where an infrastructure link to a remote site has incorrect QoS configured. I may have several trouble tickets about poor voice quality and dropped calls at the location serviced by that link. I can work on each ticket individually and may eventually determine that all the tickets are related to a QoS problem. I may even figure that out while working on the first problem. But then I'll have the other trouble tickets that I'll need to check to make sure that they're not some other problem than the one I've solved.
Finally, when I'm working from trouble tickets, there are already problems that are affecting the customers. It would be much better for them and for me if I could proactively determine that a problem exists that is affecting multiple endpoints, and address it before the customers call the help desk, file trouble tickets, and we have to process the trouble tickets. It is all about efficiency and reducing the cost of running the help desk and of increasing the productivity of the customers who are using the voice/video systems.
There are other factors that come into play. I may be working on problems that are less important than other problems. Let's say that I have a trouble ticket from a manager and another ticket from a call center employee, both complaining about poor call quality of their phones. Which ticket is most important to the business? It might be the manager in some cases. But in other cases, an entire call center may be experiencing a problem and only one person called to report it. Customers calling into this call center may be dropped or may terminate their calls due to poor call quality. If this is an order placement call center, it is the revenue generation part of the business and should probably have priority over fixing the manager's phone.
Please Give Me System Views In our conversation at the show, I asked the product manager to show me a more global view of the endpoint performance. He initially didn't understand what I wanted. I explained that I wanted to generate a report that showed me all the endpoints that had poor call quality, grouped by common criteria, such as subnet address or regional location. It took a while to explain why I wanted this grouping. The product manager hadn't any experience running a very large network and hadn't thought about how the tools should operate in large-scale networks where there are thousands or tens of thousands of endpoints.
Next page: Attacking the problem
It is a problem that I see with nearly all network management tools. The product managers do not have experience with network management in a big network. They specify functionality that works for individual tasks or for small scales. When the functions are deployed in a large network, they don't work well. Possibly the worst case I've seen was a product that took over 30 minutes to produce an interface error report. Well, maybe it isn't the worst case; it did produce a report. Some functions don't work at all at large scale.
As I explained to the product manager, I wanted to identify groups of endpoints that had common problems. For example, I wanted to identify the set of phones that reported high jitter and packet loss, grouped according to subnet or by CIDR block. (Note: Most networks use some sort of geographic addressing by CIDR block, which allows for route summarization that provides network route stability. Being able to group network devices by CIDR blocks for the purposes of reporting leverages the summarization in ways that the routing architects probably didn't envision, but that provide real benefits for network management.) An example regional report is shown below.
Using the System View Report
The sorting is currently selected to be a descending sort on the Packet Loss column. Chicago is the greatest offending site with 216 endpoints reporting high packet loss. We can also see that high Jitter is also being reported by many of the endpoints, most of which are probably the same as the endpoints reporting high Loss. The Loss seems to be driving Dropped Calls too, because that figure is also high.
I would want to drill into the Chicago site data to determine if the entire site is affected, which would indicate to me that the network connection to the site is a likely place to start my investigation. If the Chicago data tells me that it is a particular subnet within the 10.1.16.0/22 address space, I can limit my investigation even further.
As a Tier 2 or Tier 3 support manager, I would use the report to allocate my staff. I might assign a more experienced engineer to investigate the Chicago problem, even if there are no trouble tickets from Chicago. The Washington, DC and Boca Raton facilities might get assigned to a more junior engineer to investigate, with mentoring and direction from the more experienced engineers.
After using the System Views reports for a few months, I would expect that the number of problems would be significantly lower because the technical team is proactively identifying and correcting system-level problems. The help desk then has a reduced workload and is able to focus on problems that affect individual endpoints.
Systems Management, Not Element Management
I encourage vendors to look at how their systems work and to make sure that there are ways to use the collected data to provide a Systems Level View of the network and endpoints. Virtualization is increasing the need for this type of management and reporting, because the number of endpoints is increasing. Systems Level Views allows us to scale up without increasing our support staff or increasing their load.
I'm sorry to say that the example report that I show above is something that I made up. I've not seen any tools that provide this type of reporting. There may be some--if you know of any, please leave a comment to let us know where you found it.