What UC Monitoring and Trees Falling in the Forest Have in Common
You've probably heard the old saying, "If a tree falls in the forest, and no one is around, does it make a sound?" Let's apply it to network applications.
The problem with monitoring real, live application traffic is that it can't identify problems that exist in the absence of application traffic. This is somewhat like the old saying:
If a tree falls in the forest, and no one is around, does it make a sound?
Let's adapt this phrase to IT:
If an application faults in an enterprise network, but no one is using it, is it a problem?
I maintain that it is a problem, and that it should make a sound, both figuratively and literally.
Wireshark on Steroids
One way to monitor applications is to use an application performance management (APM) system. This is a class of network management and monitoring systems that collects and analyzes network traffic, identifies each application, and identifies problems with those applications. Identifying each application typically requires some initial assistance from the network administrator on configuring the protocol (TCP or UDP), the port numbers, and the IP addresses of the application servers.
The APM system collects traffic directly from the network, typically through packet brokers. Packet brokers collect packets, filter out unimportant traffic, and feed the remaining "interesting" traffic to network management systems and security systems. A convenient way to think of the combination of APM and packet brokers is that it is like a giant packet capture and analysis system. I call such systems "Wireshark on steroids." They can do analysis that would take many hours to perform in a basic packet capture application like Wireshark.
However, APM systems have a downside. They rely on real application traffic to determine when a problem exists. An application can fail because a front-end or back-end server crashes, a load balancer is misconfigured, a network element fails, or any number of other causes a disruption. If no one is using the application at the time the failure occurs, the IT organization's recognition of, and therefore response to correct the problem, is delayed. The ideal monitoring solution will include mechanisms for early problem detection and reporting.
Active Path Testing and Application-Level Pings
I prefer to augment APM systems with active path testing tools. These tools create synthetic traffic that IT can use to identify problems at times when applications are not in use. Some of these tools simply generate traffic between probes added to the network. An IP pinger that only checks reachability is a good example. It only tests network connectivity at the IP level, and cannot detect any problem with the application itself. One advantage of a reachability pinger is that all network devices must have a ping reply mechanism, so no additional software or hardware is required.
The better active path testing tools will have a means for creating true synthetic transactions. I call these transactions "application-level pings" because they frequently only elicit a response from the application server and seldom do any real work. Of course, it helps if the transaction generates all the necessary back-end functions required to evaluate the application's ability to function. Some of these systems can even make real phone calls to test the UC infrastructure. Other systems simply transport UDP packets that look like voice traffic.
The combination of active path testing, true synthetic transactions, and application performance management can provide the best of both worlds. The synthetic transactions provide traffic that performs just like the clients, although for a subset of the typical workload. The APM system can then provide alerts if any part of the application is not performing as required.
Note that active path testing with synthetic transactions comes with a cost. It creates an additional load on the application infrastructure. During normal business hours, it may be desirable to just rely on APM, if it is available, since there should be plenty of normal application traffic. But after normal business hours, when there is not likely to be much application traffic, it is important to be able to generate synthetic transactions that continue to make sure that the application is working correctly.
One of the most important functions of continuous active path testing is to create baselines of application performance. When a problem occurs, it is very useful to understand what network paths the application used when it was working and how long it took for transactions to execute. Even if APM is not deployed, active path testing with synthetic transactions can report on total application response time and generate an alert when the response time differs from the baseline by a significant amount.
Keeping a historical record of baseline performance of applications is good practice. Let's say that something is increasing the response time of an application a little bit at a time, over the course of a year. Without a record of the baseline from over a year ago, quantifying what the application response used to be is impossible. I like to be able to show trends over time, so keeping a baseline from each month for at least the last 13 months is extremely useful. Most network management systems will have at least this level of storage. Disk space is so inexpensive these days that I see no reason to keep less data.
Benefits of Continuous Monitoring
Being on the receiving end of a 3 a.m. call about an application that's having problems is not nice. However, having more time to diagnose a problem is less stressful than only finding out about and having to fix a problem during the business day. In most cases, having fewer clients using the application is beneficial. For example, if a load balancer needs to be rebooted, doing that during a period of little or no client activity is best.
What about expected failures when part of the network or application infrastructure is down for maintenance? These are great opportunities to verify that the monitoring system is functioning correctly and generating alerts when the maintenance function starts. Part of the maintenance workflow planning should be to identify the alarms that should happen when a part of an IT system is taken down. If the monitoring system supports automated alert clear functions, then make sure that happens when the IT systems are brought back up. Some monitoring systems incorporate alarm suppression mechanisms. I prefer to know that the alerts are being generated and are working correctly.
I've only mentioned UC once. Applications that formerly were standalone are being integrated with other applications, some of which use voice, video, and collaboration. The UC-enabled application is now commonplace and application programming interfaces are everywhere. In this case, the easiest approach may be to configure monitoring of each major component of an integrated application. Regardless of the actual implementation, the key components of application monitoring are: establishing a baseline, continuous monitoring, and automatic alert generation.