Converged Networks: Monitoring
Dependence upon providers and SLAs without accountability won't resolve issues, and thus network performance won't improve.
Dependence upon providers and SLAs without accountability won't resolve issues, and thus network performance won't improve. For those without the necessary tools, time to resolve won't improve, costs will go up and user satisfaction will go down. Here's a view from a reporting tool:
The benefits of network monitoring outweigh the costs, as a recent experience shows. One of my buddies set up network monitoring for our customer with daily issues, while the managed service providers and hosted providers did little to resolve core issues and daily concerns.
One of the remote sites experienced what they described as being "disconnected from the applications," and the log revealed the VPN to the data center had failed with a "policy mismatch"--but was it really a policy mismatch?
Both the data center and the onsite router configurations were confirmed. The log pictured below showed nearly 2 minutes of the VPN failing and eventually reestablishing itself. The security association lifetime was set for 8 hours and the Cisco ASA expert I consulted did say that, "If your phase 1 and phase 2 timers are identical and both go down at the same time, that could cause a longer re-key time [like] what you are finding in the logs," of nearly two minutes. The real clue is from the customer when she stated that the failures seem to occur almost at the same time each day.
Another issue that plagued another remote location was identified as a "High Talker." After drilling down into the reports of the monitoring service, the workstation was identified and removed from the network. As soon as it was removed, user services were restored. We found a history of after-hours high traffic alerts for the workstation. Is this an exploit?
Then, users at all locations complained at the same time that they were "disconnected from the network"--so did the network fail at all locations simultaneously? How would you know? While there were no network outages and all users could still access the public Internet, what did occur was a massive "lock up" of resources in the data center on the hosted VMware. Why? In fact, we had been asking the same question for the past couple of years in the previous network configuration.
The real issue wasn't a network issue or failure but the old repeated and ignored messages displayed to all users at the times of failures: "You have exceeded your concurrent sessions limit." Once the server licenses were upgraded in the hosted data center, that issue went away--and users now complained of a new problem: That their computers or tablets seemed to lock up. Some received a new screen pop stating "Your maximum number of logins exceed the limit" and the problem was replicated when we logged in as a user to the application.
Later, when the customer reported to me that voice quality between offices was totally unacceptable, I logged into the router to view the voice quality monitoring reports. Voice traffic was routing over a backup route and not MPLS. Why? The data center removed the BGP routes for the "old voice" VLANs, assuming that the new voice VLANs would be active. The routing for some calls was also found to be on the wrong routes, with calls routing across GRE tunnels instead of the MPLS network.
You can't count on luck and you definitely cannot rely upon any service provider including carriers that manage it all for you. The longer the issues remain the costlier it becomes. Service disruption is costly and customers are left with negative feelings and opinions. Productivity is reduced and billable time for companies that are stuck-in-stupid increases way beyond what I would consider the norm.
What I can't imagine is not having the monitoring and reporting services. For my company, that means that longer disruptions will translate to invoices with more billable time.