I use the term “active path testing” to refer to systems that send packets into the network to measure network characteristics. A simple ping test is the most fundamental active path test, but it doesn’t scale up to large networks very well. Active path testing is valuable because it directly measures network performance instead of inferring a level of performance from network monitoring data like SNMP. The important characteristics are packet loss, latency, and jitter.
Network testing between endpoints offers several advantages that other monitoring methods can’t match. A benefit of full-path testing is the inclusion of the last hop on both ends of a communication path. We want the ability to detect common problems that cause packet errors, like duplex mismatch, old cabling, and bad connectors. Another benefit of testing to the endpoints is the ability to validate the TCP/IP stack and interface operation. Ideally, the active testing system generates packets that are indistinguishable from real application traffic, enabling evaluation of access-control lists and load balancers in the same way as a real application.
There’s a tradeoff in doing active path testing. It adds traffic to the network. If the testing generates too much traffic, it can negatively impact the end systems and potentially the applications on those end systems, either client or server. Modern networks and servers have enough capacity to handle several test probes per second, which is sufficient to detect common problems.
Open Source Path Testing Tools
Some organizations consider open source tools because they don’t have the budget to purchase commercial tools, while other organizations need to scale up to sizes that commercial tools typically can’t handle. Regardless of the reason, starting with an existing tool is more useful than creating something from scratch. This list of tools is representative of the types of tools available.
NetNORAD -- this is Facebook’s network monitoring system. Because Facebook needed a tool that would identify network latency and packet loss problems within seconds, it wasn’t able to use SNMP-based polling systems, especially at its scale. All Facebook servers run the ping responder program while a subset of servers run the pinger application, both of which are described on Facebook’s
NetNORAD webpage. The network hierarchy consists of racks, clusters, data centers, regions, and backbone. To identity network programs, it runs tests between different parts of the hierarchy. The tests use User Datagram Protocol packets with different Differentiated Services Code Point priority values, providing coverage of quality-of-service (QoS) priority levels over the test paths. It condenses and records pinger response times in a real-time database from which monitoring programs perform analysis and generate alarms.
When NetNORAD detects a problem, Facebook can perform further analysis with another program,
fbtracert, its version of traceroute. It has the ability to diagnose problems over equal-cost multipath topologies.
NetNORAD would be a good candidate for larger organizations that have adequate development resources for fielding and supporting the tool. The pinger application is written in C++ while fbtracert is written in Go.
PerfSONAR -- this is a network testing measurement toolkit consisting of several traditional network tools. The tool suite includes ping,
traceroute, tracepath,
iperf,
nuttcp, and
owamp (one-way active measurement protocol), with a control, archiving, and display system that allows use of these programs for continuous path monitoring. The toolkit is excellent for continuously performing multidomain network diagnostics over Internet paths. Tests can be run from more than 2,000 PerfSONAR installations worldwide.
PerfSONAR is great for testing multiple Internet paths that traverse multiple ISPs. A matrix providing a summary of results, as shown below, allows easy identification of systems and paths that are experiencing problems. The minimum-loss-rate threshold is at the rate that I recommend to customers: 0.0001%, because that’s the rate at which TCP begins to experience significant performance degradation.
SmokePing -- this is a nifty little tool for tracking latency, jitter, and packet loss with an integrated alerting system. The display contains all the data for a selected site/path and timeframe. As seen below, the colored line shows the median round-trip time (latency), while the color indicates packet loss. The shadow around the median round-trip time shows the variance (jitter) in ping response times.
SmokePing’s advantage is that you can monitor paths and important endpoints without installing an agent -- it relies on ping for its measurements. This is a Perl program that runs on Unix-compatible machines. Installation is easy, and great to load on a monitoring server in each data center and in cloud implementations. This is a great tool for organizations with smaller networks in which the installation and monitoring won’t overwhelm the network team.
Summary
Open source has become a viable source for network management systems, although there’s a tradeoff between using a commercial product, with purchase and annual maintenance costs, and open source, which doesn’t cost anything to acquire but requires staff time to maintain. The network staff will need someone on the team to perform customization and software upgrades, which typically take longer with open source tools than with commercial tools.
I recommend doing a proof of concept if you’re serious about using open source network monitoring tools. You should understand what it takes to install, customize, and operate it. Automating the installation and customization may be necessary. Compare that experience with that of installing and using commercial software.