This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.
4 Tips to Reduce Network MTTR
We all wish we could reduce the mean time to repair (MTTR) of network outages. Here are several tips for doing so; each optimizes part of the diagnostic process and when combined result in big benefits.
The People, Process, and Technology Framework
IT operations revolves around a triad of people, process, and technology. Sharp people, working with known and well-rehearsed processes and using current technology, can produce remarkable results. We’ll see how the tips relate to the triad.
Tip 1: Employ Trusted Network Experts
Your network experts, whether they are employees or contractors, are the most important element. If you use contractors, it is best if they are working on your network on a regular basis so that they learn and understand your business and how the network supports the business.
In addition to learning how the network functions, they will learn its idiosyncrasies, which is where failures and slowdowns are most likely. It is this knowledge that enables your sharp people to make intuitive leaps regarding potential causes of problems.
Tip 2: Create Good Network Documentation
You’ll need good system documentation and network baseline data to validate what the network should look like and how it should function. If the necessary documentation doesn’t exist, take the time to create it. This is a critical process. Good documentation comprises:
- Network diagrams that show both physical and logical connectivity that’s so important to the troubleshooting process — Creating a single diagram that shows both can be challenging, so you may need multiple diagrams. You should be able to follow a network path between any two points and identify places to gather data or test hypotheses.
- Written policies that describe the network’s design, operation, and future growth — Policies should describe things like the network segmentation paradigms, addressing plan, site interconnectivity mechanisms, network management goals, and routing/switching policies.
- Documentation for network equipment refresh planning, upgrades to new technologies, and growth plans — Make sure to include diagnostic tools that are specific to any new technologies.
- Run-books that describe typical problems and the mechanisms that worked in the past for diagnosing them — A well-written run-book for a single scenario should allow a more junior network engineer to diagnose and remediate common problems successfully.
Tip 3: Develop Consistent Network Building Block Designs
Another process element is the use of consistent network building block designs to yield significant gains in simplification, documentation, monitoring, and troubleshooting. You should tie the building block designs to equipment refresh cycles. Each cycle may (it doesn’t have to) result in a slightly different design and new equipment with new configurations. Occasionally, you’ll have a significant change that drives an entirely new design paradigm, such as the switch from MPLS to SD-WAN (and the just-starting change to secure access service edge (SASE). This may be the opportunity to implement a more widespread change if the savings offsets any residual value or cost of the old implementations. Note that you’ll need new design and troubleshooting documentation to go with changes in the building block designs you adopt.
Don’t fall for enticements to use shiny new products and features or to switch vendors. Rather, only implement changes with sound reasoning. Standardization means that you sometimes give up on some of these things to make the network easier to monitor, manage, and troubleshoot. The place for new technology is in the lab, during the process of creating new building block designs.
Tip 4: Accelerate Diagnosis with Automation
Gone are the days of manually logging into network equipment and collecting troubleshooting information from the command line interface. Network automation (the technology component in the diagram above) is not just for deploying new configurations. In fact, using automation for the rapid collection and correlation of the same data as the manual process simply accelerates the diagnostic process. Because collecting diagnostic data is a read-only operation, automating this process causes no risk to the network — an objection that some people have regarding automation.
Coupling automation with trouble-ticketing systems and UC collaboration tools yields a powerful system that’s able to perform diagnostic data collection quickly and push the results into a chat space where the network team, regardless of their location, can view it and collaborate on troubleshooting. This method of operation has a name: ChatOps. This is a true paradigm shift in network troubleshooting that promises to reduce the MTTR.
Excellence in operations begins with the basic framework: people, process, and technology. The integration of these elements and the depth of their use is what results in gains in the troubleshooting realm. The first three tips have been around for as long as we’ve been doing networking and should be part of basic network operations. The last tip, regarding the use of automation, has seen sporadic use until recently, when the scale of networks mandated the switch to automation. The automation of troubleshooting data collection with network team collaboration tools has opened a whole new world for reducing the time to diagnose network problems.