Use Checklists to Strengthen Networks

Kenishirotie_AdobeStock_190479583.jpeg

Image: Kenishirotie - stock.adobe.com

Our IT systems, like much of the modern world, have become increasingly complex and have resulted in different levels of risk. Overlooking key factors or components in these systems can contribute to their failure. It isn’t about knowledge—we know enough to understand how to construct our IT systems. How do we make sure that we don’t miss anything? The answer: checklists.

The origin of checklists goes back to 1935 with an Army Air Corps fly-off that included the Boeing-17 Flying Fortress. The test aircraft’s flight control surfaces were locked to prevent them from flapping in winds when parked on the ground. The pilots had forgotten to disengage before takeoff, and the outcome was a crash with two crew member fatalities. The post-crash analysis found that the pilots couldn’t recall all the steps required to fly the aircraft safely. This tragedy initiated the creation of the pre-flight checklist, and the B-17 eventually had four different checklists for specific flight phases.

Applying Checklists to Networking

It’s not difficult to identify areas where checklists apply to the network. The challenge comes from taking the time to create and maintain these lists, changing processes to incorporate them, and regularly using them to validate the conditions that each item addresses.

Automation makes it easy to apply checklists. The volume of details in a comprehensive network checklist makes them impossible to implement manually on a network consisting of more than a few devices. We get consistent validation of all the checklist items across all devices because human error (and frankly, boredom from doing the checks) is averted.

Let’s look at some examples, ranging from simple to complex. Then we’ll learn how to organize them in a way that supports their use with network automation. The exact mechanism for performing the checks depends on the automation system we’ve chosen.

Simple Checks

Simple checks are used to validate that basic network device configurations are correct and that the desired functions are working as desired. They are simple when a single checklist item applies to many devices.

For example, verify the Cisco router network time protocol (NTP) configuration and that the router has been able to contact its servers. An additional check would confirm that the router has synchronized with one of the servers. Below are examples of the data that is obtained from the network for the checks that should be performed.

Check that NTP peers are correctly configured, and the hardware clock should sync with the software clock:

ntp server 10.50.36.42 ntp server 10.50.38.42 ntp update-calendar

Verify that the NTP peer relationships are working:
router#show ntp association
address       ref clock     st  when  poll reach delay offset   disp
+~10.50.38.42   86.79.127.250    4     7   256  377  0.8   -0.29     0.2
*~10.50.36.42   86.79.127.250    4   188   256  377  0.7   -0.17     0.3

* master (synced), # master (unsynced), + selected, - candidate, ~ configured

There are three checks to perform:

The two NTP servers are configured (and no others)
The update-calendar command is configured
The output of show ntp association shows that both servers are active and that one has been selected as the NTP master. The two addresses should be the same as those in the configuration.

The first two checks should be done at least whenever the configuration changes. The last check makes sure that the server association is still active, so that we can detect when the NTP master server is unavailable, either due to a network problem or a server failure. It should be checked on a periodic basis to proactively detect problems, perhaps as often as once every few days. And because the addresses in the configuration are the same as those in the show ntp association command output, we only need to specify them in one place and let the automation system perform both checks.

Moderately Complex Checks

Complexity rises when the checklist items are unique to small groups of devices, or each network device. The is simply the volume of items that drives the complexity. We can use automation tasks to populate the checklist database from the network, but this assumes that the function is working correctly when the data is captured. It’s a good idea to verify any data that is obtained from the network.

An example in this category is EtherChannel connectivity. Both configuration and operational data should be verified.

Configuration:

interface range gigabitethernet1/0/1 -2 switchport mode access switchport access vlan 10 channel-group 1 mode active

Operational data shows both interfaces in the port-channel, which is statically configured for Layer2 and is in use (the SU flags following the name Po1 in the command output).

Switch> show etherchannel 1 summary Flags: D - down P - in port-channel I - stand-alone s - suspended H - Hot-standby (LACP only) R - Layer3 S - Layer2 u - unsuitable for bundling U - in use f - failed to allocate aggregator d - default port

Number of channel-groups in use: 1
Number of aggregators: 1

Group Port-channel Protocol Ports
------+-------------+-----------+----------------------------------------
1 Po1(SU) LACP Gi1/0/1(P) Gi1/0/2(P)

Other examples include verifying routing neighbors, the next-hop router for important routes (i.e., the default route), and connectivity to critical application servers. These items can detect unexpected anomalies and failures within the network that are typically hidden by redundant designs.

Complex Checks

Intricate checks like these involve detailed configuration and operation, frequently between multiple devices. For example, we could extend the EtherChannel validation by using link-layer discovery protocols to make sure that the proper devices and ports are connected. In the link-layer case, we could verify that a Cisco router and switch connect over the same link by collecting and correlating CDP data from the two devices.

Checklist Database: The Network Source of Truth (NSoT)

Where does the checklist live? In a repository known as a Network Source of Truth (NSoT), which is essentially a database of the network checklist. The NSoT is the definition of the network’s connectivity and operations. We can’t rely on the network itself for that definition, because a failure (device, link, or human) invalidates data that we collect from it.

Even though the term database is used, it’s typically not a relational database management system (RDBMS). Instead, it’s the set of multiple files that defines the data that must be checked. In the Ansible platform, for example, we could have the NTP server addresses in the all.yml file that is suitable for all devices, while the OS version data (IOS_version) for a specific device (test_sw) is in a separate file for that device.

Image: Author

The great thing about this whole process is that we don’t have to have a 100% complete NSoT to get started. We can begin with a bunch of simple checks that are easy to create. Then add more elaborate inspections over time. The principal factor is to start building the list.

Test-Driven Network Automation

We can incorporate automatic testing into operational processes once we have a network source of truth started. Next, change the network change control process to include pre-change and post-change testing. As the NsoT grows, more parts of the network are validated as configured and working correctly before making a change as well as after the change has been implemented. This makes sure that a change didn’t break the network.

Let’s look at how would this works using the NTP example from above to demonstrate adding another NTP server. The pre-change check would validate that all the current network devices can connect to the two servers. We would then run automation that updates the configurations of all network devices to include a third server. The post-change validation check would verify that all devices have connected to the third server.

Investigate any devices that can’t connect to the third server to determine why—perhaps due to a firewall rule or a missing route. We would know immediately that the change, even though properly implemented, was not operational in some devices, and we could take steps to correct it. Using these same checks on a periodic basis allows us to identify when similar problems occur due to other changes in the network, like adding a firewall that blocks NTP or an interface whose virtual routing and forwarding (VRF) definition was fat-fingered.

The combination of checklists with automated network testing helps us improve our networks and make changes with lower risk. Join me at Enterprise Connect Digital Conference & Expo 2020 to learn how to get started with network automation.

Are you looking to learn more about automation technologies? Terry will be talking about NSoT and network automation at Enterprise Connect Digital Conference & Expo 2020. Please join him online for the session, “A Step by Step Guide to Automating Your Network,” which airs on Wednesday, August 5, at 3:00 p.m. to 3:45 p.m. (on demand afterward). Register today!