There is an IT systems administration meme where servers are treated like a herd of cattle instead of pets. And while the cattle/pet analogy might be something to chuckle about, it demonstrates a key difference in how servers in a network are set up and supported. Below we explore the analogy further and see what factors drive the selection of a herd or pet model.
The Pets vs. Cattle Analogy
IT server operations staff have been transitioning from a model in which servers are treated like pets, named and carefully cared for, to one where they are treated like a herd of cattle, each individually numbered instead. If a member of the herd gets sick and dies, it’s easily replaced with another. You can read more about it in the
history of the pets vs cattle analogy and
DevOps Concepts: Pets vs Cattle. A better model than cattle might be of
a fleet of various vehicles (cars, vans, large trucks), which matches server sizes and functions with applications.
In the pet model, resilient services are typically implemented with redundant infrastructure (i.e., multiple servers) running complex redundancy protocols. These systems are often based on an active-standby architecture. One server is the master, and another server is a hot standby that can take over when it detects that the master has died. A problem with this architecture is that most redundancy protocols contain a failure mode known as a split-brain failure in which the master and standby lose contact with one another due to a network failure and both take the role of master. Any changes to the user database during the failure window must be resolved between the two instances before the active-standby peering can safely resume.
In the cattle/fleet model, multiple servers are active (often called an active-active architecture), and any server can handle any client. These systems typically depend on a highly resilient back-end distributed database that guarantees synchronization of client data across the entire infrastructure. It’s easy to replace servers in this architecture. Clients are distributed to the available servers by a variety of global server load balancing mechanisms.
What are the characteristics of the cattle analogy for IT systems? They are:
- The ability to simultaneously run in multiple data centers (active-active), both on-prem and cloud-based.
- If a server or its supporting infrastructure (network and environmental) has a failure, an alternative server can be bought up elsewhere within minutes to replace the failed server.
- Upgrades are handled by simply installing the new version on separate servers and migrate clients to the new servers.
UC Infrastructure: Pets or Cattle?
Should the pets vs. cattle analogy be applied to critical UC infrastructure, such as call controllers, session border controllers (SBCs), and multi-point control units (MCUs)? What would be needed to make this architecture work? The short answer is yes; the cattle model can be applied, and it has certain advantages.
Pets
The traditional UC server started out as an appliance, designed as software that ran on a dedicated hardware platform. Redundancy protocols were used between pairs of systems, running in active-standby mode, so that the combined unit achieved high availability. These systems often were sold as the “High Availability” (HA) model at premium prices. The user database was incorporated within the system, generally tightly coupled for optimum performance. Software upgrades were onerous, due to the tight coupling to the database. Those devices that didn’t use an internal database like SBCs and MCUs were often dependent on stored session state information. Some redundancy protocols incorporated shared session state, which allowed a session to migrate between any of the devices in the HA cluster. Changes in the session state information could introduce incompatibilities that made it challenging to upgrade one half of an HA pair at a time.
I asked NetCraftsmen’s UC expert Bill Bell about the cloud implementation of typical UC systems. I learned that simply converting such systems to VM images and installing them in a cloud data center didn’t change its operational paradigm. Functionally, it’s still based on a pet model. Software upgrades require the use of tools to migrate client data from the old systems to the new systems and a flag day when clients are migrated to the new systems. Careful reading of the release notes and planning is critical to a successful upgrade. If you are planning such a move, it makes sense to adopt UC automation tools from the vendor or from companies like
Unimax to aid in the migration of user data.
Cattle
To learn more about native cloud-based UC infrastructure, I talked with Kevin Butler at
Vertical Communications, who is a technology partner of cloud-based UC vendor
8x8. The 8x8 UC system design is born in the cloud, according to Butler. It’s designed differently than the legacy UC systems described above. A multi-tier architecture is used in which the software components are based on VM or container technology. The tiers are loosely coupled and are designed to gracefully handle upgrades. The back-end systems are typically the database systems that store client data (name, telephone extension, Session Initiation Protocol (SIP) identifier, physical location, presence information, etc). Front-end DNS and load balancers are used to distribute client control connections across multiple UC application servers. These are very familiar design principles for the cattle model that is used by web-scale companies.
The customer only sees a “virtual PBX” and can connect to any server within the call control systems. Clusters of SBCs and media relay controllers (MCUs) share connection information via shared database servers, decoupling the sessions from specific hardware.
Note that if an MCU fails, then any calls it’s handling will also terminate. If any call controller dies, the connected clients will automatically re-connect to another controller. But in general, everything in the data center is clustered and redundant, so any failure is transparent to the end user. If an entire data center fails, active calls will fail, the DNS front-end will get updated, and IP addresses of alternate data center infrastructure will be provided to clients, so they can automatically reconnect.
Overall, the operation of a born-in-the-cloud UC system design will have fewer hard failure points than with cloud-based pets.
There are, however, unique failure modes. Some cloud vendors cache addresses from DNS servers. This interferes with the DNS update process to handle a failure.
Summary
Ok, there are some significant differences in the two models. Would that drive me to make the transition between products? Well, that depends on several factors, as it always does with networking. The primary criteria are unchanged.
- Required end-customer features
- Training an existing UC support organization and the end customers
- Deployment/upgrade of physical phone hardware
- Soft costs in switching UC vendors
- What class of single points of failure are important to avoid?
Existing vendors are working to migrate from pet-like to cattle-like, and I would thoroughly investigate them.
Assuming feature parity, the cattle approach is certainly inviting, particularly for an enterprise that is geographically distributed or that wants to move out of corporate data centers. Regardless, it’s imperative to thoroughly investigate existing vendors who are migrating from pet-like to cattle-like implementations.