Know Your SLAs
What you need to know about your service level agreement and how service availability is measured and guaranteed.
The Service Level Agreement (SLA) is a contractual agreement between a service provider and customer that defines the expected level of service delivered by a service provider. The purpose of an SLA is to specify and define what the customer will receive as part of the service. SLAs do not define how the service itself is provided or delivered. The service implementations may change during the term of the SLA.
I attended the 25th Annual "Negotiating Network and Infrastructure Deals" organized by CCMI. A session was presented on service level agreements by Mark Lindsay of LB3 and David Lee of TechCaliber. The graphics in this blog are from their presentation.
What's in an SLA?
The SLA should clearly define and delineate what services are being provided under the agreement as well as the metrics used to determine whether it's being satisfied. The levels of service should cover:
- Reliability and availability can be defined quantitatively. Availability can be defined as 99.XXX% uptime. The definitions should define the percentage of uptime as well as limit the downtime and time to restore service.
- Responsiveness by the provider not only covers the time it takes to respond to an outage, but also the time it takes to accept requests, schedules, and meet service dates. Service dates include service installation and termination.
- There should be well-defined procedures for reporting problems -- i.e. who is to be contacted, how the problem will be reported (phone call, email, instant message, or certified letter), and what other steps can be taken that allow the resolution to be promptly and efficiently implemented.
- The customer should be monitoring delivery of the SLA. The provider must have the ability to monitor the SLA. Confirm the provider can produce the SLA metrics. If the customer believes the SLA is not being met, what data needs to be collected? Can the customer access the monitoring systems and performance data of the provider?
- Assuming an SLA is not satisfied what does a customer have to do to report their dissatisfaction? How fast does the provider respond to those reports? Even though the provider may have measurements, does a customer have to show independent measurements to qualify for credits? Are they purely credits? Or can the customer receive a check? Are credits given immediately within 30 days or spread over a longer time period? If the SLAs are not met, can customers terminate the service without penalties?
- You need to determine if there any escape clauses or constraints to the SLA. Are there circumstances under which the SLA promises do not apply? Are there exemptions for things like a flood, fire, terrorism, or other hazardous situations?
What to Measure
As shown in the above graphic, there are four main areas to measure in a SLA. The one most people think about first is the availability of a service. How often is a service down and service restored? But part of that availability is also how you define and quantify the degradation of service, such as a scenario where the service is failing, but has not yet failed.
Additionally, consider how well does the SLA cover provisioning and installation? The fourth area is important to voice and video: the quality of transmission.
This is the availability formula:
In this equation, MTBF is the mean time to failure and MTTR is mean time to repair, measured in hours. What you want is the time to restore service, not repair. Don't forget you have to test the restored service and initialize/reboot your devices. So the actual restoration time will be longer than what the SLA specifies. By the way, the definition of mean is that 50% of the outages and restorations will be longer or shorter -- it is not the average time.
It is easy to report an absolute outage. But what if an outage is actually degradation? Is there a standard that can cover this? Yes -- ANSI T1.231. (For more on this standard, see "Keeping Pace with the T1.231 Evaluation").
T1.231 provides the foundation for telephony and data interfaces. The objectives are to ensure the connection operation is known at both ends. The second objective is to ensure that data is only transmitted when appropriate. The third part is to provide tools to verify and resolve problems when there are hard failures or soft failures, or when there is gradual or intermittent degradations. The standard also defines when interface data is needed from multiple technologies (T1/DS1, Metro Ethernet Broadband, MPLS, etc.). It provides information when multiple providers are involved. Does your provider reference ANSI T1.231?
Where does the availability SLA start and end?
The demarcation points may not include the local access connection. If multiple providers are included, does the SLA only cover one provider? It should not matter whether the information being transmitted is on-net or off-net. Availability should be measured from the customer point of view.
What is not included in the SLA?
Some providers exclude fiber cuts, but do not define what a fiber cut is. Be aware of vague descriptions of exceptions to the SLA. Last year a major cloud provider suffered an outage that lasted days. When the customers read their SLAs, they learned that under that particular set of outages, they were not covered nor given credits. There was no recourse. In another case, the provider wanted information from the customer to demonstrate the effect of the outages. The workload was so great that the cost of detailed reporting was more than the credits.
These kinds of exceptions point out that when providers discuss availability, you should be asking these questions:
- What is the definition of usable service, and who determines it?
- There may be definitions of the terms complete loss of use vs. usability. What's in your SLA?
- Who measures and how is service availability measured?
- What are the SLA conditions (start of service, after 30 days operation, etc.)? What are the credits and are they worth pursuing?