Measuring resilience:
How service level agreements can improve your company’s IT resilience

In the third of a series of articles on how businesses can make their IT more resilient, we discuss practical methods to identify and develop service-level metrics connecting business to technology resilience.



The digital revolution is not really a revolution anymore. The constant change we experience because of technological development is now an essential and unavoidable part of modern life.


Among the multiple impacts and consequences of these changes, one of the most remarkable is the sheer amount of data that is now collected and analyzed daily – and the proliferation of tools that make it almost effortless to store and analyze this information.




Site Reliability Engineering is only as good as your data



Organizations are competing to make astute data-driven decisions to transform their businesses. They must be agile enough to adapt to new digital technologies such as artificial intelligence (AI) and “Big Data,” plus external factors such as the Covid-19 pandemic.


Setting the right goals — while capturing and analyzing the right data from an IT service’s performance — can provide the foundation for improving technological resiliency. These metrics and attention to detail are part of BCG’s method to help organizations get the best possible returns from Site Reliability Engineering (SRE).


But what do we mean by the right goals and right data? And how do we define them?

Welcome to the acronym-strewn world of Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Service Level Agreements (SLAs).

Service levels can be measured in three ways. Firstly there are service level indicators (SLIs): such as error rates and time taken to respond to a request from an IT user. Secondly, service level objectives (SLOs) are specific targets: for example, 70% of users’ requests must be responded to within two minutes. Lastly, service level agreements (SLAs) are agreements between the IT department and its users that usually mean the provider of an IT service will be fined if the service falls below agreed standards.


If SLIs fall below an agreed threshold, known as error budget, it signals a potential problem. When this happens, SRE teams need to stop, remediate the root cause, and implement automation to prevent the same problems from happening again.




Defining service levels and how they can boost IT resiliency


Before defining metrics, it is worth clarifying what a service is and isn’t. For example, within a customer-facing website, a customer’s profile, login, and registration are services. Product, architecture, and development teams should define these service boundaries and how they affect users.


Next, focus on what IT users want from an IT service. It may sound obvious but SLIs and SLOs should aim to measure things that users care about, rather than a technical objective that only matters to an IT department. (You might be surprised how many IT service level agreements seem written for the needs of IT departments rather than users.)


Measuring service levels can improve communication between the general business and the IT department and during product development and discussion about cost and budgets.


Measuring IT service levels helps spot one-off and persistent faults. It also helps IT departments maintain high-quality services and explain them simply, using dashboard-type displays, to executives from other business units.





Defining SLIs and SLOs


The health of an IT service can be assessed based on criteria developed by Google, which created SRE.


The health check criteria include latency (the time it takes to service a transaction), error rate (the number of transactions that fail), and saturation (since the capacity of a service usually starts to deteriorate before it reaches 100% capacity, it’s important to monitor it).


Here, things can get a little confusing. SLOs are based on SLIs. When defining SLOs, check that user expectations for an IT service are realistic and your IT department has enough resources to meet the targets. Is 100% availability essential? What are the resources required to achieve it? Would 99% uptime be acceptable, leaving more IT resources for other urgent matters?



Defining SLAs


Agreeing on fair SLAs that benefit all parties is about balance between customer expectations and the team’s delivery capabilities along with technical limitations.


SLAs that are based on realistic SLOs which are easily understood by business and IT encourage good conversations between various stakeholders, with better decision making.





Better metrics, better decisions


Clarity about IT service levels is in everyone’s interest when trying to improve IT resilience.


For engineers, SLIs help their teams make better decisions about the implementation of an IT service. If engineers building the system are aware of SLI requirements, they can tweak the design of the system to help it meet SLIs.


For business executives outside the IT department, service levels can also help make better data-driven decisions, ranging from change management timelines to balancing budgets.



In short, it’s a win-win.


Daniel Martines

Managing Director
Boston, United States

Discover related Articles

This article belongs to a five-part series on tech resilience written by Dan Martines. View the related articles below:

Digital & Tech | Article

Stronger by Design: How ‘technology resilience’ can cut costs and boost your company’s profits

In the first of a series of articles on IT resilience, we tackle how ‘technology resilience’ can cut costs and boost your company’s profits.

Learn more
Digital & Tech | Article

How to get there: Maturity models for SRE and stages of development

In the second of a series of articles on on IT resilience, we articulate five stages of development in our maturity model

Learn more