Make our IT more reliable, faster and a spur for innovation. Boardrooms’ expectations for IT are rising. In the past year, many companies have accelerated plans for digital technologies in response to the Covid-19 pandemic and the sudden switch to remote working.
One increasingly common way to improve the resilience of an organisation’s IT (reduce IT outages, security breaches, make it more scalable and flexible) is a method called “site reliability engineering” (SRE)
To minimize disruption when switching an IT department to SRE, CIOs should first decide to what extent they want to use it and whether in a basic or advanced form.
Answering a handful of questions can help your organization work out its SRE “maturity”. Is there a forum where business requirements and IT architectures are discussed so there is a common understanding? Have we considered scenarios in which an IT system fails? Do we have an accurate measurement for fitness functions?
Answering these questions help managers decide how quickly they can develop their SRE and how advanced it can be.
To help organizations assess what level of SRE is right for them, and how long it may take to get there,
BCG has developed an SRE “maturity model”. It has five levels, each with its own approximate timeframe.
To make SRE projects easier to manage, our maturity model helps priorities SRE interventions of the highest value, balancing the organizations current capability level.
For example, start by agreeing service level indicators (errors, response times, saturation and throughput) to measure technology resilience and training staff in SRE/tech resilience. This is a good start for organizations in levels one and two maturity. “Microservices architecture” and “environment provisioning automation” are probably best left till an organization has moved beyond level three of SRE maturity.
Finally, at level five, organizations demonstrate high levels of SRE discipline, automation of controls and functions, and productive alignment between business and IT teams.
This maturity model has helped our clients make substantial improvements to the IT resilience:
Like most big IT projects, SRE is about more than technology. It also requires a change in an IT department’s culture. A good start is encouraging IT staff to learn from mistakes without different teams blaming each other when things go wrong.
This “no blame” culture can take between one and three months to implement. It should include having detailed, clear procedures for IT “postmortems” into, for example, IT outages or breaches in cyber security. The most important action after a postmortem is making changes, preferably automating them, so the same outage never occurs again. It helps an organisation make incremental improvements after an incident, builds organisational knowledge and improves processes.
Over time, SRE culture adoption translates into more freedom for product teams to innovate faster, while remaining accountable to building resilient products.
Site Reliability Engineering can help organizations improve the resilience and flexibility of their IT, increase its efficiency and support digital innovation. However, it can be tricky to do because it requires major changes to how IT staff work. Working out the “maturity” of your organization’s SRE can help it manage expectations and get returns.
Discover related Articles
This article belongs to a five-part series on tech resilience written by Dan Martines. View the related articles below:
Stronger by Design: How ‘technology resilience’ can cut costs and boost your company’s profits
In the first of a series of articles on IT resilience, we tackle how ‘technology resilience’ can cut costs and boost your company’s profits.
Measuring resilience: How service level agreements can improve your company’s IT resilience
In the third of a series of articles on how businesses can make their IT more resilient, we discuss practical methods to identify and develop service-level metrics connecting business to technology resilience.