A growing number of IT departments have one thing on their mind: digital transformation. Technologies including AI, Big Data and the Internet of Things (IoT) have the potential to help businesses make big improvements in their performance.
Despite these big technological advances, IT operations often seem stuck in a time warp. Each year, even the largest companies regularly experience costly IT outages, network glitches and information security breaches.
There are too many examples to mention but they include, Royal Bank of Scotland being fined £42m for a four-week payments outage, British Airways refunding £60m to customers stranded by a data centre outage and Jeep losing between $100-$700m after it was hacked.
IT departments are often hindered by poorly designed architecture, unrealistic performance targets and IT staff wasting time on manual tasks that could be automated − and reacting to problems that could be prevented.
How can IT infrastructure become more resilient?
One way is an IT method known as “site reliability engineering” (SRE), first developed by Google in 2004. It uses software engineering techniques, including automation and flexible “modular” software, or “microservices” (functions within software, such checking a customer’s bank, which can operate independently from others within the same application). Microservices can make daily IT operations less prone to failure, faster, more scalable and easier to use.
As Benjamin Treynor Sloss, Vice President of Engineering at Google, puts it,
„SRE is what you get when you treat [IT] operations as a software problem and you staff it with software engineers“
What type of IT worker is attracted to and suited to SRE? One of the software engineers in Google’s SRE team says it attracts people who want the flexibility to choose projects and work on low-level problems relating to scale and efficiency. “I’m always telling my friends that we solve cooler and more complex problems.”
SRE is now used by companies, including Walmart, Morgan Stanley, Bloomberg and Oracle.
SRE staff focus on two inter-connected IT functions: product development and the daily management of IT services. After all, developing more stable and less buggy new software applications will cause less disruption to IT operations and require fewer modifications to existing IT services and IT systems.
Other benefits of SRE include the potential to reduce IT department headcount, and therefore costs, and increasing the efficiency of IT operations through automation. It can also remove organisational silos between infrastructure teams and software development teams. Silos can cause misunderstanding and contribute to IT failures.
SRE teams typically aim to improve the reliability of an organisation’s IT in the long term. They can also help organisations scale their IT without incurring excessive costs, and in a more reliable manner.
SREs do work previously done by IT production teams but using engineers with software skills. The software skills are crucial as they are used to increase efficiency, automation, and reduce costs. They also set targets for service levels and resilience, such as availability and response times. These metrics are translated in Service Level Indicators (SLIs). For each SLI, an objective is defined, known as Service Level Objective (SLO). SRE teams constantly measure progress toward these targets. If there is a degradation in SLI metrics up to a threshold, known as error budget, SRE teams stop developing to focus on fixing the issue.
Return on investment
Creating an SRE team within an IT department and changing the way it works is a big undertaking, which can take months, or even years. Google’s web site has plenty of free advice and information about SRE.
BCG has developed its own SRE method, which has helped dozens of businesses. Our five-step process for implementing an SRE starts with changing the “culture” of your IT operations team. It finishes with interventions in IT architecture and how business and IT collaborate.
The benefits of SRE can be significant. In our experience, SRE can reduce organisations’ IT downtime by between 10% and 30%, increase efficiency by between 10% and 15% and quicken digital transformation by developing software between two to five times faster than previously.
One bank we worked with estimates that SRE improved its bottom line by between $200 million and $400 million, due to improvements including a 25% increase in productivity and launching new products and services 40% faster than previously.
Special thanks to Ian Lottering, Vasudha Joshi, Victor Fonesca, and Yashiren Nair for contributing their expertise.
Discover related Articles
This article belongs to a five-part series on tech resilience written by Dan Martines. View the related articles below:
How to get there: Maturity models for SRE and stages of development
In the second of a series of articles on on IT resilience, we articulate five stages of development in our maturity model
Measuring resilience: How service level agreements can improve your company’s IT resilience
In the third of a series of articles on how businesses can make their IT more resilient, we discuss practical methods to identify and develop service-level metrics connecting business to technology resilience.