Engineering for resilience: Engineering excellence and how to achieve it

In the fifth and last of a series of articles on how businesses can make their IT more resilient, we examine how a strong engineering culture and skill set are essential to scaling modern architectures with resilience.

Should we separate IT development and operations?

Traditional organizations enforce a classic boundary between two functions. First, there is a discreet function to develop IT applications and implement changes; second is a separate function that deploys changes and manages them. The latter also manages common services such as infrastructure, network, and end-user computing.


The separation of these functions makes sure that engineers are not making changes in the same environment used by end customers, which could lead to security breaches and inadvertent issues. Secondly, it also helps organizations meet compliance and regulatory needs, such as maintaining data privacy (i.e., engineers cannot see customers’ data).






However, this structural design is being challenged at the moment, with evidence that it leads to tension across both sides, misaligned incentives, and failure to meet business objectives. One side wants to implement changes fast; the other side wants to operate reliably. Ultimately this leads to blame plus lack of collaboration… and the business suffers.

Many technology firms have approached this problem in a different way, and are now testing using a new model in which engineers enjoy the freedom (and responsibility) to build, deploy, and operate their own services.

For example, Netflix designed a culture of “Freedom and Responsibility” in which engineers are empowered to work over the full technology stack, across disciplines such as build tools, deployment pipelines, metrics and alerts, and insights tools. At Netflix, they are called Full Cycle Developers, and they are required to go through rigorous boot camp training.



What can we do? Learnings from technology companies


Technology companies believe that engineering gives them an edge. In fact, engineering gives them a widening edge, one that is harder and harder to meet and surpass by traditional firms.


First, technology companies value engineering skills beyond engineering. At Google, most business and tech functions require coding skills. One example is product managers, who need a minimum of coding skills to understand the implications of product on engineering, and to work effectively with engineers. By contrast, traditional firms have restricted coding skills to developers. In many cases, even enterprise architects no longer possess engineering skills.


Second, knowledge building goes beyond coding skills. Engineers are encouraged to work in pairs in order to build knowledge, align on coding best practices, and help mentor juniors. Engineers perform reviews of each other’s code in order to converge on a common style and level of quality.


Third, every engineer is trained not only on engineering skills. At Algolia, engineers go through a 10-week boot camp to gain skills in Agile, test automation, and how to write high-quality production code. They also cover resilience topics such as “code smell” patterns of problematic code. One example includes programs accessing data from another program more than its own.


Fourth, engineers learn SRE skills to build more resilient systems. They learn how to work with distributed services and data, and are encouraged to bring new ideas about how to operate systems more efficiently.


Fifth, engineers focus on automating everything, from code review to deploying changes to monitoring services. They understand that scalability and resilience can only be achieved via automation — recent methods such as DevOps and SRE include a strong focus on automation. Many cloud providers offer automation out-of-the-box, and there are solutions (packages, open-source) to address every automation need.
One of the biggest benefits is the automation of quality checks using test automation: technology firms run on average 10,000 to 50,000 automated tests before any change is deployed. Once deployed, changes are automatically rolled back if they cause an issue in production.






Traditional organizations face multiple challenges ahead. How will they attract, retain, and upskill top engineering talent? How will they upskill their own internal engineering talent? Which should they do first – implement tooling, or upskill the IT workforce?


The engineering talent gap between traditional and tech firms is widening. Fortunately, closing the gap is possible, in our experience — but requires diligent effort to do so.

About the Author

Daniel Martines

Managing Director
Boston, United States

Discover related Articles

This article belongs to a five-part series on tech resilience written by Dan Martines. View the related articles below:

Digital & Tech | Article

Stronger by Design: How ‘technology resilience’ can cut costs and boost your company’s profits

In the first of a series of articles on IT resilience, we tackle how ‘technology resilience’ can cut costs and boost your company’s profits.

Learn more
Digital & Tech | Article

How to get there: Maturity models for SRE and stages of development

In the second of a series of articles on on IT resilience, we articulate five stages of development in our maturity model

Learn more
Digital & Tech | Article

Measuring resilience: How service level agreements can improve your company’s IT resilience

In the third of a series of articles on how businesses can make their IT more resilient, we discuss practical methods to identify and develop service-level metrics connecting business to technology resilience.

Learn more
Digital & Tech | Article

Designing for resilience: How next generation tech architecture can make your IT more scalable and robust

In the fourth in a series of articles on how businesses can make their IT more resilient, we discuss how architecture designs must evolve to integrate new technology changes and keep pace with faster change

Learn more