The Intersection of Chaos Engineering and GenAI

In the right hands, chaos engineering combined with GenAI can be a powerful tool for digital transformation.

Chaos engineering has long been an outlier in the Software Development Life Cycle (SDLC), most organizations consider it too risky a proposition to embed into their development, test, or production environments. A select few consider it an indispensable tool in creating reliable infrastructure. The irony of the situation lies in the fact that organizations that need the most stability and reliability are the least likely to consider using chaos engineering to improve their current state.

 
 
 

 
 
 

Research suggests that 74% of organizations consider digital transformation activities a top budget priority. With that, the infusion of cloud capabilities, Artificial Intelligence (AI), the Internet of Things (IoT), and Robotic Process Automation (RPA) have now become a mainstream approach to reduce cost, enhance workflow efficiency, improve agility, and give rise to innovation. Nonetheless, the actual practical execution of such digitization comes at a price that warrants attention.

"

 
 
 
“70% of outages in 2022 cost companies over $100,000…”

"

 
 
 

Outages are synonymous with the use of technology. The Annual Outage Report 2023 states that ~41% of organizations experience at least one significant monthly outage, leading to millions of dollars in losses. Close to 70% of outages in 2022 cost companies over $100,000 – a sharp rise compared to outages in 2019 when only 40% of outages cost over $100,000. The loss of customers and brand equity due to such losses isn’t generally measured but can easily be considered a large multiplier of the direct losses. This trend is expected to rise with continued reliance on digital services. In addition to economic loss and operational disruption, outages have other negative consequences, including user dissatisfaction, data loss, reduced productivity, and so on.

 
 
 

The rise of such consequences makes it imperative for governments, businesses, and non-profits to ensure robust technical systems with high resilience. Organizations require enhanced, sophisticated contingency plans to enable graceful degradation of systems, a key principle stated in most distributed system designs but rarely applied in practice. The key lies in proactively identifying failures before their occurrence and developing actionable solutions to strengthen resilience. This is where Chaos Engineering can provide much needed structure and guidance.

 

The concept of Chaos Engineering dates back to the early 2010s when Netflix developed Chaos Monkey to protect its services from Amazon Web Services outages. Ever since, this concept has significantly evolved. In fact, with the recent upward spike in the adoption of Generative Artificial Intelligence (GenAI), possibilities of several exciting use cases lie at the intersection of GenAI and Chaos Engineering. Let us further delve into this concept, its potential to ensure better digital resilience and the ethical dilemmas that it brings about.

 
 
 

What is chaos engineering?

 

Chaos engineering is the practice of deliberately introducing chaos, failures, and unexpected events into a system to test its ability to withstand disruptions. Its primary goal is to proactively identify weaknesses and vulnerabilities in a system’s architecture, infrastructure, or software components before they can lead to major outages or failures in a production environment. This offers robust insights into averting outages. Organizations introduce chaos engineering to enhance system resilience and cut down significant costs on operational and reputational damages. Teams that run frequent chaos engineering experiments report high effectiveness, with more than 99.9% availability rates.

 

The stages of chaos engineering

A typical chaos engineering model consciously attempts to break an existing system. Doing this allows one to determine system resilience and fix issues arising from it. Chaos engineering follows the Rumsfield matrix in its implementation:

 
 
 

 
 
 

Decomposing Chaos Engineering: An example

Let’s take an example of cloud-based organization. As with all cloud applications, this organization’s systems, too, are prone to hacking, data theft, and other malicious attempts. Ensuring resilience here would imply the identification of and aversion to potential threats.

 
 
 

 
 
 

Chaos Engineering and the intersection with GenAI

When used with Generative AI (GenAI), chaos engineering applications have the power to hasten processes and elevate results. GenAI adds crucial insights into chaos engineering experiments, especially in cases where human language usage is pertinent. Chaos engineering teams can use large language models (LLM) and predictive capabilities to enable GenAI to simulate human thinking and solve problems. This can help with automated moderation, which can benefit organizations, domain agnostic.

 

Tactically, this can be implemented by leveraging the following architectural components:

 
 
 

  • Application 1: Traffic management

 

Without historical and near real-time traffic accurate data, it remains difficult for commuters to make informed decisions about their routes and commuting times. Similarly, city planners and transportation authorities may have limited data to progress road maintenance, infrastructure upgrades, and overall, any decision related to transportation and economy.

 

Through introducing deliberate skewed data (e.g., high traffic load, road closures, unexpected accidents…) and combining it with real-time data coming from various sources such as sensors, traffic cameras, or even demographic details of residents, one can leverage Large Language Models (LLMs) to generate scenarios such as traffic congestion (known knowns) ahead of its potential occurrence, and identify commuter road blockages (known unknowns), and ultimately associate potential solution mechanisms to it.

  • Application 2: Automated Fault Injection Scenario

 

Imagine a cloud-based e-commerce platform that relies on multiple microservices. Instead of manually injecting faults like network latency or service unavailability, GenAI analyses the platform’s architecture and past performance data. It then generates specific fault scenarios, such as simulating a sudden spike in customer traffic or a database server failure. These generated scenarios can be automatically injected into the system, allowing engineers to test its resilience under realistic conditions without needing to devise these scenarios manually.

 
 
 

Are you looking for optimal results? It is possible with the careful use of GenAI in chaos engineering.

In this era of digital transformation, chaos engineering has become necessary to identify loopholes and avert technical failures. GenAI is the latest technological buzz, with applications in chaos engineering experiments to amplify outcomes and lead to quicker solutions that result in safer living and working environments. However, using this powerful combination is riddled with ethical concerns, such as the rise of bias and unfairness, misleading information, privacy obstruction, and hampered user experiences. The onus lies on decision-makers to be mindful of such consequences/implications. Having said that, the benefits of this duo are multifold; as long as chaos engineers can ensure end-users’ safety by controlling threats, the future holds powerful potential.

 
 
 

About the Authors

Andreas Rindler

Managing Director
Head of PIPE
London, UK

Andreas is a Managing Director. He specializes in IT strategy, product & tech transformation and data platforms for global clients in software & technology, media, consumer and financial services industries. He has deep experience working with private equity owned businesses and corporate clients pre-deal or during value creation.

Syed Husain

Principal IT Architect
London, UK

Syed Husain is a Principal Architect with BCG Platinion with more than 15 years of experience in IT consulting. He focuses on Solution Architecture, AI and Data Strategy with Financial Services, Public Sector, and TMT clients in Europe and the Middle East. When not solving critical technology and business problems for clients, Syed can be found perfecting his DOTA 2 and Star Craft 2 skills.

Find out more about AI here:

  • Digital & Tech | Article

    Beyond Red Teaming: Using GenAI to Proactively Strengthen Your Cybersecurity

    As generative artificial intelligence (GenAI) continues to emerge, threats are growing. In this article, we explore how large language models (LLMS) and other AI tools could prove formidable weapons against cyber attacks.

    Learn More
  • Digital & Tech | Article

    What You Always Wanted to Know About AI and More

    A guide to how AI can really benefit your business

    Learn More
  • Digital & Tech | Article

    AI: Choosing the right approach to machine learning for your needs

    This first article in our AI series gives an overview of the three common machine learning methods and their application areas.

    Learn More
    AI
  • Digital & Tech | Article

    Model Accuracy in AI: How Accurate is Accurate Enough?

    The second article of our AI series dives deep into quality and performance of models used.

    Learn More
  • Digital & Tech | Article

    AI: Detect and Avoid Bias at an Early Stage

    In article three of our AI series we present a method to succeed in minimizing bias to get realistic results from AI models.

    Learn More
  • Digital & Tech | Article

    AI: Data Quality: What to do When Errors Occur?

    Article four of our AI-series puts a lense on the importance of the quality of data used.

    Learn More