Designing for resilience: How next-generation tech architecture can make your IT more scalable and robust

For the fourth in a series of articles on how businesses can make their IT more resilient, we discuss how architecture designs must evolve to integrate new technology changes and keep pace with faster change.

Technologies change fast but some things are timeless. We will always need accessible, secure, reliable and available systems.

 

 

Consumers need access to:

  • 1.

    Funds (banking)

  • 2.

    Goods (e-commerce)

  • 3.

    Medical care (digital records)

  • 4.

    Employment (digital workplace)

     

     

     

     

    In order to survive. Organizations that produce these products and services must provide them reliably and at scale.

     

     

    Technology is advancing faster than IT workforces can keep up.

    Newer tech companies have the advantage of using modern technology such as artificial intelligence to solve customer problems.

     

    Traditional organizations must deal with legacy infrastructure that is expensive to maintain, difficult to change and relies on an ageing IT workforce. When workers retire, will their knowledge of old IT systems be lost? Organizations are forced to maintain the old while adapting to the new, but few companies have navigated this transition successfully. The successful ones can re-design or modify their IT architectures to become more adaptable and resilient.

     

     

    How to get started?

     

    When we work with organizations to bootstrap their IT architecture, we identify pilot projects to get started. Organizations tend to start by establishing basic disciplines in resilience, which will allow them to scale later.

     

    The next step is identifying common services, such as “view shopping cart” or “get account balance” in legacy IT infrastructure, which can be “extracted” and implemented as a standalone service. This effort helps the organization align on the right size and scope of the service. It also helps an organization think in terms of services versus applications and learn how to apply technologies necessary for successful service construction and deployment.

     

    Next, organizations should work on shifting towards proactive resilience − namely, detecting, and fixing problems before the customer is aware of them.

    One example is the implementation of routine health checks of IT systems (for example, every five minutes or hourly) with appropriate alerting and reporting. This technology connects the issue with the operator who is best placed to solve the issue in the most expedient manner. One example would be a search service: a routine health check can test whether a search service is returning no results and notify the appropriate development team.

     

     

    After improving troubleshooting, an organization can rethink its monitoring infrastructure. The focus can then expand from infrastructure monitoring (CPU, memory) to service level monitoring.

     

    Service level monitoring includes service availability, uptime, and response time. A search service should ensure a response time in milliseconds, but if it starts to respond in seconds, it can trigger a notification that something is wrong. The importance of service level objectives and their link to IT resilience is described in the second article in this series.

     

     

     

     

    Next stop: scale reliably

    Once organizations begin to improve their IT resilience, it’s time to scale. Organizations need to plan how they can migrate core applications into a service level design − a process also known as “hollowing the core.”

     

    Scaling the service footprint with “core application logic” requires several architecture strategies, such as rethinking how code running on a central server can run on a distributed infrastructure, with data split across regions. These strategies are expertly employed by many technology firms at scale, and can result in major improvements in the firm’s performance and outcomes for customers.

     

    One place to start is data: there are opportunities to rapidly build new digital experiences on top of data platforms. We help organizations identify, develop, and launch new digital experiences by building services on top of core legacy data; for example, building a “get account balance” service on top of “customer data.” First, it requires companies to be able to stream core data in real-time to a data platform. Then they can build new digital services on top of that data.

     

    Once there is a strong data foundation, the work on “hollowing the core” can start. A data platform allows organizations to migrate from old legacy technology to new services, incrementally. It is important for companies to develop a target over time — for example, a 10% core workload migrated to services in six months, 20% in 12 months, 60% in two years, and so on.

     

    As services scale, they are distributed across data centers and the cloud. At this point, it is important to focus on resilience before scaling further.

     

     

    Resilient architecture strategies include:

    • 1.

      Distribution of “read” data, such as “check account balance” or “get product price,” to minimize dependency on a central database.

    • 2.

      Route user traffic to infrastructure located near the user’s location. This allows organizations to reduce user impact when a service is down. This is known in Site Reliability Engineering (SRE) as reducing the “blast radius.”

    • 3.

      Build-in logic to maintain uptime even when dependencies start to fail. This technique allows a service to be responsive to users and re-route to other available services with minimal user disruption. In SRE, this is known as “graceful degradation.”

       

       

      What’s next?

       

      Scalability and resiliency require ongoing improvement, and it typically takes between six and 12 weeks to identify and implement SRE interventions in each service. This is a process that organizations should implement continuously until services achieve stable metrics, such as uptime, scalability, and the ability to recover automatically.

       

      Improvement starts with training employees in modern ways to design and operate technology, such as test automation and continuous deployment. It also requires a culture of empowerment and appreciation of IT engineering skills. Technology firms are by nature highly technology-literate, which helps them co-exist more harmoniously with constant tech change. Traditional organizations can learn from this.

       

       

      As services scale, there are many disciplines that can be continuously improved, including:

       

      • Minimize configuration changes from coding to launching live to users. This allows code to remain unchanged as it moves across environments.

      • Reduce outage time by making sure that the engineer responsible for a service is notified as soon as there is a problem.

      • Implement redundancy in services and infrastructure across geographies where users are located, including smart redirection of user traffic (load balancing) when there are problems. This can also be accomplished via “service mesh” technologies (which control how different parts of an application share data), such as Istio.

      • Implement caching mechanisms to support user actions even when underlying services are down

      About the Author

      Dan Martines

      Managing Director & SRE Practice Lead
      London

      Discover related Articles

      This article belongs to a five-part series on tech resilience written by Dan Martines. View the related articles below:

      Digital & Tech | Article

      Stronger by Design: How ‘technology resilience’ can cut costs and boost your company’s profits

      In the first of a series of articles on IT resilience, we tackle how ‘technology resilience’ can cut costs and boost your company’s profits.

      Learn more
      Digital & Tech | Article Article

      How to get there: Maturity models for SRE and stages of development

      In the second of a series of articles on on IT resilience, we articulate five stages of development in our maturity model

      Learn more
      Digital & Tech | Article Article Article

      Measuring resilience: How service level agreements can improve your company’s IT resilience

      In the third of a series of articles on how businesses can make their IT more resilient, we discuss practical methods to identify and develop service-level metrics connecting business to technology resilience.

      Learn more