With talented people and access to resources, running successful AI pilots with compelling results is straightforward, but scaling the results, however, is not. At BCG Platinion, we have experienced the challenges of scaling highly bespoke AI products, and we have learnt that teams tasked with this mission should include three key disciplines. These include data science, data engineering, and the focus of this article, ML engineering.
Understanding ML Engineering in the context of AI solutions
To understand why ML engineering is so key to delivering AI solutions at scale, we need to understand common scaling challenges within projects. The first challenge is planning.
If the approaches to the project do not match the scale of the problem, such as when overly complex bespoke algorithms are designed when stakeholders expect a fast delivery, pilots can fail before they have started or run their course.
Secondly, if the project is not properly scoped, it can lead to an underestimation and overruns that erode stakeholder confidence and budgets.
With the adoption of agile software engineering principles as part of ML, it becomes possible to communicate value during planning, scoping, experimentation, and development, with clearly defined metrics generated along the way.
„Establishing automated methods saves time during the project lifecycle.“"
This approach ensures that communication and collaboration are features of delivery, and that the simplest solutions to problems are rapidly identified. This also promotes reusability and frequent stakeholder feedback.
One of the most difficult aspects of product ownership is having the data available to articulate value to the stakeholders.
Establishing automated methods and processes to deliver this data early will save considerable time during the project lifecycle.
Tools and processes that achieve this aim include prediction quality evaluation, dashboards, model health evaluation and a/b tests.
Inadequate technology can lead to failure, which could include overly complicated architectures leading to bloated costs, or inadequate technology stacks and deployment processes that make it difficult to deliver reliable results.
These challenges result in failure points, and ultimately, increased fragility. Fragility erodes confidence in a system as stakeholders lose confidence in the solution and the teams running it. Fragility can be addressed by integrating MLOps into a data science workflow. MLOps is a set of practices that aims to deploy and maintain machine learning models in production reliably and efficiently.
Leveraging ML Ops as an ML Engineer
ML Ops is an evolution of DevOps, catering for the significant differences in technology components within AI applications compared to more standard technology platforms.
In addition to the established DevOps practice, MLOps addresses challenges relating to data and model versioning, algorithm training, and adapting existing concepts of continuous integration and deployment as well as introducing a solution for continuous training. Implementing MLOps becomes the responsibility of the ML Engineer and is at the core of ML Engineering. But how do you implement this?
Given that MLOps extends DevOps, we still apply the same techniques used in traditional software development. This ensures that all code is modular, that there is a branching strategy with team code reviews, and that it is tested, scanned and deployed in an automated manner. But these techniques also need to apply to new application elements that are the responsibility of data scientists or data engineers, such as data processing pipelines and machine learning code.
Ensuring all code within the system is modular and tested will generate significant short and long-term benefits in terms of stability, security and observability, not to mention time and cost savings on reducing manual tasks. It also allows the whole team to articulate and show value quickly.
Once we have a more robust process in place to support the workflow, we can borrow deployment and monitoring techniques from the MLOps ecosystem.
Some critical DevOps features still apply, including platform, deployment and monitoring aspects. DevOps software engineering standards are also utilised, as well as environment setup to ensure QA. Production workstreams have a vital role to play in minimizing errors, and infrastructure maintenance is also key to success. Finally, continuous integration and deployment of software elements (CI/CD) is also essential.
Enhancing the approach
This initial approach can be extended to cover MLOps by including automated model deployment, metadata storage, and A/B testing with batch scoring. Application code created in development can aid experiment, parameter and metrics tracking, while monitoring can be used to enhance qualitative and quantitative feedback, and to mitigate model drift.
Model drift and confidence in the automated model deployment is a key to resolving the fragility problem. There are many reasons why this can be a significant problem for an AI project, not least compromising the safety of individuals using the AI assisted software if it is used in dangerous environments or if the model is subject to regulatory compliance. The model retaining can be triggered by the following scenarios:
Firstly, retraining can be utilised if significant changes in the data distribution of input or output variables are observed. It can also be carried out if the model performance drops significantly, which requires continuous monitoring, or if new data is not systematically available. If bigger blocks of new data become available in regular time intervals retraining may also be necessary, following a set training frequency.
All of these failure points can lead to the project having a higher cost than its ultimate value, which means it is more cost effective to solve the problem without AI than with.
Applying best practices and ML thinking early
It is of vital importance that you begin applying engineering best practices early in the process while thinking about ML engineering.
Successful AI delivery that is achieved at scale requires a combination of data science, data engineering, and ML engineering expertise, with the latter offering a route to a robust, scalable platform with clearly quantifiable metrics.
About the Authors
Jamie is an Engineering Director from London with extensive experience managing software development teams in both startup and enterprise companies. With industry experience across consumer, energy and insurance sectors he brings a human centered approach to product delivery, innovation and platform design with a passion for sustainable engineering practices and AI assisted solutions.
Tom is an Engineering Lead based in the London office. He has vast experience delivering products and services with a human centered approach – including artificial intelligence solutions at scale and new startup ventures across a wide range of industries. Covering topics ranging from privacy enhancing AI to digital consortium design, he brings expertise in open innovation’s evolving role in solving high impact issues such as climate action, healthcare and online privacy.
A Managing Director based in the London office, Phil is a key member of BCG Platinion’s DPS and design team leadership in EMESA. He has extensive experience in innovation, having spent almost 20 years running his own product design and innovation practice which BCG acquired in 2019, which saw him work for many of the world’s leading tech companies including Google, Apple and Microsoft, alongside a wide array of start-up ventures.
Find out more about our digital expertise:
What You Always Wanted to Know About AI and MoreLearn More
A guide to how AI can really benefit your business
Enabling Open Innovation and Collaborative Data ScienceLearn More
Whilst accessible data can generate benefits through innovative data driven products, services and data science initiatives, it must be used and shared in both an ethical and responsible way.