Applying a Software Engineering Mindset to AI Product Development
Applying a Software Engineering Mindset to AI Product Development
With talented people and access to resources, running successful AI pilots with compelling results is straightforward, but scaling the results, however, is not. At BCG Platinion, we have experienced the challenges of scaling highly bespoke AI products, and we have learnt that teams tasked with this mission should include three key disciplines. These include data science, data engineering, and the focus of this article, ML engineering.
1. Data science is the core of the team, producing models, algorithms and statistical analysis via experimentation and development.
2. Data engineering provides data availability and governance via a centralized store for feature engineering (feature store).
3. ML engineering brings a software engineering rigour to the data science process built on learnings from agile software development, with an established development and deployment process called ML-Ops.
‍
Understanding ML Engineering in the context of AI solutions
To understand why ML engineering is so key to delivering AI solutions at scale, we need to understand common scaling challenges within projects. The first challenge is planning. If the approaches to the project do not match the scale of the problem, such as when overly complex bespoke algorithms are designed when stakeholders expect a fast delivery, pilots can fail before they have started or run their course.
Secondly, if the project is not properly scoped, it can lead to an underestimation and overruns that erode stakeholder confidence and budgets. With the adoption of agile software engineering principles as part of ML, it becomes possible to communicate value during planning, scoping, experimentation, and development, with clearly defined metrics generated along the way.
‍
‍
This approach ensures that communication and collaboration are features of delivery, and that the simplest solutions to problems are rapidly identified. This also promotes reusability and frequent stakeholder feedback.
One of the most difficult aspects of product ownership is having the data available to articulate value to the stakeholders. Establishing automated methods and processes to deliver this data early will save considerable time during the project lifecycle. Tools and processes that achieve this aim include prediction quality evaluation, dashboards, model health evaluation and a/b tests. Inadequate technology can lead to failure, which could include overly complicated architectures leading to bloated costs, or inadequate technology stacks and deployment processes that make it difficult to deliver reliable results.
These challenges result in failure points, and ultimately, increased fragility. Fragility erodes confidence in a system as stakeholders lose confidence in the solution and the teams running it. Fragility can be addressed by integrating MLOps into a data science workflow. MLOps is a set of practices that aims to deploy and maintain machine learning models in production reliably and efficiently.
‍
Leveraging ML Ops as an ML Engineer
ML Ops is an evolution of DevOps, catering for the significant differences in technology components within AI applications compared to more standard technology platforms. In addition to the established DevOps practice, MLOps addresses challenges relating to data and model versioning, algorithm training, and adapting existing concepts of continuous integration and deployment as well as introducing a solution for continuous training. Implementing MLOps becomes the responsibility of the ML Engineer and is at the core of ML Engineering. But how do you implement this? Given that MLOps extends DevOps, we still apply the same techniques used in traditional software development. This ensures that all code is modular, that there is a branching strategy with team code reviews, and that it is tested, scanned and deployed in an automated manner. But these techniques also need to apply to new application elements that are the responsibility of data scientists or data engineers, such as data processing pipelines and machine learning code.
‍
‍
Ensuring all code within the system is modular and tested will generate significant short and long-term benefits in terms of stability, security and observability, not to mention time and cost savings on reducing manual tasks. It also allows the whole team to articulate and show value quickly. Once we have a more robust process in place to support the workflow, we can borrow deployment and monitoring techniques from the MLOps ecosystem.
Some critical DevOps features still apply, including platform, deployment and monitoring aspects. DevOps software engineering standards are also utilised, as well as environment setup to ensure QA. Production workstreams have a vital role to play in minimizing errors, and infrastructure maintenance is also key to success. Finally, continuous integration and deployment of software elements (CI/CD) is also essential.
‍
Enhancing the approach
This initial approach can be extended to cover MLOps by including automated model deployment, metadata storage, and A/B testing with batch scoring. Application code created in development can aid experiment, parameter and metrics tracking, while monitoring can be used to enhance qualitative and quantitative feedback, and to mitigate model drift. Model drift and confidence in the automated model deployment is a key to resolving the fragility problem. There are many reasons why this can be a significant problem for an AI project, not least compromising the safety of individuals using the AI assisted software if it is used in dangerous environments or if the model is subject to regulatory compliance. The model retaining can be triggered by the following scenarios:
Firstly, retraining can be utilised if significant changes in the data distribution of input or output variables are observed. It can also be carried out if the model performance drops significantly, which requires continuous monitoring, or if new data is not systematically available. If bigger blocks of new data become available in regular time intervals retraining may also be necessary, following a set training frequency.
All of these failure points can lead to the project having a higher cost than its ultimate value, which means it is more cost effective to solve the problem without AI than with.
‍
‍
Applying best practices and ML thinking early
It is of vital importance that you begin applying engineering best practices early in the process while thinking about ML engineering. Successful AI delivery that is achieved at scale requires a combination of data science, data engineering, and ML engineering expertise, with the latter offering a route to a robust, scalable platform with clearly quantifiable metrics.
‍