Model Accuracy in AI: How Accurate is Accurate Enough?

The successful use of artificial intelligence is largely dependent on the quality and performance of the model used. But when is an AI model a sound one? How can the quality be checked and optimized if necessary? This article presents core criteria and methods for the quality testing of AI models.

From chemical production through autonomous driving to medical diagnostics, artificial intelligence is everywhere. Systems, in order to develop their full potential, must run as error-free as possible. In areas such as mobility and medicine, the quality of the predictions is of vital importance.


To ensure high quality, we have developed criteria during our project-work that can be used to measure the quality of AI models.


We first distinguish between the user perspective and the technical perspective. The user perspective focuses primarily on the user-friendliness, transparency, and ethics of the AI model. How long does it take for a result to be available to the end user, and are the results comprehensible, balanced and fair? Bias, such as unintentional discrimination against certain users, must be avoided. This topic will be addressed separately in the course of this series.

The following criteria are essential for the assessment of the technical perspective:


  • Performance measures the correctness and accuracy of the model results.
  • Expediency tests effectiveness and goal achievement. In traffic control, for example, congestion times could serve as a reference.
  • Stability means that the models work robustly and consistently. In traffic, for example, the prediction of congestion is expected to still work even when traffic flow varies slightly and a traffic light is out.


These criteria are determined in different ways depending on the learning method used. In the following, we provide an overview of test methods and improvement options for individual models.

1. Supervised Learning

Performance Testing

In supervised learning, BCG Platinion recommends a range of performance metrics, depending on the model objective. Accuracy measures the proportion of all correctly classified data points across the entire data set – for example in an algorithm that uses transaction data to detect fraudulent credit card transactions. The performance indicator in this case would be the proportion of transactions that were correctly classified as fraudulent.

Metrics such as precision or sensitivity (recall) are used when there is unbalanced data. With reference to this example, the data set could contain significantly more fraudulent transactions than such without fraudulent intent. Precision now indicates how many of the data sets classified as potentially fraudulent are actually due to illegal activity. Sensitivity, on the other hand, indicates how many of the fraud cases the model correctly classified.


Further evaluation methods allow a comprehensive evaluation of the model quality on the basis of technical performance criteria. Examples include F1, ROC, AUC, Precision Recall AUC, Gini, Confusion Matrix, Kolmogorov-Smirnov for classification problems, and R2 and MSE for regression problems.

Quality Optimization

There are several ways to eliminate deficiencies identified during performance testing and to optimize the model. One approach is the analysis and selection of drivers for model prediction using so-called SHAP values or also the LIME method. This allows the variables to be corrected, for example if the algorithm overweights individual variables or uses unsuitable ones.


Filtering data by adequate subpopulations can also produce improvements, with imbalanced data sets able to be balanced with respect to the important drivers. If, for example, an age group is particularly strongly represented in a data set and there is thus a risk of incorrect results, grouping the data according to age groups can help to obtain statements that are nevertheless representative of the population as a whole.

2. Unsupervised Learning

Performance Testing

Since with unsupervised learning no target characteristics are known in advance, other evaluation techniques are necessary. These can be divided into two types:


  • Evaluation of geometric properties (silhouette values): Such an analysis determines how coherently objects are segmented based on driver distances. Example: An algorithm is to identify suspected cases of money laundering. The silhouette value looks at whether the driver “frequent cash transactions” occurs. For money laundering cases, the driver is very similar; for lawfully-acting bank customers, it is not.
  • Sample validation of model results: Here, experts assess the results manually. This method should not be underestimated in any way, because it is a way of simultaneously detecting possible existing model biases.

Quality Optimization

Inaccuracies can be optimized with the help of so-called hyperparameters. Thus, an increase in the number of clusters and consequently smaller segmentation can lead to improved results. This takes place, for example, in election forecasts. Predictions become more accurate by segmenting the population into smaller but more consistent clusters.

Feature engineering is the process of refining raw data. For example, information about changes in income over the past few months provides a fraud detection algorithm with better information than just the “monthly income” metric.

3. Reinforcement Learning

Performance Testing

With reinforcement learning, the quality of the model is tested in practice, i.e., the predictions are compared with the events that actually occurred. In addition, A/B testing can be useful, where different model versions are tried out with randomized subgroups and the results then compared. Overall, the model evaluation is strongly focused on the definition of the reward function, that is, on the so-called discount rates and the average returns achieved.

Quality Optimization

The reward function can be modified (reward shaping) so that the algorithm develops a better strategy for selecting the next situation. In the traffic control example, short idle times and high average speeds could reward the algorithm. If the weighting of these factors is adjusted, a steadily improved traffic flow is achieved.


The second optimization option increases the latitude available to the model for its decisions. In traffic control, for example, maximum idle times can be increased or additional transport means and their control methods – tram signals, for example – can be included in the control system.


Artificial intelligence is not static; its great potential lies in the continuous optimization of the system. Accordingly, the evaluation of model quality is not a one-time event, but a continuous process. The purpose of evaluation is to determine whether the model has learned correctly during independent learning and whether and where the system starts to trip up. Regular quality testing also ensures that the data the AI model is confronted with in reality does not change so much over time that the precision of the predictions degenerates. Regular evaluations not only ensure high prediction quality, they can even improve it.

Find out more about AI here:


  • Digital & Tech | Article

    What You Always Wanted to Know About AI and More

    A guide to how AI can really benefit your business

    Learn More
  • Digital & Tech | Article

    AI: Choosing the right approach to machine learning for your needs

    This first article in our AI series gives an overview of the three common machine learning methods and their application areas.

    Learn More
  • Digital & Tech | Article

    AI: Detect and Avoid Bias at an Early Stage

    In article three of our AI series we present a method to succeed in minimizing bias to get realistic results from AI models.

    Learn More
  • Digital & Tech | Article

    AI: Data Quality: What to do When Errors Occur?

    Article four of our AI-series puts a lense on the importance of the quality of data used.

    Learn More
  • Digital & Tech | Article

    AI: Success Factor Data Preparation

    Article five of our AI-series highlights the right selection of data.

    Learn More

    About the Authors

    Jakob Gliwa

    Associate Director
    Berlin, Germany

    Jakob is an experienced IT, Artificial Intelligence and insurance expert. He led several data-driven transformations focusses on IT-modernization, organization and processes automation. Jakob leads BCG Platinion’s Smart Automation chapter and is a member of the insure practice leadership group.

    Dr. Kevin Ortbach

    BCG Project Leader
    Cologne, Germany

    Kevin is an expert in large-scale digital transformations. He has a strong track record in successfully managing complex IT programs – including next-generation IT strategy & architecture definitions, global ERP transformations, IT carve-outs and PMIs, as well as AI at scale initiatives. During his time with BCG Platinion, he was an integral part of the Consumer Goods leadership group and lead the Advanced Analytics working group within the Architecture Chapter.

    Björn Burchert

    Principal IT Architect
    Hamburg, Germany

    Björn is an expert for data analytics and modern IT architecture. With a background as data scientist he supports clients across industries to build data platforms and start their ML journey.

    Oliver Schwager

    Managing Director
    Munich, Germany

    Oliver  is a Managing Director at BCG Platinion. He supports clients all around the globe when senior advice on ramping up and managing complex digital transformation initiatives is the key to success. With his extensive experience, he supports clients in their critical IT initiatives, ranging from designing and migrating to next-generation architectures up to transforming IT organizations to be ready for managing IT programs at scale in agile ways of working. He is part of the Industrial Goods leadership group with a passion for automotive & aviation.