AI-Series (2) : Model Accuracy in AI: How Accurate is Accurate Enough?
Model Accuracy in AI: How Accurate is Accurate Enough?
From chemical production through autonomous driving to medical diagnostics, artificial intelligence is everywhere. Systems, in order to develop their full potential, must run as error-free as possible. In areas such as mobility and medicine, the quality of the predictions is of vital importance. To ensure high quality, we have developed criteria during our project-work that can be used to measure the quality of AI models.
We first distinguish between the user perspective and the technical perspective. The user perspective focuses primarily on the user-friendliness, transparency, and ethics of the AI model. How long does it take for a result to be available to the end user, and are the results comprehensible, balanced and fair? Bias, such as unintentional discrimination against certain users, must be avoided. This topic will be addressed separately in the course of this series.
‍
‍
‍
‍
The following criteria are essential for the assessment of the technical perspective:
- Performance measures the correctness and accuracy of the model results.
- Expediency tests effectiveness and goal achievement. In traffic control, for example, congestion times could serve as a reference.
- Stability means that the models work robustly and consistently. In traffic, for example, the prediction of congestion is expected to still work even when traffic flow varies slightly and a traffic light is out.
These criteria are determined in different ways depending on the learning method used. In the following, we provide an overview of test methods and improvement options for individual models.
‍
1. Supervised Learning
Performance Testing
In supervised learning, BCG Platinion recommends a range of performance metrics, depending on the model objective. Accuracy measures the proportion of all correctly classified data points across the entire data set – for example in an algorithm that uses transaction data to detect fraudulent credit card transactions. The performance indicator in this case would be the proportion of transactions that were correctly classified as fraudulent.
Metrics such as precision or sensitivity (recall) are used when there is unbalanced data. With reference to this example, the data set could contain significantly more fraudulent transactions than such without fraudulent intent. Precision now indicates how many of the data sets classified as potentially fraudulent are actually due to illegal activity. Sensitivity, on the other hand, indicates how many of the fraud cases the model correctly classified.
Further evaluation methods allow a comprehensive evaluation of the model quality on the basis of technical performance criteria. Examples include F1, ROC, AUC, Precision Recall AUC, Gini, Confusion Matrix, Kolmogorov-Smirnov for classification problems, and R2 and MSE for regression problems.
‍
‍
Quality Optimization
There are several ways to eliminate deficiencies identified during performance testing and to optimize the model. One approach is the analysis and selection of drivers for model prediction using so-called SHAP values or also the LIME method. This allows the variables to be corrected, for example if the algorithm overweights individual variables or uses unsuitable ones.
Filtering data by adequate subpopulations can also produce improvements, with imbalanced data sets able to be balanced with respect to the important drivers. If, for example, an age group is particularly strongly represented in a data set and there is thus a risk of incorrect results, grouping the data according to age groups can help to obtain statements that are nevertheless representative of the population as a whole.
‍
2. Unsupervised Learning
Performance Testing
Since with unsupervised learning no target characteristics are known in advance, other evaluation techniques are necessary. These can be divided into two types:
- Evaluation of geometric properties (silhouette values): Such an analysis determines how coherently objects are segmented based on driver distances. Example: An algorithm is to identify suspected cases of money laundering. The silhouette value looks at whether the driver “frequent cash transactions” occurs. For money laundering cases, the driver is very similar; for lawfully-acting bank customers, it is not.
- ‍Sample validation of model results: Here, experts assess the results manually. This method should not be underestimated in any way, because it is a way of simultaneously detecting possible existing model biases.
Quality Optimization
Inaccuracies can be optimized with the help of so-called hyperparameters. Thus, an increase in the number of clusters and consequently smaller segmentation can lead to improved results. This takes place, for example, in election forecasts. Predictions become more accurate by segmenting the population into smaller but more consistent clusters.
Feature engineering is the process of refining raw data. For example, information about changes in income over the past few months provides a fraud detection algorithm with better information than just the “monthly income” metric.
‍
‍
3. Reinforcement Learning
Performance Testing
With reinforcement learning, the quality of the model is tested in practice, i.e., the predictions are compared with the events that actually occurred. In addition, A/B testing can be useful, where different model versions are tried out with randomized subgroups and the results then compared. Overall, the model evaluation is strongly focused on the definition of the reward function, that is, on the so-called discount rates and the average returns achieved.
‍
Quality Optimization
The reward function can be modified (reward shaping) so that the algorithm develops a better strategy for selecting the next situation. In the traffic control example, short idle times and high average speeds could reward the algorithm. If the weighting of these factors is adjusted, a steadily improved traffic flow is achieved.
The second optimization option increases the latitude available to the model for its decisions. In traffic control, for example, maximum idle times can be increased or additional transport means and their control methods – tram signals, for example – can be included in the control system.
‍
Conclusion
Artificial intelligence is not static; its great potential lies in the continuous optimization of the system. Accordingly, the evaluation of model quality is not a one-time event, but a continuous process. The purpose of evaluation is to determine whether the model has learned correctly during independent learning and whether and where the system starts to trip up. Regular quality testing also ensures that the data the AI model is confronted with in reality does not change so much over time that the precision of the predictions degenerates. Regular evaluations not only ensure high prediction quality, they can even improve it.
‍