This blog post demonstrates how to enhance the machine learning model monitoring process for data drift using the Data Embassy SDK’s synthetic data generation capabilities. Our aim is to produce synthetic data that is representative of the original training data and contains enough variety to trigger drift on demand. This use case comes in handy for testing model monitoring alarms in advance and simulating data drift scenarios that have yet to occur in production.
Once deployed into a live production environment, machine learning models are subject to various forms of degradation. Model degradation comes in many forms, most frequently manifesting as model quality, data drift, target drift, and other data quality issues. These model and data changes usually occur because the assumptions made during the training process become outdated. One reason for this is that the real world is complex, and the patterns identified when the data was collected may no longer be valid. Our customer’s behaviors change, and therefore our machine learning models should be retrained periodically to learn from these more recent emerging trends.
The decision to retrain a deployed model is largely a subjective one and depends on a few conditions:
1. Firstly, model retraining should be triggered when a performance metric (such as accuracy, which is the proportion of correct predictions in a classification context) deteriorates significantly compared to a certain benchmark or threshold. This encompasses, for example, error analysis of the model and is referred to as the model quality trigger.
2. The second scenario to force a retraining event would be when the distribution of the predictions made by the model differs significantly when compared to the predictions observed during the model training phase. This phenomenon is known as target drift and is an attempt at understanding how model predictions and target behavior change over time. Target drift uses common statistical tests such as Z-test, Chi-squared, Kolmogorov-Smirnov, and Jensen-Shannon to compare the target distributions at different points in time.
3. The third and final common trigger for model retraining is data drift, which entails a detected change in the distribution of the feature space. More specifically, data drift detection seeks to understand whether significant differences exist between the data used to train the model and the data served to the model in production. This approach typically involves running statistical tests to determine where the drift occurred and producing visualizations to explore the nature of the drift in detail. Methods for detecting the drift include KL divergence, Jensen-Shannon, and Wasserstein distance.
Evidently AI is a popular open-source tool for monitoring machine learning models in production. Their monitoring capabilities are extensive and cover all the previously mentioned retraining triggers. To showcase the data drift functionality of Evidently AI, we ran a data drift report on a credit card fraud detection data set taken from Kaggle. This data set contains over 284k records, of which 0.172% are fraudulent.
In a real-world context, organizations are interested in understanding the evolving nature of each class within a classification model. This fraud data is separated into two distinct classes, fraudulent and non-fraudulent. As is normally the case with fraud detection problems, the minority fraudulent class is severely outweighed by the majority non-fraudulent class. One approach for correcting this imbalance is to generate synthetic data, which we wrote about previously. More specifically, Data Embassy’s synthetic data generation capabilities were used to up-sample the minority fraudulent class.
For the model monitoring use case, we then took the 45 thousand up-sampled fraudulent records and used Evidently AI to assess whether data drift is present when comparing the original data from the minority class (only 443 records) to the generated synthetic data (45k records). No drift was expected, as the synthetic data generation process produces data that is representative of the original data. This is confirmed by the dataset drift report depicted by Figure 1. A default drift score threshold of 0.5 is issued by default, although a different threshold can be set depending on the test.
The data set drift report shows three features, where the reference (original) and current (synthetic) data distributions are analyzed for the presence of drift through the Kolmogorov-Smirnov (K-S) test. Only the three features are included to reduce the complexity of the experiment for presentation purposes.
Visual inspection of each pair of distributions shows that the distributions are very similar, which is why the test indicated that no drift was detected. However, the number of records in each distribution does differ significantly. Therefore, we will use the more populous synthetic data set to simulate different drift scenarios. This is because the synthetic data set contains a fuller distribution, with more samples and, subsequently, a higher count of records that deviate from the norm, thus contributing to richer and more robust simulated scenarios.
To simulate various model drift scenarios, Gaussian (or white) noise is injected into the synthetically generated data (Figure 2). This is done for the minority, fraudulent class only, to specifically simulate evolving fraudulent behaviors. Adding white noise to the synthetic data is a trivial example of transforming the synthetic data, with the intention of detecting a distribution that differs significantly from a reference distribution, in this case, the original synthetic data. Many possible transformations could be applied to change the original synthetic data outside of white noise.
For cases where outliers are specifically desired, as would be the case in testing outlier detection systems, a distribution with a much longer tail can be simulated.
Figure 3 illustrates a lognormal distribution generated as an alternative noise element to inject into the original and synthetic data. The lognormal distribution is heavy-tailed, meaning it has a sizable right tail, as indicated by Figure 4. This heavy right tail is particularly interesting in instances where outliers are desired for testing purposes, as it simulates data points that deviate significantly from the norm.
Once applied to the synthetic data, a second pass of model monitoring through Evidently AI was made. The results of the drift detection are depicted in Figure 5, where the synthetic data is compared with the synthetic data that includes the added noise.
In this case, drift is detected for all three features, where the normed Wasserstein distance metric is used to calculate the drift score. This metric is used specifically for numerical features with more than one thousand observations and unique values that exceed five. In contrast, in the previous run, the K-S score was produced because the original data contained fewer than one thousand records.
Analyzing the reference (synthetic) and current (synthetic with noise) distributions in Figure 5 shows that the distributions are less similar than those in Figure 1. This is confirmed by Figure 6, where the synthetic reference features have a more irregular distribution that matches the original fraudulent samples. In contrast, the current synthetic data that includes noise resembles a normal distribution more closely. This is to be expected, as we intentionally added a Gaussian noise element to the synthetic data.
The purpose of the drift score is to detect whether there is a significant deviation between the two distributions. Many possible drift simulation scenarios can be created, including shifting the mean of the synthetic data or adding more variation in the form of outlier generation.
The results are promising as they indicate that synthetic data can be used to effectively simulate different data drift scenarios. In this case, by synthesizing the minority class, we were able to simulate examples where the fraudulent class exhibits a shift towards a distribution with a shape that resembles that of a normal distribution. The long tail of the noisy distribution still retains its structure, and therefore the result is not a true normal distribution.
The benefits of synthesizing the original data prior to the noise injection process are perhaps best described in Figures 7 and 8. Figure 7 shows the distribution properties for the original data in green and plotting the data points for the original data after the addition of the noise element in red. The red points clearly illustrate how sparse the distribution of the original noisy data is, with a few isolated outliers occurring at either the top or bottom of the distribution.
Figure 8 provides us with a plot of the synthetic distribution in green, with the noisy synthetic data overlaid as red points. The noisy synthetic data provides a much fuller distribution, with a better representation of data towards the extreme ends of the tails. This allows for an improvement in the quality of the simulations, providing a smoother and more consistent testing experience.
To further test the benefit of using synthetic data to simulate drift scenarios, the lognormal noise generated in Figure 3 is then injected into both the original and synthetic data sets. This was done to simulate cases where outliers occur in the right tail of the feature’s distribution. Figure 9 shows the distribution of the original reference data (dark gray) in comparison to the original “current” data that includes the lognormal noise (red). The red distribution clearly contains more data points in the extreme of its right tail when compared to the original data.
The same lognormal noise was also injected into the synthetic data, where the results of this process are expressed in Figure 10. Here the current, noisy (red) distribution contains far more extreme values in the right tail when compared to the distribution in Figure 9. This is due to the synthetic data having better distribution upon which the injected noise can be anchored. By comparison, the original data in Figure 9 is too sparse to maintain the desired variance injected via the lognormal distribution. This is an undesirable consequence for the original data because the intention of the injected noise is to simulate drift in scenarios with extreme outliers and data variability.
The results are further examined by means of the data drift detection reports in Figures 11 and 12. Drift was detected in both the original data plus noise case (Figure 11) as well as after injecting noise into the synthetic data (Figure 12). The statistical Jensen-Shannon test, which is a popular metric for measuring the similarity between distributions, detected a larger distance between the distributions when using the synthetic data (Figure 12).
This is an important finding, as it illustrates that using synthetic data to simulate model drift scenarios is indeed a more robust approach than injecting noise into the original data only.
During this experiment, we injected a significant amount of noise that had an outsized impact on the right tail of the distribution. This noisy data can be considered outliers at the extreme end of the right tail, thus simulating an example where a feature contains more outliers than the model was trained on. The test detected that drift occurred in both scenarios, although the test had a higher conviction that drift had occurred in the synthetic data, which is the purpose of the exercise.
The key insight stems from the sparse, original data being unsuited to act as a reference point for the intelligent inclusion of noise into the distribution. This is no doubt more prevalent in underpopulated samples/classes, which naturally benefit from synthetic data generation for upsampling in any case.
It is especially true for high-dimensional data, where the relationships between features are taken into consideration from a synthetic data generation perspective. An outlier cannot simply be added to a single feature without considering what corresponding values need to also be added to the rest of the features in the data set. These relationships are already preserved via the synthetic data generation process. Therefore, if noise is applied to the synthetic data consistently, then these pairwise relationships will remain intact.
In this experiment, we investigated the use of synthetic data for modeling scenarios where the features of a deployed machine learning model significantly differ from the baseline features used during model training.
This is significant, as a drift event can be used to trigger a model retraining process or alert to changes in behavior. In a real-world context, the transformations should be tailored to each feature’s possible shift scenarios. This use case could ideally be used as part of a model monitoring system and can be executed as unit tests within a model deployment pipeline. In addition to providing high-quality synthetic data, our SDK also provides features to enhance the privacy and utility of your data use cases.
Contact us and get feedback instantly.