Fraud detection is an increasingly difficult and important problem that most financial institutions and insurance companies face. One of the key challenges of detecting fraud through the use of machine learning techniques is the lack of available training data.
This is especially true for confirmed fraudulent records, as they are rare in comparison to non-fraudulent records. Large difference between the number of non-fraudulent and fraudulent records in the data set leads toa severe class imbalance. This negatively impacts the ability to train an effective machine learning model, as the training process is significantly biased towards the majority, non-fraudulent records.
This blog post will cover the following topics:
Several well-known approaches to dealing with a class imbalance have been proposed. One of those approaches is to upsample the number of records in the fraudulent minority class using a technique known as SMOTE (Synthetic Minority Oversampling Technique).
SMOTE generates sample copies of the minority class by interpolating between the nearest neighbors of random samples of fraudulent records. A randomly selected nearest neighbor is selected, where a copy data point is created by forming a line segment in the feature space between the original fraudulent sample and the randomly selected nearest neighbor. While being extremely popular because of its simplicity, the SMOTE approach doesn't represent the original data distribution or relationships, which may result in poor model generalization.
Alternatively, synthetic data generation is an approach for upsampling the minority class that maintains both the underlying distribution and relationships to ultimately improve upon the model generalization capabilities derived from the SMOTE approach.
In this article, we intend to train an initial machine learning model on an open-source credit card fraud detection data set. A second model will then be trained by augmenting the original training set by means of oversampled sample data generated by SMOTE. Finally, a third model will be trained using synthetic copies of the original training set generated by the Data Embassy SDK. The performance of the three models will then be evaluated to determine the most effective approach for detecting fraud.
The data set that we used in this exercise is a credit card fraud detection data set taken from Kaggle. This data set contains over 284k records, of which 0.172% are fraudulent. This severe class imbalance between the non-fraudulent and fraudulent classes makes this data set a perfect candidate to test our hypothesis on using synthetic data to augment the training of machine learning models for fraud detection.
Generating synthetic data using our Data Embassy SDK is straightforward, and the entire process can be conducted within a Jupyter notebook or your preferred IDE. The credit card fraud data was first loaded into a Pandas DataFrame in order to inspect and analyze the data. The fraudulent class was then explicitly split from the non-fraudulent class, as the intention was to over-sample the fraudulent data only. There was, therefore, no need to synthetically generate additional samples of the majority, non-fraudulent class.
Following this, the synthesization process was conducted. We generated 45k samples with the Data Embassy SDK, which aligns with the number of additional records generated by the SMOTE approach. The number of samples can be changed depending on the use case. No additional feature engineering was required, as the data was already pre-processed and scaled accordingly.
Once the synthesization process was completed, we saved the resulting data into its own Pandas DataFrame, where a direct comparison between the original and synthetic data can be made.
Figure 1 below shows the distribution of the original data as well as synthetic data generated for one of the most influential features in detecting fraud, V4. Feature importance was determined by using the feature importances property of the trained XGBoost model, where importance was measured by the average gain across all splits where each feature was used. The illustration shows that the synthetic data distribution of the minority class is very representative of the original data, although the count of the synthetic data samples (y-axis) is much larger.
The distribution of the original and synthetic data for feature V14 is also included in Figure 2. V14 is another significant contributor when inspecting the feature contribution using XGBoost. The same shape is exhibited in the two distributions, where the count of samples is again much larger for the synthetic distribution, as expected.
The SDK also offers reports to assess the privacy and utility of synthetically generated data. You can read more about evaluating the utility of synthetic data and the performance of ML models in this blog post. Additionally, you can also learn more about our privacy evaluations here.
The first model trained was a baseline XGBoost implementation using the features derived from the original data set only. No specific hyperparameters were configured for this model or for any of the subsequent models in order to make fair comparisons between the approaches.
The training and test data sets were created by making a 90%/10% train test split on the original features. Of the 255k records in the training set, only 443 were fraudulent. The test set contained 28k records. We applied stratified sampling to the train test split to ensure that the class imbalance was maintained in both subsets, with 49 belonging to the minority fraudulent class.
The baseline solution performs well out of the box, where an AUC (Area Under the ROC Curve) score of 96.4% is achieved (Figure 3). A ROC (receiver operating characteristic) curve is a graph showing the performance of a classification model at all classification thresholds. AUC measures the entire two-dimensional area beneath the ROC curve and is a popular classification performance metric that measures the trade-off between the false positive rate and the true positive rate at different classification thresholds. It gives a good sense of the overall performance of the model where, in practice, the threshold is likely to be changed from 50% towards another level that more effectively discriminates between the classes.
The second model was trained on an augmented training set that included a number of up-sampled fraudulent records from the training set only. These records were generated via SMOTE.
We were able to increase the number of fraudulent records in the minority class to over 48k after applying SMOTE, therefore reducing the class imbalance. Conversely, the majority class was reduced from 255k down to 100k via random downsampling.
Including SMOTE in the modeling process improved the AUC to 97.5% (Figure 4), where the model was able to detect additional fraudulent transactions not detected by the baseline model.
The number of false positives also increased slightly, which is expected when applying SMOTE or other over-sampling techniques. According to the authors of the original SMOTE paper (Chawla et al., 2002), optimal results were achieved when combining the use of both SMOTE to up-sample the minority class, as well as the use of random sampling to down-sample the majority class. Therefore, in order to extract the best possible performance from the SMOTE approach, random down-sampling of the majority class was also conducted.
The distribution of the minority class data following the upsampling via SMOTE resulted in the graph illustrated in Figure 5. The purple distribution on the left contains data for the minority class from feature V4, while the gold distribution to the right represents the upsampled minority class generated through the implementation of SMOTE.
By making a direct comparison with the synthetic data generated in Figure 1, we can see that the SMOTE approach does not approximate the original distribution as accurately as the SDK does.
Figure 6 provides a comparison of the original and SMOTE distributions for feature V14 generated for the fraudulent class only. Again, by making a direct comparison between Figure 2 and Figure 6, we see that the SMOTE approach yields a much wider distribution that is not fully representative of the original data from which it was trained.
The final model in the experiment was the same XGBoost implementation but included the use of the SDK for synthetic data generation. The SDK was used to up-sample the fraudulent minority class only, by increasing the number of fraudulent records in the training set by 45k. Model 3 was able to correctly detect even more fraudulent transactions compared with the SMOTE model, which provided an increase in AUC to 98.1%. This comes at the expense of an increase in false positives, which like SMOTE, is expected behavior.
In practice, customers are primarily interested in increasing recall while keeping precision low. An additional 10% in recall over the baseline model is worth the drop in precision, as false negatives are significantly more expensive from a cost perspective (cost of fraud, loss of goodwill with payment providers) when compared with false positives (operational overhead, poor customer experience).
The complete results from the experiment are summarized below:
An illustration of the model performances at different classification thresholds is presented in Figure 7. The model trained using the synthetic data generated by the Data Embassy SDK (gold line) achieves a higher AUC score when compared to the baseline model (purple line), and the model trained using SMOTE (red line).
These results are conclusive, and a 2% increase in AUC can have a considerable impact on operations within an organizational context. To ensure the accuracy of the follow-up experiments, it's important to minimize the potential rise in false positives. A cost-based approach should be followed, where the increase in true positives should be compared to the increase in false positives from a perspective of the total cost of fraud. This can be achieved by attaching a monetary cost to each classification based on empirical evidence gained from historical instances of different model classifications.
Imbalanced data is challenging, particularly in fraud detection contexts, where access to positive labeled data is lacking. This experiment shows that synthetic data can be used to augment model training in highly imbalanced problems, where the results exceed those of traditional methods such as SMOTE. In addition to providing high-quality synthetic data, our SDK also provides features to enhance the privacy and utility of your data use cases.
Chawla N.V., Bowyer K.W., Hall L.O., and Kegelmeyer W.P., 2002, SMOTE: Synthetic Minority Over-sampling Technique, Journal of Artificial Intelligence Research, 16, pg. 321–357
Contact us and get feedback instantly.