Credit card fraud detection is a significant concern for financial institutions. Detecting fraud is challenging because fraudsters continuously devise new methods, making it difficult to identify consistent patterns. Imagine a scenario where all icons look similar, with only one slightly different, and you have to find it. Can you spot it?
Let’s outline what you’ll learn today about credit card fraud detection:
Understanding Data Imbalance
- What is data imbalance?
- Possible causes of data imbalance
- Why class imbalance is a problem in machine learning
- Quick refresher on the Random Forest Algorithm
- Different sampling methods to address data imbalance
- Comparison of methods in our context using Python
- Business insights on model selection
Due to the typically low number of fraudulent transactions, datasets often have many more non-fraud cases. Such datasets are known as ‘imbalanced.’ Detecting fraud is crucial, as a single fraudulent transaction can lead to massive losses for banks.
We’ll use the credit card fraud dataset from Kaggle. In binary classification, we have two classes:
- Majority class: Non-fraudulent transactions
- Minority class: Fraudulent transactions
In our dataset, only 0.17% of observations are fraudulent, indicating a highly imbalanced dataset.
Causes of Data Imbalance
- Biased Sampling/Measurement Errors: Occurs when samples are collected from one class or region or are misclassified. This can be corrected by improving sampling methods.
- Domain Characteristics: Imbalance may arise from predicting rare events, skewing the data toward the majority class.
Machine learning algorithms often focus on frequently occurring events, the majority class, which is problematic for imbalanced datasets. Tree-based algorithms or anomaly detection methods can be more effective. Random Forest, an ensemble method, will be used here.
Random Forest Overview
Random Forest builds multiple decision trees, and the most common class prediction among them becomes the final outcome. For example, if two trees predict fraud while one predicts non-fraud, the final prediction is fraud.
Random Forest creates a collection of tree-structured classifiers, each voting for the most popular class. It uses random vectors for tree creation, and its error decreases as the number of trees increases.
Sampling Methods to Address Imbalance
- Random Under-sampling: Reduces the majority class to match the minority class, potentially losing valuable data.
- Random Over-sampling: Duplicates minority class examples to balance the dataset, which can create excessive duplicates.
- SMOTE (Synthetic Minority Over-sampling Technique): Uses synthetic examples along with K-nearest neighbors to generate new data points for the minority class.
Metrics like precision, recall, accuracy, and F-score help evaluate a model’s performance. Precision measures the accuracy of fraud detection, recall assesses how many actual fraud cases are correctly identified, and accuracy shows overall correct classifications.
Model Training and Evaluation
We’ll train the Random Forest model using default features, then apply under-sampling, over-sampling, and SMOTE. The results are compared using confusion matrices and performance metrics.
No Sampling Interpretation
Without sampling, 76 fraud cases are identified, with an overall accuracy of 97% and a recall of 75%.
Under-sampling Interpretation
Under-sampling captures 90 fraud cases, improving recall, but accuracy and precision decrease due to increased false positives.
Over-sampling Interpretation
Over-sampling achieves high precision and accuracy, with a recall of 81%, capturing more fraud cases with fewer false positives.
SMOTE Interpretation
SMOTE improves recall to 84%, catching more fraud cases despite a slight increase in false positives.
In fraud detection, recall is crucial as financial institutions prioritize identifying fraud cases due to the potential for significant losses. Depending on the institution’s risk tolerance, over-sampling or SMOTE can be used. Further model parameter tuning can enhance results.
For further details, refer to the code on GitHub.
References
- Mythili Krishnan, Madhan K. Srinivasan, "Credit Card Fraud Detection: An Exploration of Different Sampling Methods to Solve the Class Imbalance Problem" (2022), ResearchGate
- Bartosz Krawczyk, "Learning from imbalanced data: open challenges and future directions" (2016), Springer
- Nitesh V. Chawla et al., "SMOTE: Synthetic Minority Over-sampling Technique" (2002), Journal of Artificial Intelligence Research
- Leo Breiman, "Random Forests" (2001), stat.berkeley.edu
- Jeremy Jordan, "Learning from imbalanced data" (2018)
- Fraud Detection in Python