Abstract

Credit cards are used for a significant portion of payment system all across the world. There is a risk of fraud with credit cards if credit card credentials are stolen by someone. With machine learning algorithms, we can detect if a transaction is genuine or fraudulant. The data for building the models is taken from a Kaggle Competition [1]. After trying variouos models, model having XGBoost algorithm performed best, giving the ROC area under the curve of 0.9973. The model also achieved the mean sensitivity of 0.9964.


Introduction

With invention of electronic payment system, credit cards have become very popular. Subsequently, the number of credit card transactions, both online and offline have also risen significantly. With this increase in credit card transactions, the risk of fraud to credit card owners is also increasing[4]. The goal of this analysis is to detect credit card fraud and avoid charges to credit card owners for the items they did not buy.


Methods

Out of 284807 transactions, 492 are the actual fraud identified. Hence, there is a significant label imbalance. In this case, confusion matrix will not be the right metric to look at. The reson for not using confusion matrix is because even if the model predicts all 492 observations which are fradulant as genuine, the models will still achieve the accuracy of 99.82%. We will use area under Receiver Operating Characteristic (ROC) curve as the metric for model performance evaluation. Below are the definitions of this metric. [1]

True Positive Rate (TPR) means out of all the actual positive classes, how many are predicted as positive. Formula to calculate precision is given below. \[ True Positive Rate (TPR) = True Positive / (True Positive + False Negative) \] False positive rate(miss rate) means out of all the actual negative classes, how many are predicted as positive The formula to calculate recall is given below.

\[ False Positive Rate (FPR) = False Positive / (False Positive + True Negative) \]

We can define the positive and negative class based on our target variable classes.

To calculate Area Under the receiver operating characteristic curve(ROC), we follow below steps. [4]

  1. Calculate TPR and FPR at different probability thresholds.
  2. Plot a graph of TPR vs FPR based on the values we get from step 1.
  3. The area under the curve of TPR vs FPR is our metric. Note that on x axis, we use FPR.

Data

The data for the analysis is taken from https://www.kaggle.com/mlg-ulb/creditcardfraud. The data set consists of credit card transactions made in 2013 by European citizens. The predictor variables V1 to V28 are the results of PCA transformation, hence we cannot have any inference on what these actually mean. It is done primarily to keep the confidentiallity of the owners of these credit cards. Variables Time and amount are not transformed with PCA. The variable ‘Class’ is the response or target variable; 1: Fraud, 0: No Fraud. [1] There is no missing data in the dataset.

Below are the boxplots for var1 to var28 for both classes; fraud and genuine. We can see that boxplots for var 1 to 19 for class of fraud vs genuine differ from each other.

With the dataset, we are given transaction amount for each fraudulant as well as genuine transactions. If we look at the below boxplot, we can see there that is not a big difference in terms of medians of amounts by transaction class. However, the fraud transactions seems to have larger variance.

We are also given time variable. The time here is not an exact timestamp, instead, we are given transactions by sequence and amount of time for a transaction with respect to first transaction in seconds. Looking at the below plot, we can see almost a flat line for genuine transactions since we have significantly large number of transactions which are genuine. For fraudulant transactions, we can see, there are few gaps of time interval, but that does not look like a significant anomaly.

Modeling

After checking the data, we can proceed to start building models. We split the data into training and testing dataset. The training dataset is further split into estimation and validation set. For cross-validation, 5 fold cross-validation is used. We will build models using below algorithms.

  1. Logistic Regression
  2. Decision Trees
  3. K-Nearest Neighbors
  4. Extreme Gradient Boosting Machines

It was found that variables ‘V20’ to ‘V28’ and ‘Time’ were not improving the model performance. Instead, there were increasing the test error. Hence, these variables were removed when training models.

Due to intense computational requirements for K-Nearest Neighbors algorithm, hyper-parameters tuning was not done for this algorithm. The model with k=10 was chosen.


Results

Model Performance Results
Model Name Area Under ROC Mean Sensitivity False Positive Rate Mean Specificity
Logistic 0.5853 0.5842 0.5000 0.5000
KNN 0.9742 0.6180 0.0717 0.9283
Decision Tree 0.9080 0.7562 0.1667 0.8333
Xgboost 0.9973 0.9964 0.4972 0.5028


Discussion

Extreme Gradient Descent Algorithm achieved highest ROC area under curve of 0.9973. The model also achieved highest mean sensitivity of 0.9964. Hence, this algorithm is used to predict data on the training set.


References

  1. Credit Card Fraud Detection. Kaggle https://www.kaggle.com/mlg-ulb/creditcardfraud
  2. Deepanshu B., (2019, July). Precision Recall Curve Simplified. Listendata https://www.listendata.com/2019/07/precision-recall-curve-simplified.html
  3. Dalphiaz D., (2020, October 28). R for Statistical Learning. https://daviddalpiaz.github.io/r4sl/
  4. Toshniwal R., (2020, January 15) Demystifying ROC Curves. https://towardsdatascience.com/demystifying-roc-curves-df809474529a
  5. Lee N., (2021, January 27) Credit card fraud will increase due to the covid pandemic, experts warn. https://www.cnbc.com/2021/01/27/credit-card-fraud-is-on-the-rise-due-to-covid-pandemic.html

Appendix

Data Dictionary

  • V1 to v28: These are 28 predictor variables which are results of principle component transformations of the original predictor variables.
  • Time: Amount of time interval from first transaction in seconds.
  • Amount: The transaction amount.
  • Class: It is the response variable. It takes value of 0 for genuine transaction and 1 for fraudulant transaction.