I was very interested in the discussion on fraud detection we had last week, and I would like to explore how machine learning is used to tackle this problem. I plan on collecting data from publicly available credit card datasets that contain both fraudulent and authentic transactions, and then creating a few different models that will then be able to classify not-before-seen transactions as fraud or not. I want to focus on logistic regression, but also dip into gradient boosting models and neural nets if possible in R.
Credit card fraud detection is important to the banking industry, and has become more and more important as more and more banking happens online. The good thing is that with increased transaction volumes, more data can be collected. That means we can record transactions as well as relevant FEATURES of every transaction. Given that we can identify whether or not a transaction is fraudulent, this kind of data set at first glance lends itself to machine learning.
knitr::opts_chunk$set(echo = TRUE, class.source = "codechunk")
library(dplyr) # for data manipulation
library(stringr) # for data manipulation
library(caret) # for sampling
library(caTools) # for train/test split
library(ggplot2) # for data visualization
library(corrplot) # for correlations
library(Rtsne) # for tsne plotting
library(DMwR2) # for smote implementation
library(ROSE)# for ROSE sampling
library(rpart)# for decision tree model
library(Rborist)# for random forest model
library(RCurl)
The particular data set I will be using is found here: https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud. This data represents 284,807 transactions of European cardholders collected in September 2013. Normally the columns would be labeled with the names of the feature of the transaction they represent, however they have been obscured for confidentiality. The features that are explicitly named are the amount of the transaction, the time in which it took place, and the fraud-or-not label. Note that having the features’ names obscured will not hinder the machine learning, as the algorithms will not care. However if we did want to identify specific features as strong indicators of fraud for example, we would need to eventually obtain the feature names.
This data set is understandably huge, and so I had to figure out a way to upload it to github as a normal “git push” would not be allowed since the data exceeds 100 mb. I decided to use “git lfs”. The instructions to use it are as follows (for Mac:
Install the package using brew, and then install it in the directory one intends to use git: -brew install git-lfs -git lfs install
Track the large file like so, and also (very important!) track .gitattributes: -git lfs track creditcard.csv -git add .gitattributes
Add, commit, and push the large file normally: -git add creditcard.csv -git commmit -git push
It will be uploaded, and git-lfs will take care of managing the upload and it will appear on your repo.
my_git_url <- getURL("https://media.githubusercontent.com/media/aelsaeyed/Data607/main/creditcard.csv")
credit_transactions <- read.csv(text = my_git_url)
Below we can see a summary of the many features included in the dataset:
summary(credit_transactions, 10)
## Time V1 V2 V3
## Min. : 0 Min. :-56.40751 Min. :-72.71573 Min. :-48.3256
## 1st Qu.: 54202 1st Qu.: -0.92037 1st Qu.: -0.59855 1st Qu.: -0.8904
## Median : 84692 Median : 0.01811 Median : 0.06549 Median : 0.1799
## Mean : 94814 Mean : 0.00000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.:139320 3rd Qu.: 1.31564 3rd Qu.: 0.80372 3rd Qu.: 1.0272
## Max. :172792 Max. : 2.45493 Max. : 22.05773 Max. : 9.3826
## V4 V5 V6 V7
## Min. :-5.68317 Min. :-113.74331 Min. :-26.1605 Min. :-43.5572
## 1st Qu.:-0.84864 1st Qu.: -0.69160 1st Qu.: -0.7683 1st Qu.: -0.5541
## Median :-0.01985 Median : -0.05434 Median : -0.2742 Median : 0.0401
## Mean : 0.00000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.74334 3rd Qu.: 0.61193 3rd Qu.: 0.3986 3rd Qu.: 0.5704
## Max. :16.87534 Max. : 34.80167 Max. : 73.3016 Max. :120.5895
## V8 V9 V10 V11
## Min. :-73.21672 Min. :-13.43407 Min. :-24.58826 Min. :-4.79747
## 1st Qu.: -0.20863 1st Qu.: -0.64310 1st Qu.: -0.53543 1st Qu.:-0.76249
## Median : 0.02236 Median : -0.05143 Median : -0.09292 Median :-0.03276
## Mean : 0.00000 Mean : 0.00000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 0.32735 3rd Qu.: 0.59714 3rd Qu.: 0.45392 3rd Qu.: 0.73959
## Max. : 20.00721 Max. : 15.59500 Max. : 23.74514 Max. :12.01891
## V12 V13 V14 V15
## Min. :-18.6837 Min. :-5.79188 Min. :-19.2143 Min. :-4.49894
## 1st Qu.: -0.4056 1st Qu.:-0.64854 1st Qu.: -0.4256 1st Qu.:-0.58288
## Median : 0.1400 Median :-0.01357 Median : 0.0506 Median : 0.04807
## Mean : 0.0000 Mean : 0.00000 Mean : 0.0000 Mean : 0.00000
## 3rd Qu.: 0.6182 3rd Qu.: 0.66251 3rd Qu.: 0.4931 3rd Qu.: 0.64882
## Max. : 7.8484 Max. : 7.12688 Max. : 10.5268 Max. : 8.87774
## V16 V17 V18
## Min. :-14.12985 Min. :-25.16280 Min. :-9.498746
## 1st Qu.: -0.46804 1st Qu.: -0.48375 1st Qu.:-0.498850
## Median : 0.06641 Median : -0.06568 Median :-0.003636
## Mean : 0.00000 Mean : 0.00000 Mean : 0.000000
## 3rd Qu.: 0.52330 3rd Qu.: 0.39968 3rd Qu.: 0.500807
## Max. : 17.31511 Max. : 9.25353 Max. : 5.041069
## V19 V20 V21
## Min. :-7.213527 Min. :-54.49772 Min. :-34.83038
## 1st Qu.:-0.456299 1st Qu.: -0.21172 1st Qu.: -0.22839
## Median : 0.003735 Median : -0.06248 Median : -0.02945
## Mean : 0.000000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 0.458949 3rd Qu.: 0.13304 3rd Qu.: 0.18638
## Max. : 5.591971 Max. : 39.42090 Max. : 27.20284
## V22 V23 V24
## Min. :-10.933144 Min. :-44.80774 Min. :-2.83663
## 1st Qu.: -0.542350 1st Qu.: -0.16185 1st Qu.:-0.35459
## Median : 0.006782 Median : -0.01119 Median : 0.04098
## Mean : 0.000000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 0.528554 3rd Qu.: 0.14764 3rd Qu.: 0.43953
## Max. : 10.503090 Max. : 22.52841 Max. : 4.58455
## V25 V26 V27
## Min. :-10.29540 Min. :-2.60455 Min. :-22.565679
## 1st Qu.: -0.31715 1st Qu.:-0.32698 1st Qu.: -0.070840
## Median : 0.01659 Median :-0.05214 Median : 0.001342
## Mean : 0.00000 Mean : 0.00000 Mean : 0.000000
## 3rd Qu.: 0.35072 3rd Qu.: 0.24095 3rd Qu.: 0.091045
## Max. : 7.51959 Max. : 3.51735 Max. : 31.612198
## V28 Amount Class
## Min. :-15.43008 Min. : 0.00 Min. :0.000000
## 1st Qu.: -0.05296 1st Qu.: 5.60 1st Qu.:0.000000
## Median : 0.01124 Median : 22.00 Median :0.000000
## Mean : 0.00000 Mean : 88.35 Mean :0.001728
## 3rd Qu.: 0.07828 3rd Qu.: 77.17 3rd Qu.:0.000000
## Max. : 33.84781 Max. :25691.16 Max. :1.000000
A feature of this data set that WILL impact the model negatively is the fact that there are not enough frauds- only 492 out of the 284,807. This level of imbalance in the data set will lead to a model that is heavily biased towards classifying transactions as legitimate. It seems that fraud doesn’t happen nearly as often as legitimate transactions, however there is still a need to catch these transactions.
credit_transactions %>%
filter(Class == 1)
The goal of this project is to investigate methods of working around the problem of having an imbalanced class. There are several different ways to do this, and we can test them all out against test data to see which yields the best results.
First I will remove the “time” column because it is not the actual time of the transactions, but rather time since the first transaction. I will also change the “Class” variable to a factor. I learned that this improves performance when using this variable for machine learning as it will be treated as a truly categorical variable. It works because we know the elements can only be one of two discrete values, 0 for non-fraud, 1 for fraud.
credit_transactions_cleaned <- credit_transactions[,-1]
credit_transactions_cleaned$Class <- as.factor(credit_transactions_cleaned$Class)
levels(credit_transactions_cleaned$Class) <- c("Legit", "Fraud")
credit_transactions_cleaned[,-30] <- scale(credit_transactions_cleaned[,-30])
head(credit_transactions_cleaned, 10)
Here I split the data into testing and training subsets such that 70% of it will be used to train, and 30% to test. It will use the Class column to split.
set.seed(123)
split <- sample.split(credit_transactions_cleaned$Class, SplitRatio = 0.7)
train <- subset(credit_transactions_cleaned, split == TRUE)
test <- subset(credit_transactions_cleaned, split == FALSE)
table(test$Class)
##
## Legit Fraud
## 85295 148
table(train$Class)
##
## Legit Fraud
## 199020 344
Here I will create a decision tree model on the data set as-is to establish a baseline. Rpart allows us to specify which variable is our “label”, and specify our training data from above. We can then use predict to run the sequestered test data through the model.
In order to evaluate the model, I use a ROC curve- Receiver Operator Characteristic curve. As a quick summary, the ROC curve plots the true negative rate against false positive rate, essentially showing us how well the model does differentiating between wrong predictions and right predictions. The AUC, area under the curve, is a quick way to summarize the ROC curve, with areas closer to 1 indicating that the model is better at distinguishing between “Legit” and “Fraud”.
Here the area is 0.912, which is now our baseline.
set.seed(5627)
orig_fit <- rpart(Class ~ ., data = train)
pred_orig <- predict(orig_fit, newdata = test, method = "class")
roc.curve(test$Class, pred_orig[,2], plotit = TRUE)
## Area under the curve (AUC): 0.912
Now I will modify the original dataset by downsampling. Downsampling is a process where data from the overwhelming negative class, in this case “Legit” transactions, is removed to even out the distribution of legit and fraudulent transactions. The function downSample takes care of this for us, after we specify the “Class” variable from our training data set.
We make use of Rpart once more to create a decision tree model using the newly downsampled data, and then use it to predict using the test data. We find that our AUC has gone up to 0.942- a good improvement.
set.seed(9560)
down_train <- downSample(x = train[, -ncol(train)],y = train$Class)
table(down_train$Class)
##
## Legit Fraud
## 344 344
down_fit <- rpart(Class ~ ., data = down_train)
pred_down <- predict(down_fit, newdata = test)
roc.curve(test$Class, pred_down[,2], plotit = TRUE)
## Area under the curve (AUC): 0.942
Now I will modify the original dataset by upsampling this time. Upsampling is a process where data from the small positive class, in this case “Fraud” transactions, is augmented to even out the distribution of legit and fraudulent transactions. The function upSample takes care of this for us.
We make use of Rpart once more to create a decision tree model using the newly upsampled data, and then use it to predict using the test data. We find that our AUC has gone up to 0.943- an even larger improvement.
We can say for now that upsampling seems to be the slightly more favorable method to balance out the data.
set.seed(9560)
up_train <- upSample(x = train[, -ncol(train)],y = train$Class)
table(up_train$Class)
##
## Legit Fraud
## 199020 199020
up_fit <- rpart(Class ~ ., data = up_train)
pred_up <- predict(up_fit, newdata = test)
roc.curve(test$Class, pred_up[,2], plotit = TRUE)
## Area under the curve (AUC): 0.943
I will try this experiment again, but with a logistic regression model instead of decision tree. First, the baseline - our AUC is 0.967, already higher from the jump.
glm_fit <- glm(Class ~ ., data = train, family = 'binomial')
pred_glm <- predict(glm_fit, newdata = test, type = 'response')
roc.curve(test$Class, pred_glm, plotit = TRUE)
## Area under the curve (AUC): 0.967
The AUC after downsampling jumps up to 0.968.
glm_fit <- glm(Class ~ ., data = down_train, family = 'binomial')
pred_glm <- predict(glm_fit, newdata = test, type = 'response')
roc.curve(test$Class, pred_glm, plotit = TRUE)
## Area under the curve (AUC): 0.968
The AUC after upsampling jumps up even further to 0.971.
glm_fit <- glm(Class ~ ., data = up_train, family = 'binomial')
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
pred_glm <- predict(glm_fit, newdata = test, type = 'response')
roc.curve(test$Class, pred_glm, plotit = TRUE)
## Area under the curve (AUC): 0.971
Upsampling has provided a better AUC score and hence a better model across both models. This is intuitive because downsampling represents a loss of information, since it removes data associated with the majority negative class. However, since I am using random upsampling, these models might be slightly overfitted. That is because random upsampling will simply duplicate “Fraud” records.
As mentioned before, the features of the transactions have been masked, however that does not stop us from slightly improving this model by using only the top most influential or important features. Here are all the features and their weight, which we can display by looking at the object provided by glm:
glm_fit
##
## Call: glm(formula = Class ~ ., family = "binomial", data = up_train)
##
## Coefficients:
## (Intercept) V1 V2 V3 V4 V5
## -3.75669 1.32012 0.40632 0.39799 1.32604 0.87120
## V6 V7 V8 V9 V10 V11
## -0.56008 -0.63636 -0.55251 -0.63347 -1.17539 0.76955
## V12 V13 V14 V15 V16 V17
## -1.15290 -0.56901 -1.31554 -0.30114 -0.53163 -0.72793
## V18 V19 V20 V21 V22 V23
## -0.32257 0.39341 -0.53963 0.14766 0.51143 0.16001
## V24 V25 V26 V27 V28 Amount
## 0.11053 -0.06214 -0.09526 -0.22995 0.15298 1.59449
##
## Degrees of Freedom: 398039 Total (i.e. Null); 398010 Residual
## Null Deviance: 551800
## Residual Deviance: 101900 AIC: 102000
Here I simply rank them in order from most influential to least. This shows us that the feature V4 is the most important feature in classifying a transaction as fraudulent or not, followed by V14 and so on.
importance <- as.data.frame(varImp(glm_fit)) %>%
arrange(desc(Overall))
importance
I will redo the above experiment with decision trees, this time however with only the top 10 most important features.
credit_transactions_if <- credit_transactions[,-1]
credit_transactions_if$Class <- as.factor(credit_transactions_if$Class)
levels(credit_transactions_if$Class) <- c("Legit", "Fraud")
credit_transactions_if[,-30] <- scale(credit_transactions_if[,-30])
credit_transactions_if <- credit_transactions_if %>%
select(V4, V14, V13, V12, V11, V22, V10, V8, V19, V15, Class)
head(credit_transactions_if, 10)
Recreating the training/testing data:
set.seed(123)
split2 <- sample.split(credit_transactions_if$Class, SplitRatio = 0.7)
train_if <- subset(credit_transactions_if, split2 == TRUE)
test_if <- subset(credit_transactions_if, split2 == FALSE)
table(test_if$Class)
##
## Legit Fraud
## 85295 148
table(train_if$Class)
##
## Legit Fraud
## 199020 344
I will once again get a baseline using a decision tree model, but of course with only the 10 chose features. Our baseline AUC is now 0.892.
set.seed(5627)
orig_fit_if <- rpart(Class ~ ., data = train_if)
pred_orig_if <- predict(orig_fit_if, newdata = test_if, method = "class")
roc.curve(test$Class, pred_orig_if[,2], plotit = TRUE)
## Area under the curve (AUC): 0.892
Now I will upsample the important-factors-only dataset and see how much the model improves. The AUC jumps up to 0.962, which beats the upsampled decision tree model from before that used all of the features (0.943).
set.seed(9560)
up_train_if <- upSample(x = test_if[, -ncol(test_if)],y = test_if$Class)
table(up_train$Class)
##
## Legit Fraud
## 199020 199020
up_fit_if <- rpart(Class ~ ., data = up_train_if)
pred_up_if <- predict(up_fit_if, newdata = test_if)
roc.curve(test_if$Class, pred_up_if[,2], plotit = TRUE)
## Area under the curve (AUC): 0.962
This project demonstrates that upsampling and downsampling are viable and useful ways of dealing with unbalanced data. This was done with simple up/down sampling, ie randomly removing or adding records. More advanced versions of upsampling exist that can potentially provide even better models.