Project Proposal

I was very interested in the discussion on fraud detection we had last week, and I would like to explore how machine learning is used to tackle this problem. I plan on collecting data from publicly available credit card datasets that contain both fraudulent and authentic transactions, and then creating a few different models that will then be able to classify not-before-seen transactions as fraud or not. I want to focus on logistic regression, but also dip into gradient boosting models and neural nets if possible in R.

Introduction

Credit card fraud detection is important to the banking industry, and has become more and more important as more and more banking happens online. The good thing is that with increased transaction volumes, more data can be collected. That means we can record transactions as well as relevant FEATURES of every transaction. Given that we can identify whether or not a transaction is fraudulent, this kind of data set at first glance lends itself to machine learning.

knitr::opts_chunk$set(echo = TRUE, class.source = "codechunk")

library(dplyr) # for data manipulation
library(stringr) # for data manipulation
library(caret) # for sampling
library(caTools) # for train/test split
library(ggplot2) # for data visualization
library(corrplot) # for correlations
library(Rtsne) # for tsne plotting
library(DMwR2) # for smote implementation
library(ROSE)# for ROSE sampling
library(rpart)# for decision tree model
library(Rborist)# for random forest model
library(RCurl)

The Data

The particular data set I will be using is found here: https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud. This data represents 284,807 transactions of European cardholders collected in September 2013. Normally the columns would be labeled with the names of the feature of the transaction they represent, however they have been obscured for confidentiality. The features that are explicitly named are the amount of the transaction, the time in which it took place, and the fraud-or-not label. Note that having the features’ names obscured will not hinder the machine learning, as the algorithms will not care. However if we did want to identify specific features as strong indicators of fraud for example, we would need to eventually obtain the feature names.

New Technique to Upload Large Files to Git

This data set is understandably huge, and so I had to figure out a way to upload it to github as a normal “git push” would not be allowed since the data exceeds 100 mb. I decided to use “git lfs”. The instructions to use it are as follows (for Mac:

Install the package using brew, and then install it in the directory one intends to use git: -brew install git-lfs -git lfs install

Track the large file like so, and also (very important!) track .gitattributes: -git lfs track creditcard.csv -git add .gitattributes

Add, commit, and push the large file normally: -git add creditcard.csv -git commmit -git push

It will be uploaded, and git-lfs will take care of managing the upload and it will appear on your repo.

my_git_url <- getURL("https://media.githubusercontent.com/media/aelsaeyed/Data607/main/creditcard.csv")
credit_transactions <- read.csv(text = my_git_url)

Below we can see a summary of the many features included in the dataset:

summary(credit_transactions, 10)
##       Time              V1                  V2                  V3          
##  Min.   :     0   Min.   :-56.40751   Min.   :-72.71573   Min.   :-48.3256  
##  1st Qu.: 54202   1st Qu.: -0.92037   1st Qu.: -0.59855   1st Qu.: -0.8904  
##  Median : 84692   Median :  0.01811   Median :  0.06549   Median :  0.1799  
##  Mean   : 94814   Mean   :  0.00000   Mean   :  0.00000   Mean   :  0.0000  
##  3rd Qu.:139320   3rd Qu.:  1.31564   3rd Qu.:  0.80372   3rd Qu.:  1.0272  
##  Max.   :172792   Max.   :  2.45493   Max.   : 22.05773   Max.   :  9.3826  
##        V4                 V5                   V6                 V7          
##  Min.   :-5.68317   Min.   :-113.74331   Min.   :-26.1605   Min.   :-43.5572  
##  1st Qu.:-0.84864   1st Qu.:  -0.69160   1st Qu.: -0.7683   1st Qu.: -0.5541  
##  Median :-0.01985   Median :  -0.05434   Median : -0.2742   Median :  0.0401  
##  Mean   : 0.00000   Mean   :   0.00000   Mean   :  0.0000   Mean   :  0.0000  
##  3rd Qu.: 0.74334   3rd Qu.:   0.61193   3rd Qu.:  0.3986   3rd Qu.:  0.5704  
##  Max.   :16.87534   Max.   :  34.80167   Max.   : 73.3016   Max.   :120.5895  
##        V8                  V9                 V10                 V11          
##  Min.   :-73.21672   Min.   :-13.43407   Min.   :-24.58826   Min.   :-4.79747  
##  1st Qu.: -0.20863   1st Qu.: -0.64310   1st Qu.: -0.53543   1st Qu.:-0.76249  
##  Median :  0.02236   Median : -0.05143   Median : -0.09292   Median :-0.03276  
##  Mean   :  0.00000   Mean   :  0.00000   Mean   :  0.00000   Mean   : 0.00000  
##  3rd Qu.:  0.32735   3rd Qu.:  0.59714   3rd Qu.:  0.45392   3rd Qu.: 0.73959  
##  Max.   : 20.00721   Max.   : 15.59500   Max.   : 23.74514   Max.   :12.01891  
##       V12                V13                V14                V15          
##  Min.   :-18.6837   Min.   :-5.79188   Min.   :-19.2143   Min.   :-4.49894  
##  1st Qu.: -0.4056   1st Qu.:-0.64854   1st Qu.: -0.4256   1st Qu.:-0.58288  
##  Median :  0.1400   Median :-0.01357   Median :  0.0506   Median : 0.04807  
##  Mean   :  0.0000   Mean   : 0.00000   Mean   :  0.0000   Mean   : 0.00000  
##  3rd Qu.:  0.6182   3rd Qu.: 0.66251   3rd Qu.:  0.4931   3rd Qu.: 0.64882  
##  Max.   :  7.8484   Max.   : 7.12688   Max.   : 10.5268   Max.   : 8.87774  
##       V16                 V17                 V18           
##  Min.   :-14.12985   Min.   :-25.16280   Min.   :-9.498746  
##  1st Qu.: -0.46804   1st Qu.: -0.48375   1st Qu.:-0.498850  
##  Median :  0.06641   Median : -0.06568   Median :-0.003636  
##  Mean   :  0.00000   Mean   :  0.00000   Mean   : 0.000000  
##  3rd Qu.:  0.52330   3rd Qu.:  0.39968   3rd Qu.: 0.500807  
##  Max.   : 17.31511   Max.   :  9.25353   Max.   : 5.041069  
##       V19                 V20                 V21           
##  Min.   :-7.213527   Min.   :-54.49772   Min.   :-34.83038  
##  1st Qu.:-0.456299   1st Qu.: -0.21172   1st Qu.: -0.22839  
##  Median : 0.003735   Median : -0.06248   Median : -0.02945  
##  Mean   : 0.000000   Mean   :  0.00000   Mean   :  0.00000  
##  3rd Qu.: 0.458949   3rd Qu.:  0.13304   3rd Qu.:  0.18638  
##  Max.   : 5.591971   Max.   : 39.42090   Max.   : 27.20284  
##       V22                  V23                 V24          
##  Min.   :-10.933144   Min.   :-44.80774   Min.   :-2.83663  
##  1st Qu.: -0.542350   1st Qu.: -0.16185   1st Qu.:-0.35459  
##  Median :  0.006782   Median : -0.01119   Median : 0.04098  
##  Mean   :  0.000000   Mean   :  0.00000   Mean   : 0.00000  
##  3rd Qu.:  0.528554   3rd Qu.:  0.14764   3rd Qu.: 0.43953  
##  Max.   : 10.503090   Max.   : 22.52841   Max.   : 4.58455  
##       V25                 V26                V27            
##  Min.   :-10.29540   Min.   :-2.60455   Min.   :-22.565679  
##  1st Qu.: -0.31715   1st Qu.:-0.32698   1st Qu.: -0.070840  
##  Median :  0.01659   Median :-0.05214   Median :  0.001342  
##  Mean   :  0.00000   Mean   : 0.00000   Mean   :  0.000000  
##  3rd Qu.:  0.35072   3rd Qu.: 0.24095   3rd Qu.:  0.091045  
##  Max.   :  7.51959   Max.   : 3.51735   Max.   : 31.612198  
##       V28                Amount             Class         
##  Min.   :-15.43008   Min.   :    0.00   Min.   :0.000000  
##  1st Qu.: -0.05296   1st Qu.:    5.60   1st Qu.:0.000000  
##  Median :  0.01124   Median :   22.00   Median :0.000000  
##  Mean   :  0.00000   Mean   :   88.35   Mean   :0.001728  
##  3rd Qu.:  0.07828   3rd Qu.:   77.17   3rd Qu.:0.000000  
##  Max.   : 33.84781   Max.   :25691.16   Max.   :1.000000

A feature of this data set that WILL impact the model negatively is the fact that there are not enough frauds- only 492 out of the 284,807. This level of imbalance in the data set will lead to a model that is heavily biased towards classifying transactions as legitimate. It seems that fraud doesn’t happen nearly as often as legitimate transactions, however there is still a need to catch these transactions.

credit_transactions %>% 
  filter(Class == 1)

Goals

The goal of this project is to investigate methods of working around the problem of having an imbalanced class. There are several different ways to do this, and we can test them all out against test data to see which yields the best results.

Data Cleanup

First I will remove the “time” column because it is not the actual time of the transactions, but rather time since the first transaction. I will also change the “Class” variable to a factor. I learned that this improves performance when using this variable for machine learning as it will be treated as a truly categorical variable. It works because we know the elements can only be one of two discrete values, 0 for non-fraud, 1 for fraud.

credit_transactions_cleaned <- credit_transactions[,-1]

credit_transactions_cleaned$Class <- as.factor(credit_transactions_cleaned$Class)
levels(credit_transactions_cleaned$Class) <- c("Legit", "Fraud")

credit_transactions_cleaned[,-30] <- scale(credit_transactions_cleaned[,-30]) 

head(credit_transactions_cleaned, 10)

Split Training Data

Here I split the data into testing and training subsets such that 70% of it will be used to train, and 30% to test. It will use the Class column to split.

set.seed(123)
split <- sample.split(credit_transactions_cleaned$Class, SplitRatio = 0.7)
train <-  subset(credit_transactions_cleaned, split == TRUE)
test <- subset(credit_transactions_cleaned, split == FALSE)
table(test$Class)
## 
## Legit Fraud 
## 85295   148
table(train$Class)
## 
##  Legit  Fraud 
## 199020    344

Baseline with Decision Tree

Here I will create a decision tree model on the data set as-is to establish a baseline. Rpart allows us to specify which variable is our “label”, and specify our training data from above. We can then use predict to run the sequestered test data through the model.

In order to evaluate the model, I use a ROC curve- Receiver Operator Characteristic curve. As a quick summary, the ROC curve plots the true negative rate against false positive rate, essentially showing us how well the model does differentiating between wrong predictions and right predictions. The AUC, area under the curve, is a quick way to summarize the ROC curve, with areas closer to 1 indicating that the model is better at distinguishing between “Legit” and “Fraud”.

Here the area is 0.912, which is now our baseline.

set.seed(5627)

orig_fit <- rpart(Class ~ ., data = train)
pred_orig <- predict(orig_fit, newdata = test, method = "class")

roc.curve(test$Class, pred_orig[,2], plotit = TRUE)

## Area under the curve (AUC): 0.912

Downsampling with Decision Tree

Now I will modify the original dataset by downsampling. Downsampling is a process where data from the overwhelming negative class, in this case “Legit” transactions, is removed to even out the distribution of legit and fraudulent transactions. The function downSample takes care of this for us, after we specify the “Class” variable from our training data set.

We make use of Rpart once more to create a decision tree model using the newly downsampled data, and then use it to predict using the test data. We find that our AUC has gone up to 0.942- a good improvement.

set.seed(9560)
down_train <- downSample(x = train[, -ncol(train)],y = train$Class)
table(down_train$Class)
## 
## Legit Fraud 
##   344   344
down_fit <- rpart(Class ~ ., data = down_train)

pred_down <- predict(down_fit, newdata = test)
roc.curve(test$Class, pred_down[,2], plotit = TRUE)

## Area under the curve (AUC): 0.942

Upsampling with Decision Tree

Now I will modify the original dataset by upsampling this time. Upsampling is a process where data from the small positive class, in this case “Fraud” transactions, is augmented to even out the distribution of legit and fraudulent transactions. The function upSample takes care of this for us.

We make use of Rpart once more to create a decision tree model using the newly upsampled data, and then use it to predict using the test data. We find that our AUC has gone up to 0.943- an even larger improvement.

We can say for now that upsampling seems to be the slightly more favorable method to balance out the data.

set.seed(9560)
up_train <- upSample(x = train[, -ncol(train)],y = train$Class)
table(up_train$Class)
## 
##  Legit  Fraud 
## 199020 199020
up_fit <- rpart(Class ~ ., data = up_train)

pred_up <- predict(up_fit, newdata = test)
roc.curve(test$Class, pred_up[,2], plotit = TRUE)

## Area under the curve (AUC): 0.943

Baseline with Logistic Regression

I will try this experiment again, but with a logistic regression model instead of decision tree. First, the baseline - our AUC is 0.967, already higher from the jump.

glm_fit <- glm(Class ~ ., data = train, family = 'binomial')

pred_glm <- predict(glm_fit, newdata = test, type = 'response')

roc.curve(test$Class, pred_glm, plotit = TRUE)

## Area under the curve (AUC): 0.967

Downsampling with Logistic Regression

The AUC after downsampling jumps up to 0.968.

glm_fit <- glm(Class ~ ., data = down_train, family = 'binomial')

pred_glm <- predict(glm_fit, newdata = test, type = 'response')

roc.curve(test$Class, pred_glm, plotit = TRUE)

## Area under the curve (AUC): 0.968

Upsampling with Logistic Regression

The AUC after upsampling jumps up even further to 0.971.

glm_fit <- glm(Class ~ ., data = up_train, family = 'binomial')
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
pred_glm <- predict(glm_fit, newdata = test, type = 'response')

roc.curve(test$Class, pred_glm, plotit = TRUE)

## Area under the curve (AUC): 0.971

Upsampling has provided a better AUC score and hence a better model across both models. This is intuitive because downsampling represents a loss of information, since it removes data associated with the majority negative class. However, since I am using random upsampling, these models might be slightly overfitted. That is because random upsampling will simply duplicate “Fraud” records.

As mentioned before, the features of the transactions have been masked, however that does not stop us from slightly improving this model by using only the top most influential or important features. Here are all the features and their weight, which we can display by looking at the object provided by glm:

glm_fit
## 
## Call:  glm(formula = Class ~ ., family = "binomial", data = up_train)
## 
## Coefficients:
## (Intercept)           V1           V2           V3           V4           V5  
##    -3.75669      1.32012      0.40632      0.39799      1.32604      0.87120  
##          V6           V7           V8           V9          V10          V11  
##    -0.56008     -0.63636     -0.55251     -0.63347     -1.17539      0.76955  
##         V12          V13          V14          V15          V16          V17  
##    -1.15290     -0.56901     -1.31554     -0.30114     -0.53163     -0.72793  
##         V18          V19          V20          V21          V22          V23  
##    -0.32257      0.39341     -0.53963      0.14766      0.51143      0.16001  
##         V24          V25          V26          V27          V28       Amount  
##     0.11053     -0.06214     -0.09526     -0.22995      0.15298      1.59449  
## 
## Degrees of Freedom: 398039 Total (i.e. Null);  398010 Residual
## Null Deviance:       551800 
## Residual Deviance: 101900    AIC: 102000

Here I simply rank them in order from most influential to least. This shows us that the feature V4 is the most important feature in classifying a transaction as fraudulent or not, followed by V14 and so on.

importance <- as.data.frame(varImp(glm_fit)) %>% 
  arrange(desc(Overall))
importance

I will redo the above experiment with decision trees, this time however with only the top 10 most important features.

credit_transactions_if <- credit_transactions[,-1]

credit_transactions_if$Class <- as.factor(credit_transactions_if$Class)
levels(credit_transactions_if$Class) <- c("Legit", "Fraud")

credit_transactions_if[,-30] <- scale(credit_transactions_if[,-30]) 

credit_transactions_if <- credit_transactions_if %>% 
  select(V4, V14, V13, V12, V11, V22, V10, V8, V19, V15, Class)

head(credit_transactions_if, 10)

Recreating the training/testing data:

set.seed(123)
split2 <- sample.split(credit_transactions_if$Class, SplitRatio = 0.7)
train_if <-  subset(credit_transactions_if, split2 == TRUE)
test_if <- subset(credit_transactions_if, split2 == FALSE)
table(test_if$Class)
## 
## Legit Fraud 
## 85295   148
table(train_if$Class)
## 
##  Legit  Fraud 
## 199020    344

I will once again get a baseline using a decision tree model, but of course with only the 10 chose features. Our baseline AUC is now 0.892.

set.seed(5627)

orig_fit_if <- rpart(Class ~ ., data = train_if)
pred_orig_if <- predict(orig_fit_if, newdata = test_if, method = "class")

roc.curve(test$Class, pred_orig_if[,2], plotit = TRUE)

## Area under the curve (AUC): 0.892

Now I will upsample the important-factors-only dataset and see how much the model improves. The AUC jumps up to 0.962, which beats the upsampled decision tree model from before that used all of the features (0.943).

set.seed(9560)
up_train_if <- upSample(x = test_if[, -ncol(test_if)],y = test_if$Class)
table(up_train$Class)
## 
##  Legit  Fraud 
## 199020 199020
up_fit_if <- rpart(Class ~ ., data = up_train_if)

pred_up_if <- predict(up_fit_if, newdata = test_if)
roc.curve(test_if$Class, pred_up_if[,2], plotit = TRUE)

## Area under the curve (AUC): 0.962

This project demonstrates that upsampling and downsampling are viable and useful ways of dealing with unbalanced data. This was done with simple up/down sampling, ie randomly removing or adding records. More advanced versions of upsampling exist that can potentially provide even better models.