Credit Card Fraud Detection in R

Sambhav Shrestha

1. Introduction

In this project, I decide to analyze the credit card transactions data. The data can be found in https://www.kaggle.com/mlg-ulb/creditcardfraud. The goal of this project is to build a model that can detect fraud transactions. I will be using four machine learning models, Logistic Regression, Decision Trees, Random Forest, and XGBoost and compare their accuracy by using sensitivity vs specificty curve,also called Receiver Operating Characteristc (ROC) curve. We begin by importing the necessary libraries and loading data from the dataset.

2. Importing the libaries

# importing the required libraries
library(dplyr)      # for data manipulation
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ranger)     # for faster implementaion of random forests
library(caret)      # for classification and regression training
## Loading required package: lattice
## Loading required package: ggplot2
library(caTools)    # for splitting data into training and test set
library(data.table) # for converting data frame to table for faster execution
## 
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
library(ggplot2)    # for basic plot
library(corrplot)   # for plotting corelation plot between elements
## corrplot 0.84 loaded
library(Rtsne)      # for plotting tsne model
library(ROSE)       # for rose sampling
## Loaded ROSE 0.0-3
library(pROC)       # for plotting ROC curve
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
library(rpart)      # for regression trees
library(rpart.plot) # for plotting decision tree
library(Rborist)    # for random forest model
## Rborist 0.2-3
## Type RboristNews() to see new features/changes/bug fixes.
library(xgboost)    # for xgboost model
## 
## Attaching package: 'xgboost'
## The following object is masked from 'package:dplyr':
## 
##     slice

3. Importing the credit card dataset

we will convert the data frame to data table which performs much faster for analyzing big data.

# importing the dataset
dataset <- setDT(read.csv("data/creditcard.csv"))

4. Data Exploartion

Let’s explore the dataset and see if we can find anything that stands out and preprocess them for building our machine learning model.

# exploring the credit card data
head(dataset)
##    Time         V1          V2        V3         V4          V5          V6
## 1:    0 -1.3598071 -0.07278117 2.5363467  1.3781552 -0.33832077  0.46238778
## 2:    0  1.1918571  0.26615071 0.1664801  0.4481541  0.06001765 -0.08236081
## 3:    1 -1.3583541 -1.34016307 1.7732093  0.3797796 -0.50319813  1.80049938
## 4:    1 -0.9662717 -0.18522601 1.7929933 -0.8632913 -0.01030888  1.24720317
## 5:    2 -1.1582331  0.87773675 1.5487178  0.4030339 -0.40719338  0.09592146
## 6:    2 -0.4259659  0.96052304 1.1411093 -0.1682521  0.42098688 -0.02972755
##             V7          V8         V9         V10        V11         V12
## 1:  0.23959855  0.09869790  0.3637870  0.09079417 -0.5515995 -0.61780086
## 2: -0.07880298  0.08510165 -0.2554251 -0.16697441  1.6127267  1.06523531
## 3:  0.79146096  0.24767579 -1.5146543  0.20764287  0.6245015  0.06608369
## 4:  0.23760894  0.37743587 -1.3870241 -0.05495192 -0.2264873  0.17822823
## 5:  0.59294075 -0.27053268  0.8177393  0.75307443 -0.8228429  0.53819555
## 6:  0.47620095  0.26031433 -0.5686714 -0.37140720  1.3412620  0.35989384
##           V13        V14        V15        V16         V17         V18
## 1: -0.9913898 -0.3111694  1.4681770 -0.4704005  0.20797124  0.02579058
## 2:  0.4890950 -0.1437723  0.6355581  0.4639170 -0.11480466 -0.18336127
## 3:  0.7172927 -0.1659459  2.3458649 -2.8900832  1.10996938 -0.12135931
## 4:  0.5077569 -0.2879237 -0.6314181 -1.0596472 -0.68409279  1.96577500
## 5:  1.3458516 -1.1196698  0.1751211 -0.4514492 -0.23703324 -0.03819479
## 6: -0.3580907 -0.1371337  0.5176168  0.4017259 -0.05813282  0.06865315
##            V19         V20          V21          V22         V23         V24
## 1:  0.40399296  0.25141210 -0.018306778  0.277837576 -0.11047391  0.06692807
## 2: -0.14578304 -0.06908314 -0.225775248 -0.638671953  0.10128802 -0.33984648
## 3: -2.26185710  0.52497973  0.247998153  0.771679402  0.90941226 -0.68928096
## 4: -1.23262197 -0.20803778 -0.108300452  0.005273597 -0.19032052 -1.17557533
## 5:  0.80348692  0.40854236 -0.009430697  0.798278495 -0.13745808  0.14126698
## 6: -0.03319379  0.08496767 -0.208253515 -0.559824796 -0.02639767 -0.37142658
##           V25        V26          V27         V28 Amount Class
## 1:  0.1285394 -0.1891148  0.133558377 -0.02105305 149.62     0
## 2:  0.1671704  0.1258945 -0.008983099  0.01472417   2.69     0
## 3: -0.3276418 -0.1390966 -0.055352794 -0.05975184 378.66     0
## 4:  0.6473760 -0.2219288  0.062722849  0.06145763 123.50     0
## 5: -0.2060096  0.5022922  0.219422230  0.21515315  69.99     0
## 6: -0.2327938  0.1059148  0.253844225  0.08108026   3.67     0
tail(dataset)
##      Time          V1          V2         V3         V4          V5         V6
## 1: 172785   0.1203164  0.93100513 -0.5460121 -0.7450968  1.13031398 -0.2359732
## 2: 172786 -11.8811179 10.07178497 -9.8347835 -2.0666557 -5.36447278 -2.6068373
## 3: 172787  -0.7327887 -0.05508049  2.0350297 -0.7385886  0.86822940  1.0584153
## 4: 172788   1.9195650 -0.30125385 -3.2496398 -0.5578281  2.63051512  3.0312601
## 5: 172788  -0.2404400  0.53048251  0.7025102  0.6897992 -0.37796113  0.6237077
## 6: 172792  -0.5334125 -0.18973334  0.7033374 -0.5062712 -0.01254568 -0.6496167
##            V7         V8         V9        V10        V11         V12
## 1:  0.8127221  0.1150929 -0.2040635 -0.6574221  0.6448373  0.19091623
## 2: -4.9182154  7.3053340  1.9144283  4.3561704 -1.5931053  2.71194079
## 3:  0.0243297  0.2948687  0.5848000 -0.9759261 -0.1501888  0.91580191
## 4: -0.2968265  0.7084172  0.4324540 -0.4847818  0.4116137  0.06311886
## 5: -0.6861800  0.6791455  0.3920867 -0.3991257 -1.9338488 -0.96288614
## 6:  1.5770063 -0.4146504  0.4861795 -0.9154266 -1.0404583 -0.03151305
##           V13         V14         V15        V16         V17        V18
## 1: -0.5463289 -0.73170658 -0.80803553  0.5996281  0.07044075  0.3731103
## 2: -0.6892556  4.62694203 -0.92445871  1.1076406  1.99169111  0.5106323
## 3:  1.2147558 -0.67514296  1.16493091 -0.7117573 -0.02569286 -1.2211789
## 4: -0.1836987 -0.51060184  1.32928351  0.1407160  0.31350179  0.3956525
## 5: -1.0420817  0.44962444  1.96256312 -0.6085771  0.50992846  1.1139806
## 6: -0.1880929 -0.08431647  0.04133346 -0.3026201 -0.66037665  0.1674299
##           V19          V20        V21        V22         V23          V24
## 1:  0.1289038 0.0006758329 -0.3142046 -0.8085204  0.05034266  0.102799590
## 2: -0.6829197 1.4758291347  0.2134541  0.1118637  1.01447990 -0.509348453
## 3: -1.5455561 0.0596158999  0.2142053  0.9243836  0.01246304 -1.016225669
## 4: -0.5772518 0.0013959703  0.2320450  0.5782290 -0.03750086  0.640133881
## 5:  2.8978488 0.1274335158  0.2652449  0.8000487 -0.16329794  0.123205244
## 6: -0.2561169 0.3829481049  0.2610573  0.6430784  0.37677701  0.008797379
##           V25        V26          V27         V28 Amount Class
## 1: -0.4358701  0.1240789  0.217939865  0.06880333   2.69     0
## 2:  1.4368069  0.2500343  0.943651172  0.82373096   0.77     0
## 3: -0.6066240 -0.3952551  0.068472470 -0.05352739  24.79     0
## 4:  0.2657455 -0.0873706  0.004454772 -0.02656083  67.88     0
## 5: -0.5691589  0.5466685  0.108820735  0.10453282  10.00     0
## 6: -0.4736487 -0.8182671 -0.002415309  0.01364891 217.00     0
# view the table from class column (0 for legit transactions and 1 for fraud)
table(dataset$Class)
## 
##      0      1 
## 284315    492
# view names of colums  of dataset
names(dataset)
##  [1] "Time"   "V1"     "V2"     "V3"     "V4"     "V5"     "V6"     "V7"    
##  [9] "V8"     "V9"     "V10"    "V11"    "V12"    "V13"    "V14"    "V15"   
## [17] "V16"    "V17"    "V18"    "V19"    "V20"    "V21"    "V22"    "V23"   
## [25] "V24"    "V25"    "V26"    "V27"    "V28"    "Amount" "Class"

By looking at the data, we can see that there are 28 anonymous variables v1 - v28, one time column, one amount column and one label column( 0 for not fraud and 1 for fraud). We will visualize this data into histogram and bar plot to find any connection or relation between variables.

# view summary of amount and histogram
summary(dataset$Amount)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     0.00     5.60    22.00    88.35    77.17 25691.16
hist(dataset$Amount)

hist(dataset$Amount[dataset$Amount < 100])

# view variance and standard deviation of amount column
var(dataset$Amount)
## [1] 62560.07
sd(dataset$Amount)
## [1] 250.1201
# check whether there are any missing values in colums
colSums(is.na(dataset))
##   Time     V1     V2     V3     V4     V5     V6     V7     V8     V9    V10 
##      0      0      0      0      0      0      0      0      0      0      0 
##    V11    V12    V13    V14    V15    V16    V17    V18    V19    V20    V21 
##      0      0      0      0      0      0      0      0      0      0      0 
##    V22    V23    V24    V25    V26    V27    V28 Amount  Class 
##      0      0      0      0      0      0      0      0      0

5. Data visualization

Let’s first visualize the transactions over time and see if time is an important factor to be considered for this classification.

# visualizing the distribution of transcations across time
dataset %>%
  ggplot(aes(x = Time, fill = factor(Class))) + 
  geom_histogram(bins = 100) + 
  labs(x = "Time elapsed since first transcation (seconds)", y = "no. of transactions", title = "Distribution of transactions across time") +
  facet_grid(Class ~ ., scales = 'free_y') + theme()

The time vs amount histogram looks pretty similar in both transactions. Since time doesn’t contribute much in fraud detection we can remove the time column from the data.

Next we check the corelation between all the variables and amount and class and see if there are any variables that corelate with each other.

# correlation of anonymous variables with amount and class
correlation <- cor(dataset[, -1], method = "pearson")
corrplot(correlation, number.cex = 1, method = "color", type = "full", tl.cex=0.7, tl.col="black")

From the above graph, we can see that most of the features are not corelated.In fact, all the anonymous variables are independent to each other.

The last visualization we can observe is the visualization of transactions using t-SNE (t-Distributed Stochastic Neighbor Embedding). This helps us reduce the dimensionality of the data and find any discoverable patterns if present. If there are no patttern present, it would be difficult to train the model.

# only use 10% of data to compute SNE and perplexity to 20
tsne_data <- 1:as.integer(0.1*nrow(dataset))
tsne <- Rtsne(dataset[tsne_data,-c(1, 31)], perplexity = 20, theta = 0.5, pca = F, verbose = F, max_iter = 500, check_duplicates = F)
classes <- as.factor(dataset$Class[tsne_data])
tsne_matrix <- as.data.frame(tsne$Y)
ggplot(tsne_matrix, aes(x = V1, y = V2)) + geom_point(aes(color = classes)) + theme_minimal() + ggtitle("t-SNE visualisation of transactions") + scale_color_manual(values = c("#E69F00", "#56B4E9"))

Since, most of the fraud transactions lie near the edge of the blob of data, we can use different models to differentiate fraud transactions.

6. Data Preprocessing

Since all the anonymous variables are standardized, we also normalize Amount with mean 0.

# scaling the data using standardization and remove the first column (time) from the data set
dataset$Amount <- scale(dataset$Amount)
new_data <- dataset[, -c(1)]
head(new_data)
##            V1          V2        V3         V4          V5          V6
## 1: -1.3598071 -0.07278117 2.5363467  1.3781552 -0.33832077  0.46238778
## 2:  1.1918571  0.26615071 0.1664801  0.4481541  0.06001765 -0.08236081
## 3: -1.3583541 -1.34016307 1.7732093  0.3797796 -0.50319813  1.80049938
## 4: -0.9662717 -0.18522601 1.7929933 -0.8632913 -0.01030888  1.24720317
## 5: -1.1582331  0.87773675 1.5487178  0.4030339 -0.40719338  0.09592146
## 6: -0.4259659  0.96052304 1.1411093 -0.1682521  0.42098688 -0.02972755
##             V7          V8         V9         V10        V11         V12
## 1:  0.23959855  0.09869790  0.3637870  0.09079417 -0.5515995 -0.61780086
## 2: -0.07880298  0.08510165 -0.2554251 -0.16697441  1.6127267  1.06523531
## 3:  0.79146096  0.24767579 -1.5146543  0.20764287  0.6245015  0.06608369
## 4:  0.23760894  0.37743587 -1.3870241 -0.05495192 -0.2264873  0.17822823
## 5:  0.59294075 -0.27053268  0.8177393  0.75307443 -0.8228429  0.53819555
## 6:  0.47620095  0.26031433 -0.5686714 -0.37140720  1.3412620  0.35989384
##           V13        V14        V15        V16         V17         V18
## 1: -0.9913898 -0.3111694  1.4681770 -0.4704005  0.20797124  0.02579058
## 2:  0.4890950 -0.1437723  0.6355581  0.4639170 -0.11480466 -0.18336127
## 3:  0.7172927 -0.1659459  2.3458649 -2.8900832  1.10996938 -0.12135931
## 4:  0.5077569 -0.2879237 -0.6314181 -1.0596472 -0.68409279  1.96577500
## 5:  1.3458516 -1.1196698  0.1751211 -0.4514492 -0.23703324 -0.03819479
## 6: -0.3580907 -0.1371337  0.5176168  0.4017259 -0.05813282  0.06865315
##            V19         V20          V21          V22         V23         V24
## 1:  0.40399296  0.25141210 -0.018306778  0.277837576 -0.11047391  0.06692807
## 2: -0.14578304 -0.06908314 -0.225775248 -0.638671953  0.10128802 -0.33984648
## 3: -2.26185710  0.52497973  0.247998153  0.771679402  0.90941226 -0.68928096
## 4: -1.23262197 -0.20803778 -0.108300452  0.005273597 -0.19032052 -1.17557533
## 5:  0.80348692  0.40854236 -0.009430697  0.798278495 -0.13745808  0.14126698
## 6: -0.03319379  0.08496767 -0.208253515 -0.559824796 -0.02639767 -0.37142658
##           V25        V26          V27         V28      Amount Class
## 1:  0.1285394 -0.1891148  0.133558377 -0.02105305  0.24496383     0
## 2:  0.1671704  0.1258945 -0.008983099  0.01472417 -0.34247394     0
## 3: -0.3276418 -0.1390966 -0.055352794 -0.05975184  1.16068389     0
## 4:  0.6473760 -0.2219288  0.062722849  0.06145763  0.14053401     0
## 5: -0.2060096  0.5022922  0.219422230  0.21515315 -0.07340321     0
## 6: -0.2327938  0.1059148  0.253844225  0.08108026 -0.33855582     0
# change 'Class' variable to factor
new_data$Class <- as.factor(new_data$Class)
levels(new_data$Class) <- c("Not Fraud", "Fraud")

7. Data modeling

# split the data into training set and test set
set.seed(101)
split <- sample.split(new_data$Class, SplitRatio = 0.8)
train_data <- subset(new_data, split == TRUE)
test_data <- subset(new_data, split == FALSE)
dim(train_data)
## [1] 227846     30
dim(test_data)
## [1] 56961    30
# visualize the training data
train_data %>% ggplot(aes(x = factor(Class), y = prop.table(stat(count)), fill = factor(Class))) +
  geom_bar(position = "dodge") +
  scale_y_continuous(labels = scales::percent) +
  labs(x = 'Class', y = 'Percentage', title = 'Training Class distributions') +
  theme_grey()

Since the data is heavily unbalanced with 99% of non-fraudulent data, this may result in our model perfoming less accurately and being heavily baised towards non-fraudulent transactions. So, We sample the data using ROSE (Random over sampling examples), Over sampling or Down sampling method, and examine the area under ROC curve at each sampling methods

8. Sampling Techniques

Rose Sampling

set.seed(9560)
rose_train_data <- ROSE(Class ~ ., data  = train_data)$data 

table(rose_train_data$Class) 
## 
## Not Fraud     Fraud 
##    114081    113765

Up Sampling

set.seed(90)
up_train_data <- upSample(x = train_data[, -30],
                         y = train_data$Class)
table(up_train_data$Class)  
## 
## Not Fraud     Fraud 
##    227452    227452

Down Sampling

set.seed(90)
down_train_data <- downSample(x = train_data[, -30],
                         y = train_data$Class)
table(down_train_data$Class)  
## 
## Not Fraud     Fraud 
##       394       394

From the experiment, upsampling peformed slightly better than ROSE and Down Sampling. However, we will use Down Sampling to reduce the time for model training and execution. Now, we will test each models and see which one classifies the data better using ROC-AUC curve.

9. Logistic Regression

# fitting the logistic model
logistic_model <- glm(Class ~ ., down_train_data, family='binomial')
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(logistic_model)
## 
## Call:
## glm(formula = Class ~ ., family = "binomial", data = down_train_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9751  -0.1495   0.0000   0.0000   2.8872  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)   
## (Intercept)   -1.7853     1.9887  -0.898  0.36932   
## V1            -1.4050     4.9949  -0.281  0.77849   
## V2            38.1657    16.5913   2.300  0.02143 * 
## V3           -23.0958    12.8639  -1.795  0.07259 . 
## V4            16.5600     8.4753   1.954  0.05071 . 
## V5            -6.1526     7.5379  -0.816  0.41437   
## V6           -17.5137     7.6765  -2.281  0.02252 * 
## V7           -64.9911    29.1111  -2.233  0.02558 * 
## V8            11.9842     5.4448   2.201  0.02773 * 
## V9           -21.6678    10.6605  -2.033  0.04210 * 
## V10          -50.0094    24.5796  -2.035  0.04189 * 
## V11           39.0282    19.0878   2.045  0.04089 * 
## V12          -70.7328    34.2875  -2.063  0.03912 * 
## V13           -1.0988     0.6986  -1.573  0.11575   
## V14          -76.2379    36.6618  -2.079  0.03757 * 
## V15           -3.2488     1.1611  -2.798  0.00514 **
## V16          -67.5105    32.6616  -2.067  0.03874 * 
## V17         -119.1490    58.1574  -2.049  0.04049 * 
## V18          -45.1731    22.0164  -2.052  0.04019 * 
## V19           18.3937     8.4246   2.183  0.02901 * 
## V20           -9.4421     5.3421  -1.767  0.07715 . 
## V21            4.2473     3.2870   1.292  0.19630   
## V22            7.4754     3.1069   2.406  0.01612 * 
## V23           19.6558     9.3029   2.113  0.03461 * 
## V24           -2.2600     1.0612  -2.130  0.03320 * 
## V25            9.2787     4.3697   2.123  0.03372 * 
## V26            1.0838     1.2294   0.882  0.37798   
## V27            5.4597     4.9142   1.111  0.26657   
## V28           22.1023    11.4750   1.926  0.05409 . 
## Amount        57.7882    26.8372   2.153  0.03130 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1092.40  on 787  degrees of freedom
## Residual deviance:  147.83  on 758  degrees of freedom
## AIC: 207.83
## 
## Number of Fisher Scoring iterations: 23

lets plot the logistic model

plot(logistic_model)

Plotting the ROC-AUC Curve

logistic_predictions <- predict(logistic_model, test_data, type='response')
roc.curve(test_data$Class, logistic_predictions, plotit = TRUE, col = "blue")

## Area under the curve (AUC): 0.964

From the logistic regression, we got the area under ROC Curve: 0.964

10. Decision Tree Model

decisionTree_model <- rpart(Class ~ . , down_train_data, method = 'class')
predicted_val <- predict(decisionTree_model, down_train_data, type = 'class')
probability <- predict(decisionTree_model, down_train_data, type = 'prob')
rpart.plot(decisionTree_model)

From the decision tree model, we can see that v14 is the most important variable that separates fraud and non-fraud transactions.

11. Random Forest Model

x = down_train_data[, -30]
y = down_train_data[,30]

rf_fit <- Rborist(x, y, ntree = 1000, minNode = 20, maxLeaf = 13)


rf_pred <- predict(rf_fit, test_data[,-30], ctgCensus = "prob")
prob <- rf_pred$prob

roc.curve(test_data$Class, prob[,2], plotit = TRUE, col = 'blue')

## Area under the curve (AUC): 0.962

From the random forest model, we got area under the ROC Curve: 0.962

12. XGBoost Model

set.seed(40)

#Convert class labels from factor to numeric
labels <- down_train_data$Class
y <- recode(labels, 'Not Fraud' = 0, "Fraud" = 1)

# xgb fit
xgb_fit <- xgboost(data = data.matrix(down_train_data[,-30]), 
 label = y,
 eta = 0.1,
 gamma = 0.1,
 max_depth = 10, 
 nrounds = 300, 
 objective = "binary:logistic",
 colsample_bytree = 0.6,
 verbose = 0,
 nthread = 7
)
## [12:20:38] WARNING: amalgamation/../src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
# XGBoost predictions
xgb_pred <- predict(xgb_fit, data.matrix(test_data[,-30]))
roc.curve(test_data$Class, xgb_pred, plotit = TRUE)

## Area under the curve (AUC): 0.968

From the XGBoost model, we got area under the ROC Curve: 0.968

13. Significant Variables

We can also check which variables has signigicant role in fraud detection. V14 stood out in decision tree model. Let’s compare it with XGboost model.

names <- dimnames(data.matrix(down_train_data[,-30]))[[2]]

# Compute feature importance matrix
importance_matrix <- xgb.importance(names, model = xgb_fit)
# Nice graph
xgb.plot.importance(importance_matrix[1:10,])

As we can see v14 has siginificant role in distinguishing the fraud and non-fraud transactions.

14.Conclusion:

From the above plots and models, we can clarify that XGBoost performed better than logistic and Random Forest Model, although the margin was not very high. We can also fine tune the XGBoost model to make it perform even better. It is really great how models are able to find the distinguishing features between fraud and non-fraud transactions from such a big data.