Sambhav Shrestha
In this project, I decide to analyze the credit card transactions data. The data can be found in https://www.kaggle.com/mlg-ulb/creditcardfraud. The goal of this project is to build a model that can detect fraud transactions. I will be using four machine learning models, Logistic Regression, Decision Trees, Random Forest, and XGBoost and compare their accuracy by using sensitivity vs specificty curve,also called Receiver Operating Characteristc (ROC) curve. We begin by importing the necessary libraries and loading data from the dataset.
# importing the required libraries
library(dplyr) # for data manipulation
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ranger) # for faster implementaion of random forests
library(caret) # for classification and regression training
## Loading required package: lattice
## Loading required package: ggplot2
library(caTools) # for splitting data into training and test set
library(data.table) # for converting data frame to table for faster execution
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
library(ggplot2) # for basic plot
library(corrplot) # for plotting corelation plot between elements
## corrplot 0.84 loaded
library(Rtsne) # for plotting tsne model
library(ROSE) # for rose sampling
## Loaded ROSE 0.0-3
library(pROC) # for plotting ROC curve
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
library(rpart) # for regression trees
library(rpart.plot) # for plotting decision tree
library(Rborist) # for random forest model
## Rborist 0.2-3
## Type RboristNews() to see new features/changes/bug fixes.
library(xgboost) # for xgboost model
##
## Attaching package: 'xgboost'
## The following object is masked from 'package:dplyr':
##
## slice
we will convert the data frame to data table which performs much faster for analyzing big data.
# importing the dataset
dataset <- setDT(read.csv("data/creditcard.csv"))
Let’s explore the dataset and see if we can find anything that stands out and preprocess them for building our machine learning model.
# exploring the credit card data
head(dataset)
## Time V1 V2 V3 V4 V5 V6
## 1: 0 -1.3598071 -0.07278117 2.5363467 1.3781552 -0.33832077 0.46238778
## 2: 0 1.1918571 0.26615071 0.1664801 0.4481541 0.06001765 -0.08236081
## 3: 1 -1.3583541 -1.34016307 1.7732093 0.3797796 -0.50319813 1.80049938
## 4: 1 -0.9662717 -0.18522601 1.7929933 -0.8632913 -0.01030888 1.24720317
## 5: 2 -1.1582331 0.87773675 1.5487178 0.4030339 -0.40719338 0.09592146
## 6: 2 -0.4259659 0.96052304 1.1411093 -0.1682521 0.42098688 -0.02972755
## V7 V8 V9 V10 V11 V12
## 1: 0.23959855 0.09869790 0.3637870 0.09079417 -0.5515995 -0.61780086
## 2: -0.07880298 0.08510165 -0.2554251 -0.16697441 1.6127267 1.06523531
## 3: 0.79146096 0.24767579 -1.5146543 0.20764287 0.6245015 0.06608369
## 4: 0.23760894 0.37743587 -1.3870241 -0.05495192 -0.2264873 0.17822823
## 5: 0.59294075 -0.27053268 0.8177393 0.75307443 -0.8228429 0.53819555
## 6: 0.47620095 0.26031433 -0.5686714 -0.37140720 1.3412620 0.35989384
## V13 V14 V15 V16 V17 V18
## 1: -0.9913898 -0.3111694 1.4681770 -0.4704005 0.20797124 0.02579058
## 2: 0.4890950 -0.1437723 0.6355581 0.4639170 -0.11480466 -0.18336127
## 3: 0.7172927 -0.1659459 2.3458649 -2.8900832 1.10996938 -0.12135931
## 4: 0.5077569 -0.2879237 -0.6314181 -1.0596472 -0.68409279 1.96577500
## 5: 1.3458516 -1.1196698 0.1751211 -0.4514492 -0.23703324 -0.03819479
## 6: -0.3580907 -0.1371337 0.5176168 0.4017259 -0.05813282 0.06865315
## V19 V20 V21 V22 V23 V24
## 1: 0.40399296 0.25141210 -0.018306778 0.277837576 -0.11047391 0.06692807
## 2: -0.14578304 -0.06908314 -0.225775248 -0.638671953 0.10128802 -0.33984648
## 3: -2.26185710 0.52497973 0.247998153 0.771679402 0.90941226 -0.68928096
## 4: -1.23262197 -0.20803778 -0.108300452 0.005273597 -0.19032052 -1.17557533
## 5: 0.80348692 0.40854236 -0.009430697 0.798278495 -0.13745808 0.14126698
## 6: -0.03319379 0.08496767 -0.208253515 -0.559824796 -0.02639767 -0.37142658
## V25 V26 V27 V28 Amount Class
## 1: 0.1285394 -0.1891148 0.133558377 -0.02105305 149.62 0
## 2: 0.1671704 0.1258945 -0.008983099 0.01472417 2.69 0
## 3: -0.3276418 -0.1390966 -0.055352794 -0.05975184 378.66 0
## 4: 0.6473760 -0.2219288 0.062722849 0.06145763 123.50 0
## 5: -0.2060096 0.5022922 0.219422230 0.21515315 69.99 0
## 6: -0.2327938 0.1059148 0.253844225 0.08108026 3.67 0
tail(dataset)
## Time V1 V2 V3 V4 V5 V6
## 1: 172785 0.1203164 0.93100513 -0.5460121 -0.7450968 1.13031398 -0.2359732
## 2: 172786 -11.8811179 10.07178497 -9.8347835 -2.0666557 -5.36447278 -2.6068373
## 3: 172787 -0.7327887 -0.05508049 2.0350297 -0.7385886 0.86822940 1.0584153
## 4: 172788 1.9195650 -0.30125385 -3.2496398 -0.5578281 2.63051512 3.0312601
## 5: 172788 -0.2404400 0.53048251 0.7025102 0.6897992 -0.37796113 0.6237077
## 6: 172792 -0.5334125 -0.18973334 0.7033374 -0.5062712 -0.01254568 -0.6496167
## V7 V8 V9 V10 V11 V12
## 1: 0.8127221 0.1150929 -0.2040635 -0.6574221 0.6448373 0.19091623
## 2: -4.9182154 7.3053340 1.9144283 4.3561704 -1.5931053 2.71194079
## 3: 0.0243297 0.2948687 0.5848000 -0.9759261 -0.1501888 0.91580191
## 4: -0.2968265 0.7084172 0.4324540 -0.4847818 0.4116137 0.06311886
## 5: -0.6861800 0.6791455 0.3920867 -0.3991257 -1.9338488 -0.96288614
## 6: 1.5770063 -0.4146504 0.4861795 -0.9154266 -1.0404583 -0.03151305
## V13 V14 V15 V16 V17 V18
## 1: -0.5463289 -0.73170658 -0.80803553 0.5996281 0.07044075 0.3731103
## 2: -0.6892556 4.62694203 -0.92445871 1.1076406 1.99169111 0.5106323
## 3: 1.2147558 -0.67514296 1.16493091 -0.7117573 -0.02569286 -1.2211789
## 4: -0.1836987 -0.51060184 1.32928351 0.1407160 0.31350179 0.3956525
## 5: -1.0420817 0.44962444 1.96256312 -0.6085771 0.50992846 1.1139806
## 6: -0.1880929 -0.08431647 0.04133346 -0.3026201 -0.66037665 0.1674299
## V19 V20 V21 V22 V23 V24
## 1: 0.1289038 0.0006758329 -0.3142046 -0.8085204 0.05034266 0.102799590
## 2: -0.6829197 1.4758291347 0.2134541 0.1118637 1.01447990 -0.509348453
## 3: -1.5455561 0.0596158999 0.2142053 0.9243836 0.01246304 -1.016225669
## 4: -0.5772518 0.0013959703 0.2320450 0.5782290 -0.03750086 0.640133881
## 5: 2.8978488 0.1274335158 0.2652449 0.8000487 -0.16329794 0.123205244
## 6: -0.2561169 0.3829481049 0.2610573 0.6430784 0.37677701 0.008797379
## V25 V26 V27 V28 Amount Class
## 1: -0.4358701 0.1240789 0.217939865 0.06880333 2.69 0
## 2: 1.4368069 0.2500343 0.943651172 0.82373096 0.77 0
## 3: -0.6066240 -0.3952551 0.068472470 -0.05352739 24.79 0
## 4: 0.2657455 -0.0873706 0.004454772 -0.02656083 67.88 0
## 5: -0.5691589 0.5466685 0.108820735 0.10453282 10.00 0
## 6: -0.4736487 -0.8182671 -0.002415309 0.01364891 217.00 0
# view the table from class column (0 for legit transactions and 1 for fraud)
table(dataset$Class)
##
## 0 1
## 284315 492
# view names of colums of dataset
names(dataset)
## [1] "Time" "V1" "V2" "V3" "V4" "V5" "V6" "V7"
## [9] "V8" "V9" "V10" "V11" "V12" "V13" "V14" "V15"
## [17] "V16" "V17" "V18" "V19" "V20" "V21" "V22" "V23"
## [25] "V24" "V25" "V26" "V27" "V28" "Amount" "Class"
By looking at the data, we can see that there are 28 anonymous variables v1 - v28, one time column, one amount column and one label column( 0 for not fraud and 1 for fraud). We will visualize this data into histogram and bar plot to find any connection or relation between variables.
# view summary of amount and histogram
summary(dataset$Amount)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 5.60 22.00 88.35 77.17 25691.16
hist(dataset$Amount)
hist(dataset$Amount[dataset$Amount < 100])
# view variance and standard deviation of amount column
var(dataset$Amount)
## [1] 62560.07
sd(dataset$Amount)
## [1] 250.1201
# check whether there are any missing values in colums
colSums(is.na(dataset))
## Time V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
## 0 0 0 0 0 0 0 0 0 0 0
## V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21
## 0 0 0 0 0 0 0 0 0 0 0
## V22 V23 V24 V25 V26 V27 V28 Amount Class
## 0 0 0 0 0 0 0 0 0
Let’s first visualize the transactions over time and see if time is an important factor to be considered for this classification.
# visualizing the distribution of transcations across time
dataset %>%
ggplot(aes(x = Time, fill = factor(Class))) +
geom_histogram(bins = 100) +
labs(x = "Time elapsed since first transcation (seconds)", y = "no. of transactions", title = "Distribution of transactions across time") +
facet_grid(Class ~ ., scales = 'free_y') + theme()
The time vs amount histogram looks pretty similar in both transactions. Since time doesn’t contribute much in fraud detection we can remove the time column from the data.
Next we check the corelation between all the variables and amount and class and see if there are any variables that corelate with each other.
# correlation of anonymous variables with amount and class
correlation <- cor(dataset[, -1], method = "pearson")
corrplot(correlation, number.cex = 1, method = "color", type = "full", tl.cex=0.7, tl.col="black")
From the above graph, we can see that most of the features are not corelated.In fact, all the anonymous variables are independent to each other.
The last visualization we can observe is the visualization of transactions using t-SNE (t-Distributed Stochastic Neighbor Embedding). This helps us reduce the dimensionality of the data and find any discoverable patterns if present. If there are no patttern present, it would be difficult to train the model.
# only use 10% of data to compute SNE and perplexity to 20
tsne_data <- 1:as.integer(0.1*nrow(dataset))
tsne <- Rtsne(dataset[tsne_data,-c(1, 31)], perplexity = 20, theta = 0.5, pca = F, verbose = F, max_iter = 500, check_duplicates = F)
classes <- as.factor(dataset$Class[tsne_data])
tsne_matrix <- as.data.frame(tsne$Y)
ggplot(tsne_matrix, aes(x = V1, y = V2)) + geom_point(aes(color = classes)) + theme_minimal() + ggtitle("t-SNE visualisation of transactions") + scale_color_manual(values = c("#E69F00", "#56B4E9"))
Since, most of the fraud transactions lie near the edge of the blob of data, we can use different models to differentiate fraud transactions.
Since all the anonymous variables are standardized, we also normalize Amount with mean 0.
# scaling the data using standardization and remove the first column (time) from the data set
dataset$Amount <- scale(dataset$Amount)
new_data <- dataset[, -c(1)]
head(new_data)
## V1 V2 V3 V4 V5 V6
## 1: -1.3598071 -0.07278117 2.5363467 1.3781552 -0.33832077 0.46238778
## 2: 1.1918571 0.26615071 0.1664801 0.4481541 0.06001765 -0.08236081
## 3: -1.3583541 -1.34016307 1.7732093 0.3797796 -0.50319813 1.80049938
## 4: -0.9662717 -0.18522601 1.7929933 -0.8632913 -0.01030888 1.24720317
## 5: -1.1582331 0.87773675 1.5487178 0.4030339 -0.40719338 0.09592146
## 6: -0.4259659 0.96052304 1.1411093 -0.1682521 0.42098688 -0.02972755
## V7 V8 V9 V10 V11 V12
## 1: 0.23959855 0.09869790 0.3637870 0.09079417 -0.5515995 -0.61780086
## 2: -0.07880298 0.08510165 -0.2554251 -0.16697441 1.6127267 1.06523531
## 3: 0.79146096 0.24767579 -1.5146543 0.20764287 0.6245015 0.06608369
## 4: 0.23760894 0.37743587 -1.3870241 -0.05495192 -0.2264873 0.17822823
## 5: 0.59294075 -0.27053268 0.8177393 0.75307443 -0.8228429 0.53819555
## 6: 0.47620095 0.26031433 -0.5686714 -0.37140720 1.3412620 0.35989384
## V13 V14 V15 V16 V17 V18
## 1: -0.9913898 -0.3111694 1.4681770 -0.4704005 0.20797124 0.02579058
## 2: 0.4890950 -0.1437723 0.6355581 0.4639170 -0.11480466 -0.18336127
## 3: 0.7172927 -0.1659459 2.3458649 -2.8900832 1.10996938 -0.12135931
## 4: 0.5077569 -0.2879237 -0.6314181 -1.0596472 -0.68409279 1.96577500
## 5: 1.3458516 -1.1196698 0.1751211 -0.4514492 -0.23703324 -0.03819479
## 6: -0.3580907 -0.1371337 0.5176168 0.4017259 -0.05813282 0.06865315
## V19 V20 V21 V22 V23 V24
## 1: 0.40399296 0.25141210 -0.018306778 0.277837576 -0.11047391 0.06692807
## 2: -0.14578304 -0.06908314 -0.225775248 -0.638671953 0.10128802 -0.33984648
## 3: -2.26185710 0.52497973 0.247998153 0.771679402 0.90941226 -0.68928096
## 4: -1.23262197 -0.20803778 -0.108300452 0.005273597 -0.19032052 -1.17557533
## 5: 0.80348692 0.40854236 -0.009430697 0.798278495 -0.13745808 0.14126698
## 6: -0.03319379 0.08496767 -0.208253515 -0.559824796 -0.02639767 -0.37142658
## V25 V26 V27 V28 Amount Class
## 1: 0.1285394 -0.1891148 0.133558377 -0.02105305 0.24496383 0
## 2: 0.1671704 0.1258945 -0.008983099 0.01472417 -0.34247394 0
## 3: -0.3276418 -0.1390966 -0.055352794 -0.05975184 1.16068389 0
## 4: 0.6473760 -0.2219288 0.062722849 0.06145763 0.14053401 0
## 5: -0.2060096 0.5022922 0.219422230 0.21515315 -0.07340321 0
## 6: -0.2327938 0.1059148 0.253844225 0.08108026 -0.33855582 0
# change 'Class' variable to factor
new_data$Class <- as.factor(new_data$Class)
levels(new_data$Class) <- c("Not Fraud", "Fraud")
# split the data into training set and test set
set.seed(101)
split <- sample.split(new_data$Class, SplitRatio = 0.8)
train_data <- subset(new_data, split == TRUE)
test_data <- subset(new_data, split == FALSE)
dim(train_data)
## [1] 227846 30
dim(test_data)
## [1] 56961 30
# visualize the training data
train_data %>% ggplot(aes(x = factor(Class), y = prop.table(stat(count)), fill = factor(Class))) +
geom_bar(position = "dodge") +
scale_y_continuous(labels = scales::percent) +
labs(x = 'Class', y = 'Percentage', title = 'Training Class distributions') +
theme_grey()
Since the data is heavily unbalanced with 99% of non-fraudulent data, this may result in our model perfoming less accurately and being heavily baised towards non-fraudulent transactions. So, We sample the data using ROSE (Random over sampling examples), Over sampling or Down sampling method, and examine the area under ROC curve at each sampling methods
set.seed(9560)
rose_train_data <- ROSE(Class ~ ., data = train_data)$data
table(rose_train_data$Class)
##
## Not Fraud Fraud
## 114081 113765
set.seed(90)
up_train_data <- upSample(x = train_data[, -30],
y = train_data$Class)
table(up_train_data$Class)
##
## Not Fraud Fraud
## 227452 227452
set.seed(90)
down_train_data <- downSample(x = train_data[, -30],
y = train_data$Class)
table(down_train_data$Class)
##
## Not Fraud Fraud
## 394 394
From the experiment, upsampling peformed slightly better than ROSE and Down Sampling. However, we will use Down Sampling to reduce the time for model training and execution. Now, we will test each models and see which one classifies the data better using ROC-AUC curve.
# fitting the logistic model
logistic_model <- glm(Class ~ ., down_train_data, family='binomial')
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(logistic_model)
##
## Call:
## glm(formula = Class ~ ., family = "binomial", data = down_train_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9751 -0.1495 0.0000 0.0000 2.8872
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.7853 1.9887 -0.898 0.36932
## V1 -1.4050 4.9949 -0.281 0.77849
## V2 38.1657 16.5913 2.300 0.02143 *
## V3 -23.0958 12.8639 -1.795 0.07259 .
## V4 16.5600 8.4753 1.954 0.05071 .
## V5 -6.1526 7.5379 -0.816 0.41437
## V6 -17.5137 7.6765 -2.281 0.02252 *
## V7 -64.9911 29.1111 -2.233 0.02558 *
## V8 11.9842 5.4448 2.201 0.02773 *
## V9 -21.6678 10.6605 -2.033 0.04210 *
## V10 -50.0094 24.5796 -2.035 0.04189 *
## V11 39.0282 19.0878 2.045 0.04089 *
## V12 -70.7328 34.2875 -2.063 0.03912 *
## V13 -1.0988 0.6986 -1.573 0.11575
## V14 -76.2379 36.6618 -2.079 0.03757 *
## V15 -3.2488 1.1611 -2.798 0.00514 **
## V16 -67.5105 32.6616 -2.067 0.03874 *
## V17 -119.1490 58.1574 -2.049 0.04049 *
## V18 -45.1731 22.0164 -2.052 0.04019 *
## V19 18.3937 8.4246 2.183 0.02901 *
## V20 -9.4421 5.3421 -1.767 0.07715 .
## V21 4.2473 3.2870 1.292 0.19630
## V22 7.4754 3.1069 2.406 0.01612 *
## V23 19.6558 9.3029 2.113 0.03461 *
## V24 -2.2600 1.0612 -2.130 0.03320 *
## V25 9.2787 4.3697 2.123 0.03372 *
## V26 1.0838 1.2294 0.882 0.37798
## V27 5.4597 4.9142 1.111 0.26657
## V28 22.1023 11.4750 1.926 0.05409 .
## Amount 57.7882 26.8372 2.153 0.03130 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1092.40 on 787 degrees of freedom
## Residual deviance: 147.83 on 758 degrees of freedom
## AIC: 207.83
##
## Number of Fisher Scoring iterations: 23
lets plot the logistic model
plot(logistic_model)
logistic_predictions <- predict(logistic_model, test_data, type='response')
roc.curve(test_data$Class, logistic_predictions, plotit = TRUE, col = "blue")
## Area under the curve (AUC): 0.964
From the logistic regression, we got the area under ROC Curve: 0.964
decisionTree_model <- rpart(Class ~ . , down_train_data, method = 'class')
predicted_val <- predict(decisionTree_model, down_train_data, type = 'class')
probability <- predict(decisionTree_model, down_train_data, type = 'prob')
rpart.plot(decisionTree_model)
From the decision tree model, we can see that v14 is the most important variable that separates fraud and non-fraud transactions.
x = down_train_data[, -30]
y = down_train_data[,30]
rf_fit <- Rborist(x, y, ntree = 1000, minNode = 20, maxLeaf = 13)
rf_pred <- predict(rf_fit, test_data[,-30], ctgCensus = "prob")
prob <- rf_pred$prob
roc.curve(test_data$Class, prob[,2], plotit = TRUE, col = 'blue')
## Area under the curve (AUC): 0.962
From the random forest model, we got area under the ROC Curve: 0.962
set.seed(40)
#Convert class labels from factor to numeric
labels <- down_train_data$Class
y <- recode(labels, 'Not Fraud' = 0, "Fraud" = 1)
# xgb fit
xgb_fit <- xgboost(data = data.matrix(down_train_data[,-30]),
label = y,
eta = 0.1,
gamma = 0.1,
max_depth = 10,
nrounds = 300,
objective = "binary:logistic",
colsample_bytree = 0.6,
verbose = 0,
nthread = 7
)
## [12:20:38] WARNING: amalgamation/../src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
# XGBoost predictions
xgb_pred <- predict(xgb_fit, data.matrix(test_data[,-30]))
roc.curve(test_data$Class, xgb_pred, plotit = TRUE)
## Area under the curve (AUC): 0.968
From the XGBoost model, we got area under the ROC Curve: 0.968
We can also check which variables has signigicant role in fraud detection. V14 stood out in decision tree model. Let’s compare it with XGboost model.
names <- dimnames(data.matrix(down_train_data[,-30]))[[2]]
# Compute feature importance matrix
importance_matrix <- xgb.importance(names, model = xgb_fit)
# Nice graph
xgb.plot.importance(importance_matrix[1:10,])
As we can see v14 has siginificant role in distinguishing the fraud and non-fraud transactions.
From the above plots and models, we can clarify that XGBoost performed better than logistic and Random Forest Model, although the margin was not very high. We can also fine tune the XGBoost model to make it perform even better. It is really great how models are able to find the distinguishing features between fraud and non-fraud transactions from such a big data.