The objective of this project is to predict which credit card transactions in the dataset are fraudulent using three classification algorithms and three synthetic balancing techniques. The three classifier algorithms we will train include:
Given that the objective is to evaluate the model performance of the three classifier algorithms and synthetic balancing techniques, we will not be thoroughly reviewing the model output, but rather will be focusing on the classification performance results.
Lets start by loading the R library packages that will be used in this project, which are the caret, corrplot, and smote-family packages.
#Load the packages used in the project
suppressPackageStartupMessages(c(library(caret),library(corrplot),library(smotefamily)))
Next, using the “read.csv” function, we will import the credit card fraud dataset and set the class to a factor. This dataset is a subset of the dataset from sourced from https://www.kaggle.com/mlg-ulb/creditcardfraud, which includes anonymized credit card transactions.
#A. Load the dataset (a proportion of the original data)
<- read.csv("creditcardFraud.csv")
creditcardFraud #B. Change class to factor the as.factor function encodes the vector as a factor or category
$class <- as.factor(creditcardFraud$class) creditcardFraud
Now that we have downloaded the data we can start the training of the models, but it is important that we first understand and explore our data as it helps us identify potential data quality issues and it provides us the needed context to develop an appropriate model.
In this project, we will briefly explore the data and perform a high-level exploratory data analysis (EDA) of the dataset
str(creditcardFraud)
'data.frame': 49692 obs. of 31 variables:
$ Time : int 406 472 4462 6986 7519 7526 7535 7543 7551 7610 ...
$ V1 : num -2.31 -3.04 -2.3 -4.4 1.23 ...
$ V2 : num 1.95 -3.16 1.76 1.36 3.02 ...
$ V3 : num -1.61 1.09 -0.36 -2.59 -4.3 ...
$ V4 : num 4 2.29 2.33 2.68 4.73 ...
$ V5 : num -0.522 1.36 -0.822 -1.128 3.624 ...
$ V6 : num -1.4265 -1.0648 -0.0758 -1.7065 -1.3577 ...
$ V7 : num -2.537 0.326 0.562 -3.496 1.713 ...
$ V8 : num 1.3917 -0.0678 -0.3991 -0.2488 -0.4964 ...
$ V9 : num -2.77 -0.271 -0.238 -0.248 -1.283 ...
$ V10 : num -2.772 -0.839 -1.525 -4.802 -2.447 ...
$ V11 : num 3.202 -0.415 2.033 4.896 2.101 ...
$ V12 : num -2.9 -0.503 -6.56 -10.913 -4.61 ...
$ V13 : num -0.5952 0.6765 0.0229 0.1844 1.4644 ...
$ V14 : num -4.29 -1.69 -1.47 -6.77 -6.08 ...
$ V15 : num 0.38972 2.00063 -0.69883 -0.00733 -0.33924 ...
$ V16 : num -1.141 0.667 -2.282 -7.358 2.582 ...
$ V17 : num -2.83 0.6 -4.78 -12.6 6.74 ...
$ V18 : num -0.0168 1.7253 -2.6157 -5.1315 3.0425 ...
$ V19 : num 0.417 0.283 -1.334 0.308 -2.722 ...
$ V20 : num 0.12691 2.10234 -0.43002 -0.17161 0.00906 ...
$ V21 : num 0.517 0.662 -0.294 0.574 -0.379 ...
$ V22 : num -0.035 0.435 -0.932 0.177 -0.704 ...
$ V23 : num -0.465 1.376 0.173 -0.436 -0.657 ...
$ V24 : num 0.3202 -0.2938 -0.0873 -0.0535 -1.6327 ...
$ V25 : num 0.0445 0.2798 -0.1561 0.2524 1.4889 ...
$ V26 : num 0.178 -0.145 -0.543 -0.657 0.567 ...
$ V27 : num 0.2611 -0.2528 0.0396 -0.8271 -0.01 ...
$ V28 : num -0.1433 0.0358 -0.153 0.8496 0.1468 ...
$ Amount: num 0 529 240 59 1 ...
$ class : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
sum(is.na(creditcardFraud))
[1] 0
summary(creditcardFraud$class)
no yes
49200 492
prop.table(table(creditcardFraud$class))
no yes
0.99009901 0.00990099
par(mfrow = c(3,5))
<- 1
i for (i in 1:30) {
hist((creditcardFraud[,i]), main = paste("Distibution of ",
colnames(creditcardFraud[i])),
xlab = colnames(creditcardFraud[i]),
col = "light blue")
}
<- cor(creditcardFraud[,1:30])
r corrplot(r, type = "lower", tl.col = 'black', tl.srt = 15)
It is important that when we evaluate the performance of a model, we do so on a dataset that the model has not previously seen. Therefore, we will split our dataset into a training dataset and a test dataset and to maintain the same level of imbalance as in the original dataset, we will use stratified sampling by “class.”
Training Dataset: This is the random subset of your data used to initially fit (or train) your model.
Test Dataset: This dataset used to provide an unbiased evaluation of the model fit on the training dataset.
set.seed(1337)
<- createDataPartition(creditcardFraud$class,
train p = 0.7, # % of data going to training
times = 1,
list = F)
<- creditcardFraud[ train,]
train.orig <- creditcardFraud[-train,] test
dim(train.orig) / dim(creditcardFraud) ## result row percentage in original train set - column ratio
[1] 0.7000121 1.0000000
prop.table(table(train.orig$class))
no yes
0.990081932 0.009918068
prop.table(table(test$class))
no yes
0.990138861 0.009861139
Now that we have split our dataset into a training and test dataset, lets create three new synthetically balanced datasets from the one imbalanced training dataset. To do this we will be using the “smotefamily” R package and we will be trying out three different techniques: SMOTE, ADASYN, and DB-SMOTE. Below is a brief description of each:
SMOTE (Synthetic Minority Oversampling Technique): A subset of data is taken from the minority class as an example. New synthetic similar examples are generated from the “feature space” rather than the “data space.”
ADASYN (Adaptive Synthetic Sampling): A weighted distribution is used depending on each minority class according to their degree of learning difficulty. More synthetic observations are generated for some minority class instances that are more difficult to learn as compared to others
DB-SMOTE (Density Based SMOTE): This over-samples the minority class at the decision boundary and over-examines the region to maintain the majority class detection rate. These are more likely to be misclassified than those far from the border.
#SMOTE Balanced
<- SMOTE(train.orig[,-31],train.orig$class,K = 5)
train.smote <- train.smote$data # extract only the balanced dataset
train.smote $class <- as.factor(train.smote$class)
train.smote
#ADASYN Balanced
<- ADAS(train.orig[,-31],train.orig$class,K = 5)
train.adas <- train.adas$data # extract only the balanced dataset
train.adas $class <- as.factor(train.adas$class)
train.adas
#Density based SMOTE
<- DBSMOTE(train.orig[,-31],train.orig$class)
train.dbsmote <- train.dbsmote$data # extract only the balanced dataset
train.dbsmote $class <- as.factor(train.dbsmote$class) train.dbsmote
prop.table(table(train.smote$class))
no yes
0.5020774 0.4979226
prop.table(table(train.adas$class))
no yes
0.4993041 0.5006959
prop.table(table(train.dbsmote$class))
no yes
0.5184483 0.4815517
Now that we have our four training datasets;
the original imbalanced training dataset,
the SMOTE balanced training dataset,
the ADASYN balanced training dataset, and
the DB-SMOTE balanced training dataset,
We will use the ‘caret’ package to train three classifier models (decision tree, naive Bayes, linear discriminant analysis). Lets start by fitting the three classifier models using the original imbalanced training dataset. We will use repeated 10x cross validation for our models across all of our trained models.
#A. Global options that we will use across all of our trained models
<- trainControl(method = 'cv',
ctrl number = 10,
classProbs = TRUE,
summaryFunction = twoClassSummary)
#B. Decision Tree: original data
<- train(class ~ .,
dt_orig data = train.orig,
method = "rpart",
trControl = ctrl,
metric = "ROC")
#C. Naive Bayes regression: original data
<- train(class ~ .,
nb_orig data = train.orig,
method = "naive_bayes",
trControl = ctrl,
metric = "ROC")
#D. Linear Discriminant Analysis: original data
<- train(class ~ .,
lda_orig data = train.orig,
method = "lda",
trControl = ctrl,
metric = "ROC")
Next, we will use the models we have trained using the original imbalanced training dataset to generate predictions on the test dataset.
###################################################
#Decision Tree Model - Trained on original dataset#
###################################################
#A. Decision Tree Model predictions
<- predict(dt_orig,test,type = "prob")
dt_orig_pred
#B. Decision Tree - Assign class to probabilities
<- factor(ifelse(dt_orig_pred$yes > 0.5,"yes","no"))
dt_orig_test
#C. Decision Tree Save Precision/Recall/F
<- posPredValue(dt_orig_test,test$class,positive = "yes")
precision_dtOrig <- sensitivity(dt_orig_test,test$class,positive = "yes")
recall_dtOrig <- (2 * precision_dtOrig * recall_dtOrig) / (recall_dtOrig + precision_dtOrig)
F1_dtOrig
#################################################
#Naive Bayes Model - Trained on original dataset#
#################################################
#A. NB Model predictions
<- predict(nb_orig,test,type = "prob")
nb_orig_pred
#B. NB - Assign class to probabilities
<- factor(ifelse(nb_orig_pred$yes > 0.5,"yes","no"))
nb_orig_test
#C. NB Save Precision/Recall/F
<- posPredValue(nb_orig_test,test$class,positive = "yes")
precision_nbOrig <- sensitivity(nb_orig_test,test$class,positive = "yes")
recall_nbOrig <- (2 * precision_nbOrig * recall_nbOrig) / (recall_nbOrig + precision_nbOrig)
F1_nbOrig
#########################################
#LDA Model - Trained on original dataset#
#########################################
#A. LDA Model predictions
<- predict(lda_orig,test,type = "prob")
lda_orig_pred
#B. LDA - Assign class to probabilities
<- factor(ifelse(lda_orig_pred$yes > 0.5,"yes","no"))
lda_orig_test
#C. LDA Save Precision/Recall/F
<- posPredValue(lda_orig_test,test$class,positive = "yes")
precision_ldaOrig <- sensitivity(lda_orig_test,test$class,positive = "yes")
recall_ldaOrig <- (2 * precision_ldaOrig * recall_ldaOrig) / (recall_ldaOrig + precision_ldaOrig) F1_ldaOrig
Next, we will train the three classifier models using the SMOTE balanced training dataset. To train the models, we can simply copy and paste the code we used to train the models in task 5, create new names for the models and change the data we are using to train our models using from ‘train.orig’ to the ‘train.smote’ dataset.
#A. Decision Tree: SMOTE data
<- train(class ~ .,
dt_smote data = train.smote,
method = "rpart",
trControl = ctrl,
metric = "ROC")
#B. Naive Bayes regression: SMOTE data
<- train(class ~ .,
nb_smote data = train.smote,
method = "naive_bayes",
trControl = ctrl,
metric = "ROC")
#C. Linear Discriminant Analysis: SMOTE data
<- train(class ~ .,
lda_smote data = train.smote,
method = "lda",
trControl = ctrl,
metric = "ROC")
Next, we will use the models we have trained using the SMOTE balanced training dataset to generate predictions on the test dataset, and we will compute our three performance measures. To complete this, we can copy the code from the earlier task and change the names of the output and models to reference the models trained using the SMOTE balanced training dataset.
################################################
#Decision Tree Model - Trained on SMOTE dataset#
################################################
#A. Decision Tree Model predictions
<- predict(dt_smote,test,type = "prob")
dt_smote_pred
#B. Decision Tree - Assign class to probabilities
<- factor(ifelse(dt_smote_pred$yes > 0.5,"yes","no"))
dt_smote_test
#C. Decision Save Precision/Recall/F
<- posPredValue(dt_smote_test,test$class,positive = "yes")
precision_dtsmote <- sensitivity(dt_smote_test,test$class,positive = "yes")
recall_dtsmote <- (2 * precision_dtsmote * recall_dtsmote) / (precision_dtsmote + recall_dtsmote)
F1_dtsmote
##############################################
#Naive Bayes Model - Trained on SMOTE dataset#
##############################################
#A. NB Model predictions
<- predict(nb_smote,test,type = "prob")
nb_smote_pred
#B. NB - Assign class to probabilities
<- factor(ifelse(nb_smote_pred$yes > 0.5,"yes","no"))
nb_smote_test
#C. NB Save Precision/Recall/F
<- posPredValue(nb_smote_test,test$class,positive = "yes")
precision_nbsmote <- sensitivity(nb_smote_test,test$class,positive = "yes")
recall_nbsmote <- (2 * precision_nbsmote * recall_nbsmote) / (precision_nbsmote + recall_nbsmote)
F1_nbsmote
######################################
#LDA Model - Trained on SMOTE dataset#
######################################
#A. LDA Model predictions
<- predict(lda_smote,test,type = "prob")
lda_smote_pred
#B. LDA - Assign class to probabilities
<- factor(ifelse(lda_smote_pred$yes > 0.5,"yes","no"))
lda_smote_test
#C. LDA Save Precision/Recall/F
<- posPredValue(lda_smote_test,test$class,positive = "yes")
precision_ldasmote <- sensitivity(lda_smote_test,test$class,positive = "yes")
recall_ldasmote <- (2 * precision_ldasmote * recall_ldasmote) / (precision_ldasmote + recall_ldasmote) F1_ldasmote
In task 7, we will train the three classifier models using the ADASYN balanced training dataset. Again, to train the models, we can simply copy and paste the code we used to train the models in task 6, create new names for the model and change the data we are using to train our model to ‘train.adas’
#A. Decision Tree: ADASYN data
<- train(class ~ .,
dt_adas data = train.adas,
method = 'rpart',
metric = "ROC",
trControl = ctrl)
#B. Naive Bayes regression: ADASYN data
<- train(class ~ .,
nb_adas data = train.adas,
method = "naive_bayes",
metric = "ROC",
trControl = ctrl)
#C. Linear Discriminant Analysis: ADASYN data
<- train(class ~ .,
lda_adas data = train.adas,
method = 'lda',
metric = "ROC",
trControl = ctrl)
Next, we will use the models we have trained using the ADASYN balanced training dataset to generate predictions on the test dataset, and we will compute our three performance measures. To complete this, we can copy the code from the earlier task and change the names of the output and models to reference the models trained using the SMOTE balanced training dataset.
#################################################
#Decision Tree Model - Trained on ADASYN dataset#
#################################################
#A. Decision Tree Model predictions
<-predict(dt_adas,test,type = "prob")
dt_adas_pred
#B. Decision Tree - Assign class to probabilities
<- factor(ifelse(dt_adas_pred$yes > 0.50,"yes","no") )
dt_adas_test
#C. Decision Save Precision/Recall/F
<- posPredValue(dt_adas_test,test$class,positive = "yes")
precision_dtadas <- sensitivity(dt_adas_test, test$class,positive = "yes")
recall_dtadas <- (2 * precision_dtadas * recall_dtadas) / (precision_dtadas + recall_dtadas)
F1_dtadas
###############################################
#Naive Bayes Model - Trained on ADASYN dataset#
###############################################
#A. NB Model predictions
<-predict(nb_adas,test,type = "prob")
nb_adas_pred
#B. NB - Assign class to probabilities
<- factor(ifelse(nb_adas_pred$yes > 0.50,"yes","no") )
nb_adas_test
#C. NB Save Precision/Recall/F
<- posPredValue(nb_adas_test,test$class,positive = "yes")
precision_nbadas <- sensitivity(nb_adas_test,test$class,positive = "yes")
recall_nbadas <- (2 * precision_nbadas * recall_nbadas) / (precision_nbadas + recall_nbadas)
F1_nbadas
#######################################
#LDA Model - Trained on ADASYN dataset#
#######################################
#A. LDA Model predictions
<- predict(lda_adas,test,type = "prob")
lda_adas_pred
#B. LDA - Assign class to probabilities
<- factor(ifelse(lda_adas_pred$yes > 0.50,"yes","no") )
lda_adas_test
#C. LDA Save Precision/Recall/F
<- posPredValue(lda_adas_test,test$class,positive = "yes")
precision_ldaadas <- sensitivity(lda_adas_test,test$class,positive = "yes")
recall_ldaadas <- (2 * precision_ldaadas * recall_ldaadas) / (precision_ldaadas + recall_ldaadas) F1_ldaadas
In task 8, we will train the three classifier models using the DB-SMOTE balanced training dataset. To train the models, we can simply copy and paste the code we used to train the models in task 7, create new names for the model and change the data we are using to train our model to ‘train.dbsmote’
#A. Decision Tree: dbsmote data
<- train(class ~ .,
dt_dbsmote data = train.dbsmote,
method = "rpart",
trControl = ctrl,
metric = "ROC" )
#B. Naive Bayes regression: dbsmote data
<- train(class ~ .,
nb_dbsmote data = train.dbsmote,
method = "naive_bayes",
trControl = ctrl,
metric = "ROC")
#C. Linear Discriminant Analysis: dbsmote data
<- train(class ~ .,
lda_dbsmote data = train.dbsmote,
method = "lda",
trControl = ctrl,
metric = "ROC")
Next, we will use the models we have trained using the DB-SMOTE balanced training dataset to generate predictions on the test dataset, and we will compute our three performance measures. To complete this, we can copy the code from the earlier task and change the names of the output and models to reference the models trained using the DB-SMOTE balanced training dataset.
###################################################
#Decision Tree Model - Trained on DB SMOTE dataset#
###################################################
#A. Decision Tree Model predictions
<- predict(dt_dbsmote,test,type = "prob")
dt_dbsmote_pred
#B. Decision Tree - Assign class to probabilities
<- factor(ifelse(dt_dbsmote_pred$yes > 0.50,"yes","no"))
dt_dbsmote_test
#C. Decision Save Precision/Recall/F
<- posPredValue(dt_dbsmote_test,test$class,positive = "yes")
precision_dtdbsmote <- sensitivity(dt_dbsmote_test,test$class,positive = "yes")
recall_dtdbsmote <- (2 * precision_dtdbsmote * recall_dtdbsmote) / (precision_dtdbsmote + recall_dtdbsmote)
F1_dtdbsmote
#################################################
#Naive Bayes Model - Trained on DB SMOTE dataset#
#################################################
#A. NB Model predictions
<- predict(nb_dbsmote,test,type = "prob")
nb_dbsmote_pred
#B. NB - Assign class to probabilities
<- factor(ifelse(nb_dbsmote_pred$yes > 0.50,"yes","no"))
nb_dbsmote_test
#C. NB Save Precision/Recall/F
<- posPredValue(nb_dbsmote_test,test$class,positive = "yes")
precision_nbdbsmote <- sensitivity(nb_dbsmote_test,test$class,positive = "yes")
recall_nbdbsmote <- (2 * precision_nbdbsmote * recall_nbdbsmote) / (precision_nbdbsmote + recall_nbdbsmote)
F1_nbdbsmote
#########################################
#LDA Model - Trained on DB SMOTE dataset#
#########################################
#A. LDA Model predictions
<- predict(lda_dbsmote,test,type = "prob")
lda_dbsmote_pred
#B. LDA - Assign class to probabilities
<- factor(ifelse(lda_dbsmote_pred$yes > 0.50,"yes","no"))
lda_dbsmote_test
#C. LDA Save Precision/Recall/F
<- posPredValue(lda_dbsmote_test,test$class,positive = "yes")
precision_ldadbsmote <- sensitivity(lda_dbsmote_test,test$class,positive = "yes")
recall_ldadbsmote <- (2 * precision_ldadbsmote * recall_ldadbsmote) / (precision_ldadbsmote + recall_ldadbsmote) F1_ldadbsmote
We will compare the recall, precision, and F1 performance measures for each of the three models we trained using the four training datasets:
Recall that the most important performance measure for the fraud problem is the recall, which measures how complete our results are indicating the model captures more of the fraudulent transactions.
#Lets reset the chart settings so we see one chart at a time
par(mfrow = c(1,1))
#Compare the Recall of the models: TP / TP + FN. To do that, we'll need to combine our results into a dataframe
<- data.frame(Model = c('DT-Orig',
model_compare_recall 'NB-Orig',
'LDA-Orig',
'DT-SMOTE',
'NB-SMOTE',
'LDA-SMOTE',
'DT-ADASYN',
'NB-ADASYN',
'LDA-ADASYN',
'DT-DBSMOTE',
'NB-DBSMOTE',
'LDA-DBSMOTE'),
Recall = c(recall_dtOrig,
recall_nbOrig,
recall_ldaOrig,
recall_dtsmote,
recall_nbsmote,
recall_ldasmote,
recall_dtadas,
recall_nbadas,
recall_ldaadas,
recall_dtdbsmote,
recall_nbdbsmote,
recall_ldadbsmote))
ggplot(aes(x = reorder(Model,-Recall),y = Recall),data = model_compare_recall) +
geom_bar(stat = 'identity', fill = 'light blue') +
ggtitle('Comparative Recall of Models on Test Data') +
xlab('Models') +
ylab('Recall Measure')+
geom_text(aes(label = round(Recall,2))) + theme_bw() +
theme(axis.text.x = element_text(angle = 40))
#Compare the Precision of the models: TP/TP+FP
<- data.frame(Model = c('DT-Orig',
model_compare_precision 'NB-Orig',
'LDA-Orig',
'DT-SMOTE',
'NB-SMOTE',
'LDA-SMOTE',
'DT-ADASYN',
'NB-ADASYN',
'LDA-ADASYN',
'DT-DBSMOTE',
'NB-DBSMOTE',
'LDA-DBSMOTE'),
Precision = c(precision_dtOrig,
precision_nbOrig,
precision_ldaOrig,
precision_dtsmote,
precision_nbsmote,
precision_ldasmote,
precision_dtadas,
precision_nbadas,
precision_ldaadas,
precision_dtdbsmote,
precision_nbdbsmote,
precision_ldadbsmote))
ggplot(aes(x = reorder(Model,-Precision),y = Precision),data = model_compare_precision) +
geom_bar(stat = 'identity',fill = 'light green') +
ggtitle('Comparative Precision of Models on Test Data') +
xlab('Models') +
ylab('Precision Measure')+
geom_text(aes(label = round(Precision,2))) + theme_bw() +
theme(axis.text.x = element_text(angle = 40))
#Compare the F1 of the models: 2*((Precision*Recall) / (Precision + Recall))
<- data.frame(Model = c('DT-Orig',
model_compare_f1 'NB-Orig',
'LDA-Orig',
'DT-SMOTE',
'NB-SMOTE',
'LDA-SMOTE',
'DT-ADASYN',
'NB-ADASYN',
'LDA-ADASYN',
'DT-DBSMOTE',
'NB-DBSMOTE',
'LDA-DBSMOTE'),
F1 = c(F1_dtOrig,
F1_nbOrig,
F1_ldaOrig,
F1_dtsmote,
F1_nbsmote,
F1_ldasmote,
F1_dtadas,
F1_nbadas,
F1_ldaadas,
F1_dtdbsmote,
F1_nbdbsmote,
F1_ldadbsmote))
ggplot(aes(x=reorder(Model,-F1),y = F1),data = model_compare_f1) +
geom_bar(stat = 'identity',fill = 'light grey') +
ggtitle('Comparative F1 of Models on Test Data') +
xlab('Models') +
ylab('F1 Measure')+
geom_text(aes(label = round(F1,2))) + theme_bw() +
theme(axis.text.x = element_text(angle = 40))