The datasets contains transactions made by credit cards.
Features V1, V2, … V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are ‘Time’ and ‘Amount’.
Feature ‘Time’ contains the seconds elapsed between each transaction and the first transaction in the dataset.
The feature ‘Amount’ is the transaction Amount, this feature can be used for example-dependant cost-senstive learning.
Feature ‘Class’ is the response variable and it takes value 1 in case of fraud and 0 otherwise.
rm(list=ls())
library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag(): dplyr, stats
setwd("E:/Data")
df <- read_csv("creditcard.csv")
## Parsed with column specification:
## cols(
## .default = col_double(),
## Time = col_integer(),
## Class = col_integer()
## )
## See spec(...) for full column specifications.
## Warning: 1 parsing failure.
## row col expected actual
## 153759 Time no trailing characters e+05
sapply(df,function(x) sum(is.na(x)))
## Time V1 V2 V3 V4 V5 V6 V7 V8 V9
## 1 0 0 0 0 0 0 0 0 0
## V10 V11 V12 V13 V14 V15 V16 V17 V18 V19
## 0 0 0 0 0 0 0 0 0 0
## V20 V21 V22 V23 V24 V25 V26 V27 V28 Amount
## 0 0 0 0 0 0 0 0 0 0
## Class
## 0
str(df)
## Classes 'tbl_df', 'tbl' and 'data.frame': 284807 obs. of 31 variables:
## $ Time : int 0 0 1 1 2 2 4 7 7 9 ...
## $ V1 : num -1.36 1.192 -1.358 -0.966 -1.158 ...
## $ V2 : num -0.0728 0.2662 -1.3402 -0.1852 0.8777 ...
## $ V3 : num 2.536 0.166 1.773 1.793 1.549 ...
## $ V4 : num 1.378 0.448 0.38 -0.863 0.403 ...
## $ V5 : num -0.3383 0.06 -0.5032 -0.0103 -0.4072 ...
## $ V6 : num 0.4624 -0.0824 1.8005 1.2472 0.0959 ...
## $ V7 : num 0.2396 -0.0788 0.7915 0.2376 0.5929 ...
## $ V8 : num 0.0987 0.0851 0.2477 0.3774 -0.2705 ...
## $ V9 : num 0.364 -0.255 -1.515 -1.387 0.818 ...
## $ V10 : num 0.0908 -0.167 0.2076 -0.055 0.7531 ...
## $ V11 : num -0.552 1.613 0.625 -0.226 -0.823 ...
## $ V12 : num -0.6178 1.0652 0.0661 0.1782 0.5382 ...
## $ V13 : num -0.991 0.489 0.717 0.508 1.346 ...
## $ V14 : num -0.311 -0.144 -0.166 -0.288 -1.12 ...
## $ V15 : num 1.468 0.636 2.346 -0.631 0.175 ...
## $ V16 : num -0.47 0.464 -2.89 -1.06 -0.451 ...
## $ V17 : num 0.208 -0.115 1.11 -0.684 -0.237 ...
## $ V18 : num 0.0258 -0.1834 -0.1214 1.9658 -0.0382 ...
## $ V19 : num 0.404 -0.146 -2.262 -1.233 0.803 ...
## $ V20 : num 0.2514 -0.0691 0.525 -0.208 0.4085 ...
## $ V21 : num -0.01831 -0.22578 0.248 -0.1083 -0.00943 ...
## $ V22 : num 0.27784 -0.63867 0.77168 0.00527 0.79828 ...
## $ V23 : num -0.11 0.101 0.909 -0.19 -0.137 ...
## $ V24 : num 0.0669 -0.3398 -0.6893 -1.1756 0.1413 ...
## $ V25 : num 0.129 0.167 -0.328 0.647 -0.206 ...
## $ V26 : num -0.189 0.126 -0.139 -0.222 0.502 ...
## $ V27 : num 0.13356 -0.00898 -0.05535 0.06272 0.21942 ...
## $ V28 : num -0.0211 0.0147 -0.0598 0.0615 0.2152 ...
## $ Amount: num 149.62 2.69 378.66 123.5 69.99 ...
## $ Class : int 0 0 0 0 0 0 0 0 0 0 ...
## - attr(*, "problems")=Classes 'tbl_df', 'tbl' and 'data.frame': 1 obs. of 4 variables:
## ..$ row : int 153759
## ..$ col : chr "Time"
## ..$ expected: chr "no trailing characters"
## ..$ actual : chr "e+05"
## - attr(*, "spec")=List of 2
## ..$ cols :List of 31
## .. ..$ Time : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ V1 : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ V2 : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ V3 : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ V4 : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ V5 : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ V6 : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ V7 : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ V8 : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ V9 : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ V10 : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ V11 : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ V12 : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ V13 : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ V14 : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ V15 : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ V16 : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ V17 : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ V18 : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ V19 : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ V20 : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ V21 : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ V22 : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ V23 : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ V24 : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ V25 : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ V26 : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ V27 : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ V28 : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ Amount: list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ Class : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## ..$ default: list()
## .. ..- attr(*, "class")= chr "collector_guess" "collector"
## ..- attr(*, "class")= chr "col_spec"
As you can see, the data set contains of 31 variables of 284807 observations, whereas Class is the response variale.Since Class has the “list” type, I now convert it into factor.Also, since the Time varablie is not neccesary, it will delete it.
df$Class <-factor(df$Class)
database <- select(df,-Time)
Now, I will check the severity of imbalance in this data set:
table(database$Class)
##
## 0 1
## 284315 492
From above results, it can be seen that this dataset is a serverly imbalanced data.
I will ignore the problem of imbalanced data set this in part and use logistic and tree decision method.
First, I split data in to training set (70%) and test set (30%)
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
library(rpart)
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 3.3.3
library(ROSE)
## Warning: package 'ROSE' was built under R version 3.3.3
## Loaded ROSE 0.0-3
set.seed(1234)
indxTrain <-createDataPartition(y=database$Class,p=0.7,list=FALSE)
training <-database [indxTrain,]
testing <-database[-indxTrain,]
Then I calculate the baseline accuracy:
prop.table(table(database$Class))*100 # 99.827% baseline accuracy
##
## 0 1
## 99.8272514 0.1727486
So, the baseline accuracy is 99.83% which means that if we simply predict all the transaction as not-fraud-transaction, the accuracy is 99.83%.
logistic <- glm(Class ~., data = training,family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
problo <- predict(logistic, newdata=testing, type = "response")
threshold <- 0.5
pred <- factor( ifelse(problo> threshold, "1", "0") )
confusionMatrix(pred, testing$Class,positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 85281 51
## 1 13 96
##
## Accuracy : 0.9993
## 95% CI : (0.999, 0.9994)
## No Information Rate : 0.9983
## P-Value [Acc > NIR] : 9.779e-15
##
## Kappa : 0.7496
## Mcnemar's Test P-Value : 3.746e-06
##
## Sensitivity : 0.653061
## Specificity : 0.999848
## Pos Pred Value : 0.880734
## Neg Pred Value : 0.999402
## Prevalence : 0.001720
## Detection Rate : 0.001124
## Detection Prevalence : 0.001276
## Balanced Accuracy : 0.826454
##
## 'Positive' Class : 1
##
On the first glance, the logistic regression has high accuracy with 99.93%. However, it is worth to mention that the baseline accuracy is 99.83% if we simply predict all the transaction as not-fraud-transaction. In details, the model has high specificity (99.98%) which means that high number of non-fraud transactions is predicted to be non-fraud transactions. However, with sensitivity of 65.30%, it means that the model only detects 65.30% of total fraud transaction.
set.seed(1234)
model_rf<-rpart(Class~.,data=training,method="class")
pred1 <-predict(model_rf,testing,type="class")
cm_original <- confusionMatrix(pred1,testing$Class,positive="1")
cm_original
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 85275 29
## 1 19 118
##
## Accuracy : 0.9994
## 95% CI : (0.9993, 0.9996)
## No Information Rate : 0.9983
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8307
## Mcnemar's Test P-Value : 0.1939
##
## Sensitivity : 0.802721
## Specificity : 0.999777
## Pos Pred Value : 0.861314
## Neg Pred Value : 0.999660
## Prevalence : 0.001720
## Detection Rate : 0.001381
## Detection Prevalence : 0.001603
## Balanced Accuracy : 0.901249
##
## 'Positive' Class : 1
##
The tree algorith is has higher accuracy and specificity than logistic model, but it stil has the problem with detection of fraud transaction ( 80.27%) only.
From this part, we will make the data set balance with over sampling, under sampling, rose method.
This method replicates the observation from minority class to balance the data. It can lead to no information loss but it may end up with overfitting by simply adding replicated observation to the dataset.
data_balanced_over <- ovun.sample(Class~ ., data = training, method = "over",N = 398042)$data
table(data_balanced_over$Class)
##
## 0 1
## 199021 199021
In this method, N represents the number of observations in balanced dataset. Originally, we have 199021 negative observations (no-fraud transactions.Thus, I over sample the minority class until it reachs 199021 and the total data set will include 398042 obs.
Next, I will use the tree decision method and then calculate the accuracy:
model_rf_over <- rpart(Class ~ ., data = data_balanced_over)
pred.tree.over <- predict(model_rf_over, newdata = testing,type="class")
cm_over <-confusionMatrix(pred.tree.over,testing$Class, positive="1")
This method reduces the number of observation from majority class to make the data set balance. It is better use when the data set is huge and reducing the number of training sample reduce the run time. However, we may loose important information in the majority class.
data_balanced_under <- ovun.sample(Class ~ ., data = training, method = "under", N = 690, seed = 1)$data
table(data_balanced_under$Class)
##
## 0 1
## 345 345
In this method, N represents the number of observations in balanced dataset. Originally, we have 345 positive observations (fraud transactions).Thus, I under sample the majority class to 345 observaton and the total data set will include 690 obs.
model_rf_under <- rpart(Class~., data=data_balanced_under)
pred.tree.under<-predict(model_rf_under,newdata=testing,type="class")
roc.curve(testing$Class,pred.tree.under,plotit=F)
## Area under the curve (AUC): 0.937
cm_under <- confusionMatrix(pred.tree.under,testing$Class,positive="1")
In simple words, it is hybrid methods that combine under-sampling and up-sampling. It overcome imbalances by generates artificial data.
This can be achieved using rose function in the ROSE package
data_rose <- ROSE(Class ~ ., data = training, seed = 1)$data
table(data_rose$Class)
##
## 0 1
## 99527 99839
It creates the data set with the same amount of observation as the training set ( 199366 observations). Next, I will create its cunfusion matrix.
model_rf_rose<- rpart(Class~., data=data_rose)
pred.tree.rose<-predict(model_rf_rose,newdata=testing,type="class")
cm_rose <- confusionMatrix(pred.tree.rose,testing$Class,positive="1")
In this part, I only compare the tree decision models with original data set, under sampling, over sampling and rose method.
models <- list(original = model_rf,
under = model_rf_under,
over = model_rf_over,
rose = model_rf_rose)
library(dplyr)
comparison <- data.frame(model = names(models),
Sensitivity = rep(NA, length(models)),
Specificity = rep(NA, length(models)),
Accuracy = rep(NA, length(models)),
Pospred = rep(NA, length(models)),
Negpred = rep(NA, length(models)))
for (name in names(models)) {
model <- get(paste0("cm_", name))
comparison[comparison$model == name, ] <- filter(comparison, model == name) %>%
mutate(Sensitivity = model$byClass["Sensitivity"],
Specificity = model$byClass["Specificity"],
Accuracy = model$overall["Accuracy"],
Pospred = model$byClass["Pos Pred Value"],
Negpred = model$byClass["Neg Pred Value"])
}
library(tidyr)
comparison %>%
gather(x, y, Sensitivity:Negpred) %>%
ggplot(aes(x = x, y = y, color = model)) +
geom_jitter(width = 0.2, alpha = 0.5, size = 3)
As expected, we achieve higher sensitivity than the original tree decision. It means that we can detect around 90% of fraud transations. However, it comes with the price when accuracy and specificity decrease (the original accuracy and speficity are all above 99.90%).The main disadvange is that the positive predicted value are less than 5 %, much lower than over 86% in case of original data. It means that given the result of the test is positive, the probability of the reference result is also positive is lower than 5%.
Given the fact that selection of a model should match a certain goal, if the goal is to correctly identify fraudulent transactions even in price of low positive predicted value, then the over sampling, under sampling, rose model can be used over the original one.