Introduction

The datasets contains transactions made by credit cards.

Features V1, V2, … V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are ‘Time’ and ‘Amount’.

Feature ‘Time’ contains the seconds elapsed between each transaction and the first transaction in the dataset.

The feature ‘Amount’ is the transaction Amount, this feature can be used for example-dependant cost-senstive learning.

Feature ‘Class’ is the response variable and it takes value 1 in case of fraud and 0 otherwise.

Read data set

rm(list=ls())
library(tidyverse)

## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr

## Conflicts with tidy packages ----------------------------------------------

## filter(): dplyr, stats
## lag():    dplyr, stats

setwd("E:/Data")
df <- read_csv("creditcard.csv")

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   Time = col_integer(),
##   Class = col_integer()
## )

## See spec(...) for full column specifications.

## Warning: 1 parsing failure.
##    row  col               expected actual
## 153759 Time no trailing characters   e+05

Check NA observations in the dataset

sapply(df,function(x) sum(is.na(x)))

##   Time     V1     V2     V3     V4     V5     V6     V7     V8     V9 
##      1      0      0      0      0      0      0      0      0      0 
##    V10    V11    V12    V13    V14    V15    V16    V17    V18    V19 
##      0      0      0      0      0      0      0      0      0      0 
##    V20    V21    V22    V23    V24    V25    V26    V27    V28 Amount 
##      0      0      0      0      0      0      0      0      0      0 
##  Class 
##      0

More information about the data set

str(df)

## Classes 'tbl_df', 'tbl' and 'data.frame':    284807 obs. of  31 variables:
##  $ Time  : int  0 0 1 1 2 2 4 7 7 9 ...
##  $ V1    : num  -1.36 1.192 -1.358 -0.966 -1.158 ...
##  $ V2    : num  -0.0728 0.2662 -1.3402 -0.1852 0.8777 ...
##  $ V3    : num  2.536 0.166 1.773 1.793 1.549 ...
##  $ V4    : num  1.378 0.448 0.38 -0.863 0.403 ...
##  $ V5    : num  -0.3383 0.06 -0.5032 -0.0103 -0.4072 ...
##  $ V6    : num  0.4624 -0.0824 1.8005 1.2472 0.0959 ...
##  $ V7    : num  0.2396 -0.0788 0.7915 0.2376 0.5929 ...
##  $ V8    : num  0.0987 0.0851 0.2477 0.3774 -0.2705 ...
##  $ V9    : num  0.364 -0.255 -1.515 -1.387 0.818 ...
##  $ V10   : num  0.0908 -0.167 0.2076 -0.055 0.7531 ...
##  $ V11   : num  -0.552 1.613 0.625 -0.226 -0.823 ...
##  $ V12   : num  -0.6178 1.0652 0.0661 0.1782 0.5382 ...
##  $ V13   : num  -0.991 0.489 0.717 0.508 1.346 ...
##  $ V14   : num  -0.311 -0.144 -0.166 -0.288 -1.12 ...
##  $ V15   : num  1.468 0.636 2.346 -0.631 0.175 ...
##  $ V16   : num  -0.47 0.464 -2.89 -1.06 -0.451 ...
##  $ V17   : num  0.208 -0.115 1.11 -0.684 -0.237 ...
##  $ V18   : num  0.0258 -0.1834 -0.1214 1.9658 -0.0382 ...
##  $ V19   : num  0.404 -0.146 -2.262 -1.233 0.803 ...
##  $ V20   : num  0.2514 -0.0691 0.525 -0.208 0.4085 ...
##  $ V21   : num  -0.01831 -0.22578 0.248 -0.1083 -0.00943 ...
##  $ V22   : num  0.27784 -0.63867 0.77168 0.00527 0.79828 ...
##  $ V23   : num  -0.11 0.101 0.909 -0.19 -0.137 ...
##  $ V24   : num  0.0669 -0.3398 -0.6893 -1.1756 0.1413 ...
##  $ V25   : num  0.129 0.167 -0.328 0.647 -0.206 ...
##  $ V26   : num  -0.189 0.126 -0.139 -0.222 0.502 ...
##  $ V27   : num  0.13356 -0.00898 -0.05535 0.06272 0.21942 ...
##  $ V28   : num  -0.0211 0.0147 -0.0598 0.0615 0.2152 ...
##  $ Amount: num  149.62 2.69 378.66 123.5 69.99 ...
##  $ Class : int  0 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, "problems")=Classes 'tbl_df', 'tbl' and 'data.frame': 1 obs. of  4 variables:
##   ..$ row     : int 153759
##   ..$ col     : chr "Time"
##   ..$ expected: chr "no trailing characters"
##   ..$ actual  : chr "e+05"
##  - attr(*, "spec")=List of 2
##   ..$ cols   :List of 31
##   .. ..$ Time  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ V1    : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ V2    : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ V3    : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ V4    : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ V5    : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ V6    : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ V7    : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ V8    : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ V9    : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ V10   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ V11   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ V12   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ V13   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ V14   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ V15   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ V16   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ V17   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ V18   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ V19   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ V20   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ V21   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ V22   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ V23   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ V24   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ V25   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ V26   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ V27   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ V28   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ Amount: list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ Class : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   ..$ default: list()
##   .. ..- attr(*, "class")= chr  "collector_guess" "collector"
##   ..- attr(*, "class")= chr "col_spec"

As you can see, the data set contains of 31 variables of 284807 observations, whereas Class is the response variale.Since Class has the “list” type, I now convert it into factor.Also, since the Time varablie is not neccesary, it will delete it.

Convert the variable named “Class” into factor

df$Class <-factor(df$Class)
database <- select(df,-Time)

Check severity of imbalance in the dataset

Now, I will check the severity of imbalance in this data set:

table(database$Class)

## 
##      0      1 
## 284315    492

From above results, it can be seen that this dataset is a serverly imbalanced data.

IGNORING PROBLEM:

I will ignore the problem of imbalanced data set this in part and use logistic and tree decision method.

First, I split data in to training set (70%) and test set (30%)

library(caret)

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

library(rpart)
library(rpart.plot)

## Warning: package 'rpart.plot' was built under R version 3.3.3

library(ROSE)

## Warning: package 'ROSE' was built under R version 3.3.3

## Loaded ROSE 0.0-3

set.seed(1234)
indxTrain <-createDataPartition(y=database$Class,p=0.7,list=FALSE)
training <-database [indxTrain,]
testing <-database[-indxTrain,]

Then I calculate the baseline accuracy:

prop.table(table(database$Class))*100 # 99.827% baseline accuracy

## 
##          0          1 
## 99.8272514  0.1727486

So, the baseline accuracy is 99.83% which means that if we simply predict all the transaction as not-fraud-transaction, the accuracy is 99.83%.

Logistic regression:

logistic <- glm(Class ~., data = training,family = "binomial")

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

problo <- predict(logistic, newdata=testing, type = "response")
threshold <- 0.5
pred  <- factor( ifelse(problo> threshold, "1", "0") )
confusionMatrix(pred, testing$Class,positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 85281    51
##          1    13    96
##                                          
##                Accuracy : 0.9993         
##                  95% CI : (0.999, 0.9994)
##     No Information Rate : 0.9983         
##     P-Value [Acc > NIR] : 9.779e-15      
##                                          
##                   Kappa : 0.7496         
##  Mcnemar's Test P-Value : 3.746e-06      
##                                          
##             Sensitivity : 0.653061       
##             Specificity : 0.999848       
##          Pos Pred Value : 0.880734       
##          Neg Pred Value : 0.999402       
##              Prevalence : 0.001720       
##          Detection Rate : 0.001124       
##    Detection Prevalence : 0.001276       
##       Balanced Accuracy : 0.826454       
##                                          
##        'Positive' Class : 1              
##

On the first glance, the logistic regression has high accuracy with 99.93%. However, it is worth to mention that the baseline accuracy is 99.83% if we simply predict all the transaction as not-fraud-transaction. In details, the model has high specificity (99.98%) which means that high number of non-fraud transactions is predicted to be non-fraud transactions. However, with sensitivity of 65.30%, it means that the model only detects 65.30% of total fraud transaction.

Tree algorithm

set.seed(1234)
model_rf<-rpart(Class~.,data=training,method="class")
pred1 <-predict(model_rf,testing,type="class")
cm_original <- confusionMatrix(pred1,testing$Class,positive="1")
cm_original

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 85275    29
##          1    19   118
##                                           
##                Accuracy : 0.9994          
##                  95% CI : (0.9993, 0.9996)
##     No Information Rate : 0.9983          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8307          
##  Mcnemar's Test P-Value : 0.1939          
##                                           
##             Sensitivity : 0.802721        
##             Specificity : 0.999777        
##          Pos Pred Value : 0.861314        
##          Neg Pred Value : 0.999660        
##              Prevalence : 0.001720        
##          Detection Rate : 0.001381        
##    Detection Prevalence : 0.001603        
##       Balanced Accuracy : 0.901249        
##                                           
##        'Positive' Class : 1               
##

The tree algorith is has higher accuracy and specificity than logistic model, but it stil has the problem with detection of fraud transaction ( 80.27%) only.

PROCESS IMBALANCED DATA.

From this part, we will make the data set balance with over sampling, under sampling, rose method.

Over sampling method:

This method replicates the observation from minority class to balance the data. It can lead to no information loss but it may end up with overfitting by simply adding replicated observation to the dataset.

data_balanced_over <- ovun.sample(Class~ ., data = training, method = "over",N = 398042)$data
table(data_balanced_over$Class)

## 
##      0      1 
## 199021 199021

In this method, N represents the number of observations in balanced dataset. Originally, we have 199021 negative observations (no-fraud transactions.Thus, I over sample the minority class until it reachs 199021 and the total data set will include 398042 obs.

Next, I will use the tree decision method and then calculate the accuracy:

model_rf_over <- rpart(Class ~ ., data = data_balanced_over)
pred.tree.over <- predict(model_rf_over, newdata = testing,type="class")
cm_over <-confusionMatrix(pred.tree.over,testing$Class, positive="1")

Under sampling method:

This method reduces the number of observation from majority class to make the data set balance. It is better use when the data set is huge and reducing the number of training sample reduce the run time. However, we may loose important information in the majority class.

data_balanced_under <- ovun.sample(Class ~ ., data = training, method = "under", N = 690, seed = 1)$data
table(data_balanced_under$Class)

## 
##   0   1 
## 345 345

In this method, N represents the number of observations in balanced dataset. Originally, we have 345 positive observations (fraud transactions).Thus, I under sample the majority class to 345 observaton and the total data set will include 690 obs.

model_rf_under  <- rpart(Class~., data=data_balanced_under)
pred.tree.under<-predict(model_rf_under,newdata=testing,type="class")
roc.curve(testing$Class,pred.tree.under,plotit=F)

## Area under the curve (AUC): 0.937

cm_under <- confusionMatrix(pred.tree.under,testing$Class,positive="1")

ROSE method:

In simple words, it is hybrid methods that combine under-sampling and up-sampling. It overcome imbalances by generates artificial data.

This can be achieved using rose function in the ROSE package

 data_rose <- ROSE(Class ~ ., data = training, seed = 1)$data
table(data_rose$Class)

## 
##     0     1 
## 99527 99839

It creates the data set with the same amount of observation as the training set ( 199366 observations). Next, I will create its cunfusion matrix.

model_rf_rose<- rpart(Class~., data=data_rose)
pred.tree.rose<-predict(model_rf_rose,newdata=testing,type="class")
cm_rose <- confusionMatrix(pred.tree.rose,testing$Class,positive="1")

COMPARISION:

In this part, I only compare the tree decision models with original data set, under sampling, over sampling and rose method.

models <- list(original = model_rf,
               under = model_rf_under,
               over = model_rf_over,
               rose = model_rf_rose)

library(dplyr)
comparison <- data.frame(model = names(models),
                         Sensitivity = rep(NA, length(models)),
                         Specificity = rep(NA, length(models)),
                         Accuracy = rep(NA, length(models)),
                         Pospred = rep(NA, length(models)),
                         Negpred = rep(NA, length(models)))

for (name in names(models)) {
  model <- get(paste0("cm_", name))
  
  comparison[comparison$model == name, ] <- filter(comparison, model == name) %>%
    mutate(Sensitivity = model$byClass["Sensitivity"],
           Specificity = model$byClass["Specificity"],
           Accuracy = model$overall["Accuracy"],
           Pospred = model$byClass["Pos Pred Value"],
           Negpred = model$byClass["Neg Pred Value"])
}

library(tidyr)
comparison %>%
  gather(x, y, Sensitivity:Negpred) %>%
  ggplot(aes(x = x, y = y, color = model)) +
    geom_jitter(width = 0.2, alpha = 0.5, size = 3)

As expected, we achieve higher sensitivity than the original tree decision. It means that we can detect around 90% of fraud transations. However, it comes with the price when accuracy and specificity decrease (the original accuracy and speficity are all above 99.90%).The main disadvange is that the positive predicted value are less than 5 %, much lower than over 86% in case of original data. It means that given the result of the test is positive, the probability of the reference result is also positive is lower than 5%.

Given the fact that selection of a model should match a certain goal, if the goal is to correctly identify fraudulent transactions even in price of low positive predicted value, then the over sampling, under sampling, rose model can be used over the original one.

CFD

Vu Huong

10 July 2017