Adaboost on Credit Card data

library(JOUSBoost)
library(fastAdaboost)
library(tidyverse)
library(caret)

In this short example we will be analyzing a data set of credit card transactions in order to determine which may have been fraudulent. The original dataset included only 492 cases of fraud out of 284,807 transactions. In order to combat this imbalance, 500 cases of non fraud were randomly selected and merged with the 492 positive cases (fraud) to create the new data set that is used for this analysis.

The predictor variables in this data set are 28 numerical variables that are principal components obtained from PCA. There are two more predictor variables, Time: the time in seconds elapsed between the current transaction and the first transaction of the data set and Amount: the monetary value of the transaction. Feature ‘Class’ is the response variable and it takes value 1 in case of fraud and 0 otherwise.

The first step is to read in the balanced data set we created and split it into training and testing data (90% and 10% respectively.)

credit <- read.csv("Data/creditcard.csv")

set.seed(100)
trainindex <- sample(nrow(credit),nrow(credit)*.9)
training <- credit[trainindex,]
test <- credit[-trainindex,]

Here we are changing the Class variable to a factor and standardizing all predictor variables in the training data set.

training$Class <- as.factor(training$Class)
training[,1:30] <- scale(training[,1:30])

Next we do the same with the test data set.

test$Class <- as.factor(test$Class)
test[,1:30] <- scale(test[,1:30])

We can now focus on creating an Adaboost (adaptive boosting) classification algorithm using the fastAdaboost package. Adaptive boosting works by combining weak classifiers to create a single strong classifiers. The weak classifiers are decision trees with only one split, also known as decision stumps. All observations in the training set begin with the same weight, but observations that are more difficult to predict are given more weight time. The only user defined variable we need to consider is niter, the number of weak classifiers. The algorithim will continue to correct misclassifications by creating new decision stumps until it has created niter stumps or it can perfectly predict the response variable. We can start off with an example using nIter=10.

set.seed(100)
creditboost <- fastAdaboost::adaboost(Class ~ .,data=training, nIter = 10)
creditboost.predictions <- predict(creditboost,test)
creditboost.Accuracy <- 1-creditboost.predictions$error

The accuracy of the adaboost process from fastAdaboost with parameter nIter=10 was 90% and is printed below.

## [1] 0.9

We can now use the train() function to find the optimal value of nIter for the adaboost algorithim. I tested nIter values of 10,20,30,…,100. Each of these values were tested using repeated 5 fold cross-validation.

set.seed(100)
boostTune <- train(
  y = training$Class,
  x = training[, -31],
  method = "adaboost",
  preProcess = c("center", "scale"),
  tuneGrid = expand.grid(.nIter = seq(10, 100, length.out = 10), .method = "M1"), 
  trControl = trainControl(method = "repeatedcv",repeats = 5, number = 10
  )
)

boostTune$results
boostTuneoptimal <- boostTune$results[boostTune$results$Accuracy==max(boostTune$results$Accuracy),]
nIter.optimal <- boostTuneoptimal$nIter

The output below shows the results for each level of nIter tested and the optimal value (highest accuracy) is shown at the bottom. For this model the best value for nIter was 50.

##    nIter method  Accuracy     Kappa AccuracySD    KappaSD
## 1     10     M1 0.9264625 0.8530864 0.02581835 0.05150861
## 2     20     M1 0.9316210 0.8634325 0.02478714 0.04946260
## 3     30     M1 0.9354365 0.8710850 0.02533007 0.05050003
## 4     40     M1 0.9338634 0.8679666 0.02550229 0.05083176
## 5     50     M1 0.9354518 0.8711340 0.02579324 0.05142246
## 6     60     M1 0.9347726 0.8697805 0.02631219 0.05245852
## 7     70     M1 0.9343306 0.8689012 0.02670326 0.05323053
## 8     80     M1 0.9352143 0.8706694 0.02558786 0.05099360
## 9     90     M1 0.9347624 0.8697676 0.02573491 0.05129208
## 10   100     M1 0.9347649 0.8697762 0.02533005 0.05048125

##   nIter method  Accuracy    Kappa AccuracySD    KappaSD
## 5    50     M1 0.9354518 0.871134 0.02579324 0.05142246

We can now train the model with the optimal nIter parameter and fit it to the test data. This will give the highest possible accuracy of an Adaboost model on this data set.

set.seed(100)
creditboost.optimal <- fastAdaboost::adaboost(Class ~ .,data=training, nIter = nIter.optimal)
creditboost.optimal.predictions <- predict(creditboost.optimal,test)
creditboost.optimal.Accuracy <- 1-creditboost.optimal.predictions$error

The accuracy of this model was found to be 94% and is printed below.

## [1] 0.94

Adaboost on Credit Card data

Greyson Hardison