Credit Card Fraud costs banks millions of dollars a year in reimbursement, legal fees, and mitigation. Denying fraudulent transactions before they occur would greatly reduce expenses.
In this paper, I use a publically available dataset of 6.3 million credit card transactions and each is categorized as Fraud or Not Fraud.
The result variable is Categorical & Binary.
Based on the Result variable type I use four different machine learning models to determine which one fits the test set best based on ROC Curve and Accuracy. False-positives are a major concern since we will deny any transactions that are flagged fraudulent forcing the customer to call customer service to complete the transaction. I test each model against 100,000 random transactions categorized as Not Fraud to estimate the false-positive rate. I also capture the time to process the 100,000 transactions to estimate the computational intensity of each model.
The results are incredible. Nearly all models selected 97%+ of the Fraud transactions correctly.
Let’s see how this works!!
Data Source: https://www.kaggle.com/ntnu-testimon/paysim1
library(plyr)
library(tidyverse)
library(caret)
library(GGally)
library(stringr)
library(rattle)
library(pROC)
library(ROCR)
set.seed(317)
fraud_raw <- read_csv("PS_20174392719_1491204439457_log.csv")
glimpse(fraud_raw)
## Observations: 6,362,620
## Variables: 11
## $ step <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ type <chr> "PAYMENT", "PAYMENT", "TRANSFER", "CASH_OUT", "...
## $ amount <dbl> 9839.64, 1864.28, 181.00, 181.00, 11668.14, 781...
## $ nameOrig <chr> "C1231006815", "C1666544295", "C1305486145", "C...
## $ oldbalanceOrg <dbl> 170136.0, 21249.0, 181.0, 181.0, 41554.0, 53860...
## $ newbalanceOrig <dbl> 160296.36, 19384.72, 0.00, 0.00, 29885.86, 4604...
## $ nameDest <chr> "M1979787155", "M2044282225", "C553264065", "C3...
## $ oldbalanceDest <dbl> 0, 0, 0, 21182, 0, 0, 0, 0, 0, 41898, 10845, 0,...
## $ newbalanceDest <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00,...
## $ isFraud <int> 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ isFlaggedFraud <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
The nameOrig and nameDest variables are technically categorical but there are too many to be useful for modelling. I noticed that the names have a letter-prefix which may be important.
fraud_df <- fraud_raw %>%
mutate(name_orig_first = str_sub(nameOrig,1,1)) %>%
mutate(name_dest_first = str_sub(nameDest, 1, 1)) %>%
select(-nameOrig, -nameDest)
unique(fraud_df$name_dest_first)
## [1] "M" "C"
There are two prefixes in nameDest so this will be converted to a factor.
fraud_df$name_dest_first <- as.factor(fraud_df$name_dest_first)
table(fraud_df$name_dest_first)
##
## C M
## 4211125 2151495
unique(fraud_df$name_orig_first)
## [1] "C"
There is a single prefix in nameOrig so it is not useful and will be removed along with isFlaggedFraud which looks like someone else’s prediction.
Drop them and re-arrange the columns into a more logical way.
fraud_df2 <- fraud_df %>%
select(-name_orig_first, -isFlaggedFraud) %>%
select(isFraud, type, step, everything())
Take another glimpse
glimpse(fraud_df2)
## Observations: 6,362,620
## Variables: 9
## $ isFraud <int> 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ type <chr> "PAYMENT", "PAYMENT", "TRANSFER", "CASH_OUT", ...
## $ step <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ amount <dbl> 9839.64, 1864.28, 181.00, 181.00, 11668.14, 78...
## $ oldbalanceOrg <dbl> 170136.0, 21249.0, 181.0, 181.0, 41554.0, 5386...
## $ newbalanceOrig <dbl> 160296.36, 19384.72, 0.00, 0.00, 29885.86, 460...
## $ oldbalanceDest <dbl> 0, 0, 0, 21182, 0, 0, 0, 0, 0, 41898, 10845, 0...
## $ newbalanceDest <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00...
## $ name_dest_first <fctr> M, M, C, C, M, M, M, M, M, C, C, M, M, M, M, ...
The type & isFraud columns are categorical and will be changed to factors
fraud_df2$type <- as.factor(fraud_df2$type)
fraud_df2$isFraud <- as.factor(fraud_df2$isFraud)
Some models we’ll use do not like 1’s and 0’s in the result, so we will recode them.
fraud_df2$isFraud <- recode_factor(fraud_df2$isFraud, `0` = "No", `1` = "Yes")
summary(fraud_df2)
## isFraud type step amount
## No :6354407 CASH_IN :1399284 Min. : 1.0 Min. : 0
## Yes: 8213 CASH_OUT:2237500 1st Qu.:156.0 1st Qu.: 13390
## DEBIT : 41432 Median :239.0 Median : 74872
## PAYMENT :2151495 Mean :243.4 Mean : 179862
## TRANSFER: 532909 3rd Qu.:335.0 3rd Qu.: 208721
## Max. :743.0 Max. :92445517
## oldbalanceOrg newbalanceOrig oldbalanceDest
## Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 0 1st Qu.: 0 1st Qu.: 0
## Median : 14208 Median : 0 Median : 132706
## Mean : 833883 Mean : 855114 Mean : 1100702
## 3rd Qu.: 107315 3rd Qu.: 144258 3rd Qu.: 943037
## Max. :59585040 Max. :49585040 Max. :356015889
## newbalanceDest name_dest_first
## Min. : 0 C:4211125
## 1st Qu.: 0 M:2151495
## Median : 214661
## Mean : 1224996
## 3rd Qu.: 1111909
## Max. :356179279
There are only 8213 records where isFraud is true.
I’ll take that into account when creating the training and test sets.
My plan is to create a training and test dataset that is 50% Fraud/50% not-Fraud.
fraud_trans <- fraud_df2 %>%
filter(isFraud == "Yes")
summary(fraud_trans)
## isFraud type step amount
## No : 0 CASH_IN : 0 Min. : 1.0 Min. : 0
## Yes:8213 CASH_OUT:4116 1st Qu.:181.0 1st Qu.: 127091
## DEBIT : 0 Median :367.0 Median : 441423
## PAYMENT : 0 Mean :368.4 Mean : 1467967
## TRANSFER:4097 3rd Qu.:558.0 3rd Qu.: 1517771
## Max. :743.0 Max. :10000000
## oldbalanceOrg newbalanceOrig oldbalanceDest
## Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 125822 1st Qu.: 0 1st Qu.: 0
## Median : 438983 Median : 0 Median : 0
## Mean : 1649668 Mean : 192393 Mean : 544250
## 3rd Qu.: 1517771 3rd Qu.: 0 3rd Qu.: 147829
## Max. :59585040 Max. :49585040 Max. :236230517
## newbalanceDest name_dest_first
## Min. : 0 C:8213
## 1st Qu.: 0 M: 0
## Median : 4676
## Mean : 1279708
## 3rd Qu.: 1058725
## Max. :236726495
Interesting. When the type is CASH_IN, DEBIT, or PAYMENT, there are no fraud cases. This should be taken into consideration when preparing training & test sets.
In every case of fraud name_dest_first had a code of “C”.
We can filter the main dataset to eliminate all the M’s
Fraud Amount maxes out at 10,000,000 so we will also filter any transactions above that amount.
Remove insignificant variables; filter for only CASH_OUT and TRANSFERS There are also no transactions above 10,000,000 so they can be filtered also.
fraud_df3 <- fraud_df2 %>%
filter(type %in% c("CASH_OUT", "TRANSFER")) %>%
filter(name_dest_first == "C") %>%
filter(amount <= 10000000) %>%
select(-name_dest_first)
summary(fraud_df3)
## isFraud type step amount
## No :2759753 CASH_IN : 0 Min. : 1 Min. : 0
## Yes: 8213 CASH_OUT:2237500 1st Qu.:155 1st Qu.: 82900
## DEBIT : 0 Median :236 Median : 171092
## PAYMENT : 0 Mean :242 Mean : 300426
## TRANSFER: 530466 3rd Qu.:332 3rd Qu.: 306252
## Max. :743 Max. :10000000
## oldbalanceOrg newbalanceOrig oldbalanceDest
## Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 0 1st Qu.: 0 1st Qu.: 127785
## Median : 311 Median : 0 Median : 554620
## Mean : 47669 Mean : 16106 Mean : 1670726
## 3rd Qu.: 31013 3rd Qu.: 0 3rd Qu.: 1730370
## Max. :59585040 Max. :49585040 Max. :356015889
## newbalanceDest
## Min. : 0
## 1st Qu.: 326982
## Median : 826727
## Mean : 1998711
## 3rd Qu.: 2116020
## Max. :356179279
This brings the full dataset down to 2.8M records. That is a reduction of 56% in the dataset which should eliminate a lot of noise.
not_fraud <- fraud_df3 %>%
filter(isFraud == "No") %>%
sample_n(8213)
is_fraud <- fraud_df3 %>%
filter(isFraud == "Yes")
full_sample <- rbind(not_fraud, is_fraud) %>%
arrange(step)
Note that step indicates the hour within the month that this data was captured so these plots should be considered time-series.
ggplot(full_sample, aes(x = step, col = isFraud)) +
geom_histogram(bins = 743)
There is a pattern for Not Fraud cases but no descernible pattern of Fraud cases from this plot.
ggplot(is_fraud, aes(x = step)) +
geom_histogram(bins = 743)
Nothing.
ggpairs(full_sample)
There is a high positive correlation between OldBalanceOrig and newBalanceOrig.
Also a high correlation between oldBalanceDest and newBalanceDest.
We’ll deal with them systematically further down.
ggplot(full_sample, aes(type, amount, color = isFraud)) +
geom_point(alpha = 0.01) +
geom_jitter()
It’s all over the place. No pattern.
summary(full_sample)
## isFraud type step amount
## No :8213 CASH_IN : 0 Min. : 1.0 Min. : 0
## Yes:8213 CASH_OUT:10736 1st Qu.:161.0 1st Qu.: 99165
## DEBIT : 0 Median :282.0 Median : 232565
## PAYMENT : 0 Mean :304.4 Mean : 881246
## TRANSFER: 5690 3rd Qu.:405.0 3rd Qu.: 634911
## Max. :743.0 Max. :10000000
## oldbalanceOrg newbalanceOrig oldbalanceDest
## Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 150 1st Qu.: 0 1st Qu.: 0
## Median : 63365 Median : 0 Median : 140649
## Mean : 845927 Mean : 103855 Mean : 1099183
## 3rd Qu.: 465097 3rd Qu.: 0 3rd Qu.: 938262
## Max. :59585040 Max. :49585040 Max. :236230517
## newbalanceDest
## Min. : 0
## 1st Qu.: 0
## Median : 482809
## Mean : 1627235
## 3rd Qu.: 1694654
## Max. :236726495
preproc_model <- preProcess(fraud_df3[, -1],
method = c("center", "scale", "nzv"))
fraud_preproc <- predict(preproc_model, newdata = fraud_df3[, -1])
fraud_pp_w_result <- cbind(isFraud = fraud_df3$isFraud, fraud_preproc)
summary(fraud_pp_w_result)
## isFraud type step amount
## No :2759753 CASH_IN : 0 Min. :-1.7008 Min. :-0.49551
## Yes: 8213 CASH_OUT:2237500 1st Qu.:-0.6138 1st Qu.:-0.35878
## DEBIT : 0 Median :-0.0420 Median :-0.21332
## PAYMENT : 0 Mean : 0.0000 Mean : 0.00000
## TRANSFER: 530466 3rd Qu.: 0.6357 3rd Qu.: 0.00961
## Max. : 3.5369 Max. :15.99817
## oldbalanceOrg oldbalanceDest newbalanceDest
## Min. : -0.18960 Min. :-0.42399 Min. :-0.47994
## 1st Qu.: -0.18960 1st Qu.:-0.39156 1st Qu.:-0.40142
## Median : -0.18836 Median :-0.28324 Median :-0.28142
## Mean : 0.00000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: -0.06625 3rd Qu.: 0.01514 3rd Qu.: 0.02817
## Max. :236.80038 Max. :89.92368 Max. :85.04701
Notice that the mean of all the numeric fields is zero. The standard deviation is 1. This is the result of centering & scaling which puts all numeric variables on the same scale.
model_df <-high_cor_removed
is_fraud <- model_df %>%
filter(isFraud == "Yes")
not_fraud <- model_df %>%
filter(isFraud == "No") %>%
sample_n(8213)
# To mix up the sample set I'll arrange by `step`
model_full_sample <- rbind(is_fraud, not_fraud) %>%
arrange(step)
in_train <- createDataPartition(y = model_full_sample$isFraud, p = .75,
list = FALSE)
train <- model_full_sample[in_train, ]
test <- model_full_sample[-in_train, ]
Play Jeopardy theme here
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 11146618 595.3 17371378 927.8 11372528 607.4
## Vcells 254681038 1943.1 399061229 3044.6 394683154 3011.2
We will use three iterations of 10-fold cross-validation for every model so that we can compare apples-to-apples.
control <- trainControl(method = "repeatedcv",
number = 10,
repeats = 3,
classProbs = TRUE,
summaryFunction = twoClassSummary)
In all of the models we will follow a consistent pattern.
The pattern is:
1. fit the model to the data
2. compared model against the data it was trained on
3. compared model against the test dataset that was unknown for model building
4. compare model against a 100k sample of Not Fraud cases to determine the expected false-positives.
In each of the models we will also track the time to create the model.
Note: the relationship between the time to create the model and to apply the model to new data is very different.
Model tuning, especially for black-box type models, like Support Vector Machines, can take a long time, but the time to apply that model to new data is usually insignificant.
Do not be afraid to use a lot of time to train a model if it gives better results.
Recursive partitioning
https://cran.r-project.org/web/packages/rpart/rpart.pdf
https://cran.r-project.org/web/packages/rpart/vignettes/longintro.pdf
start_time <- Sys.time()
rpart_model = train(isFraud ~ .,
data = train,
method = "rpart",
tuneLength = 10,
metric = "ROC",
trControl = control,
parms=list(split='information'))
end_time <- Sys.time()
end_time - start_time
## Time difference of 15.95665 secs
rpart_train_pred <- predict(rpart_model, train)
confusionMatrix(train$isFraud, rpart_train_pred)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 5626 534
## Yes 96 6064
##
## Accuracy : 0.9489
## 95% CI : (0.9448, 0.9527)
## No Information Rate : 0.5356
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8977
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9832
## Specificity : 0.9191
## Pos Pred Value : 0.9133
## Neg Pred Value : 0.9844
## Prevalence : 0.4644
## Detection Rate : 0.4567
## Detection Prevalence : 0.5000
## Balanced Accuracy : 0.9511
##
## 'Positive' Class : No
##
Accuracy = 0.9489; a lot of false-positives
Sensitivity : 0.9832
Specificity : 0.9191
rpart_test_pred <- predict(rpart_model, test)
confusionMatrix(test$isFraud, rpart_test_pred)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1884 169
## Yes 33 2020
##
## Accuracy : 0.9508
## 95% CI : (0.9437, 0.9572)
## No Information Rate : 0.5331
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9016
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9828
## Specificity : 0.9228
## Pos Pred Value : 0.9177
## Neg Pred Value : 0.9839
## Prevalence : 0.4669
## Detection Rate : 0.4588
## Detection Prevalence : 0.5000
## Balanced Accuracy : 0.9528
##
## 'Positive' Class : No
##
Accuracy : 0.9508
Sensitivity : 0.9828
Specificity : 0.9228
big_no_sample <- model_df %>%
filter(isFraud == "No") %>%
sample_n(100000)
Also, capture time to apply to 100,000 records
start_time <- Sys.time()
rpart_big_no_pred <- predict(rpart_model, big_no_sample)
end_time <- Sys.time()
end_time - start_time
## Time difference of 0.2722321 secs
confusionMatrix(big_no_sample$isFraud, rpart_big_no_pred)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 91512 8488
## Yes 0 0
##
## Accuracy : 0.9151
## 95% CI : (0.9134, 0.9168)
## No Information Rate : 0.9151
## P-Value [Acc > NIR] : 0.5029
##
## Kappa : 0
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 1.0000
## Specificity : 0.0000
## Pos Pred Value : 0.9151
## Neg Pred Value : NaN
## Prevalence : 0.9151
## Detection Rate : 0.9151
## Detection Prevalence : 1.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : No
##
Accuracy : 0.9151
Sensitivity : 1.0000
Specificity : 0.0000
False-positives: 8488
Run-time: 0.2806261 secs
rpart_probs <- predict(rpart_model, test, type = "prob")
rpart_ROC <- roc(response = test$isFraud,
predictor = rpart_probs$Yes,
levels = levels(test$isFraud))
plot(rpart_ROC, col = "blue")
auc(rpart_ROC)
## Area under the curve: 0.9875
print(rpart_model)
## CART
##
## 12320 samples
## 5 predictor
## 2 classes: 'No', 'Yes'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 11088, 11088, 11088, 11088, 11088, 11088, ...
## Resampling results across tuning parameters:
##
## cp ROC Sens Spec
## 0.001623377 0.9859910 0.9093074 0.9821429
## 0.002922078 0.9787281 0.8964286 0.9725108
## 0.003138528 0.9773460 0.8943182 0.9711580
## 0.005519481 0.9681319 0.8939394 0.9239719
## 0.005681818 0.9669220 0.8954545 0.9163420
## 0.006331169 0.9658423 0.8971861 0.9090368
## 0.018073593 0.8945235 0.7704004 0.9938853
## 0.028896104 0.8945235 0.7704004 0.9938853
## 0.040665584 0.8441558 0.8235390 0.8647727
## 0.688311688 0.6340639 0.9234307 0.3446970
##
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.001623377.
Decision trees and rule-based models for pattern recognition.
https://cran.r-project.org/web/packages/C50/C50.pdf
grid <- expand.grid( .winnow = c(FALSE),
.trials=c(50, 100, 150, 200),
.model="tree" )
start_time <- Sys.time()
c5_model <- train(isFraud ~ .,
data = train,
method = "C5.0",
trControl = control,
metric = "ROC",
tuneGrid = grid,
verbose = FALSE)
end_time <- Sys.time()
end_time - start_time
## Time difference of 15.54367 secs
print(c5_model)
## C5.0
##
## 12320 samples
## 5 predictor
## 2 classes: 'No', 'Yes'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 11088, 11088, 11088, 11088, 11088, 11088, ...
## Resampling results across tuning parameters:
##
## trials ROC Sens Spec
## 50 0.3369646 NaN NaN
## 100 0.3369646 NaN NaN
## 150 0.3369646 NaN NaN
## 200 0.3369646 NaN NaN
##
## Tuning parameter 'model' was held constant at a value of tree
##
## Tuning parameter 'winnow' was held constant at a value of FALSE
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were trials = 50, model = tree
## and winnow = FALSE.
c5_pred_train <- predict(c5_model, train)
confusionMatrix(train$isFraud, c5_pred_train, positive = "Yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 6122 38
## Yes 28 6132
##
## Accuracy : 0.9946
## 95% CI : (0.9932, 0.9959)
## No Information Rate : 0.5008
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9893
## Mcnemar's Test P-Value : 0.2679
##
## Sensitivity : 0.9938
## Specificity : 0.9954
## Pos Pred Value : 0.9955
## Neg Pred Value : 0.9938
## Prevalence : 0.5008
## Detection Rate : 0.4977
## Detection Prevalence : 0.5000
## Balanced Accuracy : 0.9946
##
## 'Positive' Class : Yes
##
Accuracy : 0.9946; WOW!!
Sensitivity : 0.9938
Specificity : 0.9954
c5_pred_test <- predict(c5_model, test)
confusionMatrix(test$isFraud, c5_pred_test, positive = "Yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 2012 41
## Yes 29 2024
##
## Accuracy : 0.983
## 95% CI : (0.9785, 0.9867)
## No Information Rate : 0.5029
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9659
## Mcnemar's Test P-Value : 0.1886
##
## Sensitivity : 0.9801
## Specificity : 0.9858
## Pos Pred Value : 0.9859
## Neg Pred Value : 0.9800
## Prevalence : 0.5029
## Detection Rate : 0.4929
## Detection Prevalence : 0.5000
## Balanced Accuracy : 0.9830
##
## 'Positive' Class : Yes
##
Accuracy : 0.983; WOW!! WOW!!
Sensitivity : 0.9801
Specificity : 0.9858
start_time <- Sys.time()
c5_pred_big_no <- predict(c5_model, big_no_sample)
end_time <- Sys.time()
end_time - start_time
## Time difference of 4.985033 secs
confusionMatrix(big_no_sample$isFraud, c5_pred_big_no, positive = "Yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 97795 2205
## Yes 0 0
##
## Accuracy : 0.978
## 95% CI : (0.977, 0.9789)
## No Information Rate : 0.978
## P-Value [Acc > NIR] : 0.5057
##
## Kappa : 0
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.00000
## Specificity : 1.00000
## Pos Pred Value : NaN
## Neg Pred Value : 0.97795
## Prevalence : 0.02205
## Detection Rate : 0.00000
## Detection Prevalence : 0.00000
## Balanced Accuracy : 0.50000
##
## 'Positive' Class : Yes
##
Accuracy : 0.978
Sensitivity : 0.00000
Specificity : 1.00000
False-positives: 2205
Run-time: 5.715282 secs
c5_probs <- predict(c5_model, test, type = "prob")
c5_ROC <- roc(response = test$isFraud,
predictor = c5_probs$Yes,
levels = levels(test$isFraud))
plot(c5_ROC, col = "red")
auc(c5_ROC)
## Area under the curve: 0.9982
https://cran.r-project.org/web/packages/randomForest/randomForest.pdf
https://www.r-bloggers.com/random-forests-in-r/
grid <- expand.grid(.mtry = 5, .ntree = seq(25, 150, by = 25))
start_time <- Sys.time()
rf_model <- train(isFraud ~ .,
data = train,
method="rf",
metric = "Accuracy",
TuneGrid = grid,
trControl=control)
end_time <- Sys.time()
end_time - start_time
## Time difference of 7.477808 mins
library(randomForest)
print(rf_model$finalModel)
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry, TuneGrid = ..1)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 5
##
## OOB estimate of error rate: 2.39%
## Confusion matrix:
## No Yes class.error
## No 5941 219 0.03555195
## Yes 76 6084 0.01233766
plot(rf_model$finalModel)
This is an important plot. Notice that the error levels out at around 100 trees. This shows that the Random Forest model is an effective learner on this dataset. If this data contained a lot more variables it will take more trees to get to the plateau, but it usually happens before 500 trees. The model can safely reduce the trees to 100 without any significant negative performance impact.
This plot should always be used for Random Forests to determine the best cutoff point for trees.
varImpPlot(rf_model$finalModel)
I’m a little surprised that oldBalanceOrig is the most significant data. Amount and oldBalanceDest are also influencial.
rf_train_pred <- predict(rf_model, train)
confusionMatrix(train$isFraud, rf_train_pred, positive = "Yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 6160 0
## Yes 0 6160
##
## Accuracy : 1
## 95% CI : (0.9997, 1)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.0
## Specificity : 1.0
## Pos Pred Value : 1.0
## Neg Pred Value : 1.0
## Prevalence : 0.5
## Detection Rate : 0.5
## Detection Prevalence : 0.5
## Balanced Accuracy : 1.0
##
## 'Positive' Class : Yes
##
Accuracy : 1; Obviously overfit
Sensitivity : 1.0
Specificity : 1.0
rf_test_pred <- predict(rf_model, test)
confusionMatrix(test$isFraud, rf_test_pred, positive = "Yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1991 62
## Yes 20 2033
##
## Accuracy : 0.98
## 95% CI : (0.9753, 0.9841)
## No Information Rate : 0.5102
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9601
## Mcnemar's Test P-Value : 5.963e-06
##
## Sensitivity : 0.9704
## Specificity : 0.9901
## Pos Pred Value : 0.9903
## Neg Pred Value : 0.9698
## Prevalence : 0.5102
## Detection Rate : 0.4951
## Detection Prevalence : 0.5000
## Balanced Accuracy : 0.9802
##
## 'Positive' Class : Yes
##
Accuracy : 0.9803; not bad.
Sensitivity : 0.9709
Specificity : 0.9901
start_time <- Sys.time()
rf_big_no_pred <- predict(rf_model, big_no_sample)
end_time <- Sys.time()
end_time - start_time
## Time difference of 2.964628 secs
Accuracy : 0.968
Sensitivity : 0.00000
Specificity : 1.00000
False-positives: 3197
Run-time: 2.538139 secs
confusionMatrix(big_no_sample$isFraud, rf_big_no_pred, positive = "Yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 96800 3200
## Yes 0 0
##
## Accuracy : 0.968
## 95% CI : (0.9669, 0.9691)
## No Information Rate : 0.968
## P-Value [Acc > NIR] : 0.5047
##
## Kappa : 0
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.000
## Specificity : 1.000
## Pos Pred Value : NaN
## Neg Pred Value : 0.968
## Prevalence : 0.032
## Detection Rate : 0.000
## Detection Prevalence : 0.000
## Balanced Accuracy : 0.500
##
## 'Positive' Class : Yes
##
rf_probs <- predict(rf_model, test, type = "prob")
rf_ROC <- roc(response = test$isFraud,
predictor = rf_probs$Yes,
levels = levels(test$isFraud))
plot(rf_ROC, col = "green")
auc(rf_ROC)
## Area under the curve: 0.9977
Support Vector Machines
https://cran.r-project.org/web/packages/e1071/e1071.pdf
https://www.r-bloggers.com/machine-learning-using-support-vector-machines/
start_time <- Sys.time()
svm_model <- train(isFraud ~ .,
data = train,
method = "svmRadial", # Radial kernel
tuneLength = 3, # 3 values of the cost function
metric="ROC",
trControl=control)
end_time <- Sys.time()
end_time - start_time
## Time difference of 30.57211 mins
Time difference of 30.57211 mins; Ouch!! That’s a long time.
print(svm_model$finalModel)
## Support Vector Machine object of class "ksvm"
##
## SV type: C-svc (classification)
## parameter : cost C = 1
##
## Gaussian Radial Basis kernel function.
## Hyperparameter : sigma = 1
##
## Number of Support Vectors : 3712
##
## Objective Function Value : -2174.618
## Training error : 0.067532
## Probability model included.
svm_train_pred <- predict(svm_model, train)
confusionMatrix(train$isFraud, svm_train_pred, positive = "Yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 0 0
## Yes 0 0
##
## Accuracy : NaN
## 95% CI : (NA, NA)
## No Information Rate : NA
## P-Value [Acc > NIR] : NA
##
## Kappa : NaN
## Mcnemar's Test P-Value : NA
##
## Sensitivity : NA
## Specificity : NA
## Pos Pred Value : NA
## Neg Pred Value : NA
## Prevalence : NaN
## Detection Rate : NaN
## Detection Prevalence : NaN
## Balanced Accuracy : NA
##
## 'Positive' Class : Yes
##
Accuracy : 0.9305
Sensitivity : 0.9399
Specificity : 0.9215
svm_test_pred <- predict(svm_model, test)
confusionMatrix(test$isFraud, svm_test_pred, positive = "Yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 0 0
## Yes 0 0
##
## Accuracy : NaN
## 95% CI : (NA, NA)
## No Information Rate : NA
## P-Value [Acc > NIR] : NA
##
## Kappa : NaN
## Mcnemar's Test P-Value : NA
##
## Sensitivity : NA
## Specificity : NA
## Pos Pred Value : NA
## Neg Pred Value : NA
## Prevalence : NaN
## Detection Rate : NaN
## Detection Prevalence : NaN
## Balanced Accuracy : NA
##
## 'Positive' Class : Yes
##
Accuracy : 0.9323
Sensitivity : 0.9361
Specificity : 0.9285
start_time <- Sys.time()
svm_big_no_pred <- predict(svm_model, big_no_sample)
end_time <- Sys.time()
end_time - start_time
## Time difference of 0.2067931 secs
confusionMatrix(big_no_sample$isFraud, svm_big_no_pred, positive = "Yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 0 0
## Yes 0 0
##
## Accuracy : NaN
## 95% CI : (NA, NA)
## No Information Rate : NA
## P-Value [Acc > NIR] : NA
##
## Kappa : NaN
## Mcnemar's Test P-Value : NA
##
## Sensitivity : NA
## Specificity : NA
## Pos Pred Value : NA
## Neg Pred Value : NA
## Prevalence : NaN
## Detection Rate : NaN
## Detection Prevalence : NaN
## Balanced Accuracy : NA
##
## 'Positive' Class : Yes
##
Accuracy : 0.9341
Sensitivity : 0.00000
Specificity : 1.00000
False-positives: 6588
Run-time: 49.61945 secs
Computational intesity and poor performance rule out using SVM.
svm_probs <- predict(svm_model, test, type = "prob")
svm_ROC <- roc(response = test$isFraud,
predictor = svm_probs$Yes,
levels = levels(test$isFraud))
plot(svm_ROC, col = "black")
auc(svm_ROC)
## Area under the curve: 0.9786
plot(rpart_ROC, col = "blue")
plot(c5_ROC, col = "red", add = TRUE)
plot(rf_ROC, col = "green", add = TRUE)
plot(svm_ROC, col = "black", add = TRUE)
From the ROC curve it looks like the choice is between C5.0 and Random Forest.
sort(c(rpart = auc(rpart_ROC), rf = auc(rf_ROC),
c5 = auc(c5_ROC), svm = auc(svm_ROC)))
## svm rpart rf c5
## 0.9785829 0.9875228 0.9977197 0.9981556
Accuracy : 0.968
Sensitivity : 0.00000
Specificity : 1.00000
False-positives: 3197
Run-time: 2.538139 secs
Accuracy : 0.978
Sensitivity : 0.00000
Specificity : 1.00000
False-positives: 2205
Run-time: 5.715282 secs
As usual the decision of which model to use is not straight-forward.
The Random Forest predicts 100,000 transaction in less than half the time that C5.0 does.
C5.0 beats Random Forest on every other metric. The key metric that I would focus on is the number of false-positives since transactions for our customers would be denied even though they were not fraudulent.
I would choose the C5.0 model to use.