There is an obvious benefit to being able to predict instances of credit card fraud from a user’s transaction history.
However, the difficulty associated with developing an accurate fraud detection model comes from the fact that for many non-fraud cases, there are only ever a handful of actual fraud cases. So, by definition, any dataset that contains instances of actual fraud will contain many more instances of non-fraud transactions.
This is known as imbalanced or rare events data. It is data wherein the proportion of positive responses is much much lower than the proportion of negative responses.
The credit card fraud data used here is a good example of this type of data, but it is by no means the only one. In fact, it is often rare for any real-world data to be perfectly balanced.
This is why a robust model for predicting imbalanced or rare events data is extremely valuable in today’s data-centric business world.
Here, I develop a high-accuracy model for predicting rare events Credit Card Fraud, using an un-tuned vanilla Random Forest algorithm.
The Credit Card Fraud data that I will be using in this article is from Kaggle’s Credit Card Fraud Detection, which can be found here.
The data itself is made up of 284,807 observations and 31 variables:
dim(cc_data)
## [1] 284807 31
names(cc_data)
## [1] "Time" "V1" "V2" "V3" "V4" "V5" "V6" "V7"
## [9] "V8" "V9" "V10" "V11" "V12" "V13" "V14" "V15"
## [17] "V16" "V17" "V18" "V19" "V20" "V21" "V22" "V23"
## [25] "V24" "V25" "V26" "V27" "V28" "Amount" "Class"
The only pre-processing that I perform on the data is to factorize the response variable, “Class”:
cc_data$Class <- factor(cc_data$Class)
This data set is considered imbalanced or rare events data because it has a very low proportion of positive response observations:
table(cc_data$Class)
##
## 0 1
## 284315 492
In fact, the positive (ie Fraud) cases only make up ~ 0.17% of all observations in the data set:
492 / 284315 * 100
## [1] 0.1730475
The key to developing a robust predictive model for rare events data is using a validation set to assign an accurate threshold cutoff. This cutoff is used to convert the prediction distribution (that the model will output) to a binary 0 or 1 class.
A common mistake that I see many data scientists make is using 0.50 for the cutoff. This is a very short-sighted method of assigning a threshold, and especially so for rare events data.
Before modeling, I split my data into Train, Validation, and Test data sets. I accomplish this process using a function I wrote called train_valid_test().
The function itself is nothing fancy, and it is mainly a driver for a few dplyr functions:
train_valid_test <- function(df, tvt_prop = 0.8){
.train_test <- function(df, prop){
df <- df %>%
mutate(idx = 1:nrow(df))
train <- df %>%
sample_frac(size = prop)
test <- df %>%
setdiff(train)
train <- train %>%
arrange(idx) %>%
select(-idx)
test <- test %>%
arrange(idx) %>%
select(-idx)
return(list(
"train" = train,
"test" = test
))
}
tt_lst <- df %>%
.train_test(prop = tvt_prop)
test <- tt_lst$'test'
tmp_ <- tt_lst$train %>%
train_test()
train <- tmp_$'train'
valid <- tmp_$'test'
return(list(
'train' = train,
'valid' = valid,
'test' = test
))
}
However, the key to this function is that it takes in a data set and outputs the Train, Validation, and Test data sets while maintaining the relative response proportions. This is a VIP concept for working with rare events data.
ret_tvt <- cc_data %>%
train_valid_test()
data_train <- ret_tvt$'train'
data_train_table <- table(data_train$Class)
data_valid <- ret_tvt$'valid'
data_valid_table <- table(data_valid$Class)
data_test <- ret_tvt$'test'
data_test_table <- table(data_test$Class)
data_train_table
##
## 0 1
## 181967 310
data_valid_table
##
## 0 1
## 45476 93
data_test_table
##
## 0 1
## 56872 89
As mentioned, the model itself will be a plain vanilla un-tuned Random Forest model, run in an H2O cluster environment.
NOTE: See the Discussion section below for more detail as to my descision in using this algorithm. I also show that my results are in line with (and slightly outperform) the results of a much more convoluted Deep Learning Neural Net approach to the same data set.
First, I initiate an H2O cluster and create H2O objects:
library(h2o)
h2o.init()
data_train_h2o <- as.h2o(data_train)
data_valid_h2o <- as.h2o(data_valid)
data_test_h2o <- as.h2o(data_test)
Then I set up the model call. As stated above, no model paramaters have been tuned:
response_name <- "Class"
h2o_mod <- h2o.randomForest(
x = setdiff(names(data_train_h2o), response_name),
y = response_name,
training_frame = data_train_h2o,
validation_frame = data_valid_h2o,
ntrees = 200
)
As stated above, the “secret sauce” of this model is the ability to assign a correct threshold cutoff, to be used to convert the prediction probability distribution (from the model) to a binary 0 or 1 response, where 0 is non-fraud and 1 is fraud.
To accomplish this, I wrote a function called assign_threshold(). It takes in as input:
and returns a threshold cutoff that can be maximized by any metric.
In the following code block, I maximize for Accuracy (ACC) and Youden’s J statistic (JSTAT).
The actual code for the function is beyond the scope of this article, so I will just show its use-case here:
y_valid <- data_valid[[response_name]]
pred_prob_valid <- h2o.predict(h2o_mod, newdata = data_valid_h2o)
pred_prob_valid <- as.data.frame(pred_prob_valid)$"p1"
thresh_cut_df <- pred_prob_valid %>%
assign_threshold(y_valid, conf_skip_round = TRUE) %>%
round(3)
Once the assign_threshold() function has been run, the Test Data is used to create a test prediction probability distribution:
pred_prob_test <- h2o.predict(h2o_mod, newdata = data_test_h2o)
pred_prob_test <- as.data.frame(pred_prob_test)$"p1"
Next, I create binary class from that test prediction probability distribution, and apply the threshold cutoffs from above.
Those cutoffs are:
# by ACC:
pred_binary_test_acc <- ifelse(pred_prob_test >= 0.16, 1, 0)
prd_act_df_acc <- data.frame(
actual = data_test[[response_name]],
predict = pred_binary_test_acc
)
# by JSTAT:
pred_binary_test_jstat <- ifelse(pred_prob_test >= 0.06, 1, 0)
prd_act_df_jstat <- data.frame(
actual = data_test[[response_name]],
predict = pred_binary_test_jstat
)
Once we have the Binary prediction results from above, we can compare them to actuals and assess how accurate the model is:
prd_act_df_acc %>%
group_by(actual, predict) %>%
summarise(n = n()) %>%
mutate(freq = n / sum(n)) %>%
data.frame()
## actual predict n freq
## 1 0 0 56862 0.9998241665
## 2 0 1 10 0.0001758335
## 3 1 0 13 0.1460674157
## 4 1 1 76 0.8539325843
prd_act_df_acc %>%
confusion(act_name = "actual", pred_name = "predict")
## TPR TNR PPV NPV FNR FPR FDR FOR ACC TS JSTAT
## 0.8539 0.9998 0.8837 0.9998 0.1461 0.0002 0.1163 0.0002 0.9996 0.7677 0.8538
prd_act_df_jstat %>%
group_by(actual, predict) %>%
summarise(n = n()) %>%
mutate(freq = n / sum(n)) %>%
data.frame()
## actual predict n freq
## 1 0 0 56824 0.9991559994
## 2 0 1 48 0.0008440006
## 3 1 0 13 0.1460674157
## 4 1 1 76 0.8539325843
prd_act_df_jstat %>%
confusion(act_name = "actual", pred_name = "predict")
## TPR TNR PPV NPV FNR FPR FDR FOR ACC TS JSTAT
## 0.8539 0.9992 0.6129 0.9998 0.1461 0.0008 0.3871 0.0002 0.9989 0.5547 0.8531
Overall Accuracy is:
I would like to start off by stating that there is absolutely a time and place for using advanced modeling techniques such as a Deep Learning Neural Net. In fact, I have used this very approach in a previous article I wrote called A Deep Learning Model that Predicts SMS Text Spam with Over 97% Accuracy, using R and H2O. In other words, I by no means want to discourage its use.
Rather, my aim here is to show that it is simply not necessary to use such a convoluted approach in every situation, and that often a simple decision-tree-based algorithm (such as Random Forest) will do the job quite admirably.
Consider this blog post by the very-capable Shirin Elsinghorst entitled Autoencoders and anomaly detection with machine learning in fraud analytics.
In this article, Shirin uses the same Credit Card Fraud data and applies a Deep Learning approach to predicting fraud. In her own words she states, “Our final model now correctly identified 83% of fraud cases and almost 100% of non-fraud cases.”, and this is the output of her work:
## actual predict n freq
## <chr> <dbl> <int> <dbl>
## 1 0 0 56558 0.998693318
## 2 0 1 74 0.001306682
## 3 1 0 16 0.173913043
## 4 1 1 76 0.826086957
Let’s compare Shirin’s result to the result of my Random Forest approach from above where my Overall Accuracy is:
Clearly, we can see that my Random Forest approach replicated the accuracy of Shirin’s non-fraud cases, and slightly improved the accuracy of her fraud cases.
Now, am I criticising any aspect of Shirin’s work? Absolutely not! In fact, her article was the inspiration for me to write mine.
I am instead saying that her results could have been replicated using a much simpler-to-use Random Forest approach, the benefits of such an approach being:
Anyway, I hope you enjoyed this article. Thanks as always for reading!