Purpose
This script shows how to classify fraudulent orders by preprocessing (checking for correlation and zero/near-zero variance), selecting features (logistic regression), and using R’s CARET package to train several machine learning models (recursive partitioning, logistic regression, random forest, neural network, and support vector machine).
Resources and Inspiration
“A Short Introduction to the caret Package” https://cran.r-project.org/web/packages/caret/vignettes/caret.html
“The caret Package” https://topepo.github.io/caret/index.html
“train” Docs https://www.rdocumentation.org/packages/caret/versions/4.47/topics/train
“Introduction to Machine Learning: The Wikipedia Guide” http://www.datascienceassn.org/sites/default/files/Introduction%20to%20Machine%20Learning.pdf
Load libraries
options(max.print = 500)
repo = 'http://cran.rstudio.com/'
library(tidyverse)
library(randomForest)
library(corrplot)
library(ROSE)
library(caret)
library(caretEnsemble)
library(readxl)
library(stringr)
library(pROC)
# some homemade functions, such as a function to format dataframe columns for petty viewing
source('/Users/jarad/Fake Folder/R Libraries/jb_functions.R')Load in data
Create a vector of columns to exclude as we explore our data and build models
Check it out
## Classes 'tbl_df', 'tbl' and 'data.frame': 288384 obs. of 30 variables:
## $ delivery_address_type : chr "residential" "residential" "commercial" "commercial" ...
## $ ip_and_shipping_match : chr "no" "no" "no" "no" ...
## $ account_type : chr "account" "guest" "account" "account" ...
## $ payment_method : chr "credit card" "paypal" "credit card" "credit card" ...
## $ fraud : chr "yes" "no" "no" "no" ...
## $ subtotal : num 168 19.9 143.2 59.9 321.8 ...
## $ order_count : num 1 7 10 3 226 1 21 3 1 1 ...
## $ past_amazon_payments : num 0 0 0 0 0 1 0 0 0 0 ...
## $ past_bitpay : num 0 0 0 0 0 0 0 0 0 0 ...
## $ past_check/money_order : num 0 0 0 0 0 0 0 0 0 0 ...
## $ past_credit_card : num 1 2 10 10 0 0 32 0 1 0 ...
## $ past_gift_certificate/coupon : num 0 0 0 0 0 0 0 0 0 0 ...
## $ past_google_wallet_instant_buy : num 0 0 0 0 0 0 0 0 0 0 ...
## $ past_googlecheckout : num 0 0 0 0 0 0 0 0 0 0 ...
## $ past_paypal : num 0 5 0 0 0 0 0 3 0 1 ...
## $ past_paypal_ipn : num 0 0 0 0 0 0 0 0 0 0 ...
## $ past_prepaid_subscription : num 0 0 0 0 0 0 0 0 0 0 ...
## $ past_purchase_order : num 0 0 0 0 393 0 0 0 0 0 ...
## $ past_purchase_order_-_check : num 0 0 0 0 0 0 0 0 0 0 ...
## $ past_purchase_order_-_wire_transfer : num 0 0 0 0 0 0 0 0 0 0 ...
## $ past_replacement_order : num 0 0 0 0 0 0 0 0 0 0 ...
## $ past_saved_credit_cards<span_class="alert">_(beta_only)</span>: num 0 0 0 0 0 0 0 0 0 0 ...
## $ past_subscription : num 0 0 0 0 0 0 0 0 0 0 ...
## $ account_age : num 0 1687 551 0 1965 ...
## $ avg_part_price : num 28 9.95 7.16 14.96 35.75 ...
## $ fraud_part_proportion : num 1 0 0.5 0.5 0.5 ...
## $ countries_fraud_rating : num 0.00146 0.00146 0.00146 0.00146 0.00146 ...
## $ billing_and_shipping_match : num 0.719 0.604 1 0.743 0.543 ...
## $ email_domain_fraud_rating : num 1 0 0 0.002 0 ...
## $ email_username_non_letter_rating : num 0 0 0 0 0 ...
Check out some fraud stats
Data set is highly imbalanced, with fraud making up such a small proportion of total orders. I will undersample to correct for this during the model building phase.
count <- main %>%
group_by(fraud) %>%
summarise(count = n()) %>%
column_to_rownames('fraud') %>%
t()
main %>%
group_by(fraud) %>%
select_if(., is.numeric) %>%
select(., -c(orders_id, customers_id)) %>%
summarize_all(mean) %>%
column_to_rownames('fraud') %>%
t() %>%
as.data.frame() %>%
rbind(.,count) %>%
jb_format(., c('number2','number2')) %>%
jb_pretty_df(.)| no | yes | |
|---|---|---|
| subtotal | 144.92 | 268.85 |
| order_count | 17.57 | 1.59 |
| past_amazon_payments | 0.58 | 0.09 |
| past_bitpay | 0.01 | 0.01 |
| past_check/money_order | 0.00 | 0.00 |
| past_credit_card | 11.23 | 0.99 |
| past_gift_certificate/coupon | 0.02 | 0.01 |
| past_google_wallet_instant_buy | 0.02 | 0.00 |
| past_googlecheckout | 0.01 | 0.00 |
| past_paypal | 4.01 | 0.71 |
| past_paypal_ipn | 0.01 | 0.00 |
| past_prepaid_subscription | 0.00 | 0.00 |
| past_purchase_order | 1.61 | 0.00 |
| past_purchase_order_-_check | 0.08 | 0.01 |
| past_purchase_order_-_wire_transfer | 0.10 | 0.00 |
| past_replacement_order | 0.23 | 0.06 |
| past_saved_credit_cards<span_class=“alert”>_(beta_only)</span> | 0.00 | 0.00 |
| past_subscription | 0.04 | 0.00 |
| account_age | 553.32 | 74.45 |
| avg_part_price | 17.45 | 84.69 |
| fraud_part_proportion | 0.63 | 1.00 |
| countries_fraud_rating | 0.00 | 0.03 |
| billing_and_shipping_match | 0.93 | 0.79 |
| email_domain_fraud_rating | 0.00 | 0.23 |
| email_username_non_letter_rating | 0.06 | 0.11 |
| count | 287,910.00 | 474.00 |
Make a copy for preprocessing and model building
Check correlation
Correlation can weaken the model. Here I identify and remove correlated features.
for_cor <- for_model[, !names(for_model) %in% exclude]
# correlation only works with numeric data, so make it all numeric
for_cor[] <- lapply(for_cor, function(x){as.numeric(as.factor(x))})
# get correlation
c <- cor(for_cor, for_cor$fraud)
c <- as.matrix(c[order(c, decreasing = TRUE),])
# check it
x <- 0.50
is_correlated <- c %>%
subset(rownames(.) != 'fraud') %>%
as.data.frame(.) %>%
tibble::rownames_to_column(., 'variable') %>%
filter(V1 > x)
if (is_correlated %>% count(.) == 0) {
print('no correlation')
} else {
print(paste(is_correlated$variable, 'is correlated with fraud and will be removed'))
for_model <- for_model %>%
select(-c(is_correlated$variable))
}## [1] "no correlation"
Find and remove features with near-zero and zero variance
Features which have near-zero or zero variance are either constant features or features which contain a few unique values that occur with low frequencies, as explained here: https://topepo.github.io/caret/pre-processing.html#zero--and-near-zero-variance-predictors. These features will weaken the model and should be removed.
# find zero/near-zero variance
nzv_main <- nearZeroVar(for_model, saveMetrics = TRUE)
# remove "fraud"
nzv_main <- nzv_main[!(rownames(nzv_main) %in% 'fraud'), ]
# create empty vector
nzv_cols <- c()
# get zeros
zero <- nzv_main %>% subset(zeroVar ==TRUE)
if (nrow(zero) > 0) {
print(paste('zero variance:', zero))
nzv_cols <- append(nzv_cols, rownames(zero))
} else {
print('all features have non-zero variance')
}## [1] "all features have non-zero variance"
# get near-zeros
near_zero <- nzv_main %>% subset(nzv ==TRUE)
if (nrow(near_zero) > 0) {
print(paste('near-zero variance:', rownames(near_zero)))
nzv_cols <- append(nzv_cols, rownames(near_zero))
} else {
print('all features have variance not near zero')
}## [1] "near-zero variance: past_bitpay"
## [2] "near-zero variance: past_check/money_order"
## [3] "near-zero variance: past_gift_certificate/coupon"
## [4] "near-zero variance: past_google_wallet_instant_buy"
## [5] "near-zero variance: past_googlecheckout"
## [6] "near-zero variance: past_paypal_ipn"
## [7] "near-zero variance: past_prepaid_subscription"
## [8] "near-zero variance: past_purchase_order"
## [9] "near-zero variance: past_purchase_order_-_check"
## [10] "near-zero variance: past_purchase_order_-_wire_transfer"
## [11] "near-zero variance: past_replacement_order"
## [12] "near-zero variance: past_saved_credit_cards<span_class=\"alert\">_(beta_only)</span>"
## [13] "near-zero variance: past_subscription"
## [14] "near-zero variance: account_age"
## [15] "near-zero variance: countries_fraud_rating"
## [16] "near-zero variance: billing_and_shipping_match"
## [17] "near-zero variance: email_username_non_letter_rating"
# remove them
for_model <- for_model[, !(names(for_model) %in% nzv_cols)]
if (length(nzv_cols) > 0) {
print('these features have been removed')
}## [1] "these features have been removed"
Feature selection using logistic regression
I use the coefficients of a logistic regression model to identify and choose the most important features. I take a top down approach, and one-by-one remove each insignificant feature (i.e., a feature with a p-value greater than or equal to 0.05) until all features have low p-values.
From wiki: https://en.wikipedia.org/wiki/Logistic_regression: Logistic regression measures the relationship between the categorical dependent variable and one or more independent variables by estimating probabilities using a logistic function.
# make a copy
for_fs <- for_model
# scale numbers
for (col in colnames(for_fs)) {
if (class(for_fs[[col]]) == 'numeric') {
for_fs[[col]] <- scale(for_fs[[col]])
}
}
# change fraud to numeric
for_fs$fraud <- ifelse(for_fs$fraud == 'yes',1,0)
# change characters to numeric
for_fs[] <- lapply(for_fs, function(x){if(is.character(x)) as.numeric(as.factor(x)) else x})
# create a function that chooses the most significant features
# this function takes a top-down approach and removes features with p-values above 0.05, one by one, starting with the highest
get_glm <- function(df){
# fit model to data
fit <- glm(fraud ~ .,
family = 'binomial',
data = df)
# get highest p-value, excluding the intercept
s <- summary(fit)
s <- as.data.frame(s$coefficients)
s <- s[!(rownames(s) %in% c('(Intercept)')),]
s <- s[order(s$`Pr(>|z|)`, decreasing = TRUE),]
pvalue <- s[1,ncol(s)]
var <- rownames(s)[1]
# if highest p-value <= P, return the fitted model
if(pvalue <= 0.05){
return(fit)
# if not, remove the highest and re-run function
} else {
print(paste(var,' will be removed with pvalue = ',round(pvalue,4), sep = ''))
df <- df[!colnames(df) %in% var]
get_glm(df)
}
}
# run it
glm_fit <- get_glm(for_fs)## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## [1] "fraud_part_proportion will be removed with pvalue = 0.9386"
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## [1] "account_type will be removed with pvalue = 0.4486"
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## [1] "delivery_address_type will be removed with pvalue = 0.2657"
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
# get summary
s <- summary(glm_fit)
# extract columns to keep
v <- rownames(s$coefficients)
v <- v[v != '(Intercept)'] # remove this
v <- c(v,'fraud') # append this
# keep only significant features
for_model <- for_model[, names(for_model) %in% v]
# view summary
s##
## Call:
## glm(formula = fraud ~ ., family = "binomial", data = df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5069 -0.0612 -0.0450 -0.0137 4.7861
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -11.454986 0.730522 -15.681 < 2e-16 ***
## ip_and_shipping_match -0.277665 0.119560 -2.322 0.020212 *
## payment_method -0.674834 0.074733 -9.030 < 2e-16 ***
## subtotal 0.095082 0.024694 3.850 0.000118 ***
## order_count -18.235486 4.529008 -4.026 5.66e-05 ***
## past_amazon_payments -4.076796 0.598023 -6.817 9.29e-12 ***
## past_credit_card -22.542501 4.036381 -5.585 2.34e-08 ***
## past_paypal 2.857173 0.757194 3.773 0.000161 ***
## avg_part_price 0.079093 0.007178 11.019 < 2e-16 ***
## email_domain_fraud_rating 0.282926 0.020302 13.936 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 7024.7 on 288383 degrees of freedom
## Residual deviance: 5071.1 on 288374 degrees of freedom
## AIC: 5091.1
##
## Number of Fisher Scoring iterations: 16
We will train the models using the features above, which do not have near-zero or zero variance, and which have been deemed significant by the logistic regression model.
Create train/test sets
The train set is 80% of the data; the test set is 20%.
set.seed(1)
# set sampling method to address class imbalance
sampling_method <- 'under'
# change characters to factors
for_model[] <- lapply(for_model, function(x){if(is.character(x)) as.factor(x) else x})
print('these are the features which the model will use')## [1] "these are the features which the model will use"
## Classes 'tbl_df', 'tbl' and 'data.frame': 288384 obs. of 10 variables:
## $ ip_and_shipping_match : Factor w/ 2 levels "no","yes": 1 1 1 1 2 1 2 1 1 2 ...
## $ payment_method : Factor w/ 5 levels "amazon payments",..: 2 4 2 2 5 1 2 4 2 4 ...
## $ fraud : Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 1 1 1 1 ...
## $ subtotal : num 168 19.9 143.2 59.9 321.8 ...
## $ order_count : num 1 7 10 3 226 1 21 3 1 1 ...
## $ past_amazon_payments : num 0 0 0 0 0 1 0 0 0 0 ...
## $ past_credit_card : num 1 2 10 10 0 0 32 0 1 0 ...
## $ past_paypal : num 0 5 0 0 0 0 0 3 0 1 ...
## $ avg_part_price : num 28 9.95 7.16 14.96 35.75 ...
## $ email_domain_fraud_rating: num 1 0 0 0.002 0 ...
# create train set
size <- 0.80
ix <- createDataPartition(for_model$fraud, p = size, list = FALSE)
train_set <- for_model[ix,]
# over sample train_set to correct for data imbalance
train_set <- ovun.sample(fraud ~ .,
data = train_set,
p = 0.5,
method = sampling_method)$data
cat('\n')## [1] "fraud distribution after using sample method: under"
##
## no yes
## 383 380
Fit ML models and predict over test set
Recursive Partitioning (rpart):
From wiki: https://en.wikipedia.org/wiki/Recursive_partitioning
Recursive partitioning creates a decision tree that strives to correctly classify members of the population by splitting it into sub-populations based on several dichotomous independent variables. The process is termed recursive because each sub-population may in turn be split an indefinite number of times until the splitting process terminates after a particular stopping criterion is reached.
Logistic Regression (LogitBoost):
From wiki: https://en.wikipedia.org/wiki/Logistic_regression
A statistical model that in its basic form uses a logistic function (a logistic function or logistic curve is a common “S” shape, or sigmoid curve) to model a binary dependent variable. In regression analysis, logistic regression (or logit regression) is estimating the parameters of a logistic model (a form of binary regression). Mathematically, a binary logistic model has a dependent variable with two possible values, and these are represented by one or more indicator variables.
Random Forest (rf):
From wiki: https://en.wikipedia.org/wiki/Random_forest
An ensemble learning method (multiple learning algorithms which obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone) for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time and outputting the class that is the mode (the mode of a set of data values is the value that appears most often) of the classes (classification) or mean prediction (regression) of the individual trees.
Neural Network (nnet):
From wiki: https://en.wikipedia.org/wiki/Machine_learning#Artificial_neural_networks
The neural network itself is not an algorithm, but rather a framework for many different machine learning algorithms to work together and process complex data inputs.
From here: https://www.doc.ic.ac.uk/~nd/surprise_96/journal/vol4/cs11/report.html#An%20engineering%20approach
3.1 A simple neuron
An artificial neuron is a device with many inputs and one output. The neuron has two modes of operation: the training mode and the using mode. In the training mode, the neuron can be trained to fire (or not), for particular input patterns. In the using mode, when a taught input pattern is detected at the input, its associated output becomes the current output. If the input pattern does not belong in the taught list of input patterns, the firing rule is used to determine whether to fire or not.
3.2 Firing rules
The firing rule is an important concept in neural networks and accounts for their high flexibility. A firing rule determines how one calculates whether a neuron should fire for any input pattern. It relates to all the input patterns, not only the ones on which the node was trained.
A simple firing rule can be implemented by using Hamming distance technique. The rule goes as follows: Take a collection of training patterns for a node, some of which cause it to fire (the 1-taught set of patterns) and others which prevent it from doing so (the 0-taught set). Then the patterns not in the collection cause the node to fire if, on comparison, they have more input elements in common with the ‘nearest’ pattern in the 1-taught set than with the ‘nearest’ pattern in the 0-taught set. If there is a tie, then the pattern remains in the undefined state.
Support Vector Machine (svmLinear)
From wiki: https://en.wikipedia.org/wiki/Support-vector_machine
An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall.
set.seed(1)
# create the control method for model-fit testing
# set some control options based on sampling method
if (sampling_method == 'under') {
# if we undersample, choose crossfold validation with undersampling
tr_method <- 'repeatedcv'
tr_sampling <- 'down'
} else {
# if we oversample, choose no control method and just put in a placeholder for the sampling method
# when I choose crossfold validation with oversampling the training takes hours
tr_method <- 'none'
tr_sampling <- 'up'
}
control <- trainControl(method = tr_method,
sampling = tr_sampling,
repeats = 3,
classProbs = TRUE,
summaryFunction = twoClassSummary,
savePredictions = TRUE)
# list of models
model_vec <- c('rpart', # recursive partitioning
'LogitBoost', # logistic regression
'rf', # random forest
'nnet', # neural network
'svmLinear') # support vector machine
# empty dataframe for results
results <- data.frame()
# empty list for models
model_list <- list()
# model loop
for (m in model_vec) {
s <- Sys.time()
# train each model
# suppress garbage output from nnet model
garbage <- capture.output(fit <- train(fraud ~ .,
data = train_set,
method = m,
trControl = control,
preProcess = c('center','scale'),
metric = 'ROC',
tuneLength = 5))
# store the model so I can choose which one I want to use later
model_list[[m]] <- fit
# predict over test set
pred <- predict(fit, test_set)
# get confusion matrix
for_cm <- test_set$fraud %>% as.factor()
cm <- confusionMatrix(pred, for_cm, positive = 'yes')
# get AUC
roc <- roc(test_set$fraud, as.numeric(pred))
area <- auc(roc)
e <- Sys.time()
# store the model results
df <- data.frame(row.names = m,
time_to_train_in_sec = as.numeric(e-s, units = 'secs'),
accuracy = cm$overall['Accuracy'],
sensitivity = cm$byClass['Sensitivity'],
specificity = cm$byClass['Specificity'],
auc = area)
# append to "results" df
results <- rbind(results, df)
}
# add in the model results
model_list[['results']] <- resultsCheck out results
results <- model_list$results %>%
tibble::rownames_to_column(.) %>%
rename(model = rowname) %>%
arrange(desc(.$auc))
(results %>%
jb_format(., c('nothing','number1','percent0','percent0','percent0','number2')) %>%
jb_pretty_df(.))| model | time_to_train_in_sec | accuracy | sensitivity | specificity | auc |
|---|---|---|---|---|---|
| rf | 73.4 | 86% | 85% | 86% | 0.86 |
| rpart | 4.0 | 75% | 93% | 75% | 0.84 |
| LogitBoost | 9.1 | 88% | 77% | 88% | 0.82 |
| nnet | 67.3 | 85% | 77% | 85% | 0.81 |
| svmLinear | 10.3 | 83% | 71% | 83% | 0.77 |
Use the model with the highest AUC
Receiver Operating Characteristic (ROC) From wiki: https://en.wikipedia.org/wiki/Receiver_operating_characteristic
The AUC (area under the ROC curve) is a measure of how well a model can distinguish between diagnostic groups (fraud/not-fraud). In other words, it shows the relationship between the sensitivity (true positive rate) and the specificity (true negative rate). This is one of the leading metrics of model selection, and we want it to be as close to 1 as possible, which indicates a perfect fit.
s <- Sys.time()
# select the model with the highest AUC
model <- head(results$model,1)
# fit this model
fit <- model_list[[model]]
# select the data you want to use
# make sure you use the same columns as in the model building process
data <- main[, names(main) %in% names(for_model)]
# predict
pred <- predict(fit, data, type = 'prob') %>%
mutate('class' = names(.)[apply(., 1, which.max)])
# get confusion matrix
cm <- confusionMatrix(pred$class %>% as.factor(),
data$fraud %>% as.factor(),
positive = 'yes')
# get AUC
roc <- roc(data$fraud,
as.numeric(pred$class %>% as.factor()))
area <- auc(roc)
e <- Sys.time()
print(e-s)## Time difference of 15.79534 secs
## Area under the curve: 0.9141
## [1] "model used: rf"
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 247574 15
## yes 40336 459
##
## Accuracy : 0.8601
## 95% CI : (0.8588, 0.8613)
## No Information Rate : 0.9984
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0191
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.968354
## Specificity : 0.859901
## Pos Pred Value : 0.011251
## Neg Pred Value : 0.999939
## Prevalence : 0.001644
## Detection Rate : 0.001592
## Detection Prevalence : 0.141461
## Balanced Accuracy : 0.914128
##
## 'Positive' Class : yes
##
# merge the class and class probabilities with the original data
main2 <- merge(main,
pred,
by.x = 0, # merge on rownames
by.y = 0) # merge on rownames
# rename these columns
main2 <- main2 %>%
rename(predicted_class = class, actual_fraud_result = fraud)
# drop this column
main2 <- main2[, names(main2) != 'Row.names']Check out feature importance
Interpretation of confusion matrix output
The accuracy, sensitivity, and specificity are ok, but they could be higher. For the model to be a valid model, the Accuracy should be higher than the No Information Rate, also called the Null Accuracy, which is the accuracy that could be achieved if we always assume that any order is the majority class (in this case, that’s “not-fraud”).
The Pos Pred Value, the proportion of correctly classified fraud orders out of all the orders that are classified as fraud is very low, which means the model produces a ton of false postitves. The model misclassifies tens of thousands of orders as fraud, shown in the bottom left of the “prediction/reference” table towards the top of the output. This is problematic because employees will be checking for fraud in orders that are definitely not-fraud.
Next Steps
This severe misclassification is most likely because the model does not have enough or good-enough features to accurately classify an order as fraud. The high class imabalance of the data almost certainly contributes to this. Two solutions are to: 1.) Add in additional features; 2.) Wait until we have more data so that the class imbalance is not so high.
Although not shown here, to address the class imbalance as best as I could, I have trained the models with both over- and undersampled data, and the results are roughly same.