Machine Learning

Jarad T. Bushnell

Apr 24, 2019

Purpose

This script shows how to classify fraudulent orders by preprocessing (checking for correlation and zero/near-zero variance), selecting features (logistic regression), and using R’s CARET package to train several machine learning models (recursive partitioning, logistic regression, random forest, neural network, and support vector machine).

Check it out

## Classes 'tbl_df', 'tbl' and 'data.frame':    288384 obs. of  30 variables:
##  $ delivery_address_type                                         : chr  "residential" "residential" "commercial" "commercial" ...
##  $ ip_and_shipping_match                                         : chr  "no" "no" "no" "no" ...
##  $ account_type                                                  : chr  "account" "guest" "account" "account" ...
##  $ payment_method                                                : chr  "credit card" "paypal" "credit card" "credit card" ...
##  $ fraud                                                         : chr  "yes" "no" "no" "no" ...
##  $ subtotal                                                      : num  168 19.9 143.2 59.9 321.8 ...
##  $ order_count                                                   : num  1 7 10 3 226 1 21 3 1 1 ...
##  $ past_amazon_payments                                          : num  0 0 0 0 0 1 0 0 0 0 ...
##  $ past_bitpay                                                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ past_check/money_order                                        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ past_credit_card                                              : num  1 2 10 10 0 0 32 0 1 0 ...
##  $ past_gift_certificate/coupon                                  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ past_google_wallet_instant_buy                                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ past_googlecheckout                                           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ past_paypal                                                   : num  0 5 0 0 0 0 0 3 0 1 ...
##  $ past_paypal_ipn                                               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ past_prepaid_subscription                                     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ past_purchase_order                                           : num  0 0 0 0 393 0 0 0 0 0 ...
##  $ past_purchase_order_-_check                                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ past_purchase_order_-_wire_transfer                           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ past_replacement_order                                        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ past_saved_credit_cards<span_class="alert">_(beta_only)</span>: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ past_subscription                                             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ account_age                                                   : num  0 1687 551 0 1965 ...
##  $ avg_part_price                                                : num  28 9.95 7.16 14.96 35.75 ...
##  $ fraud_part_proportion                                         : num  1 0 0.5 0.5 0.5 ...
##  $ countries_fraud_rating                                        : num  0.00146 0.00146 0.00146 0.00146 0.00146 ...
##  $ billing_and_shipping_match                                    : num  0.719 0.604 1 0.743 0.543 ...
##  $ email_domain_fraud_rating                                     : num  1 0 0 0.002 0 ...
##  $ email_username_non_letter_rating                              : num  0 0 0 0 0 ...

Check out some fraud stats

Data set is highly imbalanced, with fraud making up such a small proportion of total orders. I will undersample to correct for this during the model building phase.

no yes
subtotal 144.92 268.85
order_count 17.57 1.59
past_amazon_payments 0.58 0.09
past_bitpay 0.01 0.01
past_check/money_order 0.00 0.00
past_credit_card 11.23 0.99
past_gift_certificate/coupon 0.02 0.01
past_google_wallet_instant_buy 0.02 0.00
past_googlecheckout 0.01 0.00
past_paypal 4.01 0.71
past_paypal_ipn 0.01 0.00
past_prepaid_subscription 0.00 0.00
past_purchase_order 1.61 0.00
past_purchase_order_-_check 0.08 0.01
past_purchase_order_-_wire_transfer 0.10 0.00
past_replacement_order 0.23 0.06
past_saved_credit_cards<span_class=“alert”>_(beta_only)</span> 0.00 0.00
past_subscription 0.04 0.00
account_age 553.32 74.45
avg_part_price 17.45 84.69
fraud_part_proportion 0.63 1.00
countries_fraud_rating 0.00 0.03
billing_and_shipping_match 0.93 0.79
email_domain_fraud_rating 0.00 0.23
email_username_non_letter_rating 0.06 0.11
count 287,910.00 474.00

Make a copy for preprocessing and model building

Find and remove features with near-zero and zero variance

Features which have near-zero or zero variance are either constant features or features which contain a few unique values that occur with low frequencies, as explained here: https://topepo.github.io/caret/pre-processing.html#zero--and-near-zero-variance-predictors. These features will weaken the model and should be removed.

## [1] "all features have non-zero variance"
##  [1] "near-zero variance: past_bitpay"                                                     
##  [2] "near-zero variance: past_check/money_order"                                          
##  [3] "near-zero variance: past_gift_certificate/coupon"                                    
##  [4] "near-zero variance: past_google_wallet_instant_buy"                                  
##  [5] "near-zero variance: past_googlecheckout"                                             
##  [6] "near-zero variance: past_paypal_ipn"                                                 
##  [7] "near-zero variance: past_prepaid_subscription"                                       
##  [8] "near-zero variance: past_purchase_order"                                             
##  [9] "near-zero variance: past_purchase_order_-_check"                                     
## [10] "near-zero variance: past_purchase_order_-_wire_transfer"                             
## [11] "near-zero variance: past_replacement_order"                                          
## [12] "near-zero variance: past_saved_credit_cards<span_class=\"alert\">_(beta_only)</span>"
## [13] "near-zero variance: past_subscription"                                               
## [14] "near-zero variance: account_age"                                                     
## [15] "near-zero variance: countries_fraud_rating"                                          
## [16] "near-zero variance: billing_and_shipping_match"                                      
## [17] "near-zero variance: email_username_non_letter_rating"
## [1] "these features have been removed"

Feature selection using logistic regression

I use the coefficients of a logistic regression model to identify and choose the most important features. I take a top down approach, and one-by-one remove each insignificant feature (i.e., a feature with a p-value greater than or equal to 0.05) until all features have low p-values.

From wiki: https://en.wikipedia.org/wiki/Logistic_regression: Logistic regression measures the relationship between the categorical dependent variable and one or more independent variables by estimating probabilities using a logistic function.

## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## [1] "fraud_part_proportion will be removed with pvalue = 0.9386"
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## [1] "account_type will be removed with pvalue = 0.4486"
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## [1] "delivery_address_type will be removed with pvalue = 0.2657"
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## 
## Call:
## glm(formula = fraud ~ ., family = "binomial", data = df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5069  -0.0612  -0.0450  -0.0137   4.7861  
## 
## Coefficients:
##                             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               -11.454986   0.730522 -15.681  < 2e-16 ***
## ip_and_shipping_match      -0.277665   0.119560  -2.322 0.020212 *  
## payment_method             -0.674834   0.074733  -9.030  < 2e-16 ***
## subtotal                    0.095082   0.024694   3.850 0.000118 ***
## order_count               -18.235486   4.529008  -4.026 5.66e-05 ***
## past_amazon_payments       -4.076796   0.598023  -6.817 9.29e-12 ***
## past_credit_card          -22.542501   4.036381  -5.585 2.34e-08 ***
## past_paypal                 2.857173   0.757194   3.773 0.000161 ***
## avg_part_price              0.079093   0.007178  11.019  < 2e-16 ***
## email_domain_fraud_rating   0.282926   0.020302  13.936  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 7024.7  on 288383  degrees of freedom
## Residual deviance: 5071.1  on 288374  degrees of freedom
## AIC: 5091.1
## 
## Number of Fisher Scoring iterations: 16

We will train the models using the features above, which do not have near-zero or zero variance, and which have been deemed significant by the logistic regression model.

Create train/test sets

The train set is 80% of the data; the test set is 20%.

## [1] "these are the features which the model will use"
## Classes 'tbl_df', 'tbl' and 'data.frame':    288384 obs. of  10 variables:
##  $ ip_and_shipping_match    : Factor w/ 2 levels "no","yes": 1 1 1 1 2 1 2 1 1 2 ...
##  $ payment_method           : Factor w/ 5 levels "amazon payments",..: 2 4 2 2 5 1 2 4 2 4 ...
##  $ fraud                    : Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 1 1 1 1 ...
##  $ subtotal                 : num  168 19.9 143.2 59.9 321.8 ...
##  $ order_count              : num  1 7 10 3 226 1 21 3 1 1 ...
##  $ past_amazon_payments     : num  0 0 0 0 0 1 0 0 0 0 ...
##  $ past_credit_card         : num  1 2 10 10 0 0 32 0 1 0 ...
##  $ past_paypal              : num  0 5 0 0 0 0 0 3 0 1 ...
##  $ avg_part_price           : num  28 9.95 7.16 14.96 35.75 ...
##  $ email_domain_fraud_rating: num  1 0 0 0.002 0 ...
## [1] "fraud distribution after using sample method: under"
## 
##  no yes 
## 383 380

Fit ML models and predict over test set

Recursive Partitioning (rpart):
From wiki: https://en.wikipedia.org/wiki/Recursive_partitioning
Recursive partitioning creates a decision tree that strives to correctly classify members of the population by splitting it into sub-populations based on several dichotomous independent variables. The process is termed recursive because each sub-population may in turn be split an indefinite number of times until the splitting process terminates after a particular stopping criterion is reached.

Logistic Regression (LogitBoost):
From wiki: https://en.wikipedia.org/wiki/Logistic_regression
A statistical model that in its basic form uses a logistic function (a logistic function or logistic curve is a common “S” shape, or sigmoid curve) to model a binary dependent variable. In regression analysis, logistic regression (or logit regression) is estimating the parameters of a logistic model (a form of binary regression). Mathematically, a binary logistic model has a dependent variable with two possible values, and these are represented by one or more indicator variables.

Random Forest (rf):
From wiki: https://en.wikipedia.org/wiki/Random_forest
An ensemble learning method (multiple learning algorithms which obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone) for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time and outputting the class that is the mode (the mode of a set of data values is the value that appears most often) of the classes (classification) or mean prediction (regression) of the individual trees.

Neural Network (nnet):
From wiki: https://en.wikipedia.org/wiki/Machine_learning#Artificial_neural_networks
The neural network itself is not an algorithm, but rather a framework for many different machine learning algorithms to work together and process complex data inputs.

From here: https://www.doc.ic.ac.uk/~nd/surprise_96/journal/vol4/cs11/report.html#An%20engineering%20approach
3.1 A simple neuron
An artificial neuron is a device with many inputs and one output. The neuron has two modes of operation: the training mode and the using mode. In the training mode, the neuron can be trained to fire (or not), for particular input patterns. In the using mode, when a taught input pattern is detected at the input, its associated output becomes the current output. If the input pattern does not belong in the taught list of input patterns, the firing rule is used to determine whether to fire or not.

3.2 Firing rules
The firing rule is an important concept in neural networks and accounts for their high flexibility. A firing rule determines how one calculates whether a neuron should fire for any input pattern. It relates to all the input patterns, not only the ones on which the node was trained.

A simple firing rule can be implemented by using Hamming distance technique. The rule goes as follows: Take a collection of training patterns for a node, some of which cause it to fire (the 1-taught set of patterns) and others which prevent it from doing so (the 0-taught set). Then the patterns not in the collection cause the node to fire if, on comparison, they have more input elements in common with the ‘nearest’ pattern in the 1-taught set than with the ‘nearest’ pattern in the 0-taught set. If there is a tie, then the pattern remains in the undefined state.

Support Vector Machine (svmLinear)
From wiki: https://en.wikipedia.org/wiki/Support-vector_machine
An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall.

set.seed(1)

# create the control method for model-fit testing
# set some control options based on sampling method
if (sampling_method == 'under') {
  # if we undersample, choose crossfold validation with undersampling
  tr_method <- 'repeatedcv'
  tr_sampling <- 'down'
} else {
  # if we oversample, choose no control method and just put in a placeholder for the sampling method
  # when I choose crossfold validation with oversampling the training takes hours
  tr_method <- 'none'
  tr_sampling <- 'up'
}

control <- trainControl(method = tr_method,
                        sampling = tr_sampling,
                        repeats = 3,
                        classProbs = TRUE,
                        summaryFunction = twoClassSummary,
                        savePredictions = TRUE)

# list of models
model_vec <- c('rpart', # recursive partitioning
               'LogitBoost', # logistic regression
               'rf', # random forest
               'nnet', # neural network
               'svmLinear') # support vector machine

# empty dataframe for results
results <- data.frame()

# empty list for models
model_list <- list()

# model loop
for (m in model_vec) {

  s <- Sys.time()
  
  # train each model
  # suppress garbage output from nnet model
  garbage <- capture.output(fit <- train(fraud ~ ., 
                                         data = train_set,
                                         method = m,
                                         trControl = control,
                                         preProcess = c('center','scale'),
                                         metric = 'ROC',
                                         tuneLength = 5))
  
  # store the model so I can choose which one I want to use later
  model_list[[m]] <- fit

  # predict over test set
  pred <- predict(fit, test_set)
  
  # get confusion matrix
  for_cm <- test_set$fraud %>% as.factor()
  cm <- confusionMatrix(pred, for_cm, positive = 'yes') 
  
  # get AUC
  roc <- roc(test_set$fraud, as.numeric(pred))
  area <- auc(roc)
   
  e <- Sys.time()
   
  # store the model results
  df <- data.frame(row.names = m,
                  time_to_train_in_sec = as.numeric(e-s, units = 'secs'),
                  accuracy = cm$overall['Accuracy'],
                  sensitivity = cm$byClass['Sensitivity'],
                  specificity = cm$byClass['Specificity'],
                  auc = area)
   
  # append to "results" df
  results <- rbind(results, df)
}

# add in the model results
model_list[['results']] <- results

Check out results

model time_to_train_in_sec accuracy sensitivity specificity auc
rf 73.4 86% 85% 86% 0.86
rpart 4.0 75% 93% 75% 0.84
LogitBoost 9.1 88% 77% 88% 0.82
nnet 67.3 85% 77% 85% 0.81
svmLinear 10.3 83% 71% 83% 0.77

Use the model with the highest AUC

Receiver Operating Characteristic (ROC) From wiki: https://en.wikipedia.org/wiki/Receiver_operating_characteristic

The AUC (area under the ROC curve) is a measure of how well a model can distinguish between diagnostic groups (fraud/not-fraud). In other words, it shows the relationship between the sensitivity (true positive rate) and the specificity (true negative rate). This is one of the leading metrics of model selection, and we want it to be as close to 1 as possible, which indicates a perfect fit.

## Time difference of 15.79534 secs
## Area under the curve: 0.9141
## [1] "model used: rf"
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     no    yes
##        no  247574     15
##        yes  40336    459
##                                           
##                Accuracy : 0.8601          
##                  95% CI : (0.8588, 0.8613)
##     No Information Rate : 0.9984          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0191          
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.968354        
##             Specificity : 0.859901        
##          Pos Pred Value : 0.011251        
##          Neg Pred Value : 0.999939        
##              Prevalence : 0.001644        
##          Detection Rate : 0.001592        
##    Detection Prevalence : 0.141461        
##       Balanced Accuracy : 0.914128        
##                                           
##        'Positive' Class : yes             
## 

Interpretation of confusion matrix output

The accuracy, sensitivity, and specificity are ok, but they could be higher. For the model to be a valid model, the Accuracy should be higher than the No Information Rate, also called the Null Accuracy, which is the accuracy that could be achieved if we always assume that any order is the majority class (in this case, that’s “not-fraud”).

The Pos Pred Value, the proportion of correctly classified fraud orders out of all the orders that are classified as fraud is very low, which means the model produces a ton of false postitves. The model misclassifies tens of thousands of orders as fraud, shown in the bottom left of the “prediction/reference” table towards the top of the output. This is problematic because employees will be checking for fraud in orders that are definitely not-fraud.

Next Steps

This severe misclassification is most likely because the model does not have enough or good-enough features to accurately classify an order as fraud. The high class imabalance of the data almost certainly contributes to this. Two solutions are to: 1.) Add in additional features; 2.) Wait until we have more data so that the class imbalance is not so high.

Although not shown here, to address the class imbalance as best as I could, I have trained the models with both over- and undersampled data, and the results are roughly same.