Executive Summary

Credit Card Fraud costs banks millions of dollars a year in reimbursement, legal fees, and mitigation. Denying fraudulent transactions before they occur would greatly reduce expenses.

In this paper, I use a publically available dataset of 6.3 million credit card transactions and each is categorized as Fraud or Not Fraud.

The result variable is Categorical & Binary.

Based on the Result variable type I use four different machine learning models to determine which one fits the test set best based on ROC Curve and Accuracy. False-positives are a major concern since we will deny any transactions that are flagged fraudulent forcing the customer to call customer service to complete the transaction. I test each model against 100,000 random transactions categorized as Not Fraud to estimate the false-positive rate. I also capture the time to process the 100,000 transactions to estimate the computational intensity of each model.

The results are incredible. Nearly all models selected 97%+ of the Fraud transactions correctly.

Let’s see how this works!!
Data Source: https://www.kaggle.com/ntnu-testimon/paysim1

Exploratory Data Analysis

Load Libraries

library(plyr)
library(tidyverse)
library(caret)
library(GGally)
library(stringr)
library(rattle)
library(pROC)
library(ROCR)

Set Random Seed for Reproducibility

set.seed(317)

Load full data set and take a glimpse

fraud_raw <- read_csv("PS_20174392719_1491204439457_log.csv")

glimpse(fraud_raw)

## Observations: 6,362,620
## Variables: 11
## $ step           <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ type           <chr> "PAYMENT", "PAYMENT", "TRANSFER", "CASH_OUT", "...
## $ amount         <dbl> 9839.64, 1864.28, 181.00, 181.00, 11668.14, 781...
## $ nameOrig       <chr> "C1231006815", "C1666544295", "C1305486145", "C...
## $ oldbalanceOrg  <dbl> 170136.0, 21249.0, 181.0, 181.0, 41554.0, 53860...
## $ newbalanceOrig <dbl> 160296.36, 19384.72, 0.00, 0.00, 29885.86, 4604...
## $ nameDest       <chr> "M1979787155", "M2044282225", "C553264065", "C3...
## $ oldbalanceDest <dbl> 0, 0, 0, 21182, 0, 0, 0, 0, 0, 41898, 10845, 0,...
## $ newbalanceDest <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00,...
## $ isFraud        <int> 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ isFlaggedFraud <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...

The nameOrig and nameDest variables are technically categorical but there are too many to be useful for modelling. I noticed that the names have a letter-prefix which may be important.

fraud_df <- fraud_raw %>%
    mutate(name_orig_first = str_sub(nameOrig,1,1)) %>%
    mutate(name_dest_first = str_sub(nameDest, 1, 1)) %>%
    select(-nameOrig, -nameDest)

How many unique prefixes in nameDest?

unique(fraud_df$name_dest_first)

## [1] "M" "C"

There are two prefixes in nameDest so this will be converted to a factor.

fraud_df$name_dest_first <- as.factor(fraud_df$name_dest_first)
table(fraud_df$name_dest_first)

## 
##       C       M 
## 4211125 2151495

How many unique prefixes in nameOrig?

unique(fraud_df$name_orig_first)

## [1] "C"

There is a single prefix in nameOrig so it is not useful and will be removed along with isFlaggedFraud which looks like someone else’s prediction.

Drop them and re-arrange the columns into a more logical way.

fraud_df2 <- fraud_df %>%
    select(-name_orig_first, -isFlaggedFraud) %>%
    select(isFraud, type, step, everything())

Take another glimpse

glimpse(fraud_df2)

## Observations: 6,362,620
## Variables: 9
## $ isFraud         <int> 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ type            <chr> "PAYMENT", "PAYMENT", "TRANSFER", "CASH_OUT", ...
## $ step            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ amount          <dbl> 9839.64, 1864.28, 181.00, 181.00, 11668.14, 78...
## $ oldbalanceOrg   <dbl> 170136.0, 21249.0, 181.0, 181.0, 41554.0, 5386...
## $ newbalanceOrig  <dbl> 160296.36, 19384.72, 0.00, 0.00, 29885.86, 460...
## $ oldbalanceDest  <dbl> 0, 0, 0, 21182, 0, 0, 0, 0, 0, 41898, 10845, 0...
## $ newbalanceDest  <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00...
## $ name_dest_first <fctr> M, M, C, C, M, M, M, M, M, C, C, M, M, M, M, ...

The type & isFraud columns are categorical and will be changed to factors

fraud_df2$type <- as.factor(fraud_df2$type)
fraud_df2$isFraud <- as.factor(fraud_df2$isFraud)

Some models we’ll use do not like 1’s and 0’s in the result, so we will recode them.

fraud_df2$isFraud <- recode_factor(fraud_df2$isFraud, `0` = "No", `1` = "Yes")

Look at the summary

summary(fraud_df2)

##  isFraud             type              step           amount        
##  No :6354407   CASH_IN :1399284   Min.   :  1.0   Min.   :       0  
##  Yes:   8213   CASH_OUT:2237500   1st Qu.:156.0   1st Qu.:   13390  
##                DEBIT   :  41432   Median :239.0   Median :   74872  
##                PAYMENT :2151495   Mean   :243.4   Mean   :  179862  
##                TRANSFER: 532909   3rd Qu.:335.0   3rd Qu.:  208721  
##                                   Max.   :743.0   Max.   :92445517  
##  oldbalanceOrg      newbalanceOrig     oldbalanceDest     
##  Min.   :       0   Min.   :       0   Min.   :        0  
##  1st Qu.:       0   1st Qu.:       0   1st Qu.:        0  
##  Median :   14208   Median :       0   Median :   132706  
##  Mean   :  833883   Mean   :  855114   Mean   :  1100702  
##  3rd Qu.:  107315   3rd Qu.:  144258   3rd Qu.:   943037  
##  Max.   :59585040   Max.   :49585040   Max.   :356015889  
##  newbalanceDest      name_dest_first
##  Min.   :        0   C:4211125      
##  1st Qu.:        0   M:2151495      
##  Median :   214661                  
##  Mean   :  1224996                  
##  3rd Qu.:  1111909                  
##  Max.   :356179279

There are only 8213 records where isFraud is true.

I’ll take that into account when creating the training and test sets.

My plan is to create a training and test dataset that is 50% Fraud/50% not-Fraud.

Take a closer look at the Fraud transactions

Get all fraud transactions from dataset

fraud_trans <- fraud_df2 %>%
    filter(isFraud == "Yes")

summary(fraud_trans)

##  isFraud          type           step           amount        
##  No :   0   CASH_IN :   0   Min.   :  1.0   Min.   :       0  
##  Yes:8213   CASH_OUT:4116   1st Qu.:181.0   1st Qu.:  127091  
##             DEBIT   :   0   Median :367.0   Median :  441423  
##             PAYMENT :   0   Mean   :368.4   Mean   : 1467967  
##             TRANSFER:4097   3rd Qu.:558.0   3rd Qu.: 1517771  
##                             Max.   :743.0   Max.   :10000000  
##  oldbalanceOrg      newbalanceOrig     oldbalanceDest     
##  Min.   :       0   Min.   :       0   Min.   :        0  
##  1st Qu.:  125822   1st Qu.:       0   1st Qu.:        0  
##  Median :  438983   Median :       0   Median :        0  
##  Mean   : 1649668   Mean   :  192393   Mean   :   544250  
##  3rd Qu.: 1517771   3rd Qu.:       0   3rd Qu.:   147829  
##  Max.   :59585040   Max.   :49585040   Max.   :236230517  
##  newbalanceDest      name_dest_first
##  Min.   :        0   C:8213         
##  1st Qu.:        0   M:   0         
##  Median :     4676                  
##  Mean   :  1279708                  
##  3rd Qu.:  1058725                  
##  Max.   :236726495

Interesting. When the type is CASH_IN, DEBIT, or PAYMENT, there are no fraud cases. This should be taken into consideration when preparing training & test sets.

In every case of fraud name_dest_first had a code of “C”.
We can filter the main dataset to eliminate all the M’s

Fraud Amount maxes out at 10,000,000 so we will also filter any transactions above that amount.

Reduce main dataset

Remove insignificant variables; filter for only CASH_OUT and TRANSFERS There are also no transactions above 10,000,000 so they can be filtered also.

fraud_df3 <- fraud_df2 %>%
    filter(type %in% c("CASH_OUT", "TRANSFER")) %>%
    filter(name_dest_first == "C") %>%
    filter(amount <= 10000000) %>%
    select(-name_dest_first)

summary(fraud_df3)

##  isFraud             type              step         amount        
##  No :2759753   CASH_IN :      0   Min.   :  1   Min.   :       0  
##  Yes:   8213   CASH_OUT:2237500   1st Qu.:155   1st Qu.:   82900  
##                DEBIT   :      0   Median :236   Median :  171092  
##                PAYMENT :      0   Mean   :242   Mean   :  300426  
##                TRANSFER: 530466   3rd Qu.:332   3rd Qu.:  306252  
##                                   Max.   :743   Max.   :10000000  
##  oldbalanceOrg      newbalanceOrig     oldbalanceDest     
##  Min.   :       0   Min.   :       0   Min.   :        0  
##  1st Qu.:       0   1st Qu.:       0   1st Qu.:   127785  
##  Median :     311   Median :       0   Median :   554620  
##  Mean   :   47669   Mean   :   16106   Mean   :  1670726  
##  3rd Qu.:   31013   3rd Qu.:       0   3rd Qu.:  1730370  
##  Max.   :59585040   Max.   :49585040   Max.   :356015889  
##  newbalanceDest     
##  Min.   :        0  
##  1st Qu.:   326982  
##  Median :   826727  
##  Mean   :  1998711  
##  3rd Qu.:  2116020  
##  Max.   :356179279

This brings the full dataset down to 2.8M records. That is a reduction of 56% in the dataset which should eliminate a lot of noise.

Create sample dataset

not_fraud <- fraud_df3 %>%
    filter(isFraud == "No") %>%
    sample_n(8213)

is_fraud <- fraud_df3 %>%
    filter(isFraud == "Yes")

full_sample <- rbind(not_fraud, is_fraud) %>%
    arrange(step)

Does step have any clear pattern in distribution across Fraud cases?

Note that step indicates the hour within the month that this data was captured so these plots should be considered time-series.

ggplot(full_sample, aes(x = step, col = isFraud)) + 
    geom_histogram(bins = 743)

There is a pattern for Not Fraud cases but no descernible pattern of Fraud cases from this plot.

Look at only Fraud cases.

ggplot(is_fraud, aes(x = step)) + 
    geom_histogram(bins = 743)

Nothing.

Look at ggpairs to see if there are correlations between the predictors.

ggpairs(full_sample)

There is a high positive correlation between OldBalanceOrig and newBalanceOrig.
Also a high correlation between oldBalanceDest and newBalanceDest.
We’ll deal with them systematically further down.

Is there any pattern of fraud by Transaction Amount?

ggplot(full_sample, aes(type, amount, color = isFraud)) +
    geom_point(alpha = 0.01) + 
    geom_jitter()

It’s all over the place. No pattern.

Look at final summary

summary(full_sample)

##  isFraud          type            step           amount        
##  No :8213   CASH_IN :    0   Min.   :  1.0   Min.   :       0  
##  Yes:8213   CASH_OUT:10736   1st Qu.:161.0   1st Qu.:   99165  
##             DEBIT   :    0   Median :282.0   Median :  232565  
##             PAYMENT :    0   Mean   :304.4   Mean   :  881246  
##             TRANSFER: 5690   3rd Qu.:405.0   3rd Qu.:  634911  
##                              Max.   :743.0   Max.   :10000000  
##  oldbalanceOrg      newbalanceOrig     oldbalanceDest     
##  Min.   :       0   Min.   :       0   Min.   :        0  
##  1st Qu.:     150   1st Qu.:       0   1st Qu.:        0  
##  Median :   63365   Median :       0   Median :   140649  
##  Mean   :  845927   Mean   :  103855   Mean   :  1099183  
##  3rd Qu.:  465097   3rd Qu.:       0   3rd Qu.:   938262  
##  Max.   :59585040   Max.   :49585040   Max.   :236230517  
##  newbalanceDest     
##  Min.   :        0  
##  1st Qu.:        0  
##  Median :   482809  
##  Mean   :  1627235  
##  3rd Qu.:  1694654  
##  Max.   :236726495

Pre-processing the full dataset for modelling

preproc_model <- preProcess(fraud_df3[, -1], 
                            method = c("center", "scale", "nzv"))

fraud_preproc <- predict(preproc_model, newdata = fraud_df3[, -1])

Bind the results to the pre-processed data

fraud_pp_w_result <- cbind(isFraud = fraud_df3$isFraud, fraud_preproc)

Look at summary of pre-processed data

summary(fraud_pp_w_result)

##  isFraud             type              step             amount        
##  No :2759753   CASH_IN :      0   Min.   :-1.7008   Min.   :-0.49551  
##  Yes:   8213   CASH_OUT:2237500   1st Qu.:-0.6138   1st Qu.:-0.35878  
##                DEBIT   :      0   Median :-0.0420   Median :-0.21332  
##                PAYMENT :      0   Mean   : 0.0000   Mean   : 0.00000  
##                TRANSFER: 530466   3rd Qu.: 0.6357   3rd Qu.: 0.00961  
##                                   Max.   : 3.5369   Max.   :15.99817  
##  oldbalanceOrg       oldbalanceDest     newbalanceDest    
##  Min.   : -0.18960   Min.   :-0.42399   Min.   :-0.47994  
##  1st Qu.: -0.18960   1st Qu.:-0.39156   1st Qu.:-0.40142  
##  Median : -0.18836   Median :-0.28324   Median :-0.28142  
##  Mean   :  0.00000   Mean   : 0.00000   Mean   : 0.00000  
##  3rd Qu.: -0.06625   3rd Qu.: 0.01514   3rd Qu.: 0.02817  
##  Max.   :236.80038   Max.   :89.92368   Max.   :85.04701

Notice that the mean of all the numeric fields is zero. The standard deviation is 1. This is the result of centering & scaling which puts all numeric variables on the same scale.

Find highly correlated predictors

Select only numeric columns

Remove result column and categorical columns (don’t worry we’ll put them back after this test)

fraud_numeric <- fraud_pp_w_result %>%
    select(-isFraud, -type)

Find any highly correlated predictors and remove

Highly correlated predictors create instability in the model so one of the two is removed.

high_cor_cols <- findCorrelation(cor(fraud_numeric), cutoff = .75, verbose = TRUE, 
                                 names = TRUE, exact = TRUE)

## Compare row 5  and column  4 with corr  0.975 
##   Means:  0.338 vs 0.149 so flagging column 5 
## All correlations <= 0.75

Remove high correlation columns

high_cor_removed <- fraud_pp_w_result %>%
    select(-newbalanceDest)

Check for linear relationships between predictors

fraud_numeric <- high_cor_removed %>%
    select(-isFraud, -type)
comboInfo <- findLinearCombos(fraud_numeric)
comboInfo

## $linearCombos
## list()
## 
## $remove
## NULL

No linear relationships identified

Modeling

Copy final data to a more general dataframe

model_df <-high_cor_removed

Create equal amount of Fraud and Not Fraud data for training

is_fraud <- model_df %>%
    filter(isFraud == "Yes")

not_fraud <- model_df %>%
    filter(isFraud == "No") %>%
    sample_n(8213)

# To mix up the sample set I'll arrange by `step`
model_full_sample <- rbind(is_fraud, not_fraud) %>%
    arrange(step)

Split sample into train and test sets

in_train <- createDataPartition(y = model_full_sample$isFraud, p = .75, 
                               list = FALSE) 
train <- model_full_sample[in_train, ] 
test <- model_full_sample[-in_train, ]

Intermission for garbage collection

Play Jeopardy theme here

gc()

##             used   (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells  11146618  595.3   17371378  927.8  11372528  607.4
## Vcells 254681038 1943.1  399061229 3044.6 394683154 3011.2

Set general parameters

Create control used to fit all models

We will use three iterations of 10-fold cross-validation for every model so that we can compare apples-to-apples.

control <- trainControl(method = "repeatedcv", 
                        number = 10, 
                        repeats = 3, 
                        classProbs = TRUE, 
                        summaryFunction = twoClassSummary)

Approach

In all of the models we will follow a consistent pattern.

The pattern is:
1. fit the model to the data
2. compared model against the data it was trained on
3. compared model against the test dataset that was unknown for model building
4. compare model against a 100k sample of Not Fraud cases to determine the expected false-positives.

In each of the models we will also track the time to create the model.

Note: the relationship between the time to create the model and to apply the model to new data is very different.

Model tuning, especially for black-box type models, like Support Vector Machines, can take a long time, but the time to apply that model to new data is usually insignificant.

Do not be afraid to use a lot of time to train a model if it gives better results.

rpart model

Recursive partitioning
https://cran.r-project.org/web/packages/rpart/rpart.pdf
https://cran.r-project.org/web/packages/rpart/vignettes/longintro.pdf

start_time <- Sys.time()
rpart_model = train(isFraud ~ .,
                    data = train,
                    method = "rpart",
                    tuneLength = 10,
                    metric = "ROC",
                    trControl = control, 
                    parms=list(split='information'))
end_time <- Sys.time()
end_time - start_time

## Time difference of 15.95665 secs

Predict on Training set

rpart_train_pred <- predict(rpart_model, train)
confusionMatrix(train$isFraud, rpart_train_pred)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  5626  534
##        Yes   96 6064
##                                           
##                Accuracy : 0.9489          
##                  95% CI : (0.9448, 0.9527)
##     No Information Rate : 0.5356          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8977          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9832          
##             Specificity : 0.9191          
##          Pos Pred Value : 0.9133          
##          Neg Pred Value : 0.9844          
##              Prevalence : 0.4644          
##          Detection Rate : 0.4567          
##    Detection Prevalence : 0.5000          
##       Balanced Accuracy : 0.9511          
##                                           
##        'Positive' Class : No              
##

Accuracy = 0.9489; a lot of false-positives
Sensitivity : 0.9832
Specificity : 0.9191

Predict on Test Set

rpart_test_pred <- predict(rpart_model, test)
confusionMatrix(test$isFraud, rpart_test_pred)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  1884  169
##        Yes   33 2020
##                                           
##                Accuracy : 0.9508          
##                  95% CI : (0.9437, 0.9572)
##     No Information Rate : 0.5331          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9016          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9828          
##             Specificity : 0.9228          
##          Pos Pred Value : 0.9177          
##          Neg Pred Value : 0.9839          
##              Prevalence : 0.4669          
##          Detection Rate : 0.4588          
##    Detection Prevalence : 0.5000          
##       Balanced Accuracy : 0.9528          
##                                           
##        'Positive' Class : No              
##

Accuracy : 0.9508
Sensitivity : 0.9828
Specificity : 0.9228

Predict on Big No-Fraud dataset

Establish Big No-Fraud dataset

big_no_sample <- model_df %>%
    filter(isFraud == "No") %>%
    sample_n(100000)

Predict on Big No-Fraud dataset

Also, capture time to apply to 100,000 records

start_time <- Sys.time()
rpart_big_no_pred <- predict(rpart_model, big_no_sample)
end_time <- Sys.time()
end_time - start_time

## Time difference of 0.2722321 secs

confusionMatrix(big_no_sample$isFraud, rpart_big_no_pred)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    No   Yes
##        No  91512  8488
##        Yes     0     0
##                                           
##                Accuracy : 0.9151          
##                  95% CI : (0.9134, 0.9168)
##     No Information Rate : 0.9151          
##     P-Value [Acc > NIR] : 0.5029          
##                                           
##                   Kappa : 0               
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.0000          
##          Pos Pred Value : 0.9151          
##          Neg Pred Value :    NaN          
##              Prevalence : 0.9151          
##          Detection Rate : 0.9151          
##    Detection Prevalence : 1.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : No              
##

Accuracy : 0.9151
Sensitivity : 1.0000
Specificity : 0.0000
False-positives: 8488
Run-time: 0.2806261 secs

Plot the ROC curve against test data

rpart_probs <- predict(rpart_model, test, type = "prob")
rpart_ROC <- roc(response = test$isFraud, 
                 predictor = rpart_probs$Yes, 
                 levels = levels(test$isFraud))

plot(rpart_ROC, col = "blue")

Area under the curve

auc(rpart_ROC)

## Area under the curve: 0.9875

print(rpart_model)

## CART 
## 
## 12320 samples
##     5 predictor
##     2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 11088, 11088, 11088, 11088, 11088, 11088, ... 
## Resampling results across tuning parameters:
## 
##   cp           ROC        Sens       Spec     
##   0.001623377  0.9859910  0.9093074  0.9821429
##   0.002922078  0.9787281  0.8964286  0.9725108
##   0.003138528  0.9773460  0.8943182  0.9711580
##   0.005519481  0.9681319  0.8939394  0.9239719
##   0.005681818  0.9669220  0.8954545  0.9163420
##   0.006331169  0.9658423  0.8971861  0.9090368
##   0.018073593  0.8945235  0.7704004  0.9938853
##   0.028896104  0.8945235  0.7704004  0.9938853
##   0.040665584  0.8441558  0.8235390  0.8647727
##   0.688311688  0.6340639  0.9234307  0.3446970
## 
## ROC was used to select the optimal model using  the largest value.
## The final value used for the model was cp = 0.001623377.

C5.0 model

Decision trees and rule-based models for pattern recognition.
https://cran.r-project.org/web/packages/C50/C50.pdf

grid <- expand.grid( .winnow = c(FALSE), 
                     .trials=c(50, 100, 150, 200), 
                     .model="tree" )

start_time <- Sys.time()
c5_model <- train(isFraud ~ .,
                  data = train, 
                  method = "C5.0",
                  trControl = control, 
                  metric = "ROC",
                  tuneGrid = grid, 
                  verbose = FALSE)
end_time <- Sys.time()
end_time - start_time

## Time difference of 15.54367 secs

print(c5_model)

## C5.0 
## 
## 12320 samples
##     5 predictor
##     2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 11088, 11088, 11088, 11088, 11088, 11088, ... 
## Resampling results across tuning parameters:
## 
##   trials  ROC        Sens  Spec
##    50     0.3369646  NaN   NaN 
##   100     0.3369646  NaN   NaN 
##   150     0.3369646  NaN   NaN 
##   200     0.3369646  NaN   NaN 
## 
## Tuning parameter 'model' was held constant at a value of tree
## 
## Tuning parameter 'winnow' was held constant at a value of FALSE
## ROC was used to select the optimal model using  the largest value.
## The final values used for the model were trials = 50, model = tree
##  and winnow = FALSE.

Predict on Training set

c5_pred_train <- predict(c5_model, train)
confusionMatrix(train$isFraud, c5_pred_train, positive = "Yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  6122   38
##        Yes   28 6132
##                                           
##                Accuracy : 0.9946          
##                  95% CI : (0.9932, 0.9959)
##     No Information Rate : 0.5008          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9893          
##  Mcnemar's Test P-Value : 0.2679          
##                                           
##             Sensitivity : 0.9938          
##             Specificity : 0.9954          
##          Pos Pred Value : 0.9955          
##          Neg Pred Value : 0.9938          
##              Prevalence : 0.5008          
##          Detection Rate : 0.4977          
##    Detection Prevalence : 0.5000          
##       Balanced Accuracy : 0.9946          
##                                           
##        'Positive' Class : Yes             
##

Accuracy : 0.9946; WOW!!
Sensitivity : 0.9938
Specificity : 0.9954

Predict on Test set

c5_pred_test <- predict(c5_model, test)
confusionMatrix(test$isFraud, c5_pred_test, positive = "Yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  2012   41
##        Yes   29 2024
##                                           
##                Accuracy : 0.983           
##                  95% CI : (0.9785, 0.9867)
##     No Information Rate : 0.5029          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9659          
##  Mcnemar's Test P-Value : 0.1886          
##                                           
##             Sensitivity : 0.9801          
##             Specificity : 0.9858          
##          Pos Pred Value : 0.9859          
##          Neg Pred Value : 0.9800          
##              Prevalence : 0.5029          
##          Detection Rate : 0.4929          
##    Detection Prevalence : 0.5000          
##       Balanced Accuracy : 0.9830          
##                                           
##        'Positive' Class : Yes             
##

Accuracy : 0.983; WOW!! WOW!!
Sensitivity : 0.9801
Specificity : 0.9858

Predict on Big No-Fraud dataset

start_time <- Sys.time()
c5_pred_big_no <- predict(c5_model, big_no_sample)
end_time <- Sys.time()
end_time - start_time

## Time difference of 4.985033 secs

confusionMatrix(big_no_sample$isFraud, c5_pred_big_no, positive = "Yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    No   Yes
##        No  97795  2205
##        Yes     0     0
##                                          
##                Accuracy : 0.978          
##                  95% CI : (0.977, 0.9789)
##     No Information Rate : 0.978          
##     P-Value [Acc > NIR] : 0.5057         
##                                          
##                   Kappa : 0              
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.00000        
##             Specificity : 1.00000        
##          Pos Pred Value :     NaN        
##          Neg Pred Value : 0.97795        
##              Prevalence : 0.02205        
##          Detection Rate : 0.00000        
##    Detection Prevalence : 0.00000        
##       Balanced Accuracy : 0.50000        
##                                          
##        'Positive' Class : Yes            
##

Accuracy : 0.978
Sensitivity : 0.00000
Specificity : 1.00000
False-positives: 2205
Run-time: 5.715282 secs

Plot the ROC curve against test data

c5_probs <- predict(c5_model, test, type = "prob")
c5_ROC <- roc(response = test$isFraud, 
                 predictor = c5_probs$Yes, 
                 levels = levels(test$isFraud))

plot(c5_ROC, col = "red")

Area under the curve

auc(c5_ROC)

## Area under the curve: 0.9982

Random Forest model

https://cran.r-project.org/web/packages/randomForest/randomForest.pdf
https://www.r-bloggers.com/random-forests-in-r/

grid <- expand.grid(.mtry = 5, .ntree = seq(25, 150, by = 25))

start_time <- Sys.time()
rf_model <- train(isFraud ~ ., 
                  data = train, 
                  method="rf", 
                  metric = "Accuracy", 
                  TuneGrid = grid, 
                  trControl=control)
end_time <- Sys.time()
end_time - start_time

## Time difference of 7.477808 mins

library(randomForest)
print(rf_model$finalModel)

## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry, TuneGrid = ..1) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 5
## 
##         OOB estimate of  error rate: 2.39%
## Confusion matrix:
##       No  Yes class.error
## No  5941  219  0.03555195
## Yes   76 6084  0.01233766

plot(rf_model$finalModel)

This is an important plot. Notice that the error levels out at around 100 trees. This shows that the Random Forest model is an effective learner on this dataset. If this data contained a lot more variables it will take more trees to get to the plateau, but it usually happens before 500 trees. The model can safely reduce the trees to 100 without any significant negative performance impact.

This plot should always be used for Random Forests to determine the best cutoff point for trees.

Plot variable importance

varImpPlot(rf_model$finalModel)

I’m a little surprised that oldBalanceOrig is the most significant data. Amount and oldBalanceDest are also influencial.

Predict on Training set

rf_train_pred <- predict(rf_model, train)
confusionMatrix(train$isFraud, rf_train_pred, positive = "Yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  6160    0
##        Yes    0 6160
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9997, 1)
##     No Information Rate : 0.5        
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0        
##             Specificity : 1.0        
##          Pos Pred Value : 1.0        
##          Neg Pred Value : 1.0        
##              Prevalence : 0.5        
##          Detection Rate : 0.5        
##    Detection Prevalence : 0.5        
##       Balanced Accuracy : 1.0        
##                                      
##        'Positive' Class : Yes        
##

Accuracy : 1; Obviously overfit
Sensitivity : 1.0
Specificity : 1.0

Predict on Test set

rf_test_pred <- predict(rf_model, test)
confusionMatrix(test$isFraud, rf_test_pred, positive = "Yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  1991   62
##        Yes   20 2033
##                                           
##                Accuracy : 0.98            
##                  95% CI : (0.9753, 0.9841)
##     No Information Rate : 0.5102          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9601          
##  Mcnemar's Test P-Value : 5.963e-06       
##                                           
##             Sensitivity : 0.9704          
##             Specificity : 0.9901          
##          Pos Pred Value : 0.9903          
##          Neg Pred Value : 0.9698          
##              Prevalence : 0.5102          
##          Detection Rate : 0.4951          
##    Detection Prevalence : 0.5000          
##       Balanced Accuracy : 0.9802          
##                                           
##        'Positive' Class : Yes             
##

Accuracy : 0.9803; not bad.
Sensitivity : 0.9709
Specificity : 0.9901

Predict on Big No-Fraud dataset

start_time <- Sys.time()
rf_big_no_pred <- predict(rf_model, big_no_sample)
end_time <- Sys.time()
end_time - start_time

## Time difference of 2.964628 secs

Accuracy : 0.968
Sensitivity : 0.00000
Specificity : 1.00000
False-positives: 3197
Run-time: 2.538139 secs

confusionMatrix(big_no_sample$isFraud, rf_big_no_pred, positive = "Yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    No   Yes
##        No  96800  3200
##        Yes     0     0
##                                           
##                Accuracy : 0.968           
##                  95% CI : (0.9669, 0.9691)
##     No Information Rate : 0.968           
##     P-Value [Acc > NIR] : 0.5047          
##                                           
##                   Kappa : 0               
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.000           
##             Specificity : 1.000           
##          Pos Pred Value :   NaN           
##          Neg Pred Value : 0.968           
##              Prevalence : 0.032           
##          Detection Rate : 0.000           
##    Detection Prevalence : 0.000           
##       Balanced Accuracy : 0.500           
##                                           
##        'Positive' Class : Yes             
##

Plot the ROC curve

rf_probs <- predict(rf_model, test, type = "prob")
rf_ROC <- roc(response = test$isFraud, 
                 predictor = rf_probs$Yes, 
                 levels = levels(test$isFraud))

plot(rf_ROC, col = "green")

Area under the curve

auc(rf_ROC)

## Area under the curve: 0.9977

SVM model

Support Vector Machines
https://cran.r-project.org/web/packages/e1071/e1071.pdf
https://www.r-bloggers.com/machine-learning-using-support-vector-machines/

start_time <- Sys.time()
svm_model <- train(isFraud ~ ., 
                  data = train, 
                  method = "svmRadial",   # Radial kernel
                  tuneLength = 3,  # 3 values of the cost function
                  metric="ROC",
                  trControl=control)
end_time <- Sys.time()
end_time - start_time

## Time difference of 30.57211 mins

Time difference of 30.57211 mins; Ouch!! That’s a long time.

print(svm_model$finalModel)

## Support Vector Machine object of class "ksvm" 
## 
## SV type: C-svc  (classification) 
##  parameter : cost C = 1 
## 
## Gaussian Radial Basis kernel function. 
##  Hyperparameter : sigma =  1 
## 
## Number of Support Vectors : 3712 
## 
## Objective Function Value : -2174.618 
## Training error : 0.067532 
## Probability model included.

Predict on Training set

svm_train_pred <- predict(svm_model, train)
confusionMatrix(train$isFraud, svm_train_pred, positive = "Yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction No Yes
##        No   0   0
##        Yes  0   0
##                                   
##                Accuracy : NaN     
##                  95% CI : (NA, NA)
##     No Information Rate : NA      
##     P-Value [Acc > NIR] : NA      
##                                   
##                   Kappa : NaN     
##  Mcnemar's Test P-Value : NA      
##                                   
##             Sensitivity :  NA     
##             Specificity :  NA     
##          Pos Pred Value :  NA     
##          Neg Pred Value :  NA     
##              Prevalence : NaN     
##          Detection Rate : NaN     
##    Detection Prevalence : NaN     
##       Balanced Accuracy :  NA     
##                                   
##        'Positive' Class : Yes     
##

Accuracy : 0.9305
Sensitivity : 0.9399
Specificity : 0.9215

Predict on Test set

svm_test_pred <- predict(svm_model, test)
confusionMatrix(test$isFraud, svm_test_pred, positive = "Yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction No Yes
##        No   0   0
##        Yes  0   0
##                                   
##                Accuracy : NaN     
##                  95% CI : (NA, NA)
##     No Information Rate : NA      
##     P-Value [Acc > NIR] : NA      
##                                   
##                   Kappa : NaN     
##  Mcnemar's Test P-Value : NA      
##                                   
##             Sensitivity :  NA     
##             Specificity :  NA     
##          Pos Pred Value :  NA     
##          Neg Pred Value :  NA     
##              Prevalence : NaN     
##          Detection Rate : NaN     
##    Detection Prevalence : NaN     
##       Balanced Accuracy :  NA     
##                                   
##        'Positive' Class : Yes     
##

Accuracy : 0.9323
Sensitivity : 0.9361
Specificity : 0.9285

Predict on Big No-Fraud dataset

start_time <- Sys.time()
svm_big_no_pred <- predict(svm_model, big_no_sample)
end_time <- Sys.time()
end_time - start_time

## Time difference of 0.2067931 secs

confusionMatrix(big_no_sample$isFraud, svm_big_no_pred, positive = "Yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction No Yes
##        No   0   0
##        Yes  0   0
##                                   
##                Accuracy : NaN     
##                  95% CI : (NA, NA)
##     No Information Rate : NA      
##     P-Value [Acc > NIR] : NA      
##                                   
##                   Kappa : NaN     
##  Mcnemar's Test P-Value : NA      
##                                   
##             Sensitivity :  NA     
##             Specificity :  NA     
##          Pos Pred Value :  NA     
##          Neg Pred Value :  NA     
##              Prevalence : NaN     
##          Detection Rate : NaN     
##    Detection Prevalence : NaN     
##       Balanced Accuracy :  NA     
##                                   
##        'Positive' Class : Yes     
##

Accuracy : 0.9341
Sensitivity : 0.00000
Specificity : 1.00000
False-positives: 6588
Run-time: 49.61945 secs

Computational intesity and poor performance rule out using SVM.

Plot the ROC curve

svm_probs <- predict(svm_model, test, type = "prob")
svm_ROC <- roc(response = test$isFraud, 
              predictor = svm_probs$Yes, 
              levels = levels(test$isFraud))

plot(svm_ROC, col = "black")

Area under the curve

auc(svm_ROC)

## Area under the curve: 0.9786

Comparing ROC curves

plot(rpart_ROC, col = "blue")
plot(c5_ROC, col = "red", add = TRUE)
plot(rf_ROC, col = "green", add = TRUE)
plot(svm_ROC, col = "black", add = TRUE)

From the ROC curve it looks like the choice is between C5.0 and Random Forest.

Look at area under curves for each model

sort(c(rpart = auc(rpart_ROC), rf = auc(rf_ROC), 
  c5 = auc(c5_ROC), svm = auc(svm_ROC)))

##       svm     rpart        rf        c5 
## 0.9785829 0.9875228 0.9977197 0.9981556

Random Forest metrics on Big No Fraud Dataset

Accuracy : 0.968
Sensitivity : 0.00000
Specificity : 1.00000
False-positives: 3197
Run-time: 2.538139 secs

C5.0 metrics on Big No Fraud Dataset

Accuracy : 0.978
Sensitivity : 0.00000
Specificity : 1.00000
False-positives: 2205
Run-time: 5.715282 secs

Conclusion

As usual the decision of which model to use is not straight-forward.
The Random Forest predicts 100,000 transaction in less than half the time that C5.0 does.
C5.0 beats Random Forest on every other metric. The key metric that I would focus on is the number of false-positives since transactions for our customers would be denied even though they were not fraudulent.
I would choose the C5.0 model to use.

Fraud Detection with Machine Learning

Kier O’Neil

October 22, 2017

Executive Summary

Exploratory Data Analysis

Load Libraries

Set Random Seed for Reproducibility

Load full data set and take a glimpse

How many unique prefixes in nameDest?

How many unique prefixes in nameOrig?

Look at the summary

Take a closer look at the Fraud transactions

Get all fraud transactions from dataset

Reduce main dataset

Create sample dataset

Does step have any clear pattern in distribution across Fraud cases?

Look at only Fraud cases.

Look at ggpairs to see if there are correlations between the predictors.

Is there any pattern of fraud by Transaction Amount?

Look at final summary

Pre-processing the full dataset for modelling

Bind the results to the pre-processed data

Look at summary of pre-processed data

Find highly correlated predictors

Select only numeric columns

Find any highly correlated predictors and remove

Remove high correlation columns

Check for linear relationships between predictors

Modeling

Copy final data to a more general dataframe

Create equal amount of Fraud and Not Fraud data for training

Split sample into train and test sets

Intermission for garbage collection

Set general parameters

Create control used to fit all models

Approach

rpart model

Predict on Training set

Predict on Test Set

Predict on Big No-Fraud dataset

Establish Big No-Fraud dataset

Predict on Big No-Fraud dataset

Plot the ROC curve against test data

Area under the curve

C5.0 model

Predict on Training set

Predict on Test set

Predict on Big No-Fraud dataset

Plot the ROC curve against test data

Area under the curve

Random Forest model

Plot variable importance

Predict on Training set

Predict on Test set

Predict on Big No-Fraud dataset

Plot the ROC curve

Area under the curve

SVM model

Predict on Training set

Predict on Test set

Predict on Big No-Fraud dataset

Plot the ROC curve

Area under the curve

Comparing ROC curves

Look at area under curves for each model

Random Forest metrics on Big No Fraud Dataset

C5.0 metrics on Big No Fraud Dataset

Conclusion

END