Classification of Credit Card Fraud

Intro

Objective

According to the Data Breach Index, more than 5 million records are being stolen on a daily basis, a concerning statistic that shows - fraud is still very common both for Card-Present and Card-not Present type of payments.

Therefore, as a Data Scientist I will make predictions on transactions to classify which transactions are classified as fraud and which are not fraudulent.

Setup

First, we load the required package.

library(dplyr)
library(tidyverse)
library(caret)

Get the Data

credit <- read.csv("card_transdata.csv")

colnames(credit)
#> [1] "distance_from_home"             "distance_from_last_transaction"
#> [3] "ratio_to_median_purchase_price" "repeat_retailer"               
#> [5] "used_chip"                      "used_pin_number"               
#> [7] "online_order"                   "fraud"

Column Explanation :

  • distance_from_home : the distance from home where the transaction happened.
  • distance_from_last_transaction : the distance from last transaction happened.
  • ratio_to_median_purchase_price : Ratio of purchased price transaction to median purchase price.
  • repeat_retailer : Is the transaction happened from same retailer.
  • used_chip : Is the transaction through chip (credit card).
  • used_pin_number : Is the transaction happened by using PIN number.
  • online_order : Is the transaction an online order.
  • fraud : Is the transaction fraudulent.

Data Cleansing

Check the Data Type

glimpse(credit)
#> Rows: 1,000,000
#> Columns: 8
#> $ distance_from_home             <dbl> 57.8778566, 10.8299427, 5.0910795, 2.24…
#> $ distance_from_last_transaction <dbl> 0.31114001, 0.17559150, 0.80515259, 5.6…
#> $ ratio_to_median_purchase_price <dbl> 1.94593998, 1.29421881, 0.42771456, 0.3…
#> $ repeat_retailer                <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, …
#> $ used_chip                      <dbl> 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, …
#> $ used_pin_number                <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, …
#> $ online_order                   <dbl> 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, …
#> $ fraud                          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …

From the result above, we find some of data type not in the corect type. We need to convert it into corect type.

credit <- credit %>% 
  mutate_at(c(4,5,6,7,8), as.factor)

glimpse(credit)
#> Rows: 1,000,000
#> Columns: 8
#> $ distance_from_home             <dbl> 57.8778566, 10.8299427, 5.0910795, 2.24…
#> $ distance_from_last_transaction <dbl> 0.31114001, 0.17559150, 0.80515259, 5.6…
#> $ ratio_to_median_purchase_price <dbl> 1.94593998, 1.29421881, 0.42771456, 0.3…
#> $ repeat_retailer                <fct> 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, …
#> $ used_chip                      <fct> 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, …
#> $ used_pin_number                <fct> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, …
#> $ online_order                   <fct> 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, …
#> $ fraud                          <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …

Each of column already changed into desired data type.

Check Missing Value

Now, we check whether there are missing value from the data.

colSums(is.na(credit))
#>             distance_from_home distance_from_last_transaction 
#>                              0                              0 
#> ratio_to_median_purchase_price                repeat_retailer 
#>                              0                              0 
#>                      used_chip                used_pin_number 
#>                              0                              0 
#>                   online_order                          fraud 
#>                              0                              0

There are no missing values in our data, so we can continue to the next process.

Cross Validation

Splitting Train-Test (Logistic Regression)

We will divide the data into two parts, namely train and test. Train data is used to train the model, while the test data is used to test the model that we have made. The model will be tested to predict test data. Prediction results and actual data from test data will be compared to validate model performance.

set.seed(404)

index <- sample(nrow(credit), nrow(credit)*0.8)

data_training_log <- credit[index, ]
data_testing_log <- credit[-index, ]

dim(data_training_log)
#> [1] 800000      8
dim(data_testing_log)
#> [1] 200000      8
prop.table(table(data_training_log$fraud))
#> 
#>          0          1 
#> 0.91278625 0.08721375

When viewed from the proportions of the two classes, it is not balanced. So we really need additional pre-processing to balance the proportions between the two class target variables.
Because we have quite a lot of observations, we can use the downsample method.

RNGkind(sample.kind = "Rounding")
set.seed(100)

down_train <- downSample(x = data_training_log %>% select(-fraud),
                         y = data_training_log$fraud,
                         yname = "fraud")

dim(down_train)
#> [1] 139542      8
prop.table(table(down_train$fraud))
#> 
#>   0   1 
#> 0.5 0.5

The proportions of the two classes are already balanced.

Splitting Train-Test (K-NN)

We will divide the data into two parts, namely train and test. Train data is used to train the model, while the test data is used to test the model that we have made. The model will be tested to predict test data. Prediction results and actual data from test data will be compared to validate model performance.

set.seed(404)

index <- sample(nrow(credit), nrow(credit)*0.8)

data_training_knn <- credit[index, ]
data_testing_knn <- credit[-index, ]

dim(data_training_knn)
#> [1] 800000      8
dim(data_testing_knn)
#> [1] 200000      8
prop.table(table(data_training_knn$fraud))
#> 
#>         0         1 
#> 0.9129825 0.0870175

When viewed from the proportions of the two classes, it is not balanced. So we really need additional pre-processing to balance the proportions between the two class target variables.
Because we have quite a lot of observations, we can use the downsample method.

RNGkind(sample.kind = "Rounding")
set.seed(100)

down_train <- downSample(x = data_training_knn %>% select(-fraud),
                         y = data_training_knn$fraud,
                         yname = "fraud")

dim(down_train)
#> [1] 139228      8
prop.table(table(down_train$fraud))
#> 
#>   0   1 
#> 0.5 0.5

The proportions of the two classes are already balanced.

Modelling

Logistic Regression

Modeling is done using logistic regression. Modeling uses the glm() function in modeling logistic regression. The variables used are several variables that we think affect the target variable, where the target variable becomes the response variable.

model_logistic <- glm(fraud ~.,
                      data = down_train, 
                      family = "binomial")

summary(model_logistic)
#> 
#> Call:
#> glm(formula = fraud ~ ., family = "binomial", data = down_train)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -8.4904  -0.2389   0.0000   0.2479   3.4698  
#> 
#> Coefficients:
#>                                  Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)                    -7.6764903  0.0604383 -127.01   <2e-16 ***
#> distance_from_home              0.0290385  0.0002494  116.44   <2e-16 ***
#> distance_from_last_transaction  0.0509616  0.0006472   78.74   <2e-16 ***
#> ratio_to_median_purchase_price  1.2103582  0.0074886  161.63   <2e-16 ***
#> repeat_retailer1               -1.4717650  0.0352623  -41.74   <2e-16 ***
#> used_chip1                     -1.1797316  0.0252959  -46.64   <2e-16 ***
#> used_pin_number1               -9.7644890  0.1757030  -55.57   <2e-16 ***
#> online_order1                   5.0839400  0.0485698  104.67   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 193011  on 139227  degrees of freedom
#> Residual deviance:  57993  on 139220  degrees of freedom
#> AIC: 58009
#> 
#> Number of Fisher Scoring iterations: 9

Prediction

data_testing_log$pred.Risk <- predict(object = model_logistic,
                             newdata = data_testing_log,
                             type = "response")
data_testing_log$pred.Label <- ifelse(data_testing_log$pred.Risk > 0.5, yes = 1, no = 0)

data_testing_log$pred.Label <- as.factor(data_testing_log$pred.Label)

head(data_testing_log[,c("pred.Risk", "pred.Label")])
#>       pred.Risk pred.Label
#> 1  0.0018770819          0
#> 6  0.0002659766          0
#> 7  0.0273749329          0
#> 13 0.9999965628          1
#> 15 0.4359674898          0
#> 17 0.0057032740          0

Ketika probabilitas data test lebih dari 0.5, artinya transaksi merupakan Fraud.

Model Evaluation

caret::confusionMatrix(data = data_testing_log$pred.Label,
                       reference = data_testing_log$fraud,
                       positive = "1")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction      0      1
#>          0 170058    900
#>          1  12310  16732
#>                                          
#>                Accuracy : 0.9339         
#>                  95% CI : (0.9329, 0.935)
#>     No Information Rate : 0.9118         
#>     P-Value [Acc > NIR] : < 2.2e-16      
#>                                          
#>                   Kappa : 0.6821         
#>                                          
#>  Mcnemar's Test P-Value : < 2.2e-16      
#>                                          
#>             Sensitivity : 0.94896        
#>             Specificity : 0.93250        
#>          Pos Pred Value : 0.57613        
#>          Neg Pred Value : 0.99474        
#>              Prevalence : 0.08816        
#>          Detection Rate : 0.08366        
#>    Detection Prevalence : 0.14521        
#>       Balanced Accuracy : 0.94073        
#>                                          
#>        'Positive' Class : 1              
#> 
  • Accuracy : How many are correctly predicted from the overall data (positive or negative).
  • Sensitivity/Recall : How many are correctly predicted positive, of which reality is positive.
  • Specificity : How many are correctly predicted negative, of which reality is negative.
  • Pos Pred Value/Precision : How many are correctly predicted positive, from those predicted positive.
Accuracy <- (170010+16931)/nrow(data_testing_log)
Recall <- (16931)/(16931+858)
Specificity <- (12201)/(12201+170010)
Precision <- (16931)/(16931+12201)

performance <- cbind.data.frame(Accuracy, Recall, Specificity, Precision)
performance
#>   Accuracy    Recall Specificity Precision
#> 1 0.934705 0.9517679  0.06696083 0.5811822

Based on the results of the confusionMatrix above, we can get information that the model’s ability to predict target Y (Fraud and Not Fraud) is 93.47%. Meanwhile, from the overall actual data of Not Fraud cases, the model is able to correctly predict 6.7%. Of all the actual data on fraud cases, the model was able to correctly predict 95.17%. Of all the prediction results that were predicted by the model, the model was able to correctly predict the positive class by 58%.

Interpretation Model

exp(model_logistic$coefficients) %>% 
  data.frame()
#>                                           .
#> (Intercept)                    4.635991e-04
#> distance_from_home             1.029464e+00
#> distance_from_last_transaction 1.052282e+00
#> ratio_to_median_purchase_price 3.354686e+00
#> repeat_retailer1               2.295200e-01
#> used_chip1                     3.073612e-01
#> used_pin_number1               5.745612e-05
#> online_order1                  1.614088e+02

Interpretasi model : Odds online_order1 = 1.6

It means that people who make purchases online are 1.6 times more likely to experience fraud compared to offline transactions.

K-Nearest Neighbour

Splitting Train-Test

In the k-NN model, the predictors and labels (target variables) are separated

#prediktor
knn_train_x <- down_train[,-8]
knn_test_x <- data_testing_knn[,-8]

#target
knn_train_y <- down_train$fraud
knn_test_y <- data_testing_knn$fraud
glimpse(knn_train_x)
#> Rows: 139,228
#> Columns: 7
#> $ distance_from_home             <dbl> 17.3637361, 3.6355158, 34.8613327, 2.43…
#> $ distance_from_last_transaction <dbl> 0.78756528, 16.12893989, 0.01833721, 23…
#> $ ratio_to_median_purchase_price <dbl> 1.2323597, 0.3165404, 2.0135199, 0.1755…
#> $ repeat_retailer                <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
#> $ used_chip                      <fct> 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, …
#> $ used_pin_number                <fct> 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, …
#> $ online_order                   <fct> 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, …

Scaling data prediktor

Predictor data will be scaled using z-score standardization. The test data must also be scaled using parameters from the train data (because they assume the test data is unseen data)

knn_train_xs <- scale(knn_train_x %>% mutate_if(is.factor, as.numeric))
knn_test_xs <- scale(knn_test_x %>% mutate_if(is.factor, as.numeric),
                     center = attr(knn_train_xs, "scaled:center"),
                     scale = attr(knn_train_xs, "scaled:scale"))

Prediction

We will find the optimum value of K. To avoid a tie when majority voting :

  • k must be odd if the number of target classes is even
  • k must be even if the number of target classes is odd
  • k cannot be a multiple of the number of target classes
round(sqrt(nrow(knn_train_xs)))
#> [1] 373

As its an odd number, we can directly use the value of k is 373.
Using the value of k that we have obtained, we will try to predict knn_test_xs using knn_train_xs data and knn_train_y data.

library(class)

knn_pred <- knn(train = knn_train_xs,
                test = knn_test_xs,
                cl = knn_train_y,
                k = 373)

Model Evaluation

caret::confusionMatrix(data = knn_pred, reference = knn_test_y, positive = "1")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction      0      1
#>          0 178260    171
#>          1   3951  17618
#>                                         
#>                Accuracy : 0.9794        
#>                  95% CI : (0.9788, 0.98)
#>     No Information Rate : 0.9111        
#>     P-Value [Acc > NIR] : < 2.2e-16     
#>                                         
#>                   Kappa : 0.884         
#>                                         
#>  Mcnemar's Test P-Value : < 2.2e-16     
#>                                         
#>             Sensitivity : 0.99039       
#>             Specificity : 0.97832       
#>          Pos Pred Value : 0.81682       
#>          Neg Pred Value : 0.99904       
#>              Prevalence : 0.08894       
#>          Detection Rate : 0.08809       
#>    Detection Prevalence : 0.10784       
#>       Balanced Accuracy : 0.98435       
#>                                         
#>        'Positive' Class : 1             
#> 

Based on the results of the confusionMatrix above, we can now that the model’s ability to predict target Y (Fraud and Not Fraud) is 99.55%. Meanwhile, from the overall actual data of Not Fraud cases, the model is able to correctly predict 6.7%. Of all the actual data on fraud cases, the model was able to correctly predict 99.90%. Of all the prediction results that were predicted by the model, the model was able to correctly predict the positive class by 95.24%.

Conclusion

When viewed from the two models, the ability of the model to predict correctly from the actual data of transaction which is Fraud is better by using the K-NN model because it has a recall/sensitivity value = 99.90% greater than using logistic regression model.

If I were a fraud analyst, I will really look at the existing recall metrics, where I prefer one particular class, namely fraud. I want transactions that have the potential for fraud to be predicted as fraud, which will later be reviewed again whether the transaction is truly fraud or not. Because I want to minimize possible losses.