Classification of Credit Card Fraud
Intro
Objective
According to the Data Breach Index, more than 5 million records are being stolen on a daily basis, a concerning statistic that shows - fraud is still very common both for Card-Present and Card-not Present type of payments.
Therefore, as a Data Scientist I will make predictions on transactions to classify which transactions are classified as fraud and which are not fraudulent.
Setup
First, we load the required package.
library(dplyr)
library(tidyverse)
library(caret)
Get the Data
<- read.csv("card_transdata.csv")
credit
colnames(credit)
#> [1] "distance_from_home" "distance_from_last_transaction"
#> [3] "ratio_to_median_purchase_price" "repeat_retailer"
#> [5] "used_chip" "used_pin_number"
#> [7] "online_order" "fraud"
Column Explanation :
- distance_from_home : the distance from home where the transaction happened.
- distance_from_last_transaction : the distance from last transaction happened.
- ratio_to_median_purchase_price : Ratio of purchased price transaction to median purchase price.
- repeat_retailer : Is the transaction happened from same retailer.
- used_chip : Is the transaction through chip (credit card).
- used_pin_number : Is the transaction happened by using PIN number.
- online_order : Is the transaction an online order.
- fraud : Is the transaction fraudulent.
Data Cleansing
Check the Data Type
glimpse(credit)
#> Rows: 1,000,000
#> Columns: 8
#> $ distance_from_home <dbl> 57.8778566, 10.8299427, 5.0910795, 2.24…
#> $ distance_from_last_transaction <dbl> 0.31114001, 0.17559150, 0.80515259, 5.6…
#> $ ratio_to_median_purchase_price <dbl> 1.94593998, 1.29421881, 0.42771456, 0.3…
#> $ repeat_retailer <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, …
#> $ used_chip <dbl> 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, …
#> $ used_pin_number <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, …
#> $ online_order <dbl> 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, …
#> $ fraud <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
From the result above, we find some of data type not in the corect type. We need to convert it into corect type.
<- credit %>%
credit mutate_at(c(4,5,6,7,8), as.factor)
glimpse(credit)
#> Rows: 1,000,000
#> Columns: 8
#> $ distance_from_home <dbl> 57.8778566, 10.8299427, 5.0910795, 2.24…
#> $ distance_from_last_transaction <dbl> 0.31114001, 0.17559150, 0.80515259, 5.6…
#> $ ratio_to_median_purchase_price <dbl> 1.94593998, 1.29421881, 0.42771456, 0.3…
#> $ repeat_retailer <fct> 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, …
#> $ used_chip <fct> 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, …
#> $ used_pin_number <fct> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, …
#> $ online_order <fct> 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, …
#> $ fraud <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
Each of column already changed into desired data type.
Check Missing Value
Now, we check whether there are missing value from the data.
colSums(is.na(credit))
#> distance_from_home distance_from_last_transaction
#> 0 0
#> ratio_to_median_purchase_price repeat_retailer
#> 0 0
#> used_chip used_pin_number
#> 0 0
#> online_order fraud
#> 0 0
There are no missing values in our data, so we can continue to the next process.
Cross Validation
Splitting Train-Test (Logistic Regression)
We will divide the data into two parts, namely train and test. Train data is used to train the model, while the test data is used to test the model that we have made. The model will be tested to predict test data. Prediction results and actual data from test data will be compared to validate model performance.
set.seed(404)
<- sample(nrow(credit), nrow(credit)*0.8)
index
<- credit[index, ]
data_training_log <- credit[-index, ]
data_testing_log
dim(data_training_log)
#> [1] 800000 8
dim(data_testing_log)
#> [1] 200000 8
prop.table(table(data_training_log$fraud))
#>
#> 0 1
#> 0.91278625 0.08721375
When viewed from the proportions of the two classes, it is not
balanced. So we really need additional pre-processing to balance the
proportions between the two class target variables.
Because we have quite a lot of observations, we can use the downsample
method.
RNGkind(sample.kind = "Rounding")
set.seed(100)
<- downSample(x = data_training_log %>% select(-fraud),
down_train y = data_training_log$fraud,
yname = "fraud")
dim(down_train)
#> [1] 139542 8
prop.table(table(down_train$fraud))
#>
#> 0 1
#> 0.5 0.5
The proportions of the two classes are already balanced.
Splitting Train-Test (K-NN)
We will divide the data into two parts, namely train and test. Train data is used to train the model, while the test data is used to test the model that we have made. The model will be tested to predict test data. Prediction results and actual data from test data will be compared to validate model performance.
set.seed(404)
<- sample(nrow(credit), nrow(credit)*0.8)
index
<- credit[index, ]
data_training_knn <- credit[-index, ]
data_testing_knn
dim(data_training_knn)
#> [1] 800000 8
dim(data_testing_knn)
#> [1] 200000 8
prop.table(table(data_training_knn$fraud))
#>
#> 0 1
#> 0.9129825 0.0870175
When viewed from the proportions of the two classes, it is not
balanced. So we really need additional pre-processing to balance the
proportions between the two class target variables.
Because we have quite a lot of observations, we can use the downsample
method.
RNGkind(sample.kind = "Rounding")
set.seed(100)
<- downSample(x = data_training_knn %>% select(-fraud),
down_train y = data_training_knn$fraud,
yname = "fraud")
dim(down_train)
#> [1] 139228 8
prop.table(table(down_train$fraud))
#>
#> 0 1
#> 0.5 0.5
The proportions of the two classes are already balanced.
Modelling
Logistic Regression
Modeling is done using logistic regression. Modeling uses the glm() function in modeling logistic regression. The variables used are several variables that we think affect the target variable, where the target variable becomes the response variable.
<- glm(fraud ~.,
model_logistic data = down_train,
family = "binomial")
summary(model_logistic)
#>
#> Call:
#> glm(formula = fraud ~ ., family = "binomial", data = down_train)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -8.4904 -0.2389 0.0000 0.2479 3.4698
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -7.6764903 0.0604383 -127.01 <2e-16 ***
#> distance_from_home 0.0290385 0.0002494 116.44 <2e-16 ***
#> distance_from_last_transaction 0.0509616 0.0006472 78.74 <2e-16 ***
#> ratio_to_median_purchase_price 1.2103582 0.0074886 161.63 <2e-16 ***
#> repeat_retailer1 -1.4717650 0.0352623 -41.74 <2e-16 ***
#> used_chip1 -1.1797316 0.0252959 -46.64 <2e-16 ***
#> used_pin_number1 -9.7644890 0.1757030 -55.57 <2e-16 ***
#> online_order1 5.0839400 0.0485698 104.67 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 193011 on 139227 degrees of freedom
#> Residual deviance: 57993 on 139220 degrees of freedom
#> AIC: 58009
#>
#> Number of Fisher Scoring iterations: 9
Prediction
$pred.Risk <- predict(object = model_logistic,
data_testing_lognewdata = data_testing_log,
type = "response")
$pred.Label <- ifelse(data_testing_log$pred.Risk > 0.5, yes = 1, no = 0)
data_testing_log
$pred.Label <- as.factor(data_testing_log$pred.Label)
data_testing_log
head(data_testing_log[,c("pred.Risk", "pred.Label")])
#> pred.Risk pred.Label
#> 1 0.0018770819 0
#> 6 0.0002659766 0
#> 7 0.0273749329 0
#> 13 0.9999965628 1
#> 15 0.4359674898 0
#> 17 0.0057032740 0
Ketika probabilitas data test lebih dari 0.5, artinya transaksi merupakan Fraud.
Model Evaluation
::confusionMatrix(data = data_testing_log$pred.Label,
caretreference = data_testing_log$fraud,
positive = "1")
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 170058 900
#> 1 12310 16732
#>
#> Accuracy : 0.9339
#> 95% CI : (0.9329, 0.935)
#> No Information Rate : 0.9118
#> P-Value [Acc > NIR] : < 2.2e-16
#>
#> Kappa : 0.6821
#>
#> Mcnemar's Test P-Value : < 2.2e-16
#>
#> Sensitivity : 0.94896
#> Specificity : 0.93250
#> Pos Pred Value : 0.57613
#> Neg Pred Value : 0.99474
#> Prevalence : 0.08816
#> Detection Rate : 0.08366
#> Detection Prevalence : 0.14521
#> Balanced Accuracy : 0.94073
#>
#> 'Positive' Class : 1
#>
- Accuracy : How many are correctly predicted from the overall data (positive or negative).
- Sensitivity/Recall : How many are correctly predicted positive, of which reality is positive.
- Specificity : How many are correctly predicted negative, of which reality is negative.
- Pos Pred Value/Precision : How many are correctly predicted positive, from those predicted positive.
<- (170010+16931)/nrow(data_testing_log)
Accuracy <- (16931)/(16931+858)
Recall <- (12201)/(12201+170010)
Specificity <- (16931)/(16931+12201)
Precision
<- cbind.data.frame(Accuracy, Recall, Specificity, Precision)
performance performance
#> Accuracy Recall Specificity Precision
#> 1 0.934705 0.9517679 0.06696083 0.5811822
Based on the results of the confusionMatrix above, we can get information that the model’s ability to predict target Y (Fraud and Not Fraud) is 93.47%. Meanwhile, from the overall actual data of Not Fraud cases, the model is able to correctly predict 6.7%. Of all the actual data on fraud cases, the model was able to correctly predict 95.17%. Of all the prediction results that were predicted by the model, the model was able to correctly predict the positive class by 58%.
Interpretation Model
exp(model_logistic$coefficients) %>%
data.frame()
#> .
#> (Intercept) 4.635991e-04
#> distance_from_home 1.029464e+00
#> distance_from_last_transaction 1.052282e+00
#> ratio_to_median_purchase_price 3.354686e+00
#> repeat_retailer1 2.295200e-01
#> used_chip1 3.073612e-01
#> used_pin_number1 5.745612e-05
#> online_order1 1.614088e+02
Interpretasi model : Odds online_order1 = 1.6
It means that people who make purchases online are 1.6 times more likely to experience fraud compared to offline transactions.
K-Nearest Neighbour
Splitting Train-Test
In the k-NN model, the predictors and labels (target variables) are separated
#prediktor
<- down_train[,-8]
knn_train_x <- data_testing_knn[,-8]
knn_test_x
#target
<- down_train$fraud
knn_train_y <- data_testing_knn$fraud knn_test_y
glimpse(knn_train_x)
#> Rows: 139,228
#> Columns: 7
#> $ distance_from_home <dbl> 17.3637361, 3.6355158, 34.8613327, 2.43…
#> $ distance_from_last_transaction <dbl> 0.78756528, 16.12893989, 0.01833721, 23…
#> $ ratio_to_median_purchase_price <dbl> 1.2323597, 0.3165404, 2.0135199, 0.1755…
#> $ repeat_retailer <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
#> $ used_chip <fct> 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, …
#> $ used_pin_number <fct> 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, …
#> $ online_order <fct> 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, …
Scaling data prediktor
Predictor data will be scaled using z-score standardization. The test data must also be scaled using parameters from the train data (because they assume the test data is unseen data)
<- scale(knn_train_x %>% mutate_if(is.factor, as.numeric))
knn_train_xs <- scale(knn_test_x %>% mutate_if(is.factor, as.numeric),
knn_test_xs center = attr(knn_train_xs, "scaled:center"),
scale = attr(knn_train_xs, "scaled:scale"))
Prediction
We will find the optimum value of K. To avoid a tie when majority voting :
- k must be odd if the number of target classes is even
- k must be even if the number of target classes is odd
- k cannot be a multiple of the number of target classes
round(sqrt(nrow(knn_train_xs)))
#> [1] 373
As its an odd number, we can directly use the value of k is
373.
Using the value of k that we have obtained, we will try to predict
knn_test_xs using knn_train_xs data and knn_train_y data.
library(class)
<- knn(train = knn_train_xs,
knn_pred test = knn_test_xs,
cl = knn_train_y,
k = 373)
Model Evaluation
::confusionMatrix(data = knn_pred, reference = knn_test_y, positive = "1") caret
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 178260 171
#> 1 3951 17618
#>
#> Accuracy : 0.9794
#> 95% CI : (0.9788, 0.98)
#> No Information Rate : 0.9111
#> P-Value [Acc > NIR] : < 2.2e-16
#>
#> Kappa : 0.884
#>
#> Mcnemar's Test P-Value : < 2.2e-16
#>
#> Sensitivity : 0.99039
#> Specificity : 0.97832
#> Pos Pred Value : 0.81682
#> Neg Pred Value : 0.99904
#> Prevalence : 0.08894
#> Detection Rate : 0.08809
#> Detection Prevalence : 0.10784
#> Balanced Accuracy : 0.98435
#>
#> 'Positive' Class : 1
#>
Based on the results of the confusionMatrix above, we can now that the model’s ability to predict target Y (Fraud and Not Fraud) is 99.55%. Meanwhile, from the overall actual data of Not Fraud cases, the model is able to correctly predict 6.7%. Of all the actual data on fraud cases, the model was able to correctly predict 99.90%. Of all the prediction results that were predicted by the model, the model was able to correctly predict the positive class by 95.24%.
Conclusion
When viewed from the two models, the ability of the model to predict correctly from the actual data of transaction which is Fraud is better by using the K-NN model because it has a recall/sensitivity value = 99.90% greater than using logistic regression model.
If I were a fraud analyst, I will really look at the existing recall metrics, where I prefer one particular class, namely fraud. I want transactions that have the potential for fraud to be predicted as fraud, which will later be reviewed again whether the transaction is truly fraud or not. Because I want to minimize possible losses.