Credit card fraud is a major concern in the financial industry nowadays. It is estimated that £20M a day were lost due to fraudulent transactions in 2016 alone, totalling almost £770M annually (Financial Fraud Action UK; https://www.financialfraudaction.org.uk/fraudfacts16/assets/fraud_the_facts.pdf).
Analysing fraudulent transactions manually is unfeasible due to huge amounts of data and its complexity. However, given sufficiently informative features, one could expect it is possible to do using Machine Learning. This hypothesis will be explored in the project.
The datasets contains transactions made by credit cards in September 2013 by european cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.
It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, … V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are ‘Time’ and ‘Amount’. Feature ‘Time’ contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature ‘Amount’ is the transaction Amount, this feature can be used for example-dependant cost-senstive learning. Feature ‘Class’ is the response variable and it takes value 1 in case of fraud and 0 otherwise.
Given the class imbalance ratio, we recommend measuring the accuracy using the Area Under the Precision-Recall Curve (AUPRC). Confusion matrix accuracy is not meaningful for unbalanced classification.
The dataset has been collected and analysed during a research collaboration of Worldline and the Machine Learning Group (http://mlg.ulb.ac.be) of ULB (Université Libre de Bruxelles) on big data mining and fraud detection. More details on current and past projects on related topics are available on http://mlg.ulb.ac.be/BruFence and http://mlg.ulb.ac.be/ARTML
Please cite: Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson and Gianluca Bontempi. Calibrating Probability with Undersampling for Unbalanced Classification. In Symposium on Computational Intelligence and Data Mining (CIDM), IEEE, 2015
# Load libraries
library(data.table)
library(ggplot2)
library(plyr)
library(dplyr)
library(corrplot)
library(pROC)
library(glmnet)
library(caret)
library(Rtsne)
library(xgboost)
library(doMC)
# Load data
data <- fread("data/creditcard.csv")
Read 49.2% of 284807 rows
Read 284807 rows and 31 (of 31) columns from 0.140 GB file in 00:00:04
head(data)
All the features, apart from “time” and “amount” are anonymised. Let’s see whether there is any missing data.
apply(data, 2, function(x) sum(is.na(x)))
Time V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
V26 V27 V28 Amount Class
0 0 0 0 0
Good news! There are no NA values in the data.
common_theme <- theme(plot.title = element_text(hjust = 0.5, face = "bold"))
p <- ggplot(data, aes(x = Class)) + geom_bar() + ggtitle("Number of class labels") + common_theme
print(p)
Clearly, the dataset is extremely unbalanced. Even a “null” classifier which always predicts class=0 would obtain over 99% accuracy on this task. This demonstrates that a simple measure of mean accuracy should not be used due to insensitivity to false negatives.
The most appropriate measures to use on this task would be:
Additionally, we can transform the data itself in numerous ways:
summary(data)
Time V1 V2 V3 V4 V5 V6 V7 V8
Min. : 0 Min. :-56.40751 Min. :-72.71573 Min. :-48.3256 Min. :-5.68317 Min. :-113.74331 Min. :-26.1605 Min. :-43.5572 Min. :-73.21672
1st Qu.: 54202 1st Qu.: -0.92037 1st Qu.: -0.59855 1st Qu.: -0.8904 1st Qu.:-0.84864 1st Qu.: -0.69160 1st Qu.: -0.7683 1st Qu.: -0.5541 1st Qu.: -0.20863
Median : 84692 Median : 0.01811 Median : 0.06549 Median : 0.1799 Median :-0.01985 Median : -0.05434 Median : -0.2742 Median : 0.0401 Median : 0.02236
Mean : 94814 Mean : 0.00000 Mean : 0.00000 Mean : 0.0000 Mean : 0.00000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000 Mean : 0.00000
3rd Qu.:139320 3rd Qu.: 1.31564 3rd Qu.: 0.80372 3rd Qu.: 1.0272 3rd Qu.: 0.74334 3rd Qu.: 0.61193 3rd Qu.: 0.3986 3rd Qu.: 0.5704 3rd Qu.: 0.32735
Max. :172792 Max. : 2.45493 Max. : 22.05773 Max. : 9.3826 Max. :16.87534 Max. : 34.80167 Max. : 73.3016 Max. :120.5895 Max. : 20.00721
V9 V10 V11 V12 V13 V14 V15 V16 V17
Min. :-13.43407 Min. :-24.58826 Min. :-4.79747 Min. :-18.6837 Min. :-5.79188 Min. :-19.2143 Min. :-4.49894 Min. :-14.12985 Min. :-25.16280
1st Qu.: -0.64310 1st Qu.: -0.53543 1st Qu.:-0.76249 1st Qu.: -0.4056 1st Qu.:-0.64854 1st Qu.: -0.4256 1st Qu.:-0.58288 1st Qu.: -0.46804 1st Qu.: -0.48375
Median : -0.05143 Median : -0.09292 Median :-0.03276 Median : 0.1400 Median :-0.01357 Median : 0.0506 Median : 0.04807 Median : 0.06641 Median : -0.06568
Mean : 0.00000 Mean : 0.00000 Mean : 0.00000 Mean : 0.0000 Mean : 0.00000 Mean : 0.0000 Mean : 0.00000 Mean : 0.00000 Mean : 0.00000
3rd Qu.: 0.59714 3rd Qu.: 0.45392 3rd Qu.: 0.73959 3rd Qu.: 0.6182 3rd Qu.: 0.66251 3rd Qu.: 0.4931 3rd Qu.: 0.64882 3rd Qu.: 0.52330 3rd Qu.: 0.39968
Max. : 15.59500 Max. : 23.74514 Max. :12.01891 Max. : 7.8484 Max. : 7.12688 Max. : 10.5268 Max. : 8.87774 Max. : 17.31511 Max. : 9.25353
V18 V19 V20 V21 V22 V23 V24 V25 V26
Min. :-9.498746 Min. :-7.213527 Min. :-54.49772 Min. :-34.83038 Min. :-10.933144 Min. :-44.80774 Min. :-2.83663 Min. :-10.29540 Min. :-2.60455
1st Qu.:-0.498850 1st Qu.:-0.456299 1st Qu.: -0.21172 1st Qu.: -0.22839 1st Qu.: -0.542350 1st Qu.: -0.16185 1st Qu.:-0.35459 1st Qu.: -0.31715 1st Qu.:-0.32698
Median :-0.003636 Median : 0.003735 Median : -0.06248 Median : -0.02945 Median : 0.006782 Median : -0.01119 Median : 0.04098 Median : 0.01659 Median :-0.05214
Mean : 0.000000 Mean : 0.000000 Mean : 0.00000 Mean : 0.00000 Mean : 0.000000 Mean : 0.00000 Mean : 0.00000 Mean : 0.00000 Mean : 0.00000
3rd Qu.: 0.500807 3rd Qu.: 0.458949 3rd Qu.: 0.13304 3rd Qu.: 0.18638 3rd Qu.: 0.528554 3rd Qu.: 0.14764 3rd Qu.: 0.43953 3rd Qu.: 0.35072 3rd Qu.: 0.24095
Max. : 5.041069 Max. : 5.591971 Max. : 39.42090 Max. : 27.20284 Max. : 10.503090 Max. : 22.52841 Max. : 4.58455 Max. : 7.51959 Max. : 3.51735
V27 V28 Amount Class
Min. :-22.565679 Min. :-15.43008 Min. : 0.00 Length:284807
1st Qu.: -0.070840 1st Qu.: -0.05296 1st Qu.: 5.60 Class :character
Median : 0.001342 Median : 0.01124 Median : 22.00 Mode :character
Mean : 0.000000 Mean : 0.00000 Mean : 88.35
3rd Qu.: 0.091045 3rd Qu.: 0.07828 3rd Qu.: 77.17
Max. : 31.612198 Max. : 33.84781 Max. :25691.16
All the anonymised features seem to have been be normalised with mean 0. We will apply that transformation to the “Amount” column later on to facilitate training ML models.
Having normalized the “Amount” column, it is important to see how informative that feature would be in predicting whether a transaction was fraudulent. Hence, let’s plot the amount against the class of transaction.
p <- ggplot(data, aes(x = Class, y = Amount)) + geom_boxplot() + ggtitle("Distribution of transaction amount by class") + common_theme
print(p)
There is clearly a lot more variability in the transaction values for non-fraudulent transactions. To get a fuller picture, let’s compute the mean and median values for each class.
data %>% group_by(Class) %>% summarise(mean(Amount), median(Amount))
fraudulent transactions seem to have higher mean value than non-fraudulent ones, meaning that this feature would likely be useful to use in the predictive model. However, the median is higher for the legitimate ones, meaning the distribution of values for class “0” is left-skewed (also seen on the boxplot above).
Since almost all the features are anonymised, let’s see whether there are any correlations with the “Class” feature.
data$Class <- as.numeric(data$Class)
corr_plot <- corrplot(cor(data[,-c("Time")]), method = "circle", type = "upper")
There are a couple interesting correlations with the “Amount” and “Class” features. We will focus on these variables later on during feature selection for the model.
Let’s apply that transformation to the “Amount” column too.
normalize <- function(x){
return((x - mean(x, na.rm = TRUE))/sd(x, na.rm = TRUE))
}
data$Amount <- normalize(data$Amount)
To try to understand the data better, we will try visualizing the data using t-Distributed Stochastic Neighbour Embedding, a technique to reduce dimensionality using Barnes-Hut approximations.
To train the model, perplexity was set to 20. This was based on experimentation and there is no “best” value to use. However, the author of the algorithm suggests using a value of 5-50.
The visualisation should give us a hint as to whether there exist any “discoverable” patterns in the data which the model could learn. If there is no obvious structure in the data, it is more likely that the model will perform poorly.
# Use 10% of data to compute t-SNE
tsne_subset <- 1:as.integer(0.1*nrow(data))
tsne <- Rtsne(data[tsne_subset,-c("Class", "Time")], perplexity = 20, theta = 0.5, pca = F, verbose = T, max_iter = 500, check_duplicates = F)
Read the 28480 x 29 data matrix successfully!
Using no_dims = 2, perplexity = 20.000000, and theta = 0.500000
Computing input similarities...
Normalizing input...
Building tree...
- point 0 of 28480
- point 10000 of 28480
- point 20000 of 28480
Done in 30.89 seconds (sparsity = 0.003025)!
Learning embedding...
Iteration 50: error is 114.699123 (50 iterations in 35.33 seconds)
Iteration 100: error is 114.636330 (50 iterations in 34.70 seconds)
Iteration 150: error is 97.368485 (50 iterations in 32.11 seconds)
Iteration 200: error is 91.431454 (50 iterations in 32.27 seconds)
Iteration 250: error is 88.790495 (50 iterations in 31.04 seconds)
Iteration 300: error is 4.005111 (50 iterations in 31.49 seconds)
Iteration 350: error is 3.559493 (50 iterations in 31.29 seconds)
Iteration 400: error is 3.262880 (50 iterations in 30.85 seconds)
Iteration 450: error is 3.042638 (50 iterations in 29.51 seconds)
Iteration 500: error is 2.867917 (50 iterations in 29.89 seconds)
Fitting performed in 318.48 seconds.
classes <- as.factor(data$Class[tsne_subset])
tsne_mat <- as.data.frame(tsne$Y)
ggplot(tsne_mat, aes(x = V1, y = V2)) + geom_point(aes(color = classes)) + theme_minimal() + common_theme + ggtitle("t-SNE visualisation of transactions") + scale_color_manual(values = c("#E69F00", "#56B4E9"))
Luckily, there is a rather clear distinction between legitimate and fraudulent transactions, which seem to lie at the edge of the “blob” of data. This is encouraging news, let’s see whether we can make our models detect fraudulent transactions!
To avoid developing a “naive” model, we should make sure the classes are roughly balanced. Therefore, we will use a resampling (and, more precisely, oversampling) scheme called SMOTE. It works roughly as follows:
SMOTE has been shown to perform better classification performance in the ROC space than either over- or undersampling (From Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall and W. Philip Kegelmeyer’s “SMOTE: Synthetic Minority Over-sampling Technique” (Journal of Artificial Intelligence Research, 2002, Vol. 16, pp. 321–357)). Since ROC is the measure we are going to optimize for, we will use SMOTE to resample the data.
# Set random seed for reproducibility
set.seed(42)
# Transform "Class" to factor to perform classification and rename levels to predict class probabilities (need to be valid R variable names)
data$Class <- as.numeric(data$Class)
#data$Class <- revalue(data$Class, c("0"="false", "1"="true"))
#data$Class <- factor(data$Class, levels(data$Class)[c(2, 1)])
# Create training and testing set with stratification (i.e. preserving the proportions of false/true values from the "Class" column)
train_index <- createDataPartition(data$Class, times = 1, p = 0.8, list = F)
X_train <- data[train_index]
X_test <- data[!train_index]
y_train <- data$Class[train_index]
y_test <- data$Class[-train_index]
# Parallel processing for faster training
registerDoMC(cores = 8)
# Use 10-fold cross-validation
ctrl <- trainControl(method = "cv",
number = 10,
verboseIter = T,
classProbs = T,
sampling = "smote",
summaryFunction = twoClassSummary,
savePredictions = T)
It is typically a good idea to start out with a simple model and move on to more complex ones to have a rough idea of what “good” performance means on our data. Moroever, it is important to consider the tradeoff between model accuracy and model complexity (which is inherently tied to computational cost). It might be the case that having a simple model with short inference times which achieves an accuracy of 85% is sufficient for a given task, as opposed to having a, say, 10-layer neural network which trains for 2 days on a GPU cluster and is 90% accurate.
Therefore, we will start out with logistic regression.
Logistic regression is a simple regression model whose output is a score between 0 and 1. This is achieved by using the logistic function:
\[g(z) = \frac{1}{1 + exp(-z)}\] Where: \[z = \beta^T x\]
The model can be fitted using gradient descent on the parameter vector beta. Equipped with some basic information, let’s fit the model and see how it performs!
log_mod <- glm(Class ~ ., family = "binomial", data = X_train)
summary(log_mod)
Call:
glm(formula = Class ~ ., family = "binomial", data = X_train)
Deviance Residuals:
Min 1Q Median 3Q Max
-4.8020 -0.0298 -0.0194 -0.0123 4.6025
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -8.274e+00 2.702e-01 -30.618 < 2e-16 ***
Time -4.086e-06 2.485e-06 -1.645 0.100068
V1 9.452e-02 4.686e-02 2.017 0.043683 *
V2 1.804e-02 6.488e-02 0.278 0.780999
V3 -2.502e-02 5.889e-02 -0.425 0.671000
V4 6.845e-01 8.277e-02 8.271 < 2e-16 ***
V5 1.380e-01 7.473e-02 1.846 0.064821 .
V6 -1.070e-01 8.162e-02 -1.310 0.190035
V7 -9.144e-02 7.410e-02 -1.234 0.217190
V8 -1.699e-01 3.336e-02 -5.095 3.50e-07 ***
V9 -2.817e-01 1.249e-01 -2.255 0.024104 *
V10 -8.052e-01 1.054e-01 -7.640 2.16e-14 ***
V11 -1.338e-01 9.018e-02 -1.483 0.137993
V12 1.414e-01 1.013e-01 1.395 0.162906
V13 -3.470e-01 9.289e-02 -3.735 0.000188 ***
V14 -5.972e-01 7.105e-02 -8.406 < 2e-16 ***
V15 -6.793e-02 9.646e-02 -0.704 0.481283
V16 -2.057e-01 1.413e-01 -1.456 0.145330
V17 -4.874e-02 7.854e-02 -0.621 0.534847
V18 2.853e-02 1.449e-01 0.197 0.843865
V19 1.833e-01 1.087e-01 1.687 0.091588 .
V20 -4.249e-01 9.021e-02 -4.710 2.47e-06 ***
V21 3.871e-01 6.715e-02 5.765 8.18e-09 ***
V22 6.555e-01 1.503e-01 4.361 1.29e-05 ***
V23 -1.374e-01 6.509e-02 -2.110 0.034830 *
V24 9.196e-02 1.620e-01 0.568 0.570166
V25 2.004e-02 1.447e-01 0.138 0.889860
V26 -1.439e-03 2.124e-01 -0.007 0.994596
V27 -8.038e-01 1.336e-01 -6.018 1.77e-09 ***
V28 -2.324e-01 9.447e-02 -2.460 0.013883 *
Amount 2.378e-01 1.050e-01 2.264 0.023597 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 5875.3 on 227845 degrees of freedom
Residual deviance: 1816.7 on 227815 degrees of freedom
AIC: 1878.7
Number of Fisher Scoring iterations: 12
# Use a threshold of 0.5 to transform predictions to binary
conf_mat <- confusionMatrix(y_test, as.numeric(predict(log_mod, X_test, type = "response") > 0.5))
print(conf_mat)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 56861 8
1 34 58
Accuracy : 0.9993
95% CI : (0.999, 0.9995)
No Information Rate : 0.9988
P-Value [Acc > NIR] : 0.0010494
Kappa : 0.7338
Mcnemar's Test P-Value : 0.0001145
Sensitivity : 0.9994
Specificity : 0.8788
Pos Pred Value : 0.9999
Neg Pred Value : 0.6304
Prevalence : 0.9988
Detection Rate : 0.9982
Detection Prevalence : 0.9984
Balanced Accuracy : 0.9391
'Positive' Class : 0
A simple logistic regression model achieved nearly 100% accuracy, with ~99% precision (positive predictive value) and ~100% recall (sensitivity). We can see there are only 6 false negatives (transactions which were fraudulent in reality but ont identified as such by the model). This means that the baseline model will be very hard to beat.
fourfoldplot(conf_mat$table)
We can further minimise the number of false negatives by increasing the classification threshold. However, this comes at the expense of identifying some legitiate transactions as fraudulent. This is typically of much lesser concern to banks and it is the false negative rate that should be minimized.
conf_mat2 <- confusionMatrix(y_test, as.numeric(predict(log_mod, X_test, type = "response") > 0.999))
print(conf_mat2)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 56866 3
1 67 25
Accuracy : 0.9988
95% CI : (0.9984, 0.999)
No Information Rate : 0.9995
P-Value [Acc > NIR] : 1
Kappa : 0.4162
Mcnemar's Test P-Value : 5.076e-14
Sensitivity : 0.9988
Specificity : 0.8929
Pos Pred Value : 0.9999
Neg Pred Value : 0.2717
Prevalence : 0.9995
Detection Rate : 0.9983
Detection Prevalence : 0.9984
Balanced Accuracy : 0.9458
'Positive' Class : 0
Now we have just 2 false negatives, but we identified many more legitimate transactions (72) as fraudulent compared to 0.5 threshold. When adjusting the classification threshold, we can have a look at the ROC curve to guide us.
roc_logmod <- roc(y_test, as.numeric(predict(log_mod, X_test, type = "response")))
plot(roc_logmod, main = paste0("AUC: ", round(pROC::auc(roc_logmod), 3)))
Let’s now move on to Random Forest and see whether we can improve any further.
# Train a Random Forest classifier, maximising recall (sensitivity)
X_train_rf <- X_train
X_train_rf$Class <- as.factor(X_train_rf$Class)
levels(X_train_rf$Class) <- make.names(c(0, 1))
model_rf_smote <- train(Class ~ ., data = X_train_rf, method = "rf", trControl = ctrl, verbose = T, metric = "ROC")
Aggregating results
Selecting tuning parameters
Fitting mtry = 16 on full training set
The code above uses SMOTE to resample the data, performs 10-fold CV and trains a Random Forest classifier using ROC as metric to maximize. Let’s look at its performance!
model_rf_smote
Random Forest
227846 samples
30 predictors
2 classes: 'X0', 'X1'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 205062, 205062, 205061, 205062, 205062, 205061, ...
Addtional sampling using SMOTE
Resampling results across tuning parameters:
mtry ROC Sens Spec
2 0.9796401 0.9954671 0.8775
16 0.9825342 0.9901339 0.8850
30 0.9814775 0.9850074 0.8850
ROC was used to select the optimal model using the largest value.
The final value used for the model was mtry = 16.
It is important to note that SMOTE resampling was done only on the training data. The reason for that is if we performed it on the whole dataset and then made the split, SMOTE would bleed some information into the testing set, thereby biasing the results in an optimistic way.
The results on the training set look very promising. Let’s see how the model performs on the unseen test set.
preds <- predict(model_rf_smote, X_test, type = "prob")
conf_mat_rf <- confusionMatrix(as.numeric(preds$X1 > 0.5), y_test)
print(conf_mat_rf)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 56443 8
1 426 84
Accuracy : 0.9924
95% CI : (0.9916, 0.9931)
No Information Rate : 0.9984
P-Value [Acc > NIR] : 1
Kappa : 0.2771
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.9925
Specificity : 0.9130
Pos Pred Value : 0.9999
Neg Pred Value : 0.1647
Prevalence : 0.9984
Detection Rate : 0.9909
Detection Prevalence : 0.9910
Balanced Accuracy : 0.9528
'Positive' Class : 0
roc_data <- roc(y_test, predict(model_rf_smote, X_test, type = "prob")$X1)
plot(roc_data, main = paste0("AUC: ", round(pROC::auc(roc_data), 3)))
The RF model achieved ~100% precision and 98% recall, which is - surprisingly - lower than for logistic regression. This might be due to the fact that Random Forest has too high model capacity and hence overfits to training data. As can be seen above, it also achieves marginally higher AUC score compared to logistic regression (since that’s what the objective function was).
plot(varImp(model_rf_smote))
It is interesting to compare variable importances of the RF model with the variables identified earlier as correlated with the “Class” variable. The top 3 most important variables in the RF model were also the ones which were most correlated with the “Class” variable. Especially for large datasets, this means we could save disk space and computation time by only training the model on the most correlated/important variables, sacrificing a bit of model accuracy.
Lastly, we can also try XGBoost, which is based on Gradient Boosted Trees and is a more powerful model compared to both Logistic Regression and Random Forest.
dtrain_X <- xgb.DMatrix(data = as.matrix(X_train[,-c("Class")]), label = as.numeric(X_train$Class))
dtest_X <- xgb.DMatrix(data = as.matrix(X_test[,-c("Class")]), label = as.numeric(X_test$Class))
xgb <- xgboost(data = dtrain_X, nrounds = 100, gamma = 0.1, max_depth = 10, objective = "binary:logistic", nthread = 7)
[1] train-error:0.000373
[2] train-error:0.000338
[3] train-error:0.000329
[4] train-error:0.000325
[5] train-error:0.000316
[6] train-error:0.000303
[7] train-error:0.000298
[8] train-error:0.000281
[9] train-error:0.000281
[10] train-error:0.000277
[11] train-error:0.000272
[12] train-error:0.000268
[13] train-error:0.000268
[14] train-error:0.000268
[15] train-error:0.000268
[16] train-error:0.000268
[17] train-error:0.000259
[18] train-error:0.000241
[19] train-error:0.000237
[20] train-error:0.000228
[21] train-error:0.000228
[22] train-error:0.000215
[23] train-error:0.000202
[24] train-error:0.000189
[25] train-error:0.000176
[26] train-error:0.000162
[27] train-error:0.000154
[28] train-error:0.000140
[29] train-error:0.000123
[30] train-error:0.000119
[31] train-error:0.000105
[32] train-error:0.000097
[33] train-error:0.000088
[34] train-error:0.000070
[35] train-error:0.000053
[36] train-error:0.000044
[37] train-error:0.000035
[38] train-error:0.000022
[39] train-error:0.000022
[40] train-error:0.000018
[41] train-error:0.000009
[42] train-error:0.000009
[43] train-error:0.000009
[44] train-error:0.000009
[45] train-error:0.000009
[46] train-error:0.000004
[47] train-error:0.000004
[48] train-error:0.000004
[49] train-error:0.000004
[50] train-error:0.000004
[51] train-error:0.000004
[52] train-error:0.000004
[53] train-error:0.000000
[54] train-error:0.000000
[55] train-error:0.000000
[56] train-error:0.000000
[57] train-error:0.000000
[58] train-error:0.000000
[59] train-error:0.000000
[60] train-error:0.000000
[61] train-error:0.000000
[62] train-error:0.000000
[63] train-error:0.000000
[64] train-error:0.000000
[65] train-error:0.000000
[66] train-error:0.000000
[67] train-error:0.000000
[68] train-error:0.000000
[69] train-error:0.000000
[70] train-error:0.000000
[71] train-error:0.000000
[72] train-error:0.000000
[73] train-error:0.000000
[74] train-error:0.000000
[75] train-error:0.000000
[76] train-error:0.000000
[77] train-error:0.000000
[78] train-error:0.000000
[79] train-error:0.000000
[80] train-error:0.000000
[81] train-error:0.000000
[82] train-error:0.000000
[83] train-error:0.000000
[84] train-error:0.000000
[85] train-error:0.000000
[86] train-error:0.000000
[87] train-error:0.000000
[88] train-error:0.000000
[89] train-error:0.000000
[90] train-error:0.000000
[91] train-error:0.000000
[92] train-error:0.000000
[93] train-error:0.000000
[94] train-error:0.000000
[95] train-error:0.000000
[96] train-error:0.000000
[97] train-error:0.000000
[98] train-error:0.000000
[99] train-error:0.000000
[100] train-error:0.000000
preds_xgb <- predict(xgb, dtest_X)
confusionMatrix(as.numeric(preds_xgb > 0.5), y_test)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 56863 17
1 6 75
Accuracy : 0.9996
95% CI : (0.9994, 0.9997)
No Information Rate : 0.9984
P-Value [Acc > NIR] : < 2e-16
Kappa : 0.8669
Mcnemar's Test P-Value : 0.03706
Sensitivity : 0.9999
Specificity : 0.8152
Pos Pred Value : 0.9997
Neg Pred Value : 0.9259
Prevalence : 0.9984
Detection Rate : 0.9983
Detection Prevalence : 0.9986
Balanced Accuracy : 0.9076
'Positive' Class : 0
We can see the model performs much better than the previous ones, espeically in terms of Negative Predictive Value, while still achieving nearly ~100% precision and recall on the validation set! Once again, we can set the classification threshold using the ROC curve.
roc_xgb <- roc(y_test, preds_xgb)
plot(roc_xgb, main = paste0("AUC: ", round(pROC::auc(roc_xgb), 3)))
This project has explored the task of identifying fraudlent transactions based on a dataset of anonymised features. It has been shown that even a very simple logistic regression model can achieve good recall, while a much more complex Random Forest model improves upon logistic regression in terms of AUC. However, XGBoost model improves upon both models.