The following document presents a machine learning exercise for predicting the probability that a client will default on loan payments. We will use the credit card database from Brett Lantz’s book, “Machine Learning with R”. The dataset has been made publicly available by Zach Stednick on his GitHub page. By using this dataset, we aim to build a model that can accurately predict the likelihood of a client failing to make loan payments.
The ML models used in this exercise were the General Lineal Model (glm), kNN, Random Forest and XGBoost. In the end, we will compare the performance of each model based on their Area Under the Curve (AUC).
library(caret) # for machine learning
library(tidyverse) # for data wrangling and plots
library(xgboost) # for xgboost
library(pROC) # for ROC curve analysis and model selection
url <- "https://raw.githubusercontent.com/stedy/Machine-Learning-with-R-datasets/master/credit.csv"
data <- read.csv(url, sep = ",")
The variable we are interested in is whether or not a loan payment is defaulted. We will first rename the variable of interest as “target” since it is the primary variable we aim to predict. Following that, we will transform the target variable into a factor, which is a categorical variable that represents the two possible outcomes of our prediction: default (1) or non-default (2). By doing so, we can effectively train and evaluate our machine learning models to predict the target variable’s values.
data_clean <- data |>
rename(target = default) |>
mutate(target = factor(target))
Let’s check our target distribution
prop.table(summary(data_clean$target))
## 1 2
## 0.7 0.3
We use 'set.seed()' to ensure the reproducibility of our
random generated selection of indices. Then, the data set is split into
a training and a test set
set.seed(42)
index <- createDataPartition(data_clean$target, times = 1, p = 0.2, list = FALSE)
train_set <- data_clean[-index,]
test_set <- data_clean[index,]
Is important to establish a baseline prediction. In this case, the simplest approach is to randomly guess the target variable. We can observe that the accuracy of this baseline is comparable to flipping a coin.
set.seed(3)
y_hat <- sample(c(1, 2), length(index), replace = TRUE) |>
factor(levels = levels(test_set$target))
mean(y_hat == test_set$target)
## [1] 0.485
Let’s see if the accuracy improves by using an specific variable as a predictor. A good candidate could be the credit score variable.
data_clean |>
group_by(credit_history) |>
summarise(mean(target == 2))
## # A tibble: 5 × 2
## credit_history `mean(target == 2)`
## <chr> <dbl>
## 1 critical 0.171
## 2 delayed 0.318
## 3 fully repaid 0.625
## 4 fully repaid this bank 0.571
## 5 repaid 0.319
y_hat_1 <- if_else(test_set$credit_history %in% c("fully repaid", "fully repaid this bank") , 2, 1)
mean(y_hat_1 == test_set$target)
## [1] 0.72
Using one variable has resulted in a significant improvement. Let’s no try using all the variables available.
First, we will use a Logistic Regression Model using all the variables available in the dataset as predictors. Again, is important to set a seed before running the model to ensure reproducibility.
After building the model, we will calculate the model coefficients
using the confusionMatrix() function. By selecting the
$Class element we will would be able to retrieve all
relevant coefficients to evaluate our model.
set.seed(45)
fit_1 <- train(target ~ ., train_set,
method = "glm")
mean(test_set$target == predict(fit_1, test_set)) # to calculate Accuracy
## [1] 0.76
matrix_1 <- confusionMatrix(predict(fit_1, test_set), as.factor(test_set$target))
matrix_1$byClass
## Sensitivity Specificity Pos Pred Value
## 0.8928571 0.4500000 0.7911392
## Neg Pred Value Precision Recall
## 0.6428571 0.7911392 0.8928571
## F1 Prevalence Detection Rate
## 0.8389262 0.7000000 0.6250000
## Detection Prevalence Balanced Accuracy
## 0.7900000 0.6714286
For this exercise we will just focus on the “Accuracy”, “Precision” and “Recall”. Accuracy is the percentage of correct predictions made by the model, either is a positive or a negative outcome. Precision is the percentage of true positives from all the predicted outcomes (true positives and false positives) and Recall is the ratio of true positives from the total number of positive outcomes.
The glm model achieved an overall Accuracy of 76%, this
means that it correctly classified 76% of all regardless if they were
positive or negative . Also, the model had a Precision score of 79%,
meaning that of all the cases predicted as positive, 79% were correctly
classified. Finally, the model reach a Recall ratio of 89% which means
that out of all the real positive cases, the model correctly classified
89% of them.
This is non-parametric supervised learning algorithm that classifies a data point based on its similarity to other observations in the cluster. For our exercise we will use 20-cross-validation training, with a k tuning from 3 to 71 in steps of 3.
trControl <- trainControl(method = "cv",
number = 20)
set.seed(8)
fit_2 <- train(target ~ ., train_set,
method = "knn",
trControl = trControl,
metric = "Accuracy",
tuneGrid = data.frame(k = seq(3, 71, 3)))
mean(test_set$target == predict(fit_2, test_set))
## [1] 0.7
matrix_2 <- confusionMatrix(predict(fit_2, test_set), as.factor(test_set$target))
matrix_2$byClass
## Sensitivity Specificity Pos Pred Value
## 0.9928571 0.0000000 0.6984925
## Neg Pred Value Precision Recall
## 0.0000000 0.6984925 0.9928571
## F1 Prevalence Detection Rate
## 0.8200590 0.7000000 0.6950000
## Detection Prevalence Balanced Accuracy
## 0.9950000 0.4964286
The knn model achieved an overall Accuracy of 70%, a
Precision score of also 70% and a Recall ratio of 99%. Even though, the
model correctly classified most of all the positive cases, the
probability of a wrong classification is higher than in the
glm model.
Now let´s use a simple Random Forest algorithm. We will set the training hyperparameters to a 100 trees and a tuning greed in a sequence from 1 to 7.
set.seed(23)
fit_3 <- train(target ~ ., train_set,
method = "rf",
tuneGrid = data.frame(mtry = seq(1:7)),
ntree = 100)
mean(test_set$target == predict(fit_3, test_set))
## [1] 0.74
matrix_3 <- confusionMatrix(predict(fit_3, test_set), as.factor(test_set$target))
matrix_3$byClass
## Sensitivity Specificity Pos Pred Value
## 0.9285714 0.3000000 0.7558140
## Neg Pred Value Precision Recall
## 0.6428571 0.7558140 0.9285714
## F1 Prevalence Detection Rate
## 0.8333333 0.7000000 0.6500000
## Detection Prevalence Balanced Accuracy
## 0.8600000 0.6142857
The rf model achieve an accuracy of 74%, a Precision of
75.6% and a Recall Ratio of 93%. We attain a better results than the
knn model and very similar to the gml, mainly
on accuracy.
The XGBoost its a very powerful classification algorithm that uses an optimized gradient boosting, making it a popular choice for classification tasks. Chen and Gestrin’s 2016 paper provides a detail review about this algorithm. Before training the model, first we need to arrange our data in a matrix format and convert the target variable into a one-hot encoding.
# Pre-processing train set
train_xgb_set <- train_set |>
mutate_if(is.character, as.factor) |>
mutate_if(is.factor, as.numeric) |>
mutate(target = if_else(target == 1, 0, 1))
# Pre-processing test set
test_xgb_set <- test_set |>
mutate_if(is.character, as.factor) |>
mutate_if(is.factor, as.numeric) |>
mutate(target = if_else(target == 1, 0, 1))
We have trained this xgboost with a learning rate of
0.071, a subsample of 0.99, max depth of 11 and a colsample_bytree of
0.85. This particular setting gave us the best results for our task,
however you’re welcome to try different hyperparameters and see how they
affect what results.
# Training the model
xgb_train_matrix <- xgb.DMatrix(
data = as.matrix(train_xgb_set[,!names(train_xgb_set) %in% c("target")]),
label = train_xgb_set$target)
xgb_params <- list(objective = "binary:logistic", eval_metric = "auc",
max_depth = 11, eta = 0.071, subsample = 0.99,
colsample_bytree = 0.85)
set.seed(17)
xgb_fit <- xgb.train(params = xgb_params,
data = xgb_train_matrix, verbose = 1,
nrounds = 155)
# Evaluate xgboost Model on Test Set
xgb_test <- xgb.DMatrix(
data = as.matrix(test_xgb_set[,!names(test_xgb_set) %in% c("target")]))
xgb_predictions <- ifelse(predict(xgb_fit, newdata = xgb_test) > 0.5,1,0)
mean(test_xgb_set$target == xgb_predictions)
## [1] 0.785
matrix_4 <- confusionMatrix(as.factor(xgb_predictions),
as.factor(test_xgb_set$target))
matrix_4$byClass
## Sensitivity Specificity Pos Pred Value
## 0.9142857 0.4833333 0.8050314
## Neg Pred Value Precision Recall
## 0.7073171 0.8050314 0.9142857
## F1 Prevalence Detection Rate
## 0.8561873 0.7000000 0.6400000
## Detection Prevalence Balanced Accuracy
## 0.7950000 0.6988095
The xgboost achieved an Accuracy of 78.5%, the highest
so far. Also, the model achieved a Precision score of 80.5% and a Recall
ratio of 91.4%, higher than the glm and simlar to the
rf.
Finally, we will perform a ROC analysis to chose the best model for our credit card default classification task.
# Using the ROC Curve to evaluate the models
glm_roc <- roc(as.numeric(test_set$target), as.numeric(predict(fit_1, test_set)))
knn_roc <- roc(as.numeric(test_set$target), as.numeric(predict(fit_2, test_set)))
rf_roc <- roc(as.numeric(test_set$target), as.numeric(predict(fit_3, test_set)))
xgboost_roc <- roc(as.numeric(test_xgb_set$target), xgb_predictions)
# Combining ROC data into a single data frame
roc_data <- rbind(
data.frame(Model = "GLM", glm_roc |> coords()),
data.frame(Model = "kNN", knn_roc |> coords()),
data.frame(Model = "Random Forest", rf_roc |> coords()),
data.frame(Model = "XGBoost", xgboost_roc |> coords())
)
# Plot ROC curves using ggplot2
ggplot(roc_data, aes(x = 1 - specificity, y = sensitivity)) +
geom_line(aes(color = Model)) +
geom_abline(slope = 1, intercept = 0, linetype = "dashed") +
labs(title = "ROC Curves", x = "False Positive Rate", y = "True Positive Rate") +
facet_wrap(~ Model, scales = "free") +
theme_bw()
The ROC curve is an easy and fast way to mesure the performance of a binary classification model, the higher the better. On the other hand, the Area Under the ROC Curve (auc) is the probability that our ML classification model “will rank a randomly chosen positive instance higher than a randomly chosen negative instance” (Fawcett, 2006, p. 868).
In this exercise, the glm and xgboost
models showed the best curve. Now, lets see the percentage of the
auc to see which model had a better performance.
auc_table <- data.frame(model = c("glm", "XGBoost", "kNN", "Random Forest"),
auc = c(glm_roc$auc, xgboost_roc$auc,
knn_roc$auc, rf_roc$auc))
knitr::kable(digits = 2, arrange(auc_table, -auc), align = "lc")
| model | auc |
|---|---|
| XGBoost | 0.70 |
| glm | 0.67 |
| Random Forest | 0.62 |
| kNN | 0.50 |
After checking all indicators, the best model for our classification task turn out to be the XGBoost, with an AUC of 70%, which falls inside the rule of thumb for an acceptable model. This is further evidence to chose this model for clasifcation tasks.
Chen, T., & Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 785-794). ACM. Retrieved from: https://arxiv.org/abs/1603.02754
Fawcett, T. (2006). An introduction to ROC analysis. Pattern Recognition Letters, 27(8), 861-874. doi: 10.1016/j.patrec.2005.10.010. Retrieved from: https://people.inf.elte.hu/kiss/11dwhdm/roc.pdf
Lantz, B. (2013). Machine Learning with R. Packt Publishing. Retrieved from: https://supermariogiacomazzo.github.io/STOR538_WEBSITE/Textbooks%20in%20R/Machine%20Learning%20with%20R.pdf
Stedy. (n.d.). Machine-Learning-with-R-datasets. Retrieved from https://github.com/stedy/Machine-Learning-with-R-datasets
I acknowledge that the paragraphs of this article were written with the help of Chat-GPT 3.5.