In this study, we use three classification methods: Logistic Regression, Linear Discriminant Analysis (LDA), and Quadratic Discriminant Analysis (QDA), to see how well they can predict loan default.
We use the Default.csv dataset and compare the models using ROC curves and AUC (Area Under the Curve) to measure their performance.
2.1 Data Preparation First, the dataset is loaded and cleaned. The target variable (default) is changed into a category type. Then, the data is split into two parts: 70% for training the models and 30% for testing them. This helps ensure a fair evaluation.
2.2 Model Specification We build three models. Logistic Regression is a basic model that estimates the probability of default. LDA assumes the data follows a normal distribution and that all groups share the same variance. QDA is similar to LDA but allows each group to have different variances.
2.3 Evaluation Metrics We evaluate the models using ROC curves, which show how well the model distinguishes between classes at different thresholds, and AUC, which provides a single value summarizing overall model performance.
library(tidymodels)
## Warning: package 'tidymodels' was built under R version 4.5.3
## ── Attaching packages ────────────────────────────────────── tidymodels 1.4.1 ──
## ✔ broom 1.0.10 ✔ recipes 1.3.1
## ✔ dials 1.4.2 ✔ rsample 1.3.1
## ✔ dplyr 1.1.4 ✔ tailor 0.1.0
## ✔ ggplot2 4.0.2 ✔ tidyr 1.3.1
## ✔ infer 1.1.0 ✔ tune 2.0.1
## ✔ modeldata 1.5.1 ✔ workflows 1.3.0
## ✔ parsnip 1.4.1 ✔ workflowsets 1.1.1
## ✔ purrr 1.1.0 ✔ yardstick 1.3.2
## Warning: package 'dials' was built under R version 4.5.3
## Warning: package 'ggplot2' was built under R version 4.5.3
## Warning: package 'infer' was built under R version 4.5.3
## Warning: package 'modeldata' was built under R version 4.5.3
## Warning: package 'parsnip' was built under R version 4.5.3
## Warning: package 'recipes' was built under R version 4.5.2
## Warning: package 'rsample' was built under R version 4.5.2
## Warning: package 'tailor' was built under R version 4.5.3
## Warning: package 'tune' was built under R version 4.5.3
## Warning: package 'workflows' was built under R version 4.5.3
## Warning: package 'workflowsets' was built under R version 4.5.3
## Warning: package 'yardstick' was built under R version 4.5.3
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ recipes::step() masks stats::step()
library(pROC)
## Warning: package 'pROC' was built under R version 4.5.3
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
library(ggplot2)
library(readr)
## Warning: package 'readr' was built under R version 4.5.3
##
## Attaching package: 'readr'
## The following object is masked from 'package:yardstick':
##
## spec
## The following object is masked from 'package:scales':
##
## col_factor
library(discrim)
## Warning: package 'discrim' was built under R version 4.5.3
##
## Attaching package: 'discrim'
## The following object is masked from 'package:dials':
##
## smoothness
set.seed(123)
data <- read_csv("C:/Users/jayde/Downloads/Default.csv")
## New names:
## Rows: 10000 Columns: 5
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (2): default, student dbl (3): ...1, balance, income
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
data$default <- as.factor(data$default)
# Train-Test Split
split <- initial_split(data, prop = 0.7, strata = default)
train_data <- training(split)
test_data <- testing(split)
# Logistic Regression
log_fit <- logistic_reg() %>%
set_engine("glm") %>%
fit(default ~ income + balance, data = train_data)
log_pred <- predict(log_fit, test_data, type = "prob") %>%
bind_cols(test_data)
## New names:
## • `...1` -> `...3`
# LDA
lda_fit <- discrim_linear() %>%
set_engine("MASS") %>%
fit(default ~ income + balance, data = train_data)
lda_pred <- predict(lda_fit, test_data, type = "prob") %>%
bind_cols(test_data)
## New names:
## • `...1` -> `...3`
# QDA
qda_fit <- discrim_quad() %>%
set_engine("MASS") %>%
fit(default ~ income + balance, data = train_data)
qda_pred <- predict(qda_fit, test_data, type = "prob") %>%
bind_cols(test_data)
## New names:
## • `...1` -> `...3`
# ROC & AUC
roc_log <- roc(test_data$default, log_pred$.pred_Yes)
## Setting levels: control = No, case = Yes
## Setting direction: controls < cases
roc_lda <- roc(test_data$default, lda_pred$.pred_Yes)
## Setting levels: control = No, case = Yes
## Setting direction: controls < cases
roc_qda <- roc(test_data$default, qda_pred$.pred_Yes)
## Setting levels: control = No, case = Yes
## Setting direction: controls < cases
auc_log <- auc(roc_log)
auc_lda <- auc(roc_lda)
auc_qda <- auc(roc_qda)
# Combine for Plot
roc_df <- rbind(
data.frame(tpr = roc_log$sensitivities,
fpr = 1 - roc_log$specificities,
model = "Logistic Regression"),
data.frame(tpr = roc_lda$sensitivities,
fpr = 1 - roc_lda$specificities,
model = "LDA"),
data.frame(tpr = roc_qda$sensitivities,
fpr = 1 - roc_qda$specificities,
model = "QDA")
)
# Plot
ggplot(roc_df, aes(x = fpr, y = tpr, color = model)) +
geom_line(size = 1.2) +
geom_abline(linetype = "dashed") +
labs(
title = "ROC Curve Comparison",
x = "False Positive Rate",
y = "True Positive Rate",
color = "Model"
) +
theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# Print AUC
auc_log
## Area under the curve: 0.9438
auc_lda
## Area under the curve: 0.9438
auc_qda
## Area under the curve: 0.9434
Logistic Regression and LDA have very similar ROC curves, which suggests that a linear decision boundary works well for this dataset. These two models give stable and consistent results, making them reliable for predicting default.
QDA, on the other hand, is more flexible because it allows non-linear decision boundaries. However, this flexibility can sometimes cause overfitting, especially when the dataset is small or when the data does not clearly support different variance structures.
From a financial modeling point of view, it is important to choose a simple model that still performs well. Therefore, Logistic Regression is the most suitable choice for real-world credit risk prediction.