Credit default prediction is a critical task in financial risk management, enabling institutions to assess the likelihood of borrowers failing to meet their obligations. This study applies three classification techniques—Logistic Regression, Linear Discriminant Analysis (LDA), and Quadratic Discriminant Analysis (QDA)—to evaluate their effectiveness in predicting default outcomes.
Using the Default.csv dataset, this report focuses on comparing model performance through Receiver Operating Characteristic (ROC) curves and Area Under the Curve (AUC) metrics.
The dataset is first loaded and preprocessed. The response variable (default) is converted into a categorical factor. A 70/30 train-test split is applied to ensure unbiased model evaluation.
2.2 Model Specification
Three models are estimated:
Logistic Regression: A baseline probabilistic classification model. LDA: Assumes normally distributed predictors with equal covariance. QDA: Extends LDA by allowing different covariance structures. 2.3 Evaluation Metrics
Model performance is evaluated using:
ROC Curve: Measures classification ability across thresholds. AUC (Area Under Curve): Summarizes overall model performance.
library(tidymodels)
## Warning: package 'tidymodels' was built under R version 4.5.3
## ── Attaching packages ────────────────────────────────────── tidymodels 1.4.1 ──
## ✔ broom 1.0.10 ✔ recipes 1.3.1
## ✔ dials 1.4.2 ✔ rsample 1.3.1
## ✔ dplyr 1.1.4 ✔ tailor 0.1.0
## ✔ ggplot2 4.0.2 ✔ tidyr 1.3.1
## ✔ infer 1.1.0 ✔ tune 2.0.1
## ✔ modeldata 1.5.1 ✔ workflows 1.3.0
## ✔ parsnip 1.4.1 ✔ workflowsets 1.1.1
## ✔ purrr 1.1.0 ✔ yardstick 1.3.2
## Warning: package 'dials' was built under R version 4.5.3
## Warning: package 'ggplot2' was built under R version 4.5.2
## Warning: package 'infer' was built under R version 4.5.3
## Warning: package 'modeldata' was built under R version 4.5.3
## Warning: package 'parsnip' was built under R version 4.5.3
## Warning: package 'recipes' was built under R version 4.5.2
## Warning: package 'tailor' was built under R version 4.5.3
## Warning: package 'tune' was built under R version 4.5.3
## Warning: package 'workflows' was built under R version 4.5.3
## Warning: package 'workflowsets' was built under R version 4.5.3
## Warning: package 'yardstick' was built under R version 4.5.3
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ recipes::step() masks stats::step()
library(pROC)
## Warning: package 'pROC' was built under R version 4.5.3
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
library(ggplot2)
library(readr)
## Warning: package 'readr' was built under R version 4.5.2
##
## Attaching package: 'readr'
## The following object is masked from 'package:yardstick':
##
## spec
## The following object is masked from 'package:scales':
##
## col_factor
library(discrim)
## Warning: package 'discrim' was built under R version 4.5.3
##
## Attaching package: 'discrim'
## The following object is masked from 'package:dials':
##
## smoothness
set.seed(123)
data <- read_csv("C:/Users/Gilang/Downloads/Default.csv")
## New names:
## Rows: 10000 Columns: 5
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (2): default, student dbl (3): ...1, balance, income
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
data$default <- as.factor(data$default)
# Train-Test Split
split <- initial_split(data, prop = 0.7, strata = default)
train_data <- training(split)
test_data <- testing(split)
# Logistic Regression
log_fit <- logistic_reg() %>%
set_engine("glm") %>%
fit(default ~ income + balance, data = train_data)
log_pred <- predict(log_fit, test_data, type = "prob") %>%
bind_cols(test_data)
## New names:
## • `...1` -> `...3`
# LDA
lda_fit <- discrim_linear() %>%
set_engine("MASS") %>%
fit(default ~ income + balance, data = train_data)
lda_pred <- predict(lda_fit, test_data, type = "prob") %>%
bind_cols(test_data)
## New names:
## • `...1` -> `...3`
# QDA
qda_fit <- discrim_quad() %>%
set_engine("MASS") %>%
fit(default ~ income + balance, data = train_data)
qda_pred <- predict(qda_fit, test_data, type = "prob") %>%
bind_cols(test_data)
## New names:
## • `...1` -> `...3`
# ROC & AUC
roc_log <- roc(test_data$default, log_pred$.pred_Yes)
## Setting levels: control = No, case = Yes
## Setting direction: controls < cases
roc_lda <- roc(test_data$default, lda_pred$.pred_Yes)
## Setting levels: control = No, case = Yes
## Setting direction: controls < cases
roc_qda <- roc(test_data$default, qda_pred$.pred_Yes)
## Setting levels: control = No, case = Yes
## Setting direction: controls < cases
auc_log <- auc(roc_log)
auc_lda <- auc(roc_lda)
auc_qda <- auc(roc_qda)
# Combine for Plot
roc_df <- rbind(
data.frame(tpr = roc_log$sensitivities,
fpr = 1 - roc_log$specificities,
model = "Logistic Regression"),
data.frame(tpr = roc_lda$sensitivities,
fpr = 1 - roc_lda$specificities,
model = "LDA"),
data.frame(tpr = roc_qda$sensitivities,
fpr = 1 - roc_qda$specificities,
model = "QDA")
)
# Plot
ggplot(roc_df, aes(x = fpr, y = tpr, color = model)) +
geom_line(size = 1.2) +
geom_abline(linetype = "dashed") +
labs(
title = "ROC Curve Comparison",
x = "False Positive Rate",
y = "True Positive Rate",
color = "Model"
) +
theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# Print AUC
auc_log
## Area under the curve: 0.9438
auc_lda
## Area under the curve: 0.9438
auc_qda
## Area under the curve: 0.9434
The ROC curves indicate that all three models perform significantly better than random classification, as evidenced by their position above the diagonal benchmark line. The AUC values further confirm strong predictive performance across all models.
Logistic Regression and LDA produce very similar ROC curves, suggesting that the linear decision boundary assumption is appropriate for this dataset. These models demonstrate stable and consistent performance, making them reliable choices for default prediction.
In contrast, QDA introduces greater flexibility by allowing non-linear boundaries. However, this flexibility may lead to overfitting, particularly when the sample size is limited or when predictor distributions do not strongly justify differing covariance structures.
This study demonstrates that Logistic Regression and LDA outperform or match QDA in predicting credit default within the given dataset. While all models exhibit high classification accuracy, Logistic Regression is preferred due to its simplicity, interpretability, and robustness.
From a financial modeling perspective, selecting a parsimonious model with strong predictive power is crucial. Therefore, Logistic Regression emerges as the most suitable approach for practical implementation in credit risk assessment.