HW 6_ROC Curve Analysis for Credit Default Prediction Using Classification Models

Introduction

Credit default prediction is a critical task in financial risk management, enabling institutions to assess the likelihood of borrowers failing to meet their obligations. This study applies three classification techniques—Logistic Regression, Linear Discriminant Analysis (LDA), and Quadratic Discriminant Analysis (QDA)—to evaluate their effectiveness in predicting default outcomes.

Using the Default.csv dataset, this report focuses on comparing model performance through Receiver Operating Characteristic (ROC) curves and Area Under the Curve (AUC) metrics.

Methodology 2.1 Data Preparation

The dataset is first loaded and preprocessed. The response variable (default) is converted into a categorical factor. A 70/30 train-test split is applied to ensure unbiased model evaluation.

2.2 Model Specification

Three models are estimated:

Logistic Regression: A baseline probabilistic classification model. LDA: Assumes normally distributed predictors with equal covariance. QDA: Extends LDA by allowing different covariance structures. 2.3 Evaluation Metrics

Model performance is evaluated using:

ROC Curve: Measures classification ability across thresholds. AUC (Area Under Curve): Summarizes overall model performance.

Empirical Results

library(tidymodels)

## Warning: package 'tidymodels' was built under R version 4.5.3

## ── Attaching packages ────────────────────────────────────── tidymodels 1.4.1 ──

## ✔ broom        1.0.10     ✔ recipes      1.3.1 
## ✔ dials        1.4.2      ✔ rsample      1.3.1 
## ✔ dplyr        1.1.4      ✔ tailor       0.1.0 
## ✔ ggplot2      4.0.2      ✔ tidyr        1.3.1 
## ✔ infer        1.1.0      ✔ tune         2.0.1 
## ✔ modeldata    1.5.1      ✔ workflows    1.3.0 
## ✔ parsnip      1.4.1      ✔ workflowsets 1.1.1 
## ✔ purrr        1.1.0      ✔ yardstick    1.3.2

## Warning: package 'dials' was built under R version 4.5.3

## Warning: package 'ggplot2' was built under R version 4.5.2

## Warning: package 'infer' was built under R version 4.5.3

## Warning: package 'modeldata' was built under R version 4.5.3

## Warning: package 'parsnip' was built under R version 4.5.3

## Warning: package 'recipes' was built under R version 4.5.2

## Warning: package 'tailor' was built under R version 4.5.3

## Warning: package 'tune' was built under R version 4.5.3

## Warning: package 'workflows' was built under R version 4.5.3

## Warning: package 'workflowsets' was built under R version 4.5.3

## Warning: package 'yardstick' was built under R version 4.5.3

## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter()  masks stats::filter()
## ✖ dplyr::lag()     masks stats::lag()
## ✖ recipes::step()  masks stats::step()

library(pROC)

## Warning: package 'pROC' was built under R version 4.5.3

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

library(ggplot2)
library(readr)

## Warning: package 'readr' was built under R version 4.5.2

## 
## Attaching package: 'readr'

## The following object is masked from 'package:yardstick':
## 
##     spec

## The following object is masked from 'package:scales':
## 
##     col_factor

library(discrim)

## Warning: package 'discrim' was built under R version 4.5.3

## 
## Attaching package: 'discrim'

## The following object is masked from 'package:dials':
## 
##     smoothness

set.seed(123)

data <- read_csv("C:/Users/Gilang/Downloads/Default.csv")

## New names:
## Rows: 10000 Columns: 5
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (2): default, student dbl (3): ...1, balance, income
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`

data$default <- as.factor(data$default)

# Train-Test Split

split <- initial_split(data, prop = 0.7, strata = default)
train_data <- training(split)
test_data  <- testing(split)

# Logistic Regression

log_fit <- logistic_reg() %>%
  set_engine("glm") %>%
  fit(default ~ income + balance, data = train_data)

log_pred <- predict(log_fit, test_data, type = "prob") %>%
  bind_cols(test_data)

## New names:
## • `...1` -> `...3`

# LDA

lda_fit <- discrim_linear() %>%
  set_engine("MASS") %>%
  fit(default ~ income + balance, data = train_data)

lda_pred <- predict(lda_fit, test_data, type = "prob") %>%
  bind_cols(test_data)

## New names:
## • `...1` -> `...3`

# QDA

qda_fit <- discrim_quad() %>%
  set_engine("MASS") %>%
  fit(default ~ income + balance, data = train_data)

qda_pred <- predict(qda_fit, test_data, type = "prob") %>%
  bind_cols(test_data)

## New names:
## • `...1` -> `...3`

# ROC & AUC

roc_log <- roc(test_data$default, log_pred$.pred_Yes)

## Setting levels: control = No, case = Yes

## Setting direction: controls < cases

roc_lda <- roc(test_data$default, lda_pred$.pred_Yes)

## Setting levels: control = No, case = Yes
## Setting direction: controls < cases

roc_qda <- roc(test_data$default, qda_pred$.pred_Yes)

## Setting levels: control = No, case = Yes
## Setting direction: controls < cases

auc_log <- auc(roc_log)
auc_lda <- auc(roc_lda)
auc_qda <- auc(roc_qda)

# Combine for Plot

roc_df <- rbind(
  data.frame(tpr = roc_log$sensitivities,
             fpr = 1 - roc_log$specificities,
             model = "Logistic Regression"),
  data.frame(tpr = roc_lda$sensitivities,
             fpr = 1 - roc_lda$specificities,
             model = "LDA"),
  data.frame(tpr = roc_qda$sensitivities,
             fpr = 1 - roc_qda$specificities,
             model = "QDA")
)

# Plot

ggplot(roc_df, aes(x = fpr, y = tpr, color = model)) +
  geom_line(size = 1.2) +
  geom_abline(linetype = "dashed") +
  labs(
    title = "ROC Curve Comparison",
    x = "False Positive Rate",
    y = "True Positive Rate",
    color = "Model"
  ) +
  theme_minimal()

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

# Print AUC

auc_log

## Area under the curve: 0.9438

auc_lda

## Area under the curve: 0.9438

auc_qda

## Area under the curve: 0.9434

Discussion

The ROC curves indicate that all three models perform significantly better than random classification, as evidenced by their position above the diagonal benchmark line. The AUC values further confirm strong predictive performance across all models.

Logistic Regression and LDA produce very similar ROC curves, suggesting that the linear decision boundary assumption is appropriate for this dataset. These models demonstrate stable and consistent performance, making them reliable choices for default prediction.

In contrast, QDA introduces greater flexibility by allowing non-linear boundaries. However, this flexibility may lead to overfitting, particularly when the sample size is limited or when predictor distributions do not strongly justify differing covariance structures.

Conclusion

This study demonstrates that Logistic Regression and LDA outperform or match QDA in predicting credit default within the given dataset. While all models exhibit high classification accuracy, Logistic Regression is preferred due to its simplicity, interpretability, and robustness.

From a financial modeling perspective, selecting a parsimonious model with strong predictive power is crucial. Therefore, Logistic Regression emerges as the most suitable approach for practical implementation in credit risk assessment.

HW 6_ROC Curve Analysis for Credit Default Prediction Using Classification Models_114035130

Muhammad Gilang Putra Ariyanto

2026-03-30