1. Introduction Predicting whether someone will fail to repay a loan is very important in financial risk management. It helps banks and other institutions decide how risky a borrower is.

In this study, we use three classification methods: Logistic Regression, Linear Discriminant Analysis (LDA), and Quadratic Discriminant Analysis (QDA), to see how well they can predict loan default.

We use the Default.csv dataset and compare the models using ROC curves and AUC (Area Under the Curve) to measure their performance.

  1. Methodology

2.1 Data Preparation First, the dataset is loaded and cleaned. The target variable (default) is changed into a category type. Then, the data is split into two parts: 70% for training the models and 30% for testing them. This helps ensure a fair evaluation.

2.2 Model Specification We build three models. Logistic Regression is a basic model that estimates the probability of default. LDA assumes the data follows a normal distribution and that all groups share the same variance. QDA is similar to LDA but allows each group to have different variances.

2.3 Evaluation Metrics We evaluate the models using ROC curves, which show how well the model distinguishes between classes at different thresholds, and AUC, which provides a single value summarizing overall model performance.

  1. Empirical Results This section presents and compares the results from the three models.
library(tidymodels)
## Warning: package 'tidymodels' was built under R version 4.5.3
## ── Attaching packages ────────────────────────────────────── tidymodels 1.4.1 ──
## ✔ broom        1.0.10     ✔ recipes      1.3.1 
## ✔ dials        1.4.2      ✔ rsample      1.3.1 
## ✔ dplyr        1.1.4      ✔ tailor       0.1.0 
## ✔ ggplot2      4.0.2      ✔ tidyr        1.3.1 
## ✔ infer        1.1.0      ✔ tune         2.0.1 
## ✔ modeldata    1.5.1      ✔ workflows    1.3.0 
## ✔ parsnip      1.4.1      ✔ workflowsets 1.1.1 
## ✔ purrr        1.1.0      ✔ yardstick    1.3.2
## Warning: package 'dials' was built under R version 4.5.3
## Warning: package 'ggplot2' was built under R version 4.5.3
## Warning: package 'infer' was built under R version 4.5.3
## Warning: package 'modeldata' was built under R version 4.5.3
## Warning: package 'parsnip' was built under R version 4.5.3
## Warning: package 'recipes' was built under R version 4.5.2
## Warning: package 'rsample' was built under R version 4.5.2
## Warning: package 'tailor' was built under R version 4.5.3
## Warning: package 'tune' was built under R version 4.5.3
## Warning: package 'workflows' was built under R version 4.5.3
## Warning: package 'workflowsets' was built under R version 4.5.3
## Warning: package 'yardstick' was built under R version 4.5.3
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter()  masks stats::filter()
## ✖ dplyr::lag()     masks stats::lag()
## ✖ recipes::step()  masks stats::step()
library(pROC)
## Warning: package 'pROC' was built under R version 4.5.3
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
library(ggplot2)
library(readr)
## Warning: package 'readr' was built under R version 4.5.3
## 
## Attaching package: 'readr'
## The following object is masked from 'package:yardstick':
## 
##     spec
## The following object is masked from 'package:scales':
## 
##     col_factor
library(discrim)
## Warning: package 'discrim' was built under R version 4.5.3
## 
## Attaching package: 'discrim'
## The following object is masked from 'package:dials':
## 
##     smoothness
set.seed(123)
data <- read_csv("C:/Users/jayde/Downloads/Default.csv")
## New names:
## Rows: 10000 Columns: 5
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (2): default, student dbl (3): ...1, balance, income
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
data$default <- as.factor(data$default) 
# Train-Test Split

split <- initial_split(data, prop = 0.7, strata = default)
train_data <- training(split)
test_data  <- testing(split)  
# Logistic Regression

log_fit <- logistic_reg() %>%
  set_engine("glm") %>%
  fit(default ~ income + balance, data = train_data)

log_pred <- predict(log_fit, test_data, type = "prob") %>%
  bind_cols(test_data) 
## New names:
## • `...1` -> `...3`
# LDA

lda_fit <- discrim_linear() %>%
  set_engine("MASS") %>%
  fit(default ~ income + balance, data = train_data)

lda_pred <- predict(lda_fit, test_data, type = "prob") %>%
  bind_cols(test_data)
## New names:
## • `...1` -> `...3`
# QDA

qda_fit <- discrim_quad() %>%
  set_engine("MASS") %>%
  fit(default ~ income + balance, data = train_data)

qda_pred <- predict(qda_fit, test_data, type = "prob") %>%
  bind_cols(test_data) 
## New names:
## • `...1` -> `...3`
# ROC & AUC

roc_log <- roc(test_data$default, log_pred$.pred_Yes)
## Setting levels: control = No, case = Yes
## Setting direction: controls < cases
roc_lda <- roc(test_data$default, lda_pred$.pred_Yes)
## Setting levels: control = No, case = Yes
## Setting direction: controls < cases
roc_qda <- roc(test_data$default, qda_pred$.pred_Yes)
## Setting levels: control = No, case = Yes
## Setting direction: controls < cases
auc_log <- auc(roc_log)
auc_lda <- auc(roc_lda)
auc_qda <- auc(roc_qda)
# Combine for Plot

roc_df <- rbind(
  data.frame(tpr = roc_log$sensitivities,
             fpr = 1 - roc_log$specificities,
             model = "Logistic Regression"),
  data.frame(tpr = roc_lda$sensitivities,
             fpr = 1 - roc_lda$specificities,
             model = "LDA"),
  data.frame(tpr = roc_qda$sensitivities,
             fpr = 1 - roc_qda$specificities,
             model = "QDA")
)
# Plot

ggplot(roc_df, aes(x = fpr, y = tpr, color = model)) +
  geom_line(size = 1.2) +
  geom_abline(linetype = "dashed") +
  labs(
    title = "ROC Curve Comparison",
    x = "False Positive Rate",
    y = "True Positive Rate",
    color = "Model"
  ) +
  theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

# Print AUC

auc_log
## Area under the curve: 0.9438
auc_lda
## Area under the curve: 0.9438
auc_qda 
## Area under the curve: 0.9434
  1. Discussion The ROC curves show that all three models perform much better than random guessing, since their curves are above the diagonal line. The AUC values also show that all models have strong predictive ability.

Logistic Regression and LDA have very similar ROC curves, which suggests that a linear decision boundary works well for this dataset. These two models give stable and consistent results, making them reliable for predicting default.

QDA, on the other hand, is more flexible because it allows non-linear decision boundaries. However, this flexibility can sometimes cause overfitting, especially when the dataset is small or when the data does not clearly support different variance structures.

  1. Conclusion This study shows that Logistic Regression and LDA perform as well as or better than QDA in predicting credit default for this dataset. Although all models have high accuracy, Logistic Regression is preferred because it is simple, easy to understand, and reliable.

From a financial modeling point of view, it is important to choose a simple model that still performs well. Therefore, Logistic Regression is the most suitable choice for real-world credit risk prediction.