hw6

Credit default prediction is an important part of financial risk management, as it helps institutions estimate the likelihood that borrowers will fail to repay their loans. In this study, three classification methods—Logistic Regression, Linear Discriminant Analysis (LDA), and Quadratic Discriminant Analysis (QDA)—are used to compare their ability to predict default outcomes. Using the Default.csv dataset, the models are evaluated based on their performance using Receiver Operating Characteristic (ROC) curves and the Area Under the Curve (AUC). The dataset is first prepared by converting the response variable (default) into a categorical form and splitting the data into 70% training and 30% testing sets. Logistic Regression is used as a baseline model, while LDA assumes normally distributed predictors with equal covariance, and QDA allows different covariance structures for more flexibility. Model performance is then assessed using ROC curves to evaluate classification ability across thresholds and AUC to summarize overall accuracy.

library(tidymodels)

## ── Attaching packages ────────────────────────────────────── tidymodels 1.4.1 ──

## ✔ broom        1.0.10     ✔ recipes      1.3.1 
## ✔ dials        1.4.2      ✔ rsample      1.3.1 
## ✔ dplyr        1.1.4      ✔ tailor       0.1.0 
## ✔ ggplot2      4.0.2      ✔ tidyr        1.3.1 
## ✔ infer        1.1.0      ✔ tune         2.0.1 
## ✔ modeldata    1.5.1      ✔ workflows    1.3.0 
## ✔ parsnip      1.4.1      ✔ workflowsets 1.1.1 
## ✔ purrr        1.1.0      ✔ yardstick    1.3.2

## Warning: package 'ggplot2' was built under R version 4.5.2

## Warning: package 'infer' was built under R version 4.5.2

## Warning: package 'parsnip' was built under R version 4.5.2

## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter()  masks stats::filter()
## ✖ dplyr::lag()     masks stats::lag()
## ✖ recipes::step()  masks stats::step()

library(pROC)

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

library(ggplot2)
library(readr)

## 
## Attaching package: 'readr'

## The following object is masked from 'package:yardstick':
## 
##     spec

## The following object is masked from 'package:scales':
## 
##     col_factor

library(discrim)

## Warning: package 'discrim' was built under R version 4.5.2

## 
## Attaching package: 'discrim'

## The following object is masked from 'package:dials':
## 
##     smoothness

set.seed(123)

data  <- read.csv("/Users/faizhaikal/Downloads/Default.csv")

data$default <- as.factor(data$default)

# Train-Test Split

split <- initial_split(data, prop = 0.7, strata = default)
train_data <- training(split)
test_data  <- testing(split)

# Logistic Regression

log_fit <- logistic_reg() %>%
  set_engine("glm") %>%
  fit(default ~ income + balance, data = train_data)

log_pred <- predict(log_fit, test_data, type = "prob") %>%
  bind_cols(test_data)

# LDA

lda_fit <- discrim_linear() %>%
  set_engine("MASS") %>%
  fit(default ~ income + balance, data = train_data)

lda_pred <- predict(lda_fit, test_data, type = "prob") %>%
  bind_cols(test_data)

# QDA

qda_fit <- discrim_quad() %>%
  set_engine("MASS") %>%
  fit(default ~ income + balance, data = train_data)

qda_pred <- predict(qda_fit, test_data, type = "prob") %>%
  bind_cols(test_data)

# ROC & AUC

roc_log <- roc(test_data$default, log_pred$.pred_Yes)

## Setting levels: control = No, case = Yes

## Setting direction: controls < cases

roc_lda <- roc(test_data$default, lda_pred$.pred_Yes)

## Setting levels: control = No, case = Yes

## Setting direction: controls < cases

roc_qda <- roc(test_data$default, qda_pred$.pred_Yes)

## Setting levels: control = No, case = Yes

## Setting direction: controls < cases

auc_log <- auc(roc_log)
auc_lda <- auc(roc_lda)
auc_qda <- auc(roc_qda)

# Combine for Plot

roc_df <- rbind(
  data.frame(tpr = roc_log$sensitivities,
             fpr = 1 - roc_log$specificities,
             model = "Logistic Regression"),
  data.frame(tpr = roc_lda$sensitivities,
             fpr = 1 - roc_lda$specificities,
             model = "LDA"),
  data.frame(tpr = roc_qda$sensitivities,
             fpr = 1 - roc_qda$specificities,
             model = "QDA")
)

# Plot

ggplot(roc_df, aes(x = fpr, y = tpr, color = model)) +
  geom_line(size = 1.2) +
  geom_abline(linetype = "dashed") +
  labs(
    title = "ROC Curve Comparison",
    x = "False Positive Rate",
    y = "True Positive Rate",
    color = "Model"
  ) +
  theme_minimal()

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

# Print AUC

auc_log

## Area under the curve: 0.9438

auc_lda

## Area under the curve: 0.9438

auc_qda

## Area under the curve: 0.9434

The ROC curves show that all three models perform much better than random guessing, as they are positioned clearly above the diagonal reference line. This is also supported by the AUC values, which indicate strong predictive ability for each model. Logistic Regression and LDA generate very similar ROC curves, suggesting that a linear decision boundary works well for this dataset. Both models show stable and consistent results, making them dependable options for predicting defaults. On the other hand, QDA offers more flexibility by allowing non-linear decision boundaries. However, this added flexibility can increase the risk of overfitting, especially when the dataset is small or when the differences in covariance structures are not significant enough to justify its use.

hw6

Faiz Haikal_114035108

2026-04-04