HW 06 - ROC Curve for Default Dataset

Load Required Libraries

library(tidymodels)
library(discrim)
library(readr)

Load and Prepare Data

We read in the Default dataset and convert default and student to factor variables for classification.

Default <- read_csv("Default.csv") %>%
  select(-1) %>%
  mutate(
    default = factor(default, levels = c("No", "Yes")),
    student = factor(student, levels = c("No", "Yes"))
  )

glimpse(Default)

## Rows: 10,000
## Columns: 4
## $ default <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, No, No, No…
## $ student <fct> No, Yes, No, No, No, Yes, No, Yes, No, No, Yes, Yes, No, No, N…
## $ balance <dbl> 729.5265, 817.1804, 1073.5492, 529.2506, 785.6559, 919.5885, 8…
## $ income  <dbl> 44361.625, 12106.135, 31767.139, 35704.494, 38463.496, 7491.55…

Split Data into Training and Testing Sets

We use a 70/30 split, stratified by the response variable default, to ensure both sets have a similar proportion of defaulters.

set.seed(123)
default_split <- initial_split(Default, prop = 0.7, strata = default)
default_train <- training(default_split)
default_test  <- testing(default_split)

Model 1: Logistic Regression (Table 4.1)

Logistic regression models the probability of default using balance, income, and student as predictors.

lr_spec <- logistic_reg() %>%
  set_engine("glm") %>%
  set_mode("classification")

lr_fit <- lr_spec %>%
  fit(default ~ balance + income + student, data = default_train)

tidy(lr_fit)

## # A tibble: 4 × 5
##   term             estimate  std.error statistic  p.value
##   <chr>               <dbl>      <dbl>     <dbl>    <dbl>
## 1 (Intercept) -10.7         0.584       -18.4    1.38e-75
## 2 balance       0.00571     0.000275     20.8    8.62e-96
## 3 income        0.000000173 0.00000979    0.0177 9.86e- 1
## 4 studentYes   -0.700       0.281        -2.49   1.28e- 2

Model 2: Linear Discriminant Analysis (Table 4.2)

LDA assumes that the predictors are normally distributed within each class and that the classes share a common covariance matrix.

lda_spec <- discrim_linear() %>%
  set_mode("classification") %>%
  set_engine("MASS")

lda_fit <- lda_spec %>%
  fit(default ~ balance + income + student, data = default_train)

lda_fit

## parsnip model object
## 
## Call:
## lda(default ~ balance + income + student, data = data)
## 
## Prior probabilities of groups:
##         No        Yes 
## 0.96714286 0.03285714 
## 
## Group means:
##       balance   income studentYes
## No   800.9643 33610.00  0.2889217
## Yes 1756.7220 32190.23  0.3739130
## 
## Coefficients of linear discriminants:
##                      LD1
## balance     2.244335e-03
## income      2.028998e-06
## studentYes -2.010832e-01

Model 3: Quadratic Discriminant Analysis (Table 4.3)

QDA is similar to LDA but allows each class to have its own covariance matrix, which can capture more complex decision boundaries.

qda_spec <- discrim_quad() %>%
  set_mode("classification") %>%
  set_engine("MASS")

qda_fit <- qda_spec %>%
  fit(default ~ balance + income + student, data = default_train)

qda_fit

## parsnip model object
## 
## Call:
## qda(default ~ balance + income + student, data = data)
## 
## Prior probabilities of groups:
##         No        Yes 
## 0.96714286 0.03285714 
## 
## Group means:
##       balance   income studentYes
## No   800.9643 33610.00  0.2889217
## Yes 1756.7220 32190.23  0.3739130

Model Evaluation

We gather predictions from all three models on the test set and compute confusion matrices along with accuracy, sensitivity, and specificity.

models <- list(
  "Logistic Regression" = lr_fit,
  "LDA" = lda_fit,
  "QDA" = qda_fit
)

preds <- imap_dfr(models, augment,
                  new_data = default_test, .id = "model")

# Confusion matrix for each model
preds %>%
  group_by(model) %>%
  conf_mat(truth = default, estimate = .pred_class) %>%
  print()

## # A tibble: 3 × 2
##   model               conf_mat  
##   <chr>               <list>    
## 1 LDA                 <conf_mat>
## 2 Logistic Regression <conf_mat>
## 3 QDA                 <conf_mat>

# Multiple performance metrics
multi_metric <- metric_set(accuracy, sensitivity, specificity)

preds %>%
  group_by(model) %>%
  multi_metric(truth = default, estimate = .pred_class)

## # A tibble: 9 × 4
##   model               .metric     .estimator .estimate
##   <chr>               <chr>       <chr>          <dbl>
## 1 LDA                 accuracy    binary         0.971
## 2 Logistic Regression accuracy    binary         0.973
## 3 QDA                 accuracy    binary         0.973
## 4 LDA                 sensitivity binary         0.998
## 5 Logistic Regression sensitivity binary         0.997
## 6 QDA                 sensitivity binary         0.998
## 7 LDA                 specificity binary         0.214
## 8 Logistic Regression specificity binary         0.282
## 9 QDA                 specificity binary         0.262

ROC Curves

The ROC curve plots sensitivity (true positive rate) against 1 - specificity (false positive rate) at various threshold values. A curve closer to the top-left corner indicates better classification performance.

preds %>%
  group_by(model) %>%
  roc_curve(default, .pred_No) %>%
  autoplot() +
  labs(
    title = "ROC Curves for Default Dataset",
    subtitle = "Comparing Logistic Regression, LDA, and QDA"
  ) +
  theme_minimal()

ROC AUC Values

The Area Under the ROC Curve (AUC) summarizes the overall performance of each model. An AUC of 1 represents a perfect classifier, while an AUC of 0.5 represents random guessing.

preds %>%
  group_by(model) %>%
  roc_auc(default, .pred_No)

## # A tibble: 3 × 4
##   model               .metric .estimator .estimate
##   <chr>               <chr>   <chr>          <dbl>
## 1 LDA                 roc_auc binary         0.945
## 2 Logistic Regression roc_auc binary         0.945
## 3 QDA                 roc_auc binary         0.944

Conclusion

The ROC curves and AUC values show that all three models perform similarly well on the Default dataset. This is expected because logistic regression and LDA often produce nearly identical results when the normality assumption approximately holds. QDA allows for more flexible decision boundaries, but in this case the added flexibility does not lead to a significant improvement, suggesting that a linear decision boundary is sufficient for separating defaulters from non-defaulters.