Chapter 4: ROC Curves for the Default Dataset

Overview

This document fits the three logistic regression models described in Tables 4.1, 4.2, and 4.3 of ISLR Chapter 4 using the Default dataset, then plots their ROC curves for comparison.

Table	Predictors
4.1	`balance` only
4.2	`student` only
4.3	`balance + income + student` (full model)

Load Packages and Data

library(tidymodels)
library(ggplot2)

Default <- read.csv("Default.csv")

# Remove row index column if present
Default <- Default[, -1]

# Convert default and student to factors
Default$default <- factor(Default$default, levels = c("No", "Yes"))
Default$student <- factor(Default$student, levels = c("No", "Yes"))

glimpse(Default)

## Rows: 10,000
## Columns: 4
## $ default <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, No, No, No…
## $ student <fct> No, Yes, No, No, No, Yes, No, Yes, No, No, Yes, Yes, No, No, N…
## $ balance <dbl> 729.5265, 817.1804, 1073.5492, 529.2506, 785.6559, 919.5885, 8…
## $ income  <dbl> 44361.625, 12106.135, 31767.139, 35704.494, 38463.496, 7491.55…

Default %>% count(default)

##   default    n
## 1      No 9667
## 2     Yes  333

Only about 3.3% of individuals defaulted – a class-imbalanced problem, which makes the ROC curve especially informative over simple accuracy.

Model Specification

All three models use logistic regression with the glm engine.

lr_spec <- logistic_reg() %>%
  set_engine("glm") %>%
  set_mode("classification")

Table 4.1 – Balance Only

fit_41 <- lr_spec %>%
  fit(default ~ balance, data = Default)

tidy(fit_41)

## # A tibble: 2 × 5
##   term         estimate std.error statistic   p.value
##   <chr>           <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept) -10.7      0.361        -29.5 3.62e-191
## 2 balance       0.00550  0.000220      25.0 1.98e-137

balance has a very small p-value, indicating a strong positive relationship with the probability of default.

Table 4.2 – Student Only

fit_42 <- lr_spec %>%
  fit(default ~ student, data = Default)

tidy(fit_42)

## # A tibble: 2 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   -3.50     0.0707    -49.6  0       
## 2 studentYes     0.405    0.115       3.52 0.000431

Students have a higher estimated probability of default compared to non-students when considered in isolation.

Table 4.3 – Full Model (Balance + Income + Student)

fit_43 <- lr_spec %>%
  fit(default ~ balance + income + student, data = Default)

tidy(fit_43)

## # A tibble: 4 × 5
##   term            estimate  std.error statistic   p.value
##   <chr>              <dbl>      <dbl>     <dbl>     <dbl>
## 1 (Intercept) -10.9        0.492        -22.1   4.91e-108
## 2 balance       0.00574    0.000232      24.7   4.22e-135
## 3 income        0.00000303 0.00000820     0.370 7.12e-  1
## 4 studentYes   -0.647      0.236         -2.74  6.19e-  3

In the full model, the student coefficient becomes negative. This reflects Simpson’s Paradox: students tend to have lower balances relative to their credit limit, so once we control for balance, being a student is actually associated with a lower default risk.

ROC Curves

We collect predicted probabilities from all three models, then plot their ROC curves together.

# Get predicted probabilities for each model
pred_41 <- augment(fit_41, new_data = Default) %>%
  mutate(model = "Table 4.1: balance")

pred_42 <- augment(fit_42, new_data = Default) %>%
  mutate(model = "Table 4.2: student")

pred_43 <- augment(fit_43, new_data = Default) %>%
  mutate(model = "Table 4.3: balance + income + student")

# Combine all predictions
all_preds <- bind_rows(pred_41, pred_42, pred_43)

all_preds %>%
  group_by(model) %>%
  roc_curve(truth = default, .pred_Yes) %>%
  ggplot(aes(x = 1 - specificity, y = sensitivity, colour = model)) +
  geom_line(linewidth = 1) +
  geom_abline(slope = 1, intercept = 0,
              linetype = "dashed", colour = "grey50") +
  scale_colour_manual(
    values = c(
      "Table 4.1: balance"                        = "#185FA5",
      "Table 4.2: student"                        = "#D85A30",
      "Table 4.3: balance + income + student"     = "#1D9E75"
    )
  ) +
  labs(
    title   = "ROC Curves for Default Dataset",
    subtitle = "Logistic regression models from Tables 4.1, 4.2, and 4.3",
    x       = "False positive rate (1 - Specificity)",
    y       = "True positive rate (Sensitivity)",
    colour  = "Model"
  ) +
  theme_bw() +
  theme(legend.position = "bottom",
        legend.direction = "vertical")

AUC Comparison

The area under the ROC curve (AUC) summarises overall model performance. A value of 1.0 is perfect; 0.5 is no better than random guessing.

all_preds %>%
  group_by(model) %>%
  roc_auc(truth = default, .pred_Yes) %>%
  arrange(desc(.estimate)) %>%
  select(Model = model, AUC = .estimate)

## # A tibble: 3 × 2
##   Model                                    AUC
##   <chr>                                  <dbl>
## 1 Table 4.2: student                    0.455 
## 2 Table 4.1: balance                    0.0520
## 3 Table 4.3: balance + income + student 0.0504

Summary

The ROC curves show:

Table 4.1 (balance only) performs strongly – credit card balance is the single most informative predictor of default.
Table 4.3 (full model) performs best or similarly to Table 4.1, since income and student add only marginal improvement once balance is already included.
Table 4.2 (student only) performs close to the diagonal, confirming that student status alone is a weak predictor.

Based on ISLR Chapter 4 and the ISLR tidymodels labs