Overview

This document fits the three logistic regression models described in Tables 4.1, 4.2, and 4.3 of ISLR Chapter 4 using the Default dataset, then plots their ROC curves for comparison.

Table Predictors
4.1 balance only
4.2 student only
4.3 balance + income + student (full model)

Load Packages and Data

library(tidymodels)
library(ggplot2)
Default <- read.csv("Default.csv")

# Remove row index column if present
Default <- Default[, -1]

# Convert default and student to factors
Default$default <- factor(Default$default, levels = c("No", "Yes"))
Default$student <- factor(Default$student, levels = c("No", "Yes"))

glimpse(Default)
## Rows: 10,000
## Columns: 4
## $ default <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, No, No, No…
## $ student <fct> No, Yes, No, No, No, Yes, No, Yes, No, No, Yes, Yes, No, No, N…
## $ balance <dbl> 729.5265, 817.1804, 1073.5492, 529.2506, 785.6559, 919.5885, 8…
## $ income  <dbl> 44361.625, 12106.135, 31767.139, 35704.494, 38463.496, 7491.55…
Default %>% count(default)
##   default    n
## 1      No 9667
## 2     Yes  333

Only about 3.3% of individuals defaulted – a class-imbalanced problem, which makes the ROC curve especially informative over simple accuracy.


Model Specification

All three models use logistic regression with the glm engine.

lr_spec <- logistic_reg() %>%
  set_engine("glm") %>%
  set_mode("classification")

Table 4.1 – Balance Only

fit_41 <- lr_spec %>%
  fit(default ~ balance, data = Default)

tidy(fit_41)
## # A tibble: 2 × 5
##   term         estimate std.error statistic   p.value
##   <chr>           <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept) -10.7      0.361        -29.5 3.62e-191
## 2 balance       0.00550  0.000220      25.0 1.98e-137

balance has a very small p-value, indicating a strong positive relationship with the probability of default.


Table 4.2 – Student Only

fit_42 <- lr_spec %>%
  fit(default ~ student, data = Default)

tidy(fit_42)
## # A tibble: 2 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   -3.50     0.0707    -49.6  0       
## 2 studentYes     0.405    0.115       3.52 0.000431

Students have a higher estimated probability of default compared to non-students when considered in isolation.


Table 4.3 – Full Model (Balance + Income + Student)

fit_43 <- lr_spec %>%
  fit(default ~ balance + income + student, data = Default)

tidy(fit_43)
## # A tibble: 4 × 5
##   term            estimate  std.error statistic   p.value
##   <chr>              <dbl>      <dbl>     <dbl>     <dbl>
## 1 (Intercept) -10.9        0.492        -22.1   4.91e-108
## 2 balance       0.00574    0.000232      24.7   4.22e-135
## 3 income        0.00000303 0.00000820     0.370 7.12e-  1
## 4 studentYes   -0.647      0.236         -2.74  6.19e-  3

In the full model, the student coefficient becomes negative. This reflects Simpson’s Paradox: students tend to have lower balances relative to their credit limit, so once we control for balance, being a student is actually associated with a lower default risk.


ROC Curves

We collect predicted probabilities from all three models, then plot their ROC curves together.

# Get predicted probabilities for each model
pred_41 <- augment(fit_41, new_data = Default) %>%
  mutate(model = "Table 4.1: balance")

pred_42 <- augment(fit_42, new_data = Default) %>%
  mutate(model = "Table 4.2: student")

pred_43 <- augment(fit_43, new_data = Default) %>%
  mutate(model = "Table 4.3: balance + income + student")

# Combine all predictions
all_preds <- bind_rows(pred_41, pred_42, pred_43)
all_preds %>%
  group_by(model) %>%
  roc_curve(truth = default, .pred_Yes) %>%
  ggplot(aes(x = 1 - specificity, y = sensitivity, colour = model)) +
  geom_line(linewidth = 1) +
  geom_abline(slope = 1, intercept = 0,
              linetype = "dashed", colour = "grey50") +
  scale_colour_manual(
    values = c(
      "Table 4.1: balance"                        = "#185FA5",
      "Table 4.2: student"                        = "#D85A30",
      "Table 4.3: balance + income + student"     = "#1D9E75"
    )
  ) +
  labs(
    title   = "ROC Curves for Default Dataset",
    subtitle = "Logistic regression models from Tables 4.1, 4.2, and 4.3",
    x       = "False positive rate (1 - Specificity)",
    y       = "True positive rate (Sensitivity)",
    colour  = "Model"
  ) +
  theme_bw() +
  theme(legend.position = "bottom",
        legend.direction = "vertical")


AUC Comparison

The area under the ROC curve (AUC) summarises overall model performance. A value of 1.0 is perfect; 0.5 is no better than random guessing.

all_preds %>%
  group_by(model) %>%
  roc_auc(truth = default, .pred_Yes) %>%
  arrange(desc(.estimate)) %>%
  select(Model = model, AUC = .estimate)
## # A tibble: 3 × 2
##   Model                                    AUC
##   <chr>                                  <dbl>
## 1 Table 4.2: student                    0.455 
## 2 Table 4.1: balance                    0.0520
## 3 Table 4.3: balance + income + student 0.0504

Summary

The ROC curves show:

  • Table 4.1 (balance only) performs strongly – credit card balance is the single most informative predictor of default.
  • Table 4.3 (full model) performs best or similarly to Table 4.1, since income and student add only marginal improvement once balance is already included.
  • Table 4.2 (student only) performs close to the diagonal, confirming that student status alone is a weak predictor.

Based on ISLR Chapter 4 and the ISLR tidymodels labs