This document fits the three logistic regression models described in
Tables 4.1, 4.2, and 4.3 of ISLR Chapter 4 using the
Default dataset, then plots their ROC curves for
comparison.
| Table | Predictors |
|---|---|
| 4.1 | balance only |
| 4.2 | student only |
| 4.3 | balance + income + student (full model) |
Default <- read.csv("Default.csv")
# Remove row index column if present
Default <- Default[, -1]
# Convert default and student to factors
Default$default <- factor(Default$default, levels = c("No", "Yes"))
Default$student <- factor(Default$student, levels = c("No", "Yes"))
glimpse(Default)## Rows: 10,000
## Columns: 4
## $ default <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, No, No, No…
## $ student <fct> No, Yes, No, No, No, Yes, No, Yes, No, No, Yes, Yes, No, No, N…
## $ balance <dbl> 729.5265, 817.1804, 1073.5492, 529.2506, 785.6559, 919.5885, 8…
## $ income <dbl> 44361.625, 12106.135, 31767.139, 35704.494, 38463.496, 7491.55…
## default n
## 1 No 9667
## 2 Yes 333
Only about 3.3% of individuals defaulted – a class-imbalanced problem, which makes the ROC curve especially informative over simple accuracy.
All three models use logistic regression with the glm
engine.
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -10.7 0.361 -29.5 3.62e-191
## 2 balance 0.00550 0.000220 25.0 1.98e-137
balance has a very small p-value, indicating a strong
positive relationship with the probability of default.
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -3.50 0.0707 -49.6 0
## 2 studentYes 0.405 0.115 3.52 0.000431
Students have a higher estimated probability of default compared to non-students when considered in isolation.
## # A tibble: 4 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -10.9 0.492 -22.1 4.91e-108
## 2 balance 0.00574 0.000232 24.7 4.22e-135
## 3 income 0.00000303 0.00000820 0.370 7.12e- 1
## 4 studentYes -0.647 0.236 -2.74 6.19e- 3
In the full model, the student coefficient becomes negative. This reflects Simpson’s Paradox: students tend to have lower balances relative to their credit limit, so once we control for balance, being a student is actually associated with a lower default risk.
We collect predicted probabilities from all three models, then plot their ROC curves together.
# Get predicted probabilities for each model
pred_41 <- augment(fit_41, new_data = Default) %>%
mutate(model = "Table 4.1: balance")
pred_42 <- augment(fit_42, new_data = Default) %>%
mutate(model = "Table 4.2: student")
pred_43 <- augment(fit_43, new_data = Default) %>%
mutate(model = "Table 4.3: balance + income + student")
# Combine all predictions
all_preds <- bind_rows(pred_41, pred_42, pred_43)all_preds %>%
group_by(model) %>%
roc_curve(truth = default, .pred_Yes) %>%
ggplot(aes(x = 1 - specificity, y = sensitivity, colour = model)) +
geom_line(linewidth = 1) +
geom_abline(slope = 1, intercept = 0,
linetype = "dashed", colour = "grey50") +
scale_colour_manual(
values = c(
"Table 4.1: balance" = "#185FA5",
"Table 4.2: student" = "#D85A30",
"Table 4.3: balance + income + student" = "#1D9E75"
)
) +
labs(
title = "ROC Curves for Default Dataset",
subtitle = "Logistic regression models from Tables 4.1, 4.2, and 4.3",
x = "False positive rate (1 - Specificity)",
y = "True positive rate (Sensitivity)",
colour = "Model"
) +
theme_bw() +
theme(legend.position = "bottom",
legend.direction = "vertical")The area under the ROC curve (AUC) summarises overall model performance. A value of 1.0 is perfect; 0.5 is no better than random guessing.
all_preds %>%
group_by(model) %>%
roc_auc(truth = default, .pred_Yes) %>%
arrange(desc(.estimate)) %>%
select(Model = model, AUC = .estimate)## # A tibble: 3 × 2
## Model AUC
## <chr> <dbl>
## 1 Table 4.2: student 0.455
## 2 Table 4.1: balance 0.0520
## 3 Table 4.3: balance + income + student 0.0504
The ROC curves show:
income and student add
only marginal improvement once balance is already
included.student status alone is a weak
predictor.Based on ISLR Chapter 4 and the ISLR tidymodels labs