We fit the three logistic-regression models for the
Default data that correspond to ISLR Tables 4.1,
4.2 and 4.3, and draw their ROC curves on one plot. This is
done in base R only — no extra packages required.
default ~ balancedefault ~ studentdefault ~ balance + income + studentPut Default.csv in the same folder as this
.Rmd before knitting. (If you have the ISLR2
package installed, the code falls back to its built-in
Default.)
if (file.exists("Default.csv")) {
Default <- read.csv("Default.csv")
} else if (requireNamespace("ISLR2", quietly = TRUE)) {
data(Default, package = "ISLR2")
} else {
stop("Please place Default.csv next to this .Rmd, or install.packages('ISLR2').")
}
# make sure default and student are factors
Default$default <- factor(Default$default, levels = c("No", "Yes"))
Default$student <- factor(Default$student, levels = c("No", "Yes"))
# numeric response: 1 = defaulted ("Yes"), 0 = did not
y <- as.integer(Default$default == "Yes")
str(Default[, c("default", "student", "balance", "income")])
## 'data.frame': 10000 obs. of 4 variables:
## $ default: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ student: Factor w/ 2 levels "No","Yes": 1 2 1 1 1 2 1 2 1 1 ...
## $ balance: num 730 817 1074 529 786 ...
## $ income : num 44362 12106 31767 35704 38463 ...
# Table 4.1: balance only
fit_balance <- glm(default ~ balance, data = Default, family = binomial)
# Table 4.2: student only
fit_student <- glm(default ~ student, data = Default, family = binomial)
# Table 4.3: balance + income + student
fit_full <- glm(default ~ balance + income + student,
data = Default, family = binomial)
The estimated coefficients reproduce the textbook tables:
round(coef(fit_balance), 4) # Table 4.1
## (Intercept) balance
## -10.6513 0.0055
round(coef(fit_student), 4) # Table 4.2
## (Intercept) studentYes
## -3.5041 0.4049
round(coef(fit_full), 4) # Table 4.3
## (Intercept) balance income studentYes
## -10.8690 0.0057 0.0000 -0.6468
For each model we compute the predicted probability of defaulting.
p_balance <- predict(fit_balance, type = "response")
p_student <- predict(fit_student, type = "response")
p_full <- predict(fit_full, type = "response")
An ROC curve plots the true-positive rate (sensitivity) against the false-positive rate (1 − specificity) as the classification threshold is varied. The two small functions below compute the curve and its area (AUC) directly.
# Build an ROC curve: sort scores high -> low, accumulate TP and FP rates.
roc_curve <- function(score, y) {
ord <- order(score, decreasing = TRUE)
yy <- y[ord]
P <- sum(yy == 1)
N <- sum(yy == 0)
tpr <- c(0, cumsum(yy == 1) / P) # true-positive rate
fpr <- c(0, cumsum(yy == 0) / N) # false-positive rate
list(fpr = fpr, tpr = tpr)
}
# Area under the curve via the trapezoidal rule.
auc_value <- function(r) {
sum(diff(r$fpr) * (head(r$tpr, -1) + tail(r$tpr, -1)) / 2)
}
roc_balance <- roc_curve(p_balance, y)
roc_student <- roc_curve(p_student, y)
roc_full <- roc_curve(p_full, y)
auc_balance <- auc_value(roc_balance)
auc_student <- auc_value(roc_student)
auc_full <- auc_value(roc_full)
round(c(balance = auc_balance,
student = auc_student,
full = auc_full), 4)
## balance student full
## 0.9480 0.5473 0.9496
plot(roc_full$fpr, roc_full$tpr, type = "l", lwd = 2, col = "firebrick",
xlab = "False positive rate (1 - specificity)",
ylab = "True positive rate (sensitivity)",
main = "ROC curves for the Default models (Tables 4.1-4.3)")
lines(roc_balance$fpr, roc_balance$tpr, lwd = 2, col = "steelblue")
lines(roc_student$fpr, roc_student$tpr, lwd = 2, col = "darkorange")
abline(0, 1, lty = 2, col = "grey50") # chance line
legend("bottomright", lwd = 2, bty = "n",
col = c("firebrick", "steelblue", "darkorange"),
legend = c(
sprintf("Table 4.3 balance+income+student (AUC = %.3f)", auc_full),
sprintf("Table 4.1 balance (AUC = %.3f)", auc_balance),
sprintf("Table 4.2 student (AUC = %.3f)", auc_student)
))
balance (Table 4.1) is a very strong
predictor of default: its ROC curve bows sharply toward the top-left
corner with an AUC of about 0.95.student (Table 4.2) alone is almost
useless. Because the model produces only two distinct probabilities
(student vs. non-student), its ROC is a coarse curve lying close to the
diagonal “chance” line, with an AUC of about 0.55 —
barely better than random guessing.balance-only model. Almost all of the predictive power
comes from balance; adding income and
student contributes very little.A larger area under the curve (AUC closer to 1) indicates a better classifier, so the ranking is: full ≈ balance ≫ student.