k-fold cross-validation works by randomly dividing the full training data set into \(k\) roughly equal-sized groups, or folds. Then, for each fold \(i = 1, \ldots, k\):
After repeating this for every fold, you have \(k\) error estimates. The k-fold CV estimate is the average of these \(k\) numbers:
\[\text{CV}_{(k)} = \frac{1}{k} \sum_{i=1}^{k} \text{MSE}_i\]
The most common choices are \(k = 5\) or \(k = 10\). When \(k = n\) (one observation left out each time), this becomes LOOCV.
i. Relative to the validation set approach:
The main advantage over the validation set approach is that you’re not stuck depending on one lucky (or unlucky) 50/50 split. Averaging over \(k\) different hold-outs lowers the variance of the error estimate, and since each training set is larger, the bias goes down too. The cost is computation: the model has to be fit \(k\) times rather than once. That’s usually fine, but for slow models on big data it adds up. The validation set approach wins on simplicity when a rough estimate is all you need.
ii. Relative to LOOCV:
LOOCV uses \(n - 1\) observations for training each time, so it barely underestimates the true error. The catch is that all \(n\) models are trained on almost the same data, making their individual errors highly correlated — the average ends up noisier than you’d expect. k-fold CV with \(k = 5\) or \(10\) trades a small amount of bias for a big drop in variance, and the estimate tends to be more stable overall. It’s also much faster since you only fit the model \(k\) times instead of \(n\).
data(Default, package = "ISLR2")
cat("Dimensions:", nrow(Default), "rows x", ncol(Default), "columns\n")## Dimensions: 10000 rows x 4 columns
## default student balance income
## 1 No No 729.5265 44361.625
## 2 No Yes 817.1804 12106.135
## 3 No No 1073.5492 31767.139
## 4 No No 529.2506 35704.494
## 5 No No 785.6559 38463.496
## 6 No Yes 919.5885 7491.559
set.seed(1)
fit_full <- glm(default ~ income + balance,
data = Default,
family = binomial)
summary(fit_full)##
## Call:
## glm(formula = default ~ income + balance, family = binomial,
## data = Default)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
set.seed(1)
n <- nrow(Default)
train_idx <- sample(n, n / 2)
fit_val <- glm(default ~ income + balance,
data = Default,
family = binomial,
subset = train_idx)
prob_val <- predict(fit_val,
newdata = Default[-train_idx, ],
type = "response")
pred_val <- ifelse(prob_val > 0.5, "Yes", "No")
actual <- Default[-train_idx, "default"]
val_err_b <- mean(pred_val != actual)
cat("Validation set error (seed 1):", round(val_err_b, 4), "\n")## Validation set error (seed 1): 0.0254
The estimated test error is about 2.54%.
seeds <- c(42, 7, 123)
errs <- numeric(3)
for (i in seq_along(seeds)) {
set.seed(seeds[i])
tr <- sample(n, n / 2)
fit <- glm(default ~ income + balance,
data = Default,
family = binomial,
subset = tr)
prb <- predict(fit, newdata = Default[-tr, ], type = "response")
prd <- ifelse(prb > 0.5, "Yes", "No")
errs[i] <- mean(prd != Default[-tr, "default"])
cat("Seed", seeds[i], "validation error:", round(errs[i], 4), "\n")
}## Seed 42 validation error: 0.026
## Seed 7 validation error: 0.0242
## Seed 123 validation error: 0.0276
The error rate shifts around a bit across the three splits, which illustrates the main issue with this approach: the result depends on which observations you happen to assign to training vs. validation. There’s no way to tell which estimate is closest to the truth. Running a few different splits and looking at the range at least gives a sense of how much the estimate is bouncing around.
set.seed(1)
tr_d <- sample(n, n / 2)
fit_d <- glm(default ~ income + balance + student,
data = Default,
family = binomial,
subset = tr_d)
prb_d <- predict(fit_d, newdata = Default[-tr_d, ], type = "response")
prd_d <- ifelse(prb_d > 0.5, "Yes", "No")
err_d <- mean(prd_d != Default[-tr_d, "default"])
cat("Validation error with student:", round(err_d, 4), "\n")## Validation error with student: 0.026
## Validation error without student (b): 0.0254
Adding student barely moves the error rate. The
two-predictor model does about the same thing, which lines up with what
Chapter 4 showed — once balance is in the model it already
accounts for most of what student status is correlated with.
fit_6 <- glm(default ~ income + balance,
data = Default,
family = binomial)
se_glm <- summary(fit_6)$coefficients[, "Std. Error"]
cat("GLM standard errors:\n")## GLM standard errors:
## (Intercept) income balance
## 0.43475636 0.00000499 0.00022737
The standard error for the income coefficient is
approximately 4.985e-06 and for balance it
is approximately 2.274e-04.
boot_fn <- function(data, index) {
fit <- glm(default ~ income + balance,
data = data,
family = binomial,
subset = index)
coef(fit)[c("income", "balance")]
}
# Quick test on the full data set
boot_fn(Default, 1:nrow(Default))## income balance
## 2.080898e-05 5.647103e-03
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Default, statistic = boot_fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 2.080898e-05 1.680317e-07 4.866284e-06
## t2* 5.647103e-03 1.855765e-05 2.298949e-04
## Bootstrap standard errors:
## income : 4.87e-06
## balance: 0.00022989
The bootstrap SEs and the glm() SEs come out very close
for both coefficients, so either way you get roughly the same picture.
The glm() formula relies on the model being correctly
specified; the bootstrap just resamples the data directly without
needing that. Since the two agree here, it suggests the model
assumptions are holding reasonably well. If the model were off, the
bootstrap would be the more trustworthy number.
data(Boston, package = "ISLR2")
cat("Dimensions:", nrow(Boston), "rows x", ncol(Boston), "columns\n")## Dimensions: 506 rows x 13 columns
## Estimated population mean (mu_hat): 22.5328
The estimate for the population mean of medv is \(\hat{\mu} =\) 22.5328.
se_formula <- sd(Boston$medv) / sqrt(nrow(Boston))
cat("SE of mu_hat (formula):", round(se_formula, 4), "\n")## SE of mu_hat (formula): 0.4089
Plugging in: \(\text{SE}(\hat{\mu}) = s / \sqrt{n}\) gives 0.4089. So the sample mean is off from the true population mean by about 0.41 thousand dollars on average.
mean_fn <- function(data, index) {
mean(data$medv[index])
}
set.seed(1)
boot_mean <- boot(Boston, mean_fn, R = 1000)
se_boot <- sd(boot_mean$t)
cat("Bootstrap SE of mu_hat:", round(se_boot, 4), "\n")## Bootstrap SE of mu_hat: 0.4107
## Formula SE of mu_hat : 0.4089
The bootstrap SE is 0.4107, essentially the same as the formula estimate of 0.4089. They match because the sample mean is well-behaved and the standard formula works fine for it. The bootstrap is more useful for statistics where no closed-form SE exists.
ci_boot <- c(mu_hat - 2 * se_boot, mu_hat + 2 * se_boot)
ci_t <- t.test(Boston$medv)$conf.int
cat("Bootstrap 95% CI (mu_hat ± 2 * SE_boot): [",
round(ci_boot[1], 4), ",", round(ci_boot[2], 4), "]\n")## Bootstrap 95% CI (mu_hat ± 2 * SE_boot): [ 21.7115 , 23.3541 ]
## t.test 95% CI : [ 21.7295 , 23.3361 ]
The bootstrap-based 95% CI is [21.711, 23.354] and
the t.test() interval is [21.73, 23.336]. The two are
nearly the same, which makes sense since the SEs were nearly identical.
Both put the true mean of medv somewhere around 21 to 23
thousand dollars.
## Estimated population median (mu_med): 21.2
The estimate for the population median is \(\hat{\mu}_{med} =\) 21.2.
median_fn <- function(data, index) {
median(data$medv[index])
}
set.seed(1)
boot_med <- boot(Boston, median_fn, R = 1000)
se_med <- sd(boot_med$t)
cat("Bootstrap SE of mu_med:", round(se_med, 4), "\n")## Bootstrap SE of mu_med: 0.3778
The bootstrap SE for the median is 0.3778, a bit smaller than the SE for the mean. The median is less pulled around by extreme values, so that’s not surprising. There’s no closed-form formula for this, so the bootstrap is the practical way to go.
## Estimated 10th percentile (mu_0.1): 12.75
The estimate for the tenth percentile of medv in the
population is \(\hat{\mu}_{0.1} =\)
12.75.
p10_fn <- function(data, index) {
quantile(data$medv[index], 0.1)
}
set.seed(1)
boot_p10 <- boot(Boston, p10_fn, R = 1000)
se_p10 <- sd(boot_p10$t)
cat("Bootstrap SE of mu_0.1:", round(se_p10, 4), "\n")## Bootstrap SE of mu_0.1: 0.4768
The SE for the 10th percentile is 0.4768, larger than what we got for the mean or median. A tail quantile is pinned down by fewer observations, so there’s more variability in the estimate. There’s no formula to use here either, which is why the bootstrap is handy.