The set of \(n\) observations is randomly divided into \(k\) groups (folds) of approximately equal size. The procedure is then repeated \(k\) times: on the \(j\)-th iteration, fold \(j\) is held out as the validation set and the model is fit on the remaining \(k-1\) folds; the fitted model is used to predict the held-out fold, giving an error measure \(\text{MSE}_j\) (or misclassification rate for classification). The \(k\)-fold CV estimate is the average of these \(k\) errors, \[ \text{CV}_{(k)} = \frac{1}{k}\sum_{j=1}^{k} \text{MSE}_j . \] Every observation is used for validation exactly once and for training \(k-1\) times. Typical choices are \(k = 5\) or \(k = 10\).
i. Relative to the validation set approach. The validation set approach splits the data once into a single training/validation pair.
ii. Relative to LOOCV (which is the special case \(k = n\)).
Default)default on income and balanceglm.fit <- glm(default ~ income + balance, data = Default, family = binomial)
summary(glm.fit)$coefficients
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154047e+01 4.347564e-01 -26.544680 2.958355e-155
## income 2.080898e-05 4.985167e-06 4.174178 2.990638e-05
## balance 5.647103e-03 2.273731e-04 24.836280 3.638120e-136
val_error <- function(seed, formula = default ~ income + balance) {
set.seed(seed)
train <- sample(nrow(Default), nrow(Default) / 2)
fit <- glm(formula, data = Default[train, ], family = binomial)
probs <- predict(fit, Default[-train, ], type = "response")
pred <- ifelse(probs > 0.5, "Yes", "No")
mean(pred != Default$default[-train]) # validation-set error rate
}
val_error(1)
## [1] 0.0254
The estimated test error from a single 50/50 split is about \(2.5\%\).
sapply(c(1, 2, 3), val_error)
## [1] 0.0254 0.0238 0.0264
The three estimates are all close to \(2.5\%\) (roughly \(2.4\)–\(2.8\%\)), but they differ from one another because they depend on exactly which observations land in the training vs. validation set. This variability is the main weakness of the single-split validation approach.
studentsapply(c(1, 2, 3), function(s) val_error(s, default ~ income + balance + student))
## [1] 0.0260 0.0246 0.0272
Adding the student dummy gives test-error estimates around \(2.5\)–\(2.8\%\) — essentially the same as the model without it. Including student does not lead to a meaningful reduction in the test error rate.
Default)glm.fit <- glm(default ~ income + balance, data = Default, family = binomial)
summary(glm.fit)$coefficients[, c("Estimate", "Std. Error")]
## Estimate Std. Error
## (Intercept) -1.154047e+01 4.347564e-01
## income 2.080898e-05 4.985167e-06
## balance 5.647103e-03 2.273731e-04
The asymptotic (formula-based) standard errors are about \(4.99\times10^{-6}\) for income and \(2.27\times10^{-4}\) for balance.
boot_fn() returning the coefficient estimatesboot_fn <- function(data, index) {
coef(glm(default ~ income + balance, data = data, subset = index, family = binomial))
}
boot_fn(Default, 1:nrow(Default))
## (Intercept) income balance
## -1.154047e+01 2.080898e-05 5.647103e-03
set.seed(1)
boot(Default, boot_fn, R = 1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Default, statistic = boot_fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* -1.154047e+01 -3.945460e-02 4.344722e-01
## t2* 2.080898e-05 1.680317e-07 4.866284e-06
## t3* 5.647103e-03 1.855765e-05 2.298949e-04
The bootstrap standard errors (the std. error column above, roughly \(4.6\times10^{-6}\) for income and \(2.3\times10^{-4}\) for balance) are very close to the formula-based standard errors from part (a). This agreement indicates that the standard assumptions underlying the GLM formula (correct model, large sample) hold well here, so the asymptotic and bootstrap estimates essentially coincide.
medv in Bostonmedvmu.hat <- mean(Boston$medv)
mu.hat
## [1] 22.53281
se.formula <- sd(Boston$medv) / sqrt(nrow(Boston))
se.formula
## [1] 0.4088611
The standard error of the sample mean is about \(0.409\). It quantifies how much the sample mean \(\hat\mu\) would vary across repeated samples of \(506\) tracts from the population.
mean_fn <- function(data, index) mean(data[index])
set.seed(1)
boot.mean <- boot(Boston$medv, mean_fn, R = 1000)
boot.mean
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston$medv, statistic = mean_fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 22.53281 0.007650791 0.4106622
The bootstrap standard error (\(\approx 0.41\)) is almost identical to the formula-based value in (b), as expected.
medvse.boot <- sd(boot.mean$t)
c(lower = mu.hat - 2 * se.boot, upper = mu.hat + 2 * se.boot) # two-SE rule
## lower upper
## 21.71148 23.35413
t.test(Boston$medv)$conf.int # for comparison
## [1] 21.72953 23.33608
## attr(,"conf.level")
## [1] 0.95
The two-standard-error bootstrap interval (about \([21.7,\ 23.4]\)) is very close to the \(t\)-distribution confidence interval returned by t.test().
medvmed.hat <- median(Boston$medv)
med.hat
## [1] 21.2
median_fn <- function(data, index) median(data[index])
set.seed(1)
boot(Boston$medv, median_fn, R = 1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston$medv, statistic = median_fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 21.2 0.02295 0.3778075
The median of medv is \(21.2\), with a bootstrap standard error of about \(0.38\) — small relative to the median, so the median is estimated quite precisely (and there is no simple closed-form formula for this SE, which is why the bootstrap is useful here).
medvmu.01 <- quantile(Boston$medv, 0.10)
mu.01
## 10%
## 12.75
p10_fn <- function(data, index) quantile(data[index], 0.10)
set.seed(1)
boot(Boston$medv, p10_fn, R = 1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston$medv, statistic = p10_fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 12.75 0.0339 0.4767526
The tenth percentile of medv is \(12.75\), with a bootstrap standard error of about \(0.50\). Again the bootstrap supplies a standard error for a quantity (a quantile) that has no convenient analytic formula; the SE is modest, so the tenth percentile is reasonably well determined.