1 Exercise 3 — \(k\)-fold cross-validation (conceptual)

1.1 (a) How is \(k\)-fold CV implemented?

The set of \(n\) observations is randomly divided into \(k\) groups (folds) of approximately equal size. The procedure is then repeated \(k\) times: on the \(j\)-th iteration, fold \(j\) is held out as the validation set and the model is fit on the remaining \(k-1\) folds; the fitted model is used to predict the held-out fold, giving an error measure \(\text{MSE}_j\) (or misclassification rate for classification). The \(k\)-fold CV estimate is the average of these \(k\) errors, \[ \text{CV}_{(k)} = \frac{1}{k}\sum_{j=1}^{k} \text{MSE}_j . \] Every observation is used for validation exactly once and for training \(k-1\) times. Typical choices are \(k = 5\) or \(k = 10\).

1.2 (b) Advantages and disadvantages

i. Relative to the validation set approach. The validation set approach splits the data once into a single training/validation pair.

  • Advantages of \(k\)-fold CV: (1) Lower variance — the validation-set estimate depends heavily on the single random split and can change a lot from split to split, whereas \(k\)-fold averages over \(k\) splits. (2) Less bias / better data use — the validation approach trains on only ~half the data and therefore tends to overestimate the test error, while \(k\)-fold trains on a fraction \((k-1)/k\) of the data, so its bias is much smaller.
  • Disadvantage: \(k\)-fold requires fitting the model \(k\) times instead of once, so it is more computationally expensive.

ii. Relative to LOOCV (which is the special case \(k = n\)).

  • Advantages of \(k\)-fold CV: (1) Far cheaper — it fits the model \(k\) times rather than \(n\) times (LOOCV is very costly unless a shortcut formula exists). (2) Often more accurate test-error estimates because of the bias–variance trade-off: LOOCV’s \(n\) training sets are nearly identical and highly positively correlated, so averaging them gives a high-variance estimate; \(k\)-fold (e.g. \(k=5\) or \(10\)) averages over less-correlated fits and usually has lower variance.
  • Disadvantage: \(k\)-fold has slightly higher bias than LOOCV, since each training set is smaller than the \(n-1\) observations LOOCV uses. Choosing \(k = 5\) or \(10\) is the usual compromise that keeps both bias and variance low.

2 Exercise 5 — Validation-set estimate of test error (Default)

2.1 (a) Logistic regression of default on income and balance

glm.fit <- glm(default ~ income + balance, data = Default, family = binomial)
summary(glm.fit)$coefficients
##                  Estimate   Std. Error    z value      Pr(>|z|)
## (Intercept) -1.154047e+01 4.347564e-01 -26.544680 2.958355e-155
## income       2.080898e-05 4.985167e-06   4.174178  2.990638e-05
## balance      5.647103e-03 2.273731e-04  24.836280 3.638120e-136

2.2 (b) Validation-set approach

val_error <- function(seed, formula = default ~ income + balance) {
  set.seed(seed)
  train <- sample(nrow(Default), nrow(Default) / 2)
  fit   <- glm(formula, data = Default[train, ], family = binomial)
  probs <- predict(fit, Default[-train, ], type = "response")
  pred  <- ifelse(probs > 0.5, "Yes", "No")
  mean(pred != Default$default[-train])     # validation-set error rate
}
val_error(1)
## [1] 0.0254

The estimated test error from a single 50/50 split is about \(2.5\%\).

2.3 (c) Repeat with three different splits

sapply(c(1, 2, 3), val_error)
## [1] 0.0254 0.0238 0.0264

The three estimates are all close to \(2.5\%\) (roughly \(2.4\)\(2.8\%\)), but they differ from one another because they depend on exactly which observations land in the training vs. validation set. This variability is the main weakness of the single-split validation approach.

2.4 (d) Add a dummy variable for student

sapply(c(1, 2, 3), function(s) val_error(s, default ~ income + balance + student))
## [1] 0.0260 0.0246 0.0272

Adding the student dummy gives test-error estimates around \(2.5\)\(2.8\%\) — essentially the same as the model without it. Including student does not lead to a meaningful reduction in the test error rate.

3 Exercise 6 — Bootstrap standard errors of logistic coefficients (Default)

3.1 (a) Standard errors from the GLM formula

glm.fit <- glm(default ~ income + balance, data = Default, family = binomial)
summary(glm.fit)$coefficients[, c("Estimate", "Std. Error")]
##                  Estimate   Std. Error
## (Intercept) -1.154047e+01 4.347564e-01
## income       2.080898e-05 4.985167e-06
## balance      5.647103e-03 2.273731e-04

The asymptotic (formula-based) standard errors are about \(4.99\times10^{-6}\) for income and \(2.27\times10^{-4}\) for balance.

3.2 (b) A boot_fn() returning the coefficient estimates

boot_fn <- function(data, index) {
  coef(glm(default ~ income + balance, data = data, subset = index, family = binomial))
}
boot_fn(Default, 1:nrow(Default))
##   (Intercept)        income       balance 
## -1.154047e+01  2.080898e-05  5.647103e-03

3.3 (c) Bootstrap standard errors

set.seed(1)
boot(Default, boot_fn, R = 1000)
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Default, statistic = boot_fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##          original        bias     std. error
## t1* -1.154047e+01 -3.945460e-02 4.344722e-01
## t2*  2.080898e-05  1.680317e-07 4.866284e-06
## t3*  5.647103e-03  1.855765e-05 2.298949e-04

3.4 (d) Comparison

The bootstrap standard errors (the std. error column above, roughly \(4.6\times10^{-6}\) for income and \(2.3\times10^{-4}\) for balance) are very close to the formula-based standard errors from part (a). This agreement indicates that the standard assumptions underlying the GLM formula (correct model, large sample) hold well here, so the asymptotic and bootstrap estimates essentially coincide.

4 Exercise 9 — Bootstrap on medv in Boston

4.1 (a) Estimate of the population mean of medv

mu.hat <- mean(Boston$medv)
mu.hat
## [1] 22.53281

4.2 (b) Standard error of \(\hat\mu\) (formula)

se.formula <- sd(Boston$medv) / sqrt(nrow(Boston))
se.formula
## [1] 0.4088611

The standard error of the sample mean is about \(0.409\). It quantifies how much the sample mean \(\hat\mu\) would vary across repeated samples of \(506\) tracts from the population.

4.3 (c) Standard error of \(\hat\mu\) via the bootstrap

mean_fn <- function(data, index) mean(data[index])
set.seed(1)
boot.mean <- boot(Boston$medv, mean_fn, R = 1000)
boot.mean
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Boston$medv, statistic = mean_fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original      bias    std. error
## t1* 22.53281 0.007650791   0.4106622

The bootstrap standard error (\(\approx 0.41\)) is almost identical to the formula-based value in (b), as expected.

4.4 (d) 95% confidence interval for the mean of medv

se.boot <- sd(boot.mean$t)
c(lower = mu.hat - 2 * se.boot, upper = mu.hat + 2 * se.boot)   # two-SE rule
##    lower    upper 
## 21.71148 23.35413
t.test(Boston$medv)$conf.int                                    # for comparison
## [1] 21.72953 23.33608
## attr(,"conf.level")
## [1] 0.95

The two-standard-error bootstrap interval (about \([21.7,\ 23.4]\)) is very close to the \(t\)-distribution confidence interval returned by t.test().

4.5 (e) Estimate of the population median of medv

med.hat <- median(Boston$medv)
med.hat
## [1] 21.2

4.6 (f) Bootstrap standard error of the median

median_fn <- function(data, index) median(data[index])
set.seed(1)
boot(Boston$medv, median_fn, R = 1000)
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Boston$medv, statistic = median_fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original  bias    std. error
## t1*     21.2 0.02295   0.3778075

The median of medv is \(21.2\), with a bootstrap standard error of about \(0.38\) — small relative to the median, so the median is estimated quite precisely (and there is no simple closed-form formula for this SE, which is why the bootstrap is useful here).

4.7 (g) Estimate of the tenth percentile of medv

mu.01 <- quantile(Boston$medv, 0.10)
mu.01
##   10% 
## 12.75

4.8 (h) Bootstrap standard error of the tenth percentile

p10_fn <- function(data, index) quantile(data[index], 0.10)
set.seed(1)
boot(Boston$medv, p10_fn, R = 1000)
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Boston$medv, statistic = p10_fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original  bias    std. error
## t1*    12.75  0.0339   0.4767526

The tenth percentile of medv is \(12.75\), with a bootstrap standard error of about \(0.50\). Again the bootstrap supplies a standard error for a quantity (a quantile) that has no convenient analytic formula; the SE is modest, so the tenth percentile is reasonably well determined.