Repeatedly drawing samples from the training set and fitting many models to gain a better understanding.

5.1 Cross-Validation

Holding out a subset of training observations to be used for testing models.

The Validation Set Approach

Randomly divide the observations into a training and a validation set. The validation set’s MSE provides an estimate of the test error rate. Each sample provides a different MSE, introducing variability in the test error rate and probably overestimates it as well.

Leave-One-Out Cross Validation

Given a set of \(n\) observations, subset a single observation and train the rest. Since \((x_1,y_1)\) is not used in the fitting process, \(\text{MSE}_1 = (y_1 - \hat y_1)^2\) provides an approximately unbiased estimate for the test error. Now repeat the procdure for all \(n\) observations, we computes the cross validation MSE as \(\text{CV}_{(n)} = \frac{1}{n} \sum_{i=1}^n \text{MSE}_i\). Fitting \(n\) models can be expensive, especially if each model is slow to fit. With least squares linear or polynomial regression, math comes to the rescue! The cost of LOOCV is the same of a single model fit: \[ \text{CV}_{(n)} = \frac{1}{n} \sum_{i=1}^n \left( \frac{y_i - \hat y_i}{1 - h_i} \right)^2 \] Where \(h_i\) is the leverage statistic. This is like ordinary MSE, but each residual is divided by \(1 - h_i\). The leverage lies between \(1/n\) and 1, and reflects the amount that an observation influences its own fit. LOOCV is a very general method, but the magic formula only applies to least squares. In the other cases, the model has to be refit \(n\) times.

k-Fold Cross-Validation

Like LOOCV, but the observations are divided into \(k\) groups, or folds, of approximately equal size. This process provides \(k\) estimates of the test error, and the \(k\)-fold CV estimate is the average of all these values \(\text{CV}_{(k)} = \frac{1}{k} \sum_{i=1}^k \text{MSE}_i\). Typical values of \(k\) are 5 and 10.

Bias-Variance Trade-Off for k-Fold Cross-Validation

Aside from computational advantages, k-fold CV provides more accurate estimates than LOOCV because of the bias-variance trade-off. The validation set approach overestimates the test error because of high bias, while k-fold reduces the bias, and LOOCV has the ultimate low bias. But LOOCV averages the outputs of n fitted models, each of which have very similar observations: therefore these are positively correlated with each other. Since k-fold CV has smaller overlap, it has lower variance than LOOCV.

CV on Classification Problems

Same idea, but the MSE if just average number of Errors \(y_i \neq \hat y_i\).

5.2 Bootstrapping

Quantifying uncertainty in a given estimator or statistical learning method. Randomly sample, with replacement, observations from a dataset then compute the average value and standard error of the estimate.

Lab 5: CV and Bootstrapping

set.seed(1)
train <- sample(392, 196)

# fit the model, using the training data
lm.fit <- lm(mpg ~ horsepower, data = Auto, subset = train)
# compute MSE on the hold out data
mse.val <- mean((Auto$mpg - predict(lm.fit, Auto))[-train] ^ 2)
# use a polynomial fit
lm.fit2 <- lm(mpg ~ poly(horsepower, 2), data = Auto, subset = train)
mse.val2 <- mean((Auto$mpg - predict(lm.fit2, Auto))[-train] ^ 2)
mse.val; mse.val2
## [1] 26.14142
## [1] 19.82259
library(boot)
# use generalized linear model
# default "familiy" is linear least squares
glm.fit <- glm(mpg ~ horsepower, data = Auto)
sapply(list(lm.fit, glm.fit), coef)
##                   [,1]       [,2]
## (Intercept) 40.3403772 39.9358610
## horsepower  -0.1617013 -0.1578447
# compute LOOCV and display MSE
cv.err <- cv.glm(Auto, glm.fit)
cv.err$delta
## [1] 24.23151 24.23114
# compute LOOCV for higher order models
cv.error <- rep(0, 5)
for(ii in 1:5) {
  cv.error[ii] <- cv.glm(Auto, 
                         glm(mpg ~ poly(horsepower, ii), data = Auto))$delta[1]
}
cv.error
## [1] 24.23151 19.24821 19.33498 19.42443 19.03321
set.seed(17)
cv.error.10 <- rep(0, 10)
for (ii in 1:10) {
  glm.fit <- glm(mpg ~ poly(horsepower, ii), data = Auto)
  cv.error.10[ii] <- cv.glm(Auto, glm.fit, K = 10)$delta[1]
}
cv.error.10
##  [1] 24.20520 19.18924 19.30662 19.33799 18.87911 19.02103 18.89609
##  [8] 19.71201 18.95140 19.50196
plot(cv.error.10, type = "l")

# 1. Create statistic of interest
# 2. use boot() to repeatedly sample with replacement
alpha.fn <- function(data, index) {
  X <- data$X[index]
  Y <- data$Y[index]
  return ((var(Y)-cov (X,Y))/(var(X)+var(Y) -2* cov(X,Y)))
}
alpha.fn(Portfolio, 1:100)
## [1] 0.5758321
# automating random sampling with replacemnt
set.seed(1)
boot(Portfolio, alpha.fn, R = 1000)
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Portfolio, statistic = alpha.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##      original       bias    std. error
## t1* 0.5758321 6.936399e-05  0.08868935
boot.fn <- function(data, index) return(coef(lm(mpg ~ horsepower, data = data, subset = index)))
boot(Auto, boot.fn, 1000)
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Auto, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##       original        bias    std. error
## t1* 39.9358610  0.0118427139 0.871057282
## t2* -0.1578447 -0.0002632441 0.007537443
summary(lm(mpg~horsepower, data=Auto))$coef
##               Estimate  Std. Error   t value      Pr(>|t|)
## (Intercept) 39.9358610 0.717498656  55.65984 1.220362e-187
## horsepower  -0.1578447 0.006445501 -24.48914  7.031989e-81
# the bootstrap uses no assumptions
# SE formulas rely on certain assumptions: constant variance, fixed x_i

boot.fn2 <- function(data, index) return(coef(lm(mpg ~ horsepower + I(horsepower^2), data = data, subset = index)))
boot(Auto, boot.fn2, 1000)
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Auto, statistic = boot.fn2, R = 1000)
## 
## 
## Bootstrap Statistics :
##         original        bias     std. error
## t1* 56.900099702 -3.349465e-02 2.1206433892
## t2* -0.466189630  5.896138e-04 0.0340232317
## t3*  0.001230536 -2.016504e-06 0.0001234614
summary(lm(mpg~horsepower + I(horsepower^2), data=Auto))$coef
##                     Estimate   Std. Error   t value      Pr(>|t|)
## (Intercept)     56.900099702 1.8004268063  31.60367 1.740911e-109
## horsepower      -0.466189630 0.0311246171 -14.97816  2.289429e-40
## I(horsepower^2)  0.001230536 0.0001220759  10.08009  2.196340e-21

Ch5 Exercises

pg. 197 1. math proof, blah…..

  1. probability of not choosing the \(j\)th observation is \(1 - 1/n\)
  2. mutiply the probabilities, so \((1 - 1/n)^2\)
  3. not choosing an observation at all. Each pick has probability \(1 - 1/n\). Having \(n\) picks results in a probability of \((1 - 1/n)^n\).
  4. with \(n = 5\), probability of picking \(j\) is \(1 - (1 - 1/5)^5 = 0.67232\)
  5. \(\text{PR}(j | n = 100) = 0.6339677\)
  6. $(j | n = 1000) = $ 0.6323046
library(scales)
## 
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
## 
##     discard
## The following object is masked from 'package:readr':
## 
##     col_factor
ggplot(tibble(
    x = 1:100000,
    y = 1 - (1 - 1/x)^x
  ), 
  aes(x, y)) +
  geom_line() +
  scale_x_log10() +
  scale_y_continuous(limits = c(0, 1), labels = percent) +
  labs(x = "n",
       y = "Probability the jth observation\nis in the bootstrap sample")

store <- rep(NA, 10000)
for (ii in 1:10000) {
  store[ii] <- sum(sample(1:100, rep = T) == 4) > 0
}
mean(store)
## [1] 0.6447
  1. k-fold cross-validation splits the observations into k subsets for validation.
  2. better than validation set because it uses more observations, and better than LOOCV because it has less variance.
  1. Choose a random sample of observations to train the model on, and record the coefficients. Repeat the process many times. Then compute the average and variance of the coefficients (bootstrapping).

library(caret)
## Loading required package: lattice
## 
## Attaching package: 'lattice'
## The following object is masked from 'package:boot':
## 
##     melanoma
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
set.seed(1)
output <- vector("list", 3)
output2 <- vector("list", 3)

for (ii in 1:3) {
  train <- sample(1:nrow(Default), 0.8 * nrow(Default))
  default.ref <- Default[-train, "default"]
  
  glm.fit <- glm(default ~ income + balance, data = Default, family = "binomial", subset = train)
  glm.probs <- predict(glm.fit, Default[-train,])
  glm.pred <- factor(glm.probs > 0.5)
  levels(glm.pred) <- c("No", "Yes")
  output[[ii]] <- table(glm.pred, default.ref)
  
    glm.fit2 <- glm(default ~ income + balance + student, data = Default, family = "binomial", subset = train)
  glm.probs2 <- predict(glm.fit2, Default[-train,])
  glm.pred2 <- factor(glm.probs2 > 0.5)
  levels(glm.pred2) <- c("No", "Yes")
  output2[[ii]] <- table(glm.pred2, default.ref)
}
output
## [[1]]
##         default.ref
## glm.pred   No  Yes
##      No  1933   49
##      Yes    3   15
## 
## [[2]]
##         default.ref
## glm.pred   No  Yes
##      No  1910   74
##      Yes    5   11
## 
## [[3]]
##         default.ref
## glm.pred   No  Yes
##      No  1916   56
##      Yes    6   22
output2
## [[1]]
##          default.ref
## glm.pred2   No  Yes
##       No  1932   47
##       Yes    4   17
## 
## [[2]]
##          default.ref
## glm.pred2   No  Yes
##       No  1910   73
##       Yes    5   12
## 
## [[3]]
##          default.ref
## glm.pred2   No  Yes
##       No  1916   54
##       Yes    6   24
boot.fn <- function(data, index) +
  return(coef(glm(default ~ income + balance, data = data, family = "binomial", subset = index)))
boot(Default, boot.fn, 100)
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Default, statistic = boot.fn, R = 100)
## 
## 
## Bootstrap Statistics :
##          original        bias     std. error
## t1* -1.154047e+01  8.754825e-02 4.172793e-01
## t2*  2.080898e-05  1.465886e-07 4.253794e-06
## t3*  5.647103e-03 -5.334405e-05 2.133563e-04
summary(glm(default ~ income + balance, data = Default, family = "binomial"))$coef
##                  Estimate   Std. Error    z value      Pr(>|z|)
## (Intercept) -1.154047e+01 4.347564e-01 -26.544680 2.958355e-155
## income       2.080898e-05 4.985167e-06   4.174178  2.990638e-05
## balance      5.647103e-03 2.273731e-04  24.836280 3.638120e-136
set.seed(1)
outPred <- vector("logical", nrow(Weekly))
for(ii in 1:nrow(Weekly)) {
  train <- -ii
  glm.fit <- glm(Direction ~ Lag1 + Lag2, data = Weekly, family = "binomial", subset = train)
  glm.probs <- predict(glm.fit, Weekly[-train, ])
  glm.pred <- glm.probs > 0.5
  outPred[[ii]] <- glm.pred
}
outPred <- factor(outPred)
levels(outPred) <- c("Down", "Up")
mean(outPred == Weekly[ , "Direction"])
## [1] 0.4545455
glm.fit2 <- glm(Direction ~ Lag1 + Lag2, data = Weekly, family = "binomial")
cv.err <- cv.glm(Weekly, glm.fit2)
cv.err$delta
## [1] 0.2464536 0.2464530
set.seed(1)
df <- tibble(
  x = rnorm(100),
  y = x - 2 * x^2 + rnorm(100)
)
output <- vector("double", 4)
for (ii in 1:4) {
  lm.fit <- glm(y ~ poly(x, ii), data = df)
  cv.err <- cv.glm(df, lm.fit)
  output[[ii]] <- cv.err$delta[1]
}
output
## [1] 7.2881616 0.9374236 0.9566218 0.9539049
summary(glm(y ~ poly(x, 4), data = df))$coef
##                Estimate Std. Error     t value     Pr(>|t|)
## (Intercept)  -1.5500226 0.09590514 -16.1620379 5.169227e-29
## poly(x, 4)1   6.1888256 0.95905143   6.4530695 4.590732e-09
## poly(x, 4)2 -23.9483049 0.95905143 -24.9708243 1.593826e-43
## poly(x, 4)3   0.2641057 0.95905143   0.2753822 7.836207e-01
## poly(x, 4)4   1.2570950 0.95905143   1.3107691 1.930956e-01
ggplot(df, aes(x, y)) + 
  geom_point() +
  geom_smooth(formula = y ~ poly(x, 2), method = "lm", se = F)

library(MASS)
tibble(
  m = mean(Boston$medv),
  sd = sd(Boston$medv) / sqrt(length(Boston$medv))
)
## # A tibble: 1 x 2
##       m    sd
##   <dbl> <dbl>
## 1  22.5 0.409
boot.fn <- function(data, index) return(mean(data[index]))
boot(Boston$medv, boot.fn, 1000)
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Boston$medv, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original     bias    std. error
## t1* 22.53281 0.00975415   0.4130483
c(22.53281 - 2*0.406628, 22.53281 + 2*0.406628)
## [1] 21.71955 23.34607
t.test(Boston$medv)
## 
##  One Sample t-test
## 
## data:  Boston$medv
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  21.72953 23.33608
## sample estimates:
## mean of x 
##  22.53281
boot.fn <- function(data, index) return(median(data[index]))
boot(Boston$medv, boot.fn, 1000)
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Boston$medv, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original  bias    std. error
## t1*     21.2 -0.0127   0.3869347
boot.fn <- function(data, index) return(quantile(data[index], seq(0, 1, 0.1))[2]
)
boot(Boston$medv, boot.fn, 1000)
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Boston$medv, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original  bias    std. error
## t1*    12.75  0.0038   0.5127948