Repeatedly drawing samples from the training set and fitting many models to gain a better understanding.
Holding out a subset of training observations to be used for testing models.
Randomly divide the observations into a training and a validation set. The validation set’s MSE provides an estimate of the test error rate. Each sample provides a different MSE, introducing variability in the test error rate and probably overestimates it as well.
Given a set of \(n\) observations, subset a single observation and train the rest. Since \((x_1,y_1)\) is not used in the fitting process, \(\text{MSE}_1 = (y_1 - \hat y_1)^2\) provides an approximately unbiased estimate for the test error. Now repeat the procdure for all \(n\) observations, we computes the cross validation MSE as \(\text{CV}_{(n)} = \frac{1}{n} \sum_{i=1}^n \text{MSE}_i\). Fitting \(n\) models can be expensive, especially if each model is slow to fit. With least squares linear or polynomial regression, math comes to the rescue! The cost of LOOCV is the same of a single model fit: \[ \text{CV}_{(n)} = \frac{1}{n} \sum_{i=1}^n \left( \frac{y_i - \hat y_i}{1 - h_i} \right)^2 \] Where \(h_i\) is the leverage statistic. This is like ordinary MSE, but each residual is divided by \(1 - h_i\). The leverage lies between \(1/n\) and 1, and reflects the amount that an observation influences its own fit. LOOCV is a very general method, but the magic formula only applies to least squares. In the other cases, the model has to be refit \(n\) times.
Like LOOCV, but the observations are divided into \(k\) groups, or folds, of approximately equal size. This process provides \(k\) estimates of the test error, and the \(k\)-fold CV estimate is the average of all these values \(\text{CV}_{(k)} = \frac{1}{k} \sum_{i=1}^k \text{MSE}_i\). Typical values of \(k\) are 5 and 10.
Aside from computational advantages, k-fold CV provides more accurate estimates than LOOCV because of the bias-variance trade-off. The validation set approach overestimates the test error because of high bias, while k-fold reduces the bias, and LOOCV has the ultimate low bias. But LOOCV averages the outputs of n fitted models, each of which have very similar observations: therefore these are positively correlated with each other. Since k-fold CV has smaller overlap, it has lower variance than LOOCV.
Same idea, but the MSE if just average number of Errors \(y_i \neq \hat y_i\).
Quantifying uncertainty in a given estimator or statistical learning method. Randomly sample, with replacement, observations from a dataset then compute the average value and standard error of the estimate.
set.seed(1)
train <- sample(392, 196)
# fit the model, using the training data
lm.fit <- lm(mpg ~ horsepower, data = Auto, subset = train)
# compute MSE on the hold out data
mse.val <- mean((Auto$mpg - predict(lm.fit, Auto))[-train] ^ 2)
# use a polynomial fit
lm.fit2 <- lm(mpg ~ poly(horsepower, 2), data = Auto, subset = train)
mse.val2 <- mean((Auto$mpg - predict(lm.fit2, Auto))[-train] ^ 2)
mse.val; mse.val2
## [1] 26.14142
## [1] 19.82259
library(boot)
# use generalized linear model
# default "familiy" is linear least squares
glm.fit <- glm(mpg ~ horsepower, data = Auto)
sapply(list(lm.fit, glm.fit), coef)
## [,1] [,2]
## (Intercept) 40.3403772 39.9358610
## horsepower -0.1617013 -0.1578447
# compute LOOCV and display MSE
cv.err <- cv.glm(Auto, glm.fit)
cv.err$delta
## [1] 24.23151 24.23114
# compute LOOCV for higher order models
cv.error <- rep(0, 5)
for(ii in 1:5) {
cv.error[ii] <- cv.glm(Auto,
glm(mpg ~ poly(horsepower, ii), data = Auto))$delta[1]
}
cv.error
## [1] 24.23151 19.24821 19.33498 19.42443 19.03321
set.seed(17)
cv.error.10 <- rep(0, 10)
for (ii in 1:10) {
glm.fit <- glm(mpg ~ poly(horsepower, ii), data = Auto)
cv.error.10[ii] <- cv.glm(Auto, glm.fit, K = 10)$delta[1]
}
cv.error.10
## [1] 24.20520 19.18924 19.30662 19.33799 18.87911 19.02103 18.89609
## [8] 19.71201 18.95140 19.50196
plot(cv.error.10, type = "l")
# 1. Create statistic of interest
# 2. use boot() to repeatedly sample with replacement
alpha.fn <- function(data, index) {
X <- data$X[index]
Y <- data$Y[index]
return ((var(Y)-cov (X,Y))/(var(X)+var(Y) -2* cov(X,Y)))
}
alpha.fn(Portfolio, 1:100)
## [1] 0.5758321
# automating random sampling with replacemnt
set.seed(1)
boot(Portfolio, alpha.fn, R = 1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Portfolio, statistic = alpha.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 0.5758321 6.936399e-05 0.08868935
boot.fn <- function(data, index) return(coef(lm(mpg ~ horsepower, data = data, subset = index)))
boot(Auto, boot.fn, 1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Auto, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 39.9358610 0.0118427139 0.871057282
## t2* -0.1578447 -0.0002632441 0.007537443
summary(lm(mpg~horsepower, data=Auto))$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.9358610 0.717498656 55.65984 1.220362e-187
## horsepower -0.1578447 0.006445501 -24.48914 7.031989e-81
# the bootstrap uses no assumptions
# SE formulas rely on certain assumptions: constant variance, fixed x_i
boot.fn2 <- function(data, index) return(coef(lm(mpg ~ horsepower + I(horsepower^2), data = data, subset = index)))
boot(Auto, boot.fn2, 1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Auto, statistic = boot.fn2, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 56.900099702 -3.349465e-02 2.1206433892
## t2* -0.466189630 5.896138e-04 0.0340232317
## t3* 0.001230536 -2.016504e-06 0.0001234614
summary(lm(mpg~horsepower + I(horsepower^2), data=Auto))$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 56.900099702 1.8004268063 31.60367 1.740911e-109
## horsepower -0.466189630 0.0311246171 -14.97816 2.289429e-40
## I(horsepower^2) 0.001230536 0.0001220759 10.08009 2.196340e-21
pg. 197 1. math proof, blah…..
library(scales)
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
ggplot(tibble(
x = 1:100000,
y = 1 - (1 - 1/x)^x
),
aes(x, y)) +
geom_line() +
scale_x_log10() +
scale_y_continuous(limits = c(0, 1), labels = percent) +
labs(x = "n",
y = "Probability the jth observation\nis in the bootstrap sample")
store <- rep(NA, 10000)
for (ii in 1:10000) {
store[ii] <- sum(sample(1:100, rep = T) == 4) > 0
}
mean(store)
## [1] 0.6447
Choose a random sample of observations to train the model on, and record the coefficients. Repeat the process many times. Then compute the average and variance of the coefficients (bootstrapping).
library(caret)
## Loading required package: lattice
##
## Attaching package: 'lattice'
## The following object is masked from 'package:boot':
##
## melanoma
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
set.seed(1)
output <- vector("list", 3)
output2 <- vector("list", 3)
for (ii in 1:3) {
train <- sample(1:nrow(Default), 0.8 * nrow(Default))
default.ref <- Default[-train, "default"]
glm.fit <- glm(default ~ income + balance, data = Default, family = "binomial", subset = train)
glm.probs <- predict(glm.fit, Default[-train,])
glm.pred <- factor(glm.probs > 0.5)
levels(glm.pred) <- c("No", "Yes")
output[[ii]] <- table(glm.pred, default.ref)
glm.fit2 <- glm(default ~ income + balance + student, data = Default, family = "binomial", subset = train)
glm.probs2 <- predict(glm.fit2, Default[-train,])
glm.pred2 <- factor(glm.probs2 > 0.5)
levels(glm.pred2) <- c("No", "Yes")
output2[[ii]] <- table(glm.pred2, default.ref)
}
output
## [[1]]
## default.ref
## glm.pred No Yes
## No 1933 49
## Yes 3 15
##
## [[2]]
## default.ref
## glm.pred No Yes
## No 1910 74
## Yes 5 11
##
## [[3]]
## default.ref
## glm.pred No Yes
## No 1916 56
## Yes 6 22
output2
## [[1]]
## default.ref
## glm.pred2 No Yes
## No 1932 47
## Yes 4 17
##
## [[2]]
## default.ref
## glm.pred2 No Yes
## No 1910 73
## Yes 5 12
##
## [[3]]
## default.ref
## glm.pred2 No Yes
## No 1916 54
## Yes 6 24
boot.fn <- function(data, index) +
return(coef(glm(default ~ income + balance, data = data, family = "binomial", subset = index)))
boot(Default, boot.fn, 100)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Default, statistic = boot.fn, R = 100)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* -1.154047e+01 8.754825e-02 4.172793e-01
## t2* 2.080898e-05 1.465886e-07 4.253794e-06
## t3* 5.647103e-03 -5.334405e-05 2.133563e-04
summary(glm(default ~ income + balance, data = Default, family = "binomial"))$coef
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154047e+01 4.347564e-01 -26.544680 2.958355e-155
## income 2.080898e-05 4.985167e-06 4.174178 2.990638e-05
## balance 5.647103e-03 2.273731e-04 24.836280 3.638120e-136
set.seed(1)
outPred <- vector("logical", nrow(Weekly))
for(ii in 1:nrow(Weekly)) {
train <- -ii
glm.fit <- glm(Direction ~ Lag1 + Lag2, data = Weekly, family = "binomial", subset = train)
glm.probs <- predict(glm.fit, Weekly[-train, ])
glm.pred <- glm.probs > 0.5
outPred[[ii]] <- glm.pred
}
outPred <- factor(outPred)
levels(outPred) <- c("Down", "Up")
mean(outPred == Weekly[ , "Direction"])
## [1] 0.4545455
glm.fit2 <- glm(Direction ~ Lag1 + Lag2, data = Weekly, family = "binomial")
cv.err <- cv.glm(Weekly, glm.fit2)
cv.err$delta
## [1] 0.2464536 0.2464530
set.seed(1)
df <- tibble(
x = rnorm(100),
y = x - 2 * x^2 + rnorm(100)
)
output <- vector("double", 4)
for (ii in 1:4) {
lm.fit <- glm(y ~ poly(x, ii), data = df)
cv.err <- cv.glm(df, lm.fit)
output[[ii]] <- cv.err$delta[1]
}
output
## [1] 7.2881616 0.9374236 0.9566218 0.9539049
summary(glm(y ~ poly(x, 4), data = df))$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.5500226 0.09590514 -16.1620379 5.169227e-29
## poly(x, 4)1 6.1888256 0.95905143 6.4530695 4.590732e-09
## poly(x, 4)2 -23.9483049 0.95905143 -24.9708243 1.593826e-43
## poly(x, 4)3 0.2641057 0.95905143 0.2753822 7.836207e-01
## poly(x, 4)4 1.2570950 0.95905143 1.3107691 1.930956e-01
ggplot(df, aes(x, y)) +
geom_point() +
geom_smooth(formula = y ~ poly(x, 2), method = "lm", se = F)
library(MASS)
tibble(
m = mean(Boston$medv),
sd = sd(Boston$medv) / sqrt(length(Boston$medv))
)
## # A tibble: 1 x 2
## m sd
## <dbl> <dbl>
## 1 22.5 0.409
boot.fn <- function(data, index) return(mean(data[index]))
boot(Boston$medv, boot.fn, 1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston$medv, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 22.53281 0.00975415 0.4130483
c(22.53281 - 2*0.406628, 22.53281 + 2*0.406628)
## [1] 21.71955 23.34607
t.test(Boston$medv)
##
## One Sample t-test
##
## data: Boston$medv
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 21.72953 23.33608
## sample estimates:
## mean of x
## 22.53281
boot.fn <- function(data, index) return(median(data[index]))
boot(Boston$medv, boot.fn, 1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston$medv, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 21.2 -0.0127 0.3869347
boot.fn <- function(data, index) return(quantile(data[index], seq(0, 1, 0.1))[2]
)
boot(Boston$medv, boot.fn, 1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston$medv, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 12.75 0.0038 0.5127948