Statistical Learning
James Scott (UT-Austin)
Reference: Introduction to Statistical Learning Chapter 5
Cross-validation (leave-one-out and K-fold)
Nonparametric bootstrap
The residual resampling bootstrap
The parametric bootstrap
Goal: estimate the prediction error of a fitted model.
Solution so far:
A perfectly sensible approach! But still has some drawbacks…
Drawback 1: the estimate of the error rate can be highly variable.
Drawback 2: only some of the data points are used to fit the model.
Key idea: average over all possible testing sets of size \( n_{\mathrm{test}} = 1 \).
For i in 1 to N:
Estimate the error rate as: \[ \mathrm{CV}_{(n)} = \frac{1}{n} \sum_{i=1}^n \mathrm{Err}(y_i, \hat{y_i}) \]
The data in “ethanol.csv” is from an experiment where an ethanol-based fuel was burned in a one-cylinder combustion engine.
library(tidyverse)
ethanol = read.csv('ethanol.csv')
ggplot(ethanol) + geom_point(aes(x=E, y=NOx)) +
theme_bw(base_size=18)
KNN with \( K=5 \):
library(FNN)
N = nrow(ethanol)
err_save = rep(0, N)
for(i in 1:N) {
X_train = ethanol$E[-i]
y_train = ethanol$NOx[-i]
X_test = ethanol$E[i]
y_test = ethanol$NOx[i]
knn_model = knn.reg(X_train, X_test, y_train, k = 5)
yhat_test = knn_model$pred
err_save[i] = (y_test - yhat_test)
}
# RMSE
sqrt(mean(err_save^2))
[1] 0.3365111
Comparison with a 5th degree polynomial:
N = nrow(ethanol)
err_save2 = rep(0, N)
for(i in 1:N) {
y_test = ethanol$NOx[i]
poly5 = lm(NOx ~ poly(E, 5), data=ethanol[-i,])
yhat_test = predict(poly5, newdata=ethanol[i,])
err_save2[i] = (y_test - yhat_test)^2
}
# RMSE
sqrt(mean(err_save2^2))
[1] 0.2319733
Less bias than the train/test split approach:
No randomness:
Downside: must re-fit \( n \) times!
Randomly divide the data set into \( K \) nonoverlapping groups, or folds, of roughly equal size.
For fold k = 1 to K:
Estimate the error rate as: \[ \mathrm{CV}_{(K)} = \frac{1}{K} \sum_{k=1}^K \mathrm{Err}_k \]
Back to the ethanol data:
N = nrow(ethanol)
# Create a vector of fold indicators
K = 10
fold_id = rep_len(1:K, N) # repeats 1:K over and over again
fold_id = sample(fold_id, replace=FALSE) # permute the order randomly
err_save = rep(0, K)
for(i in 1:K) {
train_set = which(fold_id != i)
y_test = ethanol$NOx[-train_set]
poly5 = lm(NOx ~ poly(E, 5), data=ethanol[train_set,])
yhat_test = predict(poly5, newdata=ethanol[-train_set,])
err_save[i] = mean((y_test - yhat_test)^2)
}
# RMSE
sqrt(mean(err_save))
[1] 0.3686505
Key insight: there is a bias-variance tradeoff in estimating test error.
Bias comes from estimating out-of-sample error using a smaller training set than the full data set.
Key insight: there is a bias-variance tradeoff in estimating test error.
Variance comes from overlap in the training sets:
Typical values: K = 5 to K = 10 (no theory; a purely empirical observation).
loadhou.csv data set, where x = temperature and y = peak demand.So we believe that \( y_i = f(x_i) + \epsilon_i \), and we have our estimate \( \hat{f}(x) \)… How can we quantify our uncertainty for this estimate?
Fundamental frequentist thought experiment: “How might \( \hat{f}(x) \) have been different if I'd gotten different data just by chance?”
But what does “different data” actually mean?
There's a version the bootstrap for all three situations:
Let's see this three versions of the bootstrap one by one.
If you've seen the bootstrap before, it was probably this one!
This leads to the following algorithm.
For b = 1 to B:
This gives us \( B \) draws from the bootstrapped sampling distribution of \( \hat{f}(x) \).
Use these draws to form (approximate) confidence intervals and standard errors for \( f(x) \).
library(tidyverse)
loadhou = read.csv('../data/loadhou.csv')
ggplot(loadhou) + geom_point(aes(x=KHOU, y=COAST), alpha=0.1) +
theme_set(theme_bw(base_size=18))
Suppose we want to know \( f(5) \) and \( f(25) \), i.e. the expected values of COAST when KHOU = 5 and KHOU = 25, respectively. Let's bootstrap a KNN model, with \( K=40 \):
library(mosaic)
library(FNN)
X_test = data.frame(KHOU=c(5,25))
boot20 = do(500)*{
loadhou_boot = resample(loadhou) # construct a boostrap sample
X_boot = select(loadhou_boot, KHOU)
y_boot = select(loadhou_boot, COAST)
knn20_boot = knn.reg(X_boot, X_test, y_boot, k=40)
knn20_boot$pred
}
head(boot20, 3) # first column is f(5), second is f(25)
V1 V2
1 10483.51 12064.16
2 10389.39 11782.52
3 10732.84 11793.46
Now we can calculate standard errors and/or confidence intervals.
se_hat = apply(boot20, 2, sd)
se_hat
V1 V2
159.8669 155.0356
apply(boot20, 2, quantile, probs=c(0.025, 0.975))
V1 V2
2.5% 10346.91 11534.12
97.5% 10953.81 12124.75
confint(boot20)
name lower upper level method estimate
1 V1 10346.91 10953.81 0.95 percentile 10638.78
2 V2 11534.12 12124.75 0.95 percentile 11906.23
X_test = data.frame(KHOU=seq(0, 35, by=0.1))
plot(COAST ~ KHOU, data=loadhou)
for(i in 1:500) {
loadhou_boot = resample(loadhou) # construct a boostrap sample
X_boot = select(loadhou_boot, KHOU)
y_boot = select(loadhou_boot, COAST)
knn20_boot = knn.reg(X_boot, X_test, y_boot, k=40)
knn20_boot$pred
lines(X_test$KHOU, knn20_boot$pred, col=rgb(1, 0, 0, 0.1))
}
This leads to the following algorithm. First fit the model, yielding \[ y_i = \hat{f}(x_i) + e_i \]
Then, for b = 1 to B:
ethanol = read.csv('ethanol.csv')
ggplot(ethanol) + geom_point(aes(x=E, y=NOx)) +
theme_set(theme_bw(base_size=18))
Key fact: the \( (x_i, y_i) \) are not random samples here! The \( x_i \) points are fixed as part of the experimental design.
Let's quantify uncertainty for \( f(0.7) \) and \( f(0.95) \), under 5th-order polynomial model, via the residual resampling bootstrap.
First, let's look at the empirical distribution of the residuals:
poly5 = lm(NOx ~ poly(E, 5), data=ethanol)
yhat = fitted(poly5)
evector = ethanol$NOx - yhat # empirical distribution of residuals
hist(evector, 20)
This is our estimate \( \\hat{P}(e) \) for the probability distribution of the residuals.
Now we bootstrap:
X_test = data.frame(E=c(0.7, 0.95))
boot5 = do(500)*{
e_boot = resample(evector)
y_boot = yhat + e_boot # construct synthetic outcomes
ethanol_boot = ethanol
ethanol_boot$NOx = y_boot # substitute real outcomes with synthetic ones
poly5_boot = lm(NOx ~ poly(E, 5), data=ethanol_boot)
fhat_boot = predict(poly5_boot, X_test)
fhat_boot
}
head(boot5, 3) # first column is f(0.7), second is f(0.95)
X1 X2
1 1.520835 3.592387
2 1.589200 3.472739
3 1.541575 3.556114
As before, we can get standard errors and/or confidence intervals.
se_hat = apply(boot5, 2, sd)
se_hat
X1 X2
0.07976190 0.07544626
apply(boot5, 2, quantile, probs=c(0.025, 0.975))
X1 X2
2.5% 1.355337 3.378689
97.5% 1.670854 3.694560
Example: see predimed_bootstrap.R
In class:
brca.csvrecall variable). Fit this using a logit model on the full data set.