Resampling

Statistical Learning
James Scott (UT-Austin)

Reference: Introduction to Statistical Learning Chapter 5

Resampling methods

Cross-validation (leave-one-out and K-fold)
Nonparametric bootstrap
The residual resampling bootstrap
The parametric bootstrap

Recall train/test splits

Goal: estimate the prediction error of a fitted model.

Solution so far:

Randomly split the data into a training set and testing set.
Fit the model(s) on the training set.
Make predictions on the test set using the fitted model(s).
Check how well you did (RMSE, classification error, deviance…)

A perfectly sensible approach! But still has some drawbacks…

Train/test splits: drawbacks

Drawback 1: the estimate of the error rate can be highly variable.

It depends strongly on which observations end up in the training and testing sets.
Can be mitigated by averaging over multiple train/test splits, but this can get computationally expensive.

Drawback 2: only some of the data points are used to fit the model.

Statistical methods tend to perform worse with less data.
So the test-set error rate is biased: it tends to overestimate the error rate we'd see if we trained on the full data set.
This effect is especially strong for complex models.

Leave-one-out cross validation (LOOCV)

Key idea: average over all possible testing sets of size \( n_{\mathrm{test}} = 1 \).

For i in 1 to N:

Fit the model using all data points except \( i \).
Predict \( y_i \) with \( \hat{y_i} \), using the fitted model.
Calculate the error, \( \mathrm{Err}(y_i, \hat{y_i}) \).

Estimate the error rate as: \[ \mathrm{CV}_{(n)} = \frac{1}{n} \sum_{i=1}^n \mathrm{Err}(y_i, \hat{y_i}) \]

LOOCV: quick example

The data in “ethanol.csv” is from an experiment where an ethanol-based fuel was burned in a one-cylinder combustion engine.

The experimenter varied the equivalence ratio (E), which measures the richness of the fuel-air mixture in the combustion chamber.
For each setting of E, emission of nitrogen oxides (NOx) were recorded.

LOOCV: quick example

library(tidyverse)
ethanol = read.csv('ethanol.csv')
ggplot(ethanol) + geom_point(aes(x=E, y=NOx))  + 
  theme_bw(base_size=18)

plot of chunk unnamed-chunk-1

LOOCV: quick example

KNN with \( K=5 \):

library(FNN)

N = nrow(ethanol)
err_save = rep(0, N)
for(i in 1:N) {
  X_train = ethanol$E[-i]
  y_train = ethanol$NOx[-i]
  X_test = ethanol$E[i]
  y_test = ethanol$NOx[i]
  knn_model = knn.reg(X_train, X_test, y_train, k = 5)
  yhat_test = knn_model$pred
  err_save[i] = (y_test - yhat_test)
}
# RMSE
sqrt(mean(err_save^2))

[1] 0.3365111

LOOCV: quick example

Comparison with a 5th degree polynomial:

N = nrow(ethanol)
err_save2 = rep(0, N)
for(i in 1:N) {
  y_test = ethanol$NOx[i]
  poly5 = lm(NOx ~ poly(E, 5), data=ethanol[-i,])
  yhat_test = predict(poly5, newdata=ethanol[i,])
  err_save2[i] = (y_test - yhat_test)^2
}
# RMSE
sqrt(mean(err_save2^2))

[1] 0.2319733

LOOCV: pros/cons

Less bias than the train/test split approach:

fitting with \( n-1 \) points is almost the same as fitting with \( n \) points.
Thus LOOCV is less prone to overestimating the training-set error.

No randomness:

Estimating the error rate using train/test splits will yield different results when applied repeatedly (Monte Carlo variability)
LOOCV will give the same result every time.

Downside: must re-fit \( n \) times!

K-fold cross validation

Randomly divide the data set into \( K \) nonoverlapping groups, or folds, of roughly equal size.

For fold k = 1 to K:

Fit the model using all data points not in fold \( i \).
For all points \( (y_i, x_i) \) in fold \( k \), predict \( \hat{y_i} \) using the fitted model.
Calculate \( \mathrm{Err}_k \), the average error on fold \( k \).

Estimate the error rate as: \[ \mathrm{CV}_{(K)} = \frac{1}{K} \sum_{k=1}^K \mathrm{Err}_k \]

10-fold CV: quick example

Back to the ethanol data:

N = nrow(ethanol)

# Create a vector of fold indicators
K = 10
fold_id = rep_len(1:K, N)  # repeats 1:K over and over again
fold_id = sample(fold_id, replace=FALSE) # permute the order randomly

err_save = rep(0, K)
for(i in 1:K) {
  train_set = which(fold_id != i)
  y_test = ethanol$NOx[-train_set]
  poly5 = lm(NOx ~ poly(E, 5), data=ethanol[train_set,])
  yhat_test = predict(poly5, newdata=ethanol[-train_set,])
  err_save[i] = mean((y_test - yhat_test)^2)
}
# RMSE
sqrt(mean(err_save))

[1] 0.3686505

K-fold cross validation

Generally requires less computation than LOOCV (K refits, versus N). If N is extremely large, LOOCV is almost certainly infeasible.
More stable (lower variance) than running \( K \) random train/test splits.
LOOCV is a special-case of K-fold CV (with K = N).

K-fold CV: bias-variance tradeoff

Key insight: there is a bias-variance tradeoff in estimating test error.

Bias comes from estimating out-of-sample error using a smaller training set than the full data set.

LOOCV: minimal bias, since using \( N-1 \) points to fit.
K-fold: some bias, e.g. using 80% of N to fit when \( K=5 \).

K-fold CV: bias-variance tradeoff

Key insight: there is a bias-variance tradeoff in estimating test error.

Variance comes from overlap in the training sets:

In LOOCV, we average the outputs of N fitted models, each trained on nearly identical observations.
In K-fold, we average the outputs of \( K \) fitted models that are less correlated, since they have less overlap in the training data. Generally less variance than LOOCV.

Typical values: K = 5 to K = 10 (no theory; a purely empirical observation).

K-fold CV: in class

Revisit the loadhou.csv data set, where x = temperature and y = peak demand.
1. Set up a script that uses K-fold CV to estimate RMSE for a polynomial regression model of order \( M \).
2. Run your script for all values of \( M \) from 1 to 10 to get the optimal value for the highest power of x to include in the model. Make a plot of RMSE versus \( M \).
3. Re-run your script several times to get different allocations of data points into folds. Each time plot RMSE versus \( M \). Does the plot change drastically from one run to the next?

The bootstrap

So we believe that \( y_i = f(x_i) + \epsilon_i \), and we have our estimate \( \hat{f}(x) \)… How can we quantify our uncertainty for this estimate?

Fundamental frequentist thought experiment: “How might \( \hat{f}(x) \) have been different if I'd gotten different data just by chance?”

But what does “different data” actually mean?

a different sample (of size \( N \)) of \( (x_i, y_i) \) pairs from the same population?
a different set of residuals?
a different realization of the same underlying random process/phenomenon?

The bootstrap

There's a version the bootstrap for all three situations:

a different sample (of size \( N \)) of \( (x_i, y_i) \) pairs from the same population? The nonparametric bootstrap
a different set of residuals? The residual-resampling bootstrap
a different realization of the same underlying random process/phenomenon? The parametric bootstrap

Let's see this three versions of the bootstrap one by one.

The nonparametric bootstrap

If you've seen the bootstrap before, it was probably this one!

Question: “how might my estimate \( \hat{f}(x) \) have been different if I'd seen a different sample of \( (x_i, y_i) \) pairs from the same population?”
Assumption: each \( (x_i, y_i) \) is a random sample from a joint distribution \( P(x, y) \) describing the population from which your sample was drawn.
Problem: We don't know \( P(x, y) \).
Solution: Approximate \( P(x, y) \) by \( \hat{P}(x, y) \), the empirical joint distribution of the data in your sample.
Key fact for implementation: sampling from \( \hat{P}(x, y) \) is equivalent to sampling with replacement from the original sample.

The nonparametric bootstrap

This leads to the following algorithm.

For b = 1 to B:

Construct a sample from \( \\hat{P}(x, y) \) (called a bootstrapped sample) by sampling \( N \) pairs \( (x_i, y_i) \) with replacement from the original sample.
Refit the model to each bootstrapped sample, giving you \( \hat{f}^{(b)} \).

This gives us \( B \) draws from the bootstrapped sampling distribution of \( \hat{f}(x) \).

Use these draws to form (approximate) confidence intervals and standard errors for \( f(x) \).

An example

library(tidyverse)
loadhou = read.csv('../data/loadhou.csv')

ggplot(loadhou) + geom_point(aes(x=KHOU, y=COAST), alpha=0.1) + 
  theme_set(theme_bw(base_size=18))

plot of chunk unnamed-chunk-5

An example

Suppose we want to know \( f(5) \) and \( f(25) \), i.e. the expected values of COAST when KHOU = 5 and KHOU = 25, respectively. Let's bootstrap a KNN model, with \( K=40 \):

library(mosaic)
library(FNN)

X_test = data.frame(KHOU=c(5,25))
boot20 = do(500)*{
  loadhou_boot = resample(loadhou)  # construct a boostrap sample
  X_boot = select(loadhou_boot, KHOU)
  y_boot = select(loadhou_boot, COAST)
  knn20_boot = knn.reg(X_boot, X_test, y_boot, k=40)
  knn20_boot$pred
}
head(boot20, 3)  # first column is f(5), second is f(25)

        V1       V2
1 10483.51 12064.16
2 10389.39 11782.52
3 10732.84 11793.46

An example

Now we can calculate standard errors and/or confidence intervals.

Standard errors: take the standard deviation of each column.

se_hat = apply(boot20, 2, sd)
se_hat

      V1       V2 
159.8669 155.0356

Confidence intervals: calculate quantiles for each column

apply(boot20, 2, quantile, probs=c(0.025, 0.975))

            V1       V2
2.5%  10346.91 11534.12
97.5% 10953.81 12124.75

An example

Shortcut:

confint(boot20)

  name    lower    upper level     method estimate
1   V1 10346.91 10953.81  0.95 percentile 10638.78
2   V2 11534.12 12124.75  0.95 percentile 11906.23

Spaghetti plot: using base R graphics

X_test = data.frame(KHOU=seq(0, 35, by=0.1))
plot(COAST ~ KHOU, data=loadhou)
for(i in 1:500) {
  loadhou_boot = resample(loadhou)  # construct a boostrap sample
  X_boot = select(loadhou_boot, KHOU)
  y_boot = select(loadhou_boot, COAST)
  knn20_boot = knn.reg(X_boot, X_test, y_boot, k=40)
  knn20_boot$pred
  lines(X_test$KHOU,  knn20_boot$pred, col=rgb(1, 0, 0, 0.1))
}

plot of chunk unnamed-chunk-10

The residual-resampling bootstrap

Question: “how might my estimate \( \hat{f}(x) \) have been different if the error terms/residuals had been different?”
Assumption: each residual \( e_i \) is a random sample from a probability distribution \( P(e) \) describing the noise in your data set.
Problem: We don't know \( P(e) \).
Solution: Approximate \( P(e) \) by \( \\hat{P}(e) \), the empirical distribution of the residuals from your fitted model.
Key fact for implementation: sampling from \( \hat{P}(e) \) is equivalent to sampling with replacement from residuals of your fitted model.

The residual-resampling bootstrap

This leads to the following algorithm. First fit the model, yielding \[ y_i = \hat{f}(x_i) + e_i \]

Then, for b = 1 to B:

Construct a sample from \( \\hat{P}(e) \) by sampling \( N \) residuals \( e^{(b)}_i \) with replacement from the original residuals \( e_1, \ldots, e_N \).
Construct synthetic outcomes \( y_i^{(b)} \) by setting \[ y_i^{(b)} = \hat{f}(x_i) + e^{(b)}_i \]
Refit the model to the \( (x_i, y_i^{(b)}) \) pairs, yielding \( \hat{f}^{(b)} \).

An example

ethanol = read.csv('ethanol.csv')
ggplot(ethanol) + geom_point(aes(x=E, y=NOx)) + 
  theme_set(theme_bw(base_size=18))

plot of chunk unnamed-chunk-11

Key fact: the \( (x_i, y_i) \) are not random samples here! The \( x_i \) points are fixed as part of the experimental design.

An example

Let's quantify uncertainty for \( f(0.7) \) and \( f(0.95) \), under 5th-order polynomial model, via the residual resampling bootstrap.

First, let's look at the empirical distribution of the residuals:

poly5 = lm(NOx ~ poly(E, 5), data=ethanol)
yhat = fitted(poly5)
evector = ethanol$NOx - yhat  # empirical distribution of residuals

An example

hist(evector, 20)

plot of chunk unnamed-chunk-13

This is our estimate \( \\hat{P}(e) \) for the probability distribution of the residuals.

An example

Now we bootstrap:

X_test = data.frame(E=c(0.7, 0.95))
boot5 = do(500)*{
  e_boot = resample(evector)
  y_boot = yhat + e_boot  # construct synthetic outcomes
  ethanol_boot = ethanol
  ethanol_boot$NOx = y_boot  # substitute real outcomes with synthetic ones
  poly5_boot = lm(NOx ~ poly(E, 5), data=ethanol_boot)
  fhat_boot = predict(poly5_boot, X_test)
  fhat_boot
}
head(boot5, 3)  # first column is f(0.7), second is f(0.95)

        X1       X2
1 1.520835 3.592387
2 1.589200 3.472739
3 1.541575 3.556114

An example

As before, we can get standard errors and/or confidence intervals.

Standard errors:

se_hat = apply(boot5, 2, sd)
se_hat

        X1         X2 
0.07976190 0.07544626

Confidence intervals: calculate quantiles for each column

apply(boot5, 2, quantile, probs=c(0.025, 0.975))

            X1       X2
2.5%  1.355337 3.378689
97.5% 1.670854 3.694560

The parametric bootstrap

Question: “how might my estimate \( \hat{f}(x) \) have been different if my outcomes were a different realization from the same underlying conditional distribution \( P(y_i \mid x_i) \)?
Assumption: each outcome \( y_i \) is a random sample from a conditional probability distribution \( P(y \mid x) \).
Problem: We don't know \( P(y \mid x) \).
Solution: Approximate \( P(y \mid x) \) by \( \hat{P}(y \mid x) \), the family of conditional distributions estimated from your sample.
Key fact for implementation: \( \hat{P}(y \mid x) \) is a parametric probability model parametrized by \( \hat{f}(x) \), the fitted function.

The parametric bootstrap

Example: see predimed_bootstrap.R

In class:

return to brca.csv
Consider the conditional probability P(cancer | risk factors) before screening (i.e. not using the recall variable). Fit this using a logit model on the full data set.
Then address the question: are there some patients for whom P(cancer | risk factors) is much more uncertain than others?