Resampling

Statistical Learning
James Scott (UT-Austin)

Reference: Introduction to Statistical Learning Chapter 5

Resampling methods

  1. Cross-validation (leave-one-out and K-fold)

  2. Nonparametric bootstrap

  3. The residual resampling bootstrap

  4. The parametric bootstrap

Recall train/test splits

Goal: estimate the prediction error of a fitted model.

Solution so far:

  • Randomly split the data into a training set and testing set.
  • Fit the model(s) on the training set.
  • Make predictions on the test set using the fitted model(s).
  • Check how well you did (RMSE, classification error, deviance…)

A perfectly sensible approach! But still has some drawbacks…

Train/test splits: drawbacks

Drawback 1: the estimate of the error rate can be highly variable.

  • It depends strongly on which observations end up in the training and testing sets.
  • Can be mitigated by averaging over multiple train/test splits, but this can get computationally expensive.

Drawback 2: only some of the data points are used to fit the model.

  • Statistical methods tend to perform worse with less data.
  • So the test-set error rate is biased: it tends to overestimate the error rate we'd see if we trained on the full data set.
  • This effect is especially strong for complex models.

Leave-one-out cross validation (LOOCV)

Key idea: average over all possible testing sets of size \( n_{\mathrm{test}} = 1 \).

For i in 1 to N:

  • Fit the model using all data points except \( i \).
  • Predict \( y_i \) with \( \hat{y_i} \), using the fitted model.
  • Calculate the error, \( \mathrm{Err}(y_i, \hat{y_i}) \).

Estimate the error rate as: \[ \mathrm{CV}_{(n)} = \frac{1}{n} \sum_{i=1}^n \mathrm{Err}(y_i, \hat{y_i}) \]

LOOCV: quick example

The data in “ethanol.csv” is from an experiment where an ethanol-based fuel was burned in a one-cylinder combustion engine.

  • The experimenter varied the equivalence ratio (E), which measures the richness of the fuel-air mixture in the combustion chamber.
  • For each setting of E, emission of nitrogen oxides (NOx) were recorded.

LOOCV: quick example

library(tidyverse)
ethanol = read.csv('ethanol.csv')
ggplot(ethanol) + geom_point(aes(x=E, y=NOx))  + 
  theme_bw(base_size=18)

plot of chunk unnamed-chunk-1

LOOCV: quick example

KNN with \( K=5 \):

library(FNN)

N = nrow(ethanol)
err_save = rep(0, N)
for(i in 1:N) {
  X_train = ethanol$E[-i]
  y_train = ethanol$NOx[-i]
  X_test = ethanol$E[i]
  y_test = ethanol$NOx[i]
  knn_model = knn.reg(X_train, X_test, y_train, k = 5)
  yhat_test = knn_model$pred
  err_save[i] = (y_test - yhat_test)
}
# RMSE
sqrt(mean(err_save^2))
[1] 0.3365111

LOOCV: quick example

Comparison with a 5th degree polynomial:

N = nrow(ethanol)
err_save2 = rep(0, N)
for(i in 1:N) {
  y_test = ethanol$NOx[i]
  poly5 = lm(NOx ~ poly(E, 5), data=ethanol[-i,])
  yhat_test = predict(poly5, newdata=ethanol[i,])
  err_save2[i] = (y_test - yhat_test)^2
}
# RMSE
sqrt(mean(err_save2^2))
[1] 0.2319733

LOOCV: pros/cons

Less bias than the train/test split approach:

  • fitting with \( n-1 \) points is almost the same as fitting with \( n \) points.
  • Thus LOOCV is less prone to overestimating the training-set error.

No randomness:

  • Estimating the error rate using train/test splits will yield different results when applied repeatedly (Monte Carlo variability)
  • LOOCV will give the same result every time.

Downside: must re-fit \( n \) times!

K-fold cross validation

Randomly divide the data set into \( K \) nonoverlapping groups, or folds, of roughly equal size.

For fold k = 1 to K:

  • Fit the model using all data points not in fold \( i \).
  • For all points \( (y_i, x_i) \) in fold \( k \), predict \( \hat{y_i} \) using the fitted model.
  • Calculate \( \mathrm{Err}_k \), the average error on fold \( k \).

Estimate the error rate as: \[ \mathrm{CV}_{(K)} = \frac{1}{K} \sum_{k=1}^K \mathrm{Err}_k \]

10-fold CV: quick example

Back to the ethanol data:

N = nrow(ethanol)

# Create a vector of fold indicators
K = 10
fold_id = rep_len(1:K, N)  # repeats 1:K over and over again
fold_id = sample(fold_id, replace=FALSE) # permute the order randomly

err_save = rep(0, K)
for(i in 1:K) {
  train_set = which(fold_id != i)
  y_test = ethanol$NOx[-train_set]
  poly5 = lm(NOx ~ poly(E, 5), data=ethanol[train_set,])
  yhat_test = predict(poly5, newdata=ethanol[-train_set,])
  err_save[i] = mean((y_test - yhat_test)^2)
}
# RMSE
sqrt(mean(err_save))
[1] 0.3686505

K-fold cross validation

  • Generally requires less computation than LOOCV (K refits, versus N). If N is extremely large, LOOCV is almost certainly infeasible.
  • More stable (lower variance) than running \( K \) random train/test splits.
  • LOOCV is a special-case of K-fold CV (with K = N).

K-fold CV: bias-variance tradeoff

Key insight: there is a bias-variance tradeoff in estimating test error.

Bias comes from estimating out-of-sample error using a smaller training set than the full data set.

  • LOOCV: minimal bias, since using \( N-1 \) points to fit.
  • K-fold: some bias, e.g. using 80% of N to fit when \( K=5 \).

K-fold CV: bias-variance tradeoff

Key insight: there is a bias-variance tradeoff in estimating test error.

Variance comes from overlap in the training sets:

  • In LOOCV, we average the outputs of N fitted models, each trained on nearly identical observations.
  • In K-fold, we average the outputs of \( K \) fitted models that are less correlated, since they have less overlap in the training data. Generally less variance than LOOCV.

Typical values: K = 5 to K = 10 (no theory; a purely empirical observation).

K-fold CV: in class

  • Revisit the loadhou.csv data set, where x = temperature and y = peak demand.
    1. Set up a script that uses K-fold CV to estimate RMSE for a polynomial regression model of order \( M \).
    2. Run your script for all values of \( M \) from 1 to 10 to get the optimal value for the highest power of x to include in the model. Make a plot of RMSE versus \( M \).
    3. Re-run your script several times to get different allocations of data points into folds. Each time plot RMSE versus \( M \). Does the plot change drastically from one run to the next?

The bootstrap

So we believe that \( y_i = f(x_i) + \epsilon_i \), and we have our estimate \( \hat{f}(x) \)… How can we quantify our uncertainty for this estimate?

Fundamental frequentist thought experiment: “How might \( \hat{f}(x) \) have been different if I'd gotten different data just by chance?”

But what does “different data” actually mean?

  • a different sample (of size \( N \)) of \( (x_i, y_i) \) pairs from the same population?
  • a different set of residuals?
  • a different realization of the same underlying random process/phenomenon?

The bootstrap

There's a version the bootstrap for all three situations:

  • a different sample (of size \( N \)) of \( (x_i, y_i) \) pairs from the same population? The nonparametric bootstrap
  • a different set of residuals? The residual-resampling bootstrap
  • a different realization of the same underlying random process/phenomenon? The parametric bootstrap

Let's see this three versions of the bootstrap one by one.

The nonparametric bootstrap

If you've seen the bootstrap before, it was probably this one!

  • Question: “how might my estimate \( \hat{f}(x) \) have been different if I'd seen a different sample of \( (x_i, y_i) \) pairs from the same population?”
  • Assumption: each \( (x_i, y_i) \) is a random sample from a joint distribution \( P(x, y) \) describing the population from which your sample was drawn.
  • Problem: We don't know \( P(x, y) \).
  • Solution: Approximate \( P(x, y) \) by \( \hat{P}(x, y) \), the empirical joint distribution of the data in your sample.
  • Key fact for implementation: sampling from \( \hat{P}(x, y) \) is equivalent to sampling with replacement from the original sample.

The nonparametric bootstrap

This leads to the following algorithm.

For b = 1 to B:

  • Construct a sample from \( \\hat{P}(x, y) \) (called a bootstrapped sample) by sampling \( N \) pairs \( (x_i, y_i) \) with replacement from the original sample.
  • Refit the model to each bootstrapped sample, giving you \( \hat{f}^{(b)} \).

This gives us \( B \) draws from the bootstrapped sampling distribution of \( \hat{f}(x) \).

Use these draws to form (approximate) confidence intervals and standard errors for \( f(x) \).

An example

library(tidyverse)
loadhou = read.csv('../data/loadhou.csv')

ggplot(loadhou) + geom_point(aes(x=KHOU, y=COAST), alpha=0.1) + 
  theme_set(theme_bw(base_size=18)) 

plot of chunk unnamed-chunk-5

An example

Suppose we want to know \( f(5) \) and \( f(25) \), i.e. the expected values of COAST when KHOU = 5 and KHOU = 25, respectively. Let's bootstrap a KNN model, with \( K=40 \):

library(mosaic)
library(FNN)

X_test = data.frame(KHOU=c(5,25))
boot20 = do(500)*{
  loadhou_boot = resample(loadhou)  # construct a boostrap sample
  X_boot = select(loadhou_boot, KHOU)
  y_boot = select(loadhou_boot, COAST)
  knn20_boot = knn.reg(X_boot, X_test, y_boot, k=40)
  knn20_boot$pred
}
head(boot20, 3)  # first column is f(5), second is f(25)  
        V1       V2
1 10483.51 12064.16
2 10389.39 11782.52
3 10732.84 11793.46

An example

Now we can calculate standard errors and/or confidence intervals.

  • Standard errors: take the standard deviation of each column.
se_hat = apply(boot20, 2, sd)
se_hat
      V1       V2 
159.8669 155.0356 
  • Confidence intervals: calculate quantiles for each column
apply(boot20, 2, quantile, probs=c(0.025, 0.975))  
            V1       V2
2.5%  10346.91 11534.12
97.5% 10953.81 12124.75

An example

  • Shortcut:
confint(boot20)
  name    lower    upper level     method estimate
1   V1 10346.91 10953.81  0.95 percentile 10638.78
2   V2 11534.12 12124.75  0.95 percentile 11906.23

Spaghetti plot: using base R graphics

X_test = data.frame(KHOU=seq(0, 35, by=0.1))
plot(COAST ~ KHOU, data=loadhou)
for(i in 1:500) {
  loadhou_boot = resample(loadhou)  # construct a boostrap sample
  X_boot = select(loadhou_boot, KHOU)
  y_boot = select(loadhou_boot, COAST)
  knn20_boot = knn.reg(X_boot, X_test, y_boot, k=40)
  knn20_boot$pred
  lines(X_test$KHOU,  knn20_boot$pred, col=rgb(1, 0, 0, 0.1))
}

plot of chunk unnamed-chunk-10

The residual-resampling bootstrap

  • Question: “how might my estimate \( \hat{f}(x) \) have been different if the error terms/residuals had been different?”
  • Assumption: each residual \( e_i \) is a random sample from a probability distribution \( P(e) \) describing the noise in your data set.
  • Problem: We don't know \( P(e) \).
  • Solution: Approximate \( P(e) \) by \( \\hat{P}(e) \), the empirical distribution of the residuals from your fitted model.
  • Key fact for implementation: sampling from \( \hat{P}(e) \) is equivalent to sampling with replacement from residuals of your fitted model.

The residual-resampling bootstrap

This leads to the following algorithm. First fit the model, yielding \[ y_i = \hat{f}(x_i) + e_i \]

Then, for b = 1 to B:

  • Construct a sample from \( \\hat{P}(e) \) by sampling \( N \) residuals \( e^{(b)}_i \) with replacement from the original residuals \( e_1, \ldots, e_N \).
  • Construct synthetic outcomes \( y_i^{(b)} \) by setting \[ y_i^{(b)} = \hat{f}(x_i) + e^{(b)}_i \]
  • Refit the model to the \( (x_i, y_i^{(b)}) \) pairs, yielding \( \hat{f}^{(b)} \).

An example

ethanol = read.csv('ethanol.csv')
ggplot(ethanol) + geom_point(aes(x=E, y=NOx)) + 
  theme_set(theme_bw(base_size=18)) 

plot of chunk unnamed-chunk-11

Key fact: the \( (x_i, y_i) \) are not random samples here! The \( x_i \) points are fixed as part of the experimental design.

An example

Let's quantify uncertainty for \( f(0.7) \) and \( f(0.95) \), under 5th-order polynomial model, via the residual resampling bootstrap.

First, let's look at the empirical distribution of the residuals:

poly5 = lm(NOx ~ poly(E, 5), data=ethanol)
yhat = fitted(poly5)
evector = ethanol$NOx - yhat  # empirical distribution of residuals

An example

hist(evector, 20)

plot of chunk unnamed-chunk-13

This is our estimate \( \\hat{P}(e) \) for the probability distribution of the residuals.

An example

Now we bootstrap:

X_test = data.frame(E=c(0.7, 0.95))
boot5 = do(500)*{
  e_boot = resample(evector)
  y_boot = yhat + e_boot  # construct synthetic outcomes
  ethanol_boot = ethanol
  ethanol_boot$NOx = y_boot  # substitute real outcomes with synthetic ones
  poly5_boot = lm(NOx ~ poly(E, 5), data=ethanol_boot)
  fhat_boot = predict(poly5_boot, X_test)
  fhat_boot
}
head(boot5, 3)  # first column is f(0.7), second is f(0.95)  
        X1       X2
1 1.520835 3.592387
2 1.589200 3.472739
3 1.541575 3.556114

An example

As before, we can get standard errors and/or confidence intervals.

  • Standard errors:
se_hat = apply(boot5, 2, sd)
se_hat
        X1         X2 
0.07976190 0.07544626 
  • Confidence intervals: calculate quantiles for each column
apply(boot5, 2, quantile, probs=c(0.025, 0.975))  
            X1       X2
2.5%  1.355337 3.378689
97.5% 1.670854 3.694560

The parametric bootstrap

  • Question: “how might my estimate \( \hat{f}(x) \) have been different if my outcomes were a different realization from the same underlying conditional distribution \( P(y_i \mid x_i) \)?
  • Assumption: each outcome \( y_i \) is a random sample from a conditional probability distribution \( P(y \mid x) \).
  • Problem: We don't know \( P(y \mid x) \).
  • Solution: Approximate \( P(y \mid x) \) by \( \hat{P}(y \mid x) \), the family of conditional distributions estimated from your sample.
  • Key fact for implementation: \( \hat{P}(y \mid x) \) is a parametric probability model parametrized by \( \hat{f}(x) \), the fitted function.

The parametric bootstrap

Example: see predimed_bootstrap.R

In class:

  • return to brca.csv
  • Consider the conditional probability P(cancer | risk factors) before screening (i.e. not using the recall variable). Fit this using a logit model on the full data set.
  • Then address the question: are there some patients for whom P(cancer | risk factors) is much more uncertain than others?