Introduction to resampling methods
STA 380
Reference: Introduction to Statistical Learning Chapter 5.2
From the New England Journal of Medicine in 2006:
We randomly assigned patients with resectable adenocarcinoma of the stomach, esophagogastric junction, or lower esophagus to either perioperative chemotherapy and surgery (250 patients) or surgery alone (253 patients)…. With a median follow-up of four years, 149 patients in the perioperative-chemotherapy group and 170 in the surgery group had died. As compared with the surgery group, the perioperative-chemotherapy group had a higher likelihood of overall survival (five-year survival rate, 36 percent vs. 23 percent).
Conclusion:
Conclusion:
Not so fast! In statistics, we ask “what if?” a lot:
Conclusion:
Always remember two basic facts about samples:
Conclusion:
By “quantifying uncertainty,” we mean filling in the blanks.
In data science, we equate trustworthiness with stability:
\[ \begin{array}{r} \mbox{Confidence in} \\ \mbox{your estimates} \\ \end{array} \iff \begin{array}{l} \mbox{Stability of those estimates} \\ \mbox{under the influence of chance} \\ \end{array} \]
For example:
Suppose we are trying to estimate some population-level feature of interest, \( \theta \). This might be something very complicated!
So we take a sample from the population: \( X_1, X_2, \ldots, X_N \). We use the data to form an estimate \( \hat{\theta}_N \) of the parameter. Key insight: \( \hat{\theta}_N \) is a random variable.
Suppose we are trying to estimate some population-level feature of interest, \( \theta \). This might be something very complicated!
So we take a sample from the population: \( X_1, X_2, \ldots, X_N \). We use the data to form an estimate \( \hat{\theta}_N \) of the parameter. Key insight: \( \hat{\theta}_N \) is a random variable.
Now imagine repeating this process thousands of times! Since \( \hat{\theta}_N \) is a random variable, it has a probability distribution: the sampling distribution.
Standard error: the standard deviation of an estimator's sampling distribution:
\[ \begin{aligned} \mbox{se}(\hat{\theta}_N) &= \sqrt{ \mbox{var}(\hat{\theta}_N) } \\ &= \sqrt{ E[ (\hat{\theta}_N - \bar{\theta}_N )^2] } \\ &= \mbox{Typical deviation of $\hat{\theta}_N$ from its average} \end{aligned} \]
“If I were to take repeated samples from the population and use this estimator for every sample, how much does the answer vary, on average?”
Think about ordering a ceramic bowl off Etsy, made by an artist who uses one of those cool pottery wheels:
Don't count on using the bowl for anything that requires greater precision!
Now think about forming an estimate of \( \theta \) from a noisy sample:
Don't reach any scientific conclusions that require greater precision!
But there's a problem here…
Two roads diverged in a yellow wood
And sorry I could not travel both
And be one traveler, long I stood
And looked down one as far as I could
To where it bent in the undergrowth…–Robert Frost, The Road Not Taken, 1916
Quantifying our uncertainty would seem to require knowing all the roads not taken—an impossible task.
Problem: we can't take repeated samples of size \( N \) from the population, to see how our estimate changes across samples.
Seemingly hacky solution: take repeated samples of size \( N \), with replacement, from the sample itself, and see how our estimate changes across samples. This is something we can easily simulate on a computer.
Basically, we pretend that our sample is the whole population and we charge ahead! This is called bootstrap resampling, or just bootstrapping.
Bootstrapped sample 1
Bootstrapped sample 2
Bootstrapped sample 3
Start with your original sample \( S = \{X_1, \ldots, X_N\} \) and original estimate \( \hat{\theta}_N \).
For \( b=1,...,B \):
Result: a set of \( B \) different estimates \( \hat{\theta}_N^{(1)}, \hat{\theta}_N^{(b)}, \ldots, \hat{\theta}_N^{(B)} \) that approximate the sampling distribution of \( \hat{\theta}_N \).
Calculate the bootstrapped standard error as the standard deviation of the bootstrapped estimates:
\[ \hat{se}(\hat{\theta}_N) = \mbox{std dev}\left( \hat{\theta}_N^{(1)}, \hat{\theta}_N^{(b)}, \ldots, \hat{\theta}_N^{(B)} \right) \]
This isn't the true standard error, but it's often a good approximation!
Or form an interval estimate: a range of plausible values for the parameter of interest.
The simplest way is to use the quantiles (e.g. the 2.5 and 97.5 percentiles):
\[ \theta \in (q_{2.5}, q_{97.5}) \]
There's some very hairy mathematics showing that these intervals will contain the true value approximately 95% of the time (or whatever your coverage level is).
Let's dig in to some R code and data: creatinine_bootstrap.R and creatinine.csv (both on class website).
We'll bootstrap two estimators:
Let's see this example in nonparametric regression, where
\[ y_i = f(x_i) + e_i \]
Suppose we use a nonparametric method to form an estimate \( \hat{f}(x) \) (e.g. using K-nearest neighbors), and we want to quantify our uncertainty about how well we've estimated the true \( f \).
This leads to the following algorithm. For b = 1 to B:
This gives us \( B \) draws from the bootstrapped sampling distribution of \( \hat{f}(x) \).
Use these draws to form (approximate) confidence intervals and standard errors for \( f(x) \).
ERCOT operates the electricity grid for 75% of Texas.
The 8 ERCOT regions are shown at right.
We'll focus on a basic prediction task:
ggplot(data = loadhou) +
geom_point(mapping = aes(x = KHOU, y = COAST), color='darkgrey') +
ylim(7000, 20000)
Suppose we want to know \( f(5) \) and \( f(25) \), i.e. the expected values of COAST when KHOU = 5 and KHOU = 25, respectively. Let's bootstrap a KNN model, with \( K=40 \):
library(mosaic)
library(FNN)
X_test = data.frame(KHOU=c(5,25))
boot20 = do(500)*{
loadhou_boot = resample(loadhou) # construct a boostrap sample
X_boot = select(loadhou_boot, KHOU)
y_boot = select(loadhou_boot, COAST)
knn20_boot = knn.reg(X_boot, X_test, y_boot, k=40)
knn20_boot$pred
}
head(boot20, 3) # first column is f(5), second is f(25)
V1 V2
1 10708.95 11882.67
2 10802.91 11997.82
3 10513.39 11812.59
Now we can calculate standard errors and/or confidence intervals.
se_hat = apply(boot20, 2, sd)
se_hat
V1 V2
172.5259 156.6715
apply(boot20, 2, quantile, probs=c(0.025, 0.975))
V1 V2
2.5% 10283.88 11532.70
97.5% 10974.92 12125.45
confint(boot20)
name lower upper level method estimate
1 V1 10283.88 10974.92 0.95 percentile 10638.78
2 V2 11532.70 12125.45 0.95 percentile 11906.23
X_test = data.frame(KHOU=seq(0, 35, by=0.1))
plot(COAST ~ KHOU, data=loadhou)
for(i in 1:500) {
loadhou_boot = resample(loadhou) # construct a boostrap sample
X_boot = select(loadhou_boot, KHOU)
y_boot = select(loadhou_boot, COAST)
knn20_boot = knn.reg(X_boot, X_test, y_boot, k=40)
knn20_boot$pred
lines(X_test$KHOU, knn20_boot$pred, col=rgb(1, 0, 0, 0.1))
}
Suppose that you're trying to construct a portfolio: that is, to decide how to allocate your wealth among \( D \) financial assets. Things you want might to track include:
Key idea: use the bootstrap to simulate portfolio performance.
Notation:
Let \( T \) be our investing horizon (e.g. T = 20 days, T = 40 years, etc), and let \( t \) index discrete time steps along the way.
Let \( X_{t,j} \) be the value of asset \( j = 1, \ldots, D \) at time period \( t \).
Let \( R_{t,j} \) be the return of asset \( j \) in period \( t \), so that we have the following recursive update:
\[ X_{t,j} = X_{t-1, j} \cdot (1 + R_{t,j}) \]
Notation:
A portfolio is a set of investment weights over assets: \( (w_{t1}, w_{t2}, \ldots, w_{tD}) \). Note: these weights might be fixed, or they might change over time.
The value of your portfolio is the weighted sum of the value of your assets:
\[ W_{t} = \sum_{j=1}^D w_{t,j} X_{t,j} \]
We care about \( W_T \): the random variable describing your terminal wealth after \( T \) investment periods.
Problem: this random variable is a super-complicated, nonlinear function of \( T \times D \) individual asset returns:
\[ W_T = f(R) \quad \mbox{where} \quad R = \{R_{t,j} : t = 1, \ldots, T; j = 1, \ldots, D\} \]
If we knew the asset returns, we could evaluate this function recursively, starting with initial wealth \( W_0 \) at time \( t=0 \) and sweeping through time \( t=T \):
Starting with initial wealth \( X^{(i)}_{1, j} \) in each asset, we sweep through from \( t=1 \) to \( t=T \):
\[ \begin{aligned} X^{(f)}_{t, j} &= X^{(i)}_{t, j} \cdot (1 + R_{t,j}) &\quad \mbox{(Update each asset)} \\ W_{t} &= \sum_{j=1}^D w_{t,j} X^{(f)}_{t,j} &\quad \mbox{(Sum over assets)} \\ X^{(i)}_{t+1,j} &= w_{t+1,j} \cdot W_{t} &\quad \mbox{(Rebalance)} \\ \end{aligned} \]
But of course, we don't know the asset returns! This suggests that we should use a Monte Carlo simulation, where we repeat the following for loop many times.
For \( t = 1, \ldots, T \):
The precise math of the update and rebalance steps are on the previous slide.
The difficult step here is (1): simulate a high-dimensional vector of asset returns from its joint probability distribution.
In general, using simple parametric probability models (e.g. multivariate Gaussian) to describe high-dimensional joint distributions is a very dicey proposition.
Suppose we have \( M \) past samples of the asset returns, stacked in a matrix:
\[ R = \left( \begin{array}{r r r r} R_{11} & R_{12} & \cdots & R_{1D} \\ R_{21} & R_{22} & \cdots & R_{2D} \\ \vdots & && \\ R_{M1} & R_{M2} & \cdots & R_{MD} \end{array} \right) \]
where \( R_{tj} \) is the return of asset \( j \) in period \( t \).
The key idea of bootstrap resampling is the following:
Thus our Monte Carlo simulation looks like the following at each draw.
For \( t = 1, \ldots, T \):
Let's go to the R code! See portfolio.R on the website.
Why do we draw an entire row of \( R \) at a time?