Reference: Data Science Chapter 5
From the New England Journal of Medicine in 2006:
We randomly assigned patients with resectable adenocarcinoma of the stomach, esophagogastric junction, or lower esophagus to either perioperative chemotherapy and surgery (250 patients) or surgery alone (253 patients)…. With a median follow-up of four years, 149 patients in the perioperative-chemotherapy group and 170 in the surgery group had died. As compared with the surgery group, the perioperative-chemotherapy group had a higher likelihood of overall survival (five-year survival rate, 36 percent vs. 23 percent).
Conclusion:
Conclusion:
Not so fast! In statistics, we ask “what if?” a lot:
Conclusion:
Always remember two basic facts about samples:
Conclusion:
By “quantifying uncertainty,” we mean filling in the blanks.
In stats, we equate trustworthiness with stability:
\[ \begin{array}{r} \mbox{Confidence in} \\ \mbox{your estimates} \\ \end{array} \iff \begin{array}{l} \mbox{Stability of those estimates} \\ \mbox{under the influence of chance} \\ \end{array} \]
For example:
Let's work through a thought experiment…
Imagine Andrey Kolmorogov on four-day fishing trip.
At right we see the sampling distribution for both \( \beta_0 \) and \( \beta_1 \).
Suppose we are trying to estimate some population-level quantity \( \theta \): the parameter of interest.
So we take a sample from the population: \( X_1, X_2, \ldots, X_N \).
We use the data to form an estimate \( \hat{\theta}_N \) of the parameter. Key insight: \( \hat{\theta}_N \) is a random variable.
Suppose we are trying to estimate some population-level quantity \( \theta \): the parameter of interest.
So we take a sample from the population: \( X_1, X_2, \ldots, X_N \).
We use the data to form an estimate \( \hat{\theta}_N \) of the parameter. Key insight: \( \hat{\theta}_N \) is a random variable.
Now imagine repeating this process thousands of times! Since \( \hat{\theta}_N \) is a random variable, it has a probability distribution.
Estimator: any method for estimating the value of a parameter (e.g. sample mean, sample proportion, slope of OLS line, etc).
Sampling distribution: the probability distribution of an estimator \( \hat{\theta}_N \) under repeated samples of size \( N \).
Bias: Let \( \bar{\theta}_N = E(\hat{\theta}_N) \) be the mean of the sampling distribution. The bias of \( \hat{\theta}_N \) is \( (\bar{\theta}_N - \theta) \): the difference between the average answer and the truth.
Unbiased estimator: \( (\bar{\theta}_N - \theta) = 0 \).
Standard error: the standard deviation of an estimator's sampling distribution:
\[ \begin{aligned} \mbox{se}(\hat{\theta}_N) &= \sqrt{ \mbox{var}(\hat{\theta}_N) } \\ &= \sqrt{ E[ (\hat{\theta}_N - \bar{\theta}_N )^2] } \\ &= \mbox{Typical deviation of $\hat{\theta}_N$ from its average} \end{aligned} \]
“If I were to take repeated samples from the population and use this estimator for every sample, how much does the answer vary, on average?”
If an estimator is unbiased, then \( \bar{\theta}_N = \theta \), so
\[ \begin{aligned} \mbox{se}(\hat{\theta}_N) &= \sqrt{ E[ (\hat{\theta}_N - \bar{\theta}_N )^2] } \\ &= \sqrt{ E[ (\hat{\theta}_N - \theta )^2] } \\ &= \mbox{Typical deviation of $\hat{\theta}_N$ from the truth} \end{aligned} \]
“If I were to take repeated samples from the population and use this estimator for every sample, how big of an error do I make, on average?”
Don't make any lifestyle choices that require greater precision!
Don't reach any scientific conclusions that require greater precision!
But there's a problem here…
Two roads diverged in a yellow wood
And sorry I could not travel both
And be one traveler, long I stood
And looked down one as far as I could
To where it bent in the undergrowth…–Robert Frost, The Road Not Taken, 1916
Quantifying our uncertainty would seem to require knowing all the roads not taken—an impossible task.
Problem: we can't take repeated samples of size \( N \) from the population, to see how our estimate changes across samples.
Seemingly hacky solution: take repeated samples of size \( N \), with replacement, from the sample itself, and see how our estimate changes across samples. This is something we can easily simulate on a computer.
Basically, we pretend that our sample is the whole population and we charge ahead! This is called bootstrap resampling, or just bootstrapping.
Bootstrapped sample 1
Bootstrapped sample 2
Bootstrapped sample 3
Start with your original sample \( S = \{X_1, \ldots, X_N\} \) and original estimate \( \hat{\theta}_N \).
For \( b=1,...,B \):
Result: a set of \( B \) different estimates \( \hat{\theta}_N^{(1)}, \hat{\theta}_N^{(b)}, \ldots, \hat{\theta}_N^{(B)} \) that approximate the sampling distribution of \( \hat{\theta}_N \).
Calculate the bootstrapped standard error as the standard deviation of the bootstrapped estimates:
\[ \hat{se}(\hat{\theta}_N) = \mbox{std dev}\left( \hat{\theta}_N^{(1)}, \hat{\theta}_N^{(b)}, \ldots, \hat{\theta}_N^{(B)} \right) \]
This isn't the true standard error, but it's often a good approximation!
Let's dig in to some R code and data: creatinine_bootstrap.R and creatinine.csv (both on class website).
We'll bootstrap two estimators:
Informally, an interval estimate is a range of plausible values for the parameter of interest. For example:
\[ \theta \in \hat{\theta}_N \pm k \cdot \hat{se} (\hat{\theta}_N) \]
\[ \theta \in (q_{2.5}, q_{97.5}) \]
We'd like to be able to associate a confidence level with an interval estimate like this. How?
If an interval estimate satisfies the frequentist coverage principle, we call it a confidence interval:
Frequentist coverage principle: If you were to analyze one data set after another for the rest of your life, and you were to quote X% confidence intervals for every estimate you made, those intervals should cover their corresponding true values at least X% of the time. Here X can be any number between 0 and 100.
An interval estimate takes the form \( \hat{I}_N = [\hat{L}_N, \hat{U}_N] \). Just like a point estimate \( \hat{\theta}_N \), the interval estimate is a random variable, because its endpoints are functions of a random sample.
We say that \( [\hat{L}_N, \hat{U}_N] \) is a confidence interval at coverage level \( 1-\alpha \) if, for every \( \theta \),
\[ P_{\theta} \left( \theta \in [\hat{L}_N, \hat{U}_N] \right) \geq 1- \alpha \, , \]
where \( P_{\theta} \) is the probability distribution of the data, assuming that the true parameter is equal to \( \theta \).
The key statement here can be one of the most confusing in all of statistics:
\[ P_{\theta} \left( \theta \in [\hat{L}_N, \hat{U}_N] \right) \geq 1- \alpha \, , \]
Three questions to ask yourself:
So recall our two methods of generating an interval estimate using the bootstrap:
The obvious question is: do these interval estimates satisfy the frequentist coverage principle?
The answer is: not always, but often!
In lots of common situations, both forms of bootstrapped interval estimate approximately satisfy the coverage requirement:
\[ P_{\theta} \left( \theta \in [\hat{L}_N, \hat{U}_N] \right) \approx 1- \alpha \, , \]
And the approximation gets better with larger sample sizes. That is, as \( N \) gets large,
\[ P_{\theta} \left( \theta \in [\hat{L}_N, \hat{U}_N] \right) \to 1-\alpha \]
The math here is super hairy; we won't go into it. (Google “empirical process theory” if want to learn and you've got a year or two to spare…)
But we can run a sanity check through Monte Carlo simulation!
Let's revisit our thought experiment about fishing…
Let's go on a 100 fishing trips. On each trip:
Because we know the slope of the true line (\( \beta_1 = 4.25 \)), we can check whether each bootstrapped confidence interval contains the true value. About 80% of them should!
Sometimes we can use probability theory to calculate a “plug-in” estimate of an estimator's standard error. Some simple cases include:
Let's see an example and compare the result with a bootstrap estimate of the standard error.
Suppose that \( X_1, X_2, \ldots, X_N \) are a sample of independent, identically distributed (IID) random variables with unknown mean \( \mu \) and variance \( \sigma^2 \). Let \( \bar{X}_N \) be the sample mean:
\[ \bar{X}_N = \frac{1}{N} \sum_{i=1}^N X_i \]
Clearly \( \bar{X}_N \) is a sensible estimate of \( \mu \), since it is unbiased: \( E(\bar{X}_N) = \mu \) (show this!)
We can also calculate the theoretical variance of \( \bar{X}_N \) as:
\[ \begin{aligned} \mbox{var}(\bar{X}_N) = \mbox{var} \left( \frac{1}{N} \sum_{i=1}^N X_i \right) & = \frac{1}{N^2} \mbox{var} \left( \sum_{i=1}^N X_i \right) \\ &= \frac{1}{N^2} N \sigma^2 \\ &= \frac{\sigma^2}{N} \end{aligned} \]
This tells us that the true standard error of the sample mean is:
\[ \mbox{se}(\bar{X}_N) = \frac{\sigma}{\sqrt{N}} \]
Or in words:
\[ \small \mbox{Average error of the sample mean} = \frac {\mbox{Average error of a single measurement}} { \mbox{Square root of sample size} } \]
This is sometimes called de Moivre's equation, after Abraham de Moivre.
There's only one problem with de Moivre's equation: we don't know the true \( \sigma \)!
\[ \mbox{se}(\bar{X}_N) = \frac{\sigma}{\sqrt{N}} \]
The obvious solution is to estimate \( \sigma \) from the data. This results in the so-called “plug-in” estimate of the standard error:
\[ \hat{se}(\bar{X}_N) = \frac{\hat{\sigma}}{\sqrt{N}} \]
where \( \hat{\sigma} \) is an estimate of the population standard deviation (e.g. the sample standard deviation).
Suppose we have an estimator \( \hat{\theta}_N \) and we want to know its standard error. The general plug-in procedure involves three steps:
Let's see an example in predimed_plugin.R.
Your turn! For the same predimed data set:
predimed data.The essential idea of the bootstrap is to simulate synthetic data sets by resampling from the original sample. This is often called the nonparametric bootstrap.
A common variation is called the parametric bootstrap: simulate synthetic data sets by simulating from a fitted parametric model.
See predimed_bootstrap.R!