Bootstrapping in R |
|
Bootstrapping provides a pragmatic way forward when data is extremely limited. Its is far from ideal but it is used heavily in the field, it is more flexible, sometimes more successful than alternative techniques and it is accepted amongst the Data Science and Machine Learning community. No its more about aproximcapturing variablilty of data - metadata
Non-Parametric (Empirical) Bootstrap
Overview
Suppose we have a dataset \(x_1, x_2, \dots, x_n\) drawn from an unknown distribution \(F\). These \(n\) observations define an empirical distribution \(F^*\), which assigns an equal probability of \(1/n\) to each data point \(x_i\). According to the Law of Large Numbers, \(F^*\) is a good approximation of \(F\) when \(n\) is large.
Despite not knowing the underlying distribution \(F\) - the bootstrap allows us to estimate; the statistic itself (\(\nu\)), its variability (\(\mathbb Var_F(\nu)\)), higher order moments (skewness, kurtosis, etc.) and confidence intervals and bias corrections
The Bootstrap Algorithm
Resample the data:
Draw \(n\) observations \(x_1^*,x_2^*,\dots,x_n^*\) with replacement from the empirical distribution \(F^*\). This creates an empirical bootstrap sample of the same size as the original dataset.Calculate the statistic:
Apply the desired statistic or estimator to the bootstrap sample, yielding a bootstrap estimate \(\nu^*\). Note: Maintaining the same sample size \(n\) is crucial, as the variation of the statistic depends on \(n\).Repeat:
Perform steps 1–2 many times (\(1000\) or more bootstrap replicates) to obtain an approximate bootstrap distribution of \(\nu^*\).
The Bootstrap Principle
If \(F^*\) is a good approximation of \(F\), then the behavior of \(\nu^*\) estimated from \(F^*\) will approximate the behavior of \(\nu\) estimated from \(F\). In other words:
\[ \mathbb E_F [ \nu ] \approx \mathbb E_{F^*}[ \nu^* ], \qquad \mathbb Var_F [ \nu ] \approx \mathbb Var_{F^*}[ \nu^* ] \]
More generally, any moment of \(\nu\) can be estimated using the empirical moments of \(\nu^*\).
Empirical Sampling Distribution Standard Error Approximation
Standard error (SE): Measures the variability of the sample statistic across repeated samples:
\[ \begin{align} \text{Var}(\Theta)=\frac{1}{n-1} \sum_i (\theta_i-\bar{\theta})^2\qquad&\text{ spread - MSE minus the bias component}\\ \sigma=\sqrt{\text{Var}(\Theta)}\qquad\qquad&\text{variance expressed in units of }\theta\\ \mathit{SE}=\frac{\sigma}{\sqrt{n}}\quad\qquad\qquad&\mathit{SD}\text{ expressed independently of sample data} \end{align} \]
I think there is an issue here - SE is defined different in the bootstrapping context
Distinction: All of the above refer to the sampling distribution of the estimator \(\hat{\Theta}\), not the variability of the raw data \(X_i\) which build the estimate. A “good” estimator has low variance even if the data themselves are wildly varying.
In bootstrapping, the goal is to estimate the standard error—that is, the standard deviation—of a statistic by using the empirical distribution of that statistic obtained from many bootstrap resamples.
In fact, what bootstrap gives you is the empirical sampling distribution of your estimator—whatever shape that happens to be.
Basic Example: mean and Variance estimation of \(N(1,1^2)\)
Here I’ll take \(80\) samples from an \(N(1,1^2)\) distribution and estimate the mean and the variance…
When we repeatedly calculate mean and variance values using the bootstrap dataset the sampling distribution for the estimators is always normal in accordance with the CTL. This is quite different from population estimates which would show a student-t and chi-squared sampling distribution - only the information in the original sample is available in the empirical distribution and the population information in the population is unavailable.
The mean estimate was pretty good however, the variance estimate was extremely poor. The empirical distribution wasn’t a great representation of the population - bootstrap tends to underestimate variance when the sample size is small or data distribution is highly skewed, as the empirical distribution is limited by the original sample.
Boot strapping doesn’t improve the estimate whatsoever but it gave us some idea of the variance/uncertainty of the estimate which is useful for unknown (or non-parametric) distributions.
The empirical bootstrap confidence interval
The percentile bootstrap (and back of the envelope) confidence interval
We calculate the statistic for each of the bootstrap datasets: \(\nu_1^*\dots\nu_m^*\) for \(m\) bootstrap replicates:
[1] 41.4 42.4 39.9 40.9 41.9 40.1 40.7 41.2 39.2 41.0 40.7 41.2 42.3 41.0 39.4
[16] 40.2 39.6 40.8 42.6 42.5
We use the empirical distribution as an approximation for the sampling distribution of the actual estimate \(\nu^*\approx\nu\). The quick and dirty method involves sorting these estimates and simply taking the values at the \(2.5%\) and \(97.5%\) point in the list as confidence intervals:
<- sort(bootstrap_estimates)
ordered_estimates
# IMPORTANT: Ensure positions aren't 0
<- max(1,floor(0.025 * length(ordered_estimates)))
lower_pos <-min(length(ordered_estimates), floor(0.975 * length(ordered_estimates)))
upper_pos
<- ordered_estimates[lower_pos]
ci_lower <- ordered_estimates[upper_pos]
ci_upper
cat("2.5% estimate:", ci_lower, "| 97.5% estimate:", ci_upper, "\n")
2.5% estimate: 39.2 | 97.5% estimate: 42.5
More precicely we can estimate the pdf from these values and calculate the confidence intervals using the proper percentile values.
# Get the 2.5% and 97.5% empirical percentiles...
<- quantile(bootstrap_estimates, 0.025)
ci_lower <- quantile(bootstrap_estimates, 0.975)
ci_upper cat("95% Percentile CI:", ci_lower, "to", ci_upper, "\n")
95% Percentile CI: 39.295 to 42.5525
The basic (reverse percentile) bootstrap confidence interval
This method uses an algebraic pivot similar to the pivot used in finding z or t confidence intervals. We are attemping to quantify how much the predicted estimate varies around the real population parameter e.g. for a mean estimate \(\delta=\bar{x}-\mu\) we use the following approximation \(\delta\approx\delta^*=\bar{x}^*-\bar{x}\).
Here \(\delta\) is a standardised statistic. The computational power of bootstrapping gives the \(\delta^*\) distribution with high precision. Next we compute δ∗ = x∗ − x for each bootstrap sample (i.e. each column). Again we take the quantiles from these values for the appropriate confidence interval.
<- 40 # e.g., sample mean
original_estimate <- bootstrap_estimates - original_estimate
deltas <- quantile(deltas, 0.975) # upper tail for reversal
lower_delta <- quantile(deltas, 0.025) # lower tail for reversal
upper_delta <- c(original_estimate - lower_delta, original_estimate - upper_delta) basic_ci
Parametric Boot Strap
In parametric bootstrapping a parametric distribution is assumed. Very sophisticated methods are used to calculate parametric bootstrap confidence intervals but here I’ll calculate them using the basic quantile method.
Example: Basic Parametric Boot Strap Confidence Intervals
Data is of the form \(x_1,\dots x_{300}\) and is drawn from some exponential distribution \(\mathit{exp}(\lambda)\) with unknown rate parameter \(\lambda\). Our goal is to estimate \(\lambda\)λ and give a \(95\%\) parametric bootstrap confidence interval for \(\lambda\). I’m lifting this straight out of [@Orloff2022].
# Parametric bootstrap
# Given 300 data points with mean 2.
# Assume the data is exp(lambda)
# PROBLEM: Compute a 95% parametric bootstrap confidence interval for lambda
# We are given the number of data points and mean
= 300
n = 2
xbar
# The MLE for lambda is 1/xbar
= 1.0/xbar
lambda_hat
# Generate the bootstrap samples
# Each column is one bootstrap sample (of 300 resampled values)
= 1000
n_boot
# Here's the key difference with the empirical bootstrap:
# We draw the bootstrap sample from Exponential(lambda_hat).
= rexp(n*n_boot, lambda_hat)
x = matrix(x, nrow=n, ncol=n_boot)
bootstrap_sample # Compute the bootstrap lambda_star
= 1.0/colMeans(bootstrap_sample)
lambda_star # Compute the differences
= lambda_star - lambda_hat
delta_star # Find the 0.05 and 0.95 quantile for delta_star
= quantile(delta_star, c(0.05,0.95))
d # Calculate the 95% confidence interval for lambda.
= lambda_hat - c(d[2], d[1])
ci # This line of code is just one way to format the output text.
# sprintf is an old C function for doing this. R has many other
# ways to do the same thing.
sprintf("MLE lambda: %.3f", lambda_hat)
[1] "MLE lambda: 0.500"
sprintf("Confidence interval for lambda: [%.3f, %.3f]", ci[1], ci[2])
[1] "Confidence interval for lambda: [0.448, 0.544]"