1 1. Overview

This document demonstrates multiple bootstrap methods in R with fully commented code and visuals.
We cover:

  • Nonparametric bootstrap (percentile, basic, BCa intervals) for mean & median
  • Studentized (t) bootstrap (with nested bootstraps)
  • Parametric bootstrap (Normal & Poisson examples)
  • Block bootstrap for time series (moving blocks)
  • Stratified bootstrap for subgroup-balanced resampling
  • Wild bootstrap for heteroskedastic regression
  • Smoothed bootstrap for discrete data
  • Convergence diagnostics (SE/CI vs. number of resamples)

Throughout, we provide practical interpretation notes and plots of the bootstrap sampling distribution.

Reproducibility: each section sets a seed for consistency.


2 2. Setup

pkgs <- c("boot", "ggplot2", "dplyr", "tidyr", "purrr", "broom", "gridExtra")
to_install <- pkgs[!pkgs %in% installed.packages()[, "Package"]]
if (length(to_install) > 0) install.packages(to_install, quiet = TRUE)
lapply(pkgs, library, character.only = TRUE)
## [[1]]
## [1] "boot"      "stats"     "graphics"  "grDevices" "utils"     "datasets" 
## [7] "methods"   "base"     
## 
## [[2]]
## [1] "ggplot2"   "boot"      "stats"     "graphics"  "grDevices" "utils"    
## [7] "datasets"  "methods"   "base"     
## 
## [[3]]
##  [1] "dplyr"     "ggplot2"   "boot"      "stats"     "graphics"  "grDevices"
##  [7] "utils"     "datasets"  "methods"   "base"     
## 
## [[4]]
##  [1] "tidyr"     "dplyr"     "ggplot2"   "boot"      "stats"     "graphics" 
##  [7] "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[5]]
##  [1] "purrr"     "tidyr"     "dplyr"     "ggplot2"   "boot"      "stats"    
##  [7] "graphics"  "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[6]]
##  [1] "broom"     "purrr"     "tidyr"     "dplyr"     "ggplot2"   "boot"     
##  [7] "stats"     "graphics"  "grDevices" "utils"     "datasets"  "methods"  
## [13] "base"     
## 
## [[7]]
##  [1] "gridExtra" "broom"     "purrr"     "tidyr"     "dplyr"     "ggplot2"  
##  [7] "boot"      "stats"     "graphics"  "grDevices" "utils"     "datasets" 
## [13] "methods"   "base"
theme_set(theme_minimal(base_size = 13))

3 3. Data: Skewed Income-like Sample (for mean/median)

We simulate a right-skewed sample (log-normal) to illustrate how different intervals behave for mean vs median under skewness.

set.seed(123)
n <- 200
x <- rlnorm(n, meanlog = 2, sdlog = 0.6)  # skewed positive data
summary(x)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.849   5.076   7.133   8.717  10.392  51.657

Why skewness matters: The mean is sensitive to tails; the median is robust. Bootstrap intervals may differ noticeably.


4 4. Nonparametric Bootstrap: Mean & Median

We use boot::boot() with resampling indices to obtain bootstrap replicates for the mean and median, then compute percentile, basic, and BCa intervals via boot::boot.ci().

set.seed(123)

# Statistic functions for boot()
mean_stat <- function(data, idx) mean(data[idx])
median_stat <- function(data, idx) median(data[idx])

B <- 3000  # Number of resamples

# Bootstrap objects
boot_mean   <- boot(x, statistic = mean_stat,   R = B)
boot_median <- boot(x, statistic = median_stat, R = B)

# Confidence intervals
ci_mean_perc  <- boot.ci(boot_mean,   type = c("perc", "basic", "bca"))
ci_median_perc<- boot.ci(boot_median, type = c("perc", "basic", "bca"))

ci_mean_perc
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 3000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = boot_mean, type = c("perc", "basic", "bca"))
## 
## Intervals : 
## Level      Basic              Percentile            BCa          
## 95%   ( 7.834,  9.522 )   ( 7.913,  9.600 )   ( 7.949,  9.669 )  
## Calculations and Intervals on Original Scale
ci_median_perc
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 3000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = boot_median, type = c("perc", "basic", "bca"))
## 
## Intervals : 
## Level      Basic              Percentile            BCa          
## 95%   ( 6.420,  7.853 )   ( 6.414,  7.846 )   ( 6.412,  7.821 )  
## Calculations and Intervals on Original Scale

Interpretation tips: - Percentile CI uses the quantiles of the bootstrap distribution. - Basic mirrors quantiles around the observed statistic. - BCa adds bias correction & acceleration; often shows best coverage under skewness.

4.1 4.1 Visualize Bootstrap Distributions

df_plot <- bind_rows(
  tibble(stat = boot_mean$t[,1], type = "Mean"),
  tibble(stat = boot_median$t[,1], type = "Median")
)

ggplot(df_plot, aes(stat)) +
  geom_histogram(bins = 40) +
  facet_wrap(~type, scales = "free") +
  labs(title = "Bootstrap Sampling Distributions",
       x = "Bootstrap replicate", y = "Count")


5 5. Studentized (t) Bootstrap for the Mean

The Studentized bootstrap standardizes the statistic by its estimated SE within each bootstrap resample.
We implement a nested bootstrap (small inner resamples) to estimate SE per outer resample.

Note: Studentized bootstrap is computationally heavier. Here we use small inner R_inner to keep runtime reasonable.

set.seed(123)

# Inner bootstrap to estimate SE on a given resample
se_on_sample <- function(sample_vals, R_inner = 200){
  inner <- boot(sample_vals, statistic = mean_stat, R = R_inner)
  sd(inner$t[,1], na.rm = TRUE)
}

# Outer loop: for each resample, compute t* = (mean* - mean0) / se*
R_outer <- 1000
t_star <- numeric(R_outer)
x0 <- mean(x)  # observed statistic

for (b in seq_len(R_outer)){
  idx <- sample(seq_along(x), replace = TRUE)
  xb  <- x[idx]
  mean_b <- mean(xb)
  se_b   <- se_on_sample(xb, R_inner = 150)  # inner SE
  t_star[b] <- (mean_b - x0) / se_b
}

# Studentized CI (approximate): invert t* quantiles
alpha <- 0.05
q_lo <- quantile(t_star, probs = 1 - alpha/2, na.rm = TRUE)
q_hi <- quantile(t_star, probs = alpha/2,     na.rm = TRUE)

# Need overall SE for observed sample:
se0 <- se_on_sample(x, R_inner = 1000)
ci_studentized <- c(x0 - q_lo*se0, x0 - q_hi*se0)
ci_studentized
##    97.5%     2.5% 
## 7.995367 9.696148

Interpretation: Studentization corrects for varying variability across resamples, often improving coverage.


6 6. Parametric Bootstrap

We assume a parametric model for the data, simulate from the fitted model, and recompute the statistic.

6.1 6.1 Normal Model for the Mean

set.seed(321)
mu_hat <- mean(x)
sd_hat <- sd(x)

R <- 5000
means_sim <- replicate(R, mean(rnorm(n, mu_hat, sd_hat)))

ci_perc_norm <- quantile(means_sim, c(0.025, 0.975))
c(true_mu = mu_hat, CI_low = ci_perc_norm[1], CI_high = ci_perc_norm[2])
##       true_mu   CI_low.2.5% CI_high.97.5% 
##      8.717257      7.859860      9.563329

Note: Efficient if the Normal assumption is reasonable; can be biased if misspecified.

6.2 6.2 Poisson Model for a Count Mean

set.seed(456)
counts <- rpois(200, lambda = 3.5)  # toy counts
lambda_hat <- mean(counts)

R <- 5000
means_sim_pois <- replicate(R, mean(rpois(length(counts), lambda_hat)))
quantile(means_sim_pois, c(0.025, 0.975))
##  2.5% 97.5% 
## 3.525 4.055

7 7. Block Bootstrap for Time Series (Dependence)

We simulate an AR(1) time series and use moving block bootstrap to respect autocorrelation using boot::tsboot.

set.seed(777)

# Simulate AR(1): y_t = 0.6 y_{t-1} + e_t
Tn <- 400
e <- rnorm(Tn, 0, 1)
y <- numeric(Tn)
for (t in 2:Tn) y[t] <- 0.6*y[t-1] + e[t]

# Statistic: mean of series
ts_mean <- function(z, i) mean(z[i])

# Block length choice (rule-of-thumb); try 5–10
L <- 8

boot_ts <- tsboot(tseries = y, statistic = ts_mean, R = 2000,
                  l = L, sim = "fixed")  # moving blocks, fixed length

ci_ts <- boot.ci(boot_ts, type = c("perc", "basic", "bca"))
ci_ts
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 2000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = boot_ts, type = c("perc", "basic", "bca"))
## 
## Intervals : 
## Level      Basic              Percentile     
## 95%   (-0.1455,  0.2687 )   (-0.1428,  0.2715 )  
## Calculations and Intervals on Original Scale

Interpretation: Blocks preserve local dependence; block length trades bias vs variance.


8 8. Stratified Bootstrap (Balanced Subgroups)

We create two strata with different distributions, then resample within strata to maintain representation.

set.seed(2024)

group <- sample(c("A","B"), n, replace = TRUE, prob = c(0.4, 0.6))
x_strat <- ifelse(group == "A", rnorm(n, 10, 2), rnorm(n, 12, 3))

dat <- tibble(group, x = x_strat)

# Statistic: group-weighted mean (population proportion weights)
wA <- mean(group == "A"); wB <- 1 - wA

stat_weighted_mean <- function(d) wA*mean(d$x[d$group=="A"]) + wB*mean(d$x[d$group=="B"])

R <- 3000
boot_vals <- replicate(R, {
  dA <- dat %>% filter(group=="A") %>% slice_sample(prop=1, replace=TRUE)
  dB <- dat %>% filter(group=="B") %>% slice_sample(prop=1, replace=TRUE)
  d  <- bind_rows(dA, dB)
  stat_weighted_mean(d)
})

quantile(boot_vals, c(0.025, 0.975))
##     2.5%    97.5% 
## 11.02822 11.69343

9 9. Wild Bootstrap for Heteroskedastic Regression

We simulate heteroskedastic errors, fit OLS, and use a wild bootstrap (Rademacher weights) to obtain CIs for a slope.

set.seed(999)

m <- 300
xr <- runif(m, -2, 2)
# Heteroskedastic variance grows with |x|
eps <- rnorm(m, 0, sd = 0.5 + 0.7*abs(xr))
yr <- 1.5 + 2.0*xr + eps
ols <- lm(yr ~ xr)

# Wild bootstrap: y* = yhat + w * residuals, w ∈ {-1, +1}
R <- 3000
b1 <- numeric(R)

yhat <- fitted(ols); res <- resid(ols)

for (b in 1:R){
  w <- sample(c(-1,1), size = m, replace = TRUE)  # Rademacher
  y_star <- yhat + w * res
  b1[b]  <- coef(lm(y_star ~ xr))[2]
}

quantile(b1, c(0.025, 0.975))
##     2.5%    97.5% 
## 1.964798 2.266528

Why wild? It preserves the mean structure while mimicking heteroskedastic variability via random sign flips of residuals.


10 10. Smoothed Bootstrap for Discrete Data

For discrete/rounded data, add small noise to resamples to better approximate a continuous distribution.

set.seed(2025)

disc <- rpois(300, lambda = 4)  # discrete counts
B <- 4000

# Smoothed: sample with replacement + add small N(0, h^2) noise
h <- 0.25  # smoothing bandwidth
means_sm <- replicate(B, mean(sample(disc, replace=TRUE) + rnorm(length(disc), 0, h)))

quantile(means_sm, c(0.025, 0.975))
##     2.5%    97.5% 
## 3.859674 4.316268

Note: Choose h small enough to avoid distorting the signal; compare with non-smoothed intervals.


11 11. Convergence Diagnostics: CI Width vs Resamples

We assess how CI width stabilizes as B increases for the mean under nonparametric bootstrap.

set.seed(4242)
B_grid <- c(200, 400, 800, 1200, 2000, 4000)
widths <- numeric(length(B_grid))

for (i in seq_along(B_grid)){
  b <- boot(x, statistic = mean_stat, R = B_grid[i])
  ci <- boot.ci(b, type = "perc")$percent[4:5]  # 95% CI
  widths[i] <- diff(ci)
}

diag_df <- tibble(B = B_grid, CI_width = widths)

ggplot(diag_df, aes(B, CI_width)) +
  geom_line() + geom_point() +
  scale_x_continuous(breaks = B_grid) +
  labs(title = "Bootstrap CI Width vs Number of Resamples (Percentile CI)",
       x = "Number of resamples (B)", y = "CI width")

Interpretation: When CI width stabilizes, additional resamples yield diminishing returns.


12 12. Summary & Practical Guidance

  • Match the bootstrap variant to your data structure:
    • i.i.d. → Nonparametric or parametric
    • Time seriesBlock bootstrap
    • Stratified populationsStratified bootstrap
    • Heteroskedastic regressionsWild bootstrap
    • Discrete dataSmoothed bootstrap
  • Prefer BCa intervals under skewness; compare to percentile/basic for sensitivity.
  • Use adequate B (≥ 1000; 5000–10000 for final estimates) and monitor convergence.
  • Document choices: interval type, B, block length, smoothing bandwidth, and any assumptions.

13 Appendix: Session Info

sessionInfo()
## R version 4.3.2 (2023-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 11 x64 (build 26100)
## 
## Matrix products: default
## 
## 
## locale:
## [1] LC_COLLATE=English_United States.utf8 
## [2] LC_CTYPE=English_United States.utf8   
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.utf8    
## 
## time zone: America/Chicago
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] gridExtra_2.3 broom_1.0.5   purrr_1.0.2   tidyr_1.3.1   dplyr_1.1.4  
## [6] ggplot2_3.5.1 boot_1.3-28.1
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.4      jsonlite_1.8.7    highr_0.10        compiler_4.3.2   
##  [5] tidyselect_1.2.0  jquerylib_0.1.4   scales_1.3.0      yaml_2.3.7       
##  [9] fastmap_1.1.1     R6_2.5.1          labeling_0.4.3    generics_0.1.3   
## [13] knitr_1.45        backports_1.4.1   tibble_3.2.1      munsell_0.5.0    
## [17] bslib_0.6.1       pillar_1.9.0      rlang_1.1.2       utf8_1.2.4       
## [21] cachem_1.0.8      xfun_0.41         sass_0.4.7        cli_3.6.1        
## [25] withr_2.5.2       magrittr_2.0.3    digest_0.6.33     grid_4.3.2       
## [29] rstudioapi_0.17.1 lifecycle_1.0.4   vctrs_0.6.5       evaluate_0.23    
## [33] glue_1.6.2        farver_2.1.1      fansi_1.0.5       colorspace_2.1-0 
## [37] rmarkdown_2.25    tools_4.3.2       pkgconfig_2.0.3   htmltools_0.5.7