1. Bootstrap: Nonparametric vs Parametric (Intuition)

Suppose we observe data: \[ X_1, X_2, \ldots, X_n \sim F \] from some unknown distribution \(F\).

We are interested in a parameter: \[ \theta = T(F) \] and its estimator: \[ \hat{\theta} = T(\hat{F}_n) \] such as a mean, variance, or regression parameter.

The Problem

We want to know the sampling distribution of \(\hat{\theta}\), in order to compute: - Standard error (SE)
- Bias
- Confidence intervals (CIs)

Often, this is difficult to derive analytically.


1.1 Nonparametric Bootstrap (Very Briefly)

  • Replace the unknown distribution \(F\) with the empirical distribution \(\hat{F}_n\), which puts mass \(1/n\) at each observed value.
  • Resample with replacement from the data, compute \(\hat{\theta}^*\) many times.
  • Use the variability of \(\hat{\theta}^*\) to estimate SE, bias, and CIs.

Key Point:
This makes almost no assumptions about the shape of \(F\), but it “freezes” the data support at the observed values.


1.2 Parametric Bootstrap (Our Focus)

Now suppose we are willing to assume a parametric model: \[ X_1, X_2, \ldots, X_n \sim f(x; \vartheta) \] (e.g., Normal, Poisson, etc.), where \(\vartheta\) is a parameter (or vector of parameters) to be estimated.

Idea of Parametric Bootstrap

  1. Estimate the model parameter(s) \(\hat{\vartheta}\) from the data (e.g., via MLE).
  2. Generate new synthetic datasets from the assumed model: \[ X_1^*, X_2^*, \ldots, X_n^* \sim f(x; \hat{\vartheta}) \]
  3. Recompute: \[ \hat{\theta}^* = T(X_1^*, X_2^*, \ldots, X_n^*) \] for each synthetic dataset.
  4. Repeat many times to approximate the sampling distribution of \(\hat{\theta}\).

Key Point:
This uses model assumptions more strongly, but can give better results when the model is appropriate.


2. Parametric Bootstrap for Normal Mean and Variance (Using boot)

We will use the boot package in R.
The basic function is:

boot(data, statistic, R, sim = c(“ordinary”,“parametric”,…), ran.gen, mle)

#library(boot)
#boot(data, statistic, R, sim = c("ordinary","parametric",...), ran.gen, mle)
# True parameters

mu_true <- 5
sigma_true <- 2

# Sample size

n <- 50

# Simulate data

x <- rnorm(n, mean = mu_true, sd = sigma_true)

summary(x)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.067   3.881   4.855   5.069   6.396   9.338

2.2 Define the Statistic Function for boot

The statistic function takes data and an index (for ordinary bootstrap).
For parametric bootstrap, boot still expects the same interface, but the index is usually just all indices of the generated sample.

Definitions

  • stat_mean: returns the sample mean
  • stat_mean_sd: returns both mean and standard deviation, so we can bootstrap two parameters at once
library(boot)

# Statistic for mean only

stat_mean <- function(data, indices) {
x <- data[indices]
mean(x)
}

# Statistic returning mean and sd

stat_mean_sd <- function(data, indices) {
x <- data[indices]
c(mean = mean(x), sd = sd(x))
}

2.3 Estimate Model Parameters (MLEs)

For a Normal model, the MLEs for mean and sd are just the sample mean and sd.

mu_hat    <- mean(x)
sigma_hat <- sd(x)

mu_hat
## [1] 5.068807
sigma_hat
## [1] 1.85174

We will pass these as the mle argument to boot. ## 2.4 Define the Parametric Generator ran.gen

The parametric generator must be a function of the form:

rnorm_gen <- function(data, mle) {
n <- length(data)
mu    <- mle[1]
sigma <- mle[2]
rnorm(n, mean = mu, sd = sigma)
}

2.5 Parametric Bootstrap for Mean and SD

Now we call boot with the following arguments:

  • sim = "parametric"
  • ran.gen = rnorm_gen
  • mle = c(mu_hat, sigma_hat)
set.seed(123)

boot_norm <- boot(
data      = x,
statistic = stat_mean_sd,
R         = 2000,
sim       = "parametric",
ran.gen   = rnorm_gen,
mle       = c(mu_hat, sigma_hat)
)

boot_norm
## 
## PARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = x, statistic = stat_mean_sd, R = 2000, sim = "parametric", 
##     ran.gen = rnorm_gen, mle = c(mu_hat, sigma_hat))
## 
## 
## Bootstrap Statistics :
##     original       bias    std. error
## t1* 5.068807  0.001808685   0.2499582
## t2* 1.851740 -0.008013623   0.1843551

The output shows the original statistic (t0) and bootstrap replicates (t).