STAA 567 Lec 9: Bootstrap estimates of standard error

Introduction
Bootstrap estimate of SE (\(F\) is unknown).

(Instructor : Nishant Panda)

Additional References

(IB) : An Introduction to the Bootstrap, Efron and Tibshirani, Chapman & Hall/CRC

Introduction

Last lecture we introduced the bootstrap as a technique to evaluate statistical (frequentist) properties of an estimator. We looked at the plug-in estimator and laid down notation for understanding the bootstrap method. Today we will explore the bootstrap technique to approximate the standard error and bias of an estimator.

Let \(X \sim F\) be a random variable (could be multidimensional!) whose c.d.f is given by \(F(x)\). Let us denote the expectation, variance and median of \(X\) by \(\mu_{F}\), \(\sigma^{2}_{F}\) and \(med_{F}\) to emphasize the distribution. These are all parameters of the distribution \(F\). In general, we denote a parameter \(\theta\) as a function the distribution \(F\), denoted as \(\theta = T(F)\) or \(\theta_{F}\) for short . Let \(X_1, X_2, \dots, X_n\) be a random sample of size \(n\) of \(X\). An estimator \(\widehat{\theta}\), is a function of the sample, i.e \(\widehat{\theta} = h(X_1, X_2, \dots, X_n)\). For example, if \(\theta_{F} = \mu_F\), then \(\widehat{\theta} = \overline{X}\) is an estimator for the mean. Similarly, if \(\theta_{F} = \sigma^{2}_F\), then \(\widehat{\theta} = S^2\) is an estimator for the variance. Let \(\widehat{F}\) be an estimated c.d.f, then \(\widehat{\theta}_{\widehat{F}}\) is the plug-in estimator for \(\theta\). As we saw last time, if \(\widehat{F}\) is the emperical c.d.f then, \[ \frac{1}{n}\sum\limits_{i = 1}^{n}(X_i - \overline{X})^2 \]

is the plug-in estimator for the variance. This is different from the usual estimator \(S^2\) (note the \(1/n\) as compared to the \(1/(n-1)\)!

Example 1: SE of an estimator when \(F\) is known.

Let \(X \sim N(\mu, \sigma^2)\). Let \(X_1, X_2, \dots, X_n\) be a random sample of \(X\). And say we are interested in estimating \(\sigma^2\) using the sample variance as the estimator. Let \(F\) be the normal c.d.f and let us indicate the mean and the variance as \(\mu_F\) and \(\sigma^2_{F}\). Estimate the Standard Error (SE) of \(S^2\).

Hence, \(\theta_{F} = \sigma^2\) and \(\widehat{\theta} = S^2\). What is the sampling distribution of \(S^2\)?To compute the SE we would need \(\mathbb{E}\left[\widehat{\theta}\right]\) and \(var\left[\widehat{\theta}\right]\). If we know, \(F\), we can get estimates (actually get exact values analytically!) by sampling from \(F\). Note that \(S^2\) is an unbiased estimate of \(\sigma^2\).

# Let us fix some observed data. 
# Say we have a data of size n = 40
set.seed(42)
n <- 40

# pretend we don't know what mean and sd are!
obs.data <- rnorm(n, mean = 5, sd = 2)

# Known F

# If we know F, we know mean and sd
mu.F <- 5
sigma.F <- 2
# number of simulation : number of times
# we draw from "population"
num.sim <- 1e3
sam.var <- rep(0, num.sim)

for(sim in 1:num.sim){
  # take a sample of size $n$ from N(5,2)
  data <- rnorm(n, mean = 5, sd = 2)
  # compute the estimator
  sam.var[sim] = var(data)
}

# Get the expected value of the estimator.

# this should be close to 4!
exp.thetah <- mean(sam.var)
# Get the SE of the estimator 
# Do we even know the forumla of the SE of S^2?
se.thetah <- sd(sam.var)

print(exp.thetah)

## [1] 4.064378

print(se.thetah)

## [1] 0.9040893

hist(sam.var, breaks = 21,  col = "steelblue", 
     xlab = "S^2: sample variance", 
     main = "Sampling distibution of sample variance")

Bootstrap estimate of SE (\(F\) is unknown).

Recall that in bootstrap we use the plug-in estimator to find properties of the estimator. (Thus, replace \(F\) with \(\hat{F}\) !)

Let \(X_1, X_2, \dots, X_n\) be a random sample. Let \(\hat{F}\) be an approximate c.d.f.

Note that \[ SE_{F}(\widehat{\theta}) \approx SE_{\widehat{F}}(\widehat{\theta}) = \sqrt{var_{\widehat{F}}\left[\widehat{\theta}\right]} \] and \[ var_{\widehat{F}}\left[\widehat{\theta}\right] = \mathbb{E}_{\widehat{F}}\left[\left(\widehat{\theta} - \mathbb{E}_{\widehat{F}}\left[\widehat{\theta}\right]\right)^2\right]. \] Let \[ g(\widehat{\theta}) = g\left(\widehat{\theta} - \mathbb{E}_{\widehat{F}}\left[\widehat{\theta}\right]\right)^2 \]

Thus, \[ var_{\widehat{F}}\left[\widehat{\theta}\right] = \mathbb{E}_{\widehat{F}}\left[g(\widehat{\theta})\right] \]

How do we compute \(\mathbb{E}_{\widehat{F}}\left[\widehat{\theta}\right]\) and \(\mathbb{E}_{\widehat{F}}\left[g(\widehat{\theta})\right]\)?

We can approximate this using Monte Carlo integration! First Note that \(\widehat{\theta}\) is a function of \((X_1, X_2, \dots, X_n)\). Hence, to use Monte Carlo integration,

Get \(R\) samples of size \(n\) from \(\widehat{F}\).
Each sample, say for example \(j^{th}\) sample, is of size \(n\) is denoted by \(\mathbf{x^{\ast}_{j}} = \left(x_1^{\ast}, x_2^{\ast}, \dots x_n^{\ast}\right)\). From this, compute \(\widehat{\theta(\mathbf{x^{\ast}_{j}})}\) and \(g(\widehat{\theta(\mathbf{x^{\ast}_{j}})})\)
Then, by Mote carlo, \[ \mathbb{E}_{\widehat{F}}\left[\widehat{\theta}\right] \approx \frac{1}{R}\sum\limits_{j = 1}^{R}\widehat{\theta(\mathbf{x^{\ast}_{j}})}, and \]

\[ \mathbb{E}_{\widehat{F}}\left[g(\widehat{\theta})\right] \approx \frac{1}{R}\sum\limits_{j = 1}^{R}g(\widehat{\theta(\mathbf{x^{\ast}_{j}})}) \]

Approximate the sampling distribution of \(\widehat{\theta}\) by the histogram of \(\widehat{\theta(\mathbf{x^{\ast}_{j}})}\).

In case of non-parametric bootstrap \(\hat{F} = F_e\) is the emperical c.d.f. In that case, any sample \(\mathbf{x^{\ast}} = \left(x_1^{\ast}, x_2^{\ast}, \dots x_n^{\ast}\right)\) is just a sample drawn from the observed data WITH REPLACEMENT.

In case of parameteric bootstrap if \(\widehat{F}\) is obtained using a statistical model e.g using max-likelihood to fit parameters to a known distribution, sample \(\mathbf{x^{\ast}} = \left(x_1^{\ast}, x_2^{\ast}, \dots x_n^{\ast}\right)\) is just a sample drawn from that fitted distribution.

Example 2: If as in example 1, suppose \(X \sim N(\mu, \sigma^2)\), \(\mu\) and \(\sigma^2\) are not known. Estimate the SE of the sample variance \(S^2\) using (non-parametric) Bootstrapping.

Note that our observed data was of size \(n = 40\). First let us look at the observed data

hist(obs.data, breaks = 11, xlab = "observed X", 
     main = "histogram of the observed data" )

# Check for normality
qqnorm(obs.data)
qqline(obs.data)

# First let us create our estimator
theta.hat <- function(sample){
  return (var(sample))
}
# Set the number of bootstrap samples R
R <- 1e3
# create a vector for sampling distribution
theta.hat.dist <- rep(0, R)

# do the bootstrap
for (boot in 1:R){
  
  # get sample of size n from F hat.
  x.star <- sample(obs.data, n, replace = TRUE)
  
  # get the estimated value from the estimator
  theta.hat.dist[boot] <- theta.hat(x.star)
}

# get the standard error estimate of S^2

# First create function g
g <- function(theta_hat, mu_hat){
  (theta_hat - mu_hat)^2
}

# Get the expected value of the estimator
mu.hat <- mean(theta.hat.dist)

# Get the variance of the estimator using the bootstrap Monte Carlo
var.boot <- sum(sapply(theta.hat.dist, g, mu_hat = mu.hat))/R

# Compute the Standard Error SE as variance of the estimator
se.boot <- sqrt(var.boot)

se.boot

## [1] 1.184573

Let us check if we did the above correctly with the boot package in R

library(boot)
# our test statistic is the sample variance
sample.var <- function (x, d){
  return(var(x[d]))
}

# get the bootstrap object
b <- boot(obs.data, sample.var, R)

b

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = obs.data, statistic = sample.var, R = R)
## 
## 
## Bootstrap Statistics :
##     original    bias    std. error
## t1* 5.976929 -0.153534    1.152481

This checks out. But compared to the exact (known F in example 1) it looks like we overshot! Let us look at the sampling distribution of the estimator as computed from bootstrap.

hist(theta.hat.dist, breaks = 21,  col = "steelblue", 
     xlab = "S^2: sample variance", 
     main = "(Bootstrap) Sampling distibution of sample variance")

This does look a bit biased!

# create a data frame
df <- data.frame(sam.var, theta.hat.dist)
names(df) <- c("true.dist", "boot.dist")

library(ggplot2)
ggplot(data = df, aes(true.dist)) +
  geom_density(alpha = 0.2, fill = "red") + 
  geom_histogram(aes(x = boot.dist, y= ..density..), 
                 alpha = 0.4, color = "blue", bins = 20) + 
  geom_density(alpha = 0.2, fill = "blue") + 
  labs(x = "Sample variance")

In the next Lecture, we will look at how to estimate bias of the estimator and use bias correction to account for Bootstrap bias.

(Home Assignment!) Run the above examples (1 & 2) with \(n = 50, 100\) and \(1000\). Report the SE from exact (known F) and bootstrap. Plot the above diagram showing bias of the bootstrap estimate for each \(n\). What do you conclude?

(Home Assignment!) Parametric bootstrap. Instead of using emperical c.d.f like we did here do the following.

From Observed data, fit a Normal distribution using maximum likelihood. That is figure out \(\mu,\sigma^2\) using maximum likelihood.
Now \(\widehat{F} = N(\mu_{ML}, \sigma^2_{ML})\). Run Bootstrap using this \(\widehat{F}\). The only thing that will change is this line

# Change this line!
# get sample of size n from F hat.
  x.star <- sample(obs.data, n, replace = TRUE)

Report the standard error SE and plot the diagram above showing biasedness.