A Simple Example of Bootstrap

Ashwin Malshe

6 October 2016

This note demonstrates the usefulness of bootstrap. The example is kept at a very basic level on purpose. The general principle of bootstrap can then be applied to any complex problem where we are unable to get standard errors of estimators.

Imagine that we have a sample with 100 observations. We want to use the sample to make an inference about the population mean. The sample mean, \(\bar{x}\) is an unbiased estimator of the population mean \(\mu\) (i.e., \(E(\bar{x}) = \mu\)). The standard error of the sample mean is given by \(\frac{\sigma}{\sqrt{n}}\), where \(\sigma\) is the population standard deviation and \(n\) is the sample size. As the sample size increases, the standard error decreases. When the sample size approchaes infinity, the standard error will approach 0.

“Standard error of the mean” is a confusing name. I am going to take a quick detour and summarize what exactly it is. If from a population you repeatedly pick samples of size \(n\) and calculate the sample mean for each sample, you get a sampling distribution of means. The standard deviation of the sampling distribution is called the standard error of the mean.

The formula for standard error requires us to know \(\sigma\) and \(n\). As an example, for \(\sigma = 2\) and \(n = 100\), the standard error of the mean is \(\frac{2}{\sqrt{100}} = 0.2\). Let’s run a small experiment to verify this formula.

I am going to start with a random normal distribution with \(\mu = 5\) and \(\sigma = 2\). I will draw a sample of 100 obsevations (\(n = 100\)) from this distribution and calculate its mean. I will repeatedly draw samples with 100 observations for another 999 times, each time calculating the mean of the sample. Thus, in all I will draw 1,000 samples each with 100 observations and obtain a vector of means of these 1,000 samples. The vector has a length of 1,000. This vectos is also the sampling distribution.

## Get a vector of NAs with 1,000 rows. This vector will hold the sample means

a <- matrix(NA, nrow=1000, ncol=1)

# Set random number seed for reproducible results.

set.seed(123456)

# Draw 1,000 samples from the random distribution with mean = 5 and sd = 2
# Get the mean of each sample and store in the vector a.

for(i in 1:1000){
  # Calculate the sample mean of a random sample of size 100
  a[i] <- mean(rnorm(100, mean = 5, sd = 2))
  }

Once we have the sampling distribution, we can plot it to see how it looks.

library(ggplot2)
qplot(x = a[,1], geom = "density")

The sampling distribution looks fairly symmetric. As an aside the Central Limit Theorem dictates that as the sample size approaches infinity, the sampling distribution of the mean will appraoch normal distribution. You can visually verify this by increasing the sample size in the above code from 100 to something large, say 2,000.

Next, let’s get the mean and the standard deviation of our sampling distribution. Note that as \(\sigma = 2\) and \(n = 100\), the standard deviation of the sampling distribution, or more commonly, the standard error will be 0.2. Let’s see whether in our simulation we get that right.

# Get the mean of the sampling distribution
mean(a)

## [1] 5.006553

# Get the standard deviation of the sampling distribution
# This is what's called the standard error. We expect this to be 0.2
sd(a)

## [1] 0.1989336

The standard error is 0.1989, very close to 0.2 as we expected from the formula.

Bootstrap

The above example is great to show that the empirical observation matches the formula. However, in real life we don’t have the luxury to draw repeated samples. All we have is one sample given to us. So let’s first generate a random sample that we will work with.

set.seed(6789)

a1 <- rnorm(100, mean = 5, sd = 2)

a1 is a vector that has 100 observations which are draw from a normal distribution with \(\mu = 5\) and \(\sigma = 2\) as before. Using this sample, we need to verify that the standard error of the mean is equal to 0.2 as calculated using the formula.

This is a complex problem. From one single sample, all we will have is one sample mean. However, to get a sampling distribution, we need multiple sample means. How can we do it from 1 sample? This task looks impossible.

Luckily we can use bootstrap to get multiple samples from our original sample a1. It may not lead to a perfect solution but we are likely to get a standard error that’s close to what we expect to see. In the following code, I am going to draw 1,000 bootstrap samples from our original sample a1. Each of these samples will have the same sample size of 100 as a1. However, the sampling has to be done with replacement because otherwise we will end up drawing the same sample 1,000 times. Sampling with replacement will result in potentially drawing one observation multiple times in a sample. This will lead to different sample compositions.

As before, I am going to draw bootstrap sample and get its mean in one shot. Next I will store the bootstrap means in a vector a2. This vector is the sampling distribution of the bootstrap means.

# Create a vector to hold bootstrap means
a2 <- matrix(NA, nrow = 1000, ncol = 1)

set.seed(387)
for(i in 1:1000){
  # Note the replace = T option for sampling with replacement
  a2[i] <- mean(a1[sample(length(a1),length(a1),replace = T)])  
}

Let’s verify whether the bootstrap gives us a standard error of mean that is close to 0.2

sd(a2)

## [1] 0.2073292

We get a standard error of 0.2073292, which is very close to 0.2 that we expected because of the formula. Thus, bootstrap gave us excellent results even when all we had was a single sample. Even with a single sample we can empirically create a sampling distribution and get the standard error.

Just for the sake of comparison, let’s get the density plot for the bootstarp sampling distribution.

qplot(x = a2[,1], geom = "density")

We get a nicely symmetric sampling distribution. However, note that this is shifted a bit to the right of 5, which is the true population mean. Thus, the sampling distribution mean is showing a little bias here. However, we get a reasonable estimate of the standard error, which in most cases what we are interested in.