Introduction

Bootstrapping is a computationally sophisticated way to understand the distribution of a random sample or variable. (center, variability, overall shape).

We (as always) are assuming that our sample is representative of the population. For example a simple random sample of sufficient size.

First, recall what a confidence interval is. How are they constructed? What are they used for? How are they interpreted?

Example 1

Consider this thought experiment. We would like to know if a coin is fair (50/50). We all take a coin and flip it 100 times. Suppose 1000 people did this. We would get 1000 numbers. Perhaps they are all close to 50/50. Some people would get much higher than that, some much lower. Each of these values may be the actual bias of the coin.

We can create a 95% confidence interval estimate for the bias of the coin, by putting all the value in order, then removing the top 2.5% and the bottom 2.5%. The max and min of the remaining values will be our confidence interval.We will demonstrate this in code shortly.

Example 2: Sampling with Replacement

Suppose now you are a restaurant owner. You would like to know the average bill per table. You collect a sample of receipts. Assume this is a representative sample. Let’s call this list “receipts”. Let the length of receipts be n.

Question: Suppose you made a duplicate of each receipt? Then your sample would go from length n to 2n. How would that change things?

Question: More specifically, how would that change the mean? The standard deviation?

Question: Does doubling the sample make it any more or less “representative?”

What if we made an infinite number of copies of the original sample to draw from. How would sample from this new sample change things? This is equivalent to drawing from a sample with replacement.

Also note that our resamples will usually be of size n. Hence it would not make sense to draw a resample of n from a sample of size n since we would just get back the original sample.

Example 3: Baseball

Let’s do a quick demonstration in StatKey. Then we will do it in R. https://www.lock5stat.com/StatKey/

  1. Go to the Sampling Distributions row. Select “Mean”
  2. The default is the Baseball players salary data from 2019. You can see a small histogram on the left. There are 877 samples total. You can click on “Show Data Table” near the top to see them all.
  3. Find the “Choose Sample Size” area near the top. It is set to n=10. Leave it there. Note that StatKey defaults to sampling without replacement. We will not worry about that.
  4. Click on “Generate 1000 samples” and a plot will appear. Each dot represents a resample of the salary data, drawing n=10 samples, computing the mean and plotting it. It shows the total mean, which should be about 4.4 (million).
  5. Click on the “Two-Tail” checkbox and you will see how a 95% confidence interval is generated.
  6. What do you think happens to the confidence interval if we switch from n=10 to n=100?

##Useful R Functions

rnorm
sample
summary
hist
rep

Bootstrapping in R

First, lets create a sample. Also take a look at its summary statistics and histogram.

x<-rnorm(30,50,10) #30 samples from a normal distribution with a mean of 50 and a standard deviation of 10.
summary(x)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   34.83   41.96   46.53   47.62   52.31   71.39
hist(x)

We would like to estimate the true mean. (Of course we know it is 50 cause we constructed it that way but pretend we do not know the mean and standard deviation).

Now lets make 1000 bootstrap samples.

B<-1000 # number of bootstrap samples to obtain
xbar<-rep(0,B) # first a list of 1000 zeros. 

#next we calculate a resample with replacement 1000 times and compute the mean.
for (i in 1:B){
  xbs<-sample(x, length(x), replace=TRUE)
  xbar[i]<-mean(xbs)
}

Now lets take a look at what we have.

summary(xbar)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   43.07   46.63   47.73   47.73   48.74   53.95
hist(xbar)

Before we find a 95% confidence interval via boostrapping, lets compute a confidence interval using more a more traditional method, the t-distribution.

First a few quantities.

mean(x)
## [1] 47.61716
sd(x)
## [1] 8.634948
sd(x)/sqrt(length(x))
## [1] 1.576519
sd(xbar)
## [1] 1.609607

Now lets compute the interval. We will do it two ways. If we know that the distribution is normal we might go this way.

tcv<-qt(0.975, length(x)-1) #remember this from DMC106?  You may have seen this called t-star. 

#This way use a standard error derived from out bootstrap sample. 
mean(x)+c(-1,1)*tcv*sd(xbar) #note we computed the sd
## [1] 44.32514 50.90917
#This way is using the way you may have seen in DMC106
mean(x)+c(-1,1)*tcv*sd(x)/sqrt(length(x)) #note we computed the sd
## [1] 44.39282 50.84150

If we are not sure if the sample is normal we will instead find “middle 95%” of our resample.

quantile(xbar, probs=c(0.025, 0.975, type=1))
##     2.5%    97.5%     100% 
## 44.74194 51.27860 53.95195

Note how close these three values are.

Next Time

How would we do hypothesis testing using these methods?