Bootstrap

When we compute a confidence interval, we first compute an estimate of a parameter with a statistic.

For example, we draw a random sample then compute the sample mean to estimate the parameter.

With a single sample (or anything short of the whole population) we don’t know where the population mean lies, so we want to localize the population mean within an interval, computed from the data in the sample.

An interval is a range of values (e.g. all real numbers between 4801 and 6801: written [4801, 6801]). A confidence interval is a range of plausible intervals for the parameter. A 95% confidence interval for a parameter is an interval computed by a method guaranteed to successfully cover the parameter 95% of the time (i.e. for 95% of the samples).

We call 95% the “confidence level” for the parameter. Other confidence levels, anything between 0% and 100%, are possible.

As the confidence level increases, what happens to the width of the confidence interval?

As the confidence level increases, the width of the confidence level decreases.

As a follow up to the last question, will you be more confident that the population mean lies in a bigger interval, or less confident.

I would be less confident. The smaller the confidence interval the less margine of error.

As the sample size increases, what happens to the width of the confidence interval? The width of the confiedence interval decreases.

In other words, will you be able to localize the population mean better (in a smaller interval) with a larger sample size?

Yes, with a smaller interval you would be able to localize.

If the variablility in the population increases, what happens to the width of the confidence interval?

The width increases because of the margin of error.

In other words, will you be able to localize the population mean better (in a smaller interval) with more variability in the population? NO you wouldnt

Drawing Confidence Intervals Assuming You Know Shape of Sampling Distribution

According to the Central Limit Theorem, the sampling distribution for the sample mean has a Normal (bell shaped) distribution. Using this approximation, and the 68-95-99.7 Rule to assign a margin of error (half the width of the confidence interval, centered on sample mean) to be 2 times the standard error (a statistic that estimates the standard deviation of the sampling distribution). We did this in class the Friday before Thanksgiving.

Bootstrap

Now I want to talk about another method of estimating confidence intervals. It’s called the “bootstrap”. Why is it called the bootstrap? Well, you take a single sample, and estimate the variability in the population by “pulling yourself up by your bootstraps”. (I am not making this up.)

We have only one sample, but we want to use it to get more samples from the sampling distribution of the same size. We do this by sampling from replacement. We are pulling things out of a hat and throwing them back before we draw the next number. Say we have a sample of prices of diamonds: 1, 2, 3, 4.

Now we draw new samples with the same distribution:

set.seed(1)
sample(c(1,2,3,4),size=4,replace=TRUE)

## [1] 2 2 3 4

sample(c(1,2,3,4),size=4,replace=TRUE)

## [1] 1 4 4 3

sample(c(1,2,3,4),size=4,replace=TRUE)

## [1] 3 1 1 1

sample(c(1,2,3,4),size=4,replace=TRUE)

## [1] 3 2 4 2

sample(c(1,2,3,4),size=4,replace=TRUE)

## [1] 3 4 2 4

The idea is that we are drawing new samples such that a quarter of the time we choose 1, a quarter of the time we choose 2, and so on. This gives us an indication of the variability in our sampling distribution, using just the information in a single sample and without drawing more than one sample.

Now let’s draw a sample of the price of diamonds from the diamonds data set. I’ll use a sample size of 64, (We certainly can’t use a sample size of 1, and we probably don’t want it to be too small, although it would be an interesting experiment to see what happens with small sample sizes.)

library(ggplot2)
data(diamonds)

set.seed(2)
dia64 <- sample(diamonds$price, size=64)
dia64

##  [1]  4702  1006   745  4516  2318  2316  4150  1635 13853   706   709
## [12]  5368  1185  4661  9918  1728  2548  5183 12148  3528   911  9116
## [23]  1651  4350  7644   645   582  7983  2432  4175  2832  4484  1436
## [34]  1810 18736   842  1683  6164   476  4348  2589  6405  3998  4472
## [45]  2315  1356  2530  7701   646  1436  2811  2862   957  2207  5976
## [56]  1438  1299   552   814   383  1232  1898   837  5702

We are going to create a bootstrap distribution for the sample mean using this sample. We will take 1000 samples from this distribution:

set.seed(3)
bs1000 <- c()
for (i in 1:1000) {
  bs1000[i] <- mean(sample(dia64, size=64, replace=TRUE))
}
head(bs1000)

## [1] 4043.219 2859.844 3628.453 2976.828 2906.266 4646.203

Now we are going to get 1000 samples from the whole sampling distribution of diamonds (without a bootstrap). This is obviously better, but we need to draw 1000 samples from our population. If it were very expensive to do this (e.g. we have to run 1000 different random surveys), we might not be able to do this.

set.seed(4)
dp1000 <- c(0)
for (i in 1:1000) {
  dp1000[i] = mean(sample(diamonds$price, size=64))
}
head(dp1000)

## [1] 3599.203 4380.703 4443.984 3937.844 3518.172 4301.484

Now let’s display the two side by side.

labp <- cbind(rep('population',1000))
labb <- cbind(rep('bootstrap',1000))
labels <- rbind(labp, labb)
statp <- cbind(dp1000)
statb <- cbind(bs1000)
stats <- rbind(statp, statb)
df <- data.frame(stats=stats,labels=labels)

ggplot(data=df,mapping=aes(x=labels,y=stats)) + geom_violin() + geom_boxplot()

Here is the claim: even though the means are different, the spread in the two distributions should be roughly the same, so that 95% of the time, the population mean will lie between the 2.5th percentile and the 97.5th percentile.

lo <- quantile(bs1000, probs=0.025)
hi <- quantile(bs1000, probs=0.975)
ggplot(data=df,mapping=aes(x=labels,y=stats)) + geom_violin() + geom_boxplot() + geom_hline(yintercept=lo,color="green") + geom_hline(yintercept=hi,color="green")+ geom_hline(yintercept=mean(diamonds$price),color="black")

The black line is the mean of the entire population (not 1000 samples), the green lines are the percentiles that define the confidence interval for our original sample of 64; they percentiles of the 1000 bootstrap samples.

Note that our 95% confidence interval is one of the 95% that cover the true value of the parameter (the population mean).

Bootstrap

Lizzy Mendes

November 21, 2017

Drawing Confidence Intervals Assuming You Know Shape of Sampling Distribution

Bootstrap