When we compute a confidence interval, we first compute an estimate of a parameter with a statistic.
For example, we draw a random sample then compute the sample mean to estimate the parameter.
With a single sample (or anything short of the whole population) we don’t know where the population mean lies, so we want to localize the population mean within an interval, computed from the data in the sample.
An interval is a range of values (e.g. all real numbers between 4801 and 6801: written [4801, 6801]). A confidence interval is a range of plausible intervals for the parameter. A 95% confidence interval for a parameter is an interval computed by a method guaranteed to successfully cover the parameter 95% of the time (i.e. for 95% of the samples).
We call 95% the “confidence level” for the parameter. Other confidence levels, anything between 0% and 100%, are possible.
As the confidence level increases, what happens to the width of the confidence interval?
As a follow up to the last question, will you be more confident that the population mean lies in a bigger interval, or less confident.
As the sample size increases, what happens to the width of the confidence interval?
In other words, will you be able to localize the population mean better (in a smaller interval) with a larger sample size?
If the variablility in the population increases, what happens to the width of the confidence interval?
In other words, will you be able to localize the population mean better (in a smaller interval) with more variability in the population?
According to the Central Limit Theorem, the sampling distribution for the sample mean has a Normal (bell shaped) distribution. Using this approximation, and the 68-95-99.7 Rule to assign a margin of error (half the width of the confidence interval, centered on sample mean) to be 2 times the standard error (a statistic that estimates the standard deviation of the sampling distribution). We did this in class the Friday before Thanksgiving.
Now I want to talk about another method of estimating confidence intervals. It’s called the “bootstrap”. Why is it called the bootstrap? Well, you take a single sample, and estimate the variability in the population by “pulling yourself up by your bootstraps”. (I am not making this up.)
We have only one sample, but we want to use it to get more samples from the sampling distribution of the same size. We do this by sampling from replacement. We are pulling things out of a hat and throwing them back before we draw the next number. Say we have a sample of prices of diamonds: 1, 2, 3, 4.
Now we draw new samples with the same distribution:
set.seed(1)
sample(c(1,2,3,4),size=4,replace=TRUE)
## [1] 2 2 3 4
sample(c(1,2,3,4),size=4,replace=TRUE)
## [1] 1 4 4 3
sample(c(1,2,3,4),size=4,replace=TRUE)
## [1] 3 1 1 1
sample(c(1,2,3,4),size=4,replace=TRUE)
## [1] 3 2 4 2
sample(c(1,2,3,4),size=4,replace=TRUE)
## [1] 3 4 2 4
The idea is that we are drawing new samples such that a quarter of the time we choose 1, a quarter of the time we choose 2, and so on. This gives us an indication of the variability in our sampling distribution, using just the information in a single sample and without drawing more than one sample.
Now let’s draw a sample of the price of diamonds from the diamonds data set. I’ll use a sample size of 64, (We certainly can’t use a sample size of 1, and we probably don’t want it to be too small, although it would be an interesting experiment to see what happens with small sample sizes.)
library(ggplot2)
data(diamonds)
sample.size <- 64
set.seed(2)
dia64 <- sample(diamonds$price, size=sample.size)
dia64
## [1] 4702 1006 745 4516 2318 2316 4150 1635 13853 706 709
## [12] 5368 1185 4661 9918 1728 2548 5183 12148 3528 911 9116
## [23] 1651 4350 7644 645 582 7983 2432 4175 2832 4484 1436
## [34] 1810 18736 842 1683 6164 476 4348 2589 6405 3998 4472
## [45] 2315 1356 2530 7701 646 1436 2811 2862 957 2207 5976
## [56] 1438 1299 552 814 383 1232 1898 837 5702
We are going to create a bootstrap distribution for the sample mean using this sample. We will take 1000 samples from this distribution:
set.seed(3)
bs1000 <- c()
for (i in 1:1000) {
bs1000[i] <- mean(sample(dia64, size=sample.size, replace=TRUE))
}
head(bs1000)
## [1] 4043.219 2859.844 3628.453 2976.828 2906.266 4646.203
Now we are going to get 1000 samples from the whole sampling distribution of diamonds (without a bootstrap). This is obviously better, but we need to draw 1000 samples from our population. If it were very expensive to do this (e.g. we have to run 1000 different random surveys), we might not be able to do this.
set.seed(4)
dp1000 <- c(0)
for (i in 1:1000) {
dp1000[i] = mean(sample(diamonds$price, size=sample.size))
}
head(dp1000)
## [1] 3599.203 4380.703 4443.984 3937.844 3518.172 4301.484
Now let’s display the two side by side.
labp <- cbind(rep('population',1000))
labb <- cbind(rep('bootstrap',1000))
labels <- rbind(labp, labb)
statp <- cbind(dp1000)
statb <- cbind(bs1000)
stats <- rbind(statp, statb)
df <- data.frame(stats=stats,labels=labels)
ggplot(data=df,mapping=aes(x=labels,y=stats)) + geom_violin() + geom_boxplot()
Here is the claim: even though the means are different, the spread in the two distributions should be roughly the same, so that 95% of the time, the population mean will lie between the 2.5th percentile and the 97.5th percentile.
lo <- quantile(bs1000, probs=0.025)
hi <- quantile(bs1000, probs=0.975)
ggplot(data=df,mapping=aes(x=labels,y=stats)) + geom_violin() + geom_boxplot() + geom_hline(yintercept=lo,color="green") + geom_hline(yintercept=hi,color="green")+ geom_hline(yintercept=mean(diamonds$price),color="black")
The black line is the mean of the entire population (not 1000 samples), the green lines are the percentiles that define the confidence interval for our original sample of 64; they percentiles of the 1000 bootstrap samples.
Note that our 95% confidence interval is one of the 95% that cover the true value of the parameter (the population mean).
I’ll condense the steps into two: (1) obtain the sample you plan to use, then (2) compute the confidence interval.
For (1) we will use subpopulations of diamonds:
set.seed(10)
sample.size <- 64
subpop <- subset(diamonds, color=="J")$price
mysample <- sample(subpop, size=sample.size)
Now (2) compute the CI
set.seed(11)
bs1000 <- c()
for (i in 1:1000) {
bs1000[i] <- mean(sample(dia64, size=sample.size, replace=TRUE))
}
confidence.level = 0.95
lowerlevel = (1-confidence.level)/2
upperlevel = confidence.level + (1-confidence.level)/2
ci <- quantile(bs1000, probs = c(lowerlevel, upperlevel))
ci
## 2.5% 97.5%
## 2761.855 4389.525
As a third, extra step, we can plot results, but unlike what we did above, we won’t plot the correct answer, because we won’t assume we know the correct answer.
df <- data.frame(stats=bs1000)
ggplot(data=df,mapping=aes(x=0,y=stats)) + geom_violin() + geom_boxplot() + geom_hline(yintercept=ci[1],color="green") + geom_hline(yintercept=ci[2],color="green")
Found on Wikipedia concerning this method: “This method can be applied to any statistic. It will work well in cases where the bootstrap distribution is symmetrical and centered on the observed statistic[27] and where the sample statistic is median-unbiased and has maximum concentration (or minimum risk with respect to an absolute value loss function). In other cases, the percentile bootstrap can be too narrow.[citation needed] When working with small sample sizes (i.e., less than 50), the percentile confidence intervals for (for example) the variance statistic will be too narrow. So that with a sample of 20 points, 90% confidence interval will include the true variance only 78% of the time[28].”
Try defining a sample with the “cut” characteristic of diamonds and repeating the calculation of confidence interval. Based on samples are “fair” or “premium” diamonds more expensive? Change the sample size until confidence intervals don’t overlap. About how large a sample size do you need (try 1, 4, 16, 64, 256, 1024)?