When we compute a confidence interval, we first compute an estimate of a parameter with a statistic.
For example, we draw a random sample then compute the sample mean to estimate the parameter.
With a single sample (or anything short of the whole population) we don’t know where the population mean lies, so we want to localize the population mean within an interval, computed from the data in the sample.
An interval is a range of values (e.g. all real numbers between 4801 and 6801: written [4801, 6801]). A confidence interval is a range of plausible intervals for the parameter. A 95% confidence interval for a parameter is an interval computed by a method guaranteed to successfully cover the parameter 95% of the time (i.e. for 95% of the samples).
We call 95% the “confidence level” for the parameter. Other confidence levels, anything between 0% and 100%, are possible.
As the confidence level increases, what happens to the width of the confidence interval?
As the confidence level increases, the width of the confidence level decreases.
As a follow up to the last question, will you be more confident that the population mean lies in a bigger interval, or less confident.
I would be less confident. The smaller the confidence interval the less margine of error.
As the sample size increases, what happens to the width of the confidence interval? The width of the confiedence interval decreases.
In other words, will you be able to localize the population mean better (in a smaller interval) with a larger sample size?
Yes, with a smaller interval you would be able to localize.
If the variablility in the population increases, what happens to the width of the confidence interval?
The width increases because of the margin of error.
In other words, will you be able to localize the population mean better (in a smaller interval) with more variability in the population? NO you wouldnt
According to the Central Limit Theorem, the sampling distribution for the sample mean has a Normal (bell shaped) distribution. Using this approximation, and the 68-95-99.7 Rule to assign a margin of error (half the width of the confidence interval, centered on sample mean) to be 2 times the standard error (a statistic that estimates the standard deviation of the sampling distribution). We did this in class the Friday before Thanksgiving.
Now I want to talk about another method of estimating confidence intervals. It’s called the “bootstrap”. Why is it called the bootstrap? Well, you take a single sample, and estimate the variability in the population by “pulling yourself up by your bootstraps”. (I am not making this up.)
We have only one sample, but we want to use it to get more samples from the sampling distribution of the same size. We do this by sampling from replacement. We are pulling things out of a hat and throwing them back before we draw the next number. Say we have a sample of prices of diamonds: 1, 2, 3, 4.
Now we draw new samples with the same distribution:
set.seed(1)
sample(c(1,2,3,4),size=4,replace=TRUE)
## [1] 2 2 3 4
sample(c(1,2,3,4),size=4,replace=TRUE)
## [1] 1 4 4 3
sample(c(1,2,3,4),size=4,replace=TRUE)
## [1] 3 1 1 1
sample(c(1,2,3,4),size=4,replace=TRUE)
## [1] 3 2 4 2
sample(c(1,2,3,4),size=4,replace=TRUE)
## [1] 3 4 2 4
The idea is that we are drawing new samples such that a quarter of the time we choose 1, a quarter of the time we choose 2, and so on. This gives us an indication of the variability in our sampling distribution, using just the information in a single sample and without drawing more than one sample.
Now let’s draw a sample of the price of diamonds from the diamonds data set. I’ll use a sample size of 64, (We certainly can’t use a sample size of 1, and we probably don’t want it to be too small, although it would be an interesting experiment to see what happens with small sample sizes.)
library(ggplot2)
data(diamonds)
set.seed(2)
dia64 <- sample(diamonds$price, size=64)
dia64
## [1] 4702 1006 745 4516 2318 2316 4150 1635 13853 706 709
## [12] 5368 1185 4661 9918 1728 2548 5183 12148 3528 911 9116
## [23] 1651 4350 7644 645 582 7983 2432 4175 2832 4484 1436
## [34] 1810 18736 842 1683 6164 476 4348 2589 6405 3998 4472
## [45] 2315 1356 2530 7701 646 1436 2811 2862 957 2207 5976
## [56] 1438 1299 552 814 383 1232 1898 837 5702
We are going to create a bootstrap distribution for the sample mean using this sample. We will take 1000 samples from this distribution:
set.seed(3)
bs1000 <- c()
for (i in 1:1000) {
bs1000[i] <- mean(sample(dia64, size=64, replace=TRUE))
}
head(bs1000)
## [1] 4043.219 2859.844 3628.453 2976.828 2906.266 4646.203
Now we are going to get 1000 samples from the whole sampling distribution of diamonds (without a bootstrap). This is obviously better, but we need to draw 1000 samples from our population. If it were very expensive to do this (e.g. we have to run 1000 different random surveys), we might not be able to do this.
set.seed(4)
dp1000 <- c(0)
for (i in 1:1000) {
dp1000[i] = mean(sample(diamonds$price, size=64))
}
head(dp1000)
## [1] 3599.203 4380.703 4443.984 3937.844 3518.172 4301.484
Now let’s display the two side by side.
labp <- cbind(rep('population',1000))
labb <- cbind(rep('bootstrap',1000))
labels <- rbind(labp, labb)
statp <- cbind(dp1000)
statb <- cbind(bs1000)
stats <- rbind(statp, statb)
df <- data.frame(stats=stats,labels=labels)
ggplot(data=df,mapping=aes(x=labels,y=stats)) + geom_violin() + geom_boxplot()
Here is the claim: even though the means are different, the spread in the two distributions should be roughly the same, so that 95% of the time, the population mean will lie between the 2.5th percentile and the 97.5th percentile.
lo <- quantile(bs1000, probs=0.025)
hi <- quantile(bs1000, probs=0.975)
ggplot(data=df,mapping=aes(x=labels,y=stats)) + geom_violin() + geom_boxplot() + geom_hline(yintercept=lo,color="green") + geom_hline(yintercept=hi,color="green")+ geom_hline(yintercept=mean(diamonds$price),color="black")
The black line is the mean of the entire population (not 1000 samples), the green lines are the percentiles that define the confidence interval for our original sample of 64; they percentiles of the 1000 bootstrap samples.
Note that our 95% confidence interval is one of the 95% that cover the true value of the parameter (the population mean).