When we compute a confidence interval, we first compute an estimate of a parameter with a statistic.
For example, we draw a random sample then compute the sample mean to estimate the parameter.
With a single sample (or anything short of the whole population) we don’t know where the population mean lies, so we want to localize the population mean within an interval, computed from the data in the sample.
An interval is a range of values (e.g. all real numbers between 4801 and 6801: written [4801, 6801]). A confidence interval is a range of plausible intervals for the parameter. A 95% confidence interval for a parameter is an interval computed by a method guaranteed to successfully cover the parameter 95% of the time (i.e. for 95% of the samples).
We call 95% the “confidence level” for the parameter. Other confidence levels, anything between 0% and 100%, are possible.
As the confidence level increases, what happens to the width of the confidence interval?
As the confidence level increases the width of the confidence interval increases. With all things being eqaual. You are demanding more confidence so you have to cast a wider net.
As a follow up to the last question, will you be more confident that the population mean lies in a bigger interval, or less confident.
I will be more confident that it lies in a bigger interval because you have a high confidence level.
As the sample size increases, what happens to the width of the confidence interval?
Increasing the sample size decreases the width of the confidence interval because it decreases the standard error (you have more information).
In other words, will you be able to localize the population mean better (in a smaller interval) with a larger sample size?
Yes, you would be able to better localize the population mean with a larger sample size.
If the variablility in the population increases, what happens to the width of the confidence interval?
The confidence level (statement) is set in the beginning. As the variability increases for the same confidence level, the confidence interval increases because there is more uncertainty. You will have a wider width.
In other words, will you be able to localize the population mean better (in a smaller interval) with more variability in the population?
It would be more difficult to localize with a bigger interval.
According to the Central Limit Theorem, the sampling distribution for the sample mean has a Normal (bell shaped) distribution. Using this approximation, and the 68-95-99.7 Rule to assign a margin of error (half the width of the confidence interval, centered on sample mean) to be 2 times the standard error (a statistic that estimates the standard deviation of the sampling distribution). We did this in class the Friday before Thanksgiving.
Now I want to talk about another method of estimating confidence intervals. It’s called the “bootstrap”. Why is it called the bootstrap? Well, you take a single sample, and estimate the variability in the population by “pulling yourself up by your bootstraps”. (I am not making this up.)
We have only one sample, but we want to use it to get more samples from the sampling distribution of the same size. We do this by sampling from replacement. We are pulling things out of a hat and throwing them back before we draw the next number. Say we have a sample of prices of diamonds: 1, 2, 3, 4.
Now we draw new samples with the same distribution:
set.seed(1)
sample(c(1,2,3,4),size=4,replace=TRUE)
## [1] 2 2 3 4
sample(c(1,2,3,4),size=4,replace=TRUE)
## [1] 1 4 4 3
sample(c(1,2,3,4),size=4,replace=TRUE)
## [1] 3 1 1 1
sample(c(1,2,3,4),size=4,replace=TRUE)
## [1] 3 2 4 2
sample(c(1,2,3,4),size=4,replace=TRUE)
## [1] 3 4 2 4
The idea is that we are drawing new samples such that a quarter of the time we choose 1, a quarter of the time we choose 2, and so on. This gives us an indication of the variability in our sampling distribution, using just the information in a single sample and without drawing more than one sample.
Now let’s draw a sample of the price of diamonds from the diamonds data set. I’ll use a sample size of 64, (We certainly can’t use a sample size of 1, and we probably don’t want it to be too small, although it would be an interesting experiment to see what happens with small sample sizes.)
library(ggplot2)
data(diamonds)
sample.size <- 64
set.seed(2)
dia64 <- sample(diamonds$price, size=sample.size)
dia64
## [1] 4702 1006 745 4516 2318 2316 4150 1635 13853 706 709
## [12] 5368 1185 4661 9918 1728 2548 5183 12148 3528 911 9116
## [23] 1651 4350 7644 645 582 7983 2432 4175 2832 4484 1436
## [34] 1810 18736 842 1683 6164 476 4348 2589 6405 3998 4472
## [45] 2315 1356 2530 7701 646 1436 2811 2862 957 2207 5976
## [56] 1438 1299 552 814 383 1232 1898 837 5702
We are going to create a bootstrap distribution for the sample mean using this sample. We will take 1000 samples from this distribution:
set.seed(3)
bs1000 <- c()
for (i in 1:1000) {
bs1000[i] <- mean(sample(dia64, size=sample.size, replace=TRUE))
}
head(bs1000)
## [1] 4043.219 2859.844 3628.453 2976.828 2906.266 4646.203
Now we are going to get 1000 samples from the whole sampling distribution of diamonds (without a bootstrap). This is obviously better, but we need to draw 1000 samples from our population. If it were very expensive to do this (e.g. we have to run 1000 different random surveys), we might not be able to do this.
set.seed(4)
dp1000 <- c(0)
for (i in 1:1000) {
dp1000[i] = mean(sample(diamonds$price, size=sample.size))
}
head(dp1000)
## [1] 3599.203 4380.703 4443.984 3937.844 3518.172 4301.484
Now let’s display the two side by side.
labp <- cbind(rep('population',1000))
labb <- cbind(rep('bootstrap',1000))
labels <- rbind(labp, labb)
statp <- cbind(dp1000)
statb <- cbind(bs1000)
stats <- rbind(statp, statb)
df <- data.frame(stats=stats,labels=labels)
ggplot(data=df,mapping=aes(x=labels,y=stats)) + geom_violin() + geom_boxplot()
Here is the claim: even though the means are different, the spread in the two distributions should be roughly the same, so that 95% of the time, the population mean will lie between the 2.5th percentile and the 97.5th percentile.
lo <- quantile(bs1000, probs=0.025)
hi <- quantile(bs1000, probs=0.975)
ggplot(data=df,mapping=aes(x=labels,y=stats)) + geom_violin() + geom_boxplot() + geom_hline(yintercept=lo,color="green") + geom_hline(yintercept=hi,color="green")+ geom_hline(yintercept=mean(diamonds$price),color="black")
The black line is the mean of the entire population (not 1000 samples), the green lines are the percentiles that define the confidence interval for our original sample of 64; they percentiles of the 1000 bootstrap samples.
Note that our 95% confidence interval is one of the 95% that cover the true value of the parameter (the population mean).
I’ll condense the steps into two: (1) obtain the sample you plan to use, then (2) compute the confidence interval.
For (1) we will use subpopulations of diamonds:
set.seed(15)
sample.size <- 256
subpop <- subset(diamonds, cut=="Premium")$price
mysample <- sample(subpop, size=sample.size)
Now (2) compute the CI
set.seed(11)
bs1000 <- c()
for (i in 1:1000) {
bs1000[i] <- mean(sample(mysample, size=sample.size, replace=TRUE))
}
confidence.level = 0.95
lowerlevel = (1-confidence.level)/2
upperlevel = confidence.level + (1-confidence.level)/2
ci <- quantile(bs1000, probs = c(lowerlevel, upperlevel))
ci
## 2.5% 97.5%
## 3832.746 4854.722
As a third, extra step, we can plot results, but unlike what we did above, we won’t plot the correct answer, because we won’t assume we know the correct answer.
df <- data.frame(stats=bs1000)
ggplot(data=df,mapping=aes(x=0,y=stats)) + geom_violin() + geom_boxplot() + geom_hline(yintercept=ci[1],color="green") + geom_hline(yintercept=ci[2],color="green")
Found on Wikipedia concerning this method: “This method can be applied to any statistic. It will work well in cases where the bootstrap distribution is symmetrical and centered on the observed statistic[27] and where the sample statistic is median-unbiased and has maximum concentration (or minimum risk with respect to an absolute value loss function). In other cases, the percentile bootstrap can be too narrow.[citation needed] When working with small sample sizes (i.e., less than 50), the percentile confidence intervals for (for example) the variance statistic will be too narrow. So that with a sample of 20 points, 90% confidence interval will include the true variance only 78% of the time[28].”
Try defining a sample with the “cut” characteristic of diamonds and repeating the calculation of confidence interval. Based on samples are “fair” or “premium” diamonds more expensive? Change the sample size until confidence intervals don’t overlap. About how large a sample size do you need (try 1, 4, 16, 64, 256, 1024)?
For 1024 premium - (4238, 4782) The confidence intervals overlap. The confidence intervals of the fair diamonds (4159, 4593) also overlap. With that sample size we cannot determine which is more expensive.
For 1 - Fair (398.65, 15248.17) there is no overlap Premium is (18, 684.025)
For 4 Premium (743.50, 3809.25) There is no overlap Fair (1597.75, 12305.50) There is no overlap
For 16 Fair (2153.916, 5428.575) There is no overlap Premium (1665.195, 4918)
For 64 Premium (3387.142, 5442.492) There is a slight overlap Fair (3154.305, 4763.531) There is a slight overlap
For 256 Fair (3803.904, 4666.397) There is a slight overlap Premium (3832.746, 4854.722) There is a slight overlap
Based on the results, Fair diamonds are more expensive for sample size 1, 4 and 16 and Premium diamonds are slightly more expensive for sample size 64, 256, and 1024.