A confidence interval is a special type of interval estimator for a parameter. We used the sample mean and sample proportion as point estimators for the populaton mean and population proportion, respectively. Now, rather than using a single point to guess these parameters, we turn to using an interval.
Confidence intervals help convey our level of certainty. Say that we are interested in the proportion of cyclists who wear a helmet. A 95% CI of .88 to .92 shows we are much more certain about our estimate, as compared to a 95% CI of .81 to .99.
The level of confidence also affects the width of a confidence interval. The confidence level is the proportion that goes in the middle (when we draw our picture). A larger confidence level means smaller tails. As our confidence level increases, our confidence interval becomes wider. Intuitively, if we want to be sure we are hitting the parameter, we need to make our interval wide enough.
How do we interpret a confidence interval? Let’s use the example of the bike helmets. We could say we are 95% confident that the true proportion of cyclists who wear helmets is between .88 and .92.
Today we will use the cabbages data set, which is built into the R package MASS. Before diving in, you may need to first download the MASS package (Tools, Install Packages); you only need to do this the first time. Once it’s downloaded, you can call up the MASS package. You only need to call the package once per session.
library(MASS)
data(cabbages)
Run some of the usual commands. What does each command tell you?
dim(cabbages)
## [1] 60 4
names(cabbages)
## [1] "Cult" "Date" "HeadWt" "VitC"
str(cabbages)
## 'data.frame': 60 obs. of 4 variables:
## $ Cult : Factor w/ 2 levels "c39","c52": 1 1 1 1 1 1 1 1 1 1 ...
## $ Date : Factor w/ 3 levels "d16","d20","d21": 1 1 1 1 1 1 1 1 1 1 ...
## $ HeadWt: num 2.5 2.2 3.1 4.3 2.5 4.3 3.8 4.3 1.7 3.1 ...
## $ VitC : int 51 55 45 42 53 50 50 52 56 49 ...
summary(cabbages)
## Cult Date HeadWt VitC
## c39:30 d16:20 Min. :1.000 Min. :41.00
## c52:30 d20:20 1st Qu.:1.875 1st Qu.:50.75
## d21:20 Median :2.550 Median :56.00
## Mean :2.593 Mean :57.95
## 3rd Qu.:3.125 3rd Qu.:66.25
## Max. :4.300 Max. :84.00
Here is how you can take a sample of size 10 of the cabbage head weights, with replacement.
sample(cabbages$HeadWt, size=10, replace= TRUE)
## [1] 2.7 2.5 2.3 4.2 3.1 2.8 3.0 3.2 1.7 1.5
Re-run that a couple times to remind yourself there is randomness.
You know that you want to procure a sample that is of the same size as the original sample. Therefore, you need to find the sample size of the original data set. You can find the sample size by finding the length of the variable.
(sampsize <- length(cabbages$HeadWt))
## [1] 60
Equivalently, you can find the number of rows of the original data set. Whatever floats your boat.
nrow(cabbages)
## [1] 60
Next, lay the groundwork. Decide how many random samples you want to take: we’ll call that m, the Monte Carlo sample size. Then create a vector in which you’ll store your simulated means. I filled this vector with zeros so that I can easily recognize any potential issues (because the mean weight of cabbage heads should be strictly positive).
m <- 10^4
mymeans <- rep(0, m)
Now we are ready to repeatedly sample and calculate sample means from the simulated sample. The following loop performs two steps iteratively: it takes a sample of size 60 with replacement, stores it as mysamp, calculates the mean, and stores the mean in the ith entry of the vector mymeans.
for(i in 1:m){
mysamp <- sample(cabbages$HeadWt, size=sampsize, replace= TRUE)
mymeans[i] <- mean(mysamp)
}
You now have a simulated sampling distribution. You can now look at it and learn from it. Here is a histogram of your sampling distribution.
hist(mymeans)
Do a quick check: does this make sense with our original data? Is the sampling distribution centered in the “right” place?
To construct a 95 percent confidence interval for the mean headweight, you can use the quantile command. The first argument is the vector of means and the second argument says what proportion we want below a certain cutoff.
quantile(mymeans, .025)
## 2.5%
## 2.375
quantile(mymeans, .975)
## 97.5%
## 2.815
This says that 2.5 percent of means are below 2.375 and 97.5 percent of means are below 2.815. If you want to get two quantiles at once, you can use a vector as the second argument, as shown below.
quantile(mymeans, c(.025, .975))
## 2.5% 97.5%
## 2.375 2.815
Our simulated sampling distribution looked pretty normally distributed. Do you think that a confidence interval constructed using the t-distribution would be similar? Construct it and see! Why does this make sense?
If you re-run the loop above, you will get different answers. To get the same answers next time you run it, you need to set the random seed.
set.seed(1234)
for(i in 1:m){
mysamp <- sample(cabbages$HeadWt, size=sampsize, replace= TRUE)
mymeans[i] <- mean(mysamp)
}
quantile(mymeans, c(.025, .975))
## 2.5% 97.5%
## 2.370 2.815
Run this chunk a couple times to convince yourself the resulting bootstrap confidence intervals are the same every time.
You can subset the data-set to isolate certain observations. For example, here I’ve found just the cabbages planted on date d16.
newdata <- subset(cabbages, Date =="d16" )
dim(newdata)
## [1] 20 4
Use this mini data set to find the mean headweight of cabbages planted on d16. Create a 85 percent bootstrap confidence interval for this mean.
Hint: make sure you sample from
newdata$HeadWt
## [1] 2.5 2.2 3.1 4.3 2.5 4.3 3.8 4.3 1.7 3.1 2.0 2.4 1.9 2.8 1.7 3.2 2.0 2.2 2.2
## [20] 2.2
rather from than ALL of the cabbage data.
Another hint: make sure you use the correct sample size. That is, find the number of observations in newdata, not in the original cabbage data-set.
Find a 90 percent bootstrap confidence interval for the mean ascorbic acid content of cabbages planted on d21.
Use the crabs data from the MASS package to construct a 95 percent confidence interval for crabs’ mean carapace width.
data(crabs)
summary(crabs)
## sp sex index FL RW CL
## B:100 F:100 Min. : 1.0 Min. : 7.20 Min. : 6.50 Min. :14.70
## O:100 M:100 1st Qu.:13.0 1st Qu.:12.90 1st Qu.:11.00 1st Qu.:27.27
## Median :25.5 Median :15.55 Median :12.80 Median :32.10
## Mean :25.5 Mean :15.58 Mean :12.74 Mean :32.11
## 3rd Qu.:38.0 3rd Qu.:18.05 3rd Qu.:14.30 3rd Qu.:37.23
## Max. :50.0 Max. :23.10 Max. :20.20 Max. :47.60
## CW BD
## Min. :17.10 Min. : 6.10
## 1st Qu.:31.50 1st Qu.:11.40
## Median :36.80 Median :13.90
## Mean :36.41 Mean :14.03
## 3rd Qu.:42.00 3rd Qu.:16.60
## Max. :54.60 Max. :21.60
attach(crabs)
Use the crabs data from the MASS package to construct a 95 percent confidence interval for female crabs’ mean rear width.
Use the crabs data from the MASS package to construct a 95 percent confidence interval for male crabs’ mean rear width.
The code below calculate the mean rear width for females and for males, respectively.
mean(RW[sex == "F"])
## [1] 13.487
mean(RW[sex == "M"])
## [1] 11.99
It looks like females have wider rears than males. In other words, the females’ mean minus the males’ mean is positive:
mean(RW[sex == "F"]) - mean(RW[sex == "M"])
## [1] 1.497
Adapt your bootstrap loop to calculate both the mean for males and the mean for females. For each iteration of the loop, record the difference in the two simulated means (female mean minus male mean). Use this to construct a 95 percent confidence interval that gives us an idea of how much wider females’ rears are compared to males’ rears, on average.