In this lab we treat the following Pew Research finding as the truth about the population of US adults:
For easier computation we work with a synthetic population of 100,000 adults where 62,000 answer “Yes” and 38,000 answer “No”.
us_adults <- tibble(
climate_change_affects = c(
rep("Yes", 62000),
rep("No", 38000)
)
)
us_adults %>%
count(climate_change_affects) %>%
mutate(p = n / sum(n))
## # A tibble: 2 × 3
## climate_change_affects n p
## <chr> <int> <dbl>
## 1 No 38000 0.38
## 2 Yes 62000 0.62
ggplot(us_adults, aes(x = climate_change_affects)) +
geom_bar() +
labs(
x = "",
y = "",
title = "Do you think climate change is affecting your local community?"
) +
coord_flip()
The output above confirms that 62% of the population said “Yes” and 38% said “No”.
Now we pretend we do not know the population and only get to see a simple random sample of 60 adults.
n <- 60
samp <- us_adults %>%
sample_n(size = n)
samp
## # A tibble: 60 × 1
## climate_change_affects
## <chr>
## 1 Yes
## 2 No
## 3 Yes
## 4 No
## 5 Yes
## 6 Yes
## 7 No
## 8 Yes
## 9 No
## 10 Yes
## # ℹ 50 more rows
What percent of the adults in your sample think climate change affects their local community?
We can compute the sample proportion who say “Yes”.
samp_summary <- samp %>%
count(climate_change_affects) %>%
mutate(p_hat = n / sum(n))
samp_summary
## # A tibble: 2 × 3
## climate_change_affects n p_hat
## <chr> <int> <dbl>
## 1 No 23 0.383
## 2 Yes 37 0.617
Answer (Exercise 1):
The table above shows the number and proportion of “Yes” and “No”
responses in my sample of 60. The proportion labeled p_hat
in the row for “Yes” is the percentage of adults in this sample who
think climate change affects their local community. This value is my
point estimate of the unknown population proportion.
Would you expect another student’s sample proportion to be identical to yours? Would you expect it to be similar? Why or why not?
Answer (Exercise 2):
I would not expect another student’s sample proportion to be exactly the
same as mine. We are both taking random samples of 60 adults, so by
chance our samples will include different people and slightly different
mixes of “Yes” and “No” responses. However, because we are sampling from
the same population where the true proportion is 0.62, I would expect
our sample proportions to be similar overall and
usually fairly close to 0.62, not wildly different from each other.
We now build a 95% confidence interval for the
population proportion of US adults who think climate change affects
their local community using bootstrapping with the
infer package.
ci_95 <- samp %>%
specify(response = climate_change_affects, success = "Yes") %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "prop") %>%
get_ci(level = 0.95)
ci_95
## # A tibble: 1 × 2
## lower_ci upper_ci
## <dbl> <dbl>
## 1 0.5 0.733
In the interpretation above, we used the phrase “95% confident”. What does “95% confidence” mean?
Answer (Exercise 3):
A 95% confidence level means that if we repeated this entire
process many times — taking a new random sample of 60, building
a new bootstrap confidence interval from each sample — then
about 95% of those intervals would contain the true population
proportion. For a single interval, we either do or do not cover
the true value, but 95% describes the long run success rate of the
method, not the probability for one specific interval after it has been
calculated.
Does your confidence interval capture the true population proportion of US adults who think climate change affects their local community? If you are working on this lab in a classroom, does your neighbor’s interval capture this value?
The true population proportion in our synthetic population is 0.62. We can check whether it falls between the lower and upper bounds of the interval we just computed.
true_p <- 0.62
ci_95 %>%
mutate(
captures_true_p = (lower_ci <= true_p & upper_ci >= true_p)
)
## # A tibble: 1 × 3
## lower_ci upper_ci captures_true_p
## <dbl> <dbl> <lgl>
## 1 0.5 0.733 TRUE
Answer (Exercise 4):
The output above shows whether my 95% confidence interval includes 0.62.
If captures_true_p is TRUE, my interval
does capture the true population proportion; if it is
FALSE, this particular interval is one of the few (about 5%
in the long run) that miss the true value. Similarly, each classmate’s
interval may or may not contain 0.62, but overall we would expect
roughly 95% of everyone’s 95% intervals to capture the true
proportion.
Next we explore what happens when we construct many confidence intervals.
We will:
set.seed(606)
B <- 50 # number of intervals
n_boot <- 1000 # number of bootstrap resamples for each interval
cis_95 <- tibble(
sim = 1:B,
lower_ci = NA_real_,
upper_ci = NA_real_
)
for (i in 1:B) {
samp_i <- us_adults %>%
sample_n(size = n)
ci_i <- samp_i %>%
specify(response = climate_change_affects, success = "Yes") %>%
generate(reps = n_boot, type = "bootstrap") %>%
calculate(stat = "prop") %>%
get_ci(level = 0.95)
cis_95$lower_ci[i] <- ci_i$lower_ci
cis_95$upper_ci[i] <- ci_i$upper_ci
}
# Did each interval capture the true p?
cis_95 <- cis_95 %>%
mutate(captures_true_p = (lower_ci <= true_p & upper_ci >= true_p))
# Proportion of intervals that capture the true p
prop_capture_95 <- mean(cis_95$captures_true_p)
prop_capture_95
## [1] 1
We can also visualize all 50 intervals on one plot.
ggplot(cis_95, aes(x = sim, ymin = lower_ci, ymax = upper_ci)) +
geom_linerange() +
geom_hline(yintercept = true_p, linetype = "dashed") +
labs(
x = "Interval index",
y = "Proportion saying 'Yes'",
title = "50 bootstrap 95% confidence intervals"
)
Each student should have gotten a slightly different confidence interval. What proportion of those intervals would you expect to capture the true population mean? Why?
Answer (Exercise 5):
Because we are using a 95% confidence level, we expect
the method to produce intervals that include the true
population proportion about 95% of the time. So if many students each
compute a 95% confidence interval from independent random samples, we
would expect around 95% of those intervals to capture
the true population proportion. A few intervals (around 5%) will miss
the true value purely due to sampling variability.
Given a sample size of 60, 1000 bootstrap samples for each interval, and 50 confidence intervals constructed, what proportion of your confidence intervals include the true population proportion? Is this proportion exactly equal to the confidence level? If not, explain why. Make sure to include your plot in your answer.
Answer (Exercise 6):
The value printed above as prop_capture_95 is the
proportion of my 50 intervals that contain the true proportion 0.62.
With a 95% confidence level we expect this number to be close to
0.95, but it will not usually be exactly 0.95 because we are
only constructing a finite number (50) of intervals. Random sampling
variation means that sometimes a few more or a few fewer intervals than
expected will capture the true value, even though the long run capture
rate of the method is 95%.
Choose a different confidence level than 95%. Would you expect a confidence interval at this level to be wider or narrower than the confidence interval you calculated at the 95% confidence level? Explain your reasoning.
Answer (Exercise 7):
Suppose I choose a 90% confidence level. A 90% interval
should be narrower than a 95% interval because it does
not need to capture the true parameter as often. To achieve a higher
confidence level we must stretch the interval farther in both
directions, which makes intervals wider. Reducing the confidence level
lets us shorten the interval but accept a lower long run capture
rate.
Using code from the infer package and data from the one sample you have (
samp), find a confidence interval for the proportion of US adults who think climate change is affecting their local community with a confidence level of your choosing (other than 95%) and interpret it.
We construct a 90% confidence interval from our original sample.
ci_90 <- samp %>%
specify(response = climate_change_affects, success = "Yes") %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "prop") %>%
get_ci(level = 0.90)
ci_90
## # A tibble: 1 × 2
## lower_ci upper_ci
## <dbl> <dbl>
## 1 0.517 0.717
Answer (Exercise 8):
Based on my sample of 60 adults, the 90% bootstrap confidence interval
for the proportion who think climate change affects their local
community is given by the lower and upper bounds printed above. I am 90%
confident that the true proportion of all US adults who would answer
“Yes” falls between these two values.
Using the app, calculate 50 confidence intervals at the confidence level you chose in the previous question, and plot all intervals on one plot, and calculate the proportion of intervals that include the true population proportion. How does this percentage compare to the confidence level selected for the intervals?
Instead of using the app, we can use code very similar to what we wrote earlier, but now using a 90% confidence level.
cis_90 <- tibble(
sim = 1:B,
lower_ci = NA_real_,
upper_ci = NA_real_
)
for (i in 1:B) {
samp_i <- us_adults %>%
sample_n(size = n)
ci_i <- samp_i %>%
specify(response = climate_change_affects, success = "Yes") %>%
generate(reps = n_boot, type = "bootstrap") %>%
calculate(stat = "prop") %>%
get_ci(level = 0.90)
cis_90$lower_ci[i] <- ci_i$lower_ci
cis_90$upper_ci[i] <- ci_i$upper_ci
}
cis_90 <- cis_90 %>%
mutate(captures_true_p = (lower_ci <= true_p & upper_ci >= true_p))
prop_capture_90 <- mean(cis_90$captures_true_p)
prop_capture_90
## [1] 0.88
ggplot(cis_90, aes(x = sim, ymin = lower_ci, ymax = upper_ci)) +
geom_linerange() +
geom_hline(yintercept = true_p, linetype = "dashed") +
labs(
x = "Interval index",
y = "Proportion saying 'Yes'",
title = "50 bootstrap 90% confidence intervals"
)
Answer (Exercise 9):
The proportion of 90% intervals that contain the true proportion,
prop_capture_90, should be close to 0.90. As with the 95%
intervals, this number will not usually be exactly equal to the
confidence level because we are only drawing a finite number of samples
and intervals. Still, the result should be reasonably close to 0.90,
which matches the chosen confidence level.
Lastly, try one more (different) confidence level. First, state how you expect the width of this interval to compare to previous ones you calculated. Then, calculate the bounds of the interval using the infer package and data from
sampand interpret it. Finally, use the app to generate many intervals and calculate the proportion of intervals that capture the true population proportion.
Here we choose a 99% confidence level.
ci_99 <- samp %>%
specify(response = climate_change_affects, success = "Yes") %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "prop") %>%
get_ci(level = 0.99)
ci_99
## # A tibble: 1 × 2
## lower_ci upper_ci
## <dbl> <dbl>
## 1 0.45 0.767
cis_99 <- tibble(
sim = 1:B,
lower_ci = NA_real_,
upper_ci = NA_real_
)
for (i in 1:B) {
samp_i <- us_adults %>%
sample_n(size = n)
ci_i <- samp_i %>%
specify(response = climate_change_affects, success = "Yes") %>%
generate(reps = n_boot, type = "bootstrap") %>%
calculate(stat = "prop") %>%
get_ci(level = 0.99)
cis_99$lower_ci[i] <- ci_i$lower_ci
cis_99$upper_ci[i] <- ci_i$upper_ci
}
cis_99 <- cis_99 %>%
mutate(captures_true_p = (lower_ci <= true_p & upper_ci >= true_p))
prop_capture_99 <- mean(cis_99$captures_true_p)
prop_capture_99
## [1] 1
Answer (Exercise 10):
A 99% confidence interval should be wider than both the
95% and the 90% intervals, because it must stretch farther to capture
the true proportion in 99% of repeated samples. The bounds printed in
ci_99 show this wider interval. The value
prop_capture_99 is the proportion of 99% intervals that
include the true proportion; with enough intervals this number should be
close to 0.99.
Using the app, experiment with different sample sizes and comment on how the widths of intervals change as sample size changes (increases and decreases).
Answer (Exercise 11):
As the sample size increases, the confidence intervals
become narrower. Larger samples give more information
about the population and reduce the standard error, so our estimate is
more precise. When the sample size decreases, the intervals become
wider, reflecting greater uncertainty when we have less
data.
Finally, given a sample size (say, 60), how does the width of the interval change as you increase the number of bootstrap samples. Hint: Does changing the number of bootstrap samples affect the standard error?
Answer (Exercise 12):
For a fixed sample size, increasing the number of bootstrap resamples
(for example from 500 to 1000 or 2000) does not systematically
change the true width of the confidence interval. The standard
error is determined by the variability in the original sample, not by
how many bootstrap resamples we draw. Using more bootstrap samples
simply makes the bootstrap distribution smoother and the estimated
interval bounds a bit more stable from run to run, but the typical width
of the interval stays about the same.