Foundations for statistical inference - Confidence intervals


Getting Started


Load packages

We are operating in the tidyverse.

# Load packages ----------------------------------------------------------------
library(tidyverse)
library(openintro)
library(infer)  # Needed for resampling
library(shiny)  # Needed for the shiny app

# Set image dimensions
imgheight <- 700
imgwidth <- 700



The data

Roughly six-in-ten U.S. adults (62%) say climate change is currently affecting their local community either a great deal or some, according to a new Pew Research Center survey.

Source: Most Americans say climate change impacts their community, but effects vary by region

For data to work with we create a 100,000-count population where 62% believe “climate change affects their community”.

# Create data
us_adults <- tibble(
  climate_change_affects = c(rep("Yes", 62000), rep("No", 38000))
)

# Plot Histogram of data
ggplot(us_adults, aes(x = climate_change_affects)) +
  geom_bar() +
  labs(
    x = "", y = "",
    title = "Do you think climate change is affecting your local community?"
  ) +
  coord_flip() 



Exercises


Exercise 1

What percent of the adults in your sample think climate change affects their local community?

58.33% in my sample think climate change has an effect.

n <- 60
set.seed(1493)
samp <- us_adults %>%
  sample_n(size = n)
samp %>%
  count(climate_change_affects) %>%
  mutate(p = n /sum(n))
## # A tibble: 2 × 3
##   climate_change_affects     n     p
##   <chr>                  <int> <dbl>
## 1 No                        25 0.417
## 2 Yes                       35 0.583



Exercise 2

Would you expect another student’s sample proportion to be identical to yours? Would you expect it to be similar? Why or why not?

A second sample with the same sample size would likely fall in a band of nearby proportions centered around the true mean of the population, with wider bands for smaller sample sizes.



Exercise 3

In the interpretation above, we used the phrase “95% confident”. What does “95% confidence” mean?

95% confident means that were we to take a series of samples of a population to construct a range of the possible true mean for the total population, 95% of the time the true mean would fall within that range.



Exercise 4

Does your confidence interval capture the true population proportion of US adults who think climate change affects their local community?

My sample proportion of 58.33% falls within my 95% confidence interval of (0.45, 0.7).

samp %>%
  specify(response = climate_change_affects, success = "Yes") %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "prop") %>%
  get_ci(level = 0.95)
## # A tibble: 1 × 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1     0.45      0.7



Exercise 5

Each student should have gotten a slightly different confidence interval. What proportion of those intervals would you expect to capture the true population mean? Why?

I would expect 95% of the confidence intervals to capture the true population mean. This is definitional. If it were a 90% confidence interval the bands would be smaller and 90% of them would likely capture the true population mean. If it were a 99% confidence interval the bands would be larger and 99% of them would likely capture the true population mean.



Exercise 6

Given a sample size of 60, 1000 bootstrap samples for each interval, and 50 confidence intervals constructed (the default values for the above app), what proportion of your confidence intervals include the true population proportion? Is this proportion exactly equal to the confidence level? If not, explain why. Make sure to include your plot in your answer.

90% of my confidence intervals contain the true population proportion. The expected number would be 94% or 96% (only 50, not 100 intervals). Ours is not equal to the expected number but we’re only saying the probability of one interval containing the true proportion is 95%, not that out of 20 intervals only one has the true mean.

Plot from Shiny app with 95% confidence intervals



Exercise 7

Choose a different confidence level than 95%. Would you expect a confidence interval at this level to me wider or narrower than the confidence interval you calculated at the 95% confidence level? Explain your reasoning.

I chose 50% and expected the intervals to be narrower than 95% confidence intervals, because while the sampling distribution’s mean is fixed for any given confidence interval the confidence level determines how wide the interval is, the longer the interval the more likely it is to cross over the true mean.



Exercise 8

Using code from the infer package and data from the one sample you have (samp), find a confidence interval for the proportion of US Adults who think climate change is affecting their local community with a confidence level of your choosing (other than 95%) and interpret it.

I ran it twice with a confidence level of 50% (0.533, 0.633). Since samp is fixed and has a fixed sample mean, changing the confidence level only makes the confidence range wider (more confident) or narrower (less confident) but with the same central point. Though I don’t understand why when I run it with confidence levels of 47% and 48% why the top end of the confidence interval jumps from 61.67% to 63.33%. Oh, I guess the infer package rounds the confidence level to the nearest 5%.

samp %>%
  specify(response = climate_change_affects, success = "Yes") %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "prop") %>%
  get_ci(level = 0.50)
## # A tibble: 1 × 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1    0.533    0.617
samp %>%
  specify(response = climate_change_affects, success = "Yes") %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "prop") %>%
  get_ci(level = 0.48)
## # A tibble: 1 × 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1    0.533    0.633



Exercise 9

Using the app, calculate 50 confidence intervals at the confidence level you chose in the previous question, and plot all intervals on one plot, and calculate the proportion of intervals that include the true population proportion. How does this percentage compare to the confidence level selected for the intervals?

Run at 50% confidence, there were 12 under the mean and 17 above the mean so the proportion of the intervals that contain the true mean is 42% (21 out of 50). That seems like a reasonable proportion given it’s a 50% confidence level.

Plot from Shiny app with 50% confidence intervals



Exercise 10

Lastly, try one more (different) confidence level. First, state how you expect the width of this interval to compare to previous ones you calculated. Then, calculate the bounds of the interval using the infer package and data from samp and interpret it. Finally, use the app to generate many intervals and calculate the proportion of intervals that are capture the true population proportion.

Ok, let’s try a 90% confidence level. I expect the width to be wider than the previous ones I calculated.

My sample proportion of 58.33% fell within my 95% confidence interval of (0.45, 0.7). It also falls within my 90% confidence interval of (0.483, 0.683). They both share the same mean of 0.583, my sample’s mean.

samp %>%
  specify(response = climate_change_affects, success = "Yes") %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "prop") %>%
  get_ci(level = 0.90)
## # A tibble: 1 × 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1    0.483    0.683

I used the app to generate 50 intervals, five didn’t capture the true mean, so the proportion of intervals that do is 90%.

Plot from Shiny app with 90% confidence intervals



Exercise 11

Using the app, experiment with different sample sizes and comment on how the widths of intervals change as sample size changes (increases and decreases).

The widths of the intervals increase as sample size goes down and decrease as sample size goes up.



Exercise 12

Finally, given a sample size (say, 60), how does the width of the interval change as you increase the number of bootstrap samples. Hint: Does changing the number of bootstap samples affect the standard error?

Changing the number of bootstrap samples does have an effect. If you set the number of resamples for each bootstrap confidence interval to 1, all of the intervals become points. If you set it to 2 to 4 the intervals can be vary large or very small. From 10 to 1000 the intervals the difference in consistency of lengths is less distinguishable.