Foundations for statistical inference - Confidence intervals

Load packages

library(tidyverse)
library(openintro)
library(infer)

The data

us_adults <- tibble(
  climate_change_affects = c(rep("Yes", 62000), rep("No", 38000))
)

Plotting the Data

ggplot(us_adults, aes(x = climate_change_affects)) +
  geom_bar() +
  labs(
    x = "", y = "",
    title = "Do you think climate change is affecting your local community?"
  ) +
  coord_flip()

Data Summary

us_adults |>
  count(climate_change_affects) |>
  mutate(p = n /sum(n))

# A tibble: 2 × 3
  climate_change_affects     n     p
  <chr>                  <int> <dbl>
1 No                     38000  0.38
2 Yes                    62000  0.62

Sample size

set.seed(1101)
n <- 60
samp <- us_adults |>
  sample_n(size = n)

1

samp |>
  count(climate_change_affects) |>
  mutate(p = n /sum(n))

# A tibble: 2 × 3
  climate_change_affects     n     p
  <chr>                  <int> <dbl>
1 No                        16 0.267
2 Yes                       44 0.733

73.3% of adults in my sample think climate change affects their local community

2

No, it would not be identical, but most likely it would be similar since the proportion of the sample is likely to be similar to that of the total population. However, since it is a sample there could be slight differences between it and others.

Confidence Intervals

set.seed(1101)
samp |>
  specify(response = climate_change_affects, success = "Yes") |>
  generate(reps = 1000, type = "bootstrap") |>
  calculate(stat = "prop") |>
  get_ci(level = 0.95)

# A tibble: 1 × 2
  lower_ci upper_ci
     <dbl>    <dbl>
1    0.617    0.833

Confidence Levels

1

95% confidence means that we have 95% confidence that the true population proportion of adults who think climate change is affecting their community lies between 61.67% and 83.3%.

1

My confidence interval does capture the true proportion since 62% lies between 61.67% and 83.3%.

2

95% of the intervals should capture the true proportion because that is what the confidence interval was set to.

1

47/50 include the true proportion. This is 94% which is not the same as the confidence interval. However this is one of two values closest to 95% that are available with the current model.

More Practice

1

For a confidence interval of 99.7% it would be wider because it would mean that 99.7% of the samples would have to contain the true proportion instead of only 95%. This is because it would have to take into account samples that could be further from the true proportion.

2

set.seed(1101)
samp |>
  specify(response = climate_change_affects, success = "Yes") |>
  generate(reps = 1000, type = "bootstrap") |>
  calculate(stat = "prop") |>
  get_ci(level = 0.997)

# A tibble: 1 × 2
  lower_ci upper_ci
     <dbl>    <dbl>
1    0.558      0.9

With 99.7% confidence we can say that the true population proportion of US adults who think climate change is affecting their community lies between 55.8 and .9.

3

It was 98%.

4

using ci = 0.68

set.seed(1101)
samp |>
  specify(response = climate_change_affects, success = "Yes") |>
  generate(reps = 1000, type = "bootstrap") |>
  calculate(stat = "prop") |>
  get_ci(level = 0.68)

# A tibble: 1 × 2
  lower_ci upper_ci
     <dbl>    <dbl>
1    0.683    0.783

With 68% confidence we can say that the true population proportion of US adults who think climate change is affecting their community lies between .683 and .783.

54% of the samples capture the true confidence interval

5

As the sample size increases the width of the intervals increases since the number of data points to pull from increases.

6

As the number of bootstraps changes the width of the interval stays the same since each sample size is still 60 so there are only 60 data points to pull from.