Foundations for statistical inference - Confidence intervals

Author

Ethan Nguyen

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(openintro)
Loading required package: airports
Loading required package: cherryblossom
Loading required package: usdata
library(infer)
library(knitr)
#Data frame
us_adults <- tibble(
  climate_change_affects = c(rep("Yes", 62000), rep("No", 38000))
)
#bar plot of response distribution
ggplot(us_adults, aes(x = climate_change_affects)) +
  geom_bar() +
  labs(
    x = "", y = "",
    title = "Do you think climate change is affecting your local community?"
  ) +
  coord_flip() 

#summary statistics of data frame
us_adults %>%
  count(climate_change_affects) %>%
  mutate(p = n /sum(n))
# A tibble: 2 × 3
  climate_change_affects     n     p
  <chr>                  <int> <dbl>
1 No                     38000  0.38
2 Yes                    62000  0.62
n <- 60
samp <- us_adults %>%
  sample_n(size = n)

Data

samp %>%
  count(climate_change_affects) %>%
  mutate(p = n /sum(n))
# A tibble: 2 × 3
  climate_change_affects     n     p
  <chr>                  <int> <dbl>
1 No                        24   0.4
2 Yes                       36   0.6
#Answer 1: 45% of the adults in my sample don't think that climate change affects their local community while 55% of the adults in my sample do think that climate change affects their local community. 
#Answer 2: No, I would expect to for another student's sample proportion tobe different. I would expect it to be somewhat similar, as we're using a sample size of 60 which isn't horribly small, but it isn't the largest sample size either. Of course, it is very unlikely for our proportions to be the same as we're using a sample size of 60 from the population. 
#calculating 95% confidence interval for proportion of US adults who think climate change affects their local community
samp %>%
  specify(response = climate_change_affects, success = "Yes") %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "prop") %>%
  get_ci(level = 0.95)
# A tibble: 1 × 2
  lower_ci upper_ci
     <dbl>    <dbl>
1    0.467    0.733

Confidence Levels

#Question 1 (3): 95% confidence means that 95% of my confidence intervals will  contain the true population mean. 
#Question 1 (4): Yes, my confidence interval does capture the true population propotion of 0.62. 
#Question 2 (5): I would imagine most of them would capture the true population means because we took 1000 repetitions in our bootstrapping, which is a pretty large size so it's less likely we'll have errors in our confidence intervals due to chance.
#Question 1 (6): 43/50, or 86% of the confidence intervals included the true proportion level. This is lower than the confidence level of 95%. This is due to variability in random sampling with replacement when making our bootstrap samples.

More Practice

#Question 1 (7): Choosing a confidence level of 85%, I would expect the confidence interval to be smaller since you will be less confident with a more specific range.
#Question 2 (8): For a confidence interval of 85%, there is a range of 48.3% to 66.8% probability that the mean proportion of US Adults who think climate change is affecting their local community.
samp %>%
  specify(response = climate_change_affects, success = "Yes") %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "prop") %>%
  get_ci(level = 0.85)
# A tibble: 1 × 2
  lower_ci upper_ci
     <dbl>    <dbl>
1    0.517    0.683
#Question 3 (9): 43/50, or 86% included the true proportion levels. This is extremely close to the selected confidence interval of 85%. 

#Question 4 (10): I will be using a 50% confidence interval and I expect for the width of this interval to be much smaller. Using the infer package, the range is 55% to 61.7%. This means that for a confidence interval of 50%, there is a range of 55.0% to 61.7% probability that the mean proportion of US Adults who think climate change is affecting their local community. Using the app to generate 50 confidence intervals, 24/50 or 48% did contain the true population proportion, which is extremely close to the desired confidence interval of 50%. 
samp %>%
  specify(response = climate_change_affects, success = "Yes") %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "prop") %>%
  get_ci(level = 0.50)
# A tibble: 1 × 2
  lower_ci upper_ci
     <dbl>    <dbl>
1    0.567     0.65

#Question 5 (11): Sample size and interval width have an inverse correlation. As sample size increases, interval width will decrease. 
#Question 6 (12): It will lower the standard error, so the width of the interval will decrease.