Foundations for statistical inference - Confidence intervals

Author

Lydia Baick

Foundations for statistical inference - Confidence intervals

set.seed(04102004)

Getting Started

Load Packages

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(openintro)

Loading required package: airports
Loading required package: cherryblossom
Loading required package: usdata

library(infer)

Creating a reproducible lab report

The data

us_adults <- tibble(
  climate_change_affects = c(rep("Yes", 62000), rep("No", 38000))
)

ggplot(us_adults, aes(x = climate_change_affects)) +
  geom_bar() +
  labs(
    x = "", y = "",
    title = "Do you think climate change is affecting your local community?"
  ) +
  coord_flip()

us_adults %>%
  count(climate_change_affects) %>%
  mutate(p = n /sum(n))

# A tibble: 2 × 3
  climate_change_affects     n     p
  <chr>                  <int> <dbl>
1 No                     38000  0.38
2 Yes                    62000  0.62

## # A tibble: 2 × 3
##   climate_change_affects     n     p
##   <chr>                  <int> <dbl>
## 1 No                     38000  0.38
## 2 Yes                    62000  0.62

n <- 60
samp <- us_adults %>%
  sample_n(size = n)

Exercise 1

What percent of the adults in your sample think climate change affects their local community? Hint: Just like we did with the population, we can calculate the proportion of those in this sample who think climate change affects their local community.

Answer= 62% of the adults in my sample think climate change affects their local community.

Exercise 2

Would you expect another student’s sample proportion to be identical to yours? Would you expect it to be similar? Why or why not?

Answer= I wouldn’t expect another students sample to be identical to mine but I would expect it to be similar because the data is drawn from the same sample so it would be similar but not the exact same.

Confidence intervals

samp %>%
  specify(response = climate_change_affects, success = "Yes") %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "prop") %>%
  get_ci(level = 0.95)

# A tibble: 1 × 2
  lower_ci upper_ci
     <dbl>    <dbl>
1      0.4    0.667

Confidence Levels

Exercise 3

In the interpretation above, we used the phrase “95% confident”. What does “95% confidence” mean? In this case, you have the rare luxury of knowing the true population proportion (62%) since you have data on the entire population.

Answer= 95% confidence means that we are 95% sure that the population parameter is accurate

Exercise 4

Each student should have gotten a slightly different confidence interval. What proportion of those intervals would you expect to capture the true population mean? Why?

Answer= I would say the CI should be at least 90%. I think this because although with 90% there is still room for error a percentage like 85 would increase the error margin and may not be as effective in capturing the true population mean.

Exercise 5

Given a sample size of 60, 1000 bootstrap samples for each interval, and 50 confidence intervals constructed (the default values for the above app), what proportion of your confidence intervals include the true population proportion? Is this proportion exactly equal to the confidence level? If not, explain why. Make sure to include your plot in your answer.

Answer= 56/60 times or 93% of the time the true population proportion was included. The reason for the different confidence intervals could be due to a limited amount of information

More Practice

Exercise 6

Choose a different confidence level than 95%. Would you expect a confidence interval at this level to be wider or narrower than the confidence interval you calculated at the 95% confidence level? Explain your reasoning.

Answer= If I were to chose a confidence level higher than 95 I could expect a CI to be wider because it is more likely to guess the true population proportion. The opposite goes if I chose a confidence lever lower the 95, the CI lever would be narrower because it is less likely to guess accurately.

Exercise 7

Using code from the infer package and data from the one sample you have (samp), find a confidence interval for the proportion of US Adults who think climate change is affecting their local community with a confidence level of your choosing (other than 95%) and interpret it.

set.seed(12082010)
samp %>%
  specify(response = climate_change_affects, success = "Yes") %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "prop") %>%
  get_ci(level = 0.90)

# A tibble: 1 × 2
  lower_ci upper_ci
     <dbl>    <dbl>
1    0.433    0.633

Answer= A confidence level of 90% will give us a range of 55% and 75% where the mean lies within those parameters

Exercise 8

Using the app, calculate 50 confidence intervals at the confidence level you chose in the previous question, and plot all intervals on one plot, and calculate the proportion of intervals that include the true population proportion. How does this percentage compare to the confidence level selected for the intervals?

Answer= I got 54/60 which equals 90%. This percentage matches the confidence level selected for the intervals.

Exercise 9

Lastly, try one more (different) confidence level. First, state how you expect the width of this interval to compare to previous ones you calculated. Then, calculate the bounds of the interval using the infer package and data from samp and interpret it. Finally, use the app to generate many intervals and calculate the proportion of intervals that are capture the true population proportion.

Answer= I chose a confidence level of 98% so I expect the width of this interval to be wider than the previous ones I’ve calculated.

set.seed(09012005)
samp %>%
  specify(response = climate_change_affects, success = "Yes") %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "prop") %>%
  get_ci(level = 0.98)

# A tibble: 1 × 2
  lower_ci upper_ci
     <dbl>    <dbl>
1    0.383    0.683

Answer= With a confidence level of 98% I got a range of 51% to 78%. When I put the data into the app to generate, I got 60/60 so 100%.

Exercise 10

Using the app, experiment with different sample sizes and comment on how the widths of intervals change as sample size changes (increases and decreases).

Answer= As the sample sizes increase, the results are a narrower range. As the sample sizes decrease, the results are a wider range.

Exercise 11

Finally, given a sample size (say, 60), how does the width of the interval change as you increase the number of bootstrap samples. Hint: Does changing the number of bootstrap samples affect the standard error?

Answer= As the number of bootstrap samples increase, the CI decreases. It’s the same going the other way, as the number of bootstrap samples decrease, the CI increases.