set.seed(04102004)
Foundations for statistical inference - Confidence intervals
Foundations for statistical inference - Confidence intervals
Getting Started
Load Packages
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(openintro)
Loading required package: airports
Loading required package: cherryblossom
Loading required package: usdata
library(infer)
Creating a reproducible lab report
The data
<- tibble(
us_adults climate_change_affects = c(rep("Yes", 62000), rep("No", 38000))
)
ggplot(us_adults, aes(x = climate_change_affects)) +
geom_bar() +
labs(
x = "", y = "",
title = "Do you think climate change is affecting your local community?"
+
) coord_flip()
%>%
us_adults count(climate_change_affects) %>%
mutate(p = n /sum(n))
# A tibble: 2 × 3
climate_change_affects n p
<chr> <int> <dbl>
1 No 38000 0.38
2 Yes 62000 0.62
## # A tibble: 2 × 3
## climate_change_affects n p
## <chr> <int> <dbl>
## 1 No 38000 0.38
## 2 Yes 62000 0.62
<- 60
n <- us_adults %>%
samp sample_n(size = n)
Exercise 1
What percent of the adults in your sample think climate change affects their local community? Hint: Just like we did with the population, we can calculate the proportion of those in this sample who think climate change affects their local community.
Answer= 62% of the adults in my sample think climate change affects their local community.
Exercise 2
Would you expect another student’s sample proportion to be identical to yours? Would you expect it to be similar? Why or why not?
Answer= I wouldn’t expect another students sample to be identical to mine but I would expect it to be similar because the data is drawn from the same sample so it would be similar but not the exact same.
Confidence intervals
%>%
samp specify(response = climate_change_affects, success = "Yes") %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "prop") %>%
get_ci(level = 0.95)
# A tibble: 1 × 2
lower_ci upper_ci
<dbl> <dbl>
1 0.4 0.667
Confidence Levels
Exercise 3
In the interpretation above, we used the phrase “95% confident”. What does “95% confidence” mean? In this case, you have the rare luxury of knowing the true population proportion (62%) since you have data on the entire population.
Answer= 95% confidence means that we are 95% sure that the population parameter is accurate
Exercise 4
Each student should have gotten a slightly different confidence interval. What proportion of those intervals would you expect to capture the true population mean? Why?
Answer= I would say the CI should be at least 90%. I think this because although with 90% there is still room for error a percentage like 85 would increase the error margin and may not be as effective in capturing the true population mean.
Exercise 5
Given a sample size of 60, 1000 bootstrap samples for each interval, and 50 confidence intervals constructed (the default values for the above app), what proportion of your confidence intervals include the true population proportion? Is this proportion exactly equal to the confidence level? If not, explain why. Make sure to include your plot in your answer.
Answer= 56/60 times or 93% of the time the true population proportion was included. The reason for the different confidence intervals could be due to a limited amount of information
More Practice
Exercise 6
Choose a different confidence level than 95%. Would you expect a confidence interval at this level to be wider or narrower than the confidence interval you calculated at the 95% confidence level? Explain your reasoning.
Answer= If I were to chose a confidence level higher than 95 I could expect a CI to be wider because it is more likely to guess the true population proportion. The opposite goes if I chose a confidence lever lower the 95, the CI lever would be narrower because it is less likely to guess accurately.
Exercise 7
Using code from the infer package and data from the one sample you have (samp), find a confidence interval for the proportion of US Adults who think climate change is affecting their local community with a confidence level of your choosing (other than 95%) and interpret it.
set.seed(12082010)
%>%
samp specify(response = climate_change_affects, success = "Yes") %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "prop") %>%
get_ci(level = 0.90)
# A tibble: 1 × 2
lower_ci upper_ci
<dbl> <dbl>
1 0.433 0.633
Answer= A confidence level of 90% will give us a range of 55% and 75% where the mean lies within those parameters
Exercise 8
Using the app, calculate 50 confidence intervals at the confidence level you chose in the previous question, and plot all intervals on one plot, and calculate the proportion of intervals that include the true population proportion. How does this percentage compare to the confidence level selected for the intervals?
Answer= I got 54/60 which equals 90%. This percentage matches the confidence level selected for the intervals.
Exercise 9
Lastly, try one more (different) confidence level. First, state how you expect the width of this interval to compare to previous ones you calculated. Then, calculate the bounds of the interval using the infer package and data from samp and interpret it. Finally, use the app to generate many intervals and calculate the proportion of intervals that are capture the true population proportion.
Answer= I chose a confidence level of 98% so I expect the width of this interval to be wider than the previous ones I’ve calculated.
set.seed(09012005)
%>%
samp specify(response = climate_change_affects, success = "Yes") %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "prop") %>%
get_ci(level = 0.98)
# A tibble: 1 × 2
lower_ci upper_ci
<dbl> <dbl>
1 0.383 0.683
Answer= With a confidence level of 98% I got a range of 51% to 78%. When I put the data into the app to generate, I got 60/60 so 100%.
Exercise 10
Using the app, experiment with different sample sizes and comment on how the widths of intervals change as sample size changes (increases and decreases).
Answer= As the sample sizes increase, the results are a narrower range. As the sample sizes decrease, the results are a wider range.
Exercise 11
Finally, given a sample size (say, 60), how does the width of the interval change as you increase the number of bootstrap samples. Hint: Does changing the number of bootstrap samples affect the standard error?
Answer= As the number of bootstrap samples increase, the CI decreases. It’s the same going the other way, as the number of bootstrap samples decrease, the CI increases.