knitr::opts_chunk$set(eval = TRUE, message = FALSE, warning = FALSE)
library(tidyverse)
library(openintro)
library(infer)

Exercise 1

What percent of the adults in your sample think climate change affects their local community? Hint: Just like we did with the population, we can calculate the proportion of those in this sample who think climate change affects their local community.

About 61.7% of the adults in the sample think climate change affects their local community.

us_adults <- tibble(
  climate_change_affects = c(rep("Yes", 62000), rep("No", 38000))
)

set.seed(1234)

n <- 60
samp <- us_adults %>%
  sample_n(size = n)
new <- samp |> filter(climate_change_affects == "Yes")
length(new$climate_change_affects) / 60 * 100
## [1] 61.66667

Exercise 2

Would you expect another student’s sample proportion to be identical to yours? Would you expect it to be similar? Why or why not?

I would expect another student’s sample proportion to be similar but not identical to mine. They likely chose a different seed, so they would have a different sample and thus different proportion. I would expect it to be similar because the samples were pulled from the same population.

Exercise 3

In the interpretation above, we used the phrase “95% confident”. What does “95% confidence” mean?

95% confidence means were are 95% confident that the population proportion is within the interval we calculated for the sample.

Exercise 4

Does your confidence interval capture the true population proportion of US adults who think climate change affects their local community? If you are working on this lab in a classroom, does your neighbor’s interval capture this value?

Yes, the confidence interval (copied below) captures the true population proportion of adults who think climate change affects their local community. This is because the actual proportion is 62%, and that value falls within the range of 49.96% to 75.00%.

If I were to repeat this with different seeds, I assume the intervals would continue to capture this value about 95% of the time.

set.seed(1234)
samp |>
     specify(response = climate_change_affects, success = "Yes") |>
     generate(reps = 1000, type = "bootstrap") |>
     calculate(stat = "prop") |>
     get_ci(level = 0.95)
## # A tibble: 1 × 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1    0.500     0.75

Exercise 5

Each student should have gotten a slightly different confidence interval. What proportion of those intervals would you expect to capture the true population mean? Why?

I would expect about 95% of the intervals to capture the true population mean. Since the interval is a 95% confidence interval, the probability that the population mean is within the interval is 95%. This also means that 95% of the time, the true mean will fall within the interval.

Exercise 6

Given a sample size of 60, 1000 bootstrap samples for each interval, and 50 confidence intervals constructed (the default values for the above app), what proportion of your confidence intervals include the true population proportion? Is this proportion exactly equal to the confidence level? If not, explain why. Make sure to include your plot in your answer.

When I ran the simulation, 56 out of 60 confidence intervals included the true population proportion. This proportion is 0.93. This is not exactly equal to the confidence level. This could have happened because a 95% confidence interval means that about 95% of the time, the true proportion will be within the interval. It is not always 95% of the time. Sometimes, 94% or 96% of the intervals contain the true mean. It is an average percentage and not exact.

I copy and pasted the plot at this url since I’m not sure how to recreate it in R: https://github.com/juliaDataScience-22/cuny-fall-23/blob/stats-and-probability/image-lab-5b.png

Exercise 7

Choose a different confidence level than 95%. Would you expect a confidence interval at this level to me wider or narrower than the confidence interval you calculated at the 95% confidence level? Explain your reasoning.

I chose a confidence level of 90%. I would expect this to be narrower than the confidence interval of 95%. It is less likely that the mean will be in the range, so it is probably narrower (it will have fewer values, so the probability of it including the mean will be less).

Exercise 8

Using code from the infer package and data from the one sample you have (samp), find a confidence interval for the proportion of US Adults who think climate change is affecting their local community with a confidence level of your choosing (other than 95%) and interpret it.

This confidence interval is a 90% confidence interval. This means I am 90% confident that the mean of 62%, which is the true population proportion, falls within the range of 0.5167 to 0.7167. The mean falls within this range, which makes sense because it is more likely that the mean falls within the range than not. If I were to create a confidence interval 100 times (without setting the seed), about 90 of them would have a range that included 0.62, and about 10 of them would not include 0.62.

set.seed(1234)
samp |>
     specify(response = climate_change_affects, success = "Yes") |>
     generate(reps = 1000, type = "bootstrap") |>
     calculate(stat = "prop") |>
     get_ci(level = 0.90)
## # A tibble: 1 × 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1    0.517    0.717

Exercise 9

Using the app, calculate 50 confidence intervals at the confidence level you chose in the previous question, and plot all intervals on one plot, and calculate the proportion of intervals that include the true population proportion. How does this percentage compare to the confidence level selected for the intervals?

I chose a 90% confidence interval. 7 of the intervals did not include the mean, and the other 53 intervals did. 53 / 60 = 0.8833. This means 88.33% of the intervals included the mean. That number is very close to 90 as expected.

Exercise 10

Lastly, try one more (different) confidence level. First, state how you expect the width of this interval to compare to previous ones you calculated. Then, calculate the bounds of the interval using the infer package and data from samp and interpret it. Finally, use the app to generate many intervals and calculate the proportion of intervals that are capture the true population proportion.

I will try a confidence level of 50%. I expect the width of this interval to be much smaller than the previous ones I calculated. The interval is seen below. The true population proportion of 0.62 did fall within the range, which was a bit surprising considering I had a 50-50 chance of it happening. It was very close to the bounds though, so that was expected.

When I used the app, the proportion of intervals that captured the true population proportion was 0.45.

set.seed(1234)
samp |>
     specify(response = climate_change_affects, success = "Yes") |>
     generate(reps = 1000, type = "bootstrap") |>
     calculate(stat = "prop") |>
     get_ci(level = 0.50)
## # A tibble: 1 × 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1    0.579    0.667

Exercise 11

Using the app, experiment with different sample sizes and comment on how the widths of intervals change as sample size changes (increases and decreases).

As the sample size increased, the widths of the intervals decreased. As the sample size decreased, the widths of the intervals increased. These two statements were true when the confidence level remained constant.

It was challenging to tell because the widths of the confidence intervals were different in one graph.

Exercise 12

Finally, given a sample size (say, 60), how does the width of the interval change as you increase the number of bootstrap samples. Hint: Does changing the number of bootstrap samples affect the standard error?

The width of the intervals seemed to decrease with more bootstrap samples. Again, it was challenging to know for sure because the widths of the confidence intervals were different across one graph.

