The full assignment for this lab can be found here

In this lab, you will assume that \(\pi=62\%\) is the very true population proportion. In reality, we cannot observe this value, but for the purpose of this lab we will create this hypothetical population. We will then sample our data from our hypothetical population, exploring how samples vary from one to another.

To keep our computation simple, we will assume a total population size of 100,000 (even though that’s smaller than the population size of all US adults).

# Load the tidyverse, mosaic, broom and infer packages
library(tidyverse)
library(mosaic)
library(broom)
library(infer)

# 2. use the read_rds file to read the dataset
us_adults <- read_rds("data/climate_believers.rds")

Exercise 1

Question: We can visualize the hypothetical distribution of the responses in the population using a bar plot. Recreate the plot below using the ggplot(), geom_bar() and labs() layers. To flip the x and y coordinates, add the coord_flip() layer.

# Write your code to create determine the missing data
us_adults |>
  ggplot(aes(x=climate_believers)) +
  geom_bar() +
  labs(title = "Believes in climate change affecting local community", x = " ") +
  coord_flip()

Exercise 2

Question: Print the summary statistics to confirm we constructed the data frame correctly. Use the count function to show the numeric quantities and use mutate(p = n /sum(n)) to calculate the proportions in the population. What is the proportion of climate-believers in our hypothetical population?

# Insert code for Exercise 2 here
us_adults |>
  count(climate_believers) |>
  mutate(p = n / sum(n))

Answer: The proportion of believers is 62%.

Exercise 3

Question: Calculate the proportions like we did in the previous question and answer the following: (1) What percent of your sample are climate-believers? (2) How does this compare to the proportion of climate-believers in the population? Hint: Just like we did with the population, we can calculate the proportion of those in this sample who think climate change affects their local community.

# Insert code for Exercise 3 here
set.seed(143456)
n <- 60
samp_1 <- us_adults |>
  sample_n(size = n)

samp_1 |>
  count(climate_believers) |>
  mutate(p = n / sum(n))

Answer: (1) In this sample, the percentage of believers is 65%. (2) The proportion of believers is 62% in the population, but 65% in the sample. Conclusion: the sample has a slightly higher proportion of believers than the population.

Exercise 4

Question: Create code to generate a second sample (call it samp_2). Answer the same questions as before, but this time with respect to samp_2. How do the two samples compare? Explain any difference you found between the two samples.

# Insert code for the Exercise here
set.seed(8564565)
n <- 60
samp_2 <- us_adults |>
  sample_n(size = n)

samp_2 |>
  count(climate_believers) |>
  mutate(p = n / sum(n))

Answer: (1) In this sample (sample 2), the percentage of believers is 61.67%. (2) The proportion of believers is 61.67% in this sample, but 65% in sample 1. Conclusion: Sample 1 has a slightly higher proportion of believers than sample 2, but sample 2 is more representative since it is closer to the proportion of the population.

Exercise 5

Question: Run the proportion test (see code below) on the first sample samp_1, to estimate the proportion of climate-believers in the population. Now answer the following questions: (1) How does the estimation compare to the real proportion of climate-believers in the population? (2) What is the confidence interval associated with your estimation? (3) Is the proportion of climate-believers in the population contained within your confidence interval?

# Insert code for the Exercise here
prop_test(samp_1, climate_believers ~ NULL)
## No `p` argument was hypothesized, so the test will assume a null hypothesis `p
## = .5`.

Answer: (1) P=<0.05 so we reject H0 (no difference), meaning that the estimation of proportion of climate-believers is significantly different from the real proportion. (2) The confidence interval for the estimation is [0.515; 0.766] (3) The proportion of climate-believers in the population is indeed contained within the confidence interval.

Exercise 6

Question: This code will create 1000 bootstrapping samples from samp_1, and use those samples to find the 95 percent confidence interval for proportion of climate-believers. Run the code and compare your results with the proportion test we’ve run in the previous question.

# Insert code for the Exercise here
samp_1 %>%
  specify(response = climate_believers, success = "Yes") %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "prop") %>%
  get_ci(level = 0.95)

Answer: The confidence interval of this bootstrap is [0.517; 0.767]. This is very close to the confidence interval which was the result of question 5 [0.515; 0.766]. Both these confidence intervals contain the true population proportion, which is 0.62.

Exercise 7

Question: Does your confidence interval capture the true population proportion of US adults who think climate change affects their local community? Now run the bootstrapping method on samp_2. How do your results compare?

Each time you run a sample, you would get different intervals. What proportion of those intervals would you expect to contain the true population mean?

# Insert code for the Exercise here
samp_2 %>%
  specify(response = climate_believers, success = "Yes") %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "prop") %>%
  get_ci(level = 0.95)

Answer: The confidence interval of sample 2 is [0.483; 0.733]. This is the same range as for sample 1, but shifted towards lower values. Since we use the 95% confidence interval, close to 95% of all generated intervals should contain the true population mean.

Exercise 8

Question: Given a sample size of 60, 1000 bootstrap samples for each interval, and 50 confidence intervals constructed (the default values for the above app), what proportion of your confidence intervals include the true population proportion? Is this proportion exactly equal to the confidence level? If not, explain why. Include an image of your plot with your the answer (to learn how to include an image in your RMarkdown, see this).

Image of 50 confidence intervals, sample size = 60 & confidence level = 0.95
Image of 50 confidence intervals, sample size = 60 & confidence level = 0.95

Answer: 2 out of 50 confidence intervals do not contain the true population proportion. This is not exactly equal to our confidence level (2 / 50 = 0.04), but it is contained within it and close enough to 95%.

Exercise 9

Question: Choose a different confidence level than 95%. Would you expect a confidence interval at this level to be wider or narrower than the confidence interval you calculated at the 95% confidence level? Explain your reasoning and confirm your using the app. What is the proportion of intervals that include the true population proportion? How does this percentage compare to the confidence level selected for the intervals? Include an image of your plot with your the answer.

Answer: Having a lower confidence level means the confidence intervals will be smaller. This is because you only have to be 50% certain, which means you can take smaller values (you will be wrong about 50% of the time though!). If you have to be 95% certain, you will have to have a larger range to ensure you hit have the true population proportion in the confidence interval 95% of the time.

In the image below it becomes more clear. The proportion of of confidence intervals that contain the true population proportion is 23 out of 50, which is a bit more than 50%.

Image of 50 confidence intervals, sample size = 60 & confidence level = 0.50
Image of 50 confidence intervals, sample size = 60 & confidence level = 0.50

Exercise 10

Question: Using the app, experiment with different sample sizes and comment on how the widths of intervals change as sample size changes (increases and decreases). Include an image of your plot with your the answer.

Answer: Below are 3 bootstraps with 3 different sample sizes. As you can see, the bigger the sample size, the smaller the width of the interval. This is because the as sample size increases, standard error will decrease.

Image of 50 confidence intervals, sample size = 300, confidence level = 0.95
Image of 50 confidence intervals, sample size = 300, confidence level = 0.95
Image of 50 confidence intervals, sample size = 600 & confidence level = 0.95
Image of 50 confidence intervals, sample size = 600 & confidence level = 0.95
Image of 50 confidence intervals, sample size = 10 & confidence level = 0.95
Image of 50 confidence intervals, sample size = 10 & confidence level = 0.95

Exercise 11

Question: Finally, given a sample size (say, 60), how does the width of the interval change as you increase the number of bootstrap samples? Include an image of your plot with your the answer.

Answer: Below are 3 bootstraps with 3 different bootstrap resample sizes. A smaller bootstrap resample size seems to have more outliers, offering less precision.

Image of 50 confidence intervals, sample size = 60, n of resamples = 5000
Image of 50 confidence intervals, sample size = 60, n of resamples = 5000
Image of 50 confidence intervals, sample size = 60 & n of resamples = 5000
Image of 50 confidence intervals, sample size = 60 & n of resamples = 5000
Image of 50 confidence intervals, sample size = 60 & n of resamples = 10
Image of 50 confidence intervals, sample size = 60 & n of resamples = 10
Bonus cat
Bonus cat