The full assignment for this lab can be found here

In this lab, you will assume that \(\pi=62\%\) is the very true population proportion. In reality, we cannot observe this value, but for the purpose of this lab we will create this hypothetical population. We will then sample our data from our hypothetical population, exploring how samples vary from one to another.

To keep our computation simple, we will assume a total population size of 100,000 (even though that’s smaller than the population size of all US adults).

# Load the tidyverse, mosaic, broom and infer packages
library(tidyverse)
library(infer) 
library(broom) 
library(mosaicData) 

# 2. use the read_rds file to read the dataset
us_adults <- read_rds("data/climate_believers.rds")

Exercise 1

Question: We can visualize the hypothetical distribution of the responses in the population using a bar plot. Recreate the plot below using the ggplot(), geom_bar() and labs() layers. To flip the x and y coordinates, add the coord_flip() layer.

us_adults %>% ggplot(aes(climate_believers)) + 
  geom_bar() + 
  labs(title = "Do you think climate change is affecting your local community?")

Exercise 2

Question: Print the summary statistics to confirm we constructed the data frame correctly. Use the count function to show the numeric quantities and use mutate(p = n /sum(n)) to calculate the proportions in the population. What is the proportion of climate-believers in our hypothetical population?

us_adults %>% 
  count(climate_believers) %>% 
mutate(p = n / sum(n))

Answer: The proportion of believers is 62%

Exercise 3

Question: Calculate the proportions like we did in the previous question and answer the following: (1) What percent of your sample are climate-believers? (2) How does this compare to the proportion of climate-believers in the population? Hint: Just like we did with the population, we can calculate the proportion of those in this sample who think climate change affects their local community.

# Insert code for Exercise 3 here

set.seed(35797)
n <- 60
samp_1 <- us_adults %>%
  sample_n(size = n)

samp_1 %>% 
  count(climate_believers) %>% 
mutate(p = n / sum(n))

Answer: The proportion of believers is 61.7 %. This is just a sample, so it should not deviate to much from the proportion calculated in the previous question. If I would not have used set.seed(), the proportions calculated will also differ slightly, as the samples drawn will vary. Overall, the proportion will thus be around the population proportion.

Exercise 4

Question: Create code to generate a second sample (call it samp_2). Answer the same questions as before, but this time with respect to samp_2. How do the two samples compare? Explain any difference you found between the two samples.

n <- 60
samp_2 <- us_adults %>%
  sample_n(size = n)

samp_2 %>% 
  count(climate_believers) %>% 
  mutate(p = n / sum(n))

Answer: The proportion of believers is 60%. However, the absence of set.seed() causes the proportion to be different (but not too far from the population proportion) every time the code is run. This is because the set.seed() makes sure to draw the same sample.

Exercise 5

Question: Run the proportion test (see code below) on the first sample samp_1, to estimate the proportion of climate-believers in the population. Now answer the following questions: (1) How does the estimation compare to the real proportion of climate-believers in the population? (2) What is the confidence interval associated with your estimation? (3) Is the proportion of climate-believers in the population contained within your confidence interval?

prop_test(samp_1, climate_believers ~ NULL) 
## No `p` argument was hypothesized, so the test will assume a null hypothesis `p
## = .5`.

Answer: We are 95% confident that the true proportion of climate believers is between 48.2% and 73.6%. The proportion of the real world population is 62%, the interval here includes this 62%.

Exercise 6

Question: This code will create 1000 bootstrapping samples from samp_1, and use those samples to find the 95 percent confidence interval for proportion of climate-believers. Run the code and compare your results with the proportion test we’ve run in the previous question.

samp_1 %>%
  specify(response = climate_believers, success = "Yes") %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "prop") %>%
  get_ci(level = 0.95)

Answer: We are 95% confident that the true proportion of climate believers is between 50% and 73.3%. Therefore, even though we don’t know what the full population looks like, from the sample we can be 95% confident that the true proportion of US adults who are climate believers is between the two bounds reported as result of this pipeline.

Exercise 7

Question: Does your confidence interval capture the true population proportion of US adults who think climate change affects their local community? Now run the bootstrapping method on samp_2. How do your results compare?

Each time you run a sample, you would get different intervals. What proportion of those intervals would you expect to contain the true population mean?

samp_2 %>%
  specify(response = climate_believers, success = "Yes") %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "prop") %>%
  get_ci(level = 0.95)

Answer: We are 95% confident that the true proportion of climate believers is between 46.7% and 71.7%. These results do not differ much from the results obtained for the previous question, and it does also contain the true population proportion. As the confidence level is 95%, if we run the code 100 times, 95% of the time (95 times) the true population proportion is included in the interval.

Exercise 8

Question: Given a sample size of 60, 1000 bootstrap samples for each interval, and 50 confidence intervals constructed (the default values for the above app), what proportion of your confidence intervals include the true population proportion? Is this proportion exactly equal to the confidence level? If not, explain why. Include an image of your plot with your the answer (to learn how to include an image in your RMarkdown, see this).

Answer: This proportion is not equal to the confidence level, as we see that from the 50 confidence intervals, the interval does not capture the true population proportion twice, which is equal to 4%. The expectation would be 5% since our confidence level is 0.95. Yet, if we were to run the interactive app multiple times, it might also happen that during one of the runs, there are 3 confidence intervals that do not capture the true population proportion. Overall, it will however result in 5 confidence intervals that do not capture the true population proportion which would equal to the confidence level of 0.95.

Exercise 9

Question: Choose a different confidence level than 95%. Would you expect a confidence interval at this level to be wider or narrower than the confidence interval you calculated at the 95% confidence level? Explain your reasoning and confirm your using the app. What is the proportion of intervals that include the true population proportion? How does this percentage compare to the confidence level selected for the intervals? Include an image of your plot with your the answer.

Answer: In this case a confidence level of 0.99 is chosen. This confidence level will not affect the width of the confidence interval. Instead, it changes that overall we will have more confidence intervals that include the true population proportion. In this case, 1 from the 50 (2%) confidence intervals does not include the true population proportion. When running it over and over this proportion will be 0.01 (for a confidence level of 0.99). This means that fewer confidence intervals do not include the true population proportion compared to a confidence level of 0.95 in the previous question.

Exercise 10

Question: Using the app, experiment with different sample sizes and comment on how the widths of intervals change as sample size changes (increases and decreases). Include an image of your plot with your the answer.

Answer: If we increase the sample size from 100, in the first image, to 1000, in the second image (remain default settings, CI = 0.95 and number of resamples for each bootstrap CI = 1000, number of confidence interval = 50) we can see that in the second image, the confidence interval becomes a lot more narrow in comparison to when our sample size is 100. Do mind that the scale changes on the x-axis. As we increase our sample size, our estimate becomes more precise, reducing the width of the confidence interval.

Exercise 11

Question: Finally, given a sample size (say, 60), how does the width of the interval change as you increase the number of bootstrap samples? Include an image of your plot with your the answer.

Answer: It seems that when we increase the number of bootstrap samples (the first picture, bootstrap samples = 100, the second picture bootstrap samples = 500), the number of confidence intervals that do not capture the true population proportion, decreases. This phenomenon can be explained by the central limit theorem; the resampling distribution of the proportion will approach a normal distribution when a large number of samples are taken. Therefore, the more we resample, the more it approaches the distribution of the true population proportion.