The full assignment for this lab can be found here
In this lab, you will assume that \(\pi=62\%\) is the very true population proportion. In reality, we cannot observe this value, but for the purpose of this lab we will create this hypothetical population. We will then sample our data from our hypothetical population, exploring how samples vary from one to another.
To keep our computation simple, we will assume a total population size of 100,000 (even though that’s smaller than the population size of all US adults).
# Load the tidyverse, mosaic, broom and infer packages
library(tidyverse)
library(mosaic)
library (broom)
library(infer)
# 2. use the read_rds file to read the dataset
us_adults <- read_rds("data/climate_believers.rds")Question: We can visualize the hypothetical
distribution of the responses in the population using a bar plot.
Recreate the plot below using the ggplot(),
geom_bar() and labs() layers. To flip the
x and y coordinates, add the
coord_flip() layer.
# Write your code to create determine the missing data
us_adults %>%
ggplot(aes(x = climate_believers)) +
geom_bar() + labs(title = "Do you think climate change is affecting your local community?", x = " ") + coord_flip()Question: Print the summary statistics to confirm we
constructed the data frame correctly. Use the count
function to show the numeric quantities and use
mutate(p = n /sum(n)) to calculate the proportions in the
population. What is the proportion of climate-believers in our
hypothetical population?
Answer: The proportion of our hypothetical of US adults who believe in climate change is 0.62
Question: Calculate the proportions like we did in the previous question and answer the following: (1) What percent of your sample are climate-believers? (2) How does this compare to the proportion of climate-believers in the population? Hint: Just like we did with the population, we can calculate the proportion of those in this sample who think climate change affects their local community.
# Insert code for Exercise 3 here
set.seed(220401)
n <- 60
sampl_1 <- us_adults %>%
sample_n(size = n)
sampl_1 %>%
count(climate_believers) %>%
mutate(p = n/ sum(n))Answer: In sample 1 the proportion of climate believers is 66.7%. Exactly 40 people are climate believers and 20 are not.
Question: Create code to generate a second sample
(call it samp_2). Answer the same questions as before, but
this time with respect to samp_2. How do the two samples
compare? Explain any difference you found between the two samples.
set.seed(012523)
n <- 60
sampl_2 <- us_adults %>%
sample_n(size = n)
sampl_2 %>%
count(climate_believers) %>%
mutate(p = n/ sum(n))Answer: Sample 2 shows a proportion of 0.567 climate believers vs 0.433 non-believers. The proportion of believers in sample 1 is almost 10% higher. These differences are due to the random variability in drawing a sample from the population. Both samples are still somehow similar to our original sample proportion.
Question: Run the proportion test (see code below)
on the first sample samp_1, to estimate the proportion of
climate-believers in the population. Now answer the following questions:
(1) How does the estimation compare to the real proportion of
climate-believers in the population? (2) What is the confidence interval
associated with your estimation? (3) Is the proportion of
climate-believers in the population contained within your confidence
interval?
## No `p` argument was hypothesized, so the test will assume a null hypothesis `p
## = .5`.
Answer: Our sample gives us a confidence interval of {0.532;0.780}. We know that the true population proportion is 0.62, which is included in our CI. Our H0 is that the true population proportion is equal to p=.5. Our test gives us the p-value of .0142 indicating that we should reject the HO. This seems right, given that we know that the true value is 0.62 and not 0.5.
Question: This code will create 1000 bootstrapping
samples from samp_1, and use those samples to find the 95
percent confidence interval for proportion of climate-believers. Run the
code and compare your results with the proportion test we’ve run in the
previous question.
sampl_1 %>%
specify(response = climate_believers, success = "Yes") %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "prop") %>%
get_ci(level = 0.95)Answer: The confidence interval that we received is {0.533 - 0.783}. That means that If the test is repeated 1000 times, in 95% of all cases the true value will lie within this range. Because of the error
Question: Does your confidence interval capture the
true population proportion of US adults who think climate change affects
their local community? Now run the bootstrapping method on
samp_2. How do your results compare?
Each time you run a sample, you would get different intervals. What proportion of those intervals would you expect to contain the true population mean?
sampl_2 %>%
specify(response = climate_believers, success = "Yes") %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "prop") %>%
get_ci(level = 0.95)Answer: The confidence interval from sample 1 {0.533 - 0.783} does indeed include the true population proportion of 0.62. If we repeat the CI bootstrapping with sample 2, we get {0.450 - 0.7}, which again includes our true population proportion. The two CIs have the same size/range (as the sample size is the same), but they cover a little bit different values due to sampling variability. If we repeated this process 100 times, we would get intervals that inlcude the true population proportion about 95 times.
Question: Given a sample size of 60, 1000 bootstrap samples for each interval, and 50 confidence intervals constructed (the default values for the above app), what proportion of your confidence intervals include the true population proportion? Is this proportion exactly equal to the confidence level? If not, explain why. Include an image of your plot with your the answer (to learn how to include an image in your RMarkdown, see this).
Answer: We see that only 2 out of 50 CIs do not capture the true population proportion, which is equal to 4%. We would expect that 5% do not capture it though. If we were to repeat this, there is a possibility that in our next draw of 50 samples, there would be i.e., 3 or CIs that don’t capture the true value, which would then be slightly above the 5%. But if we combine it then overall, 5/100 CIs would capture the true value.
Question: Choose a different confidence level than 95%. Would you expect a confidence interval at this level to be wider or narrower than the confidence interval you calculated at the 95% confidence level? Explain your reasoning and confirm your using the app. What is the proportion of intervals that include the true population proportion? How does this percentage compare to the confidence level selected for the intervals? Include an image of your plot with your the answer.
Answer: Let’s say we choose a confidence level of only 90%. In this case, the CIs would not be necessarily more wide or narrow than before. Instead, what this changes, is that we would get more confidence intervals that do not include the true value, in our case 10% instead of the 5% that we had when our confidence level was 95%. The width of the confidence intervals depends on the sample characteristics (sample size and standard deviation) and does not relate much to the confidence level per se. The proportion of CIs that include the true population proportion is 90%, which is equal to the confidence level that we chose. In our case 7 out 50 CIs do not include the true value, which is equal to 14% and much higher than the 10% we were aiming for. In another sample draw we could encounter lower values though so that it would on average be 10%.
Question: Using the app, experiment with different sample sizes and comment on how the widths of intervals change as sample size changes (increases and decreases). Include an image of your plot with your the answer.
Answer: We see that changing the sample size substantially changes the CI width. While they appear to have the same width in the images, note that the scale of the X axis changes a lot (from 0.4-0.8 for n= 60, up to 0.575-0.650 for n=1000). Narrower CIs mean that the true value can be estimated with more certainty.
Question: Finally, given a sample size (say, 60), how does the width of the interval change as you increase the number of bootstrap samples? Include an image of your plot with your the answer.
Answer: It seems like changing the bootstrap size progressively decreases the variability across the CIs, so that more CIs include the true population proportion value. This might be due to the fact that in order to compute each CI, the number of resamples increases, which means that the proportions will be closer to a normal distribution. This phenomenon goes by the central limit theorem.