The full assignment for this lab can be found here
In this lab, you will assume that \(\pi=62\%\) is the very true population proportion. In reality, we cannot observe this value, but for the purpose of this lab we will create this hypothetical population. We will then sample our data from our hypothetical population, exploring how samples vary from one to another.
To keep our computation simple, we will assume a total population size of 100,000 (even though that’s smaller than the population size of all US adults).
# Load the tidyverse, mosaic, broom and infer packages
library(tidyverse)
library(broom)
library(infer)
# 2. use the read_rds file to read the dataset
us_adults <- read_rds("data/climate_believers.rds")Question: We can visualize the hypothetical
distribution of the responses in the population using a bar plot.
Recreate the plot below using the ggplot(),
geom_bar() and labs() layers. To flip the
x and y coordinates, add the
coord_flip() layer. us_adults <- climate_believers
# Write your code to create determine the missing data
us_adults |> ggplot(aes(x=climate_believers)) + geom_bar() + labs(title="Do you think climate change is affecting to the community") + coord_flip()Question: Print the summary statistics to confirm we
constructed the data frame correctly. Use the count
function to show the numeric quantities and use
mutate(p = n /sum(n)) to calculate the proportions in the
population. What is the proportion of climate-believers in our
hypothetical population?
Answer: in the population, 62% of people believes that climate affects to community
Question: Calculate the proportions like we did in the previous question and answer the following: (1) What percent of your sample are climate-believers? (2) How does this compare to the proportion of climate-believers in the population? Hint: Just like we did with the population, we can calculate the proportion of those in this sample who think climate change affects their local community.
# Insert code for Exercise 3 here
set.seed(35797)
n <- 60
samp_1 <- us_adults %>%
sample_n(size = n)
samp_1 |> count(climate_believers) |> mutate(p = n/sum(n))Answer: From sample 1, 61.6% people believes that climate change affect to community (2) The mean value in sample 1 is relatively close to the mean of true population
Question: Create code to generate a second sample
(call it samp_2). Answer the same questions as before, but
this time with respect to samp_2. How do the two samples
compare? Explain any difference you found between the two samples.
# Insert code for the Exercise here
n <- 60
samp_2 <- us_adults %>%
sample_n(size = n)
samp_2 |> count(climate_believers) |> mutate(p = n/sum(n))Answer: Unlike the first sample, the proportion of climate-believers change every time we repeat the permutation. It is because we didn’t set the seed like the first sample
Question: Run the proportion test (see code below)
on the first sample samp_1, to estimate the proportion of
climate-believers in the population. Now answer the following questions:
(1) How does the estimation compare to the real proportion of
climate-believers in the population? (2) What is the confidence interval
associated with your estimation? (3) Is the proportion of
climate-believers in the population contained within your confidence
interval?
## No `p` argument was hypothesized, so the test will assume a null hypothesis `p
## = .5`.
Answer: (1) The estimate proportion of climate-believers in the population is close to the one in real population (2) We hope that the estimation will be in the range of confidence interval (3) The propotion of climate-believers in the population is within the confidence interval
Question: This code will create 1000 bootstrapping
samples from samp_1, and use those samples to find the 95
percent confidence interval for proportion of climate-believers. Run the
code and compare your results with the proportion test we’ve run in the
previous question.
# Insert code for the Exercise here
samp_1 %>%
specify(response = climate_believers, success = "Yes") %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "prop") %>%
get_ci(level = 0.95)Answer: (1) We know that the proprtion in the population is 62% but my estimating the proportion as 61.7%, which is close to the real population proportion (2)The 95% confident interval of the proportion is [0.5,0.73], (3) We can see that 95% CI contains the real pop population
Question: Does your confidence interval capture the
true population proportion of US adults who think climate change affects
their local community? Now run the bootstrapping method on
samp_2. How do your results compare?
Each time you run a sample, you would get different intervals. What proportion of those intervals would you expect to contain the true population mean?
# Insert code for the Exercise here
samp_2 %>%
specify(response = climate_believers, success = "Yes") %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "prop") %>%
get_ci(level = 0.95)Answer: Most of the time the result of 95% CI in the sample 2 also include the proportion of real population. (2) As I set the level of confident is 95%, I expect that 95% those interval will contain the true population mean
Question: Given a sample size of 60, 1000 bootstrap samples for each interval, and 50 confidence intervals constructed (the default values for the above app), what proportion of your confidence intervals include the true population proportion? Is this proportion exactly equal to the confidence level? If not, explain why. Include an image of your plot with your the answer (to learn how to include an image in your RMarkdown, see this).
Answer: As there are 2 confidence intervals include the true population, so 96% of my confidence intervals include the true population proportion. This proportion is proximately equal to the confidence level.
Question: Choose a different confidence level than 95%. Would you expect a confidence interval at this level to be wider or narrower than the confidence interval you calculated at the 95% confidence level? Explain your reasoning and confirm your using the app. What is the proportion of intervals that include the true population proportion? How does this percentage compare to the confidence level selected for the intervals? Include an image of your plot with your the answer.
Answer: As I chose the confidence level at 80%, I expect that the confidence interval will be narrower. (2) With the confidence level of 80%, we expect that 80% of intervals that include the true population (3) In this case, 74% intervals that include the true population proportion (13 out of 50 intervals is not included true value) (4) This proportion is slightly lower than confident level but not too different
Question: Using the app, experiment with different
sample sizes and comment on how the widths of intervals change as sample
size changes (increases and decreases). Include an image of your plot
with your the answer.
Image of 50 CIs, sample size = 5000
Answer: When we increase the sample size, the width of interval will be decrease and vice versa
Question: Finally, given a sample size (say, 60),
how does the width of the interval change as you increase the number of
bootstrap samples? Include an image of your plot with your the answer.
Image of 50 CIs, sample size = 60, number of
bootstraps = 10,000
Answer: The width of a confidence interval obtained using the bootstrap method typically decreases as you increase the number of bootstrap samples