library(tidyverse) #loading all library needed for this assignment
library(openintro)
library(infer)
library(gplots)
#head(fastfood)
#library(readxl)
#library(data.table)
#library(readr)
#library(plyr)
#library(dplyr)
#library(dice)
# #library(VennDiagram)
# #library(help = "dice")
#library(DBI)
#library(dbplyr)
#library(rstudioapi)
#library(RJDBC)
#library(odbc)
#library(RSQLite)
#library(rvest)
#library(stringr)
#library(readtext)
#library(ggpubr)
#library(fitdistrplus)
#library(ggplot2)
#library(moments)
#library(qualityTools)
library(normalp)
#library(utils)
#library(MASS)
#library(qqplotr)
library(DATA606)
##
## Welcome to CUNY DATA606 Statistics and Probability for Data Analytics
## This package is designed to support this course. The text book used
## is OpenIntro Statistics, 3rd Edition. You can read this by typing
## vignette('os3') or visit www.OpenIntro.org.
##
## The getLabs() function will return a list of the labs available.
##
## The demo(package='DATA606') will list the demos that are available.
getLabs()
## [1] "Lab1" "Lab2" "Lab3" "Lab4" "Lab5a" "Lab5b" "Lab6" "Lab7" "Lab8"
## [10] "Lab9"
#library(StMoSim)
The Wellcome Global Monitor finds that 20% of people globally do not believe that the work scientists do benefits people like them. In this lab, you will assume this 20% is a true population proportion and learn about how sample proportions can vary from sample to sample by taking smaller samples from the population. We will first create our population assuming a population size of 100,000. This means 20,000 (20%) of the population think the work scientists do does not benefit them personally and the remaining 80,000 think it does.
#scientist_work = c(rep("Benefits", 80000), rep("Does not benefit", 20000))
#
# We will first create our population assuming a population size of 100,000. This means 20,000 (20%) of the population think the work scientists do does not benefit them personally and the remaining 80,000 think it does. The name of the data frame is global_monitor and the name of the variable that contains responses to the question “Do you believe that the work scientists do benefit people like you?” is scientist_work.
global_monitor <- tibble(scientist_work = factor(c(rep("Benefits", 80000), rep("Does not benefit", 20000))))
# I kept getting the error with this code , I have to trick it with factor...realy don't know why
# global_monitor <- tibble(
# scientist_work = c(rep('Benefits', 80000), rep('Does not benefit', 20000))
# )
#scientist_work = c(rep(c('Benefits','Does_not_benefit'), time = c(80, 20)))
#global_monitor <- as_tibble( scientist_work, rownames = NA )
view(global_monitor)
# count(global_monitor) this will generate frequency
global_monitor %>% count(scientist_work)
## # A tibble: 2 x 2
## scientist_work n
## <fct> <int>
## 1 Benefits 80000
## 2 Does not benefit 20000
# We can quickly visualize the distribution of these responses using a bar plot.
ggplot(global_monitor, aes(x = scientist_work)) +
geom_bar() +
labs(
x = "", y = "",
title = "Do you believe that the work scientists do benefit people like you?"
) +
coord_flip()
# We can also obtain summary statistics to confirm we constructed the data frame correctly.
# keep getting this error : Error in parse(text = x) : <text>:1:6: unexpected symbol1: Does not
# Was looking to use group by
# df %>% group_by(a, b) %>% summarise(n = n())
global_monitor %>%
count(scientist_work) %>%
mutate(p = n /sum(n))
## # A tibble: 2 x 3
## scientist_work n p
## <fct> <int> <dbl>
## 1 Benefits 80000 0.8
## 2 Does not benefit 20000 0.2
samp1 <- global_monitor %>%
sample_n(50)
Describe the distribution of responses in this sample. How does it compare to the distribution of responses in the population. Hint: Although the sample_n function takes a random sample of observations (i.e. rows) from the dataset, you can still refer to the variables in the dataset with the same names. Code you presented earlier for visualizing and summarising the population data will still be useful for the sample, however be careful to not label your proportion p since you’re now calculating a sample statistic, not a population parameters. You can customize the label of the statistics to indicate that it comes from the sample
set.seed(300007)
samp1 <- global_monitor %>%
sample_n(50)
samp1 %>%
count(scientist_work) %>%
mutate(p1 = n /sum(n))
## # A tibble: 2 x 3
## scientist_work n p1
## <fct> <int> <dbl>
## 1 Benefits 41 0.82
## 2 Does not benefit 9 0.18
ggplot(samp1, aes(x = scientist_work)) +
geom_bar(fill = "blue") +
labs(
x = "", y = "",
title = "Do you believe that the work scientists do benefit people like you?"
) +
coord_flip()
ggplot(samp1, aes(x = scientist_work)) +
geom_bar(position = position_dodge(), width = 0.5, fill = "blue" ) +
coord_flip() +
ggtitle("Do you believe that the work scientists do benefit people like you?") +
xlab("Element of study") + ylab("sample1")
Would you expect the sample proportion to match the sample proportion of another student’s sample? Why, or why not? If the answer is no, would you expect the proportions to be somewhat different or very different? Ask a student team to confirm your answer.
Answer: No, in theory I Would you expect the sample proportion to match the sample proportion of another student’s sample. The sample 50 from the 100000 population is already outputing different result and if I remove the seed, just like I would ask student team I am likely to have a different distribution response. This is due to the fact that it is random sample from a large population. However, I there may be a chance that a student get a same result with me, but if this match repeats across the team, then it will be strange.
# Insert code for Exercise 4 here
Take a second sample, also of size 50, and call it samp2. How does the sample proportion of samp2 compare with that of samp1? Suppose we took two more samples, one of size 100 and one of size 1000. Which would you think would provide a more accurate estimate of the population proportion?
set.seed(3000507)
samp2 <- global_monitor %>%
sample_n(50)
samp2 %>%
count(scientist_work) %>%
mutate(p2 = n /sum(n))
## # A tibble: 2 x 3
## scientist_work n p2
## <fct> <int> <dbl>
## 1 Benefits 39 0.78
## 2 Does not benefit 11 0.22
ggplot(samp2, aes(x = scientist_work)) +
geom_bar(position = position_dodge(), width = 0.5, fill = "blue" ) +
coord_flip() +
ggtitle("Do you believe that the work scientists do benefit people like you from sample 0f 50?") +
xlab("Element of study") + ylab("sample2")
# sample 100
set.seed(3005507)
samp3 <- global_monitor %>%
sample_n(100)
samp3 %>%
count(scientist_work) %>%
mutate(p3 = n /sum(n))
## # A tibble: 2 x 3
## scientist_work n p3
## <fct> <int> <dbl>
## 1 Benefits 78 0.78
## 2 Does not benefit 22 0.22
ggplot(samp3, aes(x = scientist_work)) +
geom_bar(position = position_dodge(), width = 0.5, fill = "blue" ) +
coord_flip() +
ggtitle("Do you believe that the work scientists do benefit people like you from sample 0f 100?") +
xlab("Element of study") + ylab("sample3")
# sample 1000
set.seed(3055507)
samp4 <- global_monitor %>%
sample_n(1000)
samp4 %>%
count(scientist_work) %>%
mutate(p4 = n /sum(n))
## # A tibble: 2 x 3
## scientist_work n p4
## <fct> <int> <dbl>
## 1 Benefits 816 0.816
## 2 Does not benefit 184 0.184
# sample 10000
set.seed(3055507)
samp5 <- global_monitor %>%
sample_n(10000)
samp5 %>%
count(scientist_work) %>%
mutate(p5 = n /sum(n))
## # A tibble: 2 x 3
## scientist_work n p5
## <fct> <int> <dbl>
## 1 Benefits 8064 0.806
## 2 Does not benefit 1936 0.194
ggplot(samp5, aes(x = scientist_work)) +
geom_bar(position = position_dodge(), width = 0.5, fill = "red" ) +
coord_flip() +
ggtitle("Do you believe that the work scientists do benefit people like you from sample 0f 10000?") +
xlab("Element of study") + ylab("sample5")
How many elements are there in sample_props50? Describe the sampling distribution, and be sure to specifically note its center. Make sure to include a plot of the distribution in your answer. Answer: There are 04 elements, and the sampling distribution shows a distribution centered at 20% which is the true estimate of the oginal population. This is a normal , unimodal distribution with mean at 0.2
set.seed(5055507)
sample_props50 <- global_monitor %>%
rep_sample_n(size = 50, reps = 15000, replace = TRUE) %>%
count(scientist_work) %>%
mutate(p_hat = n /sum(n)) %>%
filter(scientist_work == "Does not benefit")
# only one variance
# sample_props50 %>%
# count(scientist_work) %>%
# mutate(p50 = n /sum(n))
# ggplot(data = sample_props50, aes(x = p_hat)) +
# geom_histogram(binwidth = 0.02, fill = rainbow(21)) +
# labs(
# x = "p_hat (Does not benefit)",
# title = "Sampling distribution of Sampling distribution of people who believe scientist work does not Benefit their lives",
# subtitle = "Sample size = 50, Number of samples = 15000"
# )
To make sure you understand how sampling distributions are built, and exactly what the rep_sample_n function does, try modifying the code to create a sampling distribution of 25 sample proportions from samples of size 10, and put them in a data frame named sample_props_small. Print the output. How many observations are there in this object called sample_props_small? What does each observation represent?
There are 05 elements (0.1, 0.2, 0.3, 0.4, 0.5). each of the element is singled out. This is not aggregate because the sample size is small. the sampling distribution shows what could be a right skewed distribution. I observed that as the sample size gets smalled , the spread get wider and we are likely to be off the true estimate of the population. each of this element represent the percentage of population who are saying …“does not benefit”
# Interlude: Sampling distributions
# The idea behind the rep_sample_n function is repetition. Earlier, you took a single sample of size n (50) from the population of all people in the population. With this new function, you can repeat this sampling procedure rep times in order to build a distribution of a series of sample statistics, which is called the sampling distribution.
#
# Note that in practice one rarely gets to build true sampling distributions, because one rarely has access to data from the entire population.
#
# Without the rep_sample_n function, this would be painful. We would have to manually run the following code 15,000 times
set.seed(6055507)
sample_props_small <- global_monitor %>%
rep_sample_n(size = 10, reps = 25, replace = TRUE) %>%
count(scientist_work) %>%
mutate(p_hat = n /sum(n)) %>%
filter(scientist_work == "Does not benefit")
# ggplot(data = sample_props_small, aes(x = p_hat)) +
# geom_histogram(binwidth = 0.02)
ggplot(data = sample_props_small, aes(x = p_hat)) +
geom_histogram(binwidth = 0.02, fill = rainbow(21)) +
labs(
x = "p_hat (Does not benefit)",
title = "Sampling distribution of Sampling distribution of people who believe scientist work does not Benefit their lives",
subtitle = "Sample size = 10, Number of samples = 25"
)
summary(sample_props_small)
## replicate scientist_work n p_hat
## Min. : 1.00 Benefits : 0 Min. :1.000 Min. :0.1000
## 1st Qu.: 6.25 Does not benefit:22 1st Qu.:1.250 1st Qu.:0.1250
## Median :13.00 Median :2.000 Median :0.2000
## Mean :13.05 Mean :2.091 Mean :0.2091
## 3rd Qu.:19.75 3rd Qu.:2.000 3rd Qu.:0.2000
## Max. :25.00 Max. :5.000 Max. :0.5000
Use the app below to create sampling distributions of proportions of Doesn’t benefit from samples of size 10, 50, and 100. Use 5,000 simulations. What does each observation in the sampling distribution represent? How does the mean, standar error, and shape of the sampling distribution change as the sample size increases? How (if at all) do these values change if you increase the number of simulations? (You do not need to include plots in your answer.)
I predicted this observation from previous questions: small sample size , the spread in the distribution response will get wider. the sampling distribution shows what could be a right skewed distribution. As sample size increase, the spread in the distribution response will get narrowed , and the sampling distribution shows a distribution centered at 20% which is the true estimate of the oginal population
set.seed(7155507)
sample_app10 <- global_monitor %>%
rep_sample_n(size = 10, reps = 5000, replace = TRUE) %>%
count(scientist_work) %>%
mutate(p_hat = n /sum(n)) %>%
filter(scientist_work == "Does not benefit")
# ggplot(data = sample_props_small, aes(x = p_hat)) +
# geom_histogram(binwidth = 0.02)
ggplot(data = sample_app10 , aes(x = p_hat)) +
geom_histogram(binwidth = 0.02, fill = rainbow(36)) +
labs(
x = "p_hat (Does not benefit)",
title = "Sampling distribution of Sampling distribution of people who believe scientist work does not Benefit their lives",
subtitle = "Sample size = 10, Number of samples = 5000"
)
summary(sample_app10) # mean 22.35%
## replicate scientist_work n p_hat
## Min. : 2 Benefits : 0 Min. :1.000 Min. :0.1000
## 1st Qu.:1270 Does not benefit:4465 1st Qu.:1.000 1st Qu.:0.1000
## Median :2517 Median :2.000 Median :0.2000
## Mean :2510 Mean :2.235 Mean :0.2235
## 3rd Qu.:3747 3rd Qu.:3.000 3rd Qu.:0.3000
## Max. :5000 Max. :8.000 Max. :0.8000
set.seed(77055507)
sample_app50 <- global_monitor %>%
rep_sample_n(size = 50, reps = 5000, replace = TRUE) %>%
count(scientist_work) %>%
mutate(p_hat = n /sum(n)) %>%
filter(scientist_work == "Does not benefit")
# ggplot(data = sample_props_small, aes(x = p_hat)) +
# geom_histogram(binwidth = 0.02)
ggplot(data = sample_app50 , aes(x = p_hat)) +
geom_histogram(binwidth = 0.02, fill = rainbow(21)) +
labs(
x = "p_hat (Does not benefit)",
title = "Sampling distribution of Sampling distribution of people who believe scientist work does not Benefit their lives",
subtitle = "Sample size = 50, Number of samples = 5000"
)
summary(sample_app50) # mean = 19.96%
## replicate scientist_work n p_hat
## Min. : 1 Benefits : 0 Min. : 1.00 Min. :0.0200
## 1st Qu.:1251 Does not benefit:5000 1st Qu.: 8.00 1st Qu.:0.1600
## Median :2500 Median :10.00 Median :0.2000
## Mean :2500 Mean : 9.98 Mean :0.1996
## 3rd Qu.:3750 3rd Qu.:12.00 3rd Qu.:0.2400
## Max. :5000 Max. :21.00 Max. :0.4200
set.seed(8055507)
sample_app100 <- global_monitor %>%
rep_sample_n(size = 100, reps = 5000, replace = TRUE) %>%
count(scientist_work) %>%
mutate(p_hat = n /sum(n)) %>%
filter(scientist_work == "Does not benefit")
# ggplot(data = sample_props_small, aes(x = p_hat)) +
# geom_histogram(binwidth = 0.02)
ggplot(data = sample_app100 , aes(x = p_hat)) +
geom_histogram(binwidth = 0.02, fill = rainbow(14)) +
labs(
x = "p_hat (Does not benefit)",
title = "Sampling distribution of Sampling distribution of people who believe scientist work does not Benefit their lives",
subtitle = "Sample size = 100, Number of samples = 5000"
)
summary(sample_app100) # mean = 19.85%
## replicate scientist_work n p_hat
## Min. : 1 Benefits : 0 Min. : 7.00 Min. :0.0700
## 1st Qu.:1251 Does not benefit:5000 1st Qu.:17.00 1st Qu.:0.1700
## Median :2500 Median :20.00 Median :0.2000
## Mean :2500 Mean :19.85 Mean :0.1985
## 3rd Qu.:3750 3rd Qu.:22.00 3rd Qu.:0.2200
## Max. :5000 Max. :34.00 Max. :0.3400
set.seed(79955507)
sample_app1000 <- global_monitor %>%
rep_sample_n(size = 1000, reps = 5000, replace = TRUE) %>%
count(scientist_work) %>%
mutate(p_hat = n /sum(n)) %>%
filter(scientist_work == "Does not benefit")
# ggplot(data = sample_props_small, aes(x = p_hat)) +
# geom_histogram(binwidth = 0.02)
ggplot(data = sample_app1000 , aes(x = p_hat)) +
geom_histogram(binwidth = 0.02, fill = rainbow(6)) +
labs(
x = "p_hat (Does not benefit)",
title = "Sampling distribution of Sampling distribution of people who believe scientist work does not Benefit their lives",
subtitle = "Sample size = 1000, Number of samples = 5000"
)
summary(sample_app1000) # mean = 19.98%
## replicate scientist_work n p_hat
## Min. : 1 Benefits : 0 Min. :161.0 Min. :0.1610
## 1st Qu.:1251 Does not benefit:5000 1st Qu.:191.0 1st Qu.:0.1910
## Median :2500 Median :200.0 Median :0.2000
## Mean :2500 Mean :199.8 Mean :0.1998
## 3rd Qu.:3750 3rd Qu.:208.0 3rd Qu.:0.2080
## Max. :5000 Max. :257.0 Max. :0.2570
Take a sample of size 15 from the population and calculate the proportion of people in this sample who think the work scientists do enchances their lives. Using this sample, what is your best point estimate of the population proportion of people who think the work scientists do enchances their lives? Answer: the best point estimate of the population proportion of people who think the work scientists do enchances their lives is about 80%. Looks like winning a lottery
set.seed(158460507)
samp15 <- global_monitor %>%
sample_n(15)
samp15 %>%
count(scientist_work) %>%
mutate(p15 = n /sum(n))
## # A tibble: 2 x 3
## scientist_work n p15
## <fct> <int> <dbl>
## 1 Benefits 12 0.8
## 2 Does not benefit 3 0.2
set.seed(15360507)
sample_app15 <- global_monitor %>%
rep_sample_n(size = 15, reps = 1, replace = TRUE) %>%
count(scientist_work) %>%
mutate(p_hat = n /sum(n))
summary(sample_app15)
## replicate scientist_work n p_hat
## Min. :1 Benefits :1 Min. : 4.00 Min. :0.2667
## 1st Qu.:1 Does not benefit:1 1st Qu.: 5.75 1st Qu.:0.3833
## Median :1 Median : 7.50 Median :0.5000
## Mean :1 Mean : 7.50 Mean :0.5000
## 3rd Qu.:1 3rd Qu.: 9.25 3rd Qu.:0.6167
## Max. :1 Max. :11.00 Max. :0.7333
ggplot(samp15, aes(x = scientist_work)) +
geom_bar(position = position_dodge(), width = 0.5, fill = "blue" ) +
coord_flip() +
ggtitle("Do you believe that the work scientists do benefit people like you from sample 0f 15?") +
xlab("Element of study") + ylab("sample15")
Since you have access to the population, simulate the sampling distribution of proportion of those who think the work scientists do enchances their lives for samples of size 15 by taking 2000 samples from the population of size 15 and computing 2000 sample proportions. Store these proportions in as sample_props15. Plot the data, then describe the shape of this sampling distribution. Based on this sampling distribution, what would you guess the true proportion of those who think the work scientists do enchances their lives to be? Finally, calculate and report the population proportion.
set.seed(147955507)
sample_props15 <- global_monitor %>%
rep_sample_n(size = 15, reps = 2000, replace = TRUE) %>%
count(scientist_work) %>%
mutate(p_hat = n /sum(n)) %>%
filter(scientist_work == "Benefits")
# ggplot(data = sample_props_small, aes(x = p_hat)) +
# geom_histogram(binwidth = 0.02)
ggplot(data = sample_props15 , aes(x = p_hat)) +
geom_histogram(binwidth = 0.02, fill = rainbow(31)) +
labs(
x = "p_hat (Benefits)",
title = "Sampling distribution of people who believe scientist work Benefits their lives",
subtitle = "Sample size = 15, Number of samples = 2000"
)
summary(sample_props15) # mean = 80.13%
## replicate scientist_work n p_hat
## Min. : 1.0 Benefits :2000 Min. : 6.00 Min. :0.4000
## 1st Qu.: 500.8 Does not benefit: 0 1st Qu.:11.00 1st Qu.:0.7333
## Median :1000.5 Median :12.00 Median :0.8000
## Mean :1000.5 Mean :12.04 Mean :0.8028
## 3rd Qu.:1500.2 3rd Qu.:13.00 3rd Qu.:0.8667
## Max. :2000.0 Max. :15.00 Max. :1.0000
Change your sample size from 15 to 150, then compute the sampling distribution using the same method as above, and store these proportions in a new object called sample_props150. Describe the shape of this sampling distribution and compare it to the sampling distribution for a sample size of 15. Based on this sampling distribution, what would you guess to be the true proportion of those who think the work scientists do enchances their lives?
set.seed(1470955507)
sample_props150 <- global_monitor %>%
rep_sample_n(size = 150, reps = 2000, replace = TRUE) %>%
count(scientist_work) %>%
mutate(p_hat = n /sum(n)) %>%
filter(scientist_work == "Benefits")
# ggplot(data = sample_props_small, aes(x = p_hat)) +
# geom_histogram(binwidth = 0.02)
ggplot(data = sample_props150 , aes(x = p_hat)) +
geom_histogram(binwidth = 0.02, fill = rainbow(12)) +
labs(
x = "p_hat (Benefits)",
title = "Sampling distribution of people who believe scientist work Benefits their lives",
subtitle = "Sample size = 150, Number of samples = 2000"
)
summary(sample_props150) # mean = 80.15%
## replicate scientist_work n p_hat
## Min. : 1.0 Benefits :2000 Min. :103.0 Min. :0.6867
## 1st Qu.: 500.8 Does not benefit: 0 1st Qu.:117.0 1st Qu.:0.7800
## Median :1000.5 Median :120.0 Median :0.8000
## Mean :1000.5 Mean :120.2 Mean :0.8015
## 3rd Qu.:1500.2 3rd Qu.:123.0 3rd Qu.:0.8200
## Max. :2000.0 Max. :135.0 Max. :0.9000
Of the sampling distributions from 2 and 3, which has a smaller spread? If you’re concerned with making estimates that are more often close to the true value, would you prefer a sampling distribution with a large or small spread?
A 2019 Pew Research report states the following:
To keep our computation simple, we will assume a total population size of 100,000 (even though that’s smaller than the population size of all US adults).
Roughly six-in-ten U.S. adults (62%) say climate change is currently affecting their local community either a great deal or some, according to a new Pew Research Center survey.
Source: Most Americans say climate change impacts their community, but effects vary by region
In this lab, you will assume this 62% is a true population proportion and learn about how sample proportions can vary from sample to sample by taking smaller samples from the population. We will first create our population assuming a population size of 100,000. This means 62,000 (62%) of the adult population think climate change impacts their community, and the remaining 38,000 does not think so.
What percent of the adults in your sample think climate change affects their local community? Hint: Just like we did with the population, we can calculate the proportion of those in this sample who think climate change affects their local community.
aNSWER: 61.67%
# The name of the data frame is us_adults and the name of the variable that contains responses to the question “Do you think climate change is affecting your local community?” is climate_change_affect
us_adults <- tibble(
climate_change_affects = factor(c(rep("Yes", 62000), rep("No", 38000)))
)
# We can quickly visualize the distribution of these responses using a bar plot.
ggplot(us_adults, aes(x = climate_change_affects)) +
geom_bar(fill = "blue") +
labs(
x = "", y = "",
title = "What percent of the adults in your sample think climate change affects their local community?"
) +
coord_flip()
#We can also obtain summary statistics to confirm we constructed the data frame correctly.
us_adults %>%
count(climate_change_affects) %>%
mutate(p = n /sum(n))
## # A tibble: 2 x 3
## climate_change_affects n p
## <fct> <int> <dbl>
## 1 No 38000 0.38
## 2 Yes 62000 0.62
# In this lab, you’ll start with a simple random sample of size 60 from the population.
set.seed(346499007)
n <- 60
samp60 <- us_adults %>%
sample_n(size = n)
# samp60 <- global_monitor %>%
# sample_n(60)
samp60 %>%
count(climate_change_affects) %>%
mutate(p60 = n /sum(n))
## # A tibble: 2 x 3
## climate_change_affects n p60
## <fct> <int> <dbl>
## 1 No 23 0.383
## 2 Yes 37 0.617
ggplot(samp60, aes(x = climate_change_affects)) +
geom_bar(fill = "blue") +
labs(
x = "", y = "",
title = "What percent of the adults in your sample think climate change affects their local community?"
) +
coord_flip()
Would you expect another student’s sample proportion to be identical to yours? Would you expect it to be similar? Why or why not?
Answer: No, random sample with large size is likely to output differently. However, there is a chance one student might get the same result, but the same across the class will be strange.
Return for a moment to the question that first motivated this lab: based on this sample, what can you infer about the population? With just one sample, the best estimate of the proportion of US adults who think climate change affects their local community would be the sample proportion, usually denoted as p^
(here we are calling it p_hat). That serves as a good point estimate, but it would be useful to also communicate how uncertain you are of that estimate. This uncertainty can be quantified using a confidence interval.
One way of calculating a confidence interval for a population proportion is based on the Central Limit Theorem, as p±z⋆SEp is, or more precisely, as p±z⋆p(1−p^)n−−−−−−−−√
Another way is using simulation, or to be more specific, using bootstrapping. The term bootstrapping comes from the phrase “pulling oneself up by one’s bootstraps”, which is a metaphor for accomplishing an impossible task without any outside help. In this case the impossible task is estimating a population parameter (the unknown population proportion), and we’ll accomplish it using data from only the given sample. Note that this notion of saying something about a population parameter using only information from an observed sample is the crux of statistical inference, it is not limited to bootstrapping.
In essence, bootstrapping assumes that there are more of observations in the populations like the ones in the observed sample. So we “reconstruct” the population by resampling from our sample, with replacement. The bootstrapping scheme is as follows:
Step 1. Take a bootstrap sample - a random sample taken with replacement from the original sample, of the same size as the original sample.
Step 2. Calculate the bootstrap statistic - a statistic such as mean, median, proportion, slope, etc. computed on the bootstrap samples.
Step 3. Repeat steps (1) and (2) many times to create a bootstrap distribution - a distribution of bootstrap statistics.
Step 4. Calculate the bounds of the XX% confidence interval as the middle XX% j knof the bootstrap distribution.
Instead of coding up each of these steps, we will construct confidence intervals using the infer package.
Below is an overview of the functions we will use to construct this confidence interval: Function Purpose specify Identify your variable of interest generate The number of samples you want to generate calculate The sample statistic you want to do inference with, or you can also think of this as the population parameter you want to do inference for get_ci Find the confidence interval
In the interpretation above, we used the phrase “95% confident”. What does “95% confidence” mean? Answer: The 95% confidence interval means that there is 95% certainty our calculated range of values contains the population mean.
# This code will find the 95 percent confidence interval for proportion of US adults who think climate change affects their local community.
# samp60 %>%
# specify(response = climate_change_affects, success = "Yes") %>%
# generate(reps = 1000, type = "bootstrap") %>%
# calculate(stat = "prop") %>%
# get_ci(level = 0.95) %>%
Does your confidence interval capture the true population proportion of US adults who think climate change affects their local community? If you are working on this lab in a classroom, does your neighbor’s interval capture this value? Answer: Yes, it does.
Each student should have gotten a slightly different confidence interval. What proportion of those intervals would you expect to capture the true population mean? Why? Based on the Figure 5.6 on OpenIntro Statistics, 4th Edition (page 182). I would expect about 95% proportion of those intervals to capture the true population mean. Why? Because there is only one mean.
Given a sample size of 60, 1000 bootstrap samples for each interval, and 50 confidence intervals constructed (the default values for the above app), what proportion of your confidence intervals include the true population proportion? Is this proportion exactly equal to the confidence level? If not, explain why. Make sure to include your plot in your answer..