library(tidyverse)
library(openintro)
library(infer)
library(ggplot2)
A 2019 Gallup report states the following:
The Wellcome Global Monitor finds that 20% of people globally do not believe that the work scientists do benefits people like them. In this lab, you will assume this 20% is a true population proportion and learn about how sample proportions can vary from sample to sample by taking smaller samples from the population. We will first create our population assuming a population size of 100,000. This means 20,000 (20%) of the population think the work scientists do does not benefit them personally and the remaining 80,000 think it does.
global_monitor <- tibble(
scientist_work = c(rep("Benefits", 80000), rep("Doesn't benefit", 20000))
)
ggplot(global_monitor, aes(x = scientist_work)) +
geom_bar() +
labs(
x = "", y = "",
title = "Do you believe that the work scientists do benefit people like you?"
) +
coord_flip()
Summary statistics
global_monitor %>%
count(scientist_work) %>%
mutate(p = n /sum(n))
## # A tibble: 2 x 3
## scientist_work n p
## <chr> <int> <dbl>
## 1 Benefits 80000 0.8
## 2 Doesn't benefit 20000 0.2
samp1 <- global_monitor %>%
sample_n(50)
sample_n
function takes a random sample of observations (i.e. rows) from the dataset, you can still refer to the variables in the dataset with the same names. Code you presented earlier for visualizing and summarizing the population data will still be useful for the sample, however be careful to not label your proportion p
since you’re now calculating a sample statistic, not a population parameters. You can customize the label of the statistics to indicate that it comes from the sample.samp1 %>%
count(scientist_work) %>%
mutate(p_hat = n /sum(n))
## # A tibble: 2 x 3
## scientist_work n p_hat
## <chr> <int> <dbl>
## 1 Benefits 37 0.74
## 2 Doesn't benefit 13 0.26
It will be a coincidence if the sample proportion match.They ought not to match because the values are randomly selected
samp2
. How does the sample proportion of samp2
compare with that of samp1
? Suppose we took two more samples, one of size 100 and one of size 1000. Which would you think would provide a more accurate estimate of the population proportion?A sample size of 1000 is expected to provide a more accurate estimate of the population proportion according to central limit theorem
samp2 <- global_monitor %>%
sample_n(50)
samp2 %>%
count(scientist_work) %>%
mutate(p_hat = n /sum(n))
## # A tibble: 2 x 3
## scientist_work n p_hat
## <chr> <int> <dbl>
## 1 Benefits 42 0.84
## 2 Doesn't benefit 8 0.16
library(pastecs)
sample_props50 <- global_monitor %>%
rep_sample_n(size = 50, reps = 15000, replace = TRUE) %>%
count(scientist_work) %>%
mutate(p_hat = n /sum(n)) %>%
filter(scientist_work == "Doesn't benefit")
stat.desc(sample_props50)
## replicate scientist_work n p_hat
## nbr.val 1.500000e+04 NA 1.500000e+04 1.500000e+04
## nbr.null 0.000000e+00 NA 0.000000e+00 0.000000e+00
## nbr.na 0.000000e+00 NA 0.000000e+00 0.000000e+00
## min 1.000000e+00 NA 1.000000e+00 2.000000e-02
## max 1.500000e+04 NA 2.100000e+01 4.200000e-01
## range 1.499900e+04 NA 2.000000e+01 4.000000e-01
## sum 1.125075e+08 NA 1.493910e+05 2.987820e+03
## median 7.500500e+03 NA 1.000000e+01 2.000000e-01
## mean 7.500500e+03 NA 9.959400e+00 1.991880e-01
## SE.mean 3.535652e+01 NA 2.323947e-02 4.647893e-04
## CI.mean.0.95 6.930309e+01 NA 4.555219e-02 9.110438e-04
## var 1.875125e+07 NA 8.101092e+00 3.240437e-03
## std.dev 4.330271e+03 NA 2.846242e+00 5.692483e-02
## coef.var 5.773310e-01 NA 2.857845e-01 2.857845e-01
To visualize the distribution of these proportions with a histogram.
ggplot(data = sample_props50, aes(x = p_hat)) +
geom_histogram(binwidth = 0.02) +
labs(x = "p_hat (Doesn't benefit)",
title = "Sampling distribution of p_hat",
subtitle = "Sample size = 50, Number of samples = 15000"
)
sample_props50
? Describe the sampling distribution, and be sure to specifically note its center. Make sure to include a plot of the distribution in your answer.There are 15000 elements in sample_props50
N(0.199,0.000465)
#Plot of the distribution
ggplot(data = sample_props50, aes(x = p_hat)) +
geom_histogram(binwidth = 0.02) +
labs(x = "p_hat (Doesn't benefit)",
title = "Sampling distribution of p_hat",
subtitle = "Sample size = 50, Number of samples = 15000"
)
global_monitor %>%
sample_n(size = 50, replace = TRUE) %>%
count(scientist_work) %>%
mutate(p_hat = n /sum(n)) %>%
filter(scientist_work == "Doesn't benefit")
## # A tibble: 1 x 3
## scientist_work n p_hat
## <chr> <int> <dbl>
## 1 Doesn't benefit 6 0.12
rep_sample_n
function does, try modifying the code to create a sampling distribution of 25 sample proportions from samples of size 10, and put them in a data frame named sample_props_small
. Print the output. How many observations are there in this object called sample_props_small
? What does each observation represent?library(pastecs)
sample_props_small <- global_monitor %>%
rep_sample_n(size = 10, reps = 25, replace = TRUE) %>%
count(scientist_work) %>%
mutate(p_hat = n /sum(n)) %>%
filter(scientist_work == "Doesn't benefit")
sample_props_small
## # A tibble: 23 x 4
## # Groups: replicate [23]
## replicate scientist_work n p_hat
## <int> <chr> <int> <dbl>
## 1 1 Doesn't benefit 2 0.2
## 2 2 Doesn't benefit 2 0.2
## 3 3 Doesn't benefit 1 0.1
## 4 4 Doesn't benefit 3 0.3
## 5 5 Doesn't benefit 1 0.1
## 6 6 Doesn't benefit 1 0.1
## 7 8 Doesn't benefit 2 0.2
## 8 9 Doesn't benefit 2 0.2
## 9 10 Doesn't benefit 1 0.1
## 10 11 Doesn't benefit 1 0.1
## # ... with 13 more rows
dim(sample_props_small)
## [1] 23 4
round(stat.desc(sample_props_small),4)
## replicate scientist_work n p_hat
## nbr.val 23.0000 NA 23.0000 23.0000
## nbr.null 0.0000 NA 0.0000 0.0000
## nbr.na 0.0000 NA 0.0000 0.0000
## min 1.0000 NA 1.0000 0.1000
## max 24.0000 NA 5.0000 0.5000
## range 23.0000 NA 4.0000 0.4000
## sum 293.0000 NA 47.0000 4.7000
## median 13.0000 NA 2.0000 0.2000
## mean 12.7391 NA 2.0435 0.2043
## SE.mean 1.4867 NA 0.2221 0.0222
## CI.mean.0.95 3.0833 NA 0.4606 0.0461
## var 50.8379 NA 1.1344 0.0113
## std.dev 7.1301 NA 1.0651 0.1065
## coef.var 0.5597 NA 0.5212 0.5212
Mechanics aside, let’s return to the reason we used the rep_sample_n
function: to compute a sampling distribution, specifically, the sampling distribution of the proportions from samples of 50 people.
#fig.show='hide'
ggplot(data = sample_props50, aes(x = p_hat)) +
geom_histogram(binwidth = 0.02)
samp2 <- global_monitor %>%
sample_n(15)
samp2
## # A tibble: 15 x 1
## scientist_work
## <chr>
## 1 Benefits
## 2 Benefits
## 3 Benefits
## 4 Doesn't benefit
## 5 Doesn't benefit
## 6 Benefits
## 7 Benefits
## 8 Benefits
## 9 Benefits
## 10 Doesn't benefit
## 11 Benefits
## 12 Doesn't benefit
## 13 Benefits
## 14 Benefits
## 15 Doesn't benefit
samp2 %>%
rep_sample_n(15) %>%
count(scientist_work) %>%
mutate(p_hat = n /sum(n)) %>%
filter(scientist_work == "Benefits")
## # A tibble: 1 x 4
## # Groups: replicate [1]
## replicate scientist_work n p_hat
## <int> <chr> <int> <dbl>
## 1 1 Benefits 10 0.667
dim(samp2)
## [1] 15 1
round(stat.desc(samp2),4)
## scientist_work
## nbr.val NA
## nbr.null NA
## nbr.na NA
## min NA
## max NA
## range NA
## sum NA
## median NA
## mean NA
## SE.mean NA
## CI.mean NA
## var NA
## std.dev NA
## coef.var NA
sample_props15
. Plot the data, then describe the shape of this sampling distribution. Based on this sampling distribution, what would you guess the true proportion of those who think the work scientists do enchances their lives to be? Finally, calculate and report the population proportion.sample_props15 <- global_monitor %>%
rep_sample_n(size = 15, reps = 2000, replace = TRUE) %>%
count(scientist_work) %>%
mutate(p_hat = n /sum(n)) %>%
filter(scientist_work == "Benefits")
dim(sample_props15)
## [1] 2000 4
round(stat.desc(sample_props15),4)
## replicate scientist_work n p_hat
## nbr.val 2000.0000 NA 2000.0000 2000.0000
## nbr.null 0.0000 NA 0.0000 0.0000
## nbr.na 0.0000 NA 0.0000 0.0000
## min 1.0000 NA 6.0000 0.4000
## max 2000.0000 NA 15.0000 1.0000
## range 1999.0000 NA 9.0000 0.6000
## sum 2001000.0000 NA 24123.0000 1608.2000
## median 1000.5000 NA 12.0000 0.8000
## mean 1000.5000 NA 12.0615 0.8041
## SE.mean 12.9132 NA 0.0341 0.0023
## CI.mean.0.95 25.3247 NA 0.0669 0.0045
## var 333500.0000 NA 2.3249 0.0103
## std.dev 577.4946 NA 1.5248 0.1017
## coef.var 0.5772 NA 0.1264 0.1264
sample_props15 %>% ggplot(aes(x=p_hat)) +
geom_histogram(binwidth = 0.02)
sample_props150
. Describe the shape of this sampling distribution and compare it to the sampling distribution for a sample size of 15. Based on this sampling distribution, what would you guess to be the true proportion of those who think the work of scientists do enchances their lives?sample_props150 <- global_monitor %>%
rep_sample_n(size = 150, reps = 2000, replace = TRUE) %>%
count(scientist_work) %>%
mutate(p_hat = n /sum(n)) %>%
filter(scientist_work == "Benefits")
dim(sample_props150)
## [1] 2000 4
round(stat.desc(sample_props150),4)
## replicate scientist_work n p_hat
## nbr.val 2000.0000 NA 2000.0000 2000.0000
## nbr.null 0.0000 NA 0.0000 0.0000
## nbr.na 0.0000 NA 0.0000 0.0000
## min 1.0000 NA 101.0000 0.6733
## max 2000.0000 NA 135.0000 0.9000
## range 1999.0000 NA 34.0000 0.2267
## sum 2001000.0000 NA 240123.0000 1600.8200
## median 1000.5000 NA 120.0000 0.8000
## mean 1000.5000 NA 120.0615 0.8004
## SE.mean 12.9132 NA 0.1081 0.0007
## CI.mean.0.95 25.3247 NA 0.2119 0.0014
## var 333500.0000 NA 23.3534 0.0010
## std.dev 577.4946 NA 4.8325 0.0322
## coef.var 0.5772 NA 0.0403 0.0403
sample_props150 %>% ggplot(aes(x=p_hat)) +
geom_histogram(binwidth = 0.02)