library(tidyverse)
library(openintro)
library(infer)global_monitor <- tibble(
scientist_work = c(rep("Benefits", 80000), rep("Doesn't benefit", 20000))
)
ggplot(global_monitor, aes(x = scientist_work)) +
geom_bar() +
labs(
x = "", y = "",
title = "Do you believe that the work scientists do benefit people like you?"
) +
coord_flip() In our first sample of size 50 the distribution is different compared to the total population. The proportion in the sample is 86% “Benefits” and 14% “Doesn’t Benefit”, compared to 80/20 in the population. We expect the sample to not line up directly with the population, however it’s pretty close.
set.seed(100000)
samp1 <- global_monitor %>%
sample_n(50)
ggplot(samp1, aes(x = scientist_work)) +
geom_bar() +
labs(
x = "", y = "",
title = "Do you believe that the work scientists do benefit people like you?"
) +
coord_flip() samp1 %>%
count(scientist_work) %>%
mutate(sample_p = n /sum(n))## # A tibble: 2 x 3
## scientist_work n sample_p
## * <chr> <int> <dbl>
## 1 Benefits 43 0.86
## 2 Doesn't benefit 7 0.14
I would suspect another students sample would not match up with mine. Since I do not have another student to ask, I run my sample multiple times and never got the same result. So the proportion would be different, but only somewhat different. Since the proportions I was generating we all fairly close to the 80/20 split.
Sample 2 has different proportions compared to my first sample.
samp2 <- global_monitor %>%
sample_n(50)
samp2 %>%
count(scientist_work) %>%
mutate(sample_p = n /sum(n))## # A tibble: 2 x 3
## scientist_work n sample_p
## * <chr> <int> <dbl>
## 1 Benefits 34 0.68
## 2 Doesn't benefit 16 0.32
The more I increase the number of samples the closer the sample proportion gets the the actual population. Sample 4, with a size of 1,000, is almost identical to the population.
samp3 <- global_monitor %>%
sample_n(100)
samp3 %>%
count(scientist_work) %>%
mutate(sample_p = n /sum(n))## # A tibble: 2 x 3
## scientist_work n sample_p
## * <chr> <int> <dbl>
## 1 Benefits 84 0.84
## 2 Doesn't benefit 16 0.16
samp4 <- global_monitor %>%
sample_n(1000)
samp4 %>%
count(scientist_work) %>%
mutate(sample_p = n /sum(n))## # A tibble: 2 x 3
## scientist_work n sample_p
## * <chr> <int> <dbl>
## 1 Benefits 799 0.799
## 2 Doesn't benefit 201 0.201
The code is taking 15,000 samples of size 50. Then calculating the proportions and filtering to the “Doesn’t benefit” results. Plotting the proportions in a histogram shows a normal distribution around 0.2, which we know is the proportion of the population.
sample_props50 <- global_monitor %>%
rep_sample_n(size = 50, reps = 15000, replace = TRUE) %>%
count(scientist_work) %>%
mutate(p_hat = n /sum(n)) %>%
filter(scientist_work == "Doesn't benefit")
ggplot(data = sample_props50, aes(x = p_hat)) +
geom_histogram(binwidth = 0.02) +
labs(
x = "p_hat (Doesn't benefit)",
title = "Sampling distribution of p_hat",
subtitle = "Sample size = 50, Number of samples = 15000"
)In this case there are 22 observations when I filter the dataframe to “Doesn’t Benefit” answers only. Since the sample size is only 10 there are a few samples that have 0 “Doesn’t Benefit” answers. Each observation in the dataframe represents the proportion of the sample that answered “Doesn’t Benefit”.
sample_props_small <- global_monitor %>%
rep_sample_n(size = 10, reps = 25, replace = TRUE) %>%
count(scientist_work) %>%
mutate(p_hat = n /sum(n)) %>%
filter(scientist_work == "Doesn't benefit")
print(sample_props_small)## # A tibble: 20 x 4
## # Groups: replicate [20]
## replicate scientist_work n p_hat
## <int> <chr> <int> <dbl>
## 1 1 Doesn't benefit 1 0.1
## 2 2 Doesn't benefit 4 0.4
## 3 4 Doesn't benefit 2 0.2
## 4 6 Doesn't benefit 2 0.2
## 5 7 Doesn't benefit 2 0.2
## 6 9 Doesn't benefit 3 0.3
## 7 10 Doesn't benefit 3 0.3
## 8 11 Doesn't benefit 2 0.2
## 9 12 Doesn't benefit 3 0.3
## 10 13 Doesn't benefit 3 0.3
## 11 14 Doesn't benefit 1 0.1
## 12 16 Doesn't benefit 1 0.1
## 13 17 Doesn't benefit 1 0.1
## 14 18 Doesn't benefit 4 0.4
## 15 20 Doesn't benefit 1 0.1
## 16 21 Doesn't benefit 2 0.2
## 17 22 Doesn't benefit 3 0.3
## 18 23 Doesn't benefit 2 0.2
## 19 24 Doesn't benefit 4 0.4
## 20 25 Doesn't benefit 3 0.3
The distribution of n=10 is pretty sparce. The peak of the distribution is at 0.2, but I don’t think the sample size is large enough to produce a normal distribution. Comparing the distributions between n=50 and n=100 are actually extremely similar. The mean, spread, and shape are close.
sample_reps10 <- global_monitor %>%
rep_sample_n(size = 10, reps = 5000, replace = TRUE) %>%
count(scientist_work) %>%
mutate(p_hat = n /sum(n)) %>%
filter(scientist_work == "Doesn't benefit")
sample_reps50 <- global_monitor %>%
rep_sample_n(size = 50, reps = 5000, replace = TRUE) %>%
count(scientist_work) %>%
mutate(p_hat = n /sum(n)) %>%
filter(scientist_work == "Doesn't benefit")
sample_reps100 <- global_monitor %>%
rep_sample_n(size = 100, reps = 5000, replace = TRUE) %>%
count(scientist_work) %>%
mutate(p_hat = n /sum(n)) %>%
filter(scientist_work == "Doesn't benefit")ggplot(data = sample_reps10, aes(x = p_hat)) +
geom_histogram(binwidth = 0.02)ggplot(data = sample_reps50, aes(x = p_hat)) +
geom_histogram(binwidth = 0.02)ggplot(data = sample_reps100, aes(x = p_hat)) +
geom_histogram(binwidth = 0.02) ### Exercise 7
Using this sample my best estimate of the proportion would be 0.8. Which lines up with the actual population, but I just got lucky :)
samp_15 <- global_monitor %>%
sample_n(15)
samp_15 %>%
count(scientist_work) %>%
mutate(sample_p = n /sum(n))## # A tibble: 2 x 3
## scientist_work n sample_p
## * <chr> <int> <dbl>
## 1 Benefits 12 0.8
## 2 Doesn't benefit 3 0.2
The distribution looks kind of normal, but is pretty sparse due to the sample size being pretty small. Based on the distribution of “sample_props15” I would say the sample proportion’s mean is around 0.8.
sample_props15 <- global_monitor %>%
rep_sample_n(size = 15, reps = 2000, replace = TRUE) %>%
count(scientist_work) %>%
mutate(p_hat = n /sum(n)) %>%
filter(scientist_work == "Benefits")
ggplot(data = sample_props15, aes(x = p_hat)) +
geom_histogram(binwidth = 0.05)Changing the sample size to 150 makes the distribution look much better, and much closer to normal. The peak is around 0.8 which matches with the population proportion.
sample_props150 <- global_monitor %>%
rep_sample_n(size = 150, reps = 2000, replace = TRUE) %>%
count(scientist_work) %>%
mutate(p_hat = n /sum(n)) %>%
filter(scientist_work == "Benefits")
ggplot(data = sample_props150, aes(x = p_hat)) +
geom_histogram(binwidth = 0.02)The collections of samples with n=150 has the smaller spread. I would choose the sample with a large size and smaller spread to represent the true population.