library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
library(infer)
set.seed(1231232)
global_monitor <- tibble(
scientist_work = c(rep("Benefits", 80000), rep("Doesn't benefit", 20000))
)
global_monitor %>%
count(scientist_work) %>%
mutate(p = n /sum(n))
## # A tibble: 2 × 3
## scientist_work n p
## <chr> <int> <dbl>
## 1 Benefits 80000 0.8
## 2 Doesn't benefit 20000 0.2
samp1 <- global_monitor %>%
sample_n(50)
samp1 %>%
count(scientist_work) %>%
mutate(p_hat = n /sum(n))
## # A tibble: 2 × 3
## scientist_work n p_hat
## <chr> <int> <dbl>
## 1 Benefits 38 0.76
## 2 Doesn't benefit 12 0.24
The distribution in the sample correlates to the population distribution in that I was expecting close to 80% and 20%.
I would assume it would be very similar with a slight deviation. It might be even considered strange if we got the same exact distribution. Comparing to another students, they had 80% benefits and 20% does not.
set.seed(32322)
samp2 <- global_monitor %>%
sample_n(50)
samp2 %>%
count(scientist_work) %>%
mutate(p_hat = n /sum(n))
## # A tibble: 2 × 3
## scientist_work n p_hat
## <chr> <int> <dbl>
## 1 Benefits 42 0.84
## 2 Doesn't benefit 8 0.16
The sample proportion of samp2 is similar but deviated slightly. Again, I was expecting an 80/20 split but got close. If we took samples of size 100 or 1000, I would expect the larger number to describe the population better.
sample_props50 <- global_monitor %>%
rep_sample_n(size = 50, reps = 15000, replace = TRUE) %>%
count(scientist_work) %>%
mutate(p_hat = n /sum(n)) %>%
filter(scientist_work == "Doesn't benefit")
ggplot(data = sample_props50, aes(x = p_hat)) +
geom_histogram(binwidth = 0.02) +
labs(
x = "p_hat (Doesn't benefit)",
title = "Sampling distribution of p_hat",
subtitle = "Sample size = 50, Number of samples = 15000"
)
There are 15000 elements in sample_prop50. The distribution looks normal and it’s center is near or exactly .20.
sample_props_small <- global_monitor %>%
rep_sample_n(size = 10, reps = 25, replace = TRUE) %>%
count(scientist_work) %>%
mutate(p_hat = n /sum(n)) %>%
filter(scientist_work == "Doesn't benefit")
sample_props_small1 <- data.frame(sample_props_small)
sample_props_small1
## replicate scientist_work n p_hat
## 1 1 Doesn't benefit 2 0.2
## 2 2 Doesn't benefit 2 0.2
## 3 3 Doesn't benefit 3 0.3
## 4 5 Doesn't benefit 2 0.2
## 5 7 Doesn't benefit 1 0.1
## 6 8 Doesn't benefit 3 0.3
## 7 9 Doesn't benefit 2 0.2
## 8 10 Doesn't benefit 2 0.2
## 9 11 Doesn't benefit 4 0.4
## 10 12 Doesn't benefit 6 0.6
## 11 13 Doesn't benefit 1 0.1
## 12 14 Doesn't benefit 3 0.3
## 13 15 Doesn't benefit 2 0.2
## 14 16 Doesn't benefit 1 0.1
## 15 17 Doesn't benefit 3 0.3
## 16 18 Doesn't benefit 3 0.3
## 17 19 Doesn't benefit 4 0.4
## 18 20 Doesn't benefit 1 0.1
## 19 21 Doesn't benefit 3 0.3
## 20 22 Doesn't benefit 3 0.3
## 21 23 Doesn't benefit 2 0.2
## 22 24 Doesn't benefit 2 0.2
## 23 25 Doesn't benefit 1 0.1
There are 25 observations in this object. Each observation represents that percent of that sample that said they don’t benefit.
Each observation represents taking sample sizes of 10, 50 etc many times. In this case the number of samples I used was 1000. As the sample size increases, the mean gets closer to .2, the standard error gets smaller, and the shape of the tends to get narrower.
samp15 <- global_monitor %>%
sample_n(15)
samp15 %>%
count(scientist_work) %>%
mutate(p_hat = n /sum(n))
## # A tibble: 2 × 3
## scientist_work n p_hat
## <chr> <int> <dbl>
## 1 Benefits 11 0.733
## 2 Doesn't benefit 4 0.267
Using this sample, I would expect 75% of the population thinks the work scientists do enhances their lives.
sample_props15 <- global_monitor %>%
rep_sample_n(size = 15, reps = 2000, replace = TRUE) %>%
count(scientist_work) %>%
mutate(p_hat = n /sum(n)) %>%
filter(scientist_work == "Benefits")
ggplot(data = sample_props15, aes(x = p_hat)) +
geom_histogram(binwidth = 0.05) +
labs(
x = "p_hat (benefits)",
title = "Sampling distribution of p_hat",
subtitle = "Sample size = 15, Number of samples = 2000"
)
mean(sample_props15$p_hat)
## [1] 0.8007667
Based on the sample, I would expect around 80% as the proportion of people who think that scientists benefit. The proportion is at 79%
sample_props150 <- global_monitor %>%
rep_sample_n(size = 150, reps = 2000, replace = TRUE) %>%
count(scientist_work) %>%
mutate(p_hat = n /sum(n)) %>%
filter(scientist_work == "Benefits")
ggplot(data = sample_props150, aes(x = p_hat)) +
geom_histogram(binwidth = 0.02) +
labs(
x = "p_hat (benefit)",
title = "Sampling distribution of population proportion")
mean(sample_props150$p_hat)
## [1] 0.7997433
Based on the sample I would guess the same proportion of 80%.
The second sampling distribution is narrower. It’s better to have a narrower distribution as the chances that the true population is in a smaller interval gives us less values to choose from.