set.seed(199)
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.6 v dplyr 1.0.7
## v tidyr 1.2.0 v stringr 1.4.0
## v readr 2.1.1 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
library(infer)
global_monitor <- tibble(
scientist_work = c(rep("Benefits", 80000), rep("Doesn't benefit", 20000))
)
samp1 <- global_monitor %>%
sample_n(50)
ggplot(samp1, aes(x = scientist_work)) +
geom_bar() +
labs(
x = "", y = "",
title = "A smaller pop sample of the Question:"
) +
coord_flip()
samp1 %>%
count(scientist_work) %>%
mutate(samp_le = n /sum(n))
## # A tibble: 2 x 3
## scientist_work n samp_le
## <chr> <int> <dbl>
## 1 Benefits 42 0.84
## 2 Doesn't benefit 8 0.16
The distribution of responses in this sample is rather skewed toward people who believe that the work scientists does benefits the people.Compared to the larger data of global_monitor I say the sample is reflective of the entire population
samp1 %>%
count(scientist_work) %>%
mutate(p_hat = n /sum(n))
## # A tibble: 2 x 3
## scientist_work n p_hat
## <chr> <int> <dbl>
## 1 Benefits 42 0.84
## 2 Doesn't benefit 8 0.16
I don’t expect the sample proportion to match the sample proportion of another’s student sample since because the sample is truly random and there could be an opportunity where a student can have more “benefits” in their table compared to other students.
samp2 <- global_monitor %>%
sample_n(50)
samp2 %>%
count(scientist_work) %>%
mutate(p_hat_ = n /sum(n))
## # A tibble: 2 x 3
## scientist_work n p_hat_
## <chr> <int> <dbl>
## 1 Benefits 42 0.84
## 2 Doesn't benefit 8 0.16
The sample proportion of samp2 is pretty smiliar to samp1 where there are more benefits than doesn’t benefits. In this sample there are a bit more doesn’t benefits than in samp1.I think a size of 1000 would provide a more accurate measurement since we are counting more people and distribution would align more with the data.
sample_props50 <- global_monitor %>%
rep_sample_n(size = 50, reps = 15000, replace = TRUE) %>%
count(scientist_work) %>%
mutate(p_hat = n /sum(n)) %>%
filter(scientist_work == "Doesn't benefit")
ggplot(data = sample_props50, aes(x = p_hat)) +
geom_histogram(binwidth = 0.02) +
labs(
x = "p_hat (Doesn't benefit)",
title = "Sampling distribution of p_hat",
subtitle = "Sample size = 50, Number of samples = 15000"
)
According to the histogram I see about 2000 elements in the graph. The sampling distribution is symmetric with no skew there is a prominent center at 0.2 or 20%.
sample_props_small <- global_monitor %>%
rep_sample_n(size = 10, reps = 25, replace = TRUE) %>%
count(scientist_work) %>%
mutate(p_hat = n /sum(n)) %>%
filter(scientist_work == "Doesn't benefit")
sample_props_small
## # A tibble: 23 x 4
## # Groups: replicate [23]
## replicate scientist_work n p_hat
## <int> <chr> <int> <dbl>
## 1 1 Doesn't benefit 2 0.2
## 2 3 Doesn't benefit 1 0.1
## 3 4 Doesn't benefit 4 0.4
## 4 5 Doesn't benefit 6 0.6
## 5 6 Doesn't benefit 2 0.2
## 6 7 Doesn't benefit 1 0.1
## 7 8 Doesn't benefit 1 0.1
## 8 9 Doesn't benefit 1 0.1
## 9 11 Doesn't benefit 6 0.6
## 10 12 Doesn't benefit 1 0.1
## # ... with 13 more rows
There are 25 observation in sample_props_small, each observation represents a proportion of response in each sample that believes that scientists doesn’t benefit them
## samples of size 10 of 5000 simulations
set.seed(1)
sample_props_small1 <- global_monitor %>%
rep_sample_n(size = 10, reps = 5000, replace = TRUE) %>%
count(scientist_work) %>%
mutate(p_hat = n /sum(n)) %>%
filter(scientist_work == "Doesn't benefit")
sample_props_small1
## # A tibble: 4,486 x 4
## # Groups: replicate [4,486]
## replicate scientist_work n p_hat
## <int> <chr> <int> <dbl>
## 1 2 Doesn't benefit 4 0.4
## 2 3 Doesn't benefit 5 0.5
## 3 4 Doesn't benefit 1 0.1
## 4 5 Doesn't benefit 2 0.2
## 5 6 Doesn't benefit 1 0.1
## 6 7 Doesn't benefit 2 0.2
## 7 8 Doesn't benefit 1 0.1
## 8 9 Doesn't benefit 3 0.3
## 9 10 Doesn't benefit 2 0.2
## 10 11 Doesn't benefit 1 0.1
## # ... with 4,476 more rows
set.seed(2)
## samples of size 50 of 5000 simulations
sample_props_small2 <- global_monitor %>%
rep_sample_n(size = 50, reps = 5000, replace = TRUE) %>%
count(scientist_work) %>%
mutate(p_hat = n /sum(n)) %>%
filter(scientist_work == "Doesn't benefit")
sample_props_small2
## # A tibble: 5,000 x 4
## # Groups: replicate [5,000]
## replicate scientist_work n p_hat
## <int> <chr> <int> <dbl>
## 1 1 Doesn't benefit 8 0.16
## 2 2 Doesn't benefit 9 0.18
## 3 3 Doesn't benefit 13 0.26
## 4 4 Doesn't benefit 5 0.1
## 5 5 Doesn't benefit 10 0.2
## 6 6 Doesn't benefit 16 0.32
## 7 7 Doesn't benefit 8 0.16
## 8 8 Doesn't benefit 9 0.18
## 9 9 Doesn't benefit 10 0.2
## 10 10 Doesn't benefit 6 0.12
## # ... with 4,990 more rows
## samples of size 100 of 5000 simulations
set.seed(3)
sample_props_small3 <- global_monitor %>%
rep_sample_n(size = 100, reps = 5000, replace = TRUE) %>%
count(scientist_work) %>%
mutate(p_hat = n /sum(n)) %>%
filter(scientist_work == "Doesn't benefit")
sample_props_small3
## # A tibble: 5,000 x 4
## # Groups: replicate [5,000]
## replicate scientist_work n p_hat
## <int> <chr> <int> <dbl>
## 1 1 Doesn't benefit 26 0.26
## 2 2 Doesn't benefit 17 0.17
## 3 3 Doesn't benefit 16 0.16
## 4 4 Doesn't benefit 20 0.2
## 5 5 Doesn't benefit 22 0.22
## 6 6 Doesn't benefit 17 0.17
## 7 7 Doesn't benefit 18 0.18
## 8 8 Doesn't benefit 22 0.22
## 9 9 Doesn't benefit 19 0.19
## 10 10 Doesn't benefit 24 0.24
## # ... with 4,990 more rows
I set a seed for each individual simulation since the values were constantly changing, each observation represents each a proportion of people in each sample that believes that scientists doesn’t benefit them personally. Acoording to my data when the sample size of the sampling distribution increases they seem to converge closer around the true population proportion of around 0.26 and the standard error is smaller compared to the sample size of 10. So the bigger the sample the more accurate our data becomes.
set.seed(4)
samp3 <- global_monitor %>%
sample_n(15)
samp3 %>%
count(scientist_work) %>%
mutate(sam = n/sum(n))
## # A tibble: 2 x 3
## scientist_work n sam
## <chr> <int> <dbl>
## 1 Benefits 14 0.933
## 2 Doesn't benefit 1 0.0667
According to my sample over 93% of people think that the work scientists does benefit them everyday. My best point estimate of the population proportion would be that the true proportion would be around 80 to 85%.
sample_props15 <- global_monitor %>%
rep_sample_n(size = 15, reps = 2000, replace = TRUE) %>%
count(scientist_work) %>%
mutate(p_hat = n /sum(n)) %>%
filter(scientist_work == "Benefits")
sample_props15
## # A tibble: 2,000 x 4
## # Groups: replicate [2,000]
## replicate scientist_work n p_hat
## <int> <chr> <int> <dbl>
## 1 1 Benefits 11 0.733
## 2 2 Benefits 13 0.867
## 3 3 Benefits 12 0.8
## 4 4 Benefits 12 0.8
## 5 5 Benefits 13 0.867
## 6 6 Benefits 15 1
## 7 7 Benefits 14 0.933
## 8 8 Benefits 13 0.867
## 9 9 Benefits 12 0.8
## 10 10 Benefits 13 0.867
## # ... with 1,990 more rows
ggplot(data = sample_props15, aes(x = p_hat)) +
geom_histogram(binwidth = 0.02) +
labs(
x = "p_hat (Benefits)",
title = "Sampling distribution of population proportion",
subtitle = "Sample size = 15, Number of samples = 2000"
)
mean(sample_props15$p_hat)
## [1] 0.8038667
According to the average of p_hat about 80% of the population truly believe that scientists do enhance their everyday lives.
sample_props150 <- global_monitor %>%
rep_sample_n(size = 150, reps = 2000, replace = TRUE) %>%
count(scientist_work) %>%
mutate(p_hat = n /sum(n)) %>%
filter(scientist_work == "Benefits")
sample_props150
## # A tibble: 2,000 x 4
## # Groups: replicate [2,000]
## replicate scientist_work n p_hat
## <int> <chr> <int> <dbl>
## 1 1 Benefits 121 0.807
## 2 2 Benefits 123 0.82
## 3 3 Benefits 113 0.753
## 4 4 Benefits 119 0.793
## 5 5 Benefits 120 0.8
## 6 6 Benefits 126 0.84
## 7 7 Benefits 118 0.787
## 8 8 Benefits 124 0.827
## 9 9 Benefits 127 0.847
## 10 10 Benefits 118 0.787
## # ... with 1,990 more rows
ggplot(data = sample_props150, aes(x = p_hat)) +
geom_histogram(binwidth = 0.02) +
labs(
x = "p_hat (Benefits)",
title = "Sampling distribution of population proportion",
subtitle = "Sample size = 150, Number of samples = 2000"
)
mean(sample_props150$p_hat)
## [1] 0.80095
The data looks symmetrical and there are no skews like the graph in exercise 8 and calculating the mean to find proportion population we can see that it is around 80% again.
Looking at the chart I would guage and say that chart 2 has a smaller spread because it would be user to work with smaller samples and that smaller samples in general would gives us a more accurate representation of the population proportion