Foundations for statisitical inference - Sampling distributions
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3 v purrr 0.3.4
## v tibble 3.0.5 v dplyr 1.0.3
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
library(infer)
set.seed(2)
The data
global_monitor <- tibble(
scientist_work = c(rep("Benefits", 80000), rep("Doesn't benefit", 20000))
)
ggplot(global_monitor, aes(x=scientist_work))+
geom_bar()+
labs(
x = "", y = "",
title = "Do you believe that the work scientists do benefit people like you?"
) +
coord_flip()

global_monitor %>%
count(scientist_work) %>%
mutate(p = n / sum(n))
## # A tibble: 2 x 3
## scientist_work n p
## * <chr> <int> <dbl>
## 1 Benefits 80000 0.8
## 2 Doesn't benefit 20000 0.2
The unknown sampling distribution
sampl <- global_monitor %>%
sample_n(50)
Exercise 1:
Describe the distribution of responses in this sample, How does it compare to the distribution of responses in the popualtion?
The distribtuion is very similar to the population distribution. In fact the percent difference between the population and the random sample is only 4% for each category. But the sample is not exactly the same as the population - which is expected.
ggplot(sampl, aes(x=scientist_work))+
geom_bar()+
labs(
x = "", y = "",
title = "Do you believe that the work scientists do benefit people like you?"
) +
coord_flip()

sampl %>%
count(scientist_work) %>%
mutate(p_hat = n / sum(n))
## # A tibble: 2 x 3
## scientist_work n p_hat
## * <chr> <int> <dbl>
## 1 Benefits 42 0.84
## 2 Doesn't benefit 8 0.16
Exercise 2:
Would you expect the same sample proportion to match the sample proportion of another student’s sample?
I would expect the sample proportion of another student’s sample to be slightly different than mine. The likelihood that their sample has the exact number of “Doesn’t benefit” as my sample is unlikely. But their sample is pulling from the same data set so I wouldn’t expect it to be too far off.
Exercise 3:
Take another sample of size 50 and call it sampl2. How does the sample proportion of sampl2 compare with that of sampl? Suppose we took two more samples, one of size 100 and one of size 1000. Which would provide a more accurate estimate of the population proportion?
The second sample is close to my first sample proportion. It also happens to be even closer to the actual proportion. The larger the sample size, the more better the estimate of the population (up until a certain point). Because of this, I would expect a sample size of 1000 to be a more accurate estimate. However, we do not want to exceed a sample size of 10% of the actual popution.
sampl2 <- global_monitor %>%
sample_n(50)
sampl2 %>%
count(scientist_work) %>%
mutate(p_hat = n / sum(n))
## # A tibble: 2 x 3
## scientist_work n p_hat
## * <chr> <int> <dbl>
## 1 Benefits 41 0.82
## 2 Doesn't benefit 9 0.18
sample_props50 <- global_monitor %>%
rep_sample_n(size = 50, reps = 15000, replace = TRUE) %>%
count(scientist_work) %>%
mutate(p_hat = n / sum(n)) %>%
filter(scientist_work == "Doesn't benefit")
ggplot(data = sample_props50, aes(x = p_hat)) +
geom_histogram(binwidth = 0.02) +
labs(
x = "p_hat (Doesn't benefit)",
title = "Sampling distribution of p_hat",
subtitle = "Sample size = 50, Number of samples = 15000"
)

Exercise 4:
How many elements are there in sample_props_50? Describe the sampling distribution and be sure to specifically note its center.
We have 15000 elements in the sample - in other words, we found the proportion of “Doesn’t benefit” in 15000 samples with a sample size of 50 for each. Its center seems to be exactly at 0.2 (which is the actual proportion in the population).
Interlude: Sampling distributions
Exercise 5:
To make sure you understand how sampling distributions are built, and exactly what the rep_sample_n function does, try modifying the code to create a sampling distribution of 25 sampling proportions from samples of size 10, and put them in a data frame named sample_props_small. How many observations are there in this object and what does each observation represent?
There are 25 observations and each observation represents the proportion of “Doesn’t benefit” for a randomly generated sample size of 10. This is a smaller sampling distribution.
sample_props_small <- global_monitor %>%
rep_sample_n(size = 10, reps = 25, replace = TRUE) %>%
count(scientist_work) %>%
mutate(p_hat = n / sum(n)) %>%
filter(scientist_work == "Doesn't benefit")
ggplot(data = sample_props_small, aes(x = p_hat)) +
geom_histogram(binwidth = 0.1) +
labs(
x = "p_hat (Doesn't benefit)",
title = "Sampling distribution of p_hat",
subtitle = "Sample size = 10, Number of samples = 25"
)

Sample size and the sampling distribution
ggplot(data = sample_props50, aes(x = p_hat)) +
geom_histogram(binwidth = 0.02)

Exercise 6:
Use the app below to create sampling distributions of proportions of Doesn’t benefit from samples of size 10, 50, and 100. Use 5000 simulations. What does each observation in the sampling distribution represent? How does the mean, standard error, and shape change as sample size increases?
Each observation represents a randomly generated sample from the population of size n. As sample size increases, the shape becomes more normal, the mean becomes more accurate, and the SE decrease. However, the accuracy of sample size 50 and sample size 100 were not too far from one another, so it seems the accuracy may begin to plateau.
At sample size 10, the shape is skewed right, the mean is 0.22 and the SE is 0.11. At sample size 50, the shape is almost completely normal, the mean is 0.2 (the actual mean), and the SE is 0.06. At sample size 100, the shape is also normal, the mean remains at 0.2, and the SE is 0.04.
More practice
Exercise 7:
Take a sample of size 15 from the population and calculate the proportion of people in this sample who think the work scientists do benefits them. Using this sample, what is your best point estimate of the population proportion of people who think the work scientists do benefits them?
sample_benefit_1 <- global_monitor %>%
sample_n(15)
sample_benefit_1 %>%
count(scientist_work) %>%
mutate(p_hat = n / sum(n))
## # A tibble: 2 x 3
## scientist_work n p_hat
## * <chr> <int> <dbl>
## 1 Benefits 9 0.6
## 2 Doesn't benefit 6 0.4
From this sample, we would guess that approximately 60% believe that scientists work benefit them.
Exercise 8:
Since you have access to the population, simulate the sampling distribution of proportion of those who think the work scientsits do benefits them for sample size of 15 by taking 2000 samples from the population. Store these proportions in sample_props15. Plot the data and describe distribution shape. Based on the sampling distribution, what do you think the true proportion of those who think scientists work benefits them would be?
sample_props15 <- global_monitor %>%
rep_sample_n(size = 15, reps = 2000, replace = TRUE) %>%
count(scientist_work) %>%
mutate(p_hat = n / sum(n)) %>%
filter(scientist_work == "Benefits")
ggplot(data = sample_props15, aes(x = p_hat)) +
geom_histogram(binwidth = 0.05) +
labs(
x = "p_hat (Benefits)",
title = "Sampling distribution of p_hat",
subtitle = "Sample size = 15, Number of samples = 2000"
)

The distribution is skewed left. Based on the sampling proportion, the peak of the distribution falls at around 0.8 and therefore, we would predict the population proportions to be around 80% (benefits).
Exercise 9:
Change your sample size from 15 to 150, then compute the sampling distribution and store proportions in sample_props150. Describe the shape of the distribution and compare it to that of sample size 15.
sample_props150 <- global_monitor %>%
rep_sample_n(size = 150, reps = 2000, replace = TRUE) %>%
count(scientist_work) %>%
mutate(p_hat = n / sum(n)) %>%
filter(scientist_work == "Benefits")
ggplot(data = sample_props150, aes(x = p_hat)) +
geom_histogram(binwidth = 0.02) +
labs(
x = "p_hat (Benefits)",
title = "Sampling distribution of p_hat",
subtitle = "Sample size = 150, Number of samples = 2000"
)

The distribution is slightly skewed left, but is much more normal in appearance compared to the sampling distribution of sized 15. I would still predict the population proportion is be 80% since the peak still lands at 0.8.
Exercise 10:
Of the two sampling distributions, which has the smaller spread? When making estimates that are more close to the true value, would you prefer a sampling distribution with a small or large spread?
The sampling distribution with sample size 15 has a spread from approximately 0.4 to 1 while the samlping distribution with sample size 150 has a spread from approximately 0.7 to 0.9. Therefore, the larger sample size sampling distribution has a smaller spread. When estimating population proportion, we want the distribution with the smaller spread.