Foundations for statistical inference - Sampling distributions
In this lab, you will investigate the ways in which the statistics from a random sample of data can serve as point estimates for population parameters. We’re interested in formulating a sampling distribution of our estimate in order to learn about the properties of the estimate, such as its distribution.
library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.3     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ──────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
library(infer)
The Wellcome Global Monitor finds that 20% of people globally do not believe that the work scientists do benefits people like them. In this lab, you will assume this 20% is a true population proportion and learn about how sample proportions can vary from sample to sample by taking smaller samples from the population. We will first create our population assuming a population size of 100,000. This means 20,000 (20%) of the population think the work scientists do does not benefit them personally and the remaining 80,000 think it does.
set.seed(06151982)
global_monitor <- tibble(
  scientist_work = c(rep("Benefits", 80000), rep("Doesn't benefit", 20000))
)
The name of the data frame is global_monitor and the name of the variable that contains responses to the question “Do you believe that the work scientists do benefit people like you?” is scientist_work.
We can quickly visualize the distribution of these responses using a bar plot.
ggplot(global_monitor, aes(x = scientist_work)) +
  geom_bar() +
  labs(
    x = "", y = "",
    title = "Do you believe that the work scientists do benefit people like you?"
  ) +
  coord_flip() 

We can also obtain summary statistics to confirm we constructed the data frame correctly.
global_monitor %>%
  count(scientist_work) %>%
  mutate(p = n /sum(n))
## # A tibble: 2 x 3
##   scientist_work      n     p
##   <chr>           <int> <dbl>
## 1 Benefits        80000   0.8
## 2 Doesn't benefit 20000   0.2
The unknown sampling distribution
In this lab, you have access to the entire population, but this is rarely the case in real life. Gathering information on an entire population is often extremely costly or impossible. Because of this, we often take a sample of the population and use that to understand the properties of the population.
If you are interested in estimating the proportion of people who don’t think the work scientists do benefits them, you can use the sample_n command to survey the population.
set.seed(06151983)
samp1 <- global_monitor %>%
  sample_n(50)
Exercise 1:
Describe the distribution of responses in this sample. How does it compare to the distribution of responses in the population?
Based on the responses of “samp1,” the distribution of responses is slightly close to the population +/- 4% points.
ggplot(samp1, aes(x = scientist_work)) +
  geom_bar() +
  labs(
    x = "", y = "",
    title = "Sample1: Do you believe that the work scientists do benefit people like you?"
  ) +
  coord_flip() 

samp1 %>%
  count(scientist_work) %>%
  mutate(Sample1 = n /sum(n))
## # A tibble: 2 x 3
##   scientist_work      n Sample1
##   <chr>           <int>   <dbl>
## 1 Benefits           38    0.76
## 2 Doesn't benefit    12    0.24
Sample mean formula (i.e. sampling less than 1% of the population):
samp1 %>%
  count(scientist_work) %>%
  mutate(p_hat = n /sum(n))
## # A tibble: 2 x 3
##   scientist_work      n p_hat
##   <chr>           <int> <dbl>
## 1 Benefits           38  0.76
## 2 Doesn't benefit    12  0.24
Exercise 2:
I would not expect the sample proportion of 50 taken in exercise 1 to match exactly to that of another classmate. However, I would expect the proportions to be slightly similar given that we are doing large enough samplings.
Exercise 3:
After taking a second sample “samp2” and comparing it to “samp1” it appears that both samples have big variances in the responses. Are both samples not big enough? After taking another 2 samples of size 500 - “samp3”, and size 1000 - “samp4” it appears that the answers are a more accurate estimate of the population proportion, especially with the largest sample of size 1000.
set.seed(06151984)
samp2 <- global_monitor %>%
  sample_n(50)
ggplot(samp2, aes(x = scientist_work)) +
  geom_bar() +
  labs(
    x = "", y = "",
    title = "Sample 2: Do you believe that the work scientists do benefit people like you?"
  ) +
  coord_flip() 

samp2 %>%
  count(scientist_work) %>%
  mutate(Sample2 = n /sum(n))
## # A tibble: 2 x 3
##   scientist_work      n Sample2
##   <chr>           <int>   <dbl>
## 1 Benefits           43    0.86
## 2 Doesn't benefit     7    0.14
set.seed(06151985)
samp3 <- global_monitor %>%
  sample_n(500)
set.seed(06151986)
samp4 <- global_monitor %>%
  sample_n(1000)
ggplot(samp3, aes(x = scientist_work)) +
  geom_bar() +
  labs(
    x = "", y = "",
    title = "Sample 3: Do you believe that the work scientists do benefit people like you?"
  ) +
  coord_flip() 

samp3 %>%
  count(scientist_work) %>%
  mutate(Sample3 = n /sum(n))
## # A tibble: 2 x 3
##   scientist_work      n Sample3
##   <chr>           <int>   <dbl>
## 1 Benefits          415    0.83
## 2 Doesn't benefit    85    0.17
ggplot(samp4, aes(x = scientist_work)) +
  geom_bar() +
  labs(
    x = "", y = "",
    title = "Sample 4: Do you believe that the work scientists do benefit people like you?"
  ) +
  coord_flip() 

samp4 %>%
  count(scientist_work) %>%
  mutate(Sample4 = n /sum(n))
## # A tibble: 2 x 3
##   scientist_work      n Sample4
##   <chr>           <int>   <dbl>
## 1 Benefits          801   0.801
## 2 Doesn't benefit   199   0.199
In R, we’re able to take multiple samples of the same proportion to build up the sampling distribution, and then visualize it with a histogram.
set.seed(06151987)
sample_props50 <- global_monitor %>%
                    rep_sample_n(size = 50, reps = 15000, replace = TRUE) %>%
                    count(scientist_work) %>%
                    mutate(p_hat = n /sum(n)) %>%
                    filter(scientist_work == "Doesn't benefit")
ggplot(data = sample_props50, aes(x = p_hat)) +
  geom_histogram(binwidth = 0.02) +
  labs(
    x = "p_hat (Doesn't benefit)",
    title = "Sampling distribution of p_hat",
    subtitle = "Sample size = 50, Number of samples = 15000"
  )

The sampling distribution that you computed tells you much about estimating the true proportion of people who think that the work scientists do doesn’t benefit them. Because the sample proportion is an unbiased estimator, the sampling distribution is centered at the true population proportion, and the spread of the distribution indicates how much variability is incurred by sampling only 50 people at a time from the population.
Exercise 4:
sample_props50 contains 15,000 observations of 50 samples each with 4 variables. The distribution of the samples appear to be under a normal distribution. The samples mean is 0.1995.
summary(sample_props50)
##    replicate     scientist_work           n              p_hat       
##  Min.   :    1   Length:15000       Min.   : 1.000   Min.   :0.0200  
##  1st Qu.: 3751   Class :character   1st Qu.: 8.000   1st Qu.:0.1600  
##  Median : 7500   Mode  :character   Median :10.000   Median :0.2000  
##  Mean   : 7500                      Mean   : 9.974   Mean   :0.1995  
##  3rd Qu.:11250                      3rd Qu.:12.000   3rd Qu.:0.2400  
##  Max.   :15000                      Max.   :23.000   Max.   :0.4600
ggplot(data = sample_props50, aes(sample = p_hat)) + 
  geom_line(stat = "qq")

##### Exercise 5: ###### a smaller sampling distribution of 25 observations of sample size 10 follows. In it there are 23 observations and 4 variables. Each of the observations represent a sample of the population of individuals who think scientist work does not benefit their lives.

set.seed(061519888)
sample_props_small <- global_monitor %>%
                    rep_sample_n(size = 10, reps = 25, replace = TRUE) %>%
                    count(scientist_work) %>%
                    mutate(p_hat = n /sum(n)) %>%
                    filter(scientist_work == "Doesn't benefit")
ggplot(data = sample_props_small, aes(x = p_hat)) +
  geom_histogram(binwidth = .04) +
  labs(
    x = "p_hat (Doesn't benefit)",
    title = "Sampling distribution of p_hat",
    subtitle = "Sample size = 10, Number of samples = 25"
  )

Exercise 6:
Each observation in the sampling represent the population of people who think that the work scientists do does not benefit them.
As the sample size is increased, the mean gets closer to that of the true population, while the standard error is reduced, and the shape of the sampling distribution becomes more normal. As the number of simulations is increased, so is the accuracy or representation of the sample to the population.
Exercise 7:
Proportion of 15 people who think the work scientists do enhances their lives 86%.
set.seed(06151989)
samp7 <- global_monitor %>%
  sample_n(15)
ggplot(samp7, aes(x = scientist_work)) +
  geom_bar() +
  labs(
    x = "", y = "",
    title = "Sample7: Do you believe that the work scientists do benefit people like you?"
  ) +
  coord_flip() 

samp7 %>%
  count(scientist_work) %>%
  mutate(Sample7 = n /sum(n))
## # A tibble: 2 x 3
##   scientist_work      n Sample7
##   <chr>           <int>   <dbl>
## 1 Benefits           13   0.867
## 2 Doesn't benefit     2   0.133
Exercise 8:
Sampling distribution of people who think the work scientists do enhances their lives. 2000 observations of 15 samples each.
- Sampling distribution appears to be normal.
- The true proportion of those who think the work scientists do enhances their lives is likely to be closer to 80%.
- Based on the calculation of the data, it appears that the sampling mean of the proportion is roughly 80% (or 79.98%)
set.seed(06151990)
sample_props15 <- global_monitor %>%
                    rep_sample_n(size = 15, reps = 2000, replace = TRUE) %>%
                    count(scientist_work) %>%
                    mutate(p_hat = n /sum(n)) %>%
                    filter(scientist_work == "Benefits")
ggplot(data = sample_props15, aes(x = p_hat)) +
  geom_histogram(binwidth = 0.06) +
  labs(
    x = "p_hat (Benefits)",
    title = "Sampling distribution of p_hat",
    subtitle = "Sample size = 15, Number of samples = 2000"
  )

ggplot(data = sample_props15, aes(sample = p_hat)) + 
  geom_line(stat = "qq")

summary(sample_props15)
##    replicate      scientist_work           n          p_hat       
##  Min.   :   1.0   Length:2000        Min.   : 7   Min.   :0.4667  
##  1st Qu.: 500.8   Class :character   1st Qu.:11   1st Qu.:0.7333  
##  Median :1000.5   Mode  :character   Median :12   Median :0.8000  
##  Mean   :1000.5                      Mean   :12   Mean   :0.7998  
##  3rd Qu.:1500.2                      3rd Qu.:13   3rd Qu.:0.8667  
##  Max.   :2000.0                      Max.   :15   Max.   :1.0000
Exercise 9:
Sampling distribution of people who think the work scientists do enhances their lives. 2000 observations of 150 samples each.
- The shape of the sampling distribution appears to be more normal, more accurate (80%) - more representative of the population of individuals who think the work scientists do, does benefit their lives), and with less variability than those observations with 15 samples.
set.seed(06151991)
sample_props150 <- global_monitor %>%
                    rep_sample_n(size = 150, reps = 2000, replace = TRUE) %>%
                    count(scientist_work) %>%
                    mutate(p_hat = n /sum(n)) %>%
                    filter(scientist_work == "Benefits")
ggplot(data = sample_props150, aes(x = p_hat)) +
  geom_histogram(binwidth = 0.02) +
  labs(
    x = "p_hat (Benefits)",
    title = "Sampling distribution of p_hat",
    subtitle = "Sample size = 150, Number of samples = 2000"
  )

ggplot(data = sample_props150, aes(sample = p_hat)) + 
  geom_line(stat = "qq")

summary(sample_props150)
##    replicate      scientist_work           n             p_hat      
##  Min.   :   1.0   Length:2000        Min.   :102.0   Min.   :0.680  
##  1st Qu.: 500.8   Class :character   1st Qu.:117.0   1st Qu.:0.780  
##  Median :1000.5   Mode  :character   Median :120.0   Median :0.800  
##  Mean   :1000.5                      Mean   :120.1   Mean   :0.801  
##  3rd Qu.:1500.2                      3rd Qu.:123.0   3rd Qu.:0.820  
##  Max.   :2000.0                      Max.   :135.0   Max.   :0.900
Exercise 10:
- Of the sampling distributions 2 and 3, no 3 has less variability of a smaller spread from the true population means ( answers of benefit and no benefit).
- A sampling distribution with a smaller spread would be the choice when making estimates that are more often close to the true values.