library(tidyverse)
library(openintro)
library(infer)

The Unknown Sampling Distribution

global_monitor <- tibble(
  scientist_work = c(rep("Benefits", 80000), rep("Doesn't benefit", 20000))
)
ggplot(global_monitor, aes(x = scientist_work)) +
  geom_bar() +
  labs(
    x = "", y = "",
    title = "Do you believe that the work scientists do benefit people like you?"
  ) +
  coord_flip() 

global_monitor %>%
  count(scientist_work) %>%
  mutate(p = n /sum(n))
## # A tibble: 2 Ă— 3
##   scientist_work      n     p
##   <chr>           <int> <dbl>
## 1 Benefits        80000   0.8
## 2 Doesn't benefit 20000   0.2
# sample with no replacement
samp1 <- global_monitor %>%
  sample_n(50)
samp1
## # A tibble: 50 Ă— 1
##    scientist_work 
##    <chr>          
##  1 Doesn't benefit
##  2 Doesn't benefit
##  3 Benefits       
##  4 Doesn't benefit
##  5 Doesn't benefit
##  6 Benefits       
##  7 Doesn't benefit
##  8 Doesn't benefit
##  9 Benefits       
## 10 Benefits       
## # … with 40 more rows
table(samp1$scientist_work)
## 
##        Benefits Doesn't benefit 
##              37              13
#find out the sample mean
samp1 %>%
  count(scientist_work) %>%
  mutate(p_hat = n /sum(n))
## # A tibble: 2 Ă— 3
##   scientist_work      n p_hat
##   <chr>           <int> <dbl>
## 1 Benefits           37  0.74
## 2 Doesn't benefit    13  0.26

Exercise 1

Describe the distribution of responses in the sample. How does it compare to the distribution of responses in the population?

samp_monitor <- tibble(samp1 = c(rep("Benefits", 38), rep("Doesn't benefit", 12)))
ggplot(samp_monitor, aes(x = samp1)) +
  geom_bar() +
  labs(
    x = "", y = "",
    title = "Do you believe that the work scientists do benefit people like you?"
  ) +
  coord_flip()

samp_monitor %>%
  count(samp1) %>%
  mutate(p_hat = n / sum(n))
## # A tibble: 2 Ă— 3
##   samp1               n p_hat
##   <chr>           <int> <dbl>
## 1 Benefits           38  0.76
## 2 Doesn't benefit    12  0.24

The sample size is small but the estimation of population mean is quite the same.

Exercise 2

Would you expect the sample proportion to match the sample proportion of another student’s sample? Why or Why not? If the answer is no, would you expect the proportionas to be somewhat different or very different?

Answer. No, Because the sample size 50 from a huge population is significantly small and gets different distrubution responses.

Exercise 3

set.seed(10000)
samp2 <- global_monitor %>%
  sample_n(100)

table(samp2$scientist_work)
## 
##        Benefits Doesn't benefit 
##              82              18
samp2 %>%
  count(scientist_work) %>%
  mutate(p2 = n/sum(n))
## # A tibble: 2 Ă— 3
##   scientist_work      n    p2
##   <chr>           <int> <dbl>
## 1 Benefits           82  0.82
## 2 Doesn't benefit    18  0.18
ggplot(samp2, aes(x = scientist_work)) +
    geom_bar(position = position_dodge(), width = 0.5, fill = "blue" ) +
    coord_flip() + 
    ggtitle("Do you believe that the work scientists do benefit people like you from sample 0f 100?") +
    xlab("Element of study") + ylab("sample2")

The second sample is closely similar to the first one. I would expect the larger sample size to represent accuate distribution of more dense population.

Exercise 4

How many elements are there in sample_props50? Describe the sampling distribution, and be sure to specifically note its center. Make sure to include a plot of distribution in your answer There are 15,000 elements in sample_props50 vector. The sampling distribution is built to find out the proportion of “Doesn’t Benefit” response for each element with size of 50. According to the histogram, the mean sample proportion is by 0.2 that is the same with population 0.18.

sample_props50 <-global_monitor %>%
  infer::rep_sample_n(size = 50, reps = 15000, replace = TRUE) %>%
  count(scientist_work) %>%
  mutate(p_hat = n /sum(n)) %>%
  filter(scientist_work == "Doesn't benefit")
ggplot(data = sample_props50, aes(x = p_hat)) + 
  geom_histogram(binwidth = 0.02) + 
  labs(
    x = "p_hat (Doesn't Benefit) ", 
    title = "Sampling distribution of p_hat", 
    subtitle = "Sample size = 50, Number of sample = 15000"
    )

Exercise 5

To make sure you understand how sampling distributions are built, and exactly what the rep_sample_n function does, try modifying the code to create a sampling distribution of 25 sample proportions from samples of size 10, and put them in a data frame named sample_props_small. Print the output. How many observations are there in this object called sample_porps_small? What does each observation represent?

There are 23 observations in this sample_props_small object. Each of them represents sample proportions of “Doesn’t benefit” from samples size.

sample_props_small <- global_monitor %>%
  infer::rep_sample_n(size = 10, reps = 25, replace = TRUE) %>%
  count(scientist_work)%>%
  mutate(p_hat = n / sum(n)) %>%
  filter(scientist_work == "Doesn't benefit")
sample_props_small
## # A tibble: 23 Ă— 4
## # Groups:   replicate [23]
##    replicate scientist_work      n p_hat
##        <int> <chr>           <int> <dbl>
##  1         1 Doesn't benefit     2   0.2
##  2         2 Doesn't benefit     1   0.1
##  3         3 Doesn't benefit     1   0.1
##  4         4 Doesn't benefit     3   0.3
##  5         5 Doesn't benefit     1   0.1
##  6         6 Doesn't benefit     1   0.1
##  7         7 Doesn't benefit     2   0.2
##  8         9 Doesn't benefit     2   0.2
##  9        10 Doesn't benefit     3   0.3
## 10        11 Doesn't benefit     2   0.2
## # … with 13 more rows

Exercise 6

Use the app below to create sampling distributions of proportions of Doesn’t benefit from samples of size 10, 50, and 100. Use 5,000 simulations. What does each observation in the sampling distribution represent? According to the summary statistics, the mean, standard error and shape of the sampling distribution is generally decreases as the sample size increase. But, the medium value is no change no matter how many times simulation run and the mean remains approximately the same if we increase the number of sample size and simulation times.

ggplot(data = sample_props50, aes(x = p_hat)) +
  geom_histogram(binwidth = 0.02)

sample_props10 <- global_monitor %>%
  infer::rep_sample_n(size = 10, reps = 5000, replace = TRUE) %>%
  count(scientist_work)%>%
  mutate(p_hat = n / sum(n)) %>%
  filter(scientist_work == "Doesn't benefit")
ggplot(data = sample_props10, aes(x = p_hat)) +
  geom_histogram(binwidth = 0.02) + 
  labs(
    x = "p_hat (Doesn't Benefit) ", 
    title = "Sampling distribution of p_hat", 
    subtitle = "Sample size = 10, Number of sample = 5000")

summary(sample_props10)
##    replicate    scientist_work           n             p_hat       
##  Min.   :   1   Length:4469        Min.   :1.000   Min.   :0.1000  
##  1st Qu.:1250   Class :character   1st Qu.:1.000   1st Qu.:0.1000  
##  Median :2498   Mode  :character   Median :2.000   Median :0.2000  
##  Mean   :2497                      Mean   :2.234   Mean   :0.2234  
##  3rd Qu.:3750                      3rd Qu.:3.000   3rd Qu.:0.3000  
##  Max.   :5000                      Max.   :8.000   Max.   :0.8000
sample_props10 <- global_monitor %>%
  infer::rep_sample_n(size = 50, reps = 5000, replace = TRUE) %>%
  count(scientist_work)%>%
  mutate(p_hat = n / sum(n)) %>%
  filter(scientist_work == "Doesn't benefit")
summary(sample_props50)
##    replicate     scientist_work           n              p_hat       
##  Min.   :    1   Length:15000       Min.   : 2.000   Min.   :0.0400  
##  1st Qu.: 3751   Class :character   1st Qu.: 8.000   1st Qu.:0.1600  
##  Median : 7500   Mode  :character   Median :10.000   Median :0.2000  
##  Mean   : 7500                      Mean   : 9.954   Mean   :0.1991  
##  3rd Qu.:11250                      3rd Qu.:12.000   3rd Qu.:0.2400  
##  Max.   :15000                      Max.   :23.000   Max.   :0.4600
sample_props100 <- global_monitor %>%
  infer::rep_sample_n(size = 100, reps = 5000, replace = TRUE) %>%
  count(scientist_work)%>%
  mutate(p_hat = n / sum(n)) %>%
  filter(scientist_work == "Doesn't benefit")
summary(sample_props100)
##    replicate    scientist_work           n             p_hat       
##  Min.   :   1   Length:5000        Min.   : 7.00   Min.   :0.0700  
##  1st Qu.:1251   Class :character   1st Qu.:17.00   1st Qu.:0.1700  
##  Median :2500   Mode  :character   Median :20.00   Median :0.2000  
##  Mean   :2500                      Mean   :19.98   Mean   :0.1998  
##  3rd Qu.:3750                      3rd Qu.:23.00   3rd Qu.:0.2300  
##  Max.   :5000                      Max.   :35.00   Max.   :0.3500

Exercise 7

Take a sample of size 15 from population and calculate the proportion of people in this sample who think the work scientists do enhances their lives. Using this sample, what is your best point estimate of the population of people who think the work scientists do enhances their lives?

The 80 % of population proportion of people think the work scientists do benefit for them.

samp3 <- global_monitor %>%
  sample_n(15)

samp3 %>%
  count(scientist_work) %>%
  mutate(p3 = n /sum(n))
## # A tibble: 2 Ă— 3
##   scientist_work      n    p3
##   <chr>           <int> <dbl>
## 1 Benefits           13 0.867
## 2 Doesn't benefit     2 0.133

Exercise 8

Since you have access to the population, simulate the sampling distribution of proportion of those who think the work scientists do enhances their lives for sample size of 15 by taking 2000 samples from population of size 15 and computing 2000 sample proportions. Store these proporitons in as sample_props15. Plot the data, then describe the shape of this sampling distribution. Based on this sampling distribution, what would you guess the true proportion of those who think the work scientists do enhances their lives to be? Finally, calculate and report the population proportion.

sample_pros15 <- global_monitor %>%
  infer::rep_sample_n(size = 15, reps = 2000, replace = TRUE) %>%
  count(scientist_work) %>%
  mutate(p3 = n / sum(n)) %>%
  filter(scientist_work == "Benefits")
ggplot(data = sample_pros15, aes(x = p3)) +
  geom_histogram(binwidth = 0.02) + 
  labs(
    x = "p3 (Benefits)",
    title = "Sampling distribution of p3",
    subtitle = "Sample Sizse = 15, Number of samples = 2000"
  )

summary(sample_pros15)
##    replicate      scientist_work           n               p3        
##  Min.   :   1.0   Length:2000        Min.   : 6.00   Min.   :0.4000  
##  1st Qu.: 500.8   Class :character   1st Qu.:11.00   1st Qu.:0.7333  
##  Median :1000.5   Mode  :character   Median :12.00   Median :0.8000  
##  Mean   :1000.5                      Mean   :12.01   Mean   :0.8008  
##  3rd Qu.:1500.2                      3rd Qu.:13.00   3rd Qu.:0.8667  
##  Max.   :2000.0                      Max.   :15.00   Max.   :1.0000
mean(sample_pros15$p3)
## [1] 0.8008

Exercise 9

Change your sample size from 15 to 150, then compute the sampling distribution using the same method as above, and store these proportions in a new object called sample_props150. Describe the shape of this sampling distribution and compare it to the sampling distribution for a sample size of 15. Based on this sample distribution, what would you guess to be the true proportion of those who think the work scientists do enhances their lives?

sample_pros150 <- global_monitor %>%
  infer::rep_sample_n(size = 150, reps = 2000, replace = TRUE) %>%
  count(scientist_work) %>%
  mutate(p4 = n / sum(n)) %>%
  filter(scientist_work == "Benefits")
ggplot(data = sample_pros150, aes(x = p4)) +
  geom_histogram(binwidth = 0.02) + 
  labs(
    x = "p4 (Benefits)",
    title = "Sampling distribution of people who believe scientist work benefits",
    subtitle = "Sample Sizse = 150, Number of samples = 2000"
  )

summary(sample_pros150)
##    replicate      scientist_work           n               p4        
##  Min.   :   1.0   Length:2000        Min.   :104.0   Min.   :0.6933  
##  1st Qu.: 500.8   Class :character   1st Qu.:117.0   1st Qu.:0.7800  
##  Median :1000.5   Mode  :character   Median :120.0   Median :0.8000  
##  Mean   :1000.5                      Mean   :120.1   Mean   :0.8010  
##  3rd Qu.:1500.2                      3rd Qu.:124.0   3rd Qu.:0.8267  
##  Max.   :2000.0                      Max.   :137.0   Max.   :0.9133
mean(sample_pros150$p4)
## [1] 0.8009767

Exercise 10

On the sampling distributions from 2 and 3, which has a smaller spread? If you’re connected with making estimates that are more often close to the true value, would you prefer a sampling distribution with a large or small spread?

A larger sample size has a smaller spread of sampling distribution that is much closer to the true value because standard deviation is getting smaller and proportion of population matches to the mean.

