Foundations for statistical inference - Sampling distributions

library(tidyverse)
library(openintro)
library(infer)
library(ggplot2)

The data

A 2019 Gallup report states the following:

The Wellcome Global Monitor finds that 20% of people globally do not believe that the work scientists do benefits people like them. In this lab, you will assume this 20% is a true population proportion and learn about how sample proportions can vary from sample to sample by taking smaller samples from the population. We will first create our population assuming a population size of 100,000. This means 20,000 (20%) of the population think the work scientists do does not benefit them personally and the remaining 80,000 think it does.

global_monitor <- tibble(
  scientist_work = c(rep("Benefits", 80000), rep("Doesn't benefit", 20000))
)

ggplot(global_monitor, aes(x = scientist_work)) +
  geom_bar() +
  labs(
    x = "", y = "",
    title = "Do you believe that the work scientists do benefit people like you?"
  ) +
  coord_flip()

Summary statistics

global_monitor %>%
  count(scientist_work) %>%
  mutate(p = n /sum(n))

## # A tibble: 2 x 3
##   scientist_work      n     p
##   <chr>           <int> <dbl>
## 1 Benefits        80000   0.8
## 2 Doesn't benefit 20000   0.2

The unknown sampling distribution

samp1 <- global_monitor %>%
  sample_n(50)

Describe the distribution of responses in this sample. How does it compare to the distribution of responses in the population. Hint: Although the sample_n function takes a random sample of observations (i.e. rows) from the dataset, you can still refer to the variables in the dataset with the same names. Code you presented earlier for visualizing and summarizing the population data will still be useful for the sample, however be careful to not label your proportion p since you’re now calculating a sample statistic, not a population parameters. You can customize the label of the statistics to indicate that it comes from the sample.

samp1 %>%
  count(scientist_work) %>%
  mutate(p_hat = n /sum(n))

## # A tibble: 2 x 3
##   scientist_work      n p_hat
##   <chr>           <int> <dbl>
## 1 Benefits           37  0.74
## 2 Doesn't benefit    13  0.26

Would you expect the sample proportion to match the sample proportion of another student’s sample? Why, or why not? If the answer is no, would you expect the proportions to be somewhat different or very different? Ask a student team to confirm your answer.

It will be a coincidence if the sample proportion match.They ought not to match because the values are randomly selected

Take a second sample, also of size 50, and call it samp2. How does the sample proportion of samp2 compare with that of samp1? Suppose we took two more samples, one of size 100 and one of size 1000. Which would you think would provide a more accurate estimate of the population proportion?

A sample size of 1000 is expected to provide a more accurate estimate of the population proportion according to central limit theorem

samp2 <- global_monitor %>%
  sample_n(50)

samp2 %>%
  count(scientist_work) %>%
  mutate(p_hat = n /sum(n))

## # A tibble: 2 x 3
##   scientist_work      n p_hat
##   <chr>           <int> <dbl>
## 1 Benefits           42  0.84
## 2 Doesn't benefit     8  0.16

library(pastecs)
sample_props50 <- global_monitor %>%
                    rep_sample_n(size = 50, reps = 15000, replace = TRUE) %>%
                    count(scientist_work) %>%
                    mutate(p_hat = n /sum(n)) %>%
                    filter(scientist_work == "Doesn't benefit")

stat.desc(sample_props50)

##                 replicate scientist_work            n        p_hat
## nbr.val      1.500000e+04             NA 1.500000e+04 1.500000e+04
## nbr.null     0.000000e+00             NA 0.000000e+00 0.000000e+00
## nbr.na       0.000000e+00             NA 0.000000e+00 0.000000e+00
## min          1.000000e+00             NA 1.000000e+00 2.000000e-02
## max          1.500000e+04             NA 2.100000e+01 4.200000e-01
## range        1.499900e+04             NA 2.000000e+01 4.000000e-01
## sum          1.125075e+08             NA 1.493910e+05 2.987820e+03
## median       7.500500e+03             NA 1.000000e+01 2.000000e-01
## mean         7.500500e+03             NA 9.959400e+00 1.991880e-01
## SE.mean      3.535652e+01             NA 2.323947e-02 4.647893e-04
## CI.mean.0.95 6.930309e+01             NA 4.555219e-02 9.110438e-04
## var          1.875125e+07             NA 8.101092e+00 3.240437e-03
## std.dev      4.330271e+03             NA 2.846242e+00 5.692483e-02
## coef.var     5.773310e-01             NA 2.857845e-01 2.857845e-01

To visualize the distribution of these proportions with a histogram.

ggplot(data = sample_props50, aes(x = p_hat)) +
  geom_histogram(binwidth = 0.02) +
  labs(x = "p_hat (Doesn't benefit)",
    title = "Sampling distribution of p_hat",
    subtitle = "Sample size = 50, Number of samples = 15000"
  )

How many elements are there in sample_props50? Describe the sampling distribution, and be sure to specifically note its center. Make sure to include a plot of the distribution in your answer.

There are 15000 elements in sample_props50

N(0.199,0.000465)

#Plot of the distribution
ggplot(data = sample_props50, aes(x = p_hat)) +
  geom_histogram(binwidth = 0.02) +
  labs(x = "p_hat (Doesn't benefit)",
    title = "Sampling distribution of p_hat",
    subtitle = "Sample size = 50, Number of samples = 15000"
  )

global_monitor %>%
  sample_n(size = 50, replace = TRUE) %>%
  count(scientist_work) %>%
  mutate(p_hat = n /sum(n)) %>%
  filter(scientist_work == "Doesn't benefit")

## # A tibble: 1 x 3
##   scientist_work      n p_hat
##   <chr>           <int> <dbl>
## 1 Doesn't benefit     6  0.12

To make sure you understand how sampling distributions are built, and exactly what the rep_sample_n function does, try modifying the code to create a sampling distribution of 25 sample proportions from samples of size 10, and put them in a data frame named sample_props_small. Print the output. How many observations are there in this object called sample_props_small? What does each observation represent?

library(pastecs)
sample_props_small <- global_monitor %>%
                    rep_sample_n(size = 10, reps = 25, replace = TRUE) %>%
                    count(scientist_work) %>%
                    mutate(p_hat = n /sum(n)) %>%
                    filter(scientist_work == "Doesn't benefit")

sample_props_small

## # A tibble: 23 x 4
## # Groups:   replicate [23]
##    replicate scientist_work      n p_hat
##        <int> <chr>           <int> <dbl>
##  1         1 Doesn't benefit     2   0.2
##  2         2 Doesn't benefit     2   0.2
##  3         3 Doesn't benefit     1   0.1
##  4         4 Doesn't benefit     3   0.3
##  5         5 Doesn't benefit     1   0.1
##  6         6 Doesn't benefit     1   0.1
##  7         8 Doesn't benefit     2   0.2
##  8         9 Doesn't benefit     2   0.2
##  9        10 Doesn't benefit     1   0.1
## 10        11 Doesn't benefit     1   0.1
## # ... with 13 more rows

dim(sample_props_small)

## [1] 23  4

round(stat.desc(sample_props_small),4)

##              replicate scientist_work       n   p_hat
## nbr.val        23.0000             NA 23.0000 23.0000
## nbr.null        0.0000             NA  0.0000  0.0000
## nbr.na          0.0000             NA  0.0000  0.0000
## min             1.0000             NA  1.0000  0.1000
## max            24.0000             NA  5.0000  0.5000
## range          23.0000             NA  4.0000  0.4000
## sum           293.0000             NA 47.0000  4.7000
## median         13.0000             NA  2.0000  0.2000
## mean           12.7391             NA  2.0435  0.2043
## SE.mean         1.4867             NA  0.2221  0.0222
## CI.mean.0.95    3.0833             NA  0.4606  0.0461
## var            50.8379             NA  1.1344  0.0113
## std.dev         7.1301             NA  1.0651  0.1065
## coef.var        0.5597             NA  0.5212  0.5212

sample_props_small’ represent a sample distribution of means of 25 samples
Each observation represent the mean of each sample

Sample size and the sampling distribution

Mechanics aside, let’s return to the reason we used the rep_sample_n function: to compute a sampling distribution, specifically, the sampling distribution of the proportions from samples of 50 people.

#fig.show='hide'
ggplot(data = sample_props50, aes(x = p_hat)) +
  geom_histogram(binwidth = 0.02)

Use the app below to create sampling distributions of proportions of Doesn’t benefit from samples of size 10, 50, and 100. Use 5,000 simulations. What does each observation in the sampling distribution represent? How does the mean, standard error, and shape of the sampling distribution change as the sample size increases? How (if at all) do these values change if you increase the number of simulations? (You do not need to include plots in your answer.)

Each observation represent the mean of each sample that was collected from the target population
The mean and standard error of the distribution changed from 0.22 and 0.11 to 0.2 and 0.06 respectively when the sample size increased from 10 to 50. Because according to the central limit theorem, the standard error decreases as the sample size increases.
This is evident as the mean decreased to 0.2 and the standard error decreased to 0.04. The number of simulations have no impact on the mean and standard error

Take a sample of size 15 from the population and calculate the proportion of people in this sample who think the work scientists do enhances their lives. Using this sample, what is your best point estimate of the population proportion of people who think the work scientists do enchances their lives?

samp2 <- global_monitor %>% 
  sample_n(15)
samp2

## # A tibble: 15 x 1
##    scientist_work 
##    <chr>          
##  1 Benefits       
##  2 Benefits       
##  3 Benefits       
##  4 Doesn't benefit
##  5 Doesn't benefit
##  6 Benefits       
##  7 Benefits       
##  8 Benefits       
##  9 Benefits       
## 10 Doesn't benefit
## 11 Benefits       
## 12 Doesn't benefit
## 13 Benefits       
## 14 Benefits       
## 15 Doesn't benefit

samp2 %>%
  rep_sample_n(15) %>%
  count(scientist_work) %>%
  mutate(p_hat = n /sum(n)) %>%
  filter(scientist_work == "Benefits")

## # A tibble: 1 x 4
## # Groups:   replicate [1]
##   replicate scientist_work     n p_hat
##       <int> <chr>          <int> <dbl>
## 1         1 Benefits          10 0.667

dim(samp2)

## [1] 15  1

round(stat.desc(samp2),4)

##          scientist_work
## nbr.val              NA
## nbr.null             NA
## nbr.na               NA
## min                  NA
## max                  NA
## range                NA
## sum                  NA
## median               NA
## mean                 NA
## SE.mean              NA
## CI.mean              NA
## var                  NA
## std.dev              NA
## coef.var             NA

Using the sample size n = 15 the point estimate for the population proportion of people who think the work scientists do enchances their lives is 0.667

Since you have access to the population, simulate the sampling distribution of proportion of those who think the work scientists do enchances their lives for samples of size 15 by taking 2000 samples from the population of size 15 and computing 2000 sample proportions. Store these proportions in as sample_props15. Plot the data, then describe the shape of this sampling distribution. Based on this sampling distribution, what would you guess the true proportion of those who think the work scientists do enchances their lives to be? Finally, calculate and report the population proportion.

sample_props15 <- global_monitor %>%
  rep_sample_n(size = 15, reps = 2000, replace = TRUE) %>%
  count(scientist_work) %>%
  mutate(p_hat = n /sum(n)) %>%
  filter(scientist_work == "Benefits")

dim(sample_props15)

## [1] 2000    4

round(stat.desc(sample_props15),4)

##                 replicate scientist_work          n     p_hat
## nbr.val         2000.0000             NA  2000.0000 2000.0000
## nbr.null           0.0000             NA     0.0000    0.0000
## nbr.na             0.0000             NA     0.0000    0.0000
## min                1.0000             NA     6.0000    0.4000
## max             2000.0000             NA    15.0000    1.0000
## range           1999.0000             NA     9.0000    0.6000
## sum          2001000.0000             NA 24123.0000 1608.2000
## median          1000.5000             NA    12.0000    0.8000
## mean            1000.5000             NA    12.0615    0.8041
## SE.mean           12.9132             NA     0.0341    0.0023
## CI.mean.0.95      25.3247             NA     0.0669    0.0045
## var           333500.0000             NA     2.3249    0.0103
## std.dev          577.4946             NA     1.5248    0.1017
## coef.var           0.5772             NA     0.1264    0.1264

sample_props15 %>% ggplot(aes(x=p_hat)) +
  geom_histogram(binwidth = 0.02)

The shape of the distribution is skewed to the left
The true proportion of those who think the work scientists do enchance their lives is 0.8041
The population proportion is approximately the mean of the sampling distribution which is 0.8041

Change your sample size from 15 to 150, then compute the sampling distribution using the same method as above, and store these proportions in a new object called sample_props150. Describe the shape of this sampling distribution and compare it to the sampling distribution for a sample size of 15. Based on this sampling distribution, what would you guess to be the true proportion of those who think the work of scientists do enchances their lives?

sample_props150 <- global_monitor %>%
  rep_sample_n(size = 150, reps = 2000, replace = TRUE) %>%
  count(scientist_work) %>%
  mutate(p_hat = n /sum(n)) %>%
  filter(scientist_work == "Benefits")

dim(sample_props150)

## [1] 2000    4

round(stat.desc(sample_props150),4)

##                 replicate scientist_work           n     p_hat
## nbr.val         2000.0000             NA   2000.0000 2000.0000
## nbr.null           0.0000             NA      0.0000    0.0000
## nbr.na             0.0000             NA      0.0000    0.0000
## min                1.0000             NA    101.0000    0.6733
## max             2000.0000             NA    135.0000    0.9000
## range           1999.0000             NA     34.0000    0.2267
## sum          2001000.0000             NA 240123.0000 1600.8200
## median          1000.5000             NA    120.0000    0.8000
## mean            1000.5000             NA    120.0615    0.8004
## SE.mean           12.9132             NA      0.1081    0.0007
## CI.mean.0.95      25.3247             NA      0.2119    0.0014
## var           333500.0000             NA     23.3534    0.0010
## std.dev          577.4946             NA      4.8325    0.0322
## coef.var           0.5772             NA      0.0403    0.0403

sample_props150 %>% ggplot(aes(x=p_hat)) +
  geom_histogram(binwidth = 0.02)

The shape of this sampling distribution seem normal (bell-shaped) and it did not show any attribute of skewness unlike the sampling distribution with 15 sample size
The true population mean of those who think the work of scientists do enchance their lives is the mean of the sampling distribution, which is 0.8004

Of the sampling distributions from 2 and 3, which has a smaller spread? If you’re concerned with making estimates that are more often close to the true value, would you prefer a sampling distribution with a large or small spread?

The sampling distribution with 150 sample size has lesser spread
The sampling distribution with small spread gives estimate with the true value

Foundations for statistical inference - Sampling distributions

Nnaemeka Newman Okereafor

10/13/2021

The data

The unknown sampling distribution

Sample size and the sampling distribution