Foundations for Statistical Inference: Sampling Distributions

The population

In this lab we work with a synthetic population of 100,000 people.
Twenty percent of people are coded as believing that the work scientists do doesn’t benefit people like them, and the other eighty percent believe it does.

global_monitor <- tibble(
  scientist_work = c(
    rep("Benefits", 80000),
    rep("Doesn't benefit", 20000)
  )
)

Population distribution

global_monitor %>%
  count(scientist_work) %>%
  mutate(p = n / sum(n))

## # A tibble: 2 × 3
##   scientist_work      n     p
##   <chr>           <int> <dbl>
## 1 Benefits        80000   0.8
## 2 Doesn't benefit 20000   0.2

ggplot(global_monitor, aes(x = scientist_work)) +
  geom_bar() +
  labs(
    x = "",
    y = "Count",
    title = "Do you believe that the work scientists do benefits people like you?"
  ) +
  coord_flip()

Comment:
In the population, about 80% of responses are Benefits and about 20% are Doesn’t benefit. So the true population proportion of Doesn’t benefit is p = 0.20.

The unknown sampling distribution

In practice, we almost never see the whole population. Instead, we take a sample and use the sample proportion as a point estimate of the population proportion.

We start by taking one simple random sample of size 50 from the population.

samp1 <- global_monitor %>%
  sample_n(size = 50)

samp1

## # A tibble: 50 × 1
##    scientist_work 
##    <chr>          
##  1 Benefits       
##  2 Benefits       
##  3 Benefits       
##  4 Doesn't benefit
##  5 Benefits       
##  6 Benefits       
##  7 Benefits       
##  8 Benefits       
##  9 Benefits       
## 10 Benefits       
## # ℹ 40 more rows

Exercise 1

Describe the distribution of responses in this sample. How does it compare to the distribution of responses in the population?

We first summarize the sample.

samp1_summary <- samp1 %>%
  count(scientist_work) %>%
  mutate(p_hat = n / sum(n))

samp1_summary

## # A tibble: 2 × 3
##   scientist_work      n p_hat
##   <chr>           <int> <dbl>
## 1 Benefits           45   0.9
## 2 Doesn't benefit     5   0.1

Answer (Exercise 1):
In this sample of 50 people, most responses are still Benefits, with a smaller group indicating Doesn’t benefit. The sample proportion of Doesn’t benefit (our p_hat) is around one–quarter of the sample (exact value shown in the table). This is reasonably close to the population value of 0.20, but not identical, which is expected because of random sampling variability.

Exercise 2

Would you expect the sample proportion to match the sample proportion of another student’s sample? Why or why not? Would you expect the proportions to be somewhat different or very different?

Answer (Exercise 2):
I would not expect my sample proportion to be exactly the same as another student’s. Each student takes a different random sample, so the mix of Benefits vs Doesn’t benefit in each sample will differ a bit. However, because both of us are sampling from the same population with p = 0.20, I would expect our sample proportions to be somewhat different but not wildly different — they should both typically be fairly close to 0.20.

Exercise 3

Take a second sample, also of size 50, and call it samp2. How does the sample proportion of samp2 compare with that of samp1? Suppose we took two more samples, one of size 100 and one of size 1000. Which would provide a more accurate estimate of the population proportion?

samp2 <- global_monitor %>%
  sample_n(size = 50)

samp2_summary <- samp2 %>%
  count(scientist_work) %>%
  mutate(p_hat = n / sum(n))

samp2_summary

## # A tibble: 2 × 3
##   scientist_work      n p_hat
##   <chr>           <int> <dbl>
## 1 Benefits           41  0.82
## 2 Doesn't benefit     9  0.18

Answer (Exercise 3):
The sample proportion of Doesn’t benefit in samp2 is a bit different from the proportion in samp1, again due to random sampling.

If we took samples of size 100 and 1000, the larger the sample size, the more accurate and more stable we expect p_hat to be. So a sample of 1000 would typically give a more accurate estimate of the population proportion than a sample of 100, which in turn is usually better than a sample of 50.

Building a sampling distribution (n = 50)

Now we approximate the sampling distribution of the sample proportion by repeatedly sampling from the population.

We will take 15,000 samples of size 50, compute the proportion of Doesn’t benefit in each sample, and store those proportions.

sample_props50 <- global_monitor %>%
  rep_sample_n(size = 50, reps = 15000, replace = TRUE) %>%
  count(replicate, scientist_work) %>%
  group_by(replicate) %>%
  mutate(p_hat = n / sum(n)) %>%
  filter(scientist_work == "Doesn't benefit")

head(sample_props50)

## # A tibble: 6 × 4
## # Groups:   replicate [6]
##   replicate scientist_work      n p_hat
##       <int> <chr>           <int> <dbl>
## 1         1 Doesn't benefit     7  0.14
## 2         2 Doesn't benefit    13  0.26
## 3         3 Doesn't benefit    10  0.2 
## 4         4 Doesn't benefit     5  0.1 
## 5         5 Doesn't benefit    12  0.24
## 6         6 Doesn't benefit     5  0.1

Exercise 4

How many elements are there in sample_props50? Describe the sampling distribution, and note its center. Include a plot.

nrow(sample_props50)

## [1] 15000

ggplot(sample_props50, aes(x = p_hat)) +
  geom_histogram(binwidth = 0.02) +
  labs(
    x = "Sample proportion (Doesn't benefit)",
    y = "Count",
    title = "Sampling distribution of p_hat for n = 50 (15000 samples)"
  )

sample_props50 %>%
  summarise(
    mean_p_hat = mean(p_hat),
    sd_p_hat   = sd(p_hat)
  )

## # A tibble: 15,000 × 3
##    replicate mean_p_hat sd_p_hat
##        <int>      <dbl>    <dbl>
##  1         1       0.14       NA
##  2         2       0.26       NA
##  3         3       0.2        NA
##  4         4       0.1        NA
##  5         5       0.24       NA
##  6         6       0.1        NA
##  7         7       0.14       NA
##  8         8       0.26       NA
##  9         9       0.12       NA
## 10        10       0.12       NA
## # ℹ 14,990 more rows

Answer (Exercise 4):
- The object sample_props50 has one row per sample, so there are 15,000 rows (one p_hat per sample).
- The histogram is roughly bell-shaped and unimodal, centered close to the true population proportion p = 0.20.
- The mean of the sampling distribution is very close to 0.20, and the spread (standard deviation) shows the typical sampling variability when we use samples of size 50.

Exercise 5

Modify the code to create a sampling distribution of 25 sample proportions from samples of size 10, stored in sample_props_small. Print the output. How many observations are there, and what does each observation represent?

sample_props_small <- global_monitor %>%
  rep_sample_n(size = 10, reps = 25, replace = TRUE) %>%
  count(replicate, scientist_work) %>%
  group_by(replicate) %>%
  mutate(p_hat = n / sum(n)) %>%
  filter(scientist_work == "Doesn't benefit")

sample_props_small

## # A tibble: 21 × 4
## # Groups:   replicate [21]
##    replicate scientist_work      n p_hat
##        <int> <chr>           <int> <dbl>
##  1         1 Doesn't benefit     4   0.4
##  2         2 Doesn't benefit     1   0.1
##  3         3 Doesn't benefit     1   0.1
##  4         4 Doesn't benefit     5   0.5
##  5         5 Doesn't benefit     2   0.2
##  6         6 Doesn't benefit     2   0.2
##  7         7 Doesn't benefit     3   0.3
##  8         8 Doesn't benefit     3   0.3
##  9         9 Doesn't benefit     4   0.4
## 10        10 Doesn't benefit     3   0.3
## # ℹ 11 more rows

nrow(sample_props_small)

## [1] 21

Answer (Exercise 5):
- There are 25 observations in sample_props_small, one for each of the 25 simulated samples.
- Each row represents the sample proportion of people who chose Doesn’t benefit in one random sample of 10 people from the population.

Exercise 6

Using the app (or theory), create sampling distributions of proportions of Doesn’t benefit from samples of size 10, 50, and 100 with 5,000 simulations. What does each observation represent? How do the mean, standard error, and shape change as sample size increases? How do they change (if at all) when you increase the number of simulations?

Answer (Exercise 6):

In a sampling distribution, each observation corresponds to one sample proportion p_hat computed from a single random sample.
As the sample size n increases (from 10 to 50 to 100):
- The mean of the sampling distribution remains close to the true p = 0.20.
- The standard error (spread) gets smaller, so p_hat values tend to cluster more tightly around 0.20.
- The shape becomes more symmetric and approximately normal as n increases.
Increasing the number of simulations (e.g., from a few hundred to 5,000) does not change the true mean or spread of the sampling distribution; it just makes the histogram smoother and our estimates of the mean and standard error more stable.

Sampling distributions for those who think science enhances their lives

Now we switch focus to the complement: those who think the work scientists do benefits them (the Benefits category).

Exercise 7

Take a sample of size 15 and calculate the proportion who think the work scientists do enhances their lives. Using this sample, what is your best point estimate of the population proportion?

samp15 <- global_monitor %>%
  sample_n(size = 15)

samp15_summary <- samp15 %>%
  count(scientist_work) %>%
  mutate(p_hat = n / sum(n))

samp15_summary

## # A tibble: 2 × 3
##   scientist_work      n p_hat
##   <chr>           <int> <dbl>
## 1 Benefits           13 0.867
## 2 Doesn't benefit     2 0.133

Answer (Exercise 7):
The sample proportion of Benefits from this sample of 15 is our p_hat for “enhances their lives”. This sample proportion is my best point estimate of the population proportion who think science benefits them personally. Numerically, it will be close to (but not exactly equal to) the true population value of 0.80.

Exercise 8

Simulate the sampling distribution of the proportion of those who think science enhances their lives for samples of size 15 by taking 2,000 samples and computing 2,000 sample proportions. Store these as sample_props15. Plot and describe the shape. What would you guess the true proportion is? Compute and report the actual population proportion.

sample_props15 <- global_monitor %>%
  rep_sample_n(size = 15, reps = 2000, replace = TRUE) %>%
  count(replicate, scientist_work) %>%
  group_by(replicate) %>%
  mutate(p_hat = n / sum(n)) %>%
  filter(scientist_work == "Benefits")

head(sample_props15)

## # A tibble: 6 × 4
## # Groups:   replicate [6]
##   replicate scientist_work     n p_hat
##       <int> <chr>          <int> <dbl>
## 1         1 Benefits          13 0.867
## 2         2 Benefits          13 0.867
## 3         3 Benefits          10 0.667
## 4         4 Benefits          11 0.733
## 5         5 Benefits          13 0.867
## 6         6 Benefits          12 0.8

ggplot(sample_props15, aes(x = p_hat)) +
  geom_histogram(binwidth = 0.02) +
  labs(
    x = "Sample proportion (Benefits)",
    y = "Count",
    title = "Sampling distribution of p_hat (Benefits) for n = 15"
  )

sample_props15 %>%
  summarise(
    mean_p_hat = mean(p_hat),
    sd_p_hat   = sd(p_hat)
  )

## # A tibble: 2,000 × 3
##    replicate mean_p_hat sd_p_hat
##        <int>      <dbl>    <dbl>
##  1         1      0.867       NA
##  2         2      0.867       NA
##  3         3      0.667       NA
##  4         4      0.733       NA
##  5         5      0.867       NA
##  6         6      0.8         NA
##  7         7      0.8         NA
##  8         8      0.667       NA
##  9         9      0.667       NA
## 10        10      0.667       NA
## # ℹ 1,990 more rows

# True population proportion for "Benefits"
global_monitor %>%
  count(scientist_work) %>%
  mutate(p = n / sum(n))

## # A tibble: 2 × 3
##   scientist_work      n     p
##   <chr>           <int> <dbl>
## 1 Benefits        80000   0.8
## 2 Doesn't benefit 20000   0.2

Answer (Exercise 8):
The sampling distribution of p_hat for Benefits with n = 15 is roughly unimodal and fairly symmetric, but still somewhat spread out because the sample size is small. The center of the distribution is near about 0.80. Based on this sampling distribution alone, I would guess the true proportion is around 0.8.

The population calculation confirms that the true population proportion for Benefits is exactly 0.80 (80,000 out of 100,000).

Exercise 9

Change the sample size from 15 to 150, compute the sampling distribution using the same method, and store it as sample_props150. Describe the shape and compare it to the sampling distribution for n = 15. Based on this sampling distribution, what would you guess the true proportion to be?

sample_props150 <- global_monitor %>%
  rep_sample_n(size = 150, reps = 2000, replace = TRUE) %>%
  count(replicate, scientist_work) %>%
  group_by(replicate) %>%
  mutate(p_hat = n / sum(n)) %>%
  filter(scientist_work == "Benefits")

head(sample_props150)

## # A tibble: 6 × 4
## # Groups:   replicate [6]
##   replicate scientist_work     n p_hat
##       <int> <chr>          <int> <dbl>
## 1         1 Benefits         134 0.893
## 2         2 Benefits         125 0.833
## 3         3 Benefits         119 0.793
## 4         4 Benefits         123 0.82 
## 5         5 Benefits         125 0.833
## 6         6 Benefits         122 0.813

ggplot(sample_props150, aes(x = p_hat)) +
  geom_histogram(binwidth = 0.01) +
  labs(
    x = "Sample proportion (Benefits)",
    y = "Count",
    title = "Sampling distribution of p_hat (Benefits) for n = 150"
  )

sample_props150 %>%
  summarise(
    mean_p_hat = mean(p_hat),
    sd_p_hat   = sd(p_hat)
  )

## # A tibble: 2,000 × 3
##    replicate mean_p_hat sd_p_hat
##        <int>      <dbl>    <dbl>
##  1         1      0.893       NA
##  2         2      0.833       NA
##  3         3      0.793       NA
##  4         4      0.82        NA
##  5         5      0.833       NA
##  6         6      0.813       NA
##  7         7      0.847       NA
##  8         8      0.8         NA
##  9         9      0.86        NA
## 10        10      0.853       NA
## # ℹ 1,990 more rows

Answer (Exercise 9):
For n = 150, the sampling distribution is even more concentrated and looks very close to normal, with much less spread than for n = 15. It is still centered near about 0.80. Based on this distribution, I would again guess the true population proportion is about 0.80, but now I am more confident because the variability is smaller.

Exercise 10

Of the sampling distributions from Exercises 8 and 9 (n = 15 vs n = 150), which has a smaller spread? If you want your estimates to be close to the true value more often, would you prefer a sampling distribution with a large or small spread?

Answer (Exercise 10):
- The sampling distribution for n = 150 has the smaller spread (smaller standard error).
- If I want my sample estimates to land close to the true proportion more often, I prefer a sampling distribution with a small spread. That means using a larger sample size, which reduces sampling variability and makes my estimator more precise. ```