In this lab we work with a synthetic population of
100,000 people.
Twenty percent of people are coded as believing that the work scientists
do doesn’t benefit people like them, and the other
eighty percent believe it does.
global_monitor <- tibble(
scientist_work = c(
rep("Benefits", 80000),
rep("Doesn't benefit", 20000)
)
)
global_monitor %>%
count(scientist_work) %>%
mutate(p = n / sum(n))
## # A tibble: 2 × 3
## scientist_work n p
## <chr> <int> <dbl>
## 1 Benefits 80000 0.8
## 2 Doesn't benefit 20000 0.2
ggplot(global_monitor, aes(x = scientist_work)) +
geom_bar() +
labs(
x = "",
y = "Count",
title = "Do you believe that the work scientists do benefits people like you?"
) +
coord_flip()
Comment:
In the population, about 80% of responses are Benefits and
about 20% are Doesn’t benefit. So the true population
proportion of Doesn’t benefit is p = 0.20.
In practice, we almost never see the whole population. Instead, we take a sample and use the sample proportion as a point estimate of the population proportion.
We start by taking one simple random sample of size 50 from the population.
samp1 <- global_monitor %>%
sample_n(size = 50)
samp1
## # A tibble: 50 × 1
## scientist_work
## <chr>
## 1 Benefits
## 2 Benefits
## 3 Benefits
## 4 Doesn't benefit
## 5 Benefits
## 6 Benefits
## 7 Benefits
## 8 Benefits
## 9 Benefits
## 10 Benefits
## # ℹ 40 more rows
Describe the distribution of responses in this sample. How does it compare to the distribution of responses in the population?
We first summarize the sample.
samp1_summary <- samp1 %>%
count(scientist_work) %>%
mutate(p_hat = n / sum(n))
samp1_summary
## # A tibble: 2 × 3
## scientist_work n p_hat
## <chr> <int> <dbl>
## 1 Benefits 45 0.9
## 2 Doesn't benefit 5 0.1
Answer (Exercise 1):
In this sample of 50 people, most responses are still Benefits,
with a smaller group indicating Doesn’t benefit. The sample
proportion of Doesn’t benefit (our p_hat) is around one–quarter
of the sample (exact value shown in the table). This is reasonably close
to the population value of 0.20, but not identical, which is expected
because of random sampling variability.
Would you expect the sample proportion to match the sample proportion of another student’s sample? Why or why not? Would you expect the proportions to be somewhat different or very different?
Answer (Exercise 2):
I would not expect my sample proportion to be exactly
the same as another student’s. Each student takes a different random
sample, so the mix of Benefits vs Doesn’t benefit in
each sample will differ a bit. However, because both of us are sampling
from the same population with p = 0.20, I would expect our sample
proportions to be somewhat different but not wildly
different — they should both typically be fairly close to
0.20.
Take a second sample, also of size 50, and call it
samp2. How does the sample proportion ofsamp2compare with that ofsamp1? Suppose we took two more samples, one of size 100 and one of size 1000. Which would provide a more accurate estimate of the population proportion?
samp2 <- global_monitor %>%
sample_n(size = 50)
samp2_summary <- samp2 %>%
count(scientist_work) %>%
mutate(p_hat = n / sum(n))
samp2_summary
## # A tibble: 2 × 3
## scientist_work n p_hat
## <chr> <int> <dbl>
## 1 Benefits 41 0.82
## 2 Doesn't benefit 9 0.18
Answer (Exercise 3):
The sample proportion of Doesn’t benefit in samp2
is a bit different from the proportion in samp1, again due
to random sampling.
If we took samples of size 100 and 1000, the larger the sample size, the more accurate and more stable we expect p_hat to be. So a sample of 1000 would typically give a more accurate estimate of the population proportion than a sample of 100, which in turn is usually better than a sample of 50.
Now we approximate the sampling distribution of the sample proportion by repeatedly sampling from the population.
We will take 15,000 samples of size 50, compute the proportion of Doesn’t benefit in each sample, and store those proportions.
sample_props50 <- global_monitor %>%
rep_sample_n(size = 50, reps = 15000, replace = TRUE) %>%
count(replicate, scientist_work) %>%
group_by(replicate) %>%
mutate(p_hat = n / sum(n)) %>%
filter(scientist_work == "Doesn't benefit")
head(sample_props50)
## # A tibble: 6 × 4
## # Groups: replicate [6]
## replicate scientist_work n p_hat
## <int> <chr> <int> <dbl>
## 1 1 Doesn't benefit 7 0.14
## 2 2 Doesn't benefit 13 0.26
## 3 3 Doesn't benefit 10 0.2
## 4 4 Doesn't benefit 5 0.1
## 5 5 Doesn't benefit 12 0.24
## 6 6 Doesn't benefit 5 0.1
How many elements are there in
sample_props50? Describe the sampling distribution, and note its center. Include a plot.
nrow(sample_props50)
## [1] 15000
ggplot(sample_props50, aes(x = p_hat)) +
geom_histogram(binwidth = 0.02) +
labs(
x = "Sample proportion (Doesn't benefit)",
y = "Count",
title = "Sampling distribution of p_hat for n = 50 (15000 samples)"
)
sample_props50 %>%
summarise(
mean_p_hat = mean(p_hat),
sd_p_hat = sd(p_hat)
)
## # A tibble: 15,000 × 3
## replicate mean_p_hat sd_p_hat
## <int> <dbl> <dbl>
## 1 1 0.14 NA
## 2 2 0.26 NA
## 3 3 0.2 NA
## 4 4 0.1 NA
## 5 5 0.24 NA
## 6 6 0.1 NA
## 7 7 0.14 NA
## 8 8 0.26 NA
## 9 9 0.12 NA
## 10 10 0.12 NA
## # ℹ 14,990 more rows
Answer (Exercise 4):
- The object sample_props50 has one row per sample, so
there are 15,000 rows (one p_hat per sample).
- The histogram is roughly bell-shaped and unimodal,
centered close to the true population proportion p = 0.20.
- The mean of the sampling distribution is very close to 0.20, and the
spread (standard deviation) shows the typical sampling variability when
we use samples of size 50.
Modify the code to create a sampling distribution of 25 sample proportions from samples of size 10, stored in
sample_props_small. Print the output. How many observations are there, and what does each observation represent?
sample_props_small <- global_monitor %>%
rep_sample_n(size = 10, reps = 25, replace = TRUE) %>%
count(replicate, scientist_work) %>%
group_by(replicate) %>%
mutate(p_hat = n / sum(n)) %>%
filter(scientist_work == "Doesn't benefit")
sample_props_small
## # A tibble: 21 × 4
## # Groups: replicate [21]
## replicate scientist_work n p_hat
## <int> <chr> <int> <dbl>
## 1 1 Doesn't benefit 4 0.4
## 2 2 Doesn't benefit 1 0.1
## 3 3 Doesn't benefit 1 0.1
## 4 4 Doesn't benefit 5 0.5
## 5 5 Doesn't benefit 2 0.2
## 6 6 Doesn't benefit 2 0.2
## 7 7 Doesn't benefit 3 0.3
## 8 8 Doesn't benefit 3 0.3
## 9 9 Doesn't benefit 4 0.4
## 10 10 Doesn't benefit 3 0.3
## # ℹ 11 more rows
nrow(sample_props_small)
## [1] 21
Answer (Exercise 5):
- There are 25 observations in
sample_props_small, one for each of the 25 simulated
samples.
- Each row represents the sample proportion of people
who chose Doesn’t benefit in one random sample of
10 people from the population.
Using the app (or theory), create sampling distributions of proportions of Doesn’t benefit from samples of size 10, 50, and 100 with 5,000 simulations. What does each observation represent? How do the mean, standard error, and shape change as sample size increases? How do they change (if at all) when you increase the number of simulations?
Answer (Exercise 6):
Now we switch focus to the complement: those who think the work scientists do benefits them (the Benefits category).
Take a sample of size 15 and calculate the proportion who think the work scientists do enhances their lives. Using this sample, what is your best point estimate of the population proportion?
samp15 <- global_monitor %>%
sample_n(size = 15)
samp15_summary <- samp15 %>%
count(scientist_work) %>%
mutate(p_hat = n / sum(n))
samp15_summary
## # A tibble: 2 × 3
## scientist_work n p_hat
## <chr> <int> <dbl>
## 1 Benefits 13 0.867
## 2 Doesn't benefit 2 0.133
Answer (Exercise 7):
The sample proportion of Benefits from this sample of 15 is our
p_hat for “enhances their lives”. This sample proportion is my
best point estimate of the population proportion who
think science benefits them personally. Numerically, it will be close to
(but not exactly equal to) the true population value of 0.80.
Simulate the sampling distribution of the proportion of those who think science enhances their lives for samples of size 15 by taking 2,000 samples and computing 2,000 sample proportions. Store these as
sample_props15. Plot and describe the shape. What would you guess the true proportion is? Compute and report the actual population proportion.
sample_props15 <- global_monitor %>%
rep_sample_n(size = 15, reps = 2000, replace = TRUE) %>%
count(replicate, scientist_work) %>%
group_by(replicate) %>%
mutate(p_hat = n / sum(n)) %>%
filter(scientist_work == "Benefits")
head(sample_props15)
## # A tibble: 6 × 4
## # Groups: replicate [6]
## replicate scientist_work n p_hat
## <int> <chr> <int> <dbl>
## 1 1 Benefits 13 0.867
## 2 2 Benefits 13 0.867
## 3 3 Benefits 10 0.667
## 4 4 Benefits 11 0.733
## 5 5 Benefits 13 0.867
## 6 6 Benefits 12 0.8
ggplot(sample_props15, aes(x = p_hat)) +
geom_histogram(binwidth = 0.02) +
labs(
x = "Sample proportion (Benefits)",
y = "Count",
title = "Sampling distribution of p_hat (Benefits) for n = 15"
)
sample_props15 %>%
summarise(
mean_p_hat = mean(p_hat),
sd_p_hat = sd(p_hat)
)
## # A tibble: 2,000 × 3
## replicate mean_p_hat sd_p_hat
## <int> <dbl> <dbl>
## 1 1 0.867 NA
## 2 2 0.867 NA
## 3 3 0.667 NA
## 4 4 0.733 NA
## 5 5 0.867 NA
## 6 6 0.8 NA
## 7 7 0.8 NA
## 8 8 0.667 NA
## 9 9 0.667 NA
## 10 10 0.667 NA
## # ℹ 1,990 more rows
# True population proportion for "Benefits"
global_monitor %>%
count(scientist_work) %>%
mutate(p = n / sum(n))
## # A tibble: 2 × 3
## scientist_work n p
## <chr> <int> <dbl>
## 1 Benefits 80000 0.8
## 2 Doesn't benefit 20000 0.2
Answer (Exercise 8):
The sampling distribution of p_hat for Benefits with n = 15 is
roughly unimodal and fairly symmetric,
but still somewhat spread out because the sample size is small. The
center of the distribution is near about 0.80. Based on
this sampling distribution alone, I would guess the true proportion is
around 0.8.
The population calculation confirms that the true population proportion for Benefits is exactly 0.80 (80,000 out of 100,000).
Change the sample size from 15 to 150, compute the sampling distribution using the same method, and store it as
sample_props150. Describe the shape and compare it to the sampling distribution for n = 15. Based on this sampling distribution, what would you guess the true proportion to be?
sample_props150 <- global_monitor %>%
rep_sample_n(size = 150, reps = 2000, replace = TRUE) %>%
count(replicate, scientist_work) %>%
group_by(replicate) %>%
mutate(p_hat = n / sum(n)) %>%
filter(scientist_work == "Benefits")
head(sample_props150)
## # A tibble: 6 × 4
## # Groups: replicate [6]
## replicate scientist_work n p_hat
## <int> <chr> <int> <dbl>
## 1 1 Benefits 134 0.893
## 2 2 Benefits 125 0.833
## 3 3 Benefits 119 0.793
## 4 4 Benefits 123 0.82
## 5 5 Benefits 125 0.833
## 6 6 Benefits 122 0.813
ggplot(sample_props150, aes(x = p_hat)) +
geom_histogram(binwidth = 0.01) +
labs(
x = "Sample proportion (Benefits)",
y = "Count",
title = "Sampling distribution of p_hat (Benefits) for n = 150"
)
sample_props150 %>%
summarise(
mean_p_hat = mean(p_hat),
sd_p_hat = sd(p_hat)
)
## # A tibble: 2,000 × 3
## replicate mean_p_hat sd_p_hat
## <int> <dbl> <dbl>
## 1 1 0.893 NA
## 2 2 0.833 NA
## 3 3 0.793 NA
## 4 4 0.82 NA
## 5 5 0.833 NA
## 6 6 0.813 NA
## 7 7 0.847 NA
## 8 8 0.8 NA
## 9 9 0.86 NA
## 10 10 0.853 NA
## # ℹ 1,990 more rows
Answer (Exercise 9):
For n = 150, the sampling distribution is even more
concentrated and looks very close to
normal, with much less spread than for n = 15. It is still
centered near about 0.80. Based on this distribution, I would again
guess the true population proportion is about 0.80, but
now I am more confident because the variability is smaller.
Of the sampling distributions from Exercises 8 and 9 (n = 15 vs n = 150), which has a smaller spread? If you want your estimates to be close to the true value more often, would you prefer a sampling distribution with a large or small spread?
Answer (Exercise 10):
- The sampling distribution for n = 150 has the
smaller spread (smaller standard error).
- If I want my sample estimates to land close to the true
proportion more often, I prefer a sampling distribution with a
small spread. That means using a larger sample
size, which reduces sampling variability and makes my estimator
more precise. ```