knitr::opts_chunk$set(eval = TRUE, message = FALSE, warning = FALSE)
set.seed(1234)
library(tidyverse)
library(openintro)
library(infer)

Exercise 1

Describe the distribution of responses in this sample. How does it compare to the distribution of responses in the population. Hint: Although the sample_n function takes a random sample of observations (i.e. rows) from the dataset, you can still refer to the variables in the dataset with the same names. Code you presented earlier for visualizing and summarizing the population data will still be useful for the sample, however be careful to not label your proportion p since you’re now calculating a sample statistic, not a population parameters. You can customize the label of the statistics to indicate that it comes from the sample.

The distribution is similar to the actual population. It only differs by a small decimal in the benefits and doesn’t benefit groups.

set.seed(1234)
global_monitor <- tibble(
  scientist_work = c(rep("Benefits", 80000), rep("Doesn't benefit", 20000))
)

ggplot(global_monitor, aes(x = scientist_work)) +
  geom_bar() +
  labs(
    x = "", y = "",
    title = "Do you believe that the work scientists do benefit people like you?"
  ) +
  coord_flip() 

global_monitor %>%
  count(scientist_work) %>%
  mutate(p = n /sum(n))
## # A tibble: 2 Ă— 3
##   scientist_work      n     p
##   <chr>           <int> <dbl>
## 1 Benefits        80000   0.8
## 2 Doesn't benefit 20000   0.2
set.seed(1234)

samp1 <- global_monitor %>%
  sample_n(50)

ggplot(samp1, aes(x = scientist_work)) +
  geom_bar() +
  labs(
    x = "", y = "",
    title = "Do you believe that the work scientists do benefit people like you?"
  ) +
  coord_flip() 

samp1 %>%
  count(scientist_work) %>%
  mutate(p_hat = n /sum(n))
## # A tibble: 2 Ă— 3
##   scientist_work      n p_hat
##   <chr>           <int> <dbl>
## 1 Benefits           37  0.74
## 2 Doesn't benefit    13  0.26

Exercise 2

Would you expect the sample proportion to match the sample proportion of another student’s sample? Why, or why not? If the answer is no, would you expect the proportions to be somewhat different or very different? Ask a student team to confirm your answer.

I would expect the sample proportion to be similar to another student’s sample. If they used the same seed, then it would be the same. If they used a different seed, it would likely be different but similar. I would expect them to be similar because the original data was 80% to 20%, so the sample would probably be close to this. They are much more likely to get close to 80% for the Benefits group and close o 20% for the Doesn’t Benefit group. I confirmed my answer by running the sample again with a different seed. The results were 72% and 28%. This is similar to my sample.

set.seed(5678)

samp <- global_monitor %>%
  sample_n(50)

ggplot(samp, aes(x = scientist_work)) +
  geom_bar() +
  labs(
    x = "", y = "",
    title = "Do you believe that the work scientists do benefit people like you?"
  ) +
  coord_flip() 

samp |>
  count(scientist_work) |>
  mutate(p_hat = n /sum(n))
## # A tibble: 2 Ă— 3
##   scientist_work      n p_hat
##   <chr>           <int> <dbl>
## 1 Benefits           36  0.72
## 2 Doesn't benefit    14  0.28

Exercise 3

Take a second sample, also of size 50, and call it samp2. How does the sample proportion of samp2 compare with that of samp1? Suppose we took two more samples, one of size 100 and one of size 1000. Which would you think would provide a more accurate estimate of the population proportion?

The sample proportion is similar to samp1. The two proportions are off by a few tenths of a percent. I would assume a sample of 1000 would be more accurate. I tested it below, and it turned out that the sample of size 1000 was more accurate. The results also are probably affected by the seed, but in this case, the sample of size 1000 was more accurate than 100.

set.seed(246)

samp2 <- global_monitor %>%
  sample_n(50)

ggplot(samp2, aes(x = scientist_work)) +
  geom_bar() +
  labs(
    x = "", y = "",
    title = "Do you believe that the work scientists do benefit people like you?"
  ) +
  coord_flip() 

samp2 |>
  count(scientist_work) |>
  mutate(p_hat = n /sum(n))
## # A tibble: 2 Ă— 3
##   scientist_work      n p_hat
##   <chr>           <int> <dbl>
## 1 Benefits           41  0.82
## 2 Doesn't benefit     9  0.18
set.seed(500)

samp3 <- global_monitor %>%
  sample_n(100)

ggplot(samp3, aes(x = scientist_work)) +
  geom_bar() +
  labs(
    x = "", y = "",
    title = "Do you believe that the work scientists do benefit people like you?"
  ) +
  coord_flip() 

samp3 |>
  count(scientist_work) |>
  mutate(p_hat = n /sum(n))
## # A tibble: 2 Ă— 3
##   scientist_work      n p_hat
##   <chr>           <int> <dbl>
## 1 Benefits           85  0.85
## 2 Doesn't benefit    15  0.15
set.seed(500)

samp4 <- global_monitor %>%
  sample_n(1000)

ggplot(samp4, aes(x = scientist_work)) +
  geom_bar() +
  labs(
    x = "", y = "",
    title = "Do you believe that the work scientists do benefit people like you?"
  ) +
  coord_flip() 

samp4 |>
  count(scientist_work) |>
  mutate(p_hat = n /sum(n))
## # A tibble: 2 Ă— 3
##   scientist_work      n p_hat
##   <chr>           <int> <dbl>
## 1 Benefits          790  0.79
## 2 Doesn't benefit   210  0.21

Exercise 4

How many elements are there in sample_props50? Describe the sampling distribution, and be sure to specifically note its center. Make sure to include a plot of the distribution in your answer.

It has 14999 rows (elements), but it is supposed to have 15000. The distribution looks normally distributed. It is symmetrical with a bell curve. The center is at 0., which is the original proportion for the population.

sample_props50 <- global_monitor %>%
                    rep_sample_n(size = 50, reps = 15000, replace = TRUE) %>%
                    count(scientist_work) %>%
                    mutate(p_hat = n /sum(n)) %>%
                    filter(scientist_work == "Doesn't benefit")

length(sample_props50$scientist_work)
## [1] 15000
ggplot(data = sample_props50, aes(x = p_hat)) +
  geom_histogram(binwidth = 0.02) +
  labs(
    x = "p_hat (Doesn't benefit)",
    title = "Sampling distribution of p_hat",
    subtitle = "Sample size = 50, Number of samples = 15000"
  )

Exercise 5

To make sure you understand how sampling distributions are built, and exactly what the rep_sample_n function does, try modifying the code to create a sampling distribution of 25 sample proportions from samples of size 10, and put them in a data frame named sample_props_small. Print the output. How many observations are there in this object called sample_props_small? What does each observation represent?

There are 24 observations in the object. Each observation represents one repetition of taking a sample of 10 people and getting the proportion of people who said Doesn’t Benefit.

sample_props_small <- global_monitor %>%
                    rep_sample_n(size = 10, reps = 25, replace = TRUE) %>%
                    count(scientist_work) %>%
                    mutate(p_hat = n /sum(n)) %>%
                    filter(scientist_work == "Doesn't benefit")

ggplot(data = sample_props_small, aes(x = p_hat)) +
  geom_histogram(binwidth = 0.02) +
  labs(
    x = "p_hat (Doesn't benefit)",
    title = "Sampling distribution of p_hat",
    subtitle = "Sample size = 10, Number of samples = 25"
  )

Exercise 6

Use the app below to create sampling distributions of proportions of Doesn’t benefit from samples of size 10, 50, and 100. Use 5,000 simulations. What does each observation in the sampling distribution represent? How does the mean, standard error, and shape of the sampling distribution change as the sample size increases? How (if at all) do these values change if you increase the number of simulations? (You do not need to include plots in your answer.)

Each observation in the distribution represents the proportion in one sample of people who responded “Doesn’t Benefit.”

When the sample size increased from 10 to 50, the mean became the same as the population proportion, and the standard error decreased. Also, the shape of the distribution became much more normally distributed. When the sample size increased from 50 to 100, the mean remained the same (equal to the population proportion), and the standard error decreased. Also, the shape of the distribution stayed normally distributed.

Increasing the number of simulations barely changed the mean or standard error, but the distributions became more normally distributed with a higher number of simulations.

Exercise 7

Take a sample of size 15 from the population and calculate the proportion of people in this sample who think the work scientists do enhances their lives. Using this sample, what is your best point estimate of the population proportion of people who think the work scientists do enchances their lives?

The estimate I calculated from this sample was about 0.73.

set.seed(1234)

samp1 <- global_monitor %>%
  sample_n(15)

ggplot(samp1, aes(x = scientist_work)) +
  geom_bar() +
  labs(
    x = "", y = "",
    title = "Do you believe that the work scientists do benefit people like you?"
  ) +
  coord_flip() 

samp1 %>%
  count(scientist_work) %>%
  mutate(p_hat = n /sum(n))
## # A tibble: 2 Ă— 3
##   scientist_work      n p_hat
##   <chr>           <int> <dbl>
## 1 Benefits           11 0.733
## 2 Doesn't benefit     4 0.267

Exercise 8

Since you have access to the population, simulate the sampling distribution of proportion of those who think the work scientists do enchances their lives for samples of size 15 by taking 2000 samples from the population of size 15 and computing 2000 sample proportions. Store these proportions in as sample_props15. Plot the data, then describe the shape of this sampling distribution. Based on this sampling distribution, what would you guess the true proportion of those who think the work scientists do enchances their lives to be? Finally, calculate and report the population proportion.

This sampling distribution is left-skewed.

I would guess that the true proportion is 0.80. The population proportion is actually 0.8034333.

sample_props15 <- global_monitor %>%
                    rep_sample_n(size = 15, reps = 2000, replace = TRUE) %>%
                    count(scientist_work) %>%
                    mutate(p_hat = n /sum(n)) %>%
                    filter(scientist_work == "Benefits")

ggplot(data = sample_props15, aes(x = p_hat)) +
  geom_histogram(binwidth = 0.02) +
  labs(
    x = "p_hat (Benefits)",
    title = "Sampling distribution of p_hat",
    subtitle = "Sample size = 15, Number of samples = 2000"
  )

mean(sample_props15$p_hat)
## [1] 0.7996333

Exercise 9

Change your sample size from 15 to 150, then compute the sampling distribution using the same method as above, and store these proportions in a new object called sample_props150. Describe the shape of this sampling distribution and compare it to the sampling distribution for a sample size of 15. Based on this sampling distribution, what would you guess to be the true proportion of those who think the work scientists do enchances their lives?

This sampling distribution is about normally distributed. It is much more normal than the distribution for a sample size of 15.

I would guess the true proportion of those who think scientists enhance lives is about 0.80.

sample_props150 <- global_monitor %>%
                    rep_sample_n(size = 150, reps = 2000, replace = TRUE) %>%
                    count(scientist_work) %>%
                    mutate(p_hat = n /sum(n)) %>%
                    filter(scientist_work == "Benefits")

ggplot(data = sample_props150, aes(x = p_hat)) +
  geom_histogram(binwidth = 0.02) +
  labs(
    x = "p_hat (Benefits)",
    title = "Sampling distribution of p_hat",
    subtitle = "Sample size = 150, Number of samples = 2000"
  )

Exercise 10

Of the sampling distributions from 2 and 3, which has a smaller spread? If you’re concerned with making estimates that are more often close to the true value, would you prefer a sampling distribution with a large or small spread?

The sampling distribution from exercise 9 has a smaller spread. I would prefer a sampling distribution with a small spread, because it is more likely to be more accurate. Distributions with a small spread have less of an error typically. Also, in the examples I have done, the population proportion was at the center of the distributions with less spread and with more spread, so it probably does not matter which one is used. Using a distribution with more spread will more likely contain the true proportion, but the center of the distribution might not be as accurate as one with less spread.

