library(tidyverse)
library(openintro)
library(infer)
set.seed(110)
#Create population assuming population size of 100,000.  20% think the work of scientists do not benefit them; 80% think the work does benefit them. Dataframe:  global_monitor; scientist_work:  variable that contains responses to the question "Do you believe that the work scientists do benefit you?"

global_monitor <- tibble(
  scientist_work = c(rep("Benefits", 80000), rep("Doesn't benefit", 20000))
)

Visualize the distribution of these responses by using a bar plot.

ggplot(global_monitor, aes(x = scientist_work)) +
  geom_bar() +
  labs(
    x = "", y = "",
    title = "Do you believe that the work scientists do benefit people like you?"
  ) +
  coord_flip() 

Obtain summary statistics to confirm the data frame was constructed correctly. …

global_monitor %>%
  count(scientist_work) %>%
  mutate(p = n /sum(n))
## # A tibble: 2 x 3
##   scientist_work      n     p
## * <chr>           <int> <dbl>
## 1 Benefits        80000   0.8
## 2 Doesn't benefit 20000   0.2

In this lab, you have access to the entire population, but this is rarely the case in real life. Gathering information on an entire population is often extremely costly or impossible. Because of this, we often take a sample of the population and use that to understand the properties of the population.

If you are interested in estimating the proportion of people who don’t think the work scientists do benefits them, you can use the sample_n command to survey the population.

This command collects a simple random sample of size 50 from the global_monitor dataset, and assigns the result to samp1. This is similar to randomly drawing names from a hat that contains the names of all in the population. Working with these 50 names is considerably simpler than working with all 100,000 people in the population.

#take sample of entire population of global_monitor
set.seed(1)
samp1 <- global_monitor %>%
  sample_n(50)

Exercise 1

Describe the distribution of responses in this sample. How does it compare to the distribution of responses in the population. Hint: Although the sample_n function takes a random sample of observations (i.e. rows) from the dataset, you can still refer to the variables in the dataset with the same names. Code you presented earlier for visualizing and summarising the population data will still be useful for the sample, however be careful to not label your proportion p since you’re now calculating a sample statistic, not a population parameters. You can customize the label of the statistics to indicate that it comes from the sample.

Answer 1. Visualize the distribution of the sample by using a bar plot (see below). 2. Obtain summary statistics to confirm the data frame was constructed correctly (see below).

When using the population, 80% believed scientists work benefits while 20% believed it does not. When using the random sample of 50, 76% believed it benefits while 24% believed it did not.

I moved the “seed” code inside each R-chunk and re-executed, and the numbers changed. While similar, the sample is still a bit different
than the population.

#Visualize the distribution of the sample by using a bar plot.
set.seed(1)
ggplot(samp1, aes(x = scientist_work)) +
  geom_bar() +
  labs(
    x = "", y = "",
    title = "Do you believe that the work scientists do benefit people like you?"
  ) +
  coord_flip()

Obtain summary statistics to confirm the data frame was constructed correctly.

set.seed(1)
samp1 %>%
  count(scientist_work) %>%
  mutate(s = n /sum(n))
## # A tibble: 2 x 3
##   scientist_work      n     s
## * <chr>           <int> <dbl>
## 1 Benefits           38  0.76
## 2 Doesn't benefit    12  0.24

Exercise 2

Would you expect the sample proportion to match the sample proportion of another student’s sample? Why, or why not? If the answer is no, would you expect the proportions to be somewhat different or very different? Ask a student team to confirm your answer.

**Answer With the first test I ran, I had 6 that “doesn’t benefit” and 44 that “benefits.” When I re-ran the test, I had 12 that “doesn’t benefit” and 38 that “benefits.” It was a jump from 12% to 24% for “doesn’t benefit.” It was a decrease from 88% to 76% for “benefits.” I would expect the sample proportion to be quite different from another student’s.

Exercise 3

Take a second sample, also of size 50, and call it samp2. How does the sample proportion of samp2 compare with that of samp1? Suppose we took two more samples, one of size 100 and one of size 1000. Which would you think would provide a more accurate estimate of the population proportion?

**Answer: With Samp2(50) and different seeds set initially inside each R-chunk, the sample was like the original population (20% does not benefit; 80% does benefit)

With Samp3(100) and different seeds set initially inside each R-chunk, the sample was a little lower with 18% “not benefit” and 82% “does benefit.”

With Samp4(1,000) and different seeds set initially inside each R-chunk, the sample showed .791 and .209 for “benefits” and “not benefit”, respectively. The larger the sample, the more accurate the estimate of the population as seen in Samp4.

By increasing the number of samples tested, we can get a more accurate estimate of the population proportion. Samp3 with 100 or Samp4 with 1,000 might be best.

#second sample, also of size 50, and call it samp2
set.seed(11)
samp2 <- global_monitor %>%
  sample_n(50)
#Visualize the distribution of the sample by using a bar plot.
set.seed(1)
ggplot(samp2, aes(x = scientist_work)) +
  geom_bar() +
  labs(
    x = "", y = "",
    title = "Sample 2:  Do you believe that the work scientists do benefit people like you?"
  ) +
  coord_flip()

set.seed(4)
samp2 %>%
  count(scientist_work) %>%
  mutate(s = n /sum(n))
## # A tibble: 2 x 3
##   scientist_work      n     s
## * <chr>           <int> <dbl>
## 1 Benefits           40   0.8
## 2 Doesn't benefit    10   0.2
#third sample, 100, and call it samp3
set.seed(222)
samp3 <- global_monitor %>%
  sample_n(100)
#Visualize the distribution of the sample by using a bar plot.
set.seed(333)
ggplot(samp3, aes(x = scientist_work)) +
  geom_bar() +
  labs(
    x = "", y = "",
    title = "Sample 3:  Do you believe that the work scientists do benefit people like you?"
  ) +
  coord_flip()

set.seed(444)
samp3 %>%
  count(scientist_work) %>%
  mutate(s = n /sum(n))
## # A tibble: 2 x 3
##   scientist_work      n     s
## * <chr>           <int> <dbl>
## 1 Benefits           82  0.82
## 2 Doesn't benefit    18  0.18
#fourth sample, 1000, and call it samp4
set.seed(222)
samp4 <- global_monitor %>%
  sample_n(1000)
#Visualize the distribution of the sample by using a bar plot.
set.seed(444)
ggplot(samp4, aes(x = scientist_work)) +
  geom_bar() +
  labs(
    x = "", y = "",
    title = "Sample 4:  Do you believe that the work scientists do benefit people like you?"
  ) +
  coord_flip()

set.seed(555)
samp4 %>%
  count(scientist_work) %>%
  mutate(s = n /sum(n))
## # A tibble: 2 x 3
##   scientist_work      n     s
## * <chr>           <int> <dbl>
## 1 Benefits          791 0.791
## 2 Doesn't benefit   209 0.209

Every time you take another random sample, you might get a different sample proportion. It’s useful to get a sense of just how much variability you should expect when estimating the population mean this way. The distribution of sample proportions, called the sampling distribution (of the proportion), can help you understand this variability.

#take 15,000 different samples of size 50 from the population
#rep_sample_n function is for repetition. Rather than taking a single sample of size n (50) from the population of all people in the population, repeat this sampling procedure rep times in order to build a distribution of a series of sample statistics, which is called the sampling distribution.
#calculate the proportion of responses in each sample
#filter for only the "doesn't benefit" responses
#store each result in a vector called sample_props50
#replace = TRUE since sampling distributions are constructed by sampling with replacement.

sample_props50 <- global_monitor %>%
                    rep_sample_n(size = 50, reps = 15000, replace = TRUE) %>%
                    count(scientist_work) %>%
                    mutate(p_hat = n /sum(n)) %>%
                    filter(scientist_work == "Doesn't benefit")

Visualize the distribution of these proportions with a histogram.

ggplot(data = sample_props50, aes(x = p_hat)) +
  geom_histogram(binwidth = 0.02) +
  labs(
    x = "p_hat (Doesn't benefit)",
    title = "Sampling distribution of p_hat",
    subtitle = "Sample size = 50, Number of samples = 15000"
  )

Exercise 4

How many elements are there in sample_props50? Describe the sampling distribution, and be sure to specifically note its center. Make sure to include a plot of the distribution in your answer.

Answer There are 15,000 elements in sample_props50. Around 20% (3,000 of the 15,000 elements in sample_props50) believe scientist work doesn’t benefit. The mean is appears to be about the same as the original dataset and more concentrated than the graph above.

set.seed(555)
ggplot(sample_props50, aes(x = p_hat)) +
  geom_bar() +
  labs(
    x = "p_hat   Doesn't Benefit", y = "Count",
    title = "Sample _props50:  Sampling Distribution of p_hat"
  ) +
  coord_flip()

Exercise 5

To make sure you understand how sampling distributions are built, and exactly what the rep_sample_n function does, try modifying the code to create a sampling distribution of 25 sample proportions from samples of size 10, and put them in a data frame named sample_props_small. Print the output. How many observations are there in this object called sample_props_small? What does each observation represent?

**Answer: There are 25 samples, each with 10 observations. See below.
Each element is the mean of 25 random samples.

#take 25 different samples of size 10 from the population
#rep_sample_n function is for repetition. Rather than taking a single sample of size n (10) from the population of all people in the population, repeat this sampling procedure rep times in order to build a distribution of a series of sample statistics, which is called the sampling distribution.
#calculate the proportion of responses in each sample
#filter for only the "doesn't benefit" responses
#store each result in a vector called sample_propssmall
#replace = TRUE since sampling distributions are constructed by sampling with replacement.


sample_propssmall <- global_monitor %>%
                    rep_sample_n(size = 10, reps = 25, replace = TRUE) %>%
                    count(scientist_work) %>%
                    mutate(p_hat = n /sum(n)) %>%
                    filter(scientist_work == "Doesn't benefit")

Visualize the distribution of these proportions with a histogram.

set.seed(555)
ggplot(sample_propssmall, aes(x = p_hat)) +
  geom_bar() +
  labs(
    x = "p_hat   Doesn't Benefit", y = "Count",
    title = "Sample _propssmall:  Sampling Distribution of p_hat",
    subtitle = "Sample size = 10, Number of samples = 25"
  ) +
  coord_flip()

The sampling distribution computed below tells a lot about estimating the true proportion of people who think that the work scientists do doesn’t benefit them. Because the sample proportion is an unbiased estimator, the sampling distribution is centered at the true population proportion, and the spread of the distribution indicates how much variability is incurred by sampling only 50 people at a time from the population.

#rep_sample_n function: to compute a sampling distribution, specifically, the sampling distribution of the proportions from samples of 50 people.

ggplot(data = sample_props50, aes(x = p_hat)) +
  geom_histogram(binwidth = 0.02)

Exercise 6

Use the app to get a sense of the effect that sample size has on sampling distribution.

Use the app to create sampling distributions of proportions of Doesn’t benefit from samples of size 10, 50, and 100. Use 5,000 simulations. What does each observation in the sampling distribution represent? How does the mean, standard error, and shape of the sampling distribution change as the sample size increases? How (if at all) do these values change if you increase the number of simulations? (You do not need to include plots in your answer.)

**Answer n = 10; reps=5,000; mean = .22; SE = .11 n = 50; reps=5,000; mean = .2; SE = .06 - by increasing the sample size, there was a change in the mean and notable change in SE; n =100; reps=5,000; mean = .2; SE = .04 - by increasing the sample size, there was a change in the mean and SE n =100;reps=10,000; mean = .2; SE = .04 - by increasing the reps, there was no change in the mean and SE n =100;reps=50,000; mean = .2; SE = .04 - by increasing the reps, there was no change in the mean and SE

When the sample size increases, the distribution becomes more normal. The mean (center point of the distribution) is more accurate when the sample size is larger.

Exercise 7

Take a sample of size 15 from the population and calculate the proportion of people in this sample who think the work scientists do enhances their lives. Using this sample, what is your best point estimate of the population proportion of people who think the work scientists do enhances their lives?

**Answer:
If the sampling distribution is centered at the true population proportion, it looks like around 80% of the population believe the work of scientists enhance their lives. My best point estimate is approximately .8

FYI, I attempted using a rep = to 1, but that didn’t work. Instead, I used reps = 25.

#take 25 different samples of size 15 from the population
#rep_sample_n function is for repetition. Rather than taking a single sample of size n (15) from the population of all people in the population, repeat this sampling procedure rep times in order to build a distribution of a series of sample statistics, which is called the sampling distribution.
#calculate the proportion of responses in each sample
#filter for only the "benefist" responses
#store each result in a vector called sample_props15a
#replace = TRUE since sampling distributions are constructed by sampling with replacement.


sample_props15a <- global_monitor %>%
                    rep_sample_n(size = 15, reps = 25, replace = TRUE) %>%
                    count(scientist_work) %>%
                    mutate(p_hat = n /sum(n)) %>%
                    filter(scientist_work == "Benefits")

Visualize with a histogram

set.seed(575)
ggplot(sample_props15a, aes(x = p_hat)) +
  geom_bar() +
  labs(
    x = "p_hat   Benefits", y = "Count",
    title = "Sample _props15a:  Sampling Distribution of p_hat",
    subtitle = "Sample size = 15, Number of samples = 25"
  ) +
  coord_flip()

Exercise 8

Since you have access to the population, simulate the sampling distribution of proportion of those who think the work scientists do enchances their lives for samples of size 15 by taking 2000 samples from the population of size 15 and computing 2000 sample proportions. Store these proportions in as sample_props15. Plot the data, then describe the shape of this sampling distribution. Based on this sampling distribution, what would you guess the true proportion of those who think the work scientists do enchances their lives to be? Finally, calculate and report the population proportion.

**Answer I would expect the sampling distribution is centered at the true population proportion (sample mean to be somewhat like population mean), 80% from the 2,000 samples believe scientists work do enhance their lives.

#take 2000 different samples of size 15 from the population
#rep_sample_n function is for repetition. Rather than taking a single sample of size n (15) from the population of all people in the population, repeat this sampling procedure rep times in order to build a distribution of a series of sample statistics, which is called the sampling distribution.
#calculate the proportion of responses in each sample
#filter for only the "benefits" responses
#store each result in a vector called sampl_props15
#replace = TRUE since sampling distributions are constructed by sampling with replacement.


sampl_props15 <- global_monitor %>%
                    rep_sample_n(size = 15, reps = 2000, replace = TRUE) %>%
                    count(scientist_work) %>%
                    mutate(s_hat = n /sum(n)) %>%
                    filter(scientist_work == "Benefits")

Visualize with a histogram

set.seed(575)
ggplot(sampl_props15, aes(x = s_hat)) +
  geom_bar() +
  labs(
    x = "s_hat   Benefits", y = "Count",
    title = "Sampl _props15:  Sampling Distribution of p_hat",
    subtitle = "Sample size = 15, Number of samples = 2000"
  ) +
  coord_flip()

Exercise 9

Change your sample size from 15 to 150, then compute the sampling distribution using the same method as above, and store these proportions in a new object called sample_props150. Describe the shape of this sampling distribution and compare it to the sampling distribution for a sample size of 15. Based on this sampling distribution, what would you guess to be the true proportion of those who think the work scientists do enchances their lives?

**Answer

The mean of the sample size of 150 is more concentrated than that of sample size 15. The mean appears to be a better estimate with the increased sample size.

#take 2000 different samples of size 150 from the population
#rep_sample_n function is for repetition. Rather than taking a single sample of size n (150) from the population of all people in the population, repeat this sampling procedure rep times in order to build a distribution of a series of sample statistics, which is called the sampling distribution.
#calculate the proportion of responses in each sample
#filter for only the "benefits" responses
#store each result in a vector called sampl_props150
#replace = TRUE since sampling distributions are constructed by sampling with replacement.


sampl_props150 <- global_monitor %>%
                    rep_sample_n(size = 150, reps = 2000, replace = TRUE) %>%
                    count(scientist_work) %>%
                    mutate(s_hat = n /sum(n)) %>%
                    filter(scientist_work == "Benefits")

Visualize with a histogram

set.seed(575)
ggplot(sampl_props150, aes(x = s_hat)) +
  geom_bar() +
  labs(
    x = "s_hat   Benefits", y = "Count",
    title = "Sampl _props150:  Sampling Distribution of p_hat",
    subtitle = "Sample size = 150, Number of samples = 2000"
  ) +
  coord_flip()

Exercise 10

Of the sampling distributions from 2 and 3, which has a smaller spread? If you’re concerned with making estimates that are more often close to the true value, would you prefer a sampling distribution with a large or small spread?

**Answer The larger sample size has a smaller spread and would be preferable for a sampling distribution. Above, when working in Exercises 2 and 3, we were working with samples of 50 and 100. The sample with 100 has a smaller spread.

