Lab 5-Sampling Distributions

Foundations for statistical inference- Sampling Distributions

First you install infer, but R is giving me trouble with knitting so I did not include it bere

library(tidyverse)

## -- Attaching packages --------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.3     v dplyr   1.0.2
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0

## -- Conflicts ------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(openintro)

## Loading required package: airports

## Loading required package: cherryblossom

## Loading required package: usdata

library(infer)

The Data

global_monitor <- tibble(
  scientist_work = c(rep("Benefits", 80000), rep("Doesn't benefit", 20000))
)

ggplot(global_monitor, aes(x = scientist_work)) +
  geom_bar() +
  labs(
    x = "", y = "",
    title = "Do you believe that the work scientists do benefit people like you?"
  ) +
  coord_flip()

global_monitor %>%
  count(scientist_work) %>%
  mutate(p = n /sum(n))

## # A tibble: 2 x 3
##   scientist_work      n     p
##   <chr>           <int> <dbl>
## 1 Benefits        80000   0.8
## 2 Doesn't benefit 20000   0.2

The Unknown Sampling Distribution

set.seed(1234)
samp1 <- global_monitor %>%
  sample_n(50)

Exercise 1

Describe the distribution of responses in this sample. How does it compare to the distribution of responses in the population. Hint: Although the sample_n function takes a random sample of observations (i.e. rows) from the dataset, you can still refer to the variables in the dataset with the same names. Code you presented earlier for visualizing and summarising the population data will still be useful for the sample, however be careful to not label your proportion p since you’re now calculating a sample statistic, not a population parameters. You can customize the label of the statistics to indicate that it comes from the sample.

samp1 %>%
  count(scientist_work) %>%
  mutate(p.hat = n /sum(n))

## # A tibble: 2 x 3
##   scientist_work      n p.hat
##   <chr>           <int> <dbl>
## 1 Benefits           37  0.74
## 2 Doesn't benefit    13  0.26

The data for both the sample and the population shows that about 20% of people believe that the work of scientists is not beneficial. The sample data is .06 off from the population data, but it is still close.

Exercicse 2

Would you expect the sample proportion to match the sample proportion of another student’s sample? Why, or why not? If the answer is no, would you expect the proportions to be somewhat different or very different? Ask a student team to confirm your answer.

I would not expect to get the same sample proportion as someone else because sampling will almost always come out different. Our proportions would be close to the same, but not exactly the same since the process of sampling gives a different result every time you do it.

Exercise 3

Take a second sample, also of size 50, and call it samp2. How does the sample proportion of samp2 compare with that of samp1? Suppose we took two more samples, one of size 100 and one of size 1000. Which would you think would provide a more accurate estimate of the population proportion?

set.seed(4444)
samp2 <- global_monitor %>%
  sample_n(50)

samp2 %>%
  count(scientist_work) %>%
  mutate(p.hat = n /sum(n))

## # A tibble: 2 x 3
##   scientist_work      n p.hat
##   <chr>           <int> <dbl>
## 1 Benefits           40   0.8
## 2 Doesn't benefit    10   0.2

The data is different in samp2, with more people in the does not benefit column. Only 3 more people are in the doesn’t benefit column but since the sample size is so small, this has a large impact on the p_hat. (note that i actaully got the same data on the first try but ran the sample again so there are instances where two samples will end up being the same)

The sample of 1000 would be a better estimate of the population proportion since the larger you make the sample, the closer the statistic will get to the true population parameter. Larger sample size means less variability

sample_props50 <- global_monitor %>%
                    rep_sample_n(size = 50, reps = 15000, replace = TRUE) %>%
                    count(scientist_work) %>%
                    mutate(p_hat = n /sum(n)) %>%
                    filter(scientist_work == "Doesn't benefit")

ggplot(data = sample_props50, aes(x = p_hat)) +
  geom_histogram(binwidth = 0.02) +
  labs(
    x = "p_hat (Doesn't benefit)",
    title = "Sampling distribution of p_hat",
    subtitle = "Sample size = 50, Number of samples = 15000"
  )

Exercise 4

How many elements are there in sample_props50? Describe the sampling distribution, and be sure to specifically note its center. Make sure to include a plot of the distribution in your answer.

There were 15000 samples taken, meaning there are 15000 data pts used to make this plot. This plot looks mostly normally distributed, with a slight skew to the right.The center seems to br around .2 which is consistent with the population parameter.

Interlude: Sampling Distributions

global_monitor %>%
  sample_n(size = 50, replace = TRUE) %>%
  count(scientist_work) %>%
  mutate(p_hat = n /sum(n)) %>%
  filter(scientist_work == "Doesn't benefit")

## # A tibble: 1 x 3
##   scientist_work      n p_hat
##   <chr>           <int> <dbl>
## 1 Doesn't benefit    10   0.2

Exercise 5

To make sure you understand how sampling distributions are built, and exactly what the rep_sample_n function does, try modifying the code to create a sampling distribution of 25 sample proportions from samples of size 10, and put them in a data frame named sample_props_small. Print the output. How many observations are there in this object called sample_props_small? What does each observation represent?

sample_props_small <- global_monitor %>% rep_sample_n(size = 10, reps = 25, replace = TRUE) %>%
                    count(scientist_work) %>%
                    mutate(p_hat = n /sum(n)) %>%
                    filter(scientist_work == "Doesn't benefit")

There are 25 observations in this object. Each observation is a sample proportion that makes up the sampling distribution.

Sample size and the sampling distribution

ggplot(data = sample_props50, aes(x = p_hat)) +
  geom_histogram(binwidth = 0.02)

Example 6

Use the app below to create sampling distributions of proportions of Doesn’t benefit from samples of size 10, 50, and 100. Use 5,000 simulations. What does each observation in the sampling distribution represent? How does the mean, standard error, and shape of the sampling distribution change as the sample size increases? How (if at all) do these values change if you increase the number of simulations? (You do not need to include plots in your answer.)

Each observation is the proportion that came out of one simulation.

n=10

mean= .22 se= .11

n=50

mean= .2 se= .06

n=100

mean= .2 se= .04

The mean of these goes closer to .2, the true mean as you increase the number of samples. The se decreases which means that as you increase the sample size there is less variability. The distribution begins to appear normal when you make the sample larger, since at n=10 there is a right skew.

More Practice

Example 7

Take a sample of size 15 from the population and calculate the proportion of people in this sample who think the work scientists do enhances their lives. Using this sample, what is your best point estimate of the population proportion of people who think the work scientists do enhances their lives?

set.seed(5555)
samp3 <- global_monitor %>%
  sample_n(15)

samp3 %>%
  count(scientist_work) %>%
  mutate(p.hat = n /sum(n))

## # A tibble: 2 x 3
##   scientist_work      n p.hat
##   <chr>           <int> <dbl>
## 1 Benefits           12   0.8
## 2 Doesn't benefit     3   0.2

The best estimate based on this sampling is that 80% believe that the work of scientists is beneficial.

Exercise 8

Since you have access to the population, simulate the sampling distribution of proportion of those who think the work scientists do enhances their lives for samples of size 15 by taking 2000 samples from the population of size 15 and computing 2000 sample proportions. Store these proportions in as sample_props15. Plot the data, then describe the shape of this sampling distribution. Based on this sampling distribution, what would you guess the true proportion of those who think the work scientists do enhances their lives to be? Finally, calculate and report the population proportion.

sample_props15 <- global_monitor %>% rep_sample_n(size = 15, reps = 2000, replace = TRUE) %>%
                    count(scientist_work) %>%
                    mutate(p_hat = n /sum(n)) %>%
                    filter(scientist_work == "Benefits")

ggplot(data = sample_props15, aes(x = p_hat)) +
  geom_histogram(binwidth = 0.02) +
  labs(
    x = "p_hat (Benefits)",
    title = "Sampling distribution of p_hat",
    subtitle = "Sample size = 15, Number of samples = 2000")

mean(sample_props15$p_hat)

## [1] 0.7984667

Based on the sampling, the estimate is .798.

The true population proportion in .8, so this estimate is close.

global_monitor %>%
  count(scientist_work) %>%
  mutate(p = n /sum(n))

## # A tibble: 2 x 3
##   scientist_work      n     p
##   <chr>           <int> <dbl>
## 1 Benefits        80000   0.8
## 2 Doesn't benefit 20000   0.2

Exercise 9

sample_props150 <- global_monitor %>% rep_sample_n(size = 150, reps = 2000, replace = TRUE) %>%
                    count(scientist_work) %>%
                    mutate(p_hat = n /sum(n)) %>%
                    filter(scientist_work == "Benefits")

ggplot(data = sample_props150, aes(x = p_hat)) +
  geom_histogram(binwidth = 0.02) +
  labs(
    x = "p_hat (Benefits)",
    title = "Sampling distribution of p_hat",
    subtitle = "Sample size = 150, Number of samples = 2000")

The shape is more normal when n=150.

mean(sample_props150$p_hat)

## [1] 0.7994033

Based on this, the proportion is practically .8, since the mean of the sampling distribution is .799 which is practically the true proportion.

Exercise 10

The spread of the n=150 sampling distribution is smaller than the sampling distribution where n=15. I would rather predict the proportion from this distribution with a smaller spread because there is less variability in the data. This would mean that the value from sampling is closer to the true proportion.