Exercise 1

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.8     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
library(infer)

set.seed(1231232)

global_monitor <- tibble(
  scientist_work = c(rep("Benefits", 80000), rep("Doesn't benefit", 20000))
)


global_monitor %>%
  count(scientist_work) %>%
  mutate(p = n /sum(n))
## # A tibble: 2 × 3
##   scientist_work      n     p
##   <chr>           <int> <dbl>
## 1 Benefits        80000   0.8
## 2 Doesn't benefit 20000   0.2
samp1 <- global_monitor %>%
  sample_n(50)

samp1 %>%
  count(scientist_work) %>%
  mutate(p_hat = n /sum(n))
## # A tibble: 2 × 3
##   scientist_work      n p_hat
##   <chr>           <int> <dbl>
## 1 Benefits           38  0.76
## 2 Doesn't benefit    12  0.24

The distribution in the sample correlates to the population distribution in that I was expecting close to 80% and 20%.

Exercise 2

I would assume it would be very similar with a slight deviation. It might be even considered strange if we got the same exact distribution. Comparing to another students, they had 80% benefits and 20% does not.

Exercise 3

set.seed(32322)
samp2 <- global_monitor %>%
  sample_n(50)

samp2 %>%
  count(scientist_work) %>%
  mutate(p_hat = n /sum(n))
## # A tibble: 2 × 3
##   scientist_work      n p_hat
##   <chr>           <int> <dbl>
## 1 Benefits           42  0.84
## 2 Doesn't benefit     8  0.16

The sample proportion of samp2 is similar but deviated slightly. Again, I was expecting an 80/20 split but got close. If we took samples of size 100 or 1000, I would expect the larger number to describe the population better.

Exercise 4

sample_props50 <- global_monitor %>%
                    rep_sample_n(size = 50, reps = 15000, replace = TRUE) %>%
                    count(scientist_work) %>%
                    mutate(p_hat = n /sum(n)) %>%
                    filter(scientist_work == "Doesn't benefit")


ggplot(data = sample_props50, aes(x = p_hat)) +
  geom_histogram(binwidth = 0.02) +
  labs(
    x = "p_hat (Doesn't benefit)",
    title = "Sampling distribution of p_hat",
    subtitle = "Sample size = 50, Number of samples = 15000"
  )

There are 15000 elements in sample_prop50. The distribution looks normal and it’s center is near or exactly .20.

Exercise 5

sample_props_small <- global_monitor %>%
                    rep_sample_n(size = 10, reps = 25, replace = TRUE) %>%
                    count(scientist_work) %>%
                    mutate(p_hat = n /sum(n)) %>%
                    filter(scientist_work == "Doesn't benefit")


sample_props_small1 <- data.frame(sample_props_small)

sample_props_small1
##    replicate  scientist_work n p_hat
## 1          1 Doesn't benefit 2   0.2
## 2          2 Doesn't benefit 2   0.2
## 3          3 Doesn't benefit 3   0.3
## 4          5 Doesn't benefit 2   0.2
## 5          7 Doesn't benefit 1   0.1
## 6          8 Doesn't benefit 3   0.3
## 7          9 Doesn't benefit 2   0.2
## 8         10 Doesn't benefit 2   0.2
## 9         11 Doesn't benefit 4   0.4
## 10        12 Doesn't benefit 6   0.6
## 11        13 Doesn't benefit 1   0.1
## 12        14 Doesn't benefit 3   0.3
## 13        15 Doesn't benefit 2   0.2
## 14        16 Doesn't benefit 1   0.1
## 15        17 Doesn't benefit 3   0.3
## 16        18 Doesn't benefit 3   0.3
## 17        19 Doesn't benefit 4   0.4
## 18        20 Doesn't benefit 1   0.1
## 19        21 Doesn't benefit 3   0.3
## 20        22 Doesn't benefit 3   0.3
## 21        23 Doesn't benefit 2   0.2
## 22        24 Doesn't benefit 2   0.2
## 23        25 Doesn't benefit 1   0.1

There are 25 observations in this object. Each observation represents that percent of that sample that said they don’t benefit.

Exercise 6

Each observation represents taking sample sizes of 10, 50 etc many times. In this case the number of samples I used was 1000. As the sample size increases, the mean gets closer to .2, the standard error gets smaller, and the shape of the tends to get narrower.

Exercise 7

samp15 <- global_monitor %>%
  sample_n(15)

samp15 %>%
  count(scientist_work) %>%
  mutate(p_hat = n /sum(n))
## # A tibble: 2 × 3
##   scientist_work      n p_hat
##   <chr>           <int> <dbl>
## 1 Benefits           11 0.733
## 2 Doesn't benefit     4 0.267

Using this sample, I would expect 75% of the population thinks the work scientists do enhances their lives.

Exercise 8

sample_props15 <- global_monitor %>%
                    rep_sample_n(size = 15, reps = 2000, replace = TRUE) %>%
                    count(scientist_work) %>%
                    mutate(p_hat = n /sum(n)) %>%
                    filter(scientist_work == "Benefits")


ggplot(data = sample_props15, aes(x = p_hat)) +
  geom_histogram(binwidth = 0.05) +
  labs(
    x = "p_hat (benefits)",
    title = "Sampling distribution of p_hat",
    subtitle = "Sample size = 15, Number of samples = 2000"
  )

mean(sample_props15$p_hat)
## [1] 0.8007667

Based on the sample, I would expect around 80% as the proportion of people who think that scientists benefit. The proportion is at 79%

Exercise 9

sample_props150 <- global_monitor %>% 
  rep_sample_n(size = 150, reps = 2000, replace = TRUE) %>%
  count(scientist_work) %>%
  mutate(p_hat = n /sum(n)) %>%
  filter(scientist_work == "Benefits")

ggplot(data = sample_props150, aes(x = p_hat)) +
  geom_histogram(binwidth = 0.02) +
  labs(
    x = "p_hat (benefit)",
    title = "Sampling distribution of population proportion")

mean(sample_props150$p_hat)
## [1] 0.7997433

Based on the sample I would guess the same proportion of 80%.

Exercise 10

The second sampling distribution is narrower. It’s better to have a narrower distribution as the chances that the true population is in a smaller interval gives us less values to choose from.