library(tidyverse)
library(openintro)
library(infer)
## Warning: package 'infer' was built under R version 4.3.3
global_monitor <- tibble(
  scientist_work = c(rep("Benefits", 80000), rep("Doesn't benefit", 20000))
)

ggplot(global_monitor, aes(x = scientist_work)) +
  geom_bar() +
  labs(
    x = "", y = "",
    title = "Do you believe that the work scientists do benefit people like you?"
  ) +
  coord_flip() 

global_monitor %>%
  count(scientist_work) %>%
  mutate(p = n /sum(n))
## # A tibble: 2 × 3
##   scientist_work      n     p
##   <chr>           <int> <dbl>
## 1 Benefits        80000   0.8
## 2 Doesn't benefit 20000   0.2
samp1 <- global_monitor %>%
  sample_n(50)

Exercise 1

The data for both the sample and the population shows that 14% of people believe that the work of scientists is not beneficial. The sample data is 0.06 off from the population data. However, they are still close.

samp1 %>%
  count(scientist_work) %>%
  mutate(p_hat = n /sum(n))
## # A tibble: 2 × 3
##   scientist_work      n p_hat
##   <chr>           <int> <dbl>
## 1 Benefits           40   0.8
## 2 Doesn't benefit    10   0.2

Exercise 2

I would not expect the same sample proportion as another student’s. This is because sampling will most likely come out different each time. The proportions could be close; however, they will not be exactly the same due to the sampling process yielding different results every time it is done.

Exercise 3

In samp2, the data is different. More people are in the “does not benefit” column. Only 3 more people are in this column (10), compared to the previous sample (7). Although, since the sample size is so small, this has an impact on the p_hat.

The sample of 1000 would be a more appropriate estimate of the population proportion since the larger the sample, the closer the statistic will get to the actual population parameter. Note: larger sample size means less variability.

sample_props50 <- global_monitor %>%
                    rep_sample_n(size = 50, reps = 15000, replace = TRUE) %>%
                    count(scientist_work) %>%
                    mutate(p_hat = n /sum(n)) %>%
                    filter(scientist_work == "Doesn't benefit")

ggplot(data = sample_props50, aes(x = p_hat)) +
  geom_histogram(binwidth = 0.02) +
  labs(
    x = "p_hat (Doesn't benefit)",
    title = "Sampling distribution of p_hat",
    subtitle = "Sample size = 50, Number of samples = 15000"
  )

set.seed(4444)
samp2 <- global_monitor %>%
  sample_n(50)

samp2 %>%
  count(scientist_work) %>%
  mutate(p.hat = n /sum(n))
## # A tibble: 2 × 3
##   scientist_work      n p.hat
##   <chr>           <int> <dbl>
## 1 Benefits           40   0.8
## 2 Doesn't benefit    10   0.2

Exercise 4

There were 15,000 samples taken, which means, there are 15,000 data points used to make this plot. The plot looks mostly normally distributed, but with a slight skew to the right. The center appears to be around 0.2, which is consistent with the population parameter.

global_monitor %>%
  sample_n(size = 50, replace = TRUE) %>%
  count(scientist_work) %>%
  mutate(p_hat = n /sum(n)) %>%
  filter(scientist_work == "Doesn't benefit")
## # A tibble: 1 × 3
##   scientist_work      n p_hat
##   <chr>           <int> <dbl>
## 1 Doesn't benefit     9  0.18

Exercise 5

There are 25 observations in this object. Each observation is a sample proportion that makes up the sampling distribution.

sample_props_small <- global_monitor %>% rep_sample_n(size = 10, reps = 25, replace = TRUE) %>%
                    count(scientist_work) %>%
                    mutate(p_hat = n /sum(n)) %>%
                    filter(scientist_work == "Doesn't benefit")

ggplot(data = sample_props50, aes(x = p_hat)) +
  geom_histogram(binwidth = 0.02)

Exercise 6

Each observation is the proportion that came out of the simulation:

n = 10; mean = 0.22; se = 0.11

n = 50; mean = 0.2; se = 0.06

n = 100; mean = 0.2; se = 0.04

The mean of these gets closer to 0.2; the true mean as we increase the number of samples. The se decreases, which means that as you increase the sample size, there is less variability. The distribution begins to appear normal when you make the sample larger, since at n = 10, there is a right skew.

Exercise 7 - More Practice

Based on this sampling, the best estimate is that 80% believe that the work of scientists is beneficial.

set.seed(5555)
samp3 <- global_monitor %>%
  sample_n(15)

samp3 %>%
  count(scientist_work) %>%
  mutate(p.hat = n /sum(n))
## # A tibble: 2 × 3
##   scientist_work      n p.hat
##   <chr>           <int> <dbl>
## 1 Benefits           12   0.8
## 2 Doesn't benefit     3   0.2

Exercise 8 - More Practice

Based on the sampling here, the estimate is 0.798. The true population proportion is 0.8. So, this estimate is very close.

sample_props15 <- global_monitor %>% rep_sample_n(size = 15, reps = 2000, replace = TRUE) %>%
                    count(scientist_work) %>%
                    mutate(p_hat = n /sum(n)) %>%
                    filter(scientist_work == "Benefits")

ggplot(data = sample_props15, aes(x = p_hat)) +
  geom_histogram(binwidth = 0.02) +
  labs(
    x = "p_hat (Benefits)",
    title = "Sampling distribution of p_hat",
    subtitle = "Sample size = 15, Number of samples = 2000")

mean(sample_props15$p_hat)
## [1] 0.7984667
global_monitor %>%
  count(scientist_work) %>%
  mutate(p = n /sum(n))
## # A tibble: 2 × 3
##   scientist_work      n     p
##   <chr>           <int> <dbl>
## 1 Benefits        80000   0.8
## 2 Doesn't benefit 20000   0.2

Exercise 9 - More Practice

When n = 150, the shape is more normal.

Based on this, the proportion is very nearly 0.8, since the mean of the sampling distribution is 0.799. Which is 0.001 off from the true proportion (0.8). Closer than what we found from Exercise 8 above.

sample_props150 <- global_monitor %>% rep_sample_n(size = 150, reps = 2000, replace = TRUE) %>%
                    count(scientist_work) %>%
                    mutate(p_hat = n /sum(n)) %>%
                    filter(scientist_work == "Benefits")

ggplot(data = sample_props150, aes(x = p_hat)) +
  geom_histogram(binwidth = 0.02) +
  labs(
    x = "p_hat (Benefits)",
    title = "Sampling distribution of p_hat",
    subtitle = "Sample size = 150, Number of samples = 2000")

mean(sample_props150$p_hat)
## [1] 0.7994033

Exercise 10 - More Practice

The spread of the n = 150 sampling distribution is smaller than the sampling distribution where n = 15. Based on this, it would be my preference to predict the proportion from this distribution with a smaller spread because there is less variability in the data. This would mean that the value from sampling is closer to the true proportion.

---
title: "Lab 5: Sampling Distributions"
author: "Colin S."
date: "April 21, 2024"
output: openintro::lab_report
---

```{r load-packages, message=FALSE}
library(tidyverse)
library(openintro)
library(infer)

global_monitor <- tibble(
  scientist_work = c(rep("Benefits", 80000), rep("Doesn't benefit", 20000))
)

ggplot(global_monitor, aes(x = scientist_work)) +
  geom_bar() +
  labs(
    x = "", y = "",
    title = "Do you believe that the work scientists do benefit people like you?"
  ) +
  coord_flip() 

global_monitor %>%
  count(scientist_work) %>%
  mutate(p = n /sum(n))

samp1 <- global_monitor %>%
  sample_n(50)

```

### Exercise 1

The data for both the sample and the population shows that 14% of people believe that the work of scientists is not beneficial. The sample data is 0.06 off from the population data. However, they are still close.

```{r view-girls-counts}

samp1 %>%
  count(scientist_work) %>%
  mutate(p_hat = n /sum(n))

```


### Exercise 2

I would not expect the same sample proportion as another student's. This is because sampling will most likely come out different each time. The proportions could be close; however, they will not be exactly the same due to the sampling process yielding different results every time it is done.

```{r trend-girls}

```


### Exercise 3

In samp2, the data is different. More people are in the "does not benefit" column. Only 3 more people are in this column (10), compared to the previous sample (7). Although, since the sample size is so small, this has an impact on the p_hat.

The sample of 1000 would be a more appropriate estimate of the population proportion since the larger the sample, the closer the statistic will get to the actual population parameter. Note: larger sample size means less variability.

```{r plot-prop-boys-arbuthnot}
sample_props50 <- global_monitor %>%
                    rep_sample_n(size = 50, reps = 15000, replace = TRUE) %>%
                    count(scientist_work) %>%
                    mutate(p_hat = n /sum(n)) %>%
                    filter(scientist_work == "Doesn't benefit")

ggplot(data = sample_props50, aes(x = p_hat)) +
  geom_histogram(binwidth = 0.02) +
  labs(
    x = "p_hat (Doesn't benefit)",
    title = "Sampling distribution of p_hat",
    subtitle = "Sample size = 50, Number of samples = 15000"
  )

set.seed(4444)
samp2 <- global_monitor %>%
  sample_n(50)

samp2 %>%
  count(scientist_work) %>%
  mutate(p.hat = n /sum(n))

```


### Exercise 4

There were 15,000 samples taken, which means, there are 15,000 data points used to make this plot. The plot looks mostly normally distributed, but with a slight skew to the right. The center appears to be around 0.2, which is consistent with the population parameter.

```{r dim-present}
global_monitor %>%
  sample_n(size = 50, replace = TRUE) %>%
  count(scientist_work) %>%
  mutate(p_hat = n /sum(n)) %>%
  filter(scientist_work == "Doesn't benefit")

```


### Exercise 5

There are 25 observations in this object. Each observation is a sample proportion that makes up the sampling distribution.

```{r count-compare}
sample_props_small <- global_monitor %>% rep_sample_n(size = 10, reps = 25, replace = TRUE) %>%
                    count(scientist_work) %>%
                    mutate(p_hat = n /sum(n)) %>%
                    filter(scientist_work == "Doesn't benefit")

ggplot(data = sample_props50, aes(x = p_hat)) +
  geom_histogram(binwidth = 0.02)

```


### Exercise 6

Each observation is the proportion that came out of the simulation:

n = 10; mean = 0.22; se = 0.11

n = 50; mean = 0.2; se = 0.06

n = 100; mean = 0.2; se = 0.04

The mean of these gets closer to 0.2; the true mean as we increase the number of samples. The se decreases, which means that as you increase the sample size, there is less variability. The distribution begins to appear normal when you make the sample larger, since at n = 10, there is a right skew.

```{r plot-prop-boys-present}

```


### Exercise 7 - More Practice

Based on this sampling, the best estimate is that 80% believe that the work of scientists is beneficial.

```{r find-max-total}
set.seed(5555)
samp3 <- global_monitor %>%
  sample_n(15)

samp3 %>%
  count(scientist_work) %>%
  mutate(p.hat = n /sum(n))

```


### Exercise 8 - More Practice

Based on the sampling here, the estimate is 0.798. The true population proportion is 0.8. So, this estimate is very close.

```{r}
sample_props15 <- global_monitor %>% rep_sample_n(size = 15, reps = 2000, replace = TRUE) %>%
                    count(scientist_work) %>%
                    mutate(p_hat = n /sum(n)) %>%
                    filter(scientist_work == "Benefits")

ggplot(data = sample_props15, aes(x = p_hat)) +
  geom_histogram(binwidth = 0.02) +
  labs(
    x = "p_hat (Benefits)",
    title = "Sampling distribution of p_hat",
    subtitle = "Sample size = 15, Number of samples = 2000")

mean(sample_props15$p_hat)

global_monitor %>%
  count(scientist_work) %>%
  mutate(p = n /sum(n))

```


### Exercise 9 - More Practice

When n = 150, the shape is more normal.

Based on this, the proportion is very nearly 0.8, since the mean of the sampling distribution is 0.799. Which is 0.001 off from the true proportion (0.8). Closer than what we found from Exercise 8 above.

```{r}
sample_props150 <- global_monitor %>% rep_sample_n(size = 150, reps = 2000, replace = TRUE) %>%
                    count(scientist_work) %>%
                    mutate(p_hat = n /sum(n)) %>%
                    filter(scientist_work == "Benefits")

ggplot(data = sample_props150, aes(x = p_hat)) +
  geom_histogram(binwidth = 0.02) +
  labs(
    x = "p_hat (Benefits)",
    title = "Sampling distribution of p_hat",
    subtitle = "Sample size = 150, Number of samples = 2000")

mean(sample_props150$p_hat)

```


### Exercise 10 - More Practice

The spread of the n = 150 sampling distribution is smaller than the sampling distribution where n = 15. Based on this, it would be my preference to predict the proportion from this distribution with a smaller spread because there is less variability in the data. This would mean that the value from sampling is closer to the true proportion.

```{r}

```