library(tidyverse)
library(openintro)
library(infer)
## Warning: package 'infer' was built under R version 4.3.3
global_monitor <- tibble(
scientist_work = c(rep("Benefits", 80000), rep("Doesn't benefit", 20000))
)
ggplot(global_monitor, aes(x = scientist_work)) +
geom_bar() +
labs(
x = "", y = "",
title = "Do you believe that the work scientists do benefit people like you?"
) +
coord_flip()

global_monitor %>%
count(scientist_work) %>%
mutate(p = n /sum(n))
## # A tibble: 2 × 3
## scientist_work n p
## <chr> <int> <dbl>
## 1 Benefits 80000 0.8
## 2 Doesn't benefit 20000 0.2
samp1 <- global_monitor %>%
sample_n(50)
Exercise 1
The data for both the sample and the population shows that 14% of
people believe that the work of scientists is not beneficial. The sample
data is 0.06 off from the population data. However, they are still
close.
samp1 %>%
count(scientist_work) %>%
mutate(p_hat = n /sum(n))
## # A tibble: 2 × 3
## scientist_work n p_hat
## <chr> <int> <dbl>
## 1 Benefits 40 0.8
## 2 Doesn't benefit 10 0.2
Exercise 2
I would not expect the same sample proportion as another student’s.
This is because sampling will most likely come out different each time.
The proportions could be close; however, they will not be exactly the
same due to the sampling process yielding different results every time
it is done.
Exercise 3
In samp2, the data is different. More people are in the “does not
benefit” column. Only 3 more people are in this column (10), compared to
the previous sample (7). Although, since the sample size is so small,
this has an impact on the p_hat.
The sample of 1000 would be a more appropriate estimate of the
population proportion since the larger the sample, the closer the
statistic will get to the actual population parameter. Note: larger
sample size means less variability.
sample_props50 <- global_monitor %>%
rep_sample_n(size = 50, reps = 15000, replace = TRUE) %>%
count(scientist_work) %>%
mutate(p_hat = n /sum(n)) %>%
filter(scientist_work == "Doesn't benefit")
ggplot(data = sample_props50, aes(x = p_hat)) +
geom_histogram(binwidth = 0.02) +
labs(
x = "p_hat (Doesn't benefit)",
title = "Sampling distribution of p_hat",
subtitle = "Sample size = 50, Number of samples = 15000"
)

set.seed(4444)
samp2 <- global_monitor %>%
sample_n(50)
samp2 %>%
count(scientist_work) %>%
mutate(p.hat = n /sum(n))
## # A tibble: 2 × 3
## scientist_work n p.hat
## <chr> <int> <dbl>
## 1 Benefits 40 0.8
## 2 Doesn't benefit 10 0.2
Exercise 4
There were 15,000 samples taken, which means, there are 15,000 data
points used to make this plot. The plot looks mostly normally
distributed, but with a slight skew to the right. The center appears to
be around 0.2, which is consistent with the population parameter.
global_monitor %>%
sample_n(size = 50, replace = TRUE) %>%
count(scientist_work) %>%
mutate(p_hat = n /sum(n)) %>%
filter(scientist_work == "Doesn't benefit")
## # A tibble: 1 × 3
## scientist_work n p_hat
## <chr> <int> <dbl>
## 1 Doesn't benefit 9 0.18
Exercise 5
There are 25 observations in this object. Each observation is a
sample proportion that makes up the sampling distribution.
sample_props_small <- global_monitor %>% rep_sample_n(size = 10, reps = 25, replace = TRUE) %>%
count(scientist_work) %>%
mutate(p_hat = n /sum(n)) %>%
filter(scientist_work == "Doesn't benefit")
ggplot(data = sample_props50, aes(x = p_hat)) +
geom_histogram(binwidth = 0.02)

Exercise 6
Each observation is the proportion that came out of the
simulation:
n = 10; mean = 0.22; se = 0.11
n = 50; mean = 0.2; se = 0.06
n = 100; mean = 0.2; se = 0.04
The mean of these gets closer to 0.2; the true mean as we increase
the number of samples. The se decreases, which means that as you
increase the sample size, there is less variability. The distribution
begins to appear normal when you make the sample larger, since at n =
10, there is a right skew.
Exercise 7 - More Practice
Based on this sampling, the best estimate is that 80% believe that
the work of scientists is beneficial.
set.seed(5555)
samp3 <- global_monitor %>%
sample_n(15)
samp3 %>%
count(scientist_work) %>%
mutate(p.hat = n /sum(n))
## # A tibble: 2 × 3
## scientist_work n p.hat
## <chr> <int> <dbl>
## 1 Benefits 12 0.8
## 2 Doesn't benefit 3 0.2
Exercise 8 - More Practice
Based on the sampling here, the estimate is 0.798. The true
population proportion is 0.8. So, this estimate is very close.
sample_props15 <- global_monitor %>% rep_sample_n(size = 15, reps = 2000, replace = TRUE) %>%
count(scientist_work) %>%
mutate(p_hat = n /sum(n)) %>%
filter(scientist_work == "Benefits")
ggplot(data = sample_props15, aes(x = p_hat)) +
geom_histogram(binwidth = 0.02) +
labs(
x = "p_hat (Benefits)",
title = "Sampling distribution of p_hat",
subtitle = "Sample size = 15, Number of samples = 2000")

mean(sample_props15$p_hat)
## [1] 0.7984667
global_monitor %>%
count(scientist_work) %>%
mutate(p = n /sum(n))
## # A tibble: 2 × 3
## scientist_work n p
## <chr> <int> <dbl>
## 1 Benefits 80000 0.8
## 2 Doesn't benefit 20000 0.2
Exercise 9 - More Practice
When n = 150, the shape is more normal.
Based on this, the proportion is very nearly 0.8, since the mean of
the sampling distribution is 0.799. Which is 0.001 off from the true
proportion (0.8). Closer than what we found from Exercise 8 above.
sample_props150 <- global_monitor %>% rep_sample_n(size = 150, reps = 2000, replace = TRUE) %>%
count(scientist_work) %>%
mutate(p_hat = n /sum(n)) %>%
filter(scientist_work == "Benefits")
ggplot(data = sample_props150, aes(x = p_hat)) +
geom_histogram(binwidth = 0.02) +
labs(
x = "p_hat (Benefits)",
title = "Sampling distribution of p_hat",
subtitle = "Sample size = 150, Number of samples = 2000")

mean(sample_props150$p_hat)
## [1] 0.7994033
Exercise 10 - More Practice
The spread of the n = 150 sampling distribution is smaller than the
sampling distribution where n = 15. Based on this, it would be my
preference to predict the proportion from this distribution with a
smaller spread because there is less variability in the data. This would
mean that the value from sampling is closer to the true proportion.
---
title: "Lab 5: Sampling Distributions"
author: "Colin S."
date: "April 21, 2024"
output: openintro::lab_report
---

```{r load-packages, message=FALSE}
library(tidyverse)
library(openintro)
library(infer)

global_monitor <- tibble(
  scientist_work = c(rep("Benefits", 80000), rep("Doesn't benefit", 20000))
)

ggplot(global_monitor, aes(x = scientist_work)) +
  geom_bar() +
  labs(
    x = "", y = "",
    title = "Do you believe that the work scientists do benefit people like you?"
  ) +
  coord_flip() 

global_monitor %>%
  count(scientist_work) %>%
  mutate(p = n /sum(n))

samp1 <- global_monitor %>%
  sample_n(50)

```

### Exercise 1

The data for both the sample and the population shows that 14% of people believe that the work of scientists is not beneficial. The sample data is 0.06 off from the population data. However, they are still close.

```{r view-girls-counts}

samp1 %>%
  count(scientist_work) %>%
  mutate(p_hat = n /sum(n))

```


### Exercise 2

I would not expect the same sample proportion as another student's. This is because sampling will most likely come out different each time. The proportions could be close; however, they will not be exactly the same due to the sampling process yielding different results every time it is done.

```{r trend-girls}

```


### Exercise 3

In samp2, the data is different. More people are in the "does not benefit" column. Only 3 more people are in this column (10), compared to the previous sample (7). Although, since the sample size is so small, this has an impact on the p_hat.

The sample of 1000 would be a more appropriate estimate of the population proportion since the larger the sample, the closer the statistic will get to the actual population parameter. Note: larger sample size means less variability.

```{r plot-prop-boys-arbuthnot}
sample_props50 <- global_monitor %>%
                    rep_sample_n(size = 50, reps = 15000, replace = TRUE) %>%
                    count(scientist_work) %>%
                    mutate(p_hat = n /sum(n)) %>%
                    filter(scientist_work == "Doesn't benefit")

ggplot(data = sample_props50, aes(x = p_hat)) +
  geom_histogram(binwidth = 0.02) +
  labs(
    x = "p_hat (Doesn't benefit)",
    title = "Sampling distribution of p_hat",
    subtitle = "Sample size = 50, Number of samples = 15000"
  )

set.seed(4444)
samp2 <- global_monitor %>%
  sample_n(50)

samp2 %>%
  count(scientist_work) %>%
  mutate(p.hat = n /sum(n))

```


### Exercise 4

There were 15,000 samples taken, which means, there are 15,000 data points used to make this plot. The plot looks mostly normally distributed, but with a slight skew to the right. The center appears to be around 0.2, which is consistent with the population parameter.

```{r dim-present}
global_monitor %>%
  sample_n(size = 50, replace = TRUE) %>%
  count(scientist_work) %>%
  mutate(p_hat = n /sum(n)) %>%
  filter(scientist_work == "Doesn't benefit")

```


### Exercise 5

There are 25 observations in this object. Each observation is a sample proportion that makes up the sampling distribution.

```{r count-compare}
sample_props_small <- global_monitor %>% rep_sample_n(size = 10, reps = 25, replace = TRUE) %>%
                    count(scientist_work) %>%
                    mutate(p_hat = n /sum(n)) %>%
                    filter(scientist_work == "Doesn't benefit")

ggplot(data = sample_props50, aes(x = p_hat)) +
  geom_histogram(binwidth = 0.02)

```


### Exercise 6

Each observation is the proportion that came out of the simulation:

n = 10; mean = 0.22; se = 0.11

n = 50; mean = 0.2; se = 0.06

n = 100; mean = 0.2; se = 0.04

The mean of these gets closer to 0.2; the true mean as we increase the number of samples. The se decreases, which means that as you increase the sample size, there is less variability. The distribution begins to appear normal when you make the sample larger, since at n = 10, there is a right skew.

```{r plot-prop-boys-present}

```


### Exercise 7 - More Practice

Based on this sampling, the best estimate is that 80% believe that the work of scientists is beneficial.

```{r find-max-total}
set.seed(5555)
samp3 <- global_monitor %>%
  sample_n(15)

samp3 %>%
  count(scientist_work) %>%
  mutate(p.hat = n /sum(n))

```


### Exercise 8 - More Practice

Based on the sampling here, the estimate is 0.798. The true population proportion is 0.8. So, this estimate is very close.

```{r}
sample_props15 <- global_monitor %>% rep_sample_n(size = 15, reps = 2000, replace = TRUE) %>%
                    count(scientist_work) %>%
                    mutate(p_hat = n /sum(n)) %>%
                    filter(scientist_work == "Benefits")

ggplot(data = sample_props15, aes(x = p_hat)) +
  geom_histogram(binwidth = 0.02) +
  labs(
    x = "p_hat (Benefits)",
    title = "Sampling distribution of p_hat",
    subtitle = "Sample size = 15, Number of samples = 2000")

mean(sample_props15$p_hat)

global_monitor %>%
  count(scientist_work) %>%
  mutate(p = n /sum(n))

```


### Exercise 9 - More Practice

When n = 150, the shape is more normal.

Based on this, the proportion is very nearly 0.8, since the mean of the sampling distribution is 0.799. Which is 0.001 off from the true proportion (0.8). Closer than what we found from Exercise 8 above.

```{r}
sample_props150 <- global_monitor %>% rep_sample_n(size = 150, reps = 2000, replace = TRUE) %>%
                    count(scientist_work) %>%
                    mutate(p_hat = n /sum(n)) %>%
                    filter(scientist_work == "Benefits")

ggplot(data = sample_props150, aes(x = p_hat)) +
  geom_histogram(binwidth = 0.02) +
  labs(
    x = "p_hat (Benefits)",
    title = "Sampling distribution of p_hat",
    subtitle = "Sample size = 150, Number of samples = 2000")

mean(sample_props150$p_hat)

```


### Exercise 10 - More Practice

The spread of the n = 150 sampling distribution is smaller than the sampling distribution where n = 15. Based on this, it would be my preference to predict the proportion from this distribution with a smaller spread because there is less variability in the data. This would mean that the value from sampling is closer to the true proportion.

```{r}

```