library(tidyverse)
library(openintro)
library(infer)

Exercise 1

The sample distribution is similar to the poll given by WGM, as 74% believe in scientists and 26% does not. The sample differentiates by 6% from the original, as the sample population is more wary of scientists compared to the poll.

global_monitor <- tibble(
  scientist_work = c(rep("Benefits", 80000), rep("Doesn't benefit", 20000))
)
global_monitor %>%
  count(scientist_work) %>%
  mutate(p = n /sum(n))
## # A tibble: 2 x 3
##   scientist_work      n     p
##   <chr>           <int> <dbl>
## 1 Benefits        80000   0.8
## 2 Doesn't benefit 20000   0.2
set.seed(50)

samp1 <- global_monitor %>%
  sample_n(50)

samp1 %>%
  count(scientist_work) %>%
  mutate(p = n /sum(n))
## # A tibble: 2 x 3
##   scientist_work      n     p
##   <chr>           <int> <dbl>
## 1 Benefits           37  0.74
## 2 Doesn't benefit    13  0.26

Exercise 2

The sample corporation could match another student’s sample if both students choose the same sample size and set the seed before the sample. It is not guarantee that each sample is the same as it’s randomly picked. There should not be a large difference in the true proportion. F

Exercise 3

The new sample of 50 has 4% difference from sample 1. Four more people from the population agree with scientists compared to the previous sample. I think a large population size will produce a more accurate estimate, as there is more variety in the sample.

samp2 <- global_monitor %>%
  sample_n(50)

samp2 %>%
  count(scientist_work) %>%
  mutate(p = n /sum(n))
## # A tibble: 2 x 3
##   scientist_work      n     p
##   <chr>           <int> <dbl>
## 1 Benefits           39  0.78
## 2 Doesn't benefit    11  0.22
samp3 <- global_monitor %>%
  sample_n(100)

samp4 <- global_monitor %>%
  sample_n(1000)
samp3 %>%
    count(scientist_work) %>%
    mutate(p = n /sum(n))
## # A tibble: 2 x 3
##   scientist_work      n     p
##   <chr>           <int> <dbl>
## 1 Benefits           85  0.85
## 2 Doesn't benefit    15  0.15
samp4 %>%
    count(scientist_work) %>%
    mutate(p = n /sum(n))
## # A tibble: 2 x 3
##   scientist_work      n     p
##   <chr>           <int> <dbl>
## 1 Benefits          802 0.802
## 2 Doesn't benefit   198 0.198

Exercise 4

The distribution of the results is symmetric and uni modal. The mean of the historiography is at 0.2. The estimate of the true population proportion is 20%.

sample_props50 <- global_monitor %>%
                    rep_sample_n(size = 50, reps = 15000, replace = TRUE) %>%
                    count(scientist_work) %>%
                    mutate(p_hat = n /sum(n)) %>%
                    filter(scientist_work == "Doesn't benefit")

ggplot(data = sample_props50, aes(x = p_hat)) +
  geom_histogram(binwidth = 0.02) +
  labs(
    x = "p_hat (Doesn't benefit)",
    title = "Sampling distribution of p_hat",
    subtitle = "Sample size = 50, Number of samples = 15000"
  )

Exercise 5

There are 25 observation in the tibble sample_props_small. The observations are the result of a sample, the sample produced a proportion of people who choose “Don’t believe” from a pool of ten.

sample_props_small<-tibble(scientist_work="Doesn't benefit",n=0,p_hat=0.0)
x<-0
while(x <27){
  temp<-global_monitor %>%
  sample_n(size = 10, replace = TRUE) %>%
  count(scientist_work) %>%
  mutate(p_hat = n /sum(n)) %>%
  filter(scientist_work == "Doesn't benefit")
  
  sample_props_small<-rbind(sample_props_small,temp)
  x<-x+1
}
sample_props_small<-sample_props_small[-1,]
head(sample_props_small)
## # A tibble: 6 x 3
##   scientist_work      n p_hat
##   <chr>           <dbl> <dbl>
## 1 Doesn't benefit     3   0.3
## 2 Doesn't benefit     1   0.1
## 3 Doesn't benefit     1   0.1
## 4 Doesn't benefit     1   0.1
## 5 Doesn't benefit     3   0.3
## 6 Doesn't benefit     3   0.3

Exercrise 6

As the sample size increases, the shape of the distribution becomes symmetrical and the mean centers around .20. The SE leans toward .0056 as the sample size increased.

sample_props10 <- global_monitor %>%
                    rep_sample_n(size = 10, reps = 5000, replace = TRUE) %>%
                    count(scientist_work) %>%
                    mutate(p_hat = n /sum(n)) %>%
                    filter(scientist_work == "Doesn't benefit")

sample_props100 <- global_monitor %>%
                    rep_sample_n(size = 100, reps = 5000, replace = TRUE) %>%
                    count(scientist_work) %>%
                    mutate(p_hat = n /sum(n)) %>%
                    filter(scientist_work == "Doesn't benefit")

SE<-function(x){
  
  x<-(x*(1-x))/5000
  x<- sqrt(x)
  return(x)
}
x<-mean(sample_props10$p_hat)
y<-mean(sample_props50$p_hat)
z<-mean(sample_props100$p_hat)
SE(x)
## [1] 0.005897241
SE(y)
## [1] 0.005663194
SE(z)
## [1] 0.005658105

Exercise 7

The proportion of one sample of size 15, it estimates ~67% of the population agrees on “benefits”.

samp_yes<-global_monitor %>%
  sample_n(size = 15, replace = TRUE) %>%
  count(scientist_work) %>%
  mutate(p_hat = n /sum(n)) %>%
  filter(scientist_work == "Benefits")

mean(samp_yes$p_hat)
## [1] 0.8

Exercise 8

The sample distribution is unimodal and left skewed. I guess the true proportion is .8 from the histogram. The population proportion is ~80%.

sample_props15 <- global_monitor %>%
                    rep_sample_n(size = 15, reps = 2000, replace = TRUE) %>%
                    count(scientist_work) %>%
                    mutate(p_hat = n /sum(n)) %>%
                    filter(scientist_work == "Benefits")

mean(sample_props15$p_hat)
## [1] 0.7978667
ggplot(data = sample_props15, aes(x = p_hat)) +
  geom_histogram(binwidth = 0.02) +
  labs(
    x = "p_hat (Benefits)",
    title = "Sampling distribution of p_hat",
    subtitle = "Sample size = 15, Number of samples = 2000"
  )

Exercise 9

The distribution of the sample size is unimodal and symmetric. I guess the true proportion is .80.

sample_props150 <- global_monitor %>%
                    rep_sample_n(size = 150, reps = 2000, replace = TRUE) %>%
                    count(scientist_work) %>%
                    mutate(p_hat = n /sum(n)) %>%
                    filter(scientist_work == "Benefits")
ggplot(data = sample_props150, aes(x = p_hat)) +
  geom_histogram(binwidth = 0.02) +
  labs(
    x = "p_hat (Benefits)",
    title = "Sampling distribution of p_hat",
    subtitle = "Sample size = 15, Number of samples = 2000"
  )

Exercise 10

When we compared the spreads of the sampling distributions, the collections with the higher sample size had a smaller spread than the smaller ones. For example, the sample distribution of size 15 had the largest spread compared to size 150. I would work with a smaller spread as the larger sample size provides a more accurate true proportion.

---
title: "Lab 5 Sampling distributions"
author: "Vyanna Hill"
date: "`r Sys.Date()`"
output: openintro::lab_report
---

```{r load-packages, message=FALSE}
library(tidyverse)
library(openintro)
library(infer)
```

### Exercise 1

The sample distribution is similar to the poll given by WGM, as 74% believe in scientists and 26% does not. The sample differentiates by 6% from the original, as the sample population is more wary of scientists compared to the poll.

```{r code-chunk-label}
global_monitor <- tibble(
  scientist_work = c(rep("Benefits", 80000), rep("Doesn't benefit", 20000))
)
global_monitor %>%
  count(scientist_work) %>%
  mutate(p = n /sum(n))

set.seed(50)

samp1 <- global_monitor %>%
  sample_n(50)

samp1 %>%
  count(scientist_work) %>%
  mutate(p = n /sum(n))
```

### Exercise 2

The sample corporation could match another student's sample if both students choose the same sample size and set the seed before the sample. It is not guarantee that each sample is the same as it's randomly picked. There should not be a large difference in the true proportion. F

### Exercise 3

The new sample of 50 has 4% difference from sample 1. Four more people from the population agree with scientists compared to the previous sample. I think a large population size will produce a more accurate estimate, as there is more variety in the sample. 
```{r}
samp2 <- global_monitor %>%
  sample_n(50)

samp2 %>%
  count(scientist_work) %>%
  mutate(p = n /sum(n))

samp3 <- global_monitor %>%
  sample_n(100)

samp4 <- global_monitor %>%
  sample_n(1000)
samp3 %>%
    count(scientist_work) %>%
    mutate(p = n /sum(n))
samp4 %>%
    count(scientist_work) %>%
    mutate(p = n /sum(n))
```

### Exercise 4

The distribution of the results is symmetric and uni modal. The mean of the historiography is at 0.2. The estimate of the true population proportion is 20%. 
```{r}
sample_props50 <- global_monitor %>%
                    rep_sample_n(size = 50, reps = 15000, replace = TRUE) %>%
                    count(scientist_work) %>%
                    mutate(p_hat = n /sum(n)) %>%
                    filter(scientist_work == "Doesn't benefit")

ggplot(data = sample_props50, aes(x = p_hat)) +
  geom_histogram(binwidth = 0.02) +
  labs(
    x = "p_hat (Doesn't benefit)",
    title = "Sampling distribution of p_hat",
    subtitle = "Sample size = 50, Number of samples = 15000"
  )

```

### Exercise 5

There are 25 observation in the tibble sample_props_small. The observations are the result of a sample, the sample produced a proportion of people who choose "Don't believe" from a pool of ten.
```{r}
sample_props_small<-tibble(scientist_work="Doesn't benefit",n=0,p_hat=0.0)
x<-0
while(x <27){
  temp<-global_monitor %>%
  sample_n(size = 10, replace = TRUE) %>%
  count(scientist_work) %>%
  mutate(p_hat = n /sum(n)) %>%
  filter(scientist_work == "Doesn't benefit")
  
  sample_props_small<-rbind(sample_props_small,temp)
  x<-x+1
}
sample_props_small<-sample_props_small[-1,]
head(sample_props_small)
```


### Exercrise 6

As the sample size increases, the shape of the distribution becomes symmetrical and the mean centers around .20. The SE leans toward .0056 as the sample size increased.
```{r}
sample_props10 <- global_monitor %>%
                    rep_sample_n(size = 10, reps = 5000, replace = TRUE) %>%
                    count(scientist_work) %>%
                    mutate(p_hat = n /sum(n)) %>%
                    filter(scientist_work == "Doesn't benefit")

sample_props100 <- global_monitor %>%
                    rep_sample_n(size = 100, reps = 5000, replace = TRUE) %>%
                    count(scientist_work) %>%
                    mutate(p_hat = n /sum(n)) %>%
                    filter(scientist_work == "Doesn't benefit")

SE<-function(x){
  
  x<-(x*(1-x))/5000
  x<- sqrt(x)
  return(x)
}
x<-mean(sample_props10$p_hat)
y<-mean(sample_props50$p_hat)
z<-mean(sample_props100$p_hat)
SE(x)
SE(y)
SE(z)

```


### Exercise 7

The proportion of one sample of size 15, it estimates ~67% of the population agrees on "benefits".
```{r}
samp_yes<-global_monitor %>%
  sample_n(size = 15, replace = TRUE) %>%
  count(scientist_work) %>%
  mutate(p_hat = n /sum(n)) %>%
  filter(scientist_work == "Benefits")

mean(samp_yes$p_hat)
```

### Exercise 8

The sample distribution is unimodal and left skewed. I guess the true proportion is .8 from the histogram. The population proportion is ~80%.
```{r}
sample_props15 <- global_monitor %>%
                    rep_sample_n(size = 15, reps = 2000, replace = TRUE) %>%
                    count(scientist_work) %>%
                    mutate(p_hat = n /sum(n)) %>%
                    filter(scientist_work == "Benefits")

mean(sample_props15$p_hat)

ggplot(data = sample_props15, aes(x = p_hat)) +
  geom_histogram(binwidth = 0.02) +
  labs(
    x = "p_hat (Benefits)",
    title = "Sampling distribution of p_hat",
    subtitle = "Sample size = 15, Number of samples = 2000"
  )
```

### Exercise 9

The distribution of the sample size is unimodal and symmetric. I guess the true proportion is .80.
```{r}
sample_props150 <- global_monitor %>%
                    rep_sample_n(size = 150, reps = 2000, replace = TRUE) %>%
                    count(scientist_work) %>%
                    mutate(p_hat = n /sum(n)) %>%
                    filter(scientist_work == "Benefits")
ggplot(data = sample_props150, aes(x = p_hat)) +
  geom_histogram(binwidth = 0.02) +
  labs(
    x = "p_hat (Benefits)",
    title = "Sampling distribution of p_hat",
    subtitle = "Sample size = 15, Number of samples = 2000"
  )
```

### Exercise 10

When we compared the spreads of the sampling distributions, the collections with the higher sample size had a smaller spread than the smaller ones. For example, the sample distribution of size 15 had the largest spread compared to size 150. I would work with a smaller spread as the larger sample size provides a more accurate true proportion.