Lab 5 Part I

set.seed(199)
library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.7
## v tidyr   1.2.0     v stringr 1.4.0
## v readr   2.1.1     v forcats 0.5.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(openintro)

## Loading required package: airports

## Loading required package: cherryblossom

## Loading required package: usdata

library(infer)

Exercise 1

global_monitor <- tibble(
  scientist_work = c(rep("Benefits", 80000), rep("Doesn't benefit", 20000))
)

samp1 <- global_monitor %>%
  sample_n(50)

ggplot(samp1, aes(x = scientist_work)) +
  geom_bar() +
  labs(
    x = "", y = "",
    title = "A smaller pop  sample of the Question:"
  ) +
  coord_flip()

samp1 %>%
  count(scientist_work) %>%
  mutate(samp_le = n /sum(n))

## # A tibble: 2 x 3
##   scientist_work      n samp_le
##   <chr>           <int>   <dbl>
## 1 Benefits           42    0.84
## 2 Doesn't benefit     8    0.16

The distribution of responses in this sample is rather skewed toward people who believe that the work scientists does benefits the people.Compared to the larger data of global_monitor I say the sample is reflective of the entire population

Exercise 2

samp1 %>%
  count(scientist_work) %>%
  mutate(p_hat = n /sum(n))

## # A tibble: 2 x 3
##   scientist_work      n p_hat
##   <chr>           <int> <dbl>
## 1 Benefits           42  0.84
## 2 Doesn't benefit     8  0.16

I don’t expect the sample proportion to match the sample proportion of another’s student sample since because the sample is truly random and there could be an opportunity where a student can have more “benefits” in their table compared to other students.

Exercise 3

samp2 <- global_monitor %>%
  sample_n(50) 

samp2 %>%
  count(scientist_work) %>%
  mutate(p_hat_ = n /sum(n))

## # A tibble: 2 x 3
##   scientist_work      n p_hat_
##   <chr>           <int>  <dbl>
## 1 Benefits           42   0.84
## 2 Doesn't benefit     8   0.16

The sample proportion of samp2 is pretty smiliar to samp1 where there are more benefits than doesn’t benefits. In this sample there are a bit more doesn’t benefits than in samp1.I think a size of 1000 would provide a more accurate measurement since we are counting more people and distribution would align more with the data.

Exercise 4

sample_props50 <- global_monitor %>%
                    rep_sample_n(size = 50, reps = 15000, replace = TRUE) %>%
                    count(scientist_work) %>%
                    mutate(p_hat = n /sum(n)) %>%
                    filter(scientist_work == "Doesn't benefit")

ggplot(data = sample_props50, aes(x = p_hat)) +
  geom_histogram(binwidth = 0.02) +
  labs(
    x = "p_hat (Doesn't benefit)",
    title = "Sampling distribution of p_hat",
    subtitle = "Sample size = 50, Number of samples = 15000"
  )

According to the histogram I see about 2000 elements in the graph. The sampling distribution is symmetric with no skew there is a prominent center at 0.2 or 20%.

Exercise 5

sample_props_small <- global_monitor %>%
                    rep_sample_n(size = 10, reps = 25, replace = TRUE) %>%
                    count(scientist_work) %>%
                    mutate(p_hat = n /sum(n)) %>%
                    filter(scientist_work == "Doesn't benefit")

sample_props_small

## # A tibble: 23 x 4
## # Groups:   replicate [23]
##    replicate scientist_work      n p_hat
##        <int> <chr>           <int> <dbl>
##  1         1 Doesn't benefit     2   0.2
##  2         3 Doesn't benefit     1   0.1
##  3         4 Doesn't benefit     4   0.4
##  4         5 Doesn't benefit     6   0.6
##  5         6 Doesn't benefit     2   0.2
##  6         7 Doesn't benefit     1   0.1
##  7         8 Doesn't benefit     1   0.1
##  8         9 Doesn't benefit     1   0.1
##  9        11 Doesn't benefit     6   0.6
## 10        12 Doesn't benefit     1   0.1
## # ... with 13 more rows

There are 25 observation in sample_props_small, each observation represents a proportion of response in each sample that believes that scientists doesn’t benefit them

Exercise 6

## samples of size 10 of 5000 simulations
set.seed(1)
sample_props_small1 <- global_monitor %>%
                    rep_sample_n(size = 10, reps = 5000, replace = TRUE) %>%
                    count(scientist_work) %>%
                    mutate(p_hat = n /sum(n)) %>%
                    filter(scientist_work == "Doesn't benefit")

sample_props_small1

## # A tibble: 4,486 x 4
## # Groups:   replicate [4,486]
##    replicate scientist_work      n p_hat
##        <int> <chr>           <int> <dbl>
##  1         2 Doesn't benefit     4   0.4
##  2         3 Doesn't benefit     5   0.5
##  3         4 Doesn't benefit     1   0.1
##  4         5 Doesn't benefit     2   0.2
##  5         6 Doesn't benefit     1   0.1
##  6         7 Doesn't benefit     2   0.2
##  7         8 Doesn't benefit     1   0.1
##  8         9 Doesn't benefit     3   0.3
##  9        10 Doesn't benefit     2   0.2
## 10        11 Doesn't benefit     1   0.1
## # ... with 4,476 more rows

set.seed(2)
## samples of size 50 of 5000 simulations
sample_props_small2 <- global_monitor %>%
                    rep_sample_n(size = 50, reps = 5000, replace = TRUE) %>%
                    count(scientist_work) %>%
                    mutate(p_hat = n /sum(n)) %>%
                    filter(scientist_work == "Doesn't benefit")

sample_props_small2

## # A tibble: 5,000 x 4
## # Groups:   replicate [5,000]
##    replicate scientist_work      n p_hat
##        <int> <chr>           <int> <dbl>
##  1         1 Doesn't benefit     8  0.16
##  2         2 Doesn't benefit     9  0.18
##  3         3 Doesn't benefit    13  0.26
##  4         4 Doesn't benefit     5  0.1 
##  5         5 Doesn't benefit    10  0.2 
##  6         6 Doesn't benefit    16  0.32
##  7         7 Doesn't benefit     8  0.16
##  8         8 Doesn't benefit     9  0.18
##  9         9 Doesn't benefit    10  0.2 
## 10        10 Doesn't benefit     6  0.12
## # ... with 4,990 more rows

## samples of size 100 of 5000 simulations
set.seed(3)
sample_props_small3 <- global_monitor %>%
                    rep_sample_n(size = 100, reps = 5000, replace = TRUE) %>%
                    count(scientist_work) %>%
                    mutate(p_hat = n /sum(n)) %>%
                    filter(scientist_work == "Doesn't benefit")

sample_props_small3

## # A tibble: 5,000 x 4
## # Groups:   replicate [5,000]
##    replicate scientist_work      n p_hat
##        <int> <chr>           <int> <dbl>
##  1         1 Doesn't benefit    26  0.26
##  2         2 Doesn't benefit    17  0.17
##  3         3 Doesn't benefit    16  0.16
##  4         4 Doesn't benefit    20  0.2 
##  5         5 Doesn't benefit    22  0.22
##  6         6 Doesn't benefit    17  0.17
##  7         7 Doesn't benefit    18  0.18
##  8         8 Doesn't benefit    22  0.22
##  9         9 Doesn't benefit    19  0.19
## 10        10 Doesn't benefit    24  0.24
## # ... with 4,990 more rows

I set a seed for each individual simulation since the values were constantly changing, each observation represents each a proportion of people in each sample that believes that scientists doesn’t benefit them personally. Acoording to my data when the sample size of the sampling distribution increases they seem to converge closer around the true population proportion of around 0.26 and the standard error is smaller compared to the sample size of 10. So the bigger the sample the more accurate our data becomes.

Exercise 7

set.seed(4)
samp3 <- global_monitor %>%
  sample_n(15)

samp3 %>%
  count(scientist_work) %>%
  mutate(sam = n/sum(n))

## # A tibble: 2 x 3
##   scientist_work      n    sam
##   <chr>           <int>  <dbl>
## 1 Benefits           14 0.933 
## 2 Doesn't benefit     1 0.0667

According to my sample over 93% of people think that the work scientists does benefit them everyday. My best point estimate of the population proportion would be that the true proportion would be around 80 to 85%.

Exercise 8

sample_props15 <- global_monitor %>%
                    rep_sample_n(size = 15, reps = 2000, replace = TRUE) %>%
                    count(scientist_work) %>%
                    mutate(p_hat = n /sum(n)) %>%
                    filter(scientist_work == "Benefits")

sample_props15

## # A tibble: 2,000 x 4
## # Groups:   replicate [2,000]
##    replicate scientist_work     n p_hat
##        <int> <chr>          <int> <dbl>
##  1         1 Benefits          11 0.733
##  2         2 Benefits          13 0.867
##  3         3 Benefits          12 0.8  
##  4         4 Benefits          12 0.8  
##  5         5 Benefits          13 0.867
##  6         6 Benefits          15 1    
##  7         7 Benefits          14 0.933
##  8         8 Benefits          13 0.867
##  9         9 Benefits          12 0.8  
## 10        10 Benefits          13 0.867
## # ... with 1,990 more rows

ggplot(data = sample_props15, aes(x = p_hat)) +
  geom_histogram(binwidth = 0.02) +
  labs(
    x = "p_hat (Benefits)",
    title = "Sampling distribution of population proportion",
    subtitle = "Sample size = 15, Number of samples = 2000"
  )

mean(sample_props15$p_hat)

## [1] 0.8038667

According to the average of p_hat about 80% of the population truly believe that scientists do enhance their everyday lives.

Exercise 9

sample_props150 <- global_monitor %>%
                    rep_sample_n(size = 150, reps = 2000, replace = TRUE) %>%
                    count(scientist_work) %>%
                    mutate(p_hat = n /sum(n)) %>%
                    filter(scientist_work == "Benefits") 
sample_props150

## # A tibble: 2,000 x 4
## # Groups:   replicate [2,000]
##    replicate scientist_work     n p_hat
##        <int> <chr>          <int> <dbl>
##  1         1 Benefits         121 0.807
##  2         2 Benefits         123 0.82 
##  3         3 Benefits         113 0.753
##  4         4 Benefits         119 0.793
##  5         5 Benefits         120 0.8  
##  6         6 Benefits         126 0.84 
##  7         7 Benefits         118 0.787
##  8         8 Benefits         124 0.827
##  9         9 Benefits         127 0.847
## 10        10 Benefits         118 0.787
## # ... with 1,990 more rows

ggplot(data = sample_props150, aes(x = p_hat)) +
  geom_histogram(binwidth = 0.02) +
  labs(
    x = "p_hat (Benefits)",
    title = "Sampling distribution of population proportion",
    subtitle = "Sample size = 150, Number of samples = 2000"
  )

mean(sample_props150$p_hat)

## [1] 0.80095

The data looks symmetrical and there are no skews like the graph in exercise 8 and calculating the mean to find proportion population we can see that it is around 80% again.

Exercise 10

Looking at the chart I would guage and say that chart 2 has a smaller spread because it would be user to work with smaller samples and that smaller samples in general would gives us a more accurate representation of the population proportion