#lab 3

title: "Stats 13 Lab 3" author: "Brooklyn Duran" date: "2024-08-27" output: html_document ---

```{r load-packages, message=FALSE}

Load the tidyverse, openintro and infer packages, or libraries.

library(tidyverse) library(openintro) library(infer)

If any of these do not work since you do not have them installed yet, run

install.packages('infer') in console

install.packages(infer) -> installs 'infer' package on pc forever

library(infer) -> tells file you want to use packages from the 'infer' package

(replace infer with other package name if needed)

Setting a seed -> makes it such that every time you call a random

function, it returns the same result (helpful for testing purposes)

set.seed(42) sample.int(n = 100, size = 1) # 49 set.seed(42) sample.int(n = 100, size = 1) # same result ```

Exercise 1

```{r exercise-1}

Create a dataframe/tibble that reports 100,000

responses to the question:

"Do you believe that the work scientists do benefit people like you?"

globalmonitor <- tibble( scientistwork = c(rep("Benefits", 80000), rep("Doesn't benefit", 20000)) )

Create a bar plot of global_monitor

ggplot(globalmonitor, aes(x = scientistwork)) + geombar() + labs( x = "", y = "", title = "Do you believe that the work scientists do benefit people like you?" ) + coordflip()

Calculate the proportion of each response

globalmonitor %>% count(scientistwork) %>% mutate(p = n /sum(n))

Set the seed to 42

set.seed(42)

Sample 50 responses

samp1 <- globalmonitor %>% samplen(50)

Create a bar plot of your sample

ggplot(samp1, aes(x = scientistwork)) + geombar() + labs( x = "", y="", title = "Do you believe that the work scientists do benefit people like you?") + coord_flip()

Calculate the proportion of each response in your sample

samp1 %>% count(scientist_work) %>% mutate(p = n / sum(n)) ```

Exercise 2

```{r exercise-2}

Would the bar plots match if you were to change the seed to a different

number and take another sample

no they should match due to randomness

Would the proportions be similar if you were to change the seed to a

different number and take another sample?

the proportions should be similar bc of the sample size

```

Exercise 3

```{r exercise-3}

Set the seed to 0

set.seed(0)

Sample another 50 responses as samp2

samp2 <- globalmonitor %>% samplen(50)

Calculate the proportion of each response in samp2

samp2 %>% count(scientist_work) %>% mutate(p = n / sum(n))

Sample another 100 responses as samp3

samp3 <- globalmonitor %>% samplen(100)

Calculate the proportion of each response in samp3

samp3 %>% count(scientist_work) %>% mutate(P = n / sum(n))

Sample another 1000 responses as samp4

samp4 <- globalmonitor %>% samplen(1000)

Calculate the proportion of each response in samp4

samp4 %>% count(scientist_work) %>% mutate(p= n /sum(n))

What do you notice about the proportions as the sample size increases?

as sample size increases, props get closer to true proportion

Will this always be true if you take samples of size 50, 100, 1000?

no, can be true but not alwasy becasue a smaller sample size can stil be closer to the true proprtion.

```

Exercise 4

```{r exercise-4}

Obtain 15000 samples of size 50 and calculate the proportion of

"Doesn't benefit" in each sample

sampleprops50 <- globalmonitor %>% repsamplen(size = 50, reps = 15000, replace = TRUE) %>% count(scientistwork) %>% mutate(phat = n /sum(n)) %>% filter(scientist_work == "Doesn't benefit")

Create a histogram of your 15000 p_hats

ggplot(data = sampleprops50, aes(x = phat)) + geomhistogram(binwidth = 0.02) + labs( x = "phat (Doesn't benefit)", title = "Sampling distribution of p_hat", subtitle = "Sample size = 50, Number of samples = 15000" )

Obtain 15000 samples of size 100 and calculate the proportion of

"Doesn't benefit" in each sample

sampleprops100 <- globalmonitor %>% repsamplen(size = 100, reps = 15000, replace = TRUE) %>% count(scientistwork) %>% mutate(phat = n /sum(n)) %>% filter(scientist_work == "Doesn't benefit")

Create a histogram of your 15000 p_hats

ggplot(data = sampleprops100, aes(x = phat)) + geom_histogram(binwidth = 0.02) + labs(x = "phat (doesnt benefit)",title = "samp dist of phat", subtitle = "samp size = 100, num of samples = 15000")

How are the two histograms different?

phat of the second histogram are focused around 0.2 mroe than hist 1

```

Exercise 5

```{r exercise-5}

Obtain 25 samples of size 10 and calculate the proportion of

"Doesn't benefit" in each sample

samplepropssmall <- globalmonitor %>% repsamplen(size = 10, reps = 25, replace = TRUE) %>% count(scientistwork) %>% mutate(phat = n / sum(n)) %>% filter(scientistwork == "Doesn't benefit")

Create a histogram of your 25 p_hats

ggplot(data = samplepropssmall, aes(x = phat)) + geomhistogram(binwidth = 0.02) + labs( x = "p_hat (Doesnt benefit", title = "Samp dist. of phat", subtitle = "samp size = 10, num of samples = 25")

Why does this histogram look so different from the ones in Exercise 4 and 5?

Samp is 10 so the possible phats are 0,0.1,0.2, etc. Bc there are only 25 damples the counts are smaller

```

{r submission-instructions} # Knit (or generate) the R Markdown file and submit as your TA instructs.