library(tidyverse)
library(openintro)
library(infer)
Exercise 1
Describe the distribution of responses in this sample. How does it compare to the distribution of responses in the population. Hint: Although the sample_n function takes a random sample of observations (i.e. rows) from the data set, you can still refer to the variables in the data set with the same names. Code you presented earlier for visualizing and summarizing the population data will still be useful for the sample, however be careful to not label your proportion p since you’re now calculating a sample statistic, not a population parameters. You can customize the label of the statistics to indicate that it comes from the sample.
# Insert code for Exercise 1 here
global_monitor <- tibble(
scientist_work = c(rep("Benefits", 80000), rep("Doesn't benefit", 20000))
)
global_monitor %>%
count(scientist_work) %>%
mutate(p = n /sum(n))
## # A tibble: 2 x 3
## scientist_work n p
## <chr> <int> <dbl>
## 1 Benefits 80000 0.8
## 2 Doesn't benefit 20000 0.2
set.seed(1)
samp1 <- global_monitor %>%
sample_n(50)
samp1 %>%
count(scientist_work) %>%
mutate(p_hat = n /sum(n))
## # A tibble: 2 x 3
## scientist_work n p_hat
## <chr> <int> <dbl>
## 1 Benefits 38 0.76
## 2 Doesn't benefit 12 0.24
The sample is similar to the population in terms of distribution. When I ran the data I got 76 percent for benefits and 24 percent for don’t benefits. These values are similar to the population values.
Exercise 2
Would you expect the sample proportion to match the sample proportion of another student’s sample? Why, or why not? If the answer is no, would you expect the proportions to be somewhat different or very different? Ask a student team to confirm your answer.
# Insert code for Exercise 2 here
global_monitor <- tibble(
scientist_work = c(rep("Benefits", 80000), rep("Doesn't benefit", 20000))
)
ggplot(global_monitor, aes(x = scientist_work)) +
geom_bar() +
labs(
x = "", y = "",
title = "Do you believe that the work scientists do benefit people like you?"
) +
coord_flip()

global_monitor %>%
count(scientist_work) %>%
mutate(p = n /sum(n))
## # A tibble: 2 x 3
## scientist_work n p
## <chr> <int> <dbl>
## 1 Benefits 80000 0.8
## 2 Doesn't benefit 20000 0.2
samp1 <- global_monitor %>%
sample_n(50)
samp1 %>%
count(scientist_work) %>%
mutate(p_hat = n /sum(n))
## # A tibble: 2 x 3
## scientist_work n p_hat
## <chr> <int> <dbl>
## 1 Benefits 41 0.82
## 2 Doesn't benefit 9 0.18
No, I would not expect it to be the exact same as you are randomly choosing 50 people. It could be a different 50 people in another sample who answered differently.It should be somewhat different, but not very different.
Exercise 3
Take a second sample, also of size 50, and call it samp2. How does the sample proportion of samp2 compare with that of samp1?
# Insert code for Exercise 3 here
set.seed(2)
samp2 <- global_monitor %>%
sample_n(50)
samp2 %>%
count(scientist_work) %>%
mutate(p_hat = n /sum(n))
## # A tibble: 2 x 3
## scientist_work n p_hat
## <chr> <int> <dbl>
## 1 Benefits 42 0.84
## 2 Doesn't benefit 8 0.16
The sample proportion changed from 76 to 84 in benefits and 24 to 16 in doesn’t benefit.However, they are similar in values.
Exercise 3
Suppose we took two more samples, one of size 100 and one of size 1000. Which would you think would provide a more accurate estimate of the population proportion?
set.seed (1)
samp3<- global_monitor %>%
sample_n(100)
samp3%>%
count(scientist_work) %>%
mutate(p_hat = n /sum(n))
samp4<- global_monitor %>%
sample_n(1000)
samp4%>%
count(scientist_work) %>%
mutate(p_hat = n /sum(n))
The bigger the sample proportion, the more accurate the estimation of the population proportion. The 1000 sample size proportion is closer to 80 in benefits and 20 in doesn’t benefit.
Exercise 4
How many elements are there in sample_props50? Describe the sampling distribution, and be sure to specifically note its center. Make sure to include a plot of the distribution in your answer
sample_props50 <- global_monitor %>%
rep_sample_n(size = 50, reps = 15000, replace = TRUE) %>%
count(scientist_work) %>%
mutate(p_hat = n /sum(n)) %>%
filter(scientist_work == "Doesn't benefit")
ggplot(data = sample_props50, aes(x = p_hat)) +
geom_histogram(binwidth = 0.02) +
labs(
x = "p_hat (Doesn't benefit)",
title = "Sampling distribution of p_hat",
subtitle = "Sample size = 50, Number of samples = 15000"
)
There appears to be 4 columns and 15000 rows. These would include p_hat, n, scientist_work, and replicate. It has bell shape distribution and the center is around 0.2.
Exercise 5
To make sure you understand how sampling distributions are built, and exactly what the rep_sample_n function does, try modifying the code to create a sampling distribution of 25 sample proportions from samples of size 10, and put them in a data frame named sample_props_small. Print the output. How many observations are there in this object called sample_props_small? What does each observation represent?
sample_props_small <- global_monitor %>%
rep_sample_n(size = 10, reps = 25, replace = TRUE) %>%
count(scientist_work) %>%
mutate(p_hat = n /sum(n)) %>%
filter(scientist_work == "Doesn't benefit")
sample_props_small
## # A tibble: 21 x 4
## # Groups: replicate [21]
## replicate scientist_work n p_hat
## <int> <chr> <int> <dbl>
## 1 1 Doesn't benefit 1 0.1
## 2 2 Doesn't benefit 2 0.2
## 3 3 Doesn't benefit 2 0.2
## 4 4 Doesn't benefit 1 0.1
## 5 5 Doesn't benefit 1 0.1
## 6 6 Doesn't benefit 3 0.3
## 7 8 Doesn't benefit 3 0.3
## 8 9 Doesn't benefit 3 0.3
## 9 10 Doesn't benefit 3 0.3
## 10 11 Doesn't benefit 4 0.4
## # ... with 11 more rows
mean(sample_props_small$p_hat)
## [1] 0.2380952
There are 25 observations. Each observation represents the random sample of size 10.
Exercise 6
Use the app below to create sampling distributions of proportions of Doesn’t benefit from samples of size 10, 50, and 100. Use 5,000 simulations. What does each observation in the sampling distribution represent? How does the mean, standard error, and shape of the sampling distribution change as the sample size increases? How (if at all) do these values change if you increase the number of simulations? (You do not need to include plots in your answer.)
Each observation in the sampling distribution represents the random sample of a particular size i.e 10,50, and 100.
As I increased the sample size the distribution became more bell shaped.
The sample mean becomes closer to population mean 0.2.
The standard error got smaller as the sample size increased.
According to the central limit theorem that if we have a population with mean μ and standard deviation σ and take sufficiently large random samples from the population with replacement then the distribution of the sample means will be approximately normally distributed.
If we increase the number of simulations the sample means will be closer to population mean and it is the same for standard error.
Exercise 7
Take a sample of size 15 from the population and calculate the proportion of people in this sample who think the work scientists do enhances their lives. Using this sample, what is your best point estimate of the population proportion of people who think the work scientists do enhances their lives?
# Insert code for Exercise 7 here
set.seed(1)
samp5 <- global_monitor %>%
sample_n(15)
samp5 %>%
count(scientist_work) %>%
mutate(p_hat5 = n /sum(n))
The best point estimate was 93.3%
Exercise 8
Since you have access to the population, simulate the sampling distribution of proportion of those who think the work scientists do enhances their lives for samples of size 15 by taking 2000 samples from the population of size 15 and computing 2000 sample proportions. Store these proportions in as sample_props15. Plot the data, then describe the shape of this sampling distribution. Based on this sampling distribution, what would you guess the true proportion of those who think the work scientists do enhances their lives to be? Finally, calculate and report the population proportion.
# Insert code for Exercise 8 here
sample_props15 <- global_monitor %>%
rep_sample_n(size = 15, reps = 2000, replace = TRUE) %>%
count(scientist_work) %>%
mutate(p_hat = n /sum(n)) %>%
filter(scientist_work == "Benefits")
glimpse(sample_props15)
ggplot(data = sample_props15, aes(x = p_hat)) +
geom_histogram(binwidth = 0.075) +
labs(x = "p_hat (Benefits)",
title = "Sampling distribution of p_hat",
subtitle = "Sample size = 15, Number of samples = 2000"
)
The histogram is skewed to the left. If I were to guess I would say that 70% of the population believes that scientists benefit their lives. The actual proportion is 80%.
Exercise 9
Change your sample size from 15 to 150, then compute the sampling distribution using the same method as above, and store these proportions in a new object called sample_props150. Describe the shape of this sampling distribution and compare it to the sampling distribution for a sample size of 15. Based on this sampling distribution, what would you guess to be the true proportion of those who think the work scientists do enhances their lives?
# Insert code for Exercise 9 here
sample_props150 <- global_monitor %>%
rep_sample_n(size = 150, reps = 2000, replace = TRUE) %>%
count(scientist_work) %>%
mutate(p_hat = n /sum(n)) %>%
filter(scientist_work == "Benefits")
glimpse(sample_props150)
## Rows: 2,000
## Columns: 4
## Groups: replicate [2,000]
## $ replicate <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1...
## $ scientist_work <chr> "Benefits", "Benefits", "Benefits", "Benefits", "Ben...
## $ n <int> 117, 126, 129, 113, 123, 109, 126, 122, 114, 110, 11...
## $ p_hat <dbl> 0.7800000, 0.8400000, 0.8600000, 0.7533333, 0.820000...
ggplot(data = sample_props150, aes(x = p_hat)) +
geom_histogram(binwidth = 0.02) +
labs(x = "p_hat (Benefits)",
title = "Sampling distribution of p_hat",
subtitle = "Sample size = 150, Number of samples = 2000"
)
It would be about 80% of people that say that scientists enhance their lives. The shape is more bell shaped and compared to the sample size of 15 graph.
Exercise 10
Of the sampling distributions from 2 and 3, which has a smaller spread? If you’re concerned with making estimates that are more often close to the true value, would you prefer a sampling distribution with a large or small spread?
# Insert code for Exercise 10 here
sample_props15 <- global_monitor %>%
rep_sample_n(size = 15, reps = 2000, replace = TRUE) %>%
count(scientist_work) %>%
mutate(p_hat = n /sum(n)) %>%
filter(scientist_work == "Benefits")
glimpse(sample_props15)
global_monitor %>%
count(scientist_work) %>%
mutate(p = n /sum(n))
ggplot(data = sample_props15, aes(x = p_hat)) +
geom_histogram(binwidth = 0.075) +
labs(x = "p_hat (Benefits)",
title = "Sampling distribution of p_hat",
subtitle = "Sample size = 15, Number of samples = 2000"
)
sample_props150 <- global_monitor %>%
rep_sample_n(size = 150, reps = 2000, replace = TRUE) %>%
count(scientist_work) %>%
mutate(p_hat = n /sum(n)) %>%
filter(scientist_work == "Benefits")
glimpse(sample_props150)
ggplot(data = sample_props150, aes(x = p_hat)) +
geom_histogram(binwidth = 0.02) +
labs(x = "p_hat (Benefits)",
title = "Sampling distribution of p_hat",
subtitle = "Sample size = 150, Number of samples = 2000"
)
A smaller spread would give you a better approximation and therefore is more desirable for getting close to the true value. I think the graph I just made for #9 had a smaller spread than the graph I made for #8
---
title: "Lab 5: Foundations for Inference"
author: "Teresa Doley"
date: "`r Sys.Date(10/5/2020)"
output: openintro::lab_report
---

```{r load-packages, message=FALSE}
library(tidyverse)
library(openintro)
library(infer)
```

### Exercise 1

Describe the distribution of responses in this sample. How does it compare to the distribution of responses in the population. Hint: Although the sample_n function takes a random sample of observations (i.e. rows) from the data set, you can still refer to the variables in the data set with the same names. Code you presented earlier for visualizing and summarizing the population data will still be useful for the sample, however be careful to not label your proportion p since you’re now calculating a sample statistic, not a population parameters. You can customize the label of the statistics to indicate that it comes from the sample.

```{r view-girls-counts}
# Insert code for Exercise 1 here

global_monitor <- tibble(
  scientist_work = c(rep("Benefits", 80000), rep("Doesn't benefit", 20000))
)

global_monitor %>%
  count(scientist_work) %>%
  mutate(p = n /sum(n))

set.seed(1)
samp1 <- global_monitor %>%
  sample_n(50)
samp1 %>%
  count(scientist_work) %>%
  mutate(p_hat = n /sum(n))
```
The sample is similar to the population in terms of distribution. When I ran the data I got 76 percent for benefits and 24 percent for don't benefits. These values are similar to the population values.

### Exercise 2
Would you expect the sample proportion to match the sample proportion of another student’s sample? Why, or why not? If the answer is no, would you expect the proportions to be somewhat different or very different? Ask a student team to confirm your answer.

```{r trend-girls}
# Insert code for Exercise 2 here
global_monitor <- tibble(
  scientist_work = c(rep("Benefits", 80000), rep("Doesn't benefit", 20000))
)
ggplot(global_monitor, aes(x = scientist_work)) +
  geom_bar() +
  labs(
    x = "", y = "",
    title = "Do you believe that the work scientists do benefit people like you?"
  ) +
  coord_flip() 
global_monitor %>%
  count(scientist_work) %>%
  mutate(p = n /sum(n))
samp1 <- global_monitor %>%
  sample_n(50)
samp1 %>%
  count(scientist_work) %>%
  mutate(p_hat = n /sum(n))

```
No, I would not expect it to be the exact same as you are randomly choosing 50 people. It could be a different 50 people in another sample who answered differently.It should be somewhat different, but not very different.

### Exercise 3 
Take a second sample, also of size 50, and call it samp2. How does the sample proportion of samp2 compare with that of samp1? 

```{r plot-prop-boys-arbuthnot}
# Insert code for Exercise 3 here

set.seed(2)

samp2 <- global_monitor %>%
  sample_n(50)
samp2 %>%
  count(scientist_work) %>%
  mutate(p_hat = n /sum(n))
```
The sample proportion changed from 76 to 84 in benefits and 24 to 16 in doesn't benefit.However, they are similar in values.

### Exercise 3 
Suppose we took two more samples, one of size 100 and one of size 1000. Which would you think would provide a more accurate estimate of the population proportion?

```{}
set.seed (1)

samp3<- global_monitor %>%
  sample_n(100)
samp3%>%
  count(scientist_work) %>%
  mutate(p_hat = n /sum(n))

samp4<- global_monitor %>%
  sample_n(1000)
samp4%>%
  count(scientist_work) %>%
  mutate(p_hat = n /sum(n))

```
The bigger the sample proportion, the more accurate the estimation of the population proportion. The 1000 sample size proportion is closer to 80 in benefits and 20 in doesn't benefit. 

### Exercise 4

How many elements are there in sample_props50? Describe the sampling distribution, and be sure to specifically note its center. Make sure to include a plot of the distribution in your answer

```{r dim-present}
sample_props50 <- global_monitor %>%
                    rep_sample_n(size = 50, reps = 15000, replace = TRUE) %>%
                    count(scientist_work) %>%
                    mutate(p_hat = n /sum(n)) %>%
                    filter(scientist_work == "Doesn't benefit")
ggplot(data = sample_props50, aes(x = p_hat)) +
  geom_histogram(binwidth = 0.02) +
  labs(
    x = "p_hat (Doesn't benefit)",
    title = "Sampling distribution of p_hat",
    subtitle = "Sample size = 50, Number of samples = 15000"
  )
```
There appears to be 4 columns and 15000 rows. These would include p_hat, n, scientist_work, and replicate. It has bell shape distribution and the center is around 0.2.

### Exercise 5

To make sure you understand how sampling distributions are built, and exactly what the rep_sample_n function does, try modifying the code to create a sampling distribution of 25 sample proportions from samples of size 10, and put them in a data frame named sample_props_small. Print the output. How many observations are there in this object called sample_props_small? What does each observation represent?



```{r count-compare}
sample_props_small <- global_monitor %>%
  rep_sample_n(size = 10, reps = 25, replace = TRUE) %>%
  count(scientist_work) %>%
  mutate(p_hat = n /sum(n)) %>%
  filter(scientist_work == "Doesn't benefit")
sample_props_small
mean(sample_props_small$p_hat)
```

There are 25 observations. Each observation represents the random sample of size 10.

### Exercise 6

Use the app below to create sampling distributions of proportions of Doesn’t benefit from samples of size 10, 50, and 100. Use 5,000 simulations. What does each observation in the sampling distribution represent? How does the mean, standard error, and shape of the sampling distribution change as the sample size increases? How (if at all) do these values change if you increase the number of simulations? (You do not need to include plots in your answer.)

Each observation in the sampling distribution represents 
the random sample of a particular size i.e 10,50, and 100.

As I increased the sample size the distribution became more bell shaped.

The sample mean becomes closer to population mean 0.2.

The standard error got smaller as the sample size increased.

According to the central limit theorem that if we  have a population with mean μ and standard deviation σ and take sufficiently large random samples from the population with replacement then the distribution of the sample means will be approximately normally distributed. 

If we increase the number of simulations the sample means will be closer to population mean and it is the same for standard error.


### Exercise 7

Take a sample of size 15 from the population and calculate the proportion of people in this sample who think the work scientists do enhances their lives. Using this sample, what is your best point estimate of the population proportion of people who think the work scientists do enhances their lives?


```{}
# Insert code for Exercise 7 here

set.seed(1)

samp5 <- global_monitor %>%
  sample_n(15)

samp5 %>%
  count(scientist_work) %>%
  mutate(p_hat5 = n /sum(n))

```
The best point estimate was 93.3%

### Exercise 8

Since you have access to the population, simulate the sampling distribution of proportion of those who think the work scientists do enhances their lives for samples of size 15 by taking 2000 samples from the population of size 15 and computing 2000 sample proportions. Store these proportions in as sample_props15. Plot the data, then describe the shape of this sampling distribution. Based on this sampling distribution, what would you guess the true proportion of those who think the work scientists do enhances their lives to be? Finally, calculate and report the population proportion.



```{}
# Insert code for Exercise 8 here
sample_props15 <- global_monitor %>%
  rep_sample_n(size = 15, reps = 2000, replace = TRUE) %>%
  count(scientist_work) %>%
  mutate(p_hat = n /sum(n)) %>%
  filter(scientist_work == "Benefits")
glimpse(sample_props15)
ggplot(data = sample_props15, aes(x = p_hat)) +
  geom_histogram(binwidth = 0.075) +
  labs(x = "p_hat (Benefits)",
    title = "Sampling distribution of p_hat",
    subtitle = "Sample size = 15, Number of samples = 2000"
  )
```
The histogram is skewed to the left. If I were to guess I would say that 70% of the population believes that scientists benefit their lives. The actual proportion is 80%.

### Exercise 9

Change your sample size from 15 to 150, then compute the sampling distribution using the same method as above, and store these proportions in a new object called sample_props150. Describe the shape of this sampling distribution and compare it to the sampling distribution for a sample size of 15. Based on this sampling distribution, what would you guess to be the true proportion of those who think the work scientists do enhances their lives?


```{r find-max-total}
# Insert code for Exercise 9 here
sample_props150 <- global_monitor %>%
  rep_sample_n(size = 150, reps = 2000, replace = TRUE) %>%
  count(scientist_work) %>%
  mutate(p_hat = n /sum(n)) %>%
  filter(scientist_work == "Benefits")
glimpse(sample_props150)
ggplot(data = sample_props150, aes(x = p_hat)) +
  geom_histogram(binwidth = 0.02) +
  labs(x = "p_hat (Benefits)",
    title = "Sampling distribution of p_hat",
    subtitle = "Sample size = 150, Number of samples = 2000"
  )

```
It would be about 80% of people that say that scientists enhance their lives. The shape is more bell shaped and compared to the sample size of 15 graph.

### Exercise 10

Of the sampling distributions from 2 and 3, which has a smaller spread? If you’re concerned with making estimates that are more often close to the true value, would you prefer a sampling distribution with a large or small spread?

```{}
# Insert code for Exercise 10 here

sample_props15 <- global_monitor %>%
  rep_sample_n(size = 15, reps = 2000, replace = TRUE) %>%
  count(scientist_work) %>%
  mutate(p_hat = n /sum(n)) %>%
  filter(scientist_work == "Benefits")
glimpse(sample_props15)
global_monitor %>%
  count(scientist_work) %>%
  mutate(p = n /sum(n))
ggplot(data = sample_props15, aes(x = p_hat)) +
  geom_histogram(binwidth = 0.075) +
  labs(x = "p_hat (Benefits)",
    title = "Sampling distribution of p_hat",
    subtitle = "Sample size = 15, Number of samples = 2000"
  )
sample_props150 <- global_monitor %>%
  rep_sample_n(size = 150, reps = 2000, replace = TRUE) %>%
  count(scientist_work) %>%
  mutate(p_hat = n /sum(n)) %>%
  filter(scientist_work == "Benefits")
glimpse(sample_props150)
ggplot(data = sample_props150, aes(x = p_hat)) +
  geom_histogram(binwidth = 0.02) +
  labs(x = "p_hat (Benefits)",
    title = "Sampling distribution of p_hat",
    subtitle = "Sample size = 150, Number of samples = 2000"
  )
```

A smaller spread would give you a better approximation and therefore is more desirable for getting close to the true value. I think the graph I just made for #9 had a smaller spread than the graph I made for #8
