Getting Started

Load packages

library(tidyverse)
library(openintro)
library(infer)

The data

us_adults <- tibble(
  climate_change_affects = c(rep("Yes", 62000), rep("No", 38000))
)
ggplot(us_adults, aes(x = climate_change_affects)) +
  geom_bar() +
  labs(
    x = "", y = "",
    title = "Do you think climate change is affecting your local community?"
  ) +
  coord_flip()

us_adults %>%
  count(climate_change_affects) %>%
  mutate(p = n /sum(n))
## # A tibble: 2 × 3
##   climate_change_affects     n     p
##   <chr>                  <int> <dbl>
## 1 No                     38000  0.38
## 2 Yes                    62000  0.62

In this lab, you’ll start with a simple random sample of size 60 from the population.

n <- 60
samp <- us_adults %>%
  sample_n(size = n)

Exercise 1

What percent of the adults in your sample think climate change affects their local community? Hint: Just like we did with the population, we can calculate the proportion of those in this sample who think climate change affects their local community.

samp %>%
  count(climate_change_affects) %>%
  mutate(p = n /sum(n))
## # A tibble: 2 × 3
##   climate_change_affects     n     p
##   <chr>                  <int> <dbl>
## 1 No                        27  0.45
## 2 Yes                       33  0.55

Based on the results, 63.33% of the adults think that climate change affects their local community, while 36.67% do not. This means the majority of the sample perceives climate change as having a local impact.

Exercise 2

Would you expect another student’s sample proportion to be identical to yours? Would you expect it to be similar? Why or why not?

I wouldn’t expect another student’s sample proportion to be identical, as random samples can differ. However, I would expect it to be similar, especially if the sample sizes are close. In my sample, 63.33% think climate change affects their local community, so a classmate’s random sample should yield a comparable proportion, though not exactly the same.

Confidence intervals

This code will find the 95 percent confidence interval for proportion of US adults who think climate change affects their local community.

samp %>%
  specify(response = climate_change_affects, success = "Yes") %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "prop") %>%
  get_ci(level = 0.95)
## # A tibble: 1 × 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1    0.417    0.683

Exercise 3

In the interpretation above, we used the phrase “95% confident”. What does “95% confidence” mean?

We are 95% confident that the true population mean fall within the confidence interval.

Exercise 4

Does your confidence interval capture the true population proportion of US adults who think climate change affects their local community? If you are working on this lab in a classroom, does your neighbor’s interval capture this value?

set.seed(9)
samp %>%
  count(climate_change_affects) %>%
  mutate(p=n/sum(n))
## # A tibble: 2 × 3
##   climate_change_affects     n     p
##   <chr>                  <int> <dbl>
## 1 No                        27  0.45
## 2 Yes                       33  0.55
samp %>%
  specify(response = climate_change_affects, success = "Yes") %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "prop") %>%
  get_ci(level = 0.95)
## # A tibble: 1 × 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1    0.433    0.667

My confidence interval ranges from 0.5 to 0.75, and it includes the true population proportion. Since we stated that we are 95% confident the true mean falls within this range, and the actual population proportion is 63%, it confirms that the true value is captured by the interval.

Exercise 5

Each student should have gotten a slightly different confidence interval. What proportion of those intervals would you expect to capture the true population mean? Why?

Each student has a slightly different confidence interval, so we expect the proportion of these intervals to fall both slightly above and below 63%.

Exercise 6

Given a sample size of 60, 1000 bootstrap samples for each interval, and 50 confidence intervals constructed (the default values for the above app), what proportion of your confidence intervals include the true population proportion? Is this proportion exactly equal to the confidence level? If not, explain why. Make sure to include your plot in your answer.

set.seed(9999)
#Sample size of 60 entries without replacement.
sampled_entries <- sample_n(samp, size = 60)

#Compute p-hat: count the number that are "Yes," then divide by the sample size.
p_hat <- sum(sampled_entries$climate_change_affects == "Yes") / 1000
p_hat
## [1] 0.033
ggplot(sampled_entries, aes(x = climate_change_affects)) +
  geom_bar() +
  labs(
    x = "", y = "",
    title = "Observation over sample proportion"
  ) +
  coord_flip()

Exercise 7

Choose a different confidence level than 95%. Would you expect a confidence interval at this level to me wider or narrower than the confidence interval you calculated at the 95% confidence level? Explain your reasoning.

If the confidence level is higher than 95%, the interval becomes wider because we need more certainty that the true population value is within the range.

In contrast, a lower confidence level (e.g., 90%) results in a narrower interval since we’re accepting less certainty and a higher chance of error, so the margin for capturing the true value is smaller.

Exercise 8

Using code from the infer package and data fromt the one sample you have (samp), find a confidence interval for the proportion of US Adults who think climate change is affecting their local community with a confidence level of your choosing (other than 95%) and interpret it.

#Calculating the confidence interval
prop <- mean(samp$climate_change_affects == "Yes")  
se <- sqrt(prop * (1 - prop) / nrow(samp))
z_score <- qnorm((1 + 0.85) / 2)
margin_error <- z_score * se

lower <- prop - margin_error
upper <- prop + margin_error
#Printing the confidence interval
cat("Confidence Interval: (", lower, ", ", upper, ")\n")
## Confidence Interval: ( 0.4575444 ,  0.6424556 )

Exercise 9

Using the app, calculate 50 confidence intervals at the confidence level you chose in the previous question, and plot all intervals on one plot, and calculate the proportion of intervals that include the true population proportion. How does this percentage compare to the confidence level selected for the intervals?

An 85% confidence level indicates a lower percentage of intervals that contain the true population proportion. As mentioned earlier, the interval becomes narrower as the confidence level decreases.

Exercise 10

Lastly, try one more (different) confidence level. First, state how you expect the width of this interval to compare to previous ones you calculated. Then, calculate the bounds of the interval using the infer package and data from samp and interpret it. Finally, use the app to generate many intervals and calculate the proportion of intervals that are capture the true population proportion.

set.seed(1)
samp %>%
  count(climate_change_affects) %>%
  mutate(p=n/sum(n))
## # A tibble: 2 × 3
##   climate_change_affects     n     p
##   <chr>                  <int> <dbl>
## 1 No                        27  0.45
## 2 Yes                       33  0.55
samp %>%
  specify(response = climate_change_affects, success = "Yes") %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "prop") %>%
  get_ci(level = 0.85)
## # A tibble: 1 × 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1     0.45     0.65

Exercise 11

Using the app, experiment with different sample sizes and comment on how the widths of intervals change as sample size changes (increases and decreases).

Larger sample sizes result in narrower confidence intervals, reducing uncertainty and improving precision. In contrast, smaller sample sizes lead to wider intervals, increasing uncertainty and decreasing accuracy. Therefore, sample size greatly influences the precision and width of confidence intervals in statistical analysis.

Exercise 12

Finally, given a sample size (say, 60), how does the width of the interval change as you increase the number of bootstrap samples. Hint: Does changing the number of bootstap samples affect the standard error?

The width of the interval stays consistent as the number of bootstrap samples changes. Increasing the number of bootstrap samples reduces variability, resulting in more reliable standard error estimates.

---
title: "Lab 5 part II: Foundations for statistical inference - Confidence intervals"
author: "Laura B"
date: "`r Sys.Date()`"
output: openintro::lab_report
---

### Getting Started

#### Load packages

```{r load-packages, message=FALSE}
library(tidyverse)
library(openintro)
library(infer)
```

#### The data

```{r}
us_adults <- tibble(
  climate_change_affects = c(rep("Yes", 62000), rep("No", 38000))
)
```

```{r}
ggplot(us_adults, aes(x = climate_change_affects)) +
  geom_bar() +
  labs(
    x = "", y = "",
    title = "Do you think climate change is affecting your local community?"
  ) +
  coord_flip()
```

```{r}
us_adults %>%
  count(climate_change_affects) %>%
  mutate(p = n /sum(n))
```

In this lab, you’ll start with a simple random sample of size 60 from the population.

```{r}
n <- 60
samp <- us_adults %>%
  sample_n(size = n)
```



### Exercise 1

What percent of the adults in your sample think climate change affects their local community? Hint: Just like we did with the population, we can calculate the proportion of those in this sample who think climate change affects their local community.

```{r}
samp %>%
  count(climate_change_affects) %>%
  mutate(p = n /sum(n))
```

Based on the results, 63.33% of the adults think that climate change affects their local community, while 36.67% do not. This means the majority of the sample perceives climate change as having a local impact.

### Exercise 2

Would you expect another student’s sample proportion to be identical to yours? Would you expect it to be similar? Why or why not?

I wouldn't expect another student's sample proportion to be identical, as random samples can differ. However, I would expect it to be similar, especially if the sample sizes are close. In my sample, 63.33% think climate change affects their local community, so a classmate's random sample should yield a comparable proportion, though not exactly the same.


### Confidence intervals

This code will find the 95 percent confidence interval for proportion of US adults who think climate change affects their local community.

```{r}
samp %>%
  specify(response = climate_change_affects, success = "Yes") %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "prop") %>%
  get_ci(level = 0.95)
```


### Exercise 3

In the interpretation above, we used the phrase “95% confident”. What does “95% confidence” mean?

We are 95% confident that the true population mean fall within the confidence interval.

### Exercise 4

Does your confidence interval capture the true population proportion of US adults who think climate change affects their local community? If you are working on this lab in a classroom, does your neighbor’s interval capture this value?

```{r}
set.seed(9)
samp %>%
  count(climate_change_affects) %>%
  mutate(p=n/sum(n))
```

```{r}
samp %>%
  specify(response = climate_change_affects, success = "Yes") %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "prop") %>%
  get_ci(level = 0.95)
```

My confidence interval ranges from 0.5 to 0.75, and it includes the true population proportion. Since we stated that we are 95% confident the true mean falls within this range, and the actual population proportion is 63%, it confirms that the true value is captured by the interval.

### Exercise 5

Each student should have gotten a slightly different confidence interval. What proportion of those intervals would you expect to capture the true population mean? Why?

Each student has a slightly different confidence interval, so we expect the proportion of these intervals to fall both slightly above and below 63%.

### Exercise 6

Given a sample size of 60, 1000 bootstrap samples for each interval, and 50 confidence intervals constructed (the default values for the above app), what proportion of your confidence intervals include the true population proportion? Is this proportion exactly equal to the confidence level? If not, explain why. Make sure to include your plot in your answer.

```{r}
set.seed(9999)
#Sample size of 60 entries without replacement.
sampled_entries <- sample_n(samp, size = 60)

#Compute p-hat: count the number that are "Yes," then divide by the sample size.
p_hat <- sum(sampled_entries$climate_change_affects == "Yes") / 1000
p_hat

```

```{r}
ggplot(sampled_entries, aes(x = climate_change_affects)) +
  geom_bar() +
  labs(
    x = "", y = "",
    title = "Observation over sample proportion"
  ) +
  coord_flip()
```


### Exercise 7

Choose a different confidence level than 95%. Would you expect a confidence interval at this level to me wider or narrower than the confidence interval you calculated at the 95% confidence level? Explain your reasoning.

If the confidence level is higher than 95%, the interval becomes wider because we need more certainty that the true population value is within the range.

In contrast, a lower confidence level (e.g., 90%) results in a narrower interval since we're accepting less certainty and a higher chance of error, so the margin for capturing the true value is smaller.

### Exercise 8

Using code from the infer package and data fromt the one sample you have (samp), find a confidence interval for the proportion of US Adults who think climate change is affecting their local community with a confidence level of your choosing (other than 95%) and interpret it.

```{r}
#Calculating the confidence interval
prop <- mean(samp$climate_change_affects == "Yes")  
se <- sqrt(prop * (1 - prop) / nrow(samp))
z_score <- qnorm((1 + 0.85) / 2)
margin_error <- z_score * se

lower <- prop - margin_error
upper <- prop + margin_error
#Printing the confidence interval
cat("Confidence Interval: (", lower, ", ", upper, ")\n")
```


### Exercise 9

Using the app, calculate 50 confidence intervals at the confidence level you chose in the previous question, and plot all intervals on one plot, and calculate the proportion of intervals that include the true population proportion. How does this percentage compare to the confidence level selected for the intervals?

An 85% confidence level indicates a lower percentage of intervals that contain the true population proportion. As mentioned earlier, the interval becomes narrower as the confidence level decreases.

### Exercise 10

Lastly, try one more (different) confidence level. First, state how you expect the width of this interval to compare to previous ones you calculated. Then, calculate the bounds of the interval using the infer package and data from samp and interpret it. Finally, use the app to generate many intervals and calculate the proportion of intervals that are capture the true population proportion.

```{r}
set.seed(1)
samp %>%
  count(climate_change_affects) %>%
  mutate(p=n/sum(n))
```

```{r}
samp %>%
  specify(response = climate_change_affects, success = "Yes") %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "prop") %>%
  get_ci(level = 0.85)
```


### Exercise 11

Using the app, experiment with different sample sizes and comment on how the widths of intervals change as sample size changes (increases and decreases).

Larger sample sizes result in narrower confidence intervals, reducing uncertainty and improving precision. In contrast, smaller sample sizes lead to wider intervals, increasing uncertainty and decreasing accuracy. Therefore, sample size greatly influences the precision and width of confidence intervals in statistical analysis.

### Exercise 12

Finally, given a sample size (say, 60), how does the width of the interval change as you increase the number of bootstrap samples. Hint: Does changing the number of bootstap samples affect the standard error?

The width of the interval stays consistent as the number of bootstrap samples changes. Increasing the number of bootstrap samples reduces variability, resulting in more reliable standard error estimates.

