Getting Started
Loading packages
library(tidyverse)
library(openintro)
library(infer)
set.seed(99)
Creating the data
We will assume a total population size of 100,000 even though that’s
much smaller than the population of all US adults to keep our
computations simple. The proportion of interest is: Roughly six-in-ten
U.S adults (62%) say climate change in currently affecting their local
community either a great deal or some, according to a new Pew Research
Center survey.
us_adults <- tibble(
climate_change_affects = c(rep("Yes", 62000), rep("No", 38000))
)
Visualization of distribution of responses:
ggplot(us_adults, aes(x = climate_change_affects)) +
geom_bar() +
labs(
x = "", y ="",
title = "Do you think climate change is affecting your local community?"
) +
coord_flip()

Obtaining summary statistics:
us_adults %>% count(climate_change_affects) %>% mutate (p = n/sum(n))
## # A tibble: 2 × 3
## climate_change_affects n p
## <chr> <int> <dbl>
## 1 No 38000 0.38
## 2 Yes 62000 0.62
Starting with sample size of 60
n <- 60
samp <- us_adults %>%
sample_n(size = n)
2. Would you expect another student’s sample proportion to be
identical to yours? Would you expect it to be similar? Why or why
not?
I would expect it to be similar, but not identical. We are pulling
data from the same source but through the repeated sampling process, our
samples should be around the population proportion of 62%.
Confidence Intervals
Finding the 95% confidence interval for proportion of US adults who
think climate change affects their local community
samp %>%
specify(response = climate_change_affects, success = "Yes") %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "prop") %>%
get_ci(level = 0.95)
## # A tibble: 1 × 2
## lower_ci upper_ci
## <dbl> <dbl>
## 1 0.533 0.767
Confidence levels
1. In the interpretation above, we used the phrase “95% confident”.
What does “95% confidence” mean?
We are 95% confident that the true proportion of U.S. adults that
think that climate change is affecting their local community is between
0.417 and 0.667. 95% confidence means that 95% of the times we gather
samples and plot their means, it will fall between these parameters.
3. Each student should have gotten a slightly different confidence
interval. What proportion of those intervals would you expect to capture
the true population mean? Why?
I would expect 95% of these intervals to capture the true population
proportion since that is what we were aiming for from the beginning and
was integrated in our calculations.
Given a sample size of 60, 1000 bootstrap samples for each interval,
and 50 confidence intervals constructed (the default values for the
above app), what proportion of your confidence intervals include the
true population proportion? Is this proportion exactly equal to the
confidence level? If not, explain why. Make sure to include your plot in
your answer.

In this simulation, 48 out of 50 of the confidence intervals
constructed included the true population proportion. This is consistent
with our 95% confidence level. While not exactly the same value, this is
expected due to the nature of this process.
More Practice
1. If we chose a confidence level lower than 95%, it would narrow
the interval. This is because there is more uncertainty as we go down in
confidence, and the true proportion will be captured at a lesser rate
than 95%.
2. Using code from the infer package and data from the one sample
you have (samp), find a confidence interval for the proportion of US
Adults who think climate change is affecting their local community with
a confidence level of your choosing (other than 95%) and interpret
it.
samp %>%
specify(response = climate_change_affects, success = "Yes") %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "prop") %>%
get_ci(level = 0.90)
## # A tibble: 1 × 2
## lower_ci upper_ci
## <dbl> <dbl>
## 1 0.533 0.75
We are 90% confident that the true proportion of US adults that think
that climate change is affecting their local communities is between
0.433 and 0.650
3. Using the app, calculate 50 confidence intervals at the
confidence level you chose in the previous question, and plot all
intervals on one plot, and calculate the proportion of intervals that
include the true population proportion. How does this percentage compare
to the confidence level selected for the intervals?
I utilized a 90% confidence level. The proportion of intervals that
that included the true proportion was 44/50 which 0.88, or 88%. This is
very close to our confidence level.
4. Lastly, try one more (different) confidence level. First, state
how you expect the width of this interval to compare to previous ones
you calculated. Then, calculate the bounds of the interval using the
infer package and data from samp and interpret it. Finally, use the app
to generate many intervals and calculate the proportion of intervals
that are capture the true population proportion.
samp %>%
specify(response = climate_change_affects, success = "Yes") %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "prop") %>%
get_ci(level = 0.80)
## # A tibble: 1 × 2
## lower_ci upper_ci
## <dbl> <dbl>
## 1 0.567 0.733
Our app is consistent with our bootstrapping simulation, in which 41
out of 50 confidence intervals captured the true population
proportion.
6. Finally, given a sample size (say, 60), how does the width of the
interval change as you increase the number of bootstrap samples.
Widths do not seem to change as adjust the number of bootstrap
samples.
---
title: "MATH217 Lab Hw"
author: "Ibrahim Pinzon Perez"
date: "`r Sys.Date()`"
output: openintro::lab_report
---

## Getting Started

### Loading packages

```{r, message=FALSE}
library(tidyverse)
library(openintro)
library(infer)
set.seed(99)
```

### Creating the data

We will assume a total population size of 100,000 even though that's much smaller than the population of all US adults to keep our computations simple. The proportion of interest is: Roughly six-in-ten U.S adults (62%) say climate change in currently affecting their local community either a great deal or some, according to a new Pew Research Center survey. 

```{r}
us_adults <- tibble(
  climate_change_affects = c(rep("Yes", 62000), rep("No", 38000))
)
```

Visualization of distribution of responses: 

```{r}
ggplot(us_adults, aes(x = climate_change_affects)) +
  geom_bar() +
  labs(
    x = "", y ="",
    title = "Do you think climate change is affecting your local community?"
  ) +
  coord_flip()
```

Obtaining summary statistics: 

```{r}
us_adults %>% count(climate_change_affects) %>% mutate (p = n/sum(n))
```

Starting with sample size of 60

```{r}
n <- 60
samp <- us_adults %>%
  sample_n(size = n)
```

#### 1. What percent of the adults in your sample think climate change affects their local community?

```{r}
samp %>% count(climate_change_affects) %>% mutate(p = n/sum(n))
```

About 55% of adults in this sample think climate change affects their local community.

#### 2. Would you expect another student’s sample proportion to be identical to yours? Would you expect it to be similar? Why or why not?

I would expect it to be similar, but not identical. We are pulling data from the same source but through the repeated sampling process, our samples should be around the population proportion of 62%. 

## Confidence Intervals

Finding the 95% confidence interval for proportion of US adults who think climate change affects their local community

```{r}
samp %>%
  specify(response = climate_change_affects, success = "Yes") %>% 
  generate(reps = 1000, type = "bootstrap") %>% 
  calculate(stat = "prop") %>%
  get_ci(level = 0.95)
```

### Confidence levels

#### 1. In the interpretation above, we used the phrase “95% confident”. What does “95% confidence” mean?

We are 95% confident that the true proportion of U.S. adults that think that climate change is affecting their local community is between 0.417 and 0.667. 95% confidence means that 95% of the times we gather samples and plot their means, it will fall between these parameters.   

#### 2. Does your confidence interval capture the true population proportion of US adults who think climate change affects their local community? If you are working on this lab in a classroom, does your neighbor’s interval capture this value?

Yes, it will capture the true population proportion of US adults who think climate change affects their local community 95% of the time. If we are in a classroom, we can expect our neighbor's interval to capture this value. 

#### 3. Each student should have gotten a slightly different confidence interval. What proportion of those intervals would you expect to capture the true population mean? Why?

I would expect 95% of these intervals to capture the true population proportion since that is what we were aiming for from the beginning and was integrated in our calculations. 

#### Given a sample size of 60, 1000 bootstrap samples for each interval, and 50 confidence intervals constructed (the default values for the above app), what proportion of your confidence intervals include the true population proportion? Is this proportion exactly equal to the confidence level? If not, explain why. Make sure to include your plot in your answer.

```{r, echo = FALSE}
knitr::include_graphics("/Users/ibrahimpinzon/Desktop/CI_Math217.png")
```

In this simulation, 48 out of 50 of the confidence intervals constructed included the true population proportion. This is consistent with our 95% confidence level. While not exactly the same value, this is expected due to the nature of this process. 

## More Practice

#### 1. If we chose a confidence level lower than 95%, it would narrow the interval. This is because there is more uncertainty as we go down in confidence, and the true proportion will be captured at a lesser rate than 95%. 

#### 2. Using code from the infer package and data from the one sample you have (samp), find a confidence interval for the proportion of US Adults who think climate change is affecting their local community with a confidence level of your choosing (other than 95%) and interpret it.

```{r}
samp %>%
  specify(response = climate_change_affects, success = "Yes") %>% 
  generate(reps = 1000, type = "bootstrap") %>% 
  calculate(stat = "prop") %>%
  get_ci(level = 0.90)
```

We are 90% confident that the true proportion of US adults that think that climate change is affecting their local communities is between 0.433 and 0.650

#### 3. Using the app, calculate 50 confidence intervals at the confidence level you chose in the previous question, and plot all intervals on one plot, and calculate the proportion of intervals that include the true population proportion. How does this percentage compare to the confidence level selected for the intervals?

I utilized a 90% confidence level. The proportion of intervals that that included the true proportion was 44/50 which 0.88, or 88%. This is very close to our confidence level. 

#### 4. Lastly, try one more (different) confidence level. First, state how you expect the width of this interval to compare to previous ones you calculated. Then, calculate the bounds of the interval using the infer package and data from samp and interpret it. Finally, use the app to generate many intervals and calculate the proportion of intervals that are capture the true population proportion.

```{r}
samp %>%
  specify(response = climate_change_affects, success = "Yes") %>% 
  generate(reps = 1000, type = "bootstrap") %>% 
  calculate(stat = "prop") %>%
  get_ci(level = 0.80)
```

Our app is consistent with our bootstrapping simulation, in which 41 out of 50 confidence intervals captured the true population proportion. 

#### 5. Using the app, experiment with different sample sizes and comment on how the widths of intervals change as sample size changes (increases and decreases).

We notice that as we increase the sample size, the width of the intervals decreases. The contrary occurs when we decrease sample size. 

#### 6. Finally, given a sample size (say, 60), how does the width of the interval change as you increase the number of bootstrap samples.

Widths do not seem to change as adjust the number of bootstrap samples. 