Lab 5b: Confidence intervals

Load packages

library(tidyverse)
library(openintro)
library(infer)

set seed

var_seed <- 3421

Read data

us_adults <- tibble(
  climate_change_affects = c(rep("Yes", 62000), rep("No", 38000))
)

Exercise 1

What percent of the adults in your sample think climate change affects their local community? Hint: Just like we did with the population, we can calculate the proportion of those in this sample who think climate change affects their local community.

us_adults %>%
  count(climate_change_affects) %>%
  mutate(p = n /sum(n))

## # A tibble: 2 × 3
##   climate_change_affects     n     p
##   <chr>                  <int> <dbl>
## 1 No                     38000  0.38
## 2 Yes                    62000  0.62

set.seed(var_seed)
n <- 60
samp <- us_adults %>%
  sample_n(size = n)

samp %>%
  count(climate_change_affects) %>%
  mutate(p = n /sum(n))

## # A tibble: 2 × 3
##   climate_change_affects     n     p
##   <chr>                  <int> <dbl>
## 1 No                        23 0.383
## 2 Yes                       37 0.617

In our sample, 61.7% people think that climate change affects their local community as compared to population value of 62%.

Exercise 2

Would you expect another student’s sample proportion to be identical to yours? Would you expect it to be similar? Why or why not?

I would not expect another student’s sample proportion to be identical. But, it should be similar as the sample proportion would be near tot he true value of the population.

Exercise 3

In the interpretation above, we used the phrase “95% confident”. What does “95% confidence” mean?

set.seed(var_seed)

samp %>%
  specify(response = climate_change_affects, success = "Yes") %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "prop") %>%
  get_ci(level = 0.95)

## # A tibble: 1 × 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1      0.5    0.734

A confidence interval is a range of values, bounded above and below the statistic’s mean, that likely would contain an unknown population parameter.
Thus, a 95% confidence interval is a range of values that you can be 95% certain contains the true mean of the population.

Exercise 4

Does your confidence interval capture the true population proportion of US adults who think climate change affects their local community? If you are working on this lab in a classroom, does your neighbor’s interval capture this value?

In above sample, confidence interval is 0.5 to 0.73, and the true population proportion (0.62) of US adults who think climate change affects their local community lies in the confidence interval.

If I am in a lab than, my neighbor’s interval will have 95% chance (assuming confidence level is 95%) the true population proportion falls within the confidence interval based on his sample.

Exercise 5

Each student should have gotten a slightly different confidence interval. What proportion of those intervals would you expect to capture the true population mean? Why?

For every student there is a 95% chance (assuming confidence level of 95%) that the true population mean falls within the confidence interval. Thus, I would expect about 95% proportion of those intervals to capture the true population mean.

Exercise 6

Given a sample size of 60, 1000 bootstrap samples for each interval, and 50 confidence intervals constructed (the default values for the above app), what proportion of your confidence intervals include the true population proportion? Is this proportion exactly equal to the confidence level? If not, explain why. Make sure to include your plot in your answer.

At 95% confidence level, I found 3 off the 50 simulated intervals does not include the true population proportion. i.e. 47/50 = 94% off the 50 confidence intervals included the true population proportion.

This proportion is close to 95% confidence level but not exactly equal. Since, every confidence interval simulation had a 95% chance for the population mean to be in the confidence interval. In my sample, there were 3 such rare cases where population mean fell outside the confidence interval.

Exercise 7

Choose a different confidence level than 95%. Would you expect a confidence interval at this level to me wider or narrower than the confidence interval you calculated at the 95% confidence level? Explain your reasoning.

If I increase my confidence level than I expect the confidence interval to be larger, as with higher confidence level, there is a higher chance of the true population parameter to be within the confidence interval.

Below is image for sample size =60, bootstrap samples =1000 for each interval, 50 confidence intervals constructed, 95% CI

Exercise 8

Using code from the infer package and data from the one sample you have (samp), find a confidence interval for the proportion of US Adults who think climate change is affecting their local community with a confidence level of your choosing (other than 95%) and interpret it.

set.seed(var_seed)

samp %>%
  specify(response = climate_change_affects, success = "Yes") %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "prop") %>%
  get_ci(level = 0.90)

## # A tibble: 1 × 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1    0.517    0.717

I chose 90% confidence level and the boundaries were relatively narrower with values of 0.51 to 0.71. Thus, we can be 90% certain that true mean of the population lies within 0.51 and 0.71.

Exercise 9

Using the app, calculate 50 confidence intervals at the confidence level you chose in the previous question, and plot all intervals on one plot, and calculate the proportion of intervals that include the true population proportion. How does this percentage compare to the confidence level selected for the intervals?

43 off the 50 confidence intervals or 86% intervals included the true population proportion which was close to my confidence level of 90%.

Below is image for sample size =60, Bootstrap samples =1000 for each interval, 50 confidence intervals constructed, 90% CI

Exercise 10

Lastly, try one more (different) confidence level. First, state how you expect the width of this interval to compare to previous ones you calculated. Then, calculate the bounds of the interval using the infer package and data from samp and interpret it. Finally, use the app to generate many intervals and calculate the proportion of intervals that are capture the true population proportion.

set.seed(var_seed)

samp %>%
  specify(response = climate_change_affects, success = "Yes") %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "prop") %>%
  get_ci(level = 0.80)

## # A tibble: 1 × 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1    0.533      0.7

I am expecting to be very narrow, as confidence level is even lower at 80%. Also, as expected the confidence interval becomes narrow from 0.53 to 0.7.

Exercise 11

Using the app, experiment with different sample sizes and comment on how the widths of intervals change as sample size changes (increases and decreases).

The width of confidence intervals decreased as sample size increased from 60 to 600.

Below is image for sample size =600, Bootstrap samples =1000 for each interval, 50 confidence intervals constructed, 90% CI

Exercise 12

Finally, given a sample size (say, 60), how does the width of the interval change as you increase the number of bootstrap samples. Hint: Does changing the number of bootstrap samples affect the standard error?

In my example of sample size 60, if I increase the number of bootstrap samples from 10 to 100 than, there is no visible change in confidence intervals.Most of the confidence intervals stay in the range of 0.4 to 0.8.

Below is image for sample size =60, Bootstrap samples =10 for each interval, 50 confidence intervals constructed, 90% CI

Below is image for sample size =60, Bootstrap samples =100 for each interval, 50 confidence intervals constructed, 90% CI