Foundations for statistical inference - Confidence intervals If you have access to data on an entire population, say the opinion of every adult in the United States on whether or not they think climate change is affecting their local community, it’s straightforward to answer questions like, “What percent of US adults think climate change is affecting their local community?”. Similarly, if you had demographic information on the population you could examine how, if at all, this opinion varies among young and old adults and adults with different leanings. If you have access to only a sample of the population, as is often the case, the task becomes more complicated. What is your best guess for this proportion if you only have data from a small sample of adults? This type of situation requires that you use your sample to make inference on what your population looks like.
Setting a seed: You will take random samples and build sampling distributions in this lab, which means you should set a seed on top of your lab. If this concept is new to you, review the lab on probability.
Getting Started Load packages In this lab, we will explore and visualize the data using the tidyverse suite of packages, and perform statistical inference using infer.
Let’s load the packages.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 0.3.5
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
library(infer)
library(dbplyr)
##
## Attaching package: 'dbplyr'
##
## The following objects are masked from 'package:dplyr':
##
## ident, sql
library(airports)
library(cherryblossom)
library(usdata)
The data A 2019 Pew Research report states the following:
To keep our computation simple, we will assume a total population size of 100,000 (even though that’s smaller than the population size of all US adults).
Roughly six-in-ten U.S. adults (62%) say climate change is currently affecting their local community either a great deal or some, according to a new Pew Research Center survey.
Source: Most Americans say climate change impacts their community, but effects vary by region
In this lab, you will assume this 62% is a true population proportion and learn about how sample proportions can vary from sample to sample by taking smaller samples from the population. We will first create our population assuming a population size of 100,000. This means 62,000 (62%) of the adult population think climate change impacts their community, and the remaining 38,000 does not think so.
us_adults <- tibble(
climate_change_affects = c(rep("Yes", 62000), rep("No", 38000))
)
The name of the data frame is us_adults and the name of the variable that contains responses to the question “Do you think climate change is affecting your local community?” is climate_change_affects.
We can quickly visualize the distribution of these responses using a bar plot.
ggplot(us_adults, aes(x = climate_change_affects)) +
geom_bar() +
labs(
x = "", y = "",
title = "Do you think climate change is affecting your local community?"
) +
coord_flip()
We can also obtain summary statistics to confirm we constructed the data
frame correctly.
us_adults %>%
count(climate_change_affects) %>%
mutate(p = n /sum(n))
## # A tibble: 2 × 3
## climate_change_affects n p
## <chr> <int> <dbl>
## 1 No 38000 0.38
## 2 Yes 62000 0.62
n <- 60
samp <- us_adults %>%
sample_n(size = n)
samp %>%
specify(response = climate_change_affects, success = "Yes") %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "prop") %>%
get_ci(level = 0.95)
## # A tibble: 1 × 2
## lower_ci upper_ci
## <dbl> <dbl>
## 1 0.517 0.75
1.What percent of the adults in your sample think climate change affects their local community? Hint: Just like we did with the population, we can calculate the proportion of those in this sample who think climate change affects their local community. In my sample size, 65% of adults believe that climate change has an impact on their neighborhood, whereas 35% disagree.
set.seed(9)
samp %>%
count(climate_change_affects) %>%
mutate(p=n/sum(n))
## # A tibble: 2 × 3
## climate_change_affects n p
## <chr> <int> <dbl>
## 1 No 22 0.367
## 2 Yes 38 0.633
2.Would you expect another student’s sample proportion to be identical to yours? Would you expect it to be similar? Why or why not?
Since 62% of the random sample’s participants agree with it, I anticipate that the sample size of my classmates will be comparable.
Confidence intervals
Return for a moment to the question that first motivated this lab: based on this sample, what can you infer about the population? With just one sample, the best estimate of the proportion of US adults who think climate change affects their local community would be the sample proportion, usually denoted as p^ (here we are calling it p_hat). That serves as a good point estimate, but it would be useful to also communicate how uncertain you are of that estimate. This uncertainty can be quantified using a confidence interval.
This code will find the 95 percent confidence interval for proportion of US adults who think climate change affects their local community.
samp %>%
specify(response = climate_change_affects, success = "Yes") %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "prop") %>%
get_ci(level = 0.95)
## # A tibble: 1 × 2
## lower_ci upper_ci
## <dbl> <dbl>
## 1 0.517 0.767
Confidence levels
3.In the interpretation above, we used the phrase “95% confident”. What does “95% confidence” mean?
When we state in Exercise 3 that we are 95% confident, we actually mean that we are 95% certain that the true population mean lies between the two intervals.
4.Does your confidence interval capture the true population proportion of US adults who think climate change affects their local community? If you are working on this lab in a classroom, does your neighbor’s interval capture this value? The range of my confidence interval is.533 to.767. Since we stated that we are 95% positive that the genuine mean resides between these two values and since we are aware that the true population percentage is 62%, we can see that the two boundaries fall within 62. This means that it does represent the true population proportion.
set.seed(12)
samp %>%
count(climate_change_affects) %>%
mutate(p=n/sum(n))
## # A tibble: 2 × 3
## climate_change_affects n p
## <chr> <int> <dbl>
## 1 No 22 0.367
## 2 Yes 38 0.633
samp %>%
specify(response = climate_change_affects, success = "Yes") %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "prop") %>%
get_ci(level = 0.95)
## # A tibble: 1 × 2
## lower_ci upper_ci
## <dbl> <dbl>
## 1 0.517 0.75
5.Each student should have gotten a slightly different confidence interval. What proportion of those intervals would you expect to capture the true population mean? Why?
Since the confidence interval was slightly varied for each student. We anticipate that the percentage of these intervals will fall between 62% and 64%.
6.Given a sample size of 60, 1000 bootstrap samples for each interval, and 50 confidence intervals constructed (the default values for the above app), what proportion of your confidence intervals include the true population proportion? Is this proportion exactly equal to the confidence level? If not, explain why. Make sure to include your plot in your answer.
The true population proportion is included in 95% of my confidence interval, according to my chart. Yes, it does, as I can see from the plot that the majority of the calculated confidence intervals are all within the bounds of the actual population proportion.
More Practice
7.Choose a different confidence level than 95%. Would you expect a confidence interval at this level to me wider or narrower than the confidence interval you calculated at the 95% confidence level? Explain your reasoning.
Since we are less certain of where the genuine population proportion resides between the confidence intervals, I think it would be narrower and we would have a higher likelihood of failing to capture the value within the limitations.
8.Using code from the infer package and data fromt the one sample you have (samp), find a confidence interval for the proportion of US Adults who think climate change is affecting their local community with a confidence level of your choosing (other than 95%) and interpret it.
90% of the time, the genuine population mean falls between the range of.56 and.76, and this shows that individuals believe that climate change is affecting their neighborhood.
set.seed(9)
samp %>%
count(climate_change_affects) %>%
mutate(p=n/sum(n))
## # A tibble: 2 × 3
## climate_change_affects n p
## <chr> <int> <dbl>
## 1 No 22 0.367
## 2 Yes 38 0.633
samp %>%
specify(response = climate_change_affects, success = "Yes") %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "prop") %>%
get_ci(level = 0.90)
## # A tibble: 1 × 2
## lower_ci upper_ci
## <dbl> <dbl>
## 1 0.533 0.733
9.Using the app, calculate 50 confidence intervals at the confidence level you chose in the previous question, and plot all intervals on one plot, and calculate the proportion of intervals that include the true population proportion. How does this percentage compare to the confidence level selected for the intervals?
Everything I calculated is the same. The sample size is 60, there were 1000 additional samples taken, and I calculated a 90% confidence interval. I can see from the statistics that just 3 of the 50 created intervals were out of bounds. Therefore, 47/50, or almost 94%, of intervals include the correct population proportion. The confidence level we created is rather close to this ratio.
10.Lastly, try one more (different) confidence level. First, state how you expect the width of this interval to compare to previous ones you calculated. Then, calculate the bounds of the interval using the infer package and data from samp and interpret it. Finally, use the app to generate many intervals and calculate the proportion of intervals that are capture the true population proportion.
I’m going to calculate an 85% confidence interval because I think it will be narrower than the prior interval. We can claim with 85% certainty that the actual population percentage of persons who think climate change is affecting their neighborhood is within. 58 and . 75. Using the software, I simply altered the confidence interval to 85%, where 38/50 represented the genuine population inside the range, or 76%.
set.seed(10)
samp %>%
count(climate_change_affects) %>%
mutate(p=n/sum(n))
## # A tibble: 2 × 3
## climate_change_affects n p
## <chr> <int> <dbl>
## 1 No 22 0.367
## 2 Yes 38 0.633
samp %>%
specify(response = climate_change_affects, success = "Yes") %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "prop") %>%
get_ci(level = 0.85)
## # A tibble: 1 × 2
## lower_ci upper_ci
## <dbl> <dbl>
## 1 0.55 0.733
11.Using the app, experiment with different sample sizes and comment on how the widths of intervals change as sample size changes (increases and decreases).
I conducted my experiments using samples of 70, 80, and 100, and after calculating various intervals, I discovered that the proportion was more or less constant. The width of the confidence intervals decreases as sample size increases while increasing as sample size decreases, and vice versa.
12.Finally, given a sample size (say, 60), how does the width of the interval change as you increase the number of bootstrap samples. Hint: Does changing the number of bootstap samples affect the standard error?
Since the standard error decreases as the number of bootstrap samples rises, the interval’s breadth gets lower as the number of samples rises.