Please indicate who you collaborated with on this problem set:
First load the necessary packages:
library(dplyr)
library(ggplot2)
library(janitor)
library(moderndive)
library(infer)
library(tidyverse)
library(readr)For this exercise, we will again use data from the general social survey, an annual personal-interview survey conducted in the United States. The survey is designed to monitor changes in both social characteristics and attitudes. You will work with a small sample from one neighborhood to practice generating a bootstrap sampling distribution, and confidence intervals for numeric and categorical variables.
You have only a small sample of the data from the gss survey, specifically data on age, race race, and number of hours of TV watched a day tvhours for 100 respondents in 2014. The following will read in the data:
gss_sample <- read_csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vSypSoDCMH2N76Vo2dZRPkw2q3t1mbvAXlOtgPDIsHg4NclAQFmER-BdvXH9_lrT40UQCVdPXOi_NMJ/pub?gid=257689625&single=true&output=csv")Use the View() function in the console to take a look at the data in the viewer. Type ?gss_cat into the console to see a description of the variables in the full version of this data set.
We will use bootstrapping to:
Note, if you get stuck as you are working through this, it will be helpful to go back and read Chapter 9 in ModernDive.
We will first run through an example of how to generate a bootstrap resample and a confidence interval using the numeric variable tvhours. For the purpose of understanding what a bootstrap is, we will start by taking only a single bootstrap resample and preview the first five rows:
bootstrap_sample_1 <- gss_sample %>%
rep_sample_n(size = 100, replace = TRUE, reps = 1)
bootstrap_sample_1 %>%
slice(1:5)| replicate | age | race | tvhours |
|---|---|---|---|
| 1 | 61 | White | 5 |
| 1 | 24 | White | 2 |
| 1 | 45 | White | 1 |
| 1 | 43 | White | 0 |
| 1 | 43 | White | 0 |
There are a a few important things to note about the above code, and bootstrap resampling in general:
bootstrap_sample_1 is the same size as our original sample (n = 100)replace = TRUE to indicate that we are resampling individual values from the gss_sample with replacementbootstrap_sample_1 will not contain any unique values that are not in gss_samplebootstrap_sample_1 can contain multiple responses from some people sampled in gss_sample, and will not include the responses of some of the individualsThe process of bootstrapping allows us to use a single sample to generate many different resamples that can help us approximate a sampling distribution with a bootstrap distribution. Thus, to generate a bootstrap distribution, we need to create multiple bootstrap samples.
The code below accomplishes the same kind of bootstrap sampling that we just ran 1000 times. The code looks a little different than above, because it uses the format from the infer package. Take a moment to re-visit the figure just before Section 9.3 in ModernDive.
thousand_bootstrap_samples <- gss_sample %>%
specify(response = tvhours) %>%
generate(reps = 1000, type = "bootstrap")The above code tells R to:
gss_sampletvhours to be the value we wish to resample using specify()thousand_bootstrap_samplesTake a look at thousand_bootstrap_samples in the data viewer.
Question:
tvhours are recorded for each of the 1000 replicates?Answer:
We can add one more line to the code to ask R to ALSO calculate the mean tvhours for each bootstrap sample replicate like so.
boot_distribution_tv <- gss_sample %>%
specify(response = tvhours) %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "mean")Take a look at boot_distribution_tv in the data viewer. Compare the number of rows and contents with thousand_bootstrap_samples.
Question:
boot_distribution_tv and not 100,000 like in thousand_bootstrap_samples?stat represent in boot_distribution_tv?Answer:
stat is the mean tvhours for the 100 people that were included in each bootstrap sample.We can pipe the bootstrap_distribution right into the visualize function from the infer package like so:
boot_distribution_tv %>%
visualize()It would be nice to see where our observed sample mean \(\bar{x}\) of tvhours falls within this bootstrap distribution.
Step 1: calculate the observed sample statistic \(\bar{x}\)
x_bar_tv <- gss_sample %>%
specify(response = tvhours) %>%
calculate(stat = "mean")
x_bar_tv| stat |
|---|
| 3.14 |
Step 2: add a vertical line showing the sample statistic \(\bar{x}\) for mean hours of TV consumed per day to our bootstrap distribution like so:
boot_distribution_tv %>%
visualize(obs_stat = x_bar_tv)We can now use the bootstrap distribution for tvhours to calculate a confidence interval. There are two methods for doing so. The get_ci function below asks the infer package for a 95% confidence interval, via the percentile method.
prct_ci_tv <- boot_distribution_tv %>%
get_ci(level = 0.95, type = "percentile")
prct_ci_tv| 2.5% | 97.5% |
|---|---|
| 2.52 | 3.8805 |
This method
We can visualize the results of the above calculation like so. Note that the green line on the left hits the x axis at the 2.5% CI value we just calculated, and the green line of the right hits the x axis at the 97.5% CI value we just calculated.
boot_distribution_tv %>%
visualize(endpoints = prct_ci_tv, direction = "between")Recall, if the bootstrap distribution is close to symmetric and bell-shaped, we can use the standard error method for determining the lower and upper endpoints of the confidence interval.
se_ci_tv <- boot_distribution_tv %>%
get_ci(level = 0.95, type = "se", point_estimate = x_bar_tv)
se_ci_tv| lower | upper |
|---|---|
| 2.457525 | 3.822475 |
Again, we can visualize the confidence interval
boot_distribution_tv %>%
visualize(endpoints = se_ci_tv, direction = "between")Questions:
tvhours?tvhours?Answers:
Now construct a 20% confidence interval for the number of TV hours watched per week, AND a 99% confidence interval. Use only one of the methods (quantile or standard error), your choice.
# 20% confidence interval
se_ci_tv_20 <- boot_distribution_tv %>%
get_ci(level = 0.20, type = "se", point_estimate = x_bar_tv)
se_ci_tv_20| lower | upper |
|---|---|
| 3.051782 | 3.228217 |
# 99% confidence interval
se_ci_tv_99 <- boot_distribution_tv %>%
get_ci(level = 0.99, type = "se", point_estimate = x_bar_tv)
se_ci_tv_99| lower | upper |
|---|---|
| 2.243076 | 4.036924 |
Questions:
Answers:
The procedure for generating a bootstrap sampling distribution is VERY similar for categorical data. As an example we will generate a bootstrap sampling distribution for the proportion of respondents that identified as a Person of Color.
p_hat_poc <- gss_sample %>%
specify(response = race, success = "POC") %>%
calculate(stat = "prop")
p_hat_poc| stat |
|---|
| 0.24 |
Note that the code differs in two respects from when we were working with a numeric variable:
success = "POC" to specify which level of the categorical variable we want to calculate the proportion for"prop" rather than "mean"boot_distribution_poc <- gss_sample %>%
specify(response = race, success = "POC") %>%
generate(reps = 10000, type = 'bootstrap') %>%
calculate(stat = "prop")Once you have generated the bootstrap distribution, the steps for calculating the confidence interval for a proportion are the same as for a numeric variable.
Use the get_ci function to calculate a 95% confidence interval for the population proportion \(p\) of individuals that identify as a POC. For the sake of less work, use only the standard error method (percentile can also be used for proportions). Think about what you use as your point estimate here!
se_ci_poc <- boot_distribution_poc %>%
get_ci(level = 0.95, type = "se", point_estimate = p_hat_poc)
se_ci_poc| lower | upper |
|---|---|
| 0.1566161 | 0.3233839 |
You will now apply what you have learned to calculating a confidence interval for the numeric variable age in our sample.
First generate a bootstrap sampling distribution using 1000 reps.
boot_distribution_poc_1000 <- gss_sample %>%
specify(response = race, success = "POC") %>%
generate(reps = 1000, type = 'bootstrap') %>%
calculate(stat = "prop")Calculate a 95% confidence interval for mean population age \(\mu\) using the percentile method or standard error method.
prct_ci_poc <- boot_distribution_poc_1000 %>%
get_ci(level = 0.95, type = "percentile")
prct_ci_poc| 2.5% | 97.5% |
|---|---|
| 0.16 | 0.32025 |
Calculate a 95% confidence interval for mean population age \(\mu\) using the standard error method. Remember you also need to calculate the sample mean \(\bar{x}\) to use the SE method.
# Calculate x_bar_age
x_bar_age <- gss_sample %>%
specify(response = age) %>%
calculate(stat = "mean")
x_bar_age| stat |
|---|
| 47.61 |
# 95% confidence interval using se method
boot_distribution_age <- gss_sample %>%
specify(response = age) %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "mean")se_ci_age <- boot_distribution_age %>%
get_ci(level = 0.95, type = "se", point_estimate = x_bar_age)
se_ci_age| lower | upper |
|---|---|
| 44.20164 | 51.01836 |
Imagine we only actually had a sample of 15 humans, like so:
gss_n15 <- sample_n(gss_sample, size = 15)Generate 1000 bootstrap samples based on this reduced sample gss_n15
boot_distribution15_age <- gss_n15 %>%
specify(response = age) %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "mean")Calculate a 95% confidence interval for the mean population age \(\mu\) based on this smaller sample. Use the standard error and percentile method.
# percentile
prct_ci_age15 <- boot_distribution15_age %>%
get_ci(level = 0.95, type = "percentile")
prct_ci_age15| 2.5% | 97.5% |
|---|---|
| 39.6 | 59 |
# x_bar age
x_bar_age <- gss_n15 %>%
specify(response = age) %>%
calculate(stat = "mean")
x_bar_age| stat |
|---|
| 49.4 |
# se method
se_ci_age <- boot_distribution15_age %>%
get_ci(level = 0.95, type = "se", point_estimate = x_bar_age)
se_ci_age| lower | upper |
|---|---|
| 39.3719 | 59.4281 |
Question:
Answer: