Collaboration

Please indicate who you collaborated with on this problem set:

Background

First load the necessary packages:

library(dplyr)
library(ggplot2)
library(janitor)
library(moderndive)
library(infer)
library(tidyverse)
library(readr)

For this exercise, we will again use data from the general social survey, an annual personal-interview survey conducted in the United States. The survey is designed to monitor changes in both social characteristics and attitudes. You will work with a small sample from one neighborhood to practice generating a bootstrap sampling distribution, and confidence intervals for numeric and categorical variables.

You have only a small sample of the data from the gss survey, specifically data on age, race race, and number of hours of TV watched a day tvhours for 100 respondents in 2014. The following will read in the data:

gss_sample <- read_csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vSypSoDCMH2N76Vo2dZRPkw2q3t1mbvAXlOtgPDIsHg4NclAQFmER-BdvXH9_lrT40UQCVdPXOi_NMJ/pub?gid=257689625&single=true&output=csv")

Use the View() function in the console to take a look at the data in the viewer. Type ?gss_cat into the console to see a description of the variables in the full version of this data set.

We will use bootstrapping to:

  1. Estimate the population mean \(\mu\) respondent age
  2. Generate a confidence interval for this population mean \(\mu\)
  3. Estimate the population proportion \(p\) that identify as a person of color (POC)
  4. Generate a 95% confidence interval for this population proportion \(p\)

Note, if you get stuck as you are working through this, it will be helpful to go back and read Chapter 9 in ModernDive.


Two demos of bootstrap sampling & confidence intervals

Demo 1: Numeric variables

We will first run through an example of how to generate a bootstrap resample and a confidence interval using the numeric variable tvhours. For the purpose of understanding what a bootstrap is, we will start by taking only a single bootstrap resample and preview the first five rows:

bootstrap_sample_1 <- gss_sample %>% 
  rep_sample_n(size = 100, replace = TRUE, reps = 1)

bootstrap_sample_1 %>% 
  slice(1:5)
replicate age race tvhours
1 61 White 5
1 24 White 2
1 45 White 1
1 43 White 0
1 43 White 0

There are a a few important things to note about the above code, and bootstrap resampling in general:

  • bootstrap_sample_1 is the same size as our original sample (n = 100)
  • We have added replace = TRUE to indicate that we are resampling individual values from the gss_sample with replacement
  • bootstrap_sample_1 will not contain any unique values that are not in gss_sample
  • bootstrap_sample_1 can contain multiple responses from some people sampled in gss_sample, and will not include the responses of some of the individuals

a) Generate 1000 bootstrap samples

The process of bootstrapping allows us to use a single sample to generate many different resamples that can help us approximate a sampling distribution with a bootstrap distribution. Thus, to generate a bootstrap distribution, we need to create multiple bootstrap samples.

The code below accomplishes the same kind of bootstrap sampling that we just ran 1000 times. The code looks a little different than above, because it uses the format from the infer package. Take a moment to re-visit the figure just before Section 9.3 in ModernDive.

thousand_bootstrap_samples <- gss_sample %>% 
  specify(response = tvhours) %>% 
  generate(reps = 1000, type = "bootstrap")

The above code tells R to:

  • Take the data frame gss_sample
  • Set the variable tvhours to be the value we wish to resample using specify()
  • Generate 1000 bootstrap resamples
  • Save the results in thousand_bootstrap_samples

Take a look at thousand_bootstrap_samples in the data viewer.

Question:

  1. How many values of tvhours are recorded for each of the 1000 replicates?

Answer:

  1. 100

b) Calculate the mean of each bootstrap sample

We can add one more line to the code to ask R to ALSO calculate the mean tvhours for each bootstrap sample replicate like so.

boot_distribution_tv <- gss_sample %>% 
  specify(response = tvhours) %>% 
  generate(reps = 1000, type = "bootstrap") %>% 
  calculate(stat = "mean")

Take a look at boot_distribution_tv in the data viewer. Compare the number of rows and contents with thousand_bootstrap_samples.

Question:

  1. Why are there only 1000 rows in boot_distribution_tv and not 100,000 like in thousand_bootstrap_samples?
  2. What does each value of stat represent in boot_distribution_tv?

Answer:

  1. Because for each of the 1000 replicates, we took the mean of the n=100 bootstrap resamples.
  2. Each value of stat is the mean tvhours for the 100 people that were included in each bootstrap sample.

c) Visualize the bootstrap sample

We can pipe the bootstrap_distribution right into the visualize function from the infer package like so:

boot_distribution_tv %>% 
  visualize()

d) Calculate and visulalize the observed sample statistic \(\bar{x}\)

It would be nice to see where our observed sample mean \(\bar{x}\) of tvhours falls within this bootstrap distribution.

Step 1: calculate the observed sample statistic \(\bar{x}\)

x_bar_tv <- gss_sample %>% 
  specify(response = tvhours) %>% 
  calculate(stat = "mean")

x_bar_tv
stat
3.14

Step 2: add a vertical line showing the sample statistic \(\bar{x}\) for mean hours of TV consumed per day to our bootstrap distribution like so:

boot_distribution_tv %>% 
  visualize(obs_stat = x_bar_tv)

e) Quantile method for confidence intervals

We can now use the bootstrap distribution for tvhours to calculate a confidence interval. There are two methods for doing so. The get_ci function below asks the infer package for a 95% confidence interval, via the percentile method.

prct_ci_tv <- boot_distribution_tv %>% 
  get_ci(level = 0.95, type = "percentile")

prct_ci_tv
2.5% 97.5%
2.52 3.8805

This method

  • Sorts the bootstrap sample means in order from low to high
  • Identifies the middle 95% (for a 95% confidence interval)
  • Finds the two values of bootstrap sample means that fall on the lower and upper end of this span

We can visualize the results of the above calculation like so. Note that the green line on the left hits the x axis at the 2.5% CI value we just calculated, and the green line of the right hits the x axis at the 97.5% CI value we just calculated.

boot_distribution_tv %>% 
  visualize(endpoints = prct_ci_tv, direction = "between")

f) Standard error method for Confidence Interval

Recall, if the bootstrap distribution is close to symmetric and bell-shaped, we can use the standard error method for determining the lower and upper endpoints of the confidence interval.

se_ci_tv <- boot_distribution_tv %>% 
  get_ci(level = 0.95, type = "se", point_estimate = x_bar_tv)
se_ci_tv
lower upper
2.457525 3.822475

Again, we can visualize the confidence interval

boot_distribution_tv %>% 
  visualize(endpoints = se_ci_tv, direction = "between")

g) Interpreting confidence intervals

Questions:

  1. How confident are you that the 95% confidence interval we calculated under the standard error method contains the true population mean \(\mu\) of tvhours?
  2. How confident are you that the confidence interval we calculated under the standard error method contains the sample statistic \(\bar{x}\) of tvhours?

Answers:

  1. I am 95% confident.
  2. I am 100% confident…regardless of the stated confidence level. This is by definition an interval that is built off of a sample mean.

h) Changing the confidence level

Now construct a 20% confidence interval for the number of TV hours watched per week, AND a 99% confidence interval. Use only one of the methods (quantile or standard error), your choice.

# 20% confidence interval 
se_ci_tv_20 <- boot_distribution_tv %>% 
  get_ci(level = 0.20, type = "se", point_estimate = x_bar_tv)

se_ci_tv_20
lower upper
3.051782 3.228217
# 99% confidence interval
se_ci_tv_99 <- boot_distribution_tv %>% 
  get_ci(level = 0.99, type = "se", point_estimate = x_bar_tv)

se_ci_tv_99
lower upper
2.243076 4.036924

Questions:

  1. Which confidence interval is wider? Why?
  2. Without looking at the census data: do either of these confidence intervals definitely contain the population mean \(\mu\) of number of TV hours watched per week?

Answers:

  1. The 99% confidence interval is wider. We are casting a wider net…are setting a larger confidence level.
  2. We don’t know. We are more confident that the 99% confidence interval contains it however.

Demo 2: Categorical variables

The procedure for generating a bootstrap sampling distribution is VERY similar for categorical data. As an example we will generate a bootstrap sampling distribution for the proportion of respondents that identified as a Person of Color.

a) Calculate the observed sample statistic \(\widehat{p}\)

p_hat_poc <- gss_sample %>% 
  specify(response = race, success = "POC") %>% 
  calculate(stat = "prop")

p_hat_poc
stat
0.24

Note that the code differs in two respects from when we were working with a numeric variable:

  • we added the argument success = "POC" to specify which level of the categorical variable we want to calculate the proportion for
  • the stat we calculate is now "prop" rather than "mean"

b) Generate the bootstrap sampling distribution

boot_distribution_poc <- gss_sample %>% 
  specify(response = race, success = "POC") %>% 
  generate(reps = 10000, type = 'bootstrap') %>% 
  calculate(stat = "prop")

Question 1

Once you have generated the bootstrap distribution, the steps for calculating the confidence interval for a proportion are the same as for a numeric variable.

Use the get_ci function to calculate a 95% confidence interval for the population proportion \(p\) of individuals that identify as a POC. For the sake of less work, use only the standard error method (percentile can also be used for proportions). Think about what you use as your point estimate here!

se_ci_poc <- boot_distribution_poc %>% 
  get_ci(level = 0.95, type = "se", point_estimate = p_hat_poc)
se_ci_poc
lower upper
0.1566161 0.3233839

Question 2

You will now apply what you have learned to calculating a confidence interval for the numeric variable age in our sample.

a) Bootstrap sample

First generate a bootstrap sampling distribution using 1000 reps.

boot_distribution_poc_1000 <- gss_sample %>% 
  specify(response = race, success = "POC") %>% 
  generate(reps = 1000, type = 'bootstrap') %>% 
  calculate(stat = "prop")

b) 95% Confidence interval

Calculate a 95% confidence interval for mean population age \(\mu\) using the percentile method or standard error method.

prct_ci_poc <- boot_distribution_poc_1000 %>% 
  get_ci(level = 0.95, type = "percentile")

prct_ci_poc
2.5% 97.5%
0.16 0.32025

Calculate a 95% confidence interval for mean population age \(\mu\) using the standard error method. Remember you also need to calculate the sample mean \(\bar{x}\) to use the SE method.

# Calculate x_bar_age
x_bar_age <- gss_sample %>% 
  specify(response = age) %>% 
  calculate(stat = "mean")

x_bar_age
stat
47.61
# 95% confidence interval using se method
boot_distribution_age <- gss_sample %>% 
  specify(response = age) %>% 
  generate(reps = 1000, type = "bootstrap") %>% 
  calculate(stat = "mean")
se_ci_age <- boot_distribution_age %>% 
  get_ci(level = 0.95, type = "se", point_estimate = x_bar_age)
se_ci_age
lower upper
44.20164 51.01836

c) Influence of sample size on confidence interval

Imagine we only actually had a sample of 15 humans, like so:

gss_n15 <- sample_n(gss_sample, size = 15)

Generate 1000 bootstrap samples based on this reduced sample gss_n15

boot_distribution15_age <- gss_n15 %>% 
  specify(response = age) %>% 
  generate(reps = 1000, type = "bootstrap") %>% 
  calculate(stat = "mean")

Calculate a 95% confidence interval for the mean population age \(\mu\) based on this smaller sample. Use the standard error and percentile method.

# percentile
prct_ci_age15 <- boot_distribution15_age %>% 
  get_ci(level = 0.95, type = "percentile")

prct_ci_age15
2.5% 97.5%
39.6 59
# x_bar age 
x_bar_age <- gss_n15 %>% 
  specify(response = age) %>% 
  calculate(stat = "mean")

x_bar_age
stat
49.4
# se method
se_ci_age <- boot_distribution15_age %>% 
  get_ci(level = 0.95, type = "se", point_estimate = x_bar_age)
se_ci_age
lower upper
39.3719 59.4281

Question:

  1. How did changing the sample size influence the width of the 95% confidence interval you calculated. Why?

Answer:

  1. Making the sample size smaller made the confidence interval wider. I think this is because there is more margin for error to account for when we reduce the scope of our sample. And to continue to have a 95% confidence, we need a wider “net”.