Boot Strap the Covid

This document:
- creates a new data set from the covid and census data
- conducts a hypothesis test
- boot straps the new data set
- visualizes the results

Part One: create and examine a sample data set to bootstrap

Import the Census and COVID-19 case data set created from the Create Retrieve, Wrangle, Mutate, Merge document, and select the metrics: unemployment_rate, covid_mortality_quintile

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.7     ✔ dplyr   1.0.9
## ✔ tidyr   1.1.3     ✔ stringr 1.4.0
## ✔ readr   1.4.0     ✔ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

boot_covid <- read_csv("~/STA 518/BrookemWalters-Portfolio/Stats 518 Final Project/Bootstrapping/covid_census_bs.csv") %>% 
  select(County, unemployment_rate, covid_mortality_quintile, Deaths_Per_Pop_Thousand) %>% 
  tibble()

## Warning: Missing column names filled in: 'X1' [1]

## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   X1 = col_double(),
##   population19E = col_double(),
##   householdsE = col_double(),
##   median_ageE = col_double(),
##   median_incomeE = col_double(),
##   bach_degree_plus_a25E = col_double(),
##   unemployment_rate = col_double(),
##   public_assist_rate = col_double(),
##   percent_asian = col_double(),
##   percent_black = col_double(),
##   percent_native = col_double(),
##   percent_pacific_islander = col_double(),
##   percent_white = col_double(),
##   percent_hispanic = col_double(),
##   County = col_character(),
##   Total_Deaths = col_double(),
##   Deaths_Per_Pop_Thousand = col_double(),
##   covid_mortality_quintile = col_double()
## )

visualize the observations

Michigan County Unemployment Rate and COVID-19 deaths per 1000, Q1 = the lowest percentile & Q5 = the highest percentile for COVID-19 mortality

boxplot(boot_covid$unemployment_rate~boot_covid$covid_mortality_quintile, las = 1, ylab = "Unemployment Rate",
        xlab = "COVID-19 Mortality Quintile", main = "Unemployment Rate by COVID-19 Mortality Quintile")

To keep the bootstrap simple, I’ll just look at Q1 and Q5:

boot_covid <- boot_covid %>% 
  filter(covid_mortality_quintile == 1 |covid_mortality_quintile == 5)

Compare the two quintiles

boxplot(boot_covid$unemployment_rate~boot_covid$covid_mortality_quintile, las = 1, ylab = "Unemployment Rate", 
        xlab = "COVID-19 Mortality Quintile", main = "Unemployment Rate by COVID-19 Mortality Quintile")

Hypothesis Testing

The code below conducts “Welch Two Sample t-test”, I can compare the results to the bootstrap

Specify the Null Hypothesis: There is no difference in mean unemployment rate between Q1 and Q5
Specify the Alternative Hypothesis: There is a difference in mean unemployment rate between Q1 and Q5
Calculate the Test Statistic: T = -2.6822
Calculate the P-Value: p = 0.019
Drawing a Conclusion: Where alpha = 0.05, p < alpha, reject H0, there is a difference in average unemployment rate based on COVID-19 mortality rate (Q1 vs Q5)

** However, my sample set does not meet the conditions for a two-sample T test, I would not use this as evidence, so I’ll do a bootstrap and compare!

t.test(boot_covid$unemployment_rate~boot_covid$covid_mortality_quintile, paired = FALSE, var.eq = FALSE)

## 
##  Welch Two Sample t-test
## 
## data:  boot_covid$unemployment_rate by boot_covid$covid_mortality_quintile
## t = -2.6822, df = 29.274, p-value = 0.0119
## alternative hypothesis: true difference in means between group 1 and group 5 is not equal to 0
## 95 percent confidence interval:
##  -2.3996996 -0.3238298
## sample estimates:
## mean in group 1 mean in group 5 
##        5.288235        6.650000

Part Two

Calculate the Test Statistic for Boot Strapping

test_stat_one <-  abs(mean(boot_covid$unemployment_rate[boot_covid$covid_mortality_quintile == 5])) - mean(boot_covid$unemployment_rate[boot_covid$covid_mortality_quintile == 1])

round(test_stat_one, 2)

## [1] 1.36

Set a seed for reproducibility

set.seed(11062)

Set up the bootstrap

#n = number of observations
n <- length(boot_covid$unemployment_rate)

#b = number of bootstrap samples
B <- 10000

# assign the variable I'm testing a name for easier coding
variable <- boot_covid$unemployment_rate

Collect the bootstrap samples: 33 by 10,000 matrix where each column is a bootstrap re-sample

bootstrap_samples <-  matrix(sample(variable, size = n*B, replace = TRUE),
                             nrow = n, ncol = B)

boot_test_stat_one <- rep(0,B)

Boot Strap Loop!

for (i in 1:B) {
boot_test_stat_one[i] <- abs(mean(bootstrap_samples [1:16,i]) - 
                               mean(bootstrap_samples [17:33,i]))
}

round(boot_test_stat_one[1:20],2)

##  [1] 0.23 0.37 0.10 0.38 0.06 0.20 0.02 0.89 0.73 0.38 0.03 0.72 0.48 0.65 0.41
## [16] 0.54 0.07 0.42 0.21 0.48

Taking a look at the bootstraps samples

(boot_test_stat_one >= test_stat_one)[1:20]

##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

time to calculate the bootstrap p-value 0 = False, 1 = True

mean(boot_test_stat_one >= test_stat_one)

## [1] 0.0125

Specify the Null Hypothesis: There is no difference in mean unemployment rate between Q1 and Q5
Specify the Alternative Hypothesis: There is a difference in mean unemployment rate between Q1 and Q5
Calculate the Test Statistic: T = 1.36
Calculate the P-Value: p = 0.0125
Drawing a Conclusion: Where alpha = 0.05, p < alpha, reject H0, there is a difference in average unemployment rate based on COVID-19 mortality rate (Q1 vs Q5)

The bootstrap results yielded the same conclusion as the two-sample T test.