This document:
- creates a new data set from the covid and census data
- conducts a hypothesis test
- boot straps the new data set
- visualizes the results
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.7 ✔ dplyr 1.0.9
## ✔ tidyr 1.1.3 ✔ stringr 1.4.0
## ✔ readr 1.4.0 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
boot_covid <- read_csv("~/STA 518/BrookemWalters-Portfolio/Stats 518 Final Project/Bootstrapping/covid_census_bs.csv") %>%
select(County, unemployment_rate, covid_mortality_quintile, Deaths_Per_Pop_Thousand) %>%
tibble()
## Warning: Missing column names filled in: 'X1' [1]
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## X1 = col_double(),
## population19E = col_double(),
## householdsE = col_double(),
## median_ageE = col_double(),
## median_incomeE = col_double(),
## bach_degree_plus_a25E = col_double(),
## unemployment_rate = col_double(),
## public_assist_rate = col_double(),
## percent_asian = col_double(),
## percent_black = col_double(),
## percent_native = col_double(),
## percent_pacific_islander = col_double(),
## percent_white = col_double(),
## percent_hispanic = col_double(),
## County = col_character(),
## Total_Deaths = col_double(),
## Deaths_Per_Pop_Thousand = col_double(),
## covid_mortality_quintile = col_double()
## )
Michigan County Unemployment Rate and COVID-19 deaths per 1000, Q1 = the lowest percentile & Q5 = the highest percentile for COVID-19 mortality
boxplot(boot_covid$unemployment_rate~boot_covid$covid_mortality_quintile, las = 1, ylab = "Unemployment Rate",
xlab = "COVID-19 Mortality Quintile", main = "Unemployment Rate by COVID-19 Mortality Quintile")
boot_covid <- boot_covid %>%
filter(covid_mortality_quintile == 1 |covid_mortality_quintile == 5)
boxplot(boot_covid$unemployment_rate~boot_covid$covid_mortality_quintile, las = 1, ylab = "Unemployment Rate",
xlab = "COVID-19 Mortality Quintile", main = "Unemployment Rate by COVID-19 Mortality Quintile")
The code below conducts “Welch Two Sample t-test”, I can compare the results to the bootstrap
Specify the Null Hypothesis: There is no difference in mean unemployment rate between Q1 and Q5
Specify the Alternative Hypothesis: There is a difference in mean unemployment rate between Q1 and Q5
Calculate the Test Statistic: T = -2.6822
Calculate the P-Value: p = 0.019
Drawing a Conclusion: Where alpha = 0.05, p < alpha, reject H0, there is a difference in average unemployment rate based on COVID-19 mortality rate (Q1 vs Q5)
** However, my sample set does not meet the conditions for a two-sample T test, I would not use this as evidence, so I’ll do a bootstrap and compare!
t.test(boot_covid$unemployment_rate~boot_covid$covid_mortality_quintile, paired = FALSE, var.eq = FALSE)
##
## Welch Two Sample t-test
##
## data: boot_covid$unemployment_rate by boot_covid$covid_mortality_quintile
## t = -2.6822, df = 29.274, p-value = 0.0119
## alternative hypothesis: true difference in means between group 1 and group 5 is not equal to 0
## 95 percent confidence interval:
## -2.3996996 -0.3238298
## sample estimates:
## mean in group 1 mean in group 5
## 5.288235 6.650000
test_stat_one <- abs(mean(boot_covid$unemployment_rate[boot_covid$covid_mortality_quintile == 5])) - mean(boot_covid$unemployment_rate[boot_covid$covid_mortality_quintile == 1])
round(test_stat_one, 2)
## [1] 1.36
set.seed(11062)
#n = number of observations
n <- length(boot_covid$unemployment_rate)
#b = number of bootstrap samples
B <- 10000
# assign the variable I'm testing a name for easier coding
variable <- boot_covid$unemployment_rate
bootstrap_samples <- matrix(sample(variable, size = n*B, replace = TRUE),
nrow = n, ncol = B)
boot_test_stat_one <- rep(0,B)
for (i in 1:B) {
boot_test_stat_one[i] <- abs(mean(bootstrap_samples [1:16,i]) -
mean(bootstrap_samples [17:33,i]))
}
round(boot_test_stat_one[1:20],2)
## [1] 0.23 0.37 0.10 0.38 0.06 0.20 0.02 0.89 0.73 0.38 0.03 0.72 0.48 0.65 0.41
## [16] 0.54 0.07 0.42 0.21 0.48
(boot_test_stat_one >= test_stat_one)[1:20]
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
time to calculate the bootstrap p-value 0 = False, 1 = True
mean(boot_test_stat_one >= test_stat_one)
## [1] 0.0125
Specify the Null Hypothesis: There is no difference in mean unemployment rate between Q1 and Q5
Specify the Alternative Hypothesis: There is a difference in mean unemployment rate between Q1 and Q5
Calculate the Test Statistic: T = 1.36
Calculate the P-Value: p = 0.0125
Drawing a Conclusion: Where alpha = 0.05, p < alpha, reject H0, there is a difference in average unemployment rate based on COVID-19 mortality rate (Q1 vs Q5)
The bootstrap results yielded the same conclusion as the two-sample T test.