COVID-19 Weekly Cases and Deaths by Age, Race/Ethnicity, and Sex (Mar 2020 - Nov 2023)

This data dive will focus on regression modeling using a CDC dataset that contains weekly counts and rates of COVID-19 cases and deaths reported in Region 5 (Illinois, Indiana, Michigan, Minnesota, Ohio, and Wisconsin) of the United States from March 7, 2020 through November 18, 2023.


Load Libraries and Dataset

To get started, we’ll load several R packages to assist with our analysis.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(boot)
library(broom)
library(lindia)

# default theme for plots
theme_set(theme_minimal())

Next, let’s read in the dataset from CSV.

covid <- read_delim("./COVID_weekly_cases_deaths_region5.csv", delim = ",")
## Rows: 37867 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): end_of_week, jurisdiction, age_group, sex, race_ethnicity_combined
## dbl (4): case_count_suppressed, death_count_suppressed, case_crude_rate_supp...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

ANOVA

We’ll start with an ANOVA (Analysis of Variance) test to determine whether a group factor is independent of a response variable.

For this COVID-19 dataset, the candidates for response variable would be weekly counts or rates for cases or deaths. We’ll select weekly death rate as the response variable since it would be of the highest concern.

For the explanatory variable, let’s select race/ethnicity as the group factor. In this dataset, the CDC has categorized people into one of 5 race/ethnicity groups:

  • American Indian/Alaska Native, Non-Hispanic (AI/AN, NH)
  • Asian/Pacific Islander, Non-Hispanic (Asian/PI, NH)
  • Black, Non-Hispanic (Black, NH)
  • Hispanic (Hispanic)
  • White, Non-Hispanic (White, NH)

For this test, we’ll create a dataframe with weekly COVID-19 data by race/ethnicity group.

# Weekly Totals by Race/Ethnicity Group
weekly_race_eth <- covid |>
  filter(age_group == "Overall" & sex == "Overall" & race_ethnicity_combined != "Overall")

The null hypothesis for the ANOVA test will be:

\[ H_0 : \text{Average COVID-19 weekly death rate is equal across all race/ethnicity groups.} \]

ANOVA testing assumes the data are independent and identically distributed. The distribution within each group should be (approximately) normal, and the variance within each group should be consistent.

Let’s examine histograms of the weekly death rates for each race/ethnicity group to see whether the distributions are approximately normal.

# Histograms of weekly death rates by race/ethnicity group
weekly_race_eth |>
  ggplot() +
  geom_histogram(mapping = aes(x = death_crude_rate_suppressed_per_100k),
                 na.rm = TRUE, binwidth = 0.5) +
  labs(title = "COVID-19 Weekly Death Rates (Mar 2020 - Nov 2023)",
       subtitle = "Region 5 (Illinois, Indiana, Michigan, Minnesota, Ohio, Wisconsin)",
       x = "Weekly Death Rate (per 100K population)",
       y = "Frequency") +
  facet_wrap(~ race_ethnicity_combined, nrow = 1) +
  theme(plot.subtitle = element_text(colour = "darkgray"))

As the histograms show, the distributions of death rate are more akin to lognormal than normal, as they are bound by zero and cannot have negative values. The distribution for the American Indian/Alaska Native group is the most different.

Let’s calculate the variance within each race/ethnicity group.

weekly_race_eth |>
  group_by(race_ethnicity_combined) |>
  summarise(
    death_var = var(death_crude_rate_suppressed_per_100k, na.rm = TRUE),
    death_sd = sd(death_crude_rate_suppressed_per_100k, na.rm = TRUE)
  )
## # A tibble: 5 × 3
##   race_ethnicity_combined death_var death_sd
##   <chr>                       <dbl>    <dbl>
## 1 AI/AN, NH                   6.94     2.63 
## 2 Asian/PI, NH                0.894    0.946
## 3 Black, NH                   5.57     2.36 
## 4 Hispanic                    2.26     1.50 
## 5 White, NH                   4.52     2.13

The standard deviations of each group are roughly similar, though the Asian/Pacific Islander group and Hispanic group have much lower variances compared to the other groups.

While our dataset might not be an ideal match to the assumptions for an ANOVA test, let’s proceed.

# ANOVA test summary
m <- aov(death_crude_rate_suppressed_per_100k ~ race_ethnicity_combined, data = weekly_race_eth)

summary(m)
##                          Df Sum Sq Mean Sq F value Pr(>F)    
## race_ethnicity_combined   4  590.3  147.57   38.22 <2e-16 ***
## Residuals               652 2517.3    3.86                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 313 observations deleted due to missingness

The ANOVA test has a very low p-value and a high F value, which both indicate that the null hypothesis should likely be rejected, leading us to assume that COVID-19 weekly death rates are not equal across race/ethnicity groups.

Next let’s conduct pairwise t-tests to identify which group(s) are different from the others.

# pairwise t-tests of race/ethnicity groups
pairwise.t.test(weekly_race_eth$death_crude_rate_suppressed_per_100k,
                weekly_race_eth$race_ethnicity_combined, p.adjust.method = "bonferroni")
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  weekly_race_eth$death_crude_rate_suppressed_per_100k and weekly_race_eth$race_ethnicity_combined 
## 
##              AI/AN, NH Asian/PI, NH Black, NH Hispanic
## Asian/PI, NH <2e-16    -            -         -       
## Black, NH    <2e-16    0.014        -         -       
## Hispanic     <2e-16    1.000        0.191     -       
## White, NH    <2e-16    0.110        1.000     1.000   
## 
## P value adjustment method: bonferroni

The pairwise t-tests indicate that the American Indian/Alaska Native group (AI/AN, NH) is very different from other four groups with a very low p-value across all the t-tests.

In addition, the Asian/Pacific Islander group (Asian/PI, NH) appears to be different from the Black group (Black, NH) with the t-test producing a low p-value (~0.01).

With these differences in mind, let’s run a bootstrapping simulation to generate 95% confidence intervals for each group’s mean weekly death rate.

Here is the function to run the bootstrap simulation.

# Bootstrap simulation to calculate Confidence Interval (CI)
boot_ci <- function (v, func = median, conf = 0.95, n_iter = 100) {
  # function to run on each iteration i
  boot_func <- \(x, i) func(x[i], na.rm = TRUE)
  
  # the boot instance, call function for each iteration
  b <- boot(v, boot_func, R = n_iter)
  b <- boot.ci(b, conf = conf, type = "perc")
  
  # return lower and upper values of CI
  return(c("lower" = b$percent[4],
           "upper" = b$percent[5]))
}

Here are the bootstrapped 95% confidence intervals for mean weekly death rate by race/ethnicity group.

# generate bootstrapped 95% CI for mean death rate by race/ethnicity group
df_ci <- weekly_race_eth |>
  group_by(race_ethnicity_combined) |>
  summarise(ci_lower = boot_ci(death_crude_rate_suppressed_per_100k, mean)['lower'],
            mean_rate = mean(death_crude_rate_suppressed_per_100k, na.rm = TRUE),
            ci_upper = boot_ci(death_crude_rate_suppressed_per_100k, mean)['upper'])

df_ci
## # A tibble: 5 × 4
##   race_ethnicity_combined ci_lower mean_rate ci_upper
##   <chr>                      <dbl>     <dbl>    <dbl>
## 1 AI/AN, NH                   4.52      5.53     6.52
## 2 Asian/PI, NH                1.04      1.18     1.36
## 3 Black, NH                   1.61      1.95     2.37
## 4 Hispanic                    1.18      1.43     1.66
## 5 White, NH                   1.47      1.78     2.14

Here is a plot of the bootstrapped confidence intervals.

# plot bootstrapped 95% CI for mean death rate by race/ethnicity group
df_ci |>
  ggplot() +
  geom_errorbarh(mapping = aes(y = race_ethnicity_combined, 
                               xmin=ci_lower, xmax=ci_upper,
                               color = '95% C.I.'), height = 0.5) +
  geom_point(mapping = aes(x = mean_rate, y = race_ethnicity_combined,
                           color = 'Group Mean'), shape = '|', size = 5) +
  scale_y_discrete(limits = unique(rev(df_ci$race_ethnicity_combined))) +
  scale_color_manual(values=c('black', 'red')) +
  labs(title = "COVID-19 Weekly Death Rates (Mar 2020 - Nov 2023)",
       subtitle = "Region 5 (Illinois, Indiana, Michigan, Minnesota, Ohio, Wisconsin)",
       x = "Weekly Death Rate (per 100K population)",
       y = "Race/Ethnicity Group",
       color = '') +
  theme(plot.subtitle = element_text(colour = "darkgray"))

As seen in the ANOVA test and in the bootstrapped CI, the American Indian/Alaska Native group had a much higher weekly death rate.

Linear Regression

Next we’ll build a linear regression model for two continuous variables.

Once again, we’ll use weekly death rate as the response variable.

We’ll select weekly case rate as the explanatory variable.

For this modeling, we’ll create a dataframe that only includes the rows listing the “Overall” values for each week, leaving a single row for each week that provides summary data for all demographic groups combined.

# Only include rows that summarize overall totals for each week
weekly_overall <- covid |>
  filter(age_group == "Overall" & sex == "Overall" & race_ethnicity_combined == "Overall")

Let’s plot the weekly case rates vs. weekly death rates with a linear model fit to the data.

# scatter plot of case rate vs death rate with linear model
weekly_overall |>
  ggplot(mapping = aes(x = case_crude_rate_suppressed_per_100k,
                      y = death_crude_rate_suppressed_per_100k), na.rm = TRUE) +
  geom_point(size = 1) +
  geom_smooth(method = "lm", se = FALSE, color = 'orange') +
  labs(title = "COVID-19 Case Rate vs. Death Rate (Mar 2020 - Nov 2023)",
       subtitle = "Region 5 (Illinois, Indiana, Michigan, Minnesota, Ohio, Wisconsin)",
       x = "Weekly Case Rate (per 100K population)",
       y = "Weekly Death Rate (per 100K population)") +
  theme(plot.subtitle = element_text(colour = "darkgray"))
## `geom_smooth()` using formula = 'y ~ x'

Here’s the summary for the linear model shown in the plot.

# linear model function: lm(y ~ x, data)
model <- lm(death_crude_rate_suppressed_per_100k ~ case_crude_rate_suppressed_per_100k, weekly_overall)

summary (model)
## 
## Call:
## lm(formula = death_crude_rate_suppressed_per_100k ~ case_crude_rate_suppressed_per_100k, 
##     data = weekly_overall)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2002 -0.6308 -0.3505  0.2534  4.7354 
## 
## Coefficients:
##                                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                         0.3270765  0.1337086   2.446   0.0153 *  
## case_crude_rate_suppressed_per_100k 0.0089341  0.0005486  16.286   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.382 on 192 degrees of freedom
## Multiple R-squared:  0.5801, Adjusted R-squared:  0.5779 
## F-statistic: 265.2 on 1 and 192 DF,  p-value: < 2.2e-16

The \(R^2\) value is 0.58 indicating that the linear model can account for about 58% of the variance in the response variable.

# linear model coefficients
model$coefficients
##                         (Intercept) case_crude_rate_suppressed_per_100k 
##                         0.327076523                         0.008934115

Let’s use the linear model coefficients (y-intercept and slope) to evaluate the model fit visually by plotting error as the “residual” distance from each point to the line.

# linear model coefficients
beta_0 <- 0.327076523
beta_1 <- 0.008934115

# linear model function
lm_ <- \(x) beta_1 * x + beta_0

# plot error for each point ("residual" distance from point to line)
weekly_overall |>
  ggplot(mapping = aes(x = case_crude_rate_suppressed_per_100k,
                      y = death_crude_rate_suppressed_per_100k), na.rm = TRUE) +
  geom_point(size = 1) +
  geom_segment(mapping = aes(xend = case_crude_rate_suppressed_per_100k,
                             y = death_crude_rate_suppressed_per_100k,
                             yend = lm_(case_crude_rate_suppressed_per_100k), 
                             color = 'Error'), linewidth = 0.25) +
  geom_smooth(method = "lm", se = FALSE, color = 'orange') +
  scale_color_manual(values = c('blue')) +
  labs(title = "COVID-19 Case Rate vs. Death Rate (Mar 2020 - Nov 2023)",
       subtitle = "Region 5 (Illinois, Indiana, Michigan, Minnesota, Ohio, Wisconsin)",
       x = "Weekly Case Rate (per 100K population)",
       y = "Weekly Death Rate (per 100K population)",
       color = '') +
  theme(plot.subtitle = element_text(colour = "darkgray"))
## `geom_smooth()` using formula = 'y ~ x'

While this linear model minimizes the error, there still seems to be large errors across many data points.

It does seem logical that case rate would be related to death rate, as having a case of COVID-19 is a prerequisite to a death from COVID-19, and it seems logical that more cases (i.e., higher case rate) would lead to more deaths (i.e., higher death rate).

However, it may not make the most sense to expect a linear relationship between case rate and death rate in the same week. In reality, there is likely a lag in time before a case of COVID reported in one week becomes a death from COVID in some later week.

For example, if the average time from the onset of a case to death was 2 weeks, then it would make more sense to plot the case rate for week \(i\) versus the death rate for week \(i + 2\).


This dataset will be explored further in subsequent data dives.