COVID-19 Weekly Cases and Deaths by Age, Race/Ethnicity, and Sex (Mar 2020 - Nov 2023)

This data dive will focus on developing a logistic regression model as an example of a generalized linear model for a CDC dataset that contains weekly counts and rates of COVID-19 cases and deaths reported in Region 5 (Illinois, Indiana, Michigan, Minnesota, Ohio, and Wisconsin) of the United States from March 7, 2020 through November 18, 2023.


Load Libraries and Dataset

To get started, we’ll load several R packages to assist with our analysis.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(patchwork)
library(broom)
library(lindia)
theme_set(theme_minimal())

Next, let’s read in the dataset from CSV.

covid <- read_delim("./COVID_weekly_cases_deaths_region5.csv", delim = ",")
## Rows: 37867 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): end_of_week, jurisdiction, age_group, sex, race_ethnicity_combined
## dbl (4): case_count_suppressed, death_count_suppressed, case_crude_rate_supp...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Let’s convert end_of_week from a character format to a date format.

covid$end_of_week <- as.Date(covid$end_of_week, format="%m/%d/%y")

Let’s add a new column to indicate the order of the age groups, and then arrange all the rows in nested order by week, age, sex, and race/ethnicity.

# Add new column to order age groups correctly
covid <- covid |>
  mutate(age_order = case_when(
    age_group == "Overall" ~ 0L,
    age_group == "0 - 4 Years" ~ 1L,
    age_group == "5 - 11 Years" ~ 2L,
    age_group == "12 - 15 Years" ~ 3L,
    age_group == "16 - 17 Years" ~ 4L,
    age_group == "18 - 29 Years" ~ 5L,
    age_group == "30 - 39 Years" ~ 6L,
    age_group == "40 - 49 Years" ~ 7L,
    age_group == "50 - 64 Years" ~ 8L,
    age_group == "65 - 74 Years" ~ 9L,
    age_group == "75+ Years" ~ 10L),
    .after = age_group) |>
  arrange(end_of_week, age_order, sex, race_ethnicity_combined)

Logistic Regression Model

For this data dive, we’ll construct a logistic regression model as an example of a generalized linear model.

Logistic regression is used to model a binary response variable, i.e., a value that could be coded as 0 or 1. This could be a categorical variable or a continuous variable transformed into categorical levels.

Response Variable and Explanatory Variable

For this dataset, we’ll use age group as the response variable. There are 10 age groups in the dataset (as well as a value for Overall, representing all ages combined). However, for our model, we’ll create a calculated binary variable to code the age group as 0 or 1, depending on whether the age group is 75+ Years (or not).

For the explanatory variable, we’ll use weekly death rate which represents the number of deaths per 100K population. In the dataset, weekly death rates are reported for each demographic subgroup.

Code Binary Values for Response Variable

First, let’s create a dataframe that filters out the rows that provide Overall summary counts for the demographic subgroups and also filters out rows with missing values (NA) for death rate.

# Filter data to exclude weekly summary rows ("Overall") and to exclude rows with missing data (NA) for death rate
weekly_data <- covid |>
  filter(age_group != "Overall" & sex != "Overall" & race_ethnicity_combined != "Overall") |>
  filter(!is.na(death_crude_rate_suppressed_per_100k))

Next, let’s add a new column named is_75_plus to code the age group as a binary value of 0 or 1, depending on whether the value is 75+ Years or not:

  • 0 = age group is less than 75 years
  • 1 = age group is 75+ years
# create coded binary variable for age group
weekly_data <- weekly_data |>
  mutate(is_75_plus = case_when(
    age_group != "75+ Years" ~ 0L,
    age_group == "75+ Years" ~ 1L),
    .after = age_order
  )

Explanatory Variable vs. Response Variable

Let’s plot our explanatory variable (death rate) vs. the binary response variable (coded age group).

# plot of death rate vs. binary age group variable
weekly_data |>
  ggplot(mapping = aes(x = death_crude_rate_suppressed_per_100k, y = is_75_plus)) +
  geom_jitter(width = 0, height = 0.1, shape = 'o', size = 2) +
  scale_y_continuous(breaks = c(0, 0.5, 1)) +
  labs(title = "COVID-19 Death Rates vs. Age Group (Mar 2020 - Nov 2023)",
       subtitle = "Region 5 (IL, IN, MI, MN, OH, WI)",
       x = "Weekly Death Rate (per 100K population)",
       y = "Is 75+ Years") +
  theme(plot.subtitle = element_text(colour = "darkgray"))

Transform Explanatory Variable

Weekly death rate is a variable bound by zero: i.e., negative death rates are not possible. This likely results in a non-normal distribution of values that is right-skewed.

For zero-bound values (such as price, counts, etc.) a log transform of the values can often be used to produce a distribution that is more normal and thus better suited to use in a generalized linear model.

For comparison, here is the histogram of death rates compared to the histogram of \(\log(\text{death rate})\).

# histogram of death rate
p1 <- ggplot(data = weekly_data) +
  geom_histogram(aes(x = death_crude_rate_suppressed_per_100k), binwidth = 5, color='white') +
  labs(x = "Death Rate")

# histogram of log(death rate)
p2 <- ggplot(data = weekly_data) +
  geom_histogram(aes(x = log(death_crude_rate_suppressed_per_100k)), binwidth = 0.25, color='white') +
  labs(x = "log(Death Rate)")

# display side-by-side
p1 + p2

As can be seen in the histograms, using a log transform on the death rate produces a more normal distribution of values for this explanatory variable.

Let’s create a logistic regression model that utilizes \(\log(\text{death rate})\) for the explanatory variable.

# logistic regression model using log(death rate)
glm_model2 <- glm(is_75_plus ~ I(log(death_crude_rate_suppressed_per_100k)), data = weekly_data, family = binomial(link = 'logit'))

glm_model2$coefficients
##                                  (Intercept) 
##                                    -2.576403 
## I(log(death_crude_rate_suppressed_per_100k)) 
##                                     1.051477

In this case, the linear model can be represented as:

\[ \text{log odds of $y$} = \log\left(\frac{p}{1 - p}\right) = -2.576 + 1.051\times\log(\text{death rate}) \]

(where \(\log = \ln\), the natural logarithm)

The beta coefficient for death rate tells us that for every unit increase in the \(\log(\text{death rate})\), the odds of the age group being 75+ Years increases by \(e^{1.051} \approx 2.861\) (i.e., by about 186%).

Again, the model coefficients can also be used to calculate the 50% probability threshold for the explanatory variable. For a probability of 50%, the odds are equal to 1, and since \(\log(1) = 0\):

\[ \begin{align} \log(\text{odds}) &= -2.576 + 1.051 \times \log(\text{death rate}) \newline \log(1) &= -2.576 + 1.051 \times \log(\text{death rate}) \newline 0 &= -2.576 + 1.051 \times \log(\text{death rate}) \newline 2.576 & = 1.051 \times \log(\text{death rate}) \newline \log(\text{death rate}) &= \frac{2.576}{1.051} = 2.451 \newline \text{death rate} &= e^{2.451} = 11.6 \end{align} \]

For this model using \(\log(\text{death rate})\), the 50% probability threshold is at 11.6 deaths per 100K population:

  • For death rates less than 11.6, the odds are more likely that the age group is less than 75 years.
  • For death rates above 11.6, the odds are more likely that the age group is 75+ years.

Let’s generate a plot of this new logistic regression model.

# Coefficients from model
intercept <- -2.576403
log_death_rate_beta <- 1.051477

# Sigmoid function for model
sigmoid <- \(x) 1 / (1 + exp(-(intercept + log_death_rate_beta * x)))

# plot with points, sigmoid curve, and 50% probability threshold
weekly_data |>
  ggplot(mapping = aes(x = log(death_crude_rate_suppressed_per_100k), y = is_75_plus)) +
  geom_jitter(width = 0, height = 0.1, shape = 'o', size = 2) +
  geom_function(fun = sigmoid, color = 'blue', linewidth = 1) +
  geom_vline(xintercept = (- intercept / log_death_rate_beta), color = 'red', linetype = "dashed") +
  annotate("text", x = 2.9, y = 0.5, label = round((- intercept / log_death_rate_beta), 2), color = 'red') +
  scale_y_continuous(breaks = c(0, 0.5, 1)) +
  labs(title = "COVID-19 Log(Death Rate) vs. Age Group (Mar 2020 - Nov 2023)",
       subtitle = "Region 5 (IL, IN, MI, MN, OH, WI)",
       x = "log(Weekly Death Rate, per 100K population)",
       y = "Is 75+ Years") +
  theme(plot.subtitle = element_text(colour = "darkgray"))

We can get the 95% confidence intervals for the model coefficients.

# get model estimates and 95% CI
tidy(glm_model2, conf.int = TRUE) |>
  group_by(term) |>
  summarise(
    term,
    estimate,
    conf.low,
    conf.high
  )
## # A tibble: 2 × 4
##   term                                         estimate conf.low conf.high
##   <chr>                                           <dbl>    <dbl>     <dbl>
## 1 (Intercept)                                     -2.58   -2.79      -2.38
## 2 I(log(death_crude_rate_suppressed_per_100k))     1.05    0.963      1.14

For this model, the 95% confidence interval for the beta coefficient of \(\log(\text{death rate})\) ranges from 0.963 to 1.143.


This dataset will be explored further in subsequent data dives.