COVID-19 Weekly Cases and Deaths by Age, Race/Ethnicity, and Sex (Mar 2020 - Nov 2023)

This data dive will focus on developing a logistic regression model as an example of a generalized linear model for a CDC dataset that contains weekly counts and rates of COVID-19 cases and deaths reported in Region 5 (Illinois, Indiana, Michigan, Minnesota, Ohio, and Wisconsin) of the United States from March 7, 2020 through November 18, 2023.

In addition to overall weekly counts and rates, the weekly data is broken out by demographic combinations of age group, sex at birth, and race/ethnicity group.
If the weekly count of cases or deaths for a specific demographic subgroup was 5 or fewer, the CDC suppressed the count (resulting in explicit missing data), in order to protect the confidentiality of patients.
Additional background for this dataset can be reviewed in the Week 5 Data Dive or in the CDC documentation for the full dataset (which included all US regions).

Load Libraries and Dataset

To get started, we’ll load several R packages to assist with our analysis.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggthemes)
library(patchwork)
library(broom)
library(lindia)
theme_set(theme_minimal())

Next, let’s read in the dataset from CSV.

covid <- read_delim("./COVID_weekly_cases_deaths_region5.csv", delim = ",")

## Rows: 37867 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): end_of_week, jurisdiction, age_group, sex, race_ethnicity_combined
## dbl (4): case_count_suppressed, death_count_suppressed, case_crude_rate_supp...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Let’s convert end_of_week from a character format to a date format.

covid$end_of_week <- as.Date(covid$end_of_week, format="%m/%d/%y")

Let’s add a new column to indicate the order of the age groups, and then arrange all the rows in nested order by week, age, sex, and race/ethnicity.

# Add new column to order age groups correctly
covid <- covid |>
  mutate(age_order = case_when(
    age_group == "Overall" ~ 0L,
    age_group == "0 - 4 Years" ~ 1L,
    age_group == "5 - 11 Years" ~ 2L,
    age_group == "12 - 15 Years" ~ 3L,
    age_group == "16 - 17 Years" ~ 4L,
    age_group == "18 - 29 Years" ~ 5L,
    age_group == "30 - 39 Years" ~ 6L,
    age_group == "40 - 49 Years" ~ 7L,
    age_group == "50 - 64 Years" ~ 8L,
    age_group == "65 - 74 Years" ~ 9L,
    age_group == "75+ Years" ~ 10L),
    .after = age_group) |>
  arrange(end_of_week, age_order, sex, race_ethnicity_combined)

Logistic Regression Model

For this data dive, we’ll construct a logistic regression model as an example of a generalized linear model.

Logistic regression is used to model a binary response variable, i.e., a value that could be coded as 0 or 1. This could be a categorical variable or a continuous variable transformed into categorical levels.

Response Variable and Explanatory Variable

For this dataset, we’ll use age group as the response variable. There are 10 age groups in the dataset (as well as a value for Overall, representing all ages combined). However, for our model, we’ll create a calculated binary variable to code the age group as 0 or 1, depending on whether the age group is 75+ Years (or not).

For the explanatory variable, we’ll use weekly death rate which represents the number of deaths per 100K population. In the dataset, weekly death rates are reported for each demographic subgroup.

Code Binary Values for Response Variable

First, let’s create a dataframe that filters out the rows that provide Overall summary counts for the demographic subgroups and also filters out rows with missing values (NA) for death rate.

# Filter data to exclude weekly summary rows ("Overall") and to exclude rows with missing data (NA) for death rate
weekly_data <- covid |>
  filter(age_group != "Overall" & sex != "Overall" & race_ethnicity_combined != "Overall") |>
  filter(!is.na(death_crude_rate_suppressed_per_100k))

Next, let’s add a new column named is_75_plus to code the age group as a binary value of 0 or 1, depending on whether the value is 75+ Years or not:

0 = age group is less than 75 years
1 = age group is 75+ years

# create coded binary variable for age group
weekly_data <- weekly_data |>
  mutate(is_75_plus = case_when(
    age_group != "75+ Years" ~ 0L,
    age_group == "75+ Years" ~ 1L),
    .after = age_order
  )

Explanatory Variable vs. Response Variable

Let’s plot our explanatory variable (death rate) vs. the binary response variable (coded age group).

# plot of death rate vs. binary age group variable
weekly_data |>
  ggplot(mapping = aes(x = death_crude_rate_suppressed_per_100k, y = is_75_plus)) +
  geom_jitter(width = 0, height = 0.1, shape = 'o', size = 2) +
  scale_y_continuous(breaks = c(0, 0.5, 1)) +
  labs(title = "COVID-19 Death Rates vs. Age Group (Mar 2020 - Nov 2023)",
       subtitle = "Region 5 (IL, IN, MI, MN, OH, WI)",
       x = "Weekly Death Rate (per 100K population)",
       y = "Is 75+ Years") +
  theme(plot.subtitle = element_text(colour = "darkgray"))

Link Function

Next, we use a link function to transform the value of $y$ into a linear model. A logistic regression model uses “logit” as its link function to transform a binary variable into a linear model.

The logit function represents the log-odds of of $y$ (or the natural logarithm of the odds of $y$). The logit function is a sigmoid function, meaning it has an S-shaped curve.

Let’s generate the logistic regression model for our explanatory variable and response variable.

# logistic regression model
glm_model <- glm(is_75_plus ~ death_crude_rate_suppressed_per_100k, data = weekly_data,
             family = binomial(link = 'logit'))

glm_model$coefficients

##                          (Intercept) death_crude_rate_suppressed_per_100k 
##                          -1.71143478                           0.09658452

Using these model coefficients, the linear model using the “logit” link function can be represented as:

\[ \text{log odds of $y$} = \log\left(\frac{p}{1 - p}\right) = -1.711 + 0.097\times\text{death rate} \]

(where $\log = \ln$, the natural logarithm)

The beta coefficient for death rate tells us that for every unit increase in the death rate, the odds of the age group being 75+ Years increases by $e^{0.097} \approx 1.101$ (i.e., by about 10%).

The model coefficients can also be used to calculate the 50% probability threshold for the explanatory variable. For a probability of 50%, the odds are equal to 1, and since $\log(1) = 0$:

\[ \begin{align} \log(\text{odds}) &= -1.711 + 0.097 \times \text{death rate} \newline \log(1) &= -1.711 + 0.097 \times \text{death rate} \newline 0 &= -1.711 + 0.097 \times \text{death rate} \newline 1.711 & = 0.097 \times \text{death rate} \newline \text{death rate} &= \frac{1.711}{0.097} = 17.6 \end{align} \]

This 50% probability threshold of 17.6 deaths per 100K population is a decision point (or inflection point) that tells us:

For death rates less than 17.6, the odds are more likely that the age group is less than 75 years (i.e., is not 75+ years).
For death rates above 17.6, the odds are more likely that the age group is 75+ years.

These model coefficients can be used to add a blue sigmoid curve to our plot that represents the model function, as well as a red vertical line representing the 50% probability threshold.

# Coefficients from logistic regression model
intercept <- -1.711
death_rate_beta <- 0.097

# Sigmoid function for model
sigmoid <- \(x) 1 / (1 + exp(-(intercept + death_rate_beta * x)))

# plot with points, sigmoid curve, and 50% probability threshold
weekly_data |>
  ggplot(mapping = aes(x = death_crude_rate_suppressed_per_100k, y = is_75_plus)) +
  geom_jitter(width = 0, height = 0.1, shape = 'o', size = 2) +
  geom_function(fun = sigmoid, color = 'blue', linewidth = 1) +
  geom_vline(xintercept = (- intercept / death_rate_beta), color = 'red', linetype = "dashed") +
  annotate("text", x = 25, y = 0.5, label = round((- intercept / death_rate_beta), 1), color = 'red') +
  scale_y_continuous(breaks = c(0, 0.5, 1)) +
  labs(title = "COVID-19 Death Rates vs. Age Group (Mar 2020 - Nov 2023)",
       subtitle = "Region 5 (IL, IN, MI, MN, OH, WI)",
       x = "Weekly Death Rate (per 100K population)",
       y = "Is 75+ Years") +
  theme(plot.subtitle = element_text(colour = "darkgray"))

The beta coefficient for the death rate was estimated to be 0.097. We can get the 95% confidence interval for this coefficient from the model.

# get model estimates and 95% CI
tidy(glm_model, conf.int = TRUE) |>
  group_by(term) |>
  summarise(
    term,
    estimate,
    conf.low,
    conf.high
  )

## # A tibble: 2 × 4
##   term                                 estimate conf.low conf.high
##   <chr>                                   <dbl>    <dbl>     <dbl>
## 1 (Intercept)                           -1.71    -1.84      -1.58 
## 2 death_crude_rate_suppressed_per_100k   0.0966   0.0865     0.107

The 95% confidence interval for the beta coefficient of death rate ranges from 0.086 to 0.107.

Transform Explanatory Variable

Weekly death rate is a variable bound by zero: i.e., negative death rates are not possible. This likely results in a non-normal distribution of values that is right-skewed.

For zero-bound values (such as price, counts, etc.) a log transform of the values can often be used to produce a distribution that is more normal and thus better suited to use in a generalized linear model.

For comparison, here is the histogram of death rates compared to the histogram of $\log(\text{death rate})$.

# histogram of death rate
p1 <- ggplot(data = weekly_data) +
  geom_histogram(aes(x = death_crude_rate_suppressed_per_100k), binwidth = 5, color='white') +
  labs(x = "Death Rate")

# histogram of log(death rate)
p2 <- ggplot(data = weekly_data) +
  geom_histogram(aes(x = log(death_crude_rate_suppressed_per_100k)), binwidth = 0.25, color='white') +
  labs(x = "log(Death Rate)")

# display side-by-side
p1 + p2

As can be seen in the histograms, using a log transform on the death rate produces a more normal distribution of values for this explanatory variable.

Let’s create a logistic regression model that utilizes $\log(\text{death rate})$ for the explanatory variable.

# logistic regression model using log(death rate)
glm_model2 <- glm(is_75_plus ~ I(log(death_crude_rate_suppressed_per_100k)), data = weekly_data, family = binomial(link = 'logit'))

glm_model2$coefficients

##                                  (Intercept) 
##                                    -2.576403 
## I(log(death_crude_rate_suppressed_per_100k)) 
##                                     1.051477

In this case, the linear model can be represented as:

\[ \text{log odds of $y$} = \log\left(\frac{p}{1 - p}\right) = -2.576 + 1.051\times\log(\text{death rate}) \]

(where $\log = \ln$, the natural logarithm)

The beta coefficient for death rate tells us that for every unit increase in the $\log(\text{death rate})$, the odds of the age group being 75+ Years increases by $e^{1.051} \approx 2.861$ (i.e., by about 186%).

Again, the model coefficients can also be used to calculate the 50% probability threshold for the explanatory variable. For a probability of 50%, the odds are equal to 1, and since $\log(1) = 0$:

\[ \begin{align} \log(\text{odds}) &= -2.576 + 1.051 \times \log(\text{death rate}) \newline \log(1) &= -2.576 + 1.051 \times \log(\text{death rate}) \newline 0 &= -2.576 + 1.051 \times \log(\text{death rate}) \newline 2.576 & = 1.051 \times \log(\text{death rate}) \newline \log(\text{death rate}) &= \frac{2.576}{1.051} = 2.451 \newline \text{death rate} &= e^{2.451} = 11.6 \end{align} \]

For this model using $\log(\text{death rate})$, the 50% probability threshold is at 11.6 deaths per 100K population:

For death rates less than 11.6, the odds are more likely that the age group is less than 75 years.
For death rates above 11.6, the odds are more likely that the age group is 75+ years.

Let’s generate a plot of this new logistic regression model.

# Coefficients from model
intercept <- -2.576403
log_death_rate_beta <- 1.051477

# Sigmoid function for model
sigmoid <- \(x) 1 / (1 + exp(-(intercept + log_death_rate_beta * x)))

# plot with points, sigmoid curve, and 50% probability threshold
weekly_data |>
  ggplot(mapping = aes(x = log(death_crude_rate_suppressed_per_100k), y = is_75_plus)) +
  geom_jitter(width = 0, height = 0.1, shape = 'o', size = 2) +
  geom_function(fun = sigmoid, color = 'blue', linewidth = 1) +
  geom_vline(xintercept = (- intercept / log_death_rate_beta), color = 'red', linetype = "dashed") +
  annotate("text", x = 2.9, y = 0.5, label = round((- intercept / log_death_rate_beta), 2), color = 'red') +
  scale_y_continuous(breaks = c(0, 0.5, 1)) +
  labs(title = "COVID-19 Log(Death Rate) vs. Age Group (Mar 2020 - Nov 2023)",
       subtitle = "Region 5 (IL, IN, MI, MN, OH, WI)",
       x = "log(Weekly Death Rate, per 100K population)",
       y = "Is 75+ Years") +
  theme(plot.subtitle = element_text(colour = "darkgray"))

We can get the 95% confidence intervals for the model coefficients.

# get model estimates and 95% CI
tidy(glm_model2, conf.int = TRUE) |>
  group_by(term) |>
  summarise(
    term,
    estimate,
    conf.low,
    conf.high
  )

## # A tibble: 2 × 4
##   term                                         estimate conf.low conf.high
##   <chr>                                           <dbl>    <dbl>     <dbl>
## 1 (Intercept)                                     -2.58   -2.79      -2.38
## 2 I(log(death_crude_rate_suppressed_per_100k))     1.05    0.963      1.14

For this model, the 95% confidence interval for the beta coefficient of $\log(\text{death rate})$ ranges from 0.963 to 1.143.

This dataset will be explored further in subsequent data dives.

H510 Week 10 Data Dive

Michael Frontz

2024-03-25