This data dive will focus on developing a logistic regression model as an example of a generalized linear model for a CDC dataset that contains weekly counts and rates of COVID-19 cases and deaths reported in Region 5 (Illinois, Indiana, Michigan, Minnesota, Ohio, and Wisconsin) of the United States from March 7, 2020 through November 18, 2023.
To get started, we’ll load several R packages to assist with our analysis.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.0 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(patchwork)
library(broom)
library(lindia)
theme_set(theme_minimal())
Next, let’s read in the dataset from CSV.
covid <- read_delim("./COVID_weekly_cases_deaths_region5.csv", delim = ",")
## Rows: 37867 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): end_of_week, jurisdiction, age_group, sex, race_ethnicity_combined
## dbl (4): case_count_suppressed, death_count_suppressed, case_crude_rate_supp...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Let’s convert end_of_week from a character format to a
date format.
covid$end_of_week <- as.Date(covid$end_of_week, format="%m/%d/%y")
Let’s add a new column to indicate the order of the age groups, and then arrange all the rows in nested order by week, age, sex, and race/ethnicity.
# Add new column to order age groups correctly
covid <- covid |>
mutate(age_order = case_when(
age_group == "Overall" ~ 0L,
age_group == "0 - 4 Years" ~ 1L,
age_group == "5 - 11 Years" ~ 2L,
age_group == "12 - 15 Years" ~ 3L,
age_group == "16 - 17 Years" ~ 4L,
age_group == "18 - 29 Years" ~ 5L,
age_group == "30 - 39 Years" ~ 6L,
age_group == "40 - 49 Years" ~ 7L,
age_group == "50 - 64 Years" ~ 8L,
age_group == "65 - 74 Years" ~ 9L,
age_group == "75+ Years" ~ 10L),
.after = age_group) |>
arrange(end_of_week, age_order, sex, race_ethnicity_combined)
For this data dive, we’ll construct a logistic regression model as an example of a generalized linear model.
Logistic regression is used to model a binary response variable, i.e., a value that could be coded as 0 or 1. This could be a categorical variable or a continuous variable transformed into categorical levels.
For this dataset, we’ll use age group as the
response variable. There are 10 age groups in the dataset (as well as a
value for Overall, representing all ages combined).
However, for our model, we’ll create a calculated binary variable to
code the age group as 0 or 1, depending on whether the age group is
75+ Years (or not).
For the explanatory variable, we’ll use weekly death rate which represents the number of deaths per 100K population. In the dataset, weekly death rates are reported for each demographic subgroup.
First, let’s create a dataframe that filters out the rows that
provide Overall summary counts for the demographic
subgroups and also filters out rows with missing values
(NA) for death rate.
# Filter data to exclude weekly summary rows ("Overall") and to exclude rows with missing data (NA) for death rate
weekly_data <- covid |>
filter(age_group != "Overall" & sex != "Overall" & race_ethnicity_combined != "Overall") |>
filter(!is.na(death_crude_rate_suppressed_per_100k))
Next, let’s add a new column named is_75_plus to code
the age group as a binary value of 0 or 1, depending on whether the
value is 75+ Years or not:
0 = age group is less than 75 years1 = age group is 75+ years# create coded binary variable for age group
weekly_data <- weekly_data |>
mutate(is_75_plus = case_when(
age_group != "75+ Years" ~ 0L,
age_group == "75+ Years" ~ 1L),
.after = age_order
)
Let’s plot our explanatory variable (death rate) vs. the binary response variable (coded age group).
# plot of death rate vs. binary age group variable
weekly_data |>
ggplot(mapping = aes(x = death_crude_rate_suppressed_per_100k, y = is_75_plus)) +
geom_jitter(width = 0, height = 0.1, shape = 'o', size = 2) +
scale_y_continuous(breaks = c(0, 0.5, 1)) +
labs(title = "COVID-19 Death Rates vs. Age Group (Mar 2020 - Nov 2023)",
subtitle = "Region 5 (IL, IN, MI, MN, OH, WI)",
x = "Weekly Death Rate (per 100K population)",
y = "Is 75+ Years") +
theme(plot.subtitle = element_text(colour = "darkgray"))
Next, we use a link function to transform the value of \(y\) into a linear model. A logistic regression model uses “logit” as its link function to transform a binary variable into a linear model.
The logit function represents the log-odds of of \(y\) (or the natural logarithm of the odds of \(y\)). The logit function is a sigmoid function, meaning it has an S-shaped curve.
Let’s generate the logistic regression model for our explanatory variable and response variable.
# logistic regression model
glm_model <- glm(is_75_plus ~ death_crude_rate_suppressed_per_100k, data = weekly_data,
family = binomial(link = 'logit'))
glm_model$coefficients
## (Intercept) death_crude_rate_suppressed_per_100k
## -1.71143478 0.09658452
Using these model coefficients, the linear model using the “logit” link function can be represented as:
\[ \text{log odds of $y$} = \log\left(\frac{p}{1 - p}\right) = -1.711 + 0.097\times\text{death rate} \]
(where \(\log = \ln\), the natural logarithm)
The beta coefficient for death rate tells us that for every unit increase in the death rate, the odds of the age group being 75+ Years increases by \(e^{0.097} \approx 1.101\) (i.e., by about 10%).
The model coefficients can also be used to calculate the 50% probability threshold for the explanatory variable. For a probability of 50%, the odds are equal to 1, and since \(\log(1) = 0\):
\[ \begin{align} \log(\text{odds}) &= -1.711 + 0.097 \times \text{death rate} \newline \log(1) &= -1.711 + 0.097 \times \text{death rate} \newline 0 &= -1.711 + 0.097 \times \text{death rate} \newline 1.711 & = 0.097 \times \text{death rate} \newline \text{death rate} &= \frac{1.711}{0.097} = 17.6 \end{align} \]
This 50% probability threshold of 17.6 deaths per 100K population is a decision point (or inflection point) that tells us:
These model coefficients can be used to add a blue sigmoid curve to our plot that represents the model function, as well as a red vertical line representing the 50% probability threshold.
# Coefficients from logistic regression model
intercept <- -1.711
death_rate_beta <- 0.097
# Sigmoid function for model
sigmoid <- \(x) 1 / (1 + exp(-(intercept + death_rate_beta * x)))
# plot with points, sigmoid curve, and 50% probability threshold
weekly_data |>
ggplot(mapping = aes(x = death_crude_rate_suppressed_per_100k, y = is_75_plus)) +
geom_jitter(width = 0, height = 0.1, shape = 'o', size = 2) +
geom_function(fun = sigmoid, color = 'blue', linewidth = 1) +
geom_vline(xintercept = (- intercept / death_rate_beta), color = 'red', linetype = "dashed") +
annotate("text", x = 25, y = 0.5, label = round((- intercept / death_rate_beta), 1), color = 'red') +
scale_y_continuous(breaks = c(0, 0.5, 1)) +
labs(title = "COVID-19 Death Rates vs. Age Group (Mar 2020 - Nov 2023)",
subtitle = "Region 5 (IL, IN, MI, MN, OH, WI)",
x = "Weekly Death Rate (per 100K population)",
y = "Is 75+ Years") +
theme(plot.subtitle = element_text(colour = "darkgray"))
The beta coefficient for the death rate was estimated to be 0.097. We can get the 95% confidence interval for this coefficient from the model.
# get model estimates and 95% CI
tidy(glm_model, conf.int = TRUE) |>
group_by(term) |>
summarise(
term,
estimate,
conf.low,
conf.high
)
## # A tibble: 2 × 4
## term estimate conf.low conf.high
## <chr> <dbl> <dbl> <dbl>
## 1 (Intercept) -1.71 -1.84 -1.58
## 2 death_crude_rate_suppressed_per_100k 0.0966 0.0865 0.107
The 95% confidence interval for the beta coefficient of death rate ranges from 0.086 to 0.107.
Weekly death rate is a variable bound by zero: i.e., negative death rates are not possible. This likely results in a non-normal distribution of values that is right-skewed.
For zero-bound values (such as price, counts, etc.) a log transform of the values can often be used to produce a distribution that is more normal and thus better suited to use in a generalized linear model.
For comparison, here is the histogram of death rates compared to the histogram of \(\log(\text{death rate})\).
# histogram of death rate
p1 <- ggplot(data = weekly_data) +
geom_histogram(aes(x = death_crude_rate_suppressed_per_100k), binwidth = 5, color='white') +
labs(x = "Death Rate")
# histogram of log(death rate)
p2 <- ggplot(data = weekly_data) +
geom_histogram(aes(x = log(death_crude_rate_suppressed_per_100k)), binwidth = 0.25, color='white') +
labs(x = "log(Death Rate)")
# display side-by-side
p1 + p2
As can be seen in the histograms, using a log transform on the death rate produces a more normal distribution of values for this explanatory variable.
Let’s create a logistic regression model that utilizes \(\log(\text{death rate})\) for the explanatory variable.
# logistic regression model using log(death rate)
glm_model2 <- glm(is_75_plus ~ I(log(death_crude_rate_suppressed_per_100k)), data = weekly_data, family = binomial(link = 'logit'))
glm_model2$coefficients
## (Intercept)
## -2.576403
## I(log(death_crude_rate_suppressed_per_100k))
## 1.051477
In this case, the linear model can be represented as:
\[ \text{log odds of $y$} = \log\left(\frac{p}{1 - p}\right) = -2.576 + 1.051\times\log(\text{death rate}) \]
(where \(\log = \ln\), the natural logarithm)
The beta coefficient for death rate tells us that for every unit increase in the \(\log(\text{death rate})\), the odds of the age group being 75+ Years increases by \(e^{1.051} \approx 2.861\) (i.e., by about 186%).
Again, the model coefficients can also be used to calculate the 50% probability threshold for the explanatory variable. For a probability of 50%, the odds are equal to 1, and since \(\log(1) = 0\):
\[ \begin{align} \log(\text{odds}) &= -2.576 + 1.051 \times \log(\text{death rate}) \newline \log(1) &= -2.576 + 1.051 \times \log(\text{death rate}) \newline 0 &= -2.576 + 1.051 \times \log(\text{death rate}) \newline 2.576 & = 1.051 \times \log(\text{death rate}) \newline \log(\text{death rate}) &= \frac{2.576}{1.051} = 2.451 \newline \text{death rate} &= e^{2.451} = 11.6 \end{align} \]
For this model using \(\log(\text{death rate})\), the 50% probability threshold is at 11.6 deaths per 100K population:
Let’s generate a plot of this new logistic regression model.
# Coefficients from model
intercept <- -2.576403
log_death_rate_beta <- 1.051477
# Sigmoid function for model
sigmoid <- \(x) 1 / (1 + exp(-(intercept + log_death_rate_beta * x)))
# plot with points, sigmoid curve, and 50% probability threshold
weekly_data |>
ggplot(mapping = aes(x = log(death_crude_rate_suppressed_per_100k), y = is_75_plus)) +
geom_jitter(width = 0, height = 0.1, shape = 'o', size = 2) +
geom_function(fun = sigmoid, color = 'blue', linewidth = 1) +
geom_vline(xintercept = (- intercept / log_death_rate_beta), color = 'red', linetype = "dashed") +
annotate("text", x = 2.9, y = 0.5, label = round((- intercept / log_death_rate_beta), 2), color = 'red') +
scale_y_continuous(breaks = c(0, 0.5, 1)) +
labs(title = "COVID-19 Log(Death Rate) vs. Age Group (Mar 2020 - Nov 2023)",
subtitle = "Region 5 (IL, IN, MI, MN, OH, WI)",
x = "log(Weekly Death Rate, per 100K population)",
y = "Is 75+ Years") +
theme(plot.subtitle = element_text(colour = "darkgray"))
We can get the 95% confidence intervals for the model coefficients.
# get model estimates and 95% CI
tidy(glm_model2, conf.int = TRUE) |>
group_by(term) |>
summarise(
term,
estimate,
conf.low,
conf.high
)
## # A tibble: 2 × 4
## term estimate conf.low conf.high
## <chr> <dbl> <dbl> <dbl>
## 1 (Intercept) -2.58 -2.79 -2.38
## 2 I(log(death_crude_rate_suppressed_per_100k)) 1.05 0.963 1.14
For this model, the 95% confidence interval for the beta coefficient of \(\log(\text{death rate})\) ranges from 0.963 to 1.143.
This dataset will be explored further in subsequent data dives.