COVID-19 Weekly Cases and Deaths by Age, Race/Ethnicity, and Sex (Mar 2020 - Nov 2023)

This data dive will focus on developing a complex linear regression model and performing regression diagnostics using a CDC dataset that contains weekly counts and rates of COVID-19 cases and deaths reported in Region 5 (Illinois, Indiana, Michigan, Minnesota, Ohio, and Wisconsin) of the United States from March 7, 2020 through November 18, 2023.


Load Libraries and Dataset

To get started, we’ll load several R packages to assist with our analysis.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(broom)
library(lindia)

# default theme for plots
theme_set(theme_minimal())

Next, let’s read in the dataset from CSV.

covid <- read_delim("./COVID_weekly_cases_deaths_region5.csv", delim = ",")
## Rows: 37867 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): end_of_week, jurisdiction, age_group, sex, race_ethnicity_combined
## dbl (4): case_count_suppressed, death_count_suppressed, case_crude_rate_supp...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Let’s convert end_of_week from a character format to a date format.

covid$end_of_week <- as.Date(covid$end_of_week, format="%m/%d/%y")

Let’s add a new column to indicate the order of the age groups, and then arrange all the rows in nested order by week, age, sex, and race/ethnicity.

# Add new column to order age groups correctly
covid <- covid |>
  mutate(age_order = case_when(
    age_group == "Overall" ~ 0L,
    age_group == "0 - 4 Years" ~ 1L,
    age_group == "5 - 11 Years" ~ 2L,
    age_group == "12 - 15 Years" ~ 3L,
    age_group == "16 - 17 Years" ~ 4L,
    age_group == "18 - 29 Years" ~ 5L,
    age_group == "30 - 39 Years" ~ 6L,
    age_group == "40 - 49 Years" ~ 7L,
    age_group == "50 - 64 Years" ~ 8L,
    age_group == "65 - 74 Years" ~ 9L,
    age_group == "75+ Years" ~ 10L),
    .after = age_group) |>
  arrange(end_of_week, age_order, sex, race_ethnicity_combined)

Linear Regression Models

We’ll be building upon a linear regression model developed in the previous data dive.

Model 1: Death Rate ~ Case Rate

The initial linear model examined the effect of a single explanatory variable on a response variable.

Weekly case rate was selected as the explanatory variable (independent variable), and weekly death rate was selected as the response variable (dependent variable).

For this model, we’ll create a dataframe that filters out the rows that provide “Overall” summary counts for the demographic subgroups and also filters out rows with missing values (NA) for case rate or death rate.

# Filter data to exclude weekly summary rows ("Overall") and to exclude rows with missing data (NA) for case rate or death rate
weekly_data <- covid |>
  filter(age_group != "Overall" & sex != "Overall" & race_ethnicity_combined != "Overall") |>
  filter(!is.na(case_crude_rate_suppressed_per_100k)) |>
  filter(!is.na(death_crude_rate_suppressed_per_100k))

# Get count of number of rows in filtered dataset
nrow(weekly_data)
## [1] 2589

Let’s plot the weekly case rates vs. the weekly death rates and fit a linear model to the data.

# scatter plot of case rates vs death rates with linear model fitted
weekly_data |>
  ggplot(mapping = aes(x = case_crude_rate_suppressed_per_100k,
                      y = death_crude_rate_suppressed_per_100k)) +
  geom_point(size = 0.5) +
  geom_smooth(method = "lm", se = FALSE, color = "orange") +
  labs(title = "COVID-19 Case Rates vs. Death Rates (Mar 2020 - Nov 2023)",
       subtitle = "Region 5 (IL, IN, MI, MN, OH, WI)",
       x = "Weekly Case Rate (per 100K population)",
       y = "Weekly Death Rate (per 100K population)") +
  theme(plot.subtitle = element_text(colour = "darkgray"))
## `geom_smooth()` using formula = 'y ~ x'

Here’s a summary of the linear model (Model 1) shown in the plot.

# linear model with 1 explanatory variable: lm(y ~ x, data)
model1 <- lm(death_crude_rate_suppressed_per_100k ~ case_crude_rate_suppressed_per_100k, weekly_data)

summary(model1)
## 
## Call:
## lm(formula = death_crude_rate_suppressed_per_100k ~ case_crude_rate_suppressed_per_100k, 
##     data = weekly_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -37.682  -8.142  -5.679   2.284 172.066 
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                         7.040634   0.491888   14.31   <2e-16 ***
## case_crude_rate_suppressed_per_100k 0.023962   0.002042   11.73   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.39 on 2587 degrees of freedom
## Multiple R-squared:  0.05053,    Adjusted R-squared:  0.05016 
## F-statistic: 137.7 on 1 and 2587 DF,  p-value: < 2.2e-16

For Model 1:

  • As the weekly case rate increases by 1 case per 100K population, the weekly death rate will increase, on average, by 0.024 deaths per 100K population.
    • An alternate way to interpret this relationship: As the weekly case rate increases by ~42 cases per 100K population, the weekly death rate will increase, on average, by 1 death per 100K population.
  • The p-value for case rate is very low, which allows us to reject the null hypothesis that there is no effect of case rate on death rate.
    • Thus, we will assume that the weekly case rate does affect the weekly death rate.
  • Although the model fit is statistically significant, the \(R^2\) (coefficient of determination) is only 0.05, which means this model can only account for 5% of the variance in weekly death rate.

Model 2: Death Rate ~ Case Rate + Sex At Birth

Let’s create a more complex linear model (Model 2) by adding sex at birth as an additional variable.

Sex at birth has two values (female or male), so it will act as a binary term in the model. R will automatically use “dummy coding” to recode the variable values as 0 or 1.

First, let’s confirm whether there is a linear relationship between case rate and death rate for each sex.

# Faceted plots of case rate vs. death rate by sex at birth
weekly_data |>
  ggplot(mapping = aes(x = case_crude_rate_suppressed_per_100k,
                      y = death_crude_rate_suppressed_per_100k)) +
  geom_point(size = 0.5) +
  geom_smooth(method = "lm", se = FALSE, color = "orange") +
  labs(title = "COVID-19 Case Rates vs. Death Rates (Mar 2020 - Nov 2023)",
       subtitle = "Faceted by Sex At Birth - Region 5 (IL, IN, MI, MN, OH, WI)",
       x = "Weekly Case Rate (per 100K population)",
       y = "Weekly Death Rate (per 100K population)") +
  theme(plot.subtitle = element_text(colour = "darkgray")) +
  facet_wrap(~sex)
## `geom_smooth()` using formula = 'y ~ x'

The plots show roughly similar patterns for each sex, though there are some differences. The graph for females has some points with much higher case rates, while the graph for males some points with much higher death rates. The regression line for male has a steeper slope (i.e., a larger increase in death rate for each unit increase in case rate).

Here’s a summary of Model 2, which adds sex at birth as a binary term to the linear model.

# linear model with 2 explanatory variables
model2 <- lm(death_crude_rate_suppressed_per_100k ~ case_crude_rate_suppressed_per_100k + sex, weekly_data)

summary(model2)
## 
## Call:
## lm(formula = death_crude_rate_suppressed_per_100k ~ case_crude_rate_suppressed_per_100k + 
##     sex, data = weekly_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.071  -8.620  -4.984   2.220 170.256 
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                         4.974853   0.619165   8.035 1.41e-15 ***
## case_crude_rate_suppressed_per_100k 0.024301   0.002032  11.959  < 2e-16 ***
## sexMale                             3.714026   0.682390   5.443 5.74e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.3 on 2586 degrees of freedom
## Multiple R-squared:  0.06128,    Adjusted R-squared:  0.06055 
## F-statistic: 84.41 on 2 and 2586 DF,  p-value: < 2.2e-16

For Model 2:

  • If the sex at birth remains the same and the weekly case rate increases by 1 case per 100K population, the weekly death rate will increase, on average, by 0.024 deaths per 100K population.
  • If the weekly case rate remains the same and the sex is male, the weekly death rate will increase, on average, by 3.71 deaths per 100K population.
  • The p-value for case rate is very low, which allows us to reject the null hypothesis that there is no effect of case rate on death rate.
    • Thus, we will assume that the weekly case rate does affect the weekly death rate.
  • The p-value for sex at birth is very low, which allows us to reject the null hypothesis that there is no effect of sex on death rate.
    • Thus, we will assume that sex does affect the weekly death rate.
  • Although the model fit is statistically significant, the adjusted \(R^2\) is only 0.06, which means this model can still only account for 6% of the variance in weekly death rate.
    • Thus, Model 2 is only a very slight improvement over the previous model (which accounted for 5% of the variance).

Model 3: Death Rate ~ Case Rate + Sex At Birth + Age Range

Let’s create a third iteration of the linear model (Model 3) by adding age range as an additional variable.

The age_group variable in the original covid dataset has 10 groups (plus a value for “Overall” to summarize data across all age groups). Let’s create a new column named age_range to reduce the number of age groups by combining them into reasonable ranges:

  • Under_18 = 0-4 Years, 5-11 Years, 12-15 Years, 16-17 Years
  • 18_49_Years = 18-29 Years, 30-39 Years, 40-49 Years
  • 50_74_Years = 50-64 Years, 65-74 Years
  • 75_Older = 75+ Years
# add new column for age ranges that combine age groups
weekly_data <- weekly_data |>
  mutate(age_range = case_when(
    age_group == "0 - 4 Years" ~ "Under_18",
    age_group == "5 - 11 Years" ~ "Under_18",
    age_group == "12 - 15 Years" ~ "Under_18",
    age_group == "16 - 17 Years" ~ "Under_18",
    age_group == "18 - 29 Years" ~ "18_49_Years",
    age_group == "30 - 39 Years" ~ "18_49_Years",
    age_group == "40 - 49 Years" ~ "18_49_Years",
    age_group == "50 - 64 Years" ~ "50_74_Years",
    age_group == "65 - 74 Years" ~ "50_74_Years",
    age_group == "75+ Years" ~ "75_Older"),
    .after = age_group)

# show number of rows for each age range
weekly_data |>
  group_by(age_range) |>
  summarise(n_rows = n())
## # A tibble: 3 × 2
##   age_range   n_rows
##   <chr>        <int>
## 1 18_49_Years    375
## 2 50_74_Years   1362
## 3 75_Older       852

Notice that there are no rows for the age range representing Under_18.


Why are there no rows for “Under 18”?

Let’s determine why there are no rows for the Under_18 age range in the weekly_data dataset.

First, let’s return to the original covid dataset to get a summary of the age groups under 18 years old.

# summary of rows in original dataset for age groups under 18 years old
covid |>
  filter(age_order > 0 & age_order < 5) |>
  group_by(age_group) |>
  summarise(
    n_rows = n(),
    cases = sum(case_count_suppressed, na.rm = TRUE),
    deaths = sum(death_count_suppressed, na.rm = TRUE),
    case_rate = mean(case_crude_rate_suppressed_per_100k, na.rm = TRUE),
    death_rate = mean(death_crude_rate_suppressed_per_100k, na.rm = TRUE)
  )
## # A tibble: 4 × 6
##   age_group     n_rows   cases deaths case_rate death_rate
##   <chr>          <int>   <dbl>  <dbl>     <dbl>      <dbl>
## 1 0 - 4 Years     3406 1999595      7      92.9       0.23
## 2 12 - 15 Years   3376 2364222      0     123.      NaN   
## 3 16 - 17 Years   3357 1425108      0     148.      NaN   
## 4 5 - 11 Years    3394 3402502      0     107.      NaN

There were only 7 deaths from COVID-19 recorded in the original dataset for individuals under 18 years of age, and all of those deaths occurred in the 0 - 4 Years age group.

Let’s get all the rows in the original dataset for the 0 - 4 Years age group which have a death count greater than zero.

# show all rows in original dataset for 0-4 Years age group with death count > 0
covid |>
  filter(age_group == "0 - 4 Years" & death_count_suppressed > 0) |>
  select(end_of_week, age_group, sex, race_ethnicity = race_ethnicity_combined, cases = case_count_suppressed, case_rate = case_crude_rate_suppressed_per_100k, deaths = death_count_suppressed, death_rate = death_crude_rate_suppressed_per_100k)
## # A tibble: 1 × 8
##   end_of_week age_group   sex   race_ethnicity cases case_rate deaths death_rate
##   <date>      <chr>       <chr> <chr>          <dbl>     <dbl>  <dbl>      <dbl>
## 1 2022-01-08  0 - 4 Years Over… Overall        25739      829.      7       0.23

It turns out that those 7 deaths all occurred in the same week (week ending 01/08/2022) and were recorded in a single row which happens to be a weekly summary row (i.e., a row in which age group, sex at birth, or race/ethnicity group = Overall).

In the original dataset, the weekly summary rows include all case counts and death counts, including any cases or deaths that may have been suppressed in the other rows for that week that record data for specific demographic subgroups. As a reminder, if a case count or death count was 5 or fewer, it was suppressed by the CDC (and appears as NA in the dataset).

To verify this, let’s examine the non-summary rows for the week ending 01/08/2022 for the 0 - 4 Years age group to see the death counts and death rates for the demographic subgroups.

# show all rows in original dataset for week ending 01/08/20222 for 0-4 Years age group
# but exclude the weekly summary rows (Overall)
covid |>
  filter(end_of_week == "2022-01-08" & age_group == "0 - 4 Years" & sex != "Overall" & race_ethnicity_combined != "Overall") |>
  select(end_of_week, age_group, sex, race_ethnicity = race_ethnicity_combined, cases = case_count_suppressed, case_rate = case_crude_rate_suppressed_per_100k, deaths = death_count_suppressed, death_rate = death_crude_rate_suppressed_per_100k)
## # A tibble: 10 × 8
##    end_of_week age_group  sex   race_ethnicity cases case_rate deaths death_rate
##    <date>      <chr>      <chr> <chr>          <dbl>     <dbl>  <dbl>      <dbl>
##  1 2022-01-08  0 - 4 Yea… Fema… AI/AN, NH         53      706.     NA         NA
##  2 2022-01-08  0 - 4 Yea… Fema… Asian/PI, NH     398      685.     NA         NA
##  3 2022-01-08  0 - 4 Yea… Fema… Black, NH       1378      645.     NA         NA
##  4 2022-01-08  0 - 4 Yea… Fema… Hispanic        1411      735.     NA         NA
##  5 2022-01-08  0 - 4 Yea… Fema… White, NH       5060      519.     NA         NA
##  6 2022-01-08  0 - 4 Yea… Male  AI/AN, NH         69      871.     NA         NA
##  7 2022-01-08  0 - 4 Yea… Male  Asian/PI, NH     480      801.     NA         NA
##  8 2022-01-08  0 - 4 Yea… Male  Black, NH       1418      643.     NA         NA
##  9 2022-01-08  0 - 4 Yea… Male  Hispanic        1576      792.     NA         NA
## 10 2022-01-08  0 - 4 Yea… Male  White, NH       5644      550.     NA         NA

As we can see, each row for the demographic subgroups lists NA for the death count and death rate, which means the 7 deaths listed in the weekly summary row were suppressed (NA) in these non-summary rows, so we don’t know which of these demographic subgroups had deaths during this week.

Thus, the death rate was NA for all the non-summary rows across the entire age range under 18 years old.

When creating the weekly_data dataset, we filtered out all the summary rows, as well as any non-summary rows with missing values (NA) for case rate or death rate, which means this dataset has no rows for age groups under 18 years old. Mystery solved!


Returning to Model 3

Let’s continue with our work on Model 3. We’ll confirm whether there is a linear relationship between case rate and death rate for each age range.

# Faceted plots of case rate vs. death rate by sex at birth
weekly_data |>
  ggplot(mapping = aes(x = case_crude_rate_suppressed_per_100k,
                      y = death_crude_rate_suppressed_per_100k)) +
  geom_point(size = 0.5) +
  geom_smooth(method = "lm", se = FALSE, color = "orange") +
  labs(title = "COVID-19 Case Rates vs. Death Rates (Mar 2020 - Nov 2023)",
       subtitle = "Faceted by Age Range - Region 5 (IL, IN, MI, MN, OH, WI)",
       x = "Weekly Case Rate (per 100K population)",
       y = "Weekly Death Rate (per 100K population)") +
  theme(plot.subtitle = element_text(colour = "darkgray")) +
  facet_wrap(~age_range)
## `geom_smooth()` using formula = 'y ~ x'

The faceted plots show a linear model applied to each age range. The patterns in the graphs vary, and the slopes of their regression lines vary. The regression line for the 18_49_Years age range is nearly horizontal, which could indicate that no linear relationship exists in that data subset.

Let’s calculate the Pearson’s correlation cofficient for the 18_49_Years age range.

# calculate correlation coefficient for 18-49 years age range
weekly_data_18_49 <- weekly_data |>
  filter(age_range == "18_49_Years")

round(cor(weekly_data_18_49$case_crude_rate_suppressed_per_100k, weekly_data_18_49$death_crude_rate_suppressed_per_100k), 2)
## [1] 0.11

The correlation coefficient for the 18_49_Years age range is very low, but it is non-zero, so we’ll proceed with including this age range in the model.

Here is a summary of Model 3, which adds age range to the linear model.

# linear model with 3 explanatory variables
model3 <- lm(death_crude_rate_suppressed_per_100k ~ case_crude_rate_suppressed_per_100k + sex + age_range, weekly_data)

summary(model3)
## 
## Call:
## lm(formula = death_crude_rate_suppressed_per_100k ~ case_crude_rate_suppressed_per_100k + 
##     sex + age_range, data = weekly_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -37.170  -6.672  -1.307   3.995 152.158 
## 
## Coefficients:
##                                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                         -13.85224    1.01586 -13.636   <2e-16 ***
## case_crude_rate_suppressed_per_100k   0.03874    0.00181  21.410   <2e-16 ***
## sexMale                               5.22906    0.56994   9.175   <2e-16 ***
## age_range50_74_Years                 11.62884    0.89314  13.020   <2e-16 ***
## age_range75_Older                    28.53174    0.94822  30.090   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.39 on 2584 degrees of freedom
## Multiple R-squared:  0.3512, Adjusted R-squared:  0.3501 
## F-statistic: 349.6 on 4 and 2584 DF,  p-value: < 2.2e-16

Let’s plot the confidence intervals for the model coefficients.

# plot model coefficients with 95% CI
tidy(model3, conf.int = TRUE) |>
  ggplot(mapping = aes(x = estimate, y = term)) +
  geom_point() +
  geom_vline(xintercept = 0, linetype = 'dotted', color = 'gray') +
  geom_errorbarh(mapping = aes(xmin = conf.low, 
                               xmax = conf.high, 
                               height = 0.5)) +
  labs(title = "Model Coefficients 95% C.I.")

For Model 3:

  • If the sex at birth and age range remain the same and the weekly case rate increases by 1 case per 100K population, the weekly death rate will increase, on average, by 0.039 deaths per 100K population.
    • An alternate way to interpret this relationship: If the sex at birth and age range remain the same and the weekly case rate increases by ~26 cases per 100K population, the weekly death rate will increase, on average, by 1 death per 100K population.
  • If the weekly case rate and age range remain the same and the sex is male, the weekly death rate will increase, on average, by 5.23 deaths per 100K population.
    • The 95% Confidence Interval for the increase is 4.11 - 6.35 deaths per 100K population.
  • If the weekly case rate and sex at birth remain the same and the age range is 50-74 years, the weekly death rate will increase, on average, by 11.63 deaths per 100K population.
    • The 95% Confidence Interval for the increase is 9.88 - 13.38 deaths per 100K population.
  • If the weekly case rate and sex at birth remain the same and the age range is 75+ years, the weekly death rate will increase, on average, by 28.53 deaths per 100K population.
    • The 95% Confidence Interval for the increase is 26.67 - 30.39 deaths per 100K population.
  • The p-value for case rate is very low, which allows us to reject the null hypothesis that there is no effect of weekly case rate on weekly death rate.
    • Thus, we will assume that the weekly case rate does affect the weekly death rate.
  • The p-value for sex at birth is very low, which allows us to reject the null hypothesis that there is no effect of sex on weekly death rate.
    • Thus, we will assume that sex does affect the weekly death rate.
  • The p-value for age range is very low for both age range groups (50-74 years & 75+ years), which allows us to reject the null hypothesis that there is no effect of age range on weekly death rate.
    • Thus, we will assume that age range does affect the weekly death rate.
  • The adjusted \(R^2\) is now 0.35, which means this model can account for 35% of the variance in weekly death rate.
    • Thus, Model 3 is a large improvement over the previous model (which only accounted for 6% of the variance).

Diagnostic Plots for Model 3

Residuals vs. Fitted Values

Let’s examine the plot of the residual values vs. the fitted values (\(\hat{y}\) values).

# plot of residuals vs. fitted values (y hat values)
gg_resfitted(model3) +
  geom_smooth(se=FALSE)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

One of the assumptions for linear regression is that the errors have constant variance across all predictions. If this is true, the plot of residuals vs. fitted values should ideally show that the variation in residuals is relatively uniform across the range of fitted values, with the blue best-fit line being close to a straight line.

In the plot above, we can see a “fanning” effect as the variance in errors increases for higher fitted values. The blue best-fit line is relatively close to a straight line but does begin to slope upwards for higher fitted values. Both of these are evidence that the errors do not have constant variance across predictions.

For higher fitted values, the larger positive residuals indicate that the model is underestimating the true value of the weekly death rates. This is not surprising, as infectious diseases, such as COVID-19, tend to follow an exponential growth pattern, rather than linear.

Residuals vs. X Values

Let’s examine the plots of the residual values vs. the \(x\) values for each variable in Model 3.

As with the previous diagnostic plot, the plot of residuals vs. \(x\) values should ideally show that the variation in residuals is relatively uniform across the range of \(x\) values for each \(x\) variable in the model.

# generate plots for residuals vs. x values (each x variable)
plots <- gg_resX(model3, plot.all = FALSE)

# show residuals for case rate
plots$case_crude_rate_suppressed_per_100k

In the plot of residual values for case rate, we can see a “fanning” effect as the variance in errors increases for case rates between approximately 100 to 750 (but above 750, the variance in errors becomes tighter and more uniform). This indicates the errors do not have constant variance across the range of case rates.

# show residals for sex at birth
plots$sex

For sex at birth, the plot of residuals shows that the interquartile range of the errors appears to be similar for females and males. However, males have more outlier observations with higher positive errors. For example, the highest positive residual for females is approximately 80, whereas the highest positive residual for males is almost double at approximately 155.

# show residuals for age range
plots$age_range

For age range, the interquartile ranges of errors are roughly similar for the 18-49 and 50-74 age ranges, though the 50-74 age range has a number of outlier observations with higher positive errors.

The interquartile range of errors is much wider for the 75+ age range, which also has outlier observations with even higher positive errors. Again, higher positive residuals indicate the model is underestimating the true value of the weekly death rate for these observations.

Residuals Histogram

Let’s plot a histogram of the residuals for Model 3.

# plot histogram of residuals
gg_reshist(model3, bins = 35)

Another assumption of linear regression is that errors are normally distributed over the prediction line. If this is true, the histogram of residuals should ideally be normal.

In the plot above, the histogram is roughly normal, though the distribution appears to be slightly right-skewed.

QQ Plot

Let’s examine the quantile-quantile (QQ) plot for Model 3.

# plot normal QQ plot (quantile-quantile)
gg_qqplot(model3)

The QQ plot is another way to determine if the residuals are normally distributed. The residual values are sorted smallest to largest and plotted, converted to quantiles, and plotted against a theoretical quantile based on a normal distribution.

If the model residuals have a perfectly normal distribution, they will follow a straight line, indicated by the red line in the QQ plot.

In the plot above, the residuals follow the theoretical line fairly closely, except for the right side (quantile values higher than 1.5). This is consistent with the slight right skew seen in the histogram of residuals.

Cook’s D by Observation

Let’s plot Cook’s Distance (D) for each observation used in Model 3.

Cook’s Distance measures how much the model’s fitted values would change if an observation were removed from the dataset. Thus, it indicates how much influence a given observation has on the overall model.

Cook’s D can be calculated for each observation in the dataset. If the Cook’s D value for a given observation is 3 times higher than the mean Cook’s D for the dataset, the observation might be an outlier.

First, we’ll show the plot without labels for the observations that are above the threshold (i.e., 3 times the mean Cook’s D), which is represented by the red dashed line in the plot.

# plot Cook's Distance for observations (without labels for obs above threshold)
gg_cooksd(model3, label = FALSE, threshold = "matlab")

There are numerous observations with a Cook’s D value above the threshold. They form 3 distinct clusters of observations with high Cook’s D values.

As a reminder, the original dataset was arranged into order by week (from March 2020 to November 2023), so the observation numbers are ordered by week (though each week has multiple observations, representing the various demographic subgroups). As will be shown later in this data dive, these 3 clusters of observations with high Cook’s D values correspond to the three largest peaks in COVID-19 death rates which occurred during spring 2020, winter 2020-2021, and winter 2021-2022.

Now, let’s view the plot with labels for the observations above the threshold. The labels represent the observation’s row number (index) within the dataset. Due to the high number of observations above the threshold, it will be challenging to read the labels due to overlapping text.

# plot Cook's Distance for observations (with labels for obs above threshold)
gg_cooksd(model3, label = TRUE, threshold = "matlab")

To get a better sense of how these high Cook’s D observations compare to the rest of the dataset, let’s identify the 3 observations with the highest Cook’s D value from each of the 3 clusters (for a total of 9 observations), and highlight these high influence points in a scatter plot of case rate vs. death rate.

# examples of observations with high Cook's D
high_influence = c(96, 149, 150, 643, 646, 703, 1847, 1852, 1878)

# scatter plot of case rate vs death rate showing high influence points in red
ggplot() +
  geom_point(data = weekly_data,
             mapping = aes(x = case_crude_rate_suppressed_per_100k,
                           y = death_crude_rate_suppressed_per_100k),
             color = "lightgray") +
  geom_point(data = slice(weekly_data, high_influence),
             mapping = aes(x = case_crude_rate_suppressed_per_100k,
                           y = death_crude_rate_suppressed_per_100k),
             color = "red") +
  annotate("text", x = 1000, y = 175, label = "Examples of High Cook's D", color = 'red') + 
  labs(title = "COVID-19 Case Rates vs. Death Rates (Mar 2020 - Nov 2023)",
       subtitle = "Region 5 (IL, IN, MI, MN, OH, WI)",
       x = "Weekly Case Rate (per 100K population)",
       y = "Weekly Death Rate (per 100K population)") +
  theme(plot.subtitle = element_text(colour = "darkgray"))

As we can see for these 9 example points highlighted in red, the observations with a high Cook’s D value tend to be found further out from the other points (shown in gray). These high influence points have either a very high case rate or very high death rate, which explains their higher influence on the linear model.

Even though we have only highlighted 9 examples, it might be that the full set of high influence points roughly balance each other, as the entire dataset has a roughly symmetrical shape that fans outwards from the origin. (This is merely an educated guess and would need to be verified.)

As mentioned previously, because the observations are ordered by date, we can verify that the 3 clusters of observations with high Cook’s D values correspond to the 3 major peaks in the COVID-19 death rates during the pandemic.

Let’s view the row for one of the high Cook’s D observations from the first cluster to see what date the observation is from.

# view row for observation 96 (high Cook's D)
weekly_data[96, ] |>
  select(end_of_week, age_group, sex, race_ethnicity = race_ethnicity_combined, death_rate = death_crude_rate_suppressed_per_100k)
## # A tibble: 1 × 5
##   end_of_week age_group sex   race_ethnicity death_rate
##   <date>      <chr>     <chr> <chr>               <dbl>
## 1 2020-04-04  75+ Years Male  Black, NH            150.

As we can see, this example observation from the first cluster is from spring 2020. Other high Cook’s D observations in this first cluster occur close to this observation, which means they are from a similar date period.

Let’s view the row for one of the high Cook’s D observations from the second cluster to see what date the observation is from.

# view row for observation 643 (high Cook's D)
weekly_data[643, ] |>
  select(end_of_week, age_group, sex, race_ethnicity = race_ethnicity_combined, death_rate = death_crude_rate_suppressed_per_100k)
## # A tibble: 1 × 5
##   end_of_week age_group sex   race_ethnicity death_rate
##   <date>      <chr>     <chr> <chr>               <dbl>
## 1 2020-11-07  75+ Years Male  AI/AN, NH            191.

As we can see, this example observation from the second cluster is from late 2020. Other high Cook’s D observations in this second cluster occur shortly before or shortly after this observation, again indicating a similar date period.

Finally, let’s view the row for one of the high Cook’s D observations from the third cluster to see what date the observation is from.

# view row for observation 1847 (high Cook's D)
weekly_data[1847, ] |>
  select(end_of_week, age_group, sex, race_ethnicity = race_ethnicity_combined, death_rate = death_crude_rate_suppressed_per_100k)
## # A tibble: 1 × 5
##   end_of_week age_group     sex    race_ethnicity death_rate
##   <date>      <chr>         <chr>  <chr>               <dbl>
## 1 2022-01-01  30 - 39 Years Female Black, NH            3.09

As we can see, this example observation from the third cluster is from the start of 2022. Other high Cook’s D observations in this third cluster occur shortly before or shortly after this observation, again representing a similar date period.

For reference, let’s view a plot of the overall weekly death rates over time.

# line plot of weekly death rate over time
covid |>
  filter(age_group == "Overall" & race_ethnicity_combined == "Overall" & sex == "Overall") |>
  ggplot() +
  geom_line(mapping = aes(x = end_of_week,
                          y = death_crude_rate_suppressed_per_100k),
            na.rm = TRUE, color = "red") +
  scale_x_date(minor_breaks = as.Date(c("2020-04-01", "2020-07-01", "2020-10-01", "2021-04-01", "2021-07-01", "2021-10-01", "2022-04-01", "2022-07-01", "2022-10-01", "2023-04-01", "2023-07-01", "2023-10-01"))) +
  labs(title = "COVID-19 Weekly Death Rates (Mar 2020 - Nov 2023)",
       subtitle = "Region 5 (IL, IN, MI, MN, OH, WI)",
       x = "", y = "Weekly Death Rate (per 100K population)") +
  theme(plot.subtitle = element_text(colour = "darkgray"))

As seen above, the three largest peaks in the weekly death rates occur in spring 2020 (peaking in April 2020), winter 2020-2021 (peaking in November 2020), and winter 2021-2022 (peaking in January 2022), which correspond precisely with the dates of the 3 clusters of high Cook’s D observations.

Because the overall case rates and death rates were much higher during these peaks, it would be much more likely to have observations for a specific demographic subgroup that are very different from the rest of the data (e.g., having a case rate or death rate that is either much higher or much lower) and thus could have greater influence on the model (i.e., higher Cook’s D).


Conclusion

By incorporating additional explanatory variables into the linear regression model, we were able to show the effect of each variable is statistically significant (as evidenced by very low p-values) and produces a model with greater explanatory power (as evidenced by a higher \(R^2\) value).

However, we also found diagnostic evidence to indicate that the assumptions of the linear regression model are not completely met, indicating that the pattern in the data is not completely linear (which is not surprising for observations of an infectious disease, which tend to follow exponential growth patterns).


This dataset will be explored further in subsequent data dives.