This data dive will focus on developing a complex linear regression model and performing regression diagnostics using a CDC dataset that contains weekly counts and rates of COVID-19 cases and deaths reported in Region 5 (Illinois, Indiana, Michigan, Minnesota, Ohio, and Wisconsin) of the United States from March 7, 2020 through November 18, 2023.
To get started, we’ll load several R packages to assist with our analysis.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.0 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(broom)
library(lindia)
# default theme for plots
theme_set(theme_minimal())
Next, let’s read in the dataset from CSV.
covid <- read_delim("./COVID_weekly_cases_deaths_region5.csv", delim = ",")
## Rows: 37867 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): end_of_week, jurisdiction, age_group, sex, race_ethnicity_combined
## dbl (4): case_count_suppressed, death_count_suppressed, case_crude_rate_supp...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Let’s convert end_of_week from a character format to a
date format.
covid$end_of_week <- as.Date(covid$end_of_week, format="%m/%d/%y")
Let’s add a new column to indicate the order of the age groups, and then arrange all the rows in nested order by week, age, sex, and race/ethnicity.
# Add new column to order age groups correctly
covid <- covid |>
mutate(age_order = case_when(
age_group == "Overall" ~ 0L,
age_group == "0 - 4 Years" ~ 1L,
age_group == "5 - 11 Years" ~ 2L,
age_group == "12 - 15 Years" ~ 3L,
age_group == "16 - 17 Years" ~ 4L,
age_group == "18 - 29 Years" ~ 5L,
age_group == "30 - 39 Years" ~ 6L,
age_group == "40 - 49 Years" ~ 7L,
age_group == "50 - 64 Years" ~ 8L,
age_group == "65 - 74 Years" ~ 9L,
age_group == "75+ Years" ~ 10L),
.after = age_group) |>
arrange(end_of_week, age_order, sex, race_ethnicity_combined)
We’ll be building upon a linear regression model developed in the previous data dive.
The initial linear model examined the effect of a single explanatory variable on a response variable.
Weekly case rate was selected as the explanatory variable (independent variable), and weekly death rate was selected as the response variable (dependent variable).
For this model, we’ll create a dataframe that filters out the rows
that provide “Overall” summary counts for the demographic subgroups and
also filters out rows with missing values (NA) for case
rate or death rate.
# Filter data to exclude weekly summary rows ("Overall") and to exclude rows with missing data (NA) for case rate or death rate
weekly_data <- covid |>
filter(age_group != "Overall" & sex != "Overall" & race_ethnicity_combined != "Overall") |>
filter(!is.na(case_crude_rate_suppressed_per_100k)) |>
filter(!is.na(death_crude_rate_suppressed_per_100k))
# Get count of number of rows in filtered dataset
nrow(weekly_data)
## [1] 2589
Let’s plot the weekly case rates vs. the weekly death rates and fit a linear model to the data.
# scatter plot of case rates vs death rates with linear model fitted
weekly_data |>
ggplot(mapping = aes(x = case_crude_rate_suppressed_per_100k,
y = death_crude_rate_suppressed_per_100k)) +
geom_point(size = 0.5) +
geom_smooth(method = "lm", se = FALSE, color = "orange") +
labs(title = "COVID-19 Case Rates vs. Death Rates (Mar 2020 - Nov 2023)",
subtitle = "Region 5 (IL, IN, MI, MN, OH, WI)",
x = "Weekly Case Rate (per 100K population)",
y = "Weekly Death Rate (per 100K population)") +
theme(plot.subtitle = element_text(colour = "darkgray"))
## `geom_smooth()` using formula = 'y ~ x'
Here’s a summary of the linear model (Model 1) shown in the plot.
# linear model with 1 explanatory variable: lm(y ~ x, data)
model1 <- lm(death_crude_rate_suppressed_per_100k ~ case_crude_rate_suppressed_per_100k, weekly_data)
summary(model1)
##
## Call:
## lm(formula = death_crude_rate_suppressed_per_100k ~ case_crude_rate_suppressed_per_100k,
## data = weekly_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.682 -8.142 -5.679 2.284 172.066
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.040634 0.491888 14.31 <2e-16 ***
## case_crude_rate_suppressed_per_100k 0.023962 0.002042 11.73 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.39 on 2587 degrees of freedom
## Multiple R-squared: 0.05053, Adjusted R-squared: 0.05016
## F-statistic: 137.7 on 1 and 2587 DF, p-value: < 2.2e-16
For Model 1:
Let’s create a more complex linear model (Model 2) by adding sex at birth as an additional variable.
Sex at birth has two values (female or male), so it will act as a binary term in the model. R will automatically use “dummy coding” to recode the variable values as 0 or 1.
First, let’s confirm whether there is a linear relationship between case rate and death rate for each sex.
# Faceted plots of case rate vs. death rate by sex at birth
weekly_data |>
ggplot(mapping = aes(x = case_crude_rate_suppressed_per_100k,
y = death_crude_rate_suppressed_per_100k)) +
geom_point(size = 0.5) +
geom_smooth(method = "lm", se = FALSE, color = "orange") +
labs(title = "COVID-19 Case Rates vs. Death Rates (Mar 2020 - Nov 2023)",
subtitle = "Faceted by Sex At Birth - Region 5 (IL, IN, MI, MN, OH, WI)",
x = "Weekly Case Rate (per 100K population)",
y = "Weekly Death Rate (per 100K population)") +
theme(plot.subtitle = element_text(colour = "darkgray")) +
facet_wrap(~sex)
## `geom_smooth()` using formula = 'y ~ x'
The plots show roughly similar patterns for each sex, though there are some differences. The graph for females has some points with much higher case rates, while the graph for males some points with much higher death rates. The regression line for male has a steeper slope (i.e., a larger increase in death rate for each unit increase in case rate).
Here’s a summary of Model 2, which adds sex at birth as a binary term to the linear model.
# linear model with 2 explanatory variables
model2 <- lm(death_crude_rate_suppressed_per_100k ~ case_crude_rate_suppressed_per_100k + sex, weekly_data)
summary(model2)
##
## Call:
## lm(formula = death_crude_rate_suppressed_per_100k ~ case_crude_rate_suppressed_per_100k +
## sex, data = weekly_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -36.071 -8.620 -4.984 2.220 170.256
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.974853 0.619165 8.035 1.41e-15 ***
## case_crude_rate_suppressed_per_100k 0.024301 0.002032 11.959 < 2e-16 ***
## sexMale 3.714026 0.682390 5.443 5.74e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.3 on 2586 degrees of freedom
## Multiple R-squared: 0.06128, Adjusted R-squared: 0.06055
## F-statistic: 84.41 on 2 and 2586 DF, p-value: < 2.2e-16
For Model 2:
Let’s create a third iteration of the linear model (Model 3) by adding age range as an additional variable.
The age_group variable in the original
covid dataset has 10 groups (plus a value for “Overall” to
summarize data across all age groups). Let’s create a new column named
age_range to reduce the number of age groups by combining
them into reasonable ranges:
Under_18 = 0-4 Years, 5-11 Years, 12-15 Years, 16-17
Years18_49_Years = 18-29 Years, 30-39 Years, 40-49
Years50_74_Years = 50-64 Years, 65-74 Years75_Older = 75+ Years# add new column for age ranges that combine age groups
weekly_data <- weekly_data |>
mutate(age_range = case_when(
age_group == "0 - 4 Years" ~ "Under_18",
age_group == "5 - 11 Years" ~ "Under_18",
age_group == "12 - 15 Years" ~ "Under_18",
age_group == "16 - 17 Years" ~ "Under_18",
age_group == "18 - 29 Years" ~ "18_49_Years",
age_group == "30 - 39 Years" ~ "18_49_Years",
age_group == "40 - 49 Years" ~ "18_49_Years",
age_group == "50 - 64 Years" ~ "50_74_Years",
age_group == "65 - 74 Years" ~ "50_74_Years",
age_group == "75+ Years" ~ "75_Older"),
.after = age_group)
# show number of rows for each age range
weekly_data |>
group_by(age_range) |>
summarise(n_rows = n())
## # A tibble: 3 × 2
## age_range n_rows
## <chr> <int>
## 1 18_49_Years 375
## 2 50_74_Years 1362
## 3 75_Older 852
Notice that there are no rows for the age range
representing Under_18.
Let’s determine why there are no rows for the Under_18
age range in the weekly_data dataset.
First, let’s return to the original covid dataset to get
a summary of the age groups under 18 years old.
# summary of rows in original dataset for age groups under 18 years old
covid |>
filter(age_order > 0 & age_order < 5) |>
group_by(age_group) |>
summarise(
n_rows = n(),
cases = sum(case_count_suppressed, na.rm = TRUE),
deaths = sum(death_count_suppressed, na.rm = TRUE),
case_rate = mean(case_crude_rate_suppressed_per_100k, na.rm = TRUE),
death_rate = mean(death_crude_rate_suppressed_per_100k, na.rm = TRUE)
)
## # A tibble: 4 × 6
## age_group n_rows cases deaths case_rate death_rate
## <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 0 - 4 Years 3406 1999595 7 92.9 0.23
## 2 12 - 15 Years 3376 2364222 0 123. NaN
## 3 16 - 17 Years 3357 1425108 0 148. NaN
## 4 5 - 11 Years 3394 3402502 0 107. NaN
There were only 7 deaths from COVID-19 recorded in the original
dataset for individuals under 18 years of age, and all of those deaths
occurred in the 0 - 4 Years age group.
Let’s get all the rows in the original dataset for the
0 - 4 Years age group which have a death count greater than
zero.
# show all rows in original dataset for 0-4 Years age group with death count > 0
covid |>
filter(age_group == "0 - 4 Years" & death_count_suppressed > 0) |>
select(end_of_week, age_group, sex, race_ethnicity = race_ethnicity_combined, cases = case_count_suppressed, case_rate = case_crude_rate_suppressed_per_100k, deaths = death_count_suppressed, death_rate = death_crude_rate_suppressed_per_100k)
## # A tibble: 1 × 8
## end_of_week age_group sex race_ethnicity cases case_rate deaths death_rate
## <date> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 2022-01-08 0 - 4 Years Over… Overall 25739 829. 7 0.23
It turns out that those 7 deaths all occurred in the same week (week
ending 01/08/2022) and were recorded in a single row which
happens to be a weekly summary row (i.e., a row in
which age group, sex at birth, or race/ethnicity group =
Overall).
In the original dataset, the weekly summary rows include
all case counts and death counts, including any cases
or deaths that may have been suppressed in the other rows for that week
that record data for specific demographic subgroups. As a reminder, if a
case count or death count was 5 or fewer, it was suppressed by the CDC
(and appears as NA in the dataset).
To verify this, let’s examine the non-summary rows for the week
ending 01/08/2022 for the 0 - 4 Years age group to see the
death counts and death rates for the demographic subgroups.
# show all rows in original dataset for week ending 01/08/20222 for 0-4 Years age group
# but exclude the weekly summary rows (Overall)
covid |>
filter(end_of_week == "2022-01-08" & age_group == "0 - 4 Years" & sex != "Overall" & race_ethnicity_combined != "Overall") |>
select(end_of_week, age_group, sex, race_ethnicity = race_ethnicity_combined, cases = case_count_suppressed, case_rate = case_crude_rate_suppressed_per_100k, deaths = death_count_suppressed, death_rate = death_crude_rate_suppressed_per_100k)
## # A tibble: 10 × 8
## end_of_week age_group sex race_ethnicity cases case_rate deaths death_rate
## <date> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 2022-01-08 0 - 4 Yea… Fema… AI/AN, NH 53 706. NA NA
## 2 2022-01-08 0 - 4 Yea… Fema… Asian/PI, NH 398 685. NA NA
## 3 2022-01-08 0 - 4 Yea… Fema… Black, NH 1378 645. NA NA
## 4 2022-01-08 0 - 4 Yea… Fema… Hispanic 1411 735. NA NA
## 5 2022-01-08 0 - 4 Yea… Fema… White, NH 5060 519. NA NA
## 6 2022-01-08 0 - 4 Yea… Male AI/AN, NH 69 871. NA NA
## 7 2022-01-08 0 - 4 Yea… Male Asian/PI, NH 480 801. NA NA
## 8 2022-01-08 0 - 4 Yea… Male Black, NH 1418 643. NA NA
## 9 2022-01-08 0 - 4 Yea… Male Hispanic 1576 792. NA NA
## 10 2022-01-08 0 - 4 Yea… Male White, NH 5644 550. NA NA
As we can see, each row for the demographic subgroups lists
NA for the death count and death rate, which means the 7
deaths listed in the weekly summary row were suppressed
(NA) in these non-summary rows, so we don’t know which of
these demographic subgroups had deaths during this week.
Thus, the death rate was NA for all the
non-summary rows across the entire age range under 18
years old.
When creating the weekly_data dataset, we filtered out
all the summary rows, as well as any non-summary rows with missing
values (NA) for case rate or death rate, which means this
dataset has no rows for age groups under 18 years old.
Mystery solved!
Let’s continue with our work on Model 3. We’ll confirm whether there is a linear relationship between case rate and death rate for each age range.
# Faceted plots of case rate vs. death rate by sex at birth
weekly_data |>
ggplot(mapping = aes(x = case_crude_rate_suppressed_per_100k,
y = death_crude_rate_suppressed_per_100k)) +
geom_point(size = 0.5) +
geom_smooth(method = "lm", se = FALSE, color = "orange") +
labs(title = "COVID-19 Case Rates vs. Death Rates (Mar 2020 - Nov 2023)",
subtitle = "Faceted by Age Range - Region 5 (IL, IN, MI, MN, OH, WI)",
x = "Weekly Case Rate (per 100K population)",
y = "Weekly Death Rate (per 100K population)") +
theme(plot.subtitle = element_text(colour = "darkgray")) +
facet_wrap(~age_range)
## `geom_smooth()` using formula = 'y ~ x'
The faceted plots show a linear model applied to each age range. The
patterns in the graphs vary, and the slopes of their regression lines
vary. The regression line for the 18_49_Years age range is
nearly horizontal, which could indicate that no linear relationship
exists in that data subset.
Let’s calculate the Pearson’s correlation cofficient for the
18_49_Years age range.
# calculate correlation coefficient for 18-49 years age range
weekly_data_18_49 <- weekly_data |>
filter(age_range == "18_49_Years")
round(cor(weekly_data_18_49$case_crude_rate_suppressed_per_100k, weekly_data_18_49$death_crude_rate_suppressed_per_100k), 2)
## [1] 0.11
The correlation coefficient for the 18_49_Years age
range is very low, but it is non-zero, so we’ll proceed with including
this age range in the model.
Here is a summary of Model 3, which adds age range to the linear model.
# linear model with 3 explanatory variables
model3 <- lm(death_crude_rate_suppressed_per_100k ~ case_crude_rate_suppressed_per_100k + sex + age_range, weekly_data)
summary(model3)
##
## Call:
## lm(formula = death_crude_rate_suppressed_per_100k ~ case_crude_rate_suppressed_per_100k +
## sex + age_range, data = weekly_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.170 -6.672 -1.307 3.995 152.158
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -13.85224 1.01586 -13.636 <2e-16 ***
## case_crude_rate_suppressed_per_100k 0.03874 0.00181 21.410 <2e-16 ***
## sexMale 5.22906 0.56994 9.175 <2e-16 ***
## age_range50_74_Years 11.62884 0.89314 13.020 <2e-16 ***
## age_range75_Older 28.53174 0.94822 30.090 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.39 on 2584 degrees of freedom
## Multiple R-squared: 0.3512, Adjusted R-squared: 0.3501
## F-statistic: 349.6 on 4 and 2584 DF, p-value: < 2.2e-16
Let’s plot the confidence intervals for the model coefficients.
# plot model coefficients with 95% CI
tidy(model3, conf.int = TRUE) |>
ggplot(mapping = aes(x = estimate, y = term)) +
geom_point() +
geom_vline(xintercept = 0, linetype = 'dotted', color = 'gray') +
geom_errorbarh(mapping = aes(xmin = conf.low,
xmax = conf.high,
height = 0.5)) +
labs(title = "Model Coefficients 95% C.I.")
For Model 3:
Let’s examine the plot of the residual values vs. the fitted values (\(\hat{y}\) values).
# plot of residuals vs. fitted values (y hat values)
gg_resfitted(model3) +
geom_smooth(se=FALSE)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
One of the assumptions for linear regression is that the errors have constant variance across all predictions. If this is true, the plot of residuals vs. fitted values should ideally show that the variation in residuals is relatively uniform across the range of fitted values, with the blue best-fit line being close to a straight line.
In the plot above, we can see a “fanning” effect as the variance in errors increases for higher fitted values. The blue best-fit line is relatively close to a straight line but does begin to slope upwards for higher fitted values. Both of these are evidence that the errors do not have constant variance across predictions.
For higher fitted values, the larger positive residuals indicate that the model is underestimating the true value of the weekly death rates. This is not surprising, as infectious diseases, such as COVID-19, tend to follow an exponential growth pattern, rather than linear.
Let’s examine the plots of the residual values vs. the \(x\) values for each variable in Model 3.
As with the previous diagnostic plot, the plot of residuals vs. \(x\) values should ideally show that the variation in residuals is relatively uniform across the range of \(x\) values for each \(x\) variable in the model.
# generate plots for residuals vs. x values (each x variable)
plots <- gg_resX(model3, plot.all = FALSE)
# show residuals for case rate
plots$case_crude_rate_suppressed_per_100k
In the plot of residual values for case rate, we can see a “fanning” effect as the variance in errors increases for case rates between approximately 100 to 750 (but above 750, the variance in errors becomes tighter and more uniform). This indicates the errors do not have constant variance across the range of case rates.
# show residals for sex at birth
plots$sex
For sex at birth, the plot of residuals shows that the interquartile range of the errors appears to be similar for females and males. However, males have more outlier observations with higher positive errors. For example, the highest positive residual for females is approximately 80, whereas the highest positive residual for males is almost double at approximately 155.
# show residuals for age range
plots$age_range
For age range, the interquartile ranges of errors are roughly similar for the 18-49 and 50-74 age ranges, though the 50-74 age range has a number of outlier observations with higher positive errors.
The interquartile range of errors is much wider for the 75+ age range, which also has outlier observations with even higher positive errors. Again, higher positive residuals indicate the model is underestimating the true value of the weekly death rate for these observations.
Let’s plot a histogram of the residuals for Model 3.
# plot histogram of residuals
gg_reshist(model3, bins = 35)
Another assumption of linear regression is that errors are normally distributed over the prediction line. If this is true, the histogram of residuals should ideally be normal.
In the plot above, the histogram is roughly normal, though the distribution appears to be slightly right-skewed.
Let’s examine the quantile-quantile (QQ) plot for Model 3.
# plot normal QQ plot (quantile-quantile)
gg_qqplot(model3)
The QQ plot is another way to determine if the residuals are normally distributed. The residual values are sorted smallest to largest and plotted, converted to quantiles, and plotted against a theoretical quantile based on a normal distribution.
If the model residuals have a perfectly normal distribution, they will follow a straight line, indicated by the red line in the QQ plot.
In the plot above, the residuals follow the theoretical line fairly closely, except for the right side (quantile values higher than 1.5). This is consistent with the slight right skew seen in the histogram of residuals.
Let’s plot Cook’s Distance (D) for each observation used in Model 3.
Cook’s Distance measures how much the model’s fitted values would change if an observation were removed from the dataset. Thus, it indicates how much influence a given observation has on the overall model.
Cook’s D can be calculated for each observation in the dataset. If the Cook’s D value for a given observation is 3 times higher than the mean Cook’s D for the dataset, the observation might be an outlier.
First, we’ll show the plot without labels for the observations that are above the threshold (i.e., 3 times the mean Cook’s D), which is represented by the red dashed line in the plot.
# plot Cook's Distance for observations (without labels for obs above threshold)
gg_cooksd(model3, label = FALSE, threshold = "matlab")
There are numerous observations with a Cook’s D value above the threshold. They form 3 distinct clusters of observations with high Cook’s D values.
As a reminder, the original dataset was arranged into order by week (from March 2020 to November 2023), so the observation numbers are ordered by week (though each week has multiple observations, representing the various demographic subgroups). As will be shown later in this data dive, these 3 clusters of observations with high Cook’s D values correspond to the three largest peaks in COVID-19 death rates which occurred during spring 2020, winter 2020-2021, and winter 2021-2022.
Now, let’s view the plot with labels for the observations above the threshold. The labels represent the observation’s row number (index) within the dataset. Due to the high number of observations above the threshold, it will be challenging to read the labels due to overlapping text.
# plot Cook's Distance for observations (with labels for obs above threshold)
gg_cooksd(model3, label = TRUE, threshold = "matlab")
To get a better sense of how these high Cook’s D observations compare to the rest of the dataset, let’s identify the 3 observations with the highest Cook’s D value from each of the 3 clusters (for a total of 9 observations), and highlight these high influence points in a scatter plot of case rate vs. death rate.
# examples of observations with high Cook's D
high_influence = c(96, 149, 150, 643, 646, 703, 1847, 1852, 1878)
# scatter plot of case rate vs death rate showing high influence points in red
ggplot() +
geom_point(data = weekly_data,
mapping = aes(x = case_crude_rate_suppressed_per_100k,
y = death_crude_rate_suppressed_per_100k),
color = "lightgray") +
geom_point(data = slice(weekly_data, high_influence),
mapping = aes(x = case_crude_rate_suppressed_per_100k,
y = death_crude_rate_suppressed_per_100k),
color = "red") +
annotate("text", x = 1000, y = 175, label = "Examples of High Cook's D", color = 'red') +
labs(title = "COVID-19 Case Rates vs. Death Rates (Mar 2020 - Nov 2023)",
subtitle = "Region 5 (IL, IN, MI, MN, OH, WI)",
x = "Weekly Case Rate (per 100K population)",
y = "Weekly Death Rate (per 100K population)") +
theme(plot.subtitle = element_text(colour = "darkgray"))
As we can see for these 9 example points highlighted in red, the observations with a high Cook’s D value tend to be found further out from the other points (shown in gray). These high influence points have either a very high case rate or very high death rate, which explains their higher influence on the linear model.
Even though we have only highlighted 9 examples, it might be that the full set of high influence points roughly balance each other, as the entire dataset has a roughly symmetrical shape that fans outwards from the origin. (This is merely an educated guess and would need to be verified.)
As mentioned previously, because the observations are ordered by date, we can verify that the 3 clusters of observations with high Cook’s D values correspond to the 3 major peaks in the COVID-19 death rates during the pandemic.
Let’s view the row for one of the high Cook’s D observations from the first cluster to see what date the observation is from.
# view row for observation 96 (high Cook's D)
weekly_data[96, ] |>
select(end_of_week, age_group, sex, race_ethnicity = race_ethnicity_combined, death_rate = death_crude_rate_suppressed_per_100k)
## # A tibble: 1 × 5
## end_of_week age_group sex race_ethnicity death_rate
## <date> <chr> <chr> <chr> <dbl>
## 1 2020-04-04 75+ Years Male Black, NH 150.
As we can see, this example observation from the first cluster is from spring 2020. Other high Cook’s D observations in this first cluster occur close to this observation, which means they are from a similar date period.
Let’s view the row for one of the high Cook’s D observations from the second cluster to see what date the observation is from.
# view row for observation 643 (high Cook's D)
weekly_data[643, ] |>
select(end_of_week, age_group, sex, race_ethnicity = race_ethnicity_combined, death_rate = death_crude_rate_suppressed_per_100k)
## # A tibble: 1 × 5
## end_of_week age_group sex race_ethnicity death_rate
## <date> <chr> <chr> <chr> <dbl>
## 1 2020-11-07 75+ Years Male AI/AN, NH 191.
As we can see, this example observation from the second cluster is from late 2020. Other high Cook’s D observations in this second cluster occur shortly before or shortly after this observation, again indicating a similar date period.
Finally, let’s view the row for one of the high Cook’s D observations from the third cluster to see what date the observation is from.
# view row for observation 1847 (high Cook's D)
weekly_data[1847, ] |>
select(end_of_week, age_group, sex, race_ethnicity = race_ethnicity_combined, death_rate = death_crude_rate_suppressed_per_100k)
## # A tibble: 1 × 5
## end_of_week age_group sex race_ethnicity death_rate
## <date> <chr> <chr> <chr> <dbl>
## 1 2022-01-01 30 - 39 Years Female Black, NH 3.09
As we can see, this example observation from the third cluster is from the start of 2022. Other high Cook’s D observations in this third cluster occur shortly before or shortly after this observation, again representing a similar date period.
For reference, let’s view a plot of the overall weekly death rates over time.
# line plot of weekly death rate over time
covid |>
filter(age_group == "Overall" & race_ethnicity_combined == "Overall" & sex == "Overall") |>
ggplot() +
geom_line(mapping = aes(x = end_of_week,
y = death_crude_rate_suppressed_per_100k),
na.rm = TRUE, color = "red") +
scale_x_date(minor_breaks = as.Date(c("2020-04-01", "2020-07-01", "2020-10-01", "2021-04-01", "2021-07-01", "2021-10-01", "2022-04-01", "2022-07-01", "2022-10-01", "2023-04-01", "2023-07-01", "2023-10-01"))) +
labs(title = "COVID-19 Weekly Death Rates (Mar 2020 - Nov 2023)",
subtitle = "Region 5 (IL, IN, MI, MN, OH, WI)",
x = "", y = "Weekly Death Rate (per 100K population)") +
theme(plot.subtitle = element_text(colour = "darkgray"))
As seen above, the three largest peaks in the weekly death rates occur in spring 2020 (peaking in April 2020), winter 2020-2021 (peaking in November 2020), and winter 2021-2022 (peaking in January 2022), which correspond precisely with the dates of the 3 clusters of high Cook’s D observations.
Because the overall case rates and death rates were much higher during these peaks, it would be much more likely to have observations for a specific demographic subgroup that are very different from the rest of the data (e.g., having a case rate or death rate that is either much higher or much lower) and thus could have greater influence on the model (i.e., higher Cook’s D).
By incorporating additional explanatory variables into the linear regression model, we were able to show the effect of each variable is statistically significant (as evidenced by very low p-values) and produces a model with greater explanatory power (as evidenced by a higher \(R^2\) value).
However, we also found diagnostic evidence to indicate that the assumptions of the linear regression model are not completely met, indicating that the pattern in the data is not completely linear (which is not surprising for observations of an infectious disease, which tend to follow exponential growth patterns).
This dataset will be explored further in subsequent data dives.