This data dive focuses on further practice with developing and diagnosing a linear model (LM) or generalized linear model (GLM) for a CDC dataset that contains weekly counts and rates of COVID-19 cases and deaths reported in Region 5 (Illinois, Indiana, Michigan, Minnesota, Ohio, and Wisconsin) of the United States from March 7, 2020 through November 18, 2023.
To get started, we’ll load several R packages to assist with our analysis.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.0 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(patchwork)
library(broom)
library(lindia)
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
theme_set(theme_minimal())
Next, let’s read in the dataset from CSV.
covid <- read_delim("./COVID_weekly_cases_deaths_region5.csv", delim = ",")
## Rows: 37867 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): end_of_week, jurisdiction, age_group, sex, race_ethnicity_combined
## dbl (4): case_count_suppressed, death_count_suppressed, case_crude_rate_supp...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Let’s convert end_of_week from a character format to a
date format.
# convert date from char into date format
covid$end_of_week <- as.Date(covid$end_of_week, format="%m/%d/%y")
Let’s add a new column to indicate the order of the age groups, and then arrange all the rows in nested order by week, age, sex, and race/ethnicity.
# add new column to order age groups correctly
covid <- covid |>
mutate(age_order = case_when(
age_group == "Overall" ~ 0L,
age_group == "0 - 4 Years" ~ 1L,
age_group == "5 - 11 Years" ~ 2L,
age_group == "12 - 15 Years" ~ 3L,
age_group == "16 - 17 Years" ~ 4L,
age_group == "18 - 29 Years" ~ 5L,
age_group == "30 - 39 Years" ~ 6L,
age_group == "40 - 49 Years" ~ 7L,
age_group == "50 - 64 Years" ~ 8L,
age_group == "65 - 74 Years" ~ 9L,
age_group == "75+ Years" ~ 10L),
.after = age_group) |>
arrange(end_of_week, age_order, sex, race_ethnicity_combined)
Let’s create a linear regression model for our COVID-19 dataset.
First, let’s create a dataframe that filters out the rows that
provide Overall summary counts for the demographic
subgroups and also filters out rows with missing values
(NA) for case rate or death rate.
# Filter data to exclude weekly summary rows ("Overall") and to exclude rows with missing data (NA) for case rate or death rate
weekly_data <- covid |>
filter(age_group != "Overall" & sex != "Overall" & race_ethnicity_combined != "Overall") |>
filter(!is.na(case_crude_rate_suppressed_per_100k) & !is.na(death_crude_rate_suppressed_per_100k))
We’ll use weekly death rate as the response variable for our model since it is the variable of most importance in this dataset.
We’ll use weekly case rate as the initial explanatory variable for our model since it seems logical that this might affect the weekly death rate.
Weekly case rate is a variable bound by zero: i.e., negative case rates are not possible. This likely results in a non-normal distribution of values that is right-skewed.
For zero-bound values (such as price, counts, etc.) a log transform of the values can often be used to produce a distribution that is more normal and thus better suited to use in a generalized linear model.
For comparison, here is a histogram of the case rate compared to a histogram of \(\log(\text{case rate})\).
# histogram of case rate
p1 <- ggplot(data = weekly_data) +
geom_histogram(aes(x = case_crude_rate_suppressed_per_100k), binwidth = 100, color='white') +
labs(x = "Case Rate")
# histogram of log(case rate)
p2 <- ggplot(data = weekly_data) +
geom_histogram(aes(x = log(case_crude_rate_suppressed_per_100k)), binwidth = 0.5, color='white') +
labs(x = "log(Case Rate)")
# display side-by-side
p1 + p2
Based on the histograms, it appears that using a log transform on the case rate produces a more normal distribution of values for this explanatory variable.
Let’s add a new column to transform case rate using a log function.
# add new column for log(case rate)
weekly_data <- weekly_data |>
mutate(log_case_rate = log(case_crude_rate_suppressed_per_100k))
Weekly death rate is also a variable bound by zero: i.e., negative death rates are not possible. Again, this likely results in a non-normal distribution of values that is right-skewed, which could be transformed with a log function to produce a more normal distribution, suitable for use in a generalized linear model.
For comparison, here is a histogram of the death rate compared to a histogram of \(\log(\text{death rate})\).
# histogram of death rate
p3 <- ggplot(data = weekly_data) +
geom_histogram(aes(x = death_crude_rate_suppressed_per_100k), binwidth = 10, color='white') +
labs(x = "Death Rate")
# histogram of log(death rate)
p4 <- ggplot(data = weekly_data) +
geom_histogram(aes(x = log(death_crude_rate_suppressed_per_100k)), binwidth = 0.5, color='white') +
labs(x = "log(Death Rate)")
# display side-by-side
p3 + p4
Based on the histograms, it appears that using a log transform on the death rate will produce a more normal distribution of values for this explanatory variable.
As further confirmation, we can also use the Box-Cox Transformation to determine the optimal \(\lambda\) (lambda) value such that the conditional distribution of our response variable (given our explanatory variable) is most normal.
# linear model
model <- lm(death_crude_rate_suppressed_per_100k ~ log_case_rate, weekly_data)
# Box-Cox Power Transformation
pT <- powerTransform(model, family = "bcnPower")
pT$lambda
## [1] 0.03852863
When \(\lambda \approx 0\), the Box-Cox transformation function approaches \(\log(y)\), so a log-transform of \(y\) becomes the best transformation.
Since the calculated \(\lambda \approx 0\), let’s add a new column to transform death rate using a log function.
# add new column for log(death rate)
weekly_data <- weekly_data |>
mutate(log_death_rate = log(death_crude_rate_suppressed_per_100k))
Let’s plot \(\log(\text{case rate})\) vs. \(\log(\text{death rate})\) with a linear model fit to the data.
# scatter plot of log(case rate) vs log(death rate) with linear model
weekly_data |>
ggplot(mapping = aes(x = log_case_rate,
y = log_death_rate)) +
geom_point(size = 1) +
geom_smooth(method = "lm", se = FALSE, color = 'orange') +
labs(title = "COVID-19 Weekly Case Rate vs. Death Rate (Mar 2020 - Nov 2023)",
subtitle = "Region 5 (Illinois, Indiana, Michigan, Minnesota, Ohio, Wisconsin)",
x = "log(Case Rate)",
y = "log(Death Rate)") +
theme(plot.subtitle = element_text(colour = "darkgray"))
## `geom_smooth()` using formula = 'y ~ x'
Let’s calculate the correlation between the variables.
round(cor(weekly_data$log_case_rate, weekly_data$log_death_rate), 2)
## [1] 0.31
Here’s the summary for the linear model shown in the plot.
model1 <- lm(log_death_rate ~ log_case_rate, weekly_data)
summary(model1)
##
## Call:
## lm(formula = log_death_rate ~ log_case_rate, data = weekly_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9861 -1.0010 0.0788 1.1465 3.1419
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.9919 0.1507 -6.584 5.54e-11 ***
## log_case_rate 0.5088 0.0310 16.413 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.428 on 2587 degrees of freedom
## Multiple R-squared: 0.09431, Adjusted R-squared: 0.09396
## F-statistic: 269.4 on 1 and 2587 DF, p-value: < 2.2e-16
The beta coefficient for \(\log(\text{case rate})\) indicates that for every increase in \(\log(\text{case rate})\) by 1 unit, the \(\log(\text{death rate})\) will increase, on average, by 0.51 units.
The relationship between \(\log(\text{case rate})\) and \(\log(\text{death rate})\) is significant with a high F statistic and very low p-value.
However, the \(R^2\) value indicates this model only explains 9% of the variance in the response variable.
Let’s explore adding sex at birth as another explanatory variable in the model.
First, let’s confirm whether there is a linear relationship between \(\log(\text{case rate})\) and \(\log(\text{death rate})\) for each sex.
# Faceted plots of log(case rate) vs. log(death rate) by sex at birth
weekly_data |>
ggplot(mapping = aes(x = log_case_rate,
y = log_death_rate)) +
geom_point(size = 0.5) +
geom_smooth(method = "lm", se = FALSE, color = "orange") +
labs(title = "COVID-19 Weekly Case Rate vs. Death Rate (Mar 2020 - Nov 2023)",
subtitle = "Faceted by Sex at Birth - Region 5 (IL, IN, MI, MN, OH, WI)",
x = "log(Case Rate)",
y = "log(Death Rate)") +
theme(plot.subtitle = element_text(colour = "darkgray")) +
facet_wrap(~sex)
## `geom_smooth()` using formula = 'y ~ x'
Let’s compare the correlation between \(\log(\text{case rate})\) and \(\log(\text{death rate})\) by sex at birth.
weekly_data |>
group_by(sex) |>
summarise(cor = round(cor(log_case_rate, log_death_rate), 2))
## # A tibble: 2 × 2
## sex cor
## <chr> <dbl>
## 1 Female 0.26
## 2 Male 0.35
Based on the plots and correlations, it appears there may be a difference in the linear relationship based on sex at birth. Let’s consider adding this to the model as another explanatory variable.
Next, let’s explore adding age group as another explanatory variable.
Let’s confirm whether there is a linear relationship between \(\log(\text{case rate})\) and \(\log(\text{death rate})\) for each age range.
# Faceted plots of log(case rate) vs. log(death rate) by age
weekly_data |>
ggplot(mapping = aes(x = log_case_rate,
y = log_death_rate)) +
geom_point(size = 0.5) +
geom_smooth(method = "lm", se = FALSE, color = "orange") +
labs(title = "COVID-19 Weekly Case Rate vs. Death Rate (Mar 2020 - Nov 2023)",
subtitle = "Faceted by Age Range - Region 5 (IL, IN, MI, MN, OH, WI)",
x = "log(Case Rate)",
y = "log(Death Rate)") +
theme(plot.subtitle = element_text(colour = "darkgray")) +
facet_wrap(~age_group)
## `geom_smooth()` using formula = 'y ~ x'
Note: There is no plot for under 18 years of age because all the
original rows for this age range were filtered out when we removed rows
with NA for case rate or death rate.
Let’s compare the correlation between \(\log(\text{case rate})\) and \(\log(\text{death rate})\) by age group.
weekly_data |>
group_by(age_group) |>
summarise(cor = round(cor(log_case_rate, log_death_rate), 2))
## # A tibble: 6 × 2
## age_group cor
## <chr> <dbl>
## 1 18 - 29 Years -0.33
## 2 30 - 39 Years 0.24
## 3 40 - 49 Years 0.35
## 4 50 - 64 Years 0.62
## 5 65 - 74 Years 0.64
## 6 75+ Years 0.61
Based on the correlations, the age groups of 50+ years seem more similar to each other than to the age groups below 50 years.
Let’s add a new column to the dataset named is_50_plus
to identify whether an age group is 50+ years using TRUE or
FALSE values.
# create coded binary variable for age group
weekly_data <- weekly_data |>
mutate(is_50_plus = case_when(
age_group == "18 - 29 Years" ~ FALSE,
age_group == "30 - 39 Years" ~ FALSE,
age_group == "40 - 49 Years" ~ FALSE,
age_group == "50 - 64 Years" ~ TRUE,
age_group == "65 - 74 Years" ~ TRUE,
age_group == "75+ Years" ~ TRUE),
.after = age_group)
Let’s create plots for the values of the is_50_plus age
range variable.
# Faceted plots of log(case rate) vs. log(death rate) by age 50+
weekly_data |>
ggplot(mapping = aes(x = log_case_rate,
y = log_death_rate)) +
geom_point(size = 0.5) +
geom_smooth(method = "lm", se = FALSE, color = "orange") +
labs(title = "COVID-19 Weekly Case Rate vs. Death Rate (Mar 2020 - Nov 2023)",
subtitle = "Faceted by Age 50+ Years - Region 5 (IL, IN, MI, MN, OH, WI)",
x = "log(Case Rate)",
y = "log(Death Rate)") +
theme(plot.subtitle = element_text(colour = "darkgray")) +
facet_wrap(~is_50_plus)
## `geom_smooth()` using formula = 'y ~ x'
Let’s compare the correlation values for these two subgroups.
weekly_data |>
group_by(is_50_plus) |>
summarise(cor = round(cor(log_case_rate, log_death_rate), 2))
## # A tibble: 2 × 2
## is_50_plus cor
## <lgl> <dbl>
## 1 FALSE 0.17
## 2 TRUE 0.5
Based on the plots and correlations, it appears there may be a
difference in the linear relationship based on the
is_50_plus age range variable. Let’s consider adding this
to the model as another explanatory variable.
Now, let’s compare three nested linear models using our explanatory variables.
is_50_plus as a third
explanatory variable.model1 <- lm(log_death_rate ~ log_case_rate, weekly_data)
model2 <- lm(log_death_rate ~ log_case_rate + sex, weekly_data)
model3 <- lm(log_death_rate ~ log_case_rate + sex + is_50_plus, weekly_data)
Let’s use ANOVA to test these nested models for equal residual variance (or deviance) as variables are introduced.
anova(model1, model2, model3)
## Analysis of Variance Table
##
## Model 1: log_death_rate ~ log_case_rate
## Model 2: log_death_rate ~ log_case_rate + sex
## Model 3: log_death_rate ~ log_case_rate + sex + is_50_plus
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2587 5278.8
## 2 2586 5224.3 1 54.6 36.575 1.682e-09 ***
## 3 2585 3858.9 1 1365.3 914.612 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As seen in the ANOVA summary, adding sex at birth in Model 2 does reduce the residual sum of squares slightly compared to Model 1.
Model 2 also has a high F statistic and very low p-value, indicating it is an improvement over Model 1.
Adding the age variable for is_50_plus in Model 3
greatly reduces the residual sum of squares. Model 3 has a very high F
statistic and very low p-value, indicating it is an improvement over
Model 2.
Let’s view the summary for Model 3.
summary(model3)
##
## Call:
## lm(formula = log_death_rate ~ log_case_rate + sex + is_50_plus,
## data = weekly_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.7629 -0.8062 0.0309 1.0085 2.6570
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.34372 0.16866 -25.754 <2e-16 ***
## log_case_rate 0.77535 0.02794 27.753 <2e-16 ***
## sexMale 0.39928 0.04831 8.264 <2e-16 ***
## is_50_plusTRUE 2.17913 0.07206 30.243 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.222 on 2585 degrees of freedom
## Multiple R-squared: 0.3379, Adjusted R-squared: 0.3372
## F-statistic: 439.8 on 3 and 2585 DF, p-value: < 2.2e-16
For Model 3:
Let’s examine the plot of the residual values vs. the fitted values (\(\hat{y}\) values).
# plot of residuals vs. fitted values (y hat values)
gg_resfitted(model3) +
geom_smooth(se=FALSE)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
One of the assumptions for linear regression is that the errors have constant variance across all predictions. If this is true, the plot of residuals vs. fitted values should ideally show that the variation in residuals is relatively uniform across the range of fitted values, with the blue best-fit line being close to a straight line.
As seen in the plot, the variance in residuals is not completely uniform across the range of fitted values.
Let’s examine the plots of the residual values vs. the \(x\) values for each variable in Model 3.
As with the previous diagnostic plot, the plot of residuals vs. \(x\) values should ideally show that the variation in residuals is relatively uniform across the range of \(x\) values for each \(x\) variable in the model.
# generate plots for residuals vs. x values (each x variable)
plots <- gg_resX(model3, plot.all = FALSE)
# show residuals for log(case rate)
plots$log_case_rate
As seen in the plot, the variance in residuals is not completely uniform across the range of \(\log(\text{case rate})\).
# show residals for sex at birth
plots$sex
As seen in the boxplot, the variance in residuals appears to be similar for both sexes.
# show residals for is_50_plus
plots$is_50_plus
As seen in the boxplot, the variance in residuals appears to be
roughly similar for the two age ranges defined by the
is_50_plus variable, though the variance appears to be
greater among the subgroup that is 50+ years (TRUE).
Let’s plot a histogram of the residuals for Model 3.
# plot histogram of residuals
gg_reshist(model1, bins = 35)
Another assumption of linear regression is that errors are normally distributed over the prediction line. If this is true, the histogram of residuals should ideally be normal.
In the plot above, the histogram seems roughly normal in distribution, though the left and right sides are not symmetrical.
Let’s examine the quantile-quantile (QQ) plot for Model 3.
# plot normal QQ plot (quantile-quantile)
gg_qqplot(model3)
The QQ plot is another way to determine if the residuals are normally distributed. The residual values are sorted smallest to largest and plotted, converted to quantiles, and plotted against a theoretical quantile based on a normal distribution.
If the model residuals have a perfectly normal distribution, they will follow a straight line, indicated by the red line in the QQ plot.
In the plot above, the residuals follow the theoretical line closely, except for the lower left end (quantile values lower than -2.5) and upper right end (quantile values higher than 1).
Let’s plot Cook’s Distance (D) for each observation used in Model 3.
Cook’s Distance measures how much the model’s fitted values would change if an observation were removed from the dataset. Thus, it indicates how much influence a given observation has on the overall model.
Cook’s D can be calculated for each observation in the dataset. If the Cook’s D value for a given observation is 3 times higher than the mean Cook’s D for the dataset, the observation might be an outlier.
First, we’ll show the plot without labels for the observations that are above the threshold (i.e., 3 times the mean Cook’s D), which is represented by the red dashed line in the plot.
# plot Cook's Distance for observations (without labels for obs above threshold)
gg_cooksd(model3, label = FALSE, threshold = "matlab")
There are numerous observations with a Cook’s D value above the threshold.
Now, let’s view the plot with labels for the observations above the threshold. The labels represent the observation’s row number (index) within the dataset. Due to the high number of observations above the threshold, it will be challenging to read the labels due to overlapping text.
# plot Cook's Distance for observations (with labels for obs above threshold)
gg_cooksd(model3, label = TRUE, threshold = "matlab")
We can also identify these points within our dataset. First, we can get a vector of the calculated Cook’s Distance value for each observation used in our model. The order of the values in the vector matches the order of the observations (rows) in the dataset.
Then we can calculate the threshold for a high Cook’s D as 3 times the mean Cook’s D value across the dataset.
# get ordered vector of Cook's D values for observations
cooks_values <- cooks.distance(model3)
# calculate threshold for high Cook's D as 3 times mean Cook's Distance
cooks_threshold <- mean(cooks_values) * 3
cooks_threshold
## [1] 0.0010387
We can simply add the ordered vector of Cook’s D values to our dataset as a new column. Again, the order of the values in the vector corresponds to the observations (row indices) in our dataset.
# add ordered vector of Cook's D values as new colum
weekly_data$cooks_d <- cooks_values
Next we can add another new column to indicate which rows have a high Cook’s D value (i.e., above the threshold we calculated).
# add new column to indicate observations with high Cook's D (above threshold)
weekly_data <- weekly_data |>
mutate(high_cooks_d = ifelse(cooks_d >= 0.0010387, 1, 0))
Finally, we can create a plot to show all the data points and highlight the points with a high Cook’s D value in red.
low_cooks <- weekly_data |>
filter(high_cooks_d == 0)
high_cooks <- weekly_data |>
filter(high_cooks_d == 1)
# scatter plot of log(case rate) vs. log(death rate) showing high influence points in red
ggplot() +
geom_point(data = low_cooks,
mapping = aes(x = log_case_rate,
y = log_death_rate),
color = "lightgray") +
geom_point(data = high_cooks,
mapping = aes(x = log_case_rate,
y = log_death_rate),
color = "red") +
annotate("text", x = 2, y = 2.5, label = "High Cook's D", color = 'red') +
labs(title = "COVID-19 Weekly Case Rates vs. Death Rates (Mar 2020 - Nov 2023)",
subtitle = "Region 5 (IL, IN, MI, MN, OH, WI)",
x = "log(Case Rate)",
y = "log(Death Rate)") +
theme(plot.subtitle = element_text(colour = "darkgray"))
As shown above, the observations with high Cook’s D values tend to occur along the edges of the scatter plot, which seems logical that these points would have greater influence on the linear model. However, not not every point along the edges has a high Cook’s D value.
Furthermore, there is a cluster of points with high Cook’s D in the middle of the scatter plot.
While most of these high Cook’s D points don’t necessarily appear visually to be outliers, they do have higher influence on the model.
By incorporating additional explanatory variables into the linear regression model, we were able to show the effect of each variable is statistically significant (as evidenced by high F statistics and very low p-values), reduces the residual sum of squares error, and produces a model with greater explanatory power (as evidenced by a higher \(R^2\) value).
However, we also found diagnostic evidence to indicate that the assumptions for a linear regression model are not completely met by this particular model, despite the transformations made to the explanatory and response variables.
This dataset will be explored further in subsequent data dives.