Since the first confirmed US case of coronavirus disease 2019 (COVID-19), on January 21, 2020 (https://www.cdc.gov/media/releases/2020/p0121-novel-coronavirus-travel-case.html), the fifty states have imposed varying requirements limiting social interactions among their citizens. Can any of these restrictions be shown to be related to the progress of the disease process? By looking at the government responses to the pandemic on an equal basis, some idea of their relative effectiveness may be seen. This analysis was performed using R statistical software. If you would like to see the R code, please press the “code” button on the right (the buttons may not appear on mobile phones or i-pads).
#load the required libraries
library(tidyverse)
library(janitor)
library(lindia)
library(DT)
Data on the various Covid-19 lockdown measures that states have taken were obtained from: file:///D:/Data_Science/Covid_Lockdowns/U.S.%20state%20and%20local%20government%20responses%20to%20the%20COVID-19%20pandemic%20-%20Wikipedia.html “soh” refers to stay at home order. I didn’t include the “soh-enacted” in the analysis because it was unclear whether the blank cells represented simply missing data or indicated that a stay at home order was not enacted.
#1 corresponds to "yes - this measure was enforced by the state",
#0 corresponds to "no - this measure was not enforced"
lock <- read_csv("state_lockdowns.csv")
datatable(head(lock, class = "cell-border stripe compact"))
The data were initially cleaned to select the relevant variables in the proper format.
lock1 <- lock %>%
clean_names() %>%
select(state, population_2019, pop_density, school_close,
mask_pub_require, grp_ban, travel_restrict, daycare_close,
bar_rest_close, retail_close) %>%
mutate(across(school_close : retail_close, as_factor)) %>%
mutate(state = as_factor(state)) %>% #factorize character variables
filter(state != "DC") #exclude the District of Columbia
Data on state Covid-19 cases and deaths were obtained from: https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv Geographic regions were added by me.
cov <- read_csv("Covid_States.csv")
datatable(head(cov, class = "cell-border stripe compact"))
An initial data exploration shows the plot of date vs. cumulative Covid-19 deaths.
cov %>%
ggplot(aes(x = date, y = deaths, group = state, color = region)) +
geom_line() +
theme_classic() +
labs(title = "Cumulative Covid-19 Deaths by State",
y = "Deaths", x = "Date" ) +
annotate(geom = "text", x = as.Date("2020-04-08"), y = 25810, label = "New York") +
annotate(geom = "text", x = as.Date("2020-06-10"), y = 17200, label = "New Jersey")

The graph shows marked variation in the cumulative death counts among the states. We need a way to compare the death counts to have an idea of the relative epidemic severity in each state. In order to normalize the various states’ cumulative death curves, a linear model will be derived from these graphs using deaths as the outcome variable and date as the explanatory variable. The coefficients of the model are the linear slopes of each variable, and will be used to compare the state response measures. The y-intercept was set at zero since it is not possible to have less than zero deaths. In the above plot, New York and New Jersey show the two most severe death statistics.
For example, the linear slopes for Washington and New York are as follows:
cov_wa <- cov %>%
filter(state == "Washington")
cov_ny <- cov %>%
filter(state == "New York")
m_wa <- lm(deaths ~ 0 + date, data = cov_wa)
paste("slope wa =", round(coef(m_wa), 4))
## [1] "slope wa = 0.0565"
m_ny <- lm(deaths ~ 0 + date, data = cov_ny)
paste("slope ny =", round(coef(m_ny), 4))
## [1] "slope ny = 1.3334"
This function for the linear model will be mapped onto the whole set of covid death data.
state_model <- function(df){
lm(deaths ~ 0 + date, data = df)
}
Nest the data per state
models <- cov %>%
group_by(state) %>%
nest() %>%
mutate(mod = data %>% map(state_model)) %>%
#remove the island territories and DC
filter(state != "Guam") %>%
filter(state != "Northern Mariana Islands") %>%
filter(state != "Virgin Islands") %>%
filter(state != "Puerto Rico" ) %>%
filter(state != "District of Columbia") %>%
mutate(state = as_factor(state))
Extract the linear coefficients(slopes) from the death data
coefs <- as.numeric(0) # create an empty vector
for (i in 1:50) {
coefs[[i]] <- round(pluck(models, 3, i, 1, 1), 4)
}
Combine the lockdown data and the coefficients
lock2 <- lock1 %>%
bind_cols(coefs) %>%
rename(coefs = "...11") %>%
select(-c(population_2019, state)) %>%
#logs for future use
mutate(l_pop_density = log(pop_density))
#mutate(l_coefs = log(coefs))
m1 <- lm(coefs ~ 0 + pop_density + school_close +
mask_pub_require + grp_ban + travel_restrict +
daycare_close + bar_rest_close + retail_close , data = lock2)
summary(m1)
##
## Call:
## lm(formula = coefs ~ 0 + pop_density + school_close + mask_pub_require +
## grp_ban + travel_restrict + daycare_close + bar_rest_close +
## retail_close, data = lock2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.25995 -0.09416 -0.00639 0.03246 1.02656
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## pop_density 0.0003293 0.0001171 2.811 0.00768 **
## school_closeno_order 0.0133458 0.2376669 0.056 0.95551
## school_closeopen 0.0622903 0.2611046 0.239 0.81269
## school_closepartial -0.0663001 0.2576594 -0.257 0.79829
## mask_pub_require1 -0.0106164 0.0730868 -0.145 0.88526
## grp_ban>50 -0.0424904 0.1060837 -0.401 0.69095
## grp_banall 0.0905460 0.0722371 1.253 0.21750
## travel_restrict1 -0.0794097 0.0653034 -1.216 0.23129
## daycare_close1 0.0301930 0.1088303 0.277 0.78291
## bar_rest_close1 0.0371167 0.2723330 0.136 0.89229
## retail_close1 0.0102528 0.1522711 0.067 0.94666
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1942 on 39 degrees of freedom
## Multiple R-squared: 0.4831, Adjusted R-squared: 0.3373
## F-statistic: 3.313 on 11 and 39 DF, p-value: 0.002745
The only variable that showed statistical significance was population density. The state imposed restrictions had no significant effects on the normalized death curves.
Evaluation of the Model
Model Asumptions:
Normality
lock2 %>%
ggplot(aes(sample = pop_density)) +
stat_qq() +
stat_qq_line() +
theme_classic()
The continuous explanatory variable(pop_density) looks decidedly non-normal. Let’s try taking its log and see if that improves the appearance.
lock2 %>%
ggplot(aes(sample = l_pop_density)) +
stat_qq() +
stat_qq_line() +
theme_classic()
Well, that looks better.
Run the model again with logs:
m2 <- lm(coefs ~ 0 + l_pop_density + school_close +
mask_pub_require + grp_ban + travel_restrict +
daycare_close + bar_rest_close + retail_close , data = lock2)
summary(m1)
##
## Call:
## lm(formula = coefs ~ 0 + pop_density + school_close + mask_pub_require +
## grp_ban + travel_restrict + daycare_close + bar_rest_close +
## retail_close, data = lock2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.25995 -0.09416 -0.00639 0.03246 1.02656
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## pop_density 0.0003293 0.0001171 2.811 0.00768 **
## school_closeno_order 0.0133458 0.2376669 0.056 0.95551
## school_closeopen 0.0622903 0.2611046 0.239 0.81269
## school_closepartial -0.0663001 0.2576594 -0.257 0.79829
## mask_pub_require1 -0.0106164 0.0730868 -0.145 0.88526
## grp_ban>50 -0.0424904 0.1060837 -0.401 0.69095
## grp_banall 0.0905460 0.0722371 1.253 0.21750
## travel_restrict1 -0.0794097 0.0653034 -1.216 0.23129
## daycare_close1 0.0301930 0.1088303 0.277 0.78291
## bar_rest_close1 0.0371167 0.2723330 0.136 0.89229
## retail_close1 0.0102528 0.1522711 0.067 0.94666
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1942 on 39 degrees of freedom
## Multiple R-squared: 0.4831, Adjusted R-squared: 0.3373
## F-statistic: 3.313 on 11 and 39 DF, p-value: 0.002745
Compare AIC’s
AIC(m1, m2)
## df AIC
## m1 12 -10.405246
## m2 12 -9.394987
M1 comes out a little better, so we’ll go with that.
Heterogeneity
lock2 %>%
mutate(residuals = resid(m1)) %>%
ggplot(aes(x = residuals)) +
geom_histogram(binwidth = .08) +
theme_classic()

The residuals appear to be roughly normally distributed so I will call this a homogenous model. The outliers will be addressed later.
lock2 %>%
ggplot(aes(x = fitted(m1), y = resid(m1))) +
geom_point() +
theme_classic() +
geom_hline(yintercept = 0)
The spread of the residuals looks relatively even.
Independence
lock2 %>%
ggplot(aes(x = coefs, y = residuals(m1))) +
geom_point() +
theme_classic() +
geom_hline(yintercept = 0)

There doesn’t appear to be any clear pattern, so there isn’t a violation of independence.
You probably noticed the outliers (New York and New Jersey). Here’s a more formal showing of the Cook’s distances of the model (32 is New York, 39 is New Jersey).
gg_cooksd(m1)

Our model appears to be a good linear model that doesn’t greatly violate the assumptions of a linear model. Leaving the outliers in I think is important here since they are important players in the story. Removing them would more likely result in an overfitted model. Adjusted R2 explains 34% of variation in the outcome. So there are other potential variables to measure which may produce a more accurate model. Population density was the only variable showing statistical significance as an explanatory variable. It would be interesting to see if variation in state requirements for vaccination and for universal testing had any influence on the outcome.
I chose deaths rather than cases as the outcome variable because the daily case count(number of people testing positive) includes people who may have been infected even months in the past, people who will not have symptoms or get sick, and false positives. The death count is a strictly binomial event and doesn’t include the problems encountered using cases as the outcome.
This model would tend to cast some doubt on the efficacy of mass lockdown measures to influence the outcome of Covid-19 infections.