Introduction
The ongoing COVID-19 pandemic has impacted lives worldwide, but residents of different regions have been differentially impacted. Since COVID-19 regulations have often been handled at the national or state level, the impact of subregional influences on case counts needs to be addressed to allocate aid where it is needed most. Additional aid to vulnerable subregions could hinder the progress of the virus as it spreads throughout an entire country and keep more of the overall population healthy. One particular factor we are interested in is the population distribution over a country, and how differing population densities between subregions plays a role in transmission rates of the disease.
Main Question
This study looks to answer one main research question:
What is the relationship between population distribution and COVID-19 case rates within a country, and is it a causal relationship?
Our main hypothesis for this research question is that population density will have a positive relationship with COVID-19 case rates. Additionally, we hypothesize that a higher population density causes increased COVID-19 case rates. Since COVID-19 is a respiratory illness that spreads readily through the community via the air (Peters, 2020). The primary reasoning behind our hypotheses is that higher densities of people breathing in proximity would theoretically make it easier to transmit the virus. Thus, areas with more people per square mile will have a proportionally higher rate of transmission.
Data Summary
Our analysis will use the state of California as a proxy for other countries and we will look at daily case numbers from March 2020 to February 2021. The data encapsulates several axes of COVID-19 metrics, including the number of cases, number of deaths, and case-mortality rate. To isolate the problem, we will adjust for other regional socioeconomic factors, such as the percentage of the elderly and household income, to be sure that the results we see are the effect of population density. The sources of data include the country-level WHO dataset (2021), and New York Times data dataset, which has the added advantage of including county-level metrics. Population density, and compounding factors such as socioeconomic data and age demographics are also included. The summaries of and links to these datasets and what they provided is listed in Table 1 below. Model identification and diagnostics, and a propensity score analysis will address the question of the association between population density and COVID-19 rates. Policymakers could use the results of this report to specifically allocate funds to best address the needs of their county. Additionally, the general population or those in the healthcare field may be interested in the results of this report.
Linear Mixed-Effects Model
Our proposed model seeks to determine a casual relationship between an area’s metropolitan status and COVID-19 infection rates. As previously stated, we assume the independent fixed effects of elder ratio and poverty percent contribute underlying effects in determining both whether a county is labeled as metropolitan and its associated infection rates.
It is under our assumption that metropolitan locations tend to have lower numbers of elderly and higher levels of poverty compared to non-metropolitan areas.
The lme4 package was used to fit our Linear Mixed-Effects Models. Since the data contained a multitude of observations for different counties over the course of several days, we assumed existing random, unmeasurable effects were present in our data. In this model, we define county and date effects as random effects we cannot control.
Our Linear Mixed-Effects Model takes the following form:
\[y = {X\beta} + {Zu} + {\varepsilon}\] Where - \(y\) represents an \(N \times 1\) vector of our fatality rates.
\(X\) is a \(N \times p\) matrix of our predictor variables.
\(\beta\) is a \(p \times 1\) column of our fixed effects coefficients.
\(Z\) is a \(N \times q:groups\) matrix for the \(q\) random effects for our county and state variables.
Finally, \(\epsilon\) captures the portion of \(y\) not included by the rest of the model.
(McCullagh, P, & Nelder, J. A. (1989). Generalized Linear Models, 2nd ed. Chapman & Hall/CRC Press.)
The distribution of infection rates was highly right skewed, so a \(log(x+1)\) transformation was applied to the numeric variables in the data.

A quick comparision of the average infection rates between our treatment and control indicates a higher average present in metropolitan coded counties.

A Mixed Effects Linear model was fit without the use of propensity score weighting.
In the case of this model, our treatment was MetroCode=1 (metropolitan county) and our control was MetroCode=0 (non-metropolitan county)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula:
## new_casesrate ~ (elder_ratio) + factor(MetCode) + I(Poverty_Percent/100) +
## (1 | county) + (1 | date)
## Data: transformed
##
## REML criterion at convergence: 22965
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -5.0374 -0.5210 -0.0621 0.4523 10.5151
##
## Random effects:
## Groups Name Variance Std.Dev.
## date (Intercept) 0.3285 0.5732
## county (Intercept) 0.0202 0.1421
## Residual 0.1731 0.4161
## Number of obs: 19510, groups: date, 342; county, 58
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 1.10750 0.27164 55.43390 4.077 0.000147 ***
## elder_ratio -3.55893 0.61923 54.00154 -5.747 4.32e-07 ***
## factor(MetCode)1 0.12100 0.06207 53.99786 1.950 0.056427 .
## I(Poverty_Percent/100) 6.60808 7.10041 53.99249 0.931 0.356171
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) eldr_r f(MC)1
## elder_ratio -0.734
## fctr(MtCd)1 -0.737 0.749
## I(Pv_P/100) -0.883 0.368 0.457
We found the treatment effect to be statistically significant at \(p=0.1\). Poverty rates were not found to have a significant effect while elder ratios were statistically significant with \(p<0.001\).
However, our previous analysis leads us to believe both poverty rates and elder ratios influence whether an area is determined as metropolitan or not.
To assess the extend of this bias during assignment, we will compare the imbalance present.
MetroCodes against Elder Ratio
##
## Call:
## lm(formula = elder_ratio ~ MetCode, data = transformed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.070784 -0.020169 -0.002452 0.030677 0.067597
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.2072814 0.0003922 528.5 <2e-16 ***
## MetCode -0.0671659 0.0004897 -137.2 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.03281 on 19508 degrees of freedom
## Multiple R-squared: 0.4909, Adjusted R-squared: 0.4909
## F-statistic: 1.881e+04 on 1 and 19508 DF, p-value: < 2.2e-16
MetroCodes against Poverty Ratio
##
## Call:
## lm(formula = (Poverty_Percent/100) ~ MetCode, data = transformed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0056979 -0.0020387 0.0000381 0.0019888 0.0060561
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.713e-02 3.426e-05 792.08 <2e-16 ***
## MetCode -1.836e-03 4.278e-05 -42.92 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.002866 on 19508 degrees of freedom
## Multiple R-squared: 0.08627, Adjusted R-squared: 0.08622
## F-statistic: 1842 on 1 and 19508 DF, p-value: < 2.2e-16
In both cases there appears to be strong imbalance between MetroCode levels and our two covariates as \(p<0.01\). This suggests there may be selection bias present. We aim to use propensity score to reduce these effects.
Propensity scores were determined using logistic regression with our treatment as the response against our two covariates. The probabilities estimated by the logistic regression model were processed and inverted. The final scores were used as weights in the final Linear Mixed Model.
Pscores
## 0% 25% 50% 75% 100%
## 0.06482176 0.79742913 0.92763157 0.97213783 0.99722439
##
## Call:
## glm(formula = MetCode ~ (elder_ratio) + I(Poverty_Percent/100),
## family = binomial(logit), data = transformed)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3393 -0.1785 0.2490 0.4253 1.5826
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 22.0902 0.3326 66.42 <2e-16 ***
## elder_ratio -56.6185 0.7890 -71.76 <2e-16 ***
## I(Poverty_Percent/100) -461.5966 9.4778 -48.70 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 25467 on 19509 degrees of freedom
## Residual deviance: 11122 on 19507 degrees of freedom
## AIC: 11128
##
## Number of Fisher Scoring iterations: 6

The distribution of the propensity scores indicates there is not a significant difference in scoring between the treatment and control group despite the existence of confounding effects. The propensity scoring and matching method used may be improved upon with alternate methods.
Final Weighted Model
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula:
## new_casesrate ~ (elder_ratio) + factor(MetCode) + I(Poverty_Percent/100) +
## (1 | county) + (1 | date)
## Data: transformed
## Weights: weight
##
## REML criterion at convergence: 34367.8
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -11.3501 -0.4006 -0.0246 0.3546 16.5991
##
## Random effects:
## Groups Name Variance Std.Dev.
## date (Intercept) 0.36993 0.6082
## county (Intercept) 0.01967 0.1402
## Residual 0.41492 0.6441
## Number of obs: 19510, groups: date, 342; county, 58
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 1.11238 0.26973 53.91770 4.124 0.000129 ***
## elder_ratio -3.56544 0.61396 52.21704 -5.807 3.83e-07 ***
## factor(MetCode)1 0.12003 0.06125 51.22446 1.960 0.055469 .
## I(Poverty_Percent/100) 6.48026 7.08297 53.50554 0.915 0.364345
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) eldr_r f(MC)1
## elder_ratio -0.728
## fctr(MtCd)1 -0.730 0.742
## I(Pv_P/100) -0.883 0.362 0.451
The final mixed effects model with propensity score weights is not significantly different than the model without. The difference in estimates is not marginally different from the unweighted model. Our treatment group has a positive relationship with the infection rate (\(p<0.01\)), indicating metropolitan counties may have high infection rates than non-metropolitan counties.