1 Abstract

The COVID-19 pandemic, while ubiquitous worldwide, has differentially affected different segments of the population. Regional demographic factors, such as population density, could have an impact on this respiratory, community transmitted disease. This study addresses the association and impact of demographic factors on COVID-19 rates, particularly in California Counties. California’s counties were selected due to their large range in population densities (1.6 to 17,179.2 people per square mile) and diversity across the rural-urban continuum. State model type. In particular, this study calculates a propensity score to explore if there is a causal relationship between the population density and COVID-19 case rates. The linear mixed model indicates a negative relationship between elder ratio and infection rate and a positive relationship between metropolitan counties and infection rates.Selection bias persisted after propensity scores were included in the model, invalidating our ability to infer a causal relationship between metropolitan locations and infection rates. These findings indicate which areas are at higher risk during pandemic outbreaks and could critically help policymakers allocate funding and healthcare to these regions.

2 Introduction

The ongoing COVID-19 pandemic has impacted lives worldwide, but residents of different regions have been differentially impacted. Since COVID-19 regulations have often been handled at the national or state level, the impact of subregional influences on case counts needs to be addressed to allocate aid where it is needed most. Additional aid to vulnerable subregions could hinder the progress of the virus as it spreads throughout an entire country and keep more of the overall population healthy. One particular factor we are interested in is the population distribution over a country, and how differing population densities between subregions plays a role in transmission rates of the disease.

2.1 Main Question

This study looks to answer one main research question:

What is the relationship between population distribution and COVID-19 case rates within a country, and is it a causal relationship?

Our main hypothesis for this research question is that population density will have a positive relationship with COVID-19 case rates. Additionally, we hypothesize that a higher population density causes increased COVID-19 case rates. Since COVID-19 is a respiratory illness that spreads readily through the community via the air (Peters, 2020). The primary reasoning behind our hypotheses is that higher densities of people breathing in proximity would theoretically make it easier to transmit the virus. Thus, areas with more people per square mile will have a proportionally higher rate of transmission.

2.2 Data Summary

Our analysis will use the state of California as a proxy for other countries and we will look at daily case numbers from March 2020 to February 2021. The data encapsulates several axes of COVID-19 metrics, including the number of cases, number of deaths, and case-mortality rate. To isolate the problem, we will adjust for other regional socioeconomic factors, such as the percentage of the elderly and household income, to be sure that the results we see are the effect of population density. The sources of data include the country-level WHO dataset (2021), and New York Times data dataset, which has the added advantage of including county-level metrics. Population density, and compounding factors such as socioeconomic data and age demographics are also included. The summaries of and links to these datasets and what they provided is listed in Table 1 below. Model identification and diagnostics, and a propensity score analysis will address the question of the association between population density and COVID-19 rates. Policymakers could use the results of this report to specifically allocate funds to best address the needs of their county. Additionally, the general population or those in the healthcare field may be interested in the results of this report.

Dataset Components Used
WHO COVID Dataset Number of new cases and deaths are calculated by subtracting previous cumulative total counts from the current count.
NY Times COVID Dataset Total COVID cases and deaths per county in each state of the USA
County Household Sizes and Population Densities Population density for each county in the USA
Urban County Codes Population Estimate, Number of elders over 65, Housing Density
Demographic Information Poverty estimates, Median household income

3 Background

3.1 Proxy For Countries

Our choice to use California instead of any country was because of the data we had access to. While we could compare the population densities between nations and compare the effects of the pandemic on them, we chose not to for two reasons:

  1. Interpretability: The purpose of our analysis is to provide actionable insights for leaders of a region to determine how to distribute resources. There is no leader of the entire world who decides how to distribute resources among countries. Thus we must work at a scale that is governed by a single body. With California, the governor has control over how resources can be distributed to counties.

  2. Scale: We would like to have our regions subdivided into the smallest possible chunks to give the best description of how the population is distributed over a region. The level of concentration or distribution of a country’s population cannot be quantified without regional data. For example, Canadas population is very concentrated around the southern border despite having large amounts of relatively unpopulated land in the north. The national population density would be lowered by the large amounts of uninhabited area despite the population being relatively concentrated.

To show how our results from California can be used at a national level, we give two examples of countries that have a similar area and similar population density as California:

  1. Iraq: 169,235 square miles compared to 163,696 of California (Notably with a similar population density too, at 214 per square mile).

  2. Spain: 237 people per square mile compared to 254 of California.

Although we do not have access to regional level data for these countries, we can see how they are similar to California geographically.

3.2 Data Specifics

This project utilizes a few different data sources because they each shed a different light on the socio-demographic realities of COVID-19. The different data sources are summarized in Table 1.

3.2.1 WHO COVID-19 Dataset

The first data source, the WHO COVID-19 global dataset, was compiled by The World Health Organization to complete their global mission of freely communicating global health data. The target population for this dataset was the entire world population, reported by country of residence. The sampling mechanism was through the reporting of sovereign nations. This sampling mechanism may have had differences between countries as some have more stringent reporting requirements than others. The variables utilized from this dataset were the number of new cases and deaths at a country level. We used this dataset to enhance the analysis at a global level, incorporating nations with similar population densities to California.

3.2.2 New York Times Dataset

The second dataset is from the New York Times. This data contains county level information about the number of confirmed COVID-19 cases (both new and total). The data is compiled for the New York Times by journalists across the nation who combine info from public health press conferences and meetings. The target population is the entire United States, and the sampling mechanism is, theoretically, every case reported at a state or county level. If a state or a county refused or neglected to communicate that information publicly, the data would not be represented in this database. The variables we used from this dataset were the total number of COVID-19 cases. This dataset was critical to the county-level component of the study, which chose to isolate the effects of population density in subregions which are more evenly urban or rural than the state as a whole.

3.2.3 US Census Dataset

The third data source is the US census. This Organization is a part of the US government who counts the entire nation every decade. As a census, by definition the target population is every resident of the United States, and their sampling mechanism is to speak to every household in the entire country and count all residents. The main variable that we used from this dataset is the population density per county. This is calculated as the population per square mile. We used this dataset as it could give us the best idea of the true population density.

3.3 Other work

Similar studies have looked at the effects of population density on the COVID-19 pandemic (Peters, 2020; Wong, 2020; Almagro, 2020). Previous studies have shown that it is an important factor in community-spread diseases (Tarwater 2001). This report adds to the existing literature as it adds the analysis of causality in the discussion.

4 Data Exploration

In this section, we will be giving a comparative international view of the COVID-19 pandemic, complimented by county-level data in California. We will be looking at trends in the data and discussing any abnormalities found.

4.1 Iraq and Spain

First, we looked at the global pandemic rates in each country. Since we are interested in applying the county-level data in California on a national scale, we focused on Spain and Iraq. These two countries which have similar characteristics to California (area and population density).

We looked to compare the 7 day running average for Spain and Iraq.

4.2 California

4.2.1 Overall

Below are the plots for cases and deaths in California

We can see there are two waves of the pandemic, with the second being much worse than the first. The week to week fluctuations are suggest that people are catching the disease at certain times of the week (presumably the weekend) and these fluctuations are normalized by the 7-day running average.

4.2.2 Example Counties

Looking at a few counties as an example, below we can see San Francisco and Inyo counties. The first is a metropolitan county and the second is not.

San Francisco has fewer case fatalities compared to Inyo during the entire time period.

This graph shows the case fatality rates per month for the respective counties. We also intended to plot Alpine county due to the very low population density, however, people living in Alpine county had no mortalities. This could be due to the fact that nobody has yet had a fatal case of COVID-19. Alternatively, this could be because people in critical condition in this county are moved to another county to receive Critical Care treatment. It is impossible to know this from the information given.

Next, we analyzed the trends in California. Below is a figure depicting the population density of all 58 counties in California. This shows that San Francisco (in the Bay area, central California) has the highest population density. The counties further East (largely mountainous and desert regions) have the lowest population densities. Note the variation in population density across counties, across orders of magnitude

Figure X. Population Density in California. For mapping methods see: Making maps with R)

We also looked at the case count in each of the counties. These are represented in Figures (X-X) below. While the pure number of cases changed from July (a relatively low-case point) to January (a relatively high-case point), the relative number of cases in each of the counties does not seem to change. This observation requires further statistical analysis to explore.

Figure X. July 2020 case count. For mapping methods see: Making maps with R

Figure X. January 2020 case count. For mapping methods see: Making maps with R

4.3 Grouping Response Variable

To simplify our model we will be condensing our response variable by labeling each county as a metropolitan or non-metropolitan county. Codes corresponding to these categories were pulled from the California Demographic data. Having a binary response allows us to perform causal inference from the results of our model. This way we can use non-metropolitan counties as a control group and metropolitan counties as a treatment group.

Looking at the case numbers between our two groups the first smaller wave had more cases per-capita in metropolitan areas, but in the rising of the second wave, the non-metropolitan counties meet and exceed the cases per-capita of metropolitan counties. Interestingly, the dip around the peak of the second wave was almost entirely caused by metropolitan counties, perhaps suggesting people were leaving the city for the holidays.

5 Linear Mixed-Effects Model

Our proposed model seeks to determine a casual relationship between an area’s metropolitan status and COVID-19 infection rates. As previously stated, we assume the independent fixed effects of elder ratio and poverty percent contribute underlying effects in determining both whether a county is labeled as metropolitan and its associated infection rates.

It is under our assumption that metropolitan locations tend to have lower numbers of elderly and higher levels of poverty compared to non-metropolitan areas.

The lme4 package was used to fit our Linear Mixed-Effects Models. Since the data contained a multitude of observations for different counties over the course of several days, we assumed existing random, unmeasurable effects were present in our data. In this model, we define county and date effects as random effects we cannot control.

Our Linear Mixed-Effects Model takes the following form:

\[y = {X\beta} + {Zu} + {\varepsilon}\] Where - \(y\) represents an \(N \times 1\) vector of our fatality rates.

(McCullagh, P, & Nelder, J. A. (1989). Generalized Linear Models, 2nd ed. Chapman & Hall/CRC Press.)

The distribution of infection rates was highly right skewed, so a \(log(x+1)\) transformation was applied to the numeric variables in the data.

A quick comparision of the average infection rates between our treatment and control indicates a higher average present in metropolitan coded counties.

A Mixed Effects Linear model was fit without the use of propensity score weighting.

In the case of this model, our treatment was MetroCode=1 (metropolitan county) and our control was MetroCode=0 (non-metropolitan county)

## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: 
## new_casesrate ~ (elder_ratio) + factor(MetCode) + I(Poverty_Percent/100) +  
##     (1 | county) + (1 | date)
##    Data: transformed
## 
## REML criterion at convergence: 22965
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -5.0374 -0.5210 -0.0621  0.4523 10.5151 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  date     (Intercept) 0.3285   0.5732  
##  county   (Intercept) 0.0202   0.1421  
##  Residual             0.1731   0.4161  
## Number of obs: 19510, groups:  date, 342; county, 58
## 
## Fixed effects:
##                        Estimate Std. Error       df t value Pr(>|t|)    
## (Intercept)             1.10750    0.27164 55.43390   4.077 0.000147 ***
## elder_ratio            -3.55893    0.61923 54.00154  -5.747 4.32e-07 ***
## factor(MetCode)1        0.12100    0.06207 53.99786   1.950 0.056427 .  
## I(Poverty_Percent/100)  6.60808    7.10041 53.99249   0.931 0.356171    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) eldr_r f(MC)1
## elder_ratio -0.734              
## fctr(MtCd)1 -0.737  0.749       
## I(Pv_P/100) -0.883  0.368  0.457

We found the treatment effect to be statistically significant at \(p=0.1\). Poverty rates were not found to have a significant effect while elder ratios were statistically significant with \(p<0.001\).

However, our previous analysis leads us to believe both poverty rates and elder ratios influence whether an area is determined as metropolitan or not.

To assess the extend of this bias during assignment, we will compare the imbalance present.

MetroCodes against Elder Ratio

## 
## Call:
## lm(formula = elder_ratio ~ MetCode, data = transformed)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.070784 -0.020169 -0.002452  0.030677  0.067597 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.2072814  0.0003922   528.5   <2e-16 ***
## MetCode     -0.0671659  0.0004897  -137.2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03281 on 19508 degrees of freedom
## Multiple R-squared:  0.4909, Adjusted R-squared:  0.4909 
## F-statistic: 1.881e+04 on 1 and 19508 DF,  p-value: < 2.2e-16

MetroCodes against Poverty Ratio

## 
## Call:
## lm(formula = (Poverty_Percent/100) ~ MetCode, data = transformed)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0056979 -0.0020387  0.0000381  0.0019888  0.0060561 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.713e-02  3.426e-05  792.08   <2e-16 ***
## MetCode     -1.836e-03  4.278e-05  -42.92   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.002866 on 19508 degrees of freedom
## Multiple R-squared:  0.08627,    Adjusted R-squared:  0.08622 
## F-statistic:  1842 on 1 and 19508 DF,  p-value: < 2.2e-16

In both cases there appears to be strong imbalance between MetroCode levels and our two covariates as \(p<0.01\). This suggests there may be selection bias present. We aim to use propensity score to reduce these effects.

Propensity scores were determined using logistic regression with our treatment as the response against our two covariates. The probabilities estimated by the logistic regression model were processed and inverted. The final scores were used as weights in the final Linear Mixed Model.

Pscores

##         0%        25%        50%        75%       100% 
## 0.06482176 0.79742913 0.92763157 0.97213783 0.99722439
## 
## Call:
## glm(formula = MetCode ~ (elder_ratio) + I(Poverty_Percent/100), 
##     family = binomial(logit), data = transformed)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3393  -0.1785   0.2490   0.4253   1.5826  
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              22.0902     0.3326   66.42   <2e-16 ***
## elder_ratio             -56.6185     0.7890  -71.76   <2e-16 ***
## I(Poverty_Percent/100) -461.5966     9.4778  -48.70   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 25467  on 19509  degrees of freedom
## Residual deviance: 11122  on 19507  degrees of freedom
## AIC: 11128
## 
## Number of Fisher Scoring iterations: 6

The distribution of the propensity scores indicates there is not a significant difference in scoring between the treatment and control group despite the existence of confounding effects. The propensity scoring and matching method used may be improved upon with alternate methods.

5.1 Final Weighted Model

## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: 
## new_casesrate ~ (elder_ratio) + factor(MetCode) + I(Poverty_Percent/100) +  
##     (1 | county) + (1 | date)
##    Data: transformed
## Weights: weight
## 
## REML criterion at convergence: 34367.8
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -11.3501  -0.4006  -0.0246   0.3546  16.5991 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  date     (Intercept) 0.36993  0.6082  
##  county   (Intercept) 0.01967  0.1402  
##  Residual             0.41492  0.6441  
## Number of obs: 19510, groups:  date, 342; county, 58
## 
## Fixed effects:
##                        Estimate Std. Error       df t value Pr(>|t|)    
## (Intercept)             1.11238    0.26973 53.91770   4.124 0.000129 ***
## elder_ratio            -3.56544    0.61396 52.21704  -5.807 3.83e-07 ***
## factor(MetCode)1        0.12003    0.06125 51.22446   1.960 0.055469 .  
## I(Poverty_Percent/100)  6.48026    7.08297 53.50554   0.915 0.364345    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) eldr_r f(MC)1
## elder_ratio -0.728              
## fctr(MtCd)1 -0.730  0.742       
## I(Pv_P/100) -0.883  0.362  0.451

The final mixed effects model with propensity score weights is not significantly different than the model without. The difference in estimates is not marginally different from the unweighted model. Our treatment group has a positive relationship with the infection rate (\(p<0.01\)), indicating metropolitan counties may have high infection rates than non-metropolitan counties.

6 Sensitivity Analysis

7 Causal Interpretation

After the addition of propensity scores in our Linear Mixed Model, we observed very little change in the treatment effect. The biases introduced by poverty and elder ratios are not significantly reduced after weighting. It is therefore highly possible that there still exists selection bias which invalidates our ability to conduct proper causal inference.

8 Conclusion

In this study we examined the relationship between COVID-19 infection rates per 10,000 people in California and population density. The data on infection rates was collected daily on the county level from March 2020 to February 2021. We proposed a linear mixed model with county and date effects as random effects. In an attempt to perform causal inference, we considered the code indicating whether a county was metropolitan or not as our treatment group. We assumed information about population density was contained within metropolitan scores. Elder and poverty ratios were included in the model as confounding effects on our treatment.

The linear mixed model indicates a negative relationship between elder ratio and infection rate and a positive relationship between metropolitan counties and infection rates.Selection bias persisted after propensity scores were included in the model, invalidating our ability to infer a causal relationship between metropolitan locations and infection rates.

9 References

  1. Almagro, M. (2020, September 22). Racial Disparities in Frontline Workers and Housing Crowding During COVID-19: Evidence from Geolocation Data by Milena Almagro, Joshua Coven, Arpit Gupta, Angelo Orane-Hutchinson :: SSRN. SSRN. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3695249

  2. Making maps with R. (n.d.). Retrieved March 05, 2021, from https://eriqande.github.io/rep-res-web/lectures/making-maps-with-R.html

  3. Mortality Risk of COVID-19 - Statistics and Research. (2021). Our World in Data. https://ourworldindata.org/mortality-risk-covid

  4. Peters, D. J. (2020, June 1). Community Susceptibility and Resiliency to COVID-19 Across the Rural‐Urban Continuum in the United States. Wiley Online Library. https://onlinelibrary.wiley.com/doi/full/10.1111/jrh.12477

  5. Tarwater, P. M. (2001, July 1). Effects of population density on the spread of disease. Wiley Online Library. https://onlinelibrary.wiley.com/doi/abs/10.1002/cplx.10003#:%7E:text=Although%20an%20epidemic%20is%20likely,to%2098%25%20of%20the%20time

  6. Wong, D. W. S. (2020, December 23). Spreading of COVID-19: Density matters. PLOS ONE. https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0242398

Session info

sessionInfo()
## R version 4.0.4 (2021-02-15)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19041)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] zoo_1.8-8        scales_1.1.1     lmerTest_3.1-3   lme4_1.1-26     
##  [5] Matrix_1.3-2     forcats_0.5.1    stringr_1.4.0    purrr_0.3.4     
##  [9] readr_1.4.0      tidyr_1.1.2      tibble_3.0.6     tidyverse_1.3.0 
## [13] gridExtra_2.3    hrbrthemes_0.8.0 plotly_4.9.3     dplyr_1.0.4     
## [17] ggplot2_3.3.3   
## 
## loaded via a namespace (and not attached):
##  [1] nlme_3.1-152        fs_1.5.0            lubridate_1.7.10   
##  [4] httr_1.4.2          numDeriv_2016.8-1.1 tools_4.0.4        
##  [7] backports_1.2.1     bslib_0.2.4         utf8_1.1.4         
## [10] R6_2.5.0            DBI_1.1.1           lazyeval_0.2.2     
## [13] colorspace_2.0-0    withr_2.4.1         tidyselect_1.1.0   
## [16] curl_4.3            compiler_4.0.4      extrafontdb_1.0    
## [19] cli_2.3.1           rvest_0.3.6         xml2_1.3.2         
## [22] labeling_0.4.2      sass_0.3.1          systemfonts_1.0.1  
## [25] digest_0.6.27       minqa_1.2.4         rmarkdown_2.7      
## [28] pkgconfig_2.0.3     htmltools_0.5.1.1   extrafont_0.17     
## [31] highr_0.8           dbplyr_2.1.0        htmlwidgets_1.5.3  
## [34] rlang_0.4.10        readxl_1.3.1        rstudioapi_0.13    
## [37] jquerylib_0.1.3     generics_0.1.0      farver_2.1.0       
## [40] jsonlite_1.7.2      crosstalk_1.1.1     magrittr_2.0.1     
## [43] Rcpp_1.0.6          munsell_0.5.0       fansi_0.4.2        
## [46] gdtools_0.2.3       lifecycle_1.0.0     stringi_1.5.3      
## [49] yaml_2.2.1          MASS_7.3-53         grid_4.0.4         
## [52] crayon_1.4.1        lattice_0.20-41     haven_2.3.1        
## [55] splines_4.0.4       hms_1.0.0           knitr_1.31         
## [58] pillar_1.5.0        boot_1.3-26         reprex_1.0.0       
## [61] glue_1.4.2          evaluate_0.14       data.table_1.14.0  
## [64] modelr_0.1.8        vctrs_0.3.6         nloptr_1.2.2.2     
## [67] Rttf2pt1_1.3.8      cellranger_1.1.0    gtable_0.3.0       
## [70] assertthat_0.2.1    xfun_0.21           broom_0.7.5        
## [73] viridisLite_0.4.0   statmod_1.4.35      ellipsis_0.3.1