Multiple Regression Analysis of COVID-19 Data by U.S. County


Continuing my regression analysis from last week, this time around, I’ll add in additional terms, including a quadratic term, dichotomous term, and a quadratic vs. dichotomous interaction term to build out a multiple regression.

From last week’s analysis, it appears that population estimates were a fairly good predictor of the number of confirmed cases of COVID-19 at a county level. To make this more of a predictive model, we’ll determine whether population estimates, as well as other variables can accurately predict the number of new confirmed cases for yesterday, April 21st, based on counts from previous days. If the model fits the data well, we could eventually see if we can apply this to future dates.

To do this, we’ll first have to merge in some extra data. Fortunately, I have tidy’d other census data in past semesters and can read in the data file below that includes demographic, gender, age, education, income, and employment data on a county level.


Step 1: Downloading the Data

I decided to utilize the New York Times dataset that was made open-source at the end of March. For more information about the methodology of these confirmed case counts, you can find a detailed explanation here.

Additionally, I’ll be using U.S. Census data estimates as factors for my regression model. This week, I decided to utilize population estimates as my factor for my regression. I found a large dataset on the Census website with population estimates as recent as 2019.

And finally, as mentioned above, I downloaded the census data I compiled from last semester as an additional file for this week.

I saved all three of these files to my github account and read them into R:

county_covid <- read.csv('https://raw.githubusercontent.com/zachalexander/data605_cuny/master/Homework13/Discussion13/us-counties_covid19_4222020.csv', header = TRUE)

county_population <- read.csv('https://raw.githubusercontent.com/zachalexander/data605_cuny/master/Homework12/Discussion12/county_population.csv', header = TRUE)

education_data <- read.csv('https://raw.githubusercontent.com/zachalexander/data_606_cunysps/master/Final_Project/education_data_fip.csv', header = TRUE)

Step 2: Tidying the Data

With the data successfully loaded into R, I then had to do a bit of tidying to ensure that my county-level data was accurate and reflected the most up-to-date COVID-19 confirmed case counts.

First, I decided to tidy my COVID-19 data. Since the confirmed counts are cumulative and broken out by each day since late January, I needed to filter the dataset to only include the confirmed counts for April 14th, 2020.

covid_week <- county_covid %>% 
  arrange(fips, date) %>% 
  filter((as.Date(date) >= as.Date('2020-04-12')) & (as.Date(date) <= as.Date('2020-04-21')))

covid_test <- covid_week %>% 
  filter((as.Date(date) == as.Date('2020-04-21')))

covid_week_pivot <- covid_week %>% 
  pivot_wider(id_cols = c(date, fips, state, county), names_from = date, values_from = cases)

diff_list <- c("413_diff", "414_diff", "415_diff", "416_diff", "417_diff", "418_diff", "419_diff", "420_diff")
covid_week_pivot <- cbind(covid_week_pivot, setNames( lapply(diff_list, 
                                         function(x) x=NA), diff_list) )

for(i in 1:length(covid_week_pivot$fips)){
  for(j in 1:8){
  covid_week_pivot[i,(j+12)] <- covid_week_pivot[i, (j+4)] - covid_week_pivot[i, (j+3)]
  }
}

covid_week_fnl <- covid_week_pivot %>% 
  select(fips, state, county, `413_diff`, `414_diff`, `415_diff`, `416_diff`, `417_diff`, `418_diff`, `419_diff`, `420_diff`, `2020-04-20`)


covid_week_fnl <- mutate(covid_week_fnl, week_avg = rowMeans(covid_week_fnl[, 4:11], na.rm = TRUE))

Next, I noticed that the New York Times groups the confirmed counts for New York City into one row, instead of breaking it into the five counties that comprise of “New York City” (Queens, Kings, New York, Bronx and Richmond). Therefore, I decided to use the FIPS code associated with New York County (36061) as my identifier when I eventually merge the population data into the confirmed case data. I then adjusted the population estimate to reflect the population of all five counties instead of just New York County (seen later).

covid_week_fnl <- within(covid_week_fnl, {
    f <- county == 'New York City'
    fips[f] <- '36061'
}) 

covid_test <- within(covid_test, {
    f <- county == 'New York City'
    fips[f] <- '36061'
}) 

covid_test <- covid_test %>% 
  select(fips, state, county, cases)

For the merge, I also thought it would be helpful to make the county names consistent across both datasets.

covid_week_fnl$county <- paste(covid_week_fnl$county, 'County')

With the COVID-19 data file ready for the merge, I then turned my attention to my population data file. In order to use county FIPS codes as my identifier in both datasets, I had to generate the FIPS codes in the population file.

county_population <- county_population %>% 
  select(STATE, COUNTY, STNAME, CTYNAME, POPESTIMATE2019)

county_population$fips <- NA
for (i in 1:length(county_population$STATE)) {
  if(county_population$COUNTY[i] < 10) {
    county_population$fips[i] <- paste0(county_population$STATE[i], '00', county_population$COUNTY[i])
  }
  if(county_population$COUNTY[i] < 100 & county_population$COUNTY[i] >= 10) {
    county_population$fips[i] <- paste0(county_population$STATE[i], '0', county_population$COUNTY[i])
  }
  if(county_population$COUNTY[i] >=100) {
    county_population$fips[i] <- paste0(county_population$STATE[i], county_population$COUNTY[i])
  }
}

county_population <- county_population %>% 
  select(fips, CTYNAME, STNAME, POPESTIMATE2019)
names(county_population) <- c('fips', 'county', 'state', 'pop_estimate')

I also noticed that this file had overall state population estimates, so before merging, I made sure to filter these out.

county_population <- county_population %>% 
  filter(grepl("County",county))

With both data files ready to go, I then merged the population estimates data into the COVID-19 data file and created one final dataframe. I also updated the population count for New York City to ensure that it didn’t just account for the population in New York County, but also the four other counties in the metropolitan area.

fnl <- merge(covid_week_fnl, county_population, by='fips')

fnl <- fnl %>% 
  select(fips, county.x, state.x, `2020-04-20`, week_avg, pop_estimate)

names(fnl) <- c('fips', 'county', 'state', 'April20cases', 'week_avg', 'pop_estimate')

fnl$pop_estimate[fnl$county == 'New York City County'] <- 8398748
fnl$county[fnl$county == 'New York City County'] <- 'New York City'

fnl <- fnl %>% 
  arrange(desc(week_avg))

Here is a look at my final dataframe, ready to start my regression analysis:

kable(head(fnl, n=15L)) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
fips county state April20cases week_avg pop_estimate
36061 New York City New York 136816 4293.1429 8398748
17031 Cook County Illinois 22101 946.7143 5150233
36059 Nassau County New York 30677 902.7143 1356924
36103 Suffolk County New York 27662 859.8571 1476601
36119 Westchester County New York 24306 645.8571 967506
6037 Los Angeles County California 13816 628.0000 10039107
34039 Union County New Jersey 9972 476.5714 556341
34017 Hudson County New Jersey 11150 467.2857 672391
25017 Middlesex County Massachusetts 9253 467.1429 1611699
34013 Essex County New Jersey 10729 442.1429 798975
34003 Bergen County New Jersey 13011 417.0000 932202
42101 Philadelphia County Pennsylvania 9553 391.8571 1584064
25025 Suffolk County Massachusetts 8314 390.7143 803907
34031 Passaic County New Jersey 8479 361.2857 501826
34023 Middlesex County New Jersey 8346 337.0000 825062

We’ll then merge in a few more terms from the other census dataset:

fnl <- merge(fnl, education_data, by='fips')

fnl <- fnl %>% 
  select(fips, county.x, state.x, April20cases, week_avg, pop_estimate, median_hh_inc, age65andolder_pct, nonwhite_pct, lesscollege_pct)
par(mfrow=c(3,3))
hist(fnl$April20cases)
hist(fnl$week_avg)
hist(fnl$pop_estimate)
hist(fnl$median_hh_inc)
hist(fnl$age65andolder_pct)
hist(fnl$nonwhite_pct)
hist(fnl$lesscollege_pct)


Dichotomous Term

mean(fnl$median_hh_inc)
## [1] 48349.44

It looks like we could create a dichotomous term from our median household income variable. By doing this, we’ll create a new variable called median_hh_inc_dichot, which will be a value of 1 if the median household income is greater than $48,300 and 0 if the median household income is less than this value for each county:

for (i in 1:length(fnl$fips)) {
  if(fnl$median_hh_inc[i] > 43000){
    fnl$median_hh_inc_dichot[i] <- 1
  } else {
    fnl$median_hh_inc_dichot[i] <- 0
  }
}

And here’s the split between the two factors of this new dichotomous variable:

table(fnl$median_hh_inc_dichot)
## 
##    0    1 
##  953 1690

Quadratic Term

It looks like from our initial plots of histograms above, we can transform our nonwhite_pct variable by taking the square root, to make it more of a normal distribution. By doing this, we can now see that the distribution is approaching normal:

par(mfrow=c(3,3))
hist(fnl$April20cases)
hist(fnl$week_avg)
hist(1 / log(fnl$pop_estimate))
hist(fnl$median_hh_inc)
hist(fnl$age65andolder_pct)
hist(sqrt(fnl$nonwhite_pct))
hist(fnl$lesscollege_pct)

fnl$pop_estimate <- 1 / log(fnl$pop_estimate)
fnl$nonwhite_pct <- sqrt(fnl$nonwhite_pct)

par(mfrow=c(3,3))

hist(fnl$April20cases)
hist(fnl$week_avg)
hist(fnl$pop_estimate)
hist(fnl$median_hh_inc)
hist(fnl$age65andolder_pct)
hist(fnl$nonwhite_pct)
hist(fnl$lesscollege_pct)

Additionally, I did a log transformation on the population estimate term to make it more normalized as well.


Quadratic and Dichotomous Interaction Term

When we run our linear regression model, we’ll be sure to create an interaction term between our non-white percentage variable (quadratic transformation) and our dichotomous median household income variable.


Initial Analysis

With our variables and dataset ready to go, we will now take a quick look at interactions between variables. We can do this using the pairs() function:

fnl_pairs <- fnl %>% 
  select(April20cases, week_avg, pop_estimate, median_hh_inc, median_hh_inc_dichot, age65andolder_pct,nonwhite_pct, lesscollege_pct)
pairs(fnl_pairs)

If you look closely, we can see that there seems to be a somewhat strong interaction between median household income and counties with a larger percentage of their population with less than a college degree – this make sense intuitively. Other interactions are not as obvious, so we’ll see how this fairs in our regression model.


Building the Multiple Regression

With our factors ready to go, we can create a multiple linear regression model.

covid_lm <- lm(fnl$April20cases ~ fnl$April20cases + fnl$week_avg + fnl$pop_estimate + fnl$median_hh_inc + fnl$median_hh_inc_dichot + fnl$age65andolder_pct + fnl$nonwhite_pct + fnl$lesscollege_pct + fnl$median_hh_inc_dichot:fnl$nonwhite_pct)

Above, we can see the intercept and slope of our linear regression (8.962 and 0.5914 respectively). To get a more detailed outlook of the performance of our model, we can use the summary() function in R:

summary(covid_lm)
## 
## Call:
## lm(formula = fnl$April20cases ~ fnl$April20cases + fnl$week_avg + 
##     fnl$pop_estimate + fnl$median_hh_inc + fnl$median_hh_inc_dichot + 
##     fnl$age65andolder_pct + fnl$nonwhite_pct + fnl$lesscollege_pct + 
##     fnl$median_hh_inc_dichot:fnl$nonwhite_pct)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6797.7   -19.8    14.6    59.5  4860.8 
## 
## Coefficients:
##                                             Estimate Std. Error t value
## (Intercept)                               -2.861e+02  1.547e+02  -1.849
## fnl$week_avg                               3.079e+01  8.542e-02 360.452
## fnl$pop_estimate                           3.396e+03  8.295e+02   4.094
## fnl$median_hh_inc                         -3.765e-05  1.120e-03  -0.034
## fnl$median_hh_inc_dichot                   7.391e+01  4.244e+01   1.742
## fnl$age65andolder_pct                      5.995e-01  2.277e+00   0.263
## fnl$nonwhite_pct                          -3.333e+00  6.278e+00  -0.531
## fnl$lesscollege_pct                       -7.393e-01  1.291e+00  -0.573
## fnl$median_hh_inc_dichot:fnl$nonwhite_pct -2.538e+01  8.695e+00  -2.919
##                                           Pr(>|t|)    
## (Intercept)                                0.06457 .  
## fnl$week_avg                               < 2e-16 ***
## fnl$pop_estimate                          4.37e-05 ***
## fnl$median_hh_inc                          0.97318    
## fnl$median_hh_inc_dichot                   0.08170 .  
## fnl$age65andolder_pct                      0.79234    
## fnl$nonwhite_pct                           0.59548    
## fnl$lesscollege_pct                        0.56700    
## fnl$median_hh_inc_dichot:fnl$nonwhite_pct  0.00354 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 408.4 on 2615 degrees of freedom
##   (19 observations deleted due to missingness)
## Multiple R-squared:  0.9814, Adjusted R-squared:  0.9814 
## F-statistic: 1.727e+04 on 8 and 2615 DF,  p-value: < 2.2e-16

Backward Elimination

It looks like our age65andolder_pct variable has the highest p-value, and others, including median_hh_inc and lesscollege_pct will likely be removed when we do backward elimination because they also exhibit pretty high p-values. We can double check this by running the stepwise process:

covid_lm <- step(covid_lm, direction = 'backward', trace = FALSE)
summary(covid_lm)
## 
## Call:
## lm(formula = fnl$April20cases ~ fnl$week_avg + fnl$pop_estimate + 
##     fnl$median_hh_inc_dichot + fnl$nonwhite_pct + fnl$median_hh_inc_dichot:fnl$nonwhite_pct)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6800.5   -19.7    13.7    59.7  4850.7 
## 
## Coefficients:
##                                             Estimate Std. Error t value
## (Intercept)                               -324.54516   81.98072  -3.959
## fnl$week_avg                                30.79764    0.08459 364.070
## fnl$pop_estimate                          3271.69836  740.05432   4.421
## fnl$median_hh_inc_dichot                    77.07769   41.10527   1.875
## fnl$nonwhite_pct                            -3.77270    5.98143  -0.631
## fnl$median_hh_inc_dichot:fnl$nonwhite_pct  -25.23837    8.63175  -2.924
##                                           Pr(>|t|)    
## (Intercept)                               7.73e-05 ***
## fnl$week_avg                               < 2e-16 ***
## fnl$pop_estimate                          1.02e-05 ***
## fnl$median_hh_inc_dichot                   0.06089 .  
## fnl$nonwhite_pct                           0.52827    
## fnl$median_hh_inc_dichot:fnl$nonwhite_pct  0.00349 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 408.2 on 2618 degrees of freedom
##   (19 observations deleted due to missingness)
## Multiple R-squared:  0.9814, Adjusted R-squared:  0.9814 
## F-statistic: 2.767e+04 on 5 and 2618 DF,  p-value: < 2.2e-16

Evaluating the Model and Residual Analysis

After running the summary above, we can see a few things:

  • The median residual value is around zero (at 14 cases), which is a good sign.

  • Additionally, the minimum and maximum values of the residuals are roughly around the same, albeit leaning a bit more on the minimum side of at -7192 and 4883 respectively.

  • Our Multiple R-squared value is 0.9825, which indicates that these terms by county account for about 98.25% of the variability in the number of confirmed COVID-19 cases by county for April 20th, 2020. This seems to be a pretty good result for our regression model.

Although this seems to be a good sign, we need to check our residuals to see if this qualifies as a suitable regression model.

When plotting residuals (below), we can see that residuals are not uniformly scattered around zero:

plot(fitted(covid_lm),resid(covid_lm))

I’m not sure if this is due to some very large outliers and skewed cases data, but we can also see that there are a fair amount of outliers towards both ends of the q-q plot. This can be visualized in a quantile-versus-quantile (Q-Q) plot (see below):

qqnorm(resid(covid_lm))
qqline(resid(covid_lm))

hist(resid(covid_lm))


Was the Linear Model Appropriate?

From this residual analysis, we can conclude that a linear model here may not be very appropriate, or accurate for areas with larger numbers of confirmed cases (i.e. counties with larger populations). However, this may be effective to predict cases for counties with smaller populations. At this point, I think I would not call this linear model appropriate, however, further manipulation of the initial dataset to exclude counties with very large populations may make this a more effective model.

Predictions for April 21st, 2020

From the t-test below, it does look like a 95% confidence interval includes zero. However, it isn’t a very tight range between the lower and upper bounds, indicating that the predictions for confirmed cases could be drastically off, especially for smaller populations. This confirms that it would be beneficial to adjust the original dataset to build out a more productive and efficient model.

predicted <- predict(covid_lm, covid_test)
delta <- predicted - covid_test$cases

t.test(delta, conf.level = 0.95)
## 
##  One Sample t-test
## 
## data:  delta
## t = -0.16414, df = 2765, p-value = 0.8696
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  -167.8447  141.9148
## sample estimates:
## mean of x 
## -12.96496
plot(delta)