DATA 605 - Discussion #13

Multiple Regression Analysis of COVID-19 Data by U.S. County

Continuing my regression analysis from last week, this time around, I’ll add in additional terms, including a quadratic term, dichotomous term, and a quadratic vs. dichotomous interaction term to build out a multiple regression.

From last week’s analysis, it appears that population estimates were a fairly good predictor of the number of confirmed cases of COVID-19 at a county level. To make this more of a predictive model, we’ll determine whether population estimates, as well as other variables can accurately predict the number of new confirmed cases for yesterday, April 21st, based on counts from previous days. If the model fits the data well, we could eventually see if we can apply this to future dates.

To do this, we’ll first have to merge in some extra data. Fortunately, I have tidy’d other census data in past semesters and can read in the data file below that includes demographic, gender, age, education, income, and employment data on a county level.

Step 1: Downloading the Data

I decided to utilize the New York Times dataset that was made open-source at the end of March. For more information about the methodology of these confirmed case counts, you can find a detailed explanation here.

Additionally, I’ll be using U.S. Census data estimates as factors for my regression model. This week, I decided to utilize population estimates as my factor for my regression. I found a large dataset on the Census website with population estimates as recent as 2019.

And finally, as mentioned above, I downloaded the census data I compiled from last semester as an additional file for this week.

I saved all three of these files to my github account and read them into R:

county_covid <- read.csv('https://raw.githubusercontent.com/zachalexander/data605_cuny/master/Homework13/Discussion13/us-counties_covid19_4222020.csv', header = TRUE)

county_population <- read.csv('https://raw.githubusercontent.com/zachalexander/data605_cuny/master/Homework12/Discussion12/county_population.csv', header = TRUE)

education_data <- read.csv('https://raw.githubusercontent.com/zachalexander/data_606_cunysps/master/Final_Project/education_data_fip.csv', header = TRUE)

Step 2: Tidying the Data

With the data successfully loaded into R, I then had to do a bit of tidying to ensure that my county-level data was accurate and reflected the most up-to-date COVID-19 confirmed case counts.

First, I decided to tidy my COVID-19 data. Since the confirmed counts are cumulative and broken out by each day since late January, I needed to filter the dataset to only include the confirmed counts for April 14th, 2020.

covid_week <- county_covid %>% 
  arrange(fips, date) %>% 
  filter((as.Date(date) >= as.Date('2020-04-12')) & (as.Date(date) <= as.Date('2020-04-21')))

covid_test <- covid_week %>% 
  filter((as.Date(date) == as.Date('2020-04-21')))

covid_week_pivot <- covid_week %>% 
  pivot_wider(id_cols = c(date, fips, state, county), names_from = date, values_from = cases)

diff_list <- c("413_diff", "414_diff", "415_diff", "416_diff", "417_diff", "418_diff", "419_diff", "420_diff")
covid_week_pivot <- cbind(covid_week_pivot, setNames( lapply(diff_list, 
                                         function(x) x=NA), diff_list) )

for(i in 1:length(covid_week_pivot$fips)){
  for(j in 1:8){
  covid_week_pivot[i,(j+12)] <- covid_week_pivot[i, (j+4)] - covid_week_pivot[i, (j+3)]
  }
}

covid_week_fnl <- covid_week_pivot %>% 
  select(fips, state, county, `413_diff`, `414_diff`, `415_diff`, `416_diff`, `417_diff`, `418_diff`, `419_diff`, `420_diff`, `2020-04-20`)


covid_week_fnl <- mutate(covid_week_fnl, week_avg = rowMeans(covid_week_fnl[, 4:11], na.rm = TRUE))

Next, I noticed that the New York Times groups the confirmed counts for New York City into one row, instead of breaking it into the five counties that comprise of “New York City” (Queens, Kings, New York, Bronx and Richmond). Therefore, I decided to use the FIPS code associated with New York County (36061) as my identifier when I eventually merge the population data into the confirmed case data. I then adjusted the population estimate to reflect the population of all five counties instead of just New York County (seen later).

covid_week_fnl <- within(covid_week_fnl, {
    f <- county == 'New York City'
    fips[f] <- '36061'
}) 

covid_test <- within(covid_test, {
    f <- county == 'New York City'
    fips[f] <- '36061'
}) 

covid_test <- covid_test %>% 
  select(fips, state, county, cases)

For the merge, I also thought it would be helpful to make the county names consistent across both datasets.

covid_week_fnl$county <- paste(covid_week_fnl$county, 'County')

With the COVID-19 data file ready for the merge, I then turned my attention to my population data file. In order to use county FIPS codes as my identifier in both datasets, I had to generate the FIPS codes in the population file.

county_population <- county_population %>% 
  select(STATE, COUNTY, STNAME, CTYNAME, POPESTIMATE2019)

county_population$fips <- NA
for (i in 1:length(county_population$STATE)) {
  if(county_population$COUNTY[i] < 10) {
    county_population$fips[i] <- paste0(county_population$STATE[i], '00', county_population$COUNTY[i])
  }
  if(county_population$COUNTY[i] < 100 & county_population$COUNTY[i] >= 10) {
    county_population$fips[i] <- paste0(county_population$STATE[i], '0', county_population$COUNTY[i])
  }
  if(county_population$COUNTY[i] >=100) {
    county_population$fips[i] <- paste0(county_population$STATE[i], county_population$COUNTY[i])
  }
}

county_population <- county_population %>% 
  select(fips, CTYNAME, STNAME, POPESTIMATE2019)
names(county_population) <- c('fips', 'county', 'state', 'pop_estimate')

I also noticed that this file had overall state population estimates, so before merging, I made sure to filter these out.

county_population <- county_population %>% 
  filter(grepl("County",county))

With both data files ready to go, I then merged the population estimates data into the COVID-19 data file and created one final dataframe. I also updated the population count for New York City to ensure that it didn’t just account for the population in New York County, but also the four other counties in the metropolitan area.

fnl <- merge(covid_week_fnl, county_population, by='fips')

fnl <- fnl %>% 
  select(fips, county.x, state.x, `2020-04-20`, week_avg, pop_estimate)

names(fnl) <- c('fips', 'county', 'state', 'April20cases', 'week_avg', 'pop_estimate')

fnl$pop_estimate[fnl$county == 'New York City County'] <- 8398748
fnl$county[fnl$county == 'New York City County'] <- 'New York City'

fnl <- fnl %>% 
  arrange(desc(week_avg))

Here is a look at my final dataframe, ready to start my regression analysis:

kable(head(fnl, n=15L)) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

fips	county	state	April20cases	week_avg	pop_estimate
36061	New York City	New York	136816	4293.1429	8398748
17031	Cook County	Illinois	22101	946.7143	5150233
36059	Nassau County	New York	30677	902.7143	1356924
36103	Suffolk County	New York	27662	859.8571	1476601
36119	Westchester County	New York	24306	645.8571	967506
6037	Los Angeles County	California	13816	628.0000	10039107
34039	Union County	New Jersey	9972	476.5714	556341
34017	Hudson County	New Jersey	11150	467.2857	672391
25017	Middlesex County	Massachusetts	9253	467.1429	1611699
34013	Essex County	New Jersey	10729	442.1429	798975
34003	Bergen County	New Jersey	13011	417.0000	932202
42101	Philadelphia County	Pennsylvania	9553	391.8571	1584064
25025	Suffolk County	Massachusetts	8314	390.7143	803907
34031	Passaic County	New Jersey	8479	361.2857	501826
34023	Middlesex County	New Jersey	8346	337.0000	825062

We’ll then merge in a few more terms from the other census dataset:

fnl <- merge(fnl, education_data, by='fips')

fnl <- fnl %>% 
  select(fips, county.x, state.x, April20cases, week_avg, pop_estimate, median_hh_inc, age65andolder_pct, nonwhite_pct, lesscollege_pct)

par(mfrow=c(3,3))
hist(fnl$April20cases)
hist(fnl$week_avg)
hist(fnl$pop_estimate)
hist(fnl$median_hh_inc)
hist(fnl$age65andolder_pct)
hist(fnl$nonwhite_pct)
hist(fnl$lesscollege_pct)

Dichotomous Term

mean(fnl$median_hh_inc)

## [1] 48349.44

It looks like we could create a dichotomous term from our median household income variable. By doing this, we’ll create a new variable called median_hh_inc_dichot, which will be a value of 1 if the median household income is greater than $48,300 and 0 if the median household income is less than this value for each county:

for (i in 1:length(fnl$fips)) {
  if(fnl$median_hh_inc[i] > 43000){
    fnl$median_hh_inc_dichot[i] <- 1
  } else {
    fnl$median_hh_inc_dichot[i] <- 0
  }
}

And here’s the split between the two factors of this new dichotomous variable:

table(fnl$median_hh_inc_dichot)

## 
##    0    1 
##  953 1690

Quadratic Term

It looks like from our initial plots of histograms above, we can transform our nonwhite_pct variable by taking the square root, to make it more of a normal distribution. By doing this, we can now see that the distribution is approaching normal:

par(mfrow=c(3,3))
hist(fnl$April20cases)
hist(fnl$week_avg)
hist(1 / log(fnl$pop_estimate))
hist(fnl$median_hh_inc)
hist(fnl$age65andolder_pct)
hist(sqrt(fnl$nonwhite_pct))
hist(fnl$lesscollege_pct)

fnl$pop_estimate <- 1 / log(fnl$pop_estimate)
fnl$nonwhite_pct <- sqrt(fnl$nonwhite_pct)

par(mfrow=c(3,3))

hist(fnl$April20cases)
hist(fnl$week_avg)
hist(fnl$pop_estimate)
hist(fnl$median_hh_inc)
hist(fnl$age65andolder_pct)
hist(fnl$nonwhite_pct)
hist(fnl$lesscollege_pct)

Additionally, I did a log transformation on the population estimate term to make it more normalized as well.

Quadratic and Dichotomous Interaction Term

When we run our linear regression model, we’ll be sure to create an interaction term between our non-white percentage variable (quadratic transformation) and our dichotomous median household income variable.

Initial Analysis

With our variables and dataset ready to go, we will now take a quick look at interactions between variables. We can do this using the pairs() function:

fnl_pairs <- fnl %>% 
  select(April20cases, week_avg, pop_estimate, median_hh_inc, median_hh_inc_dichot, age65andolder_pct,nonwhite_pct, lesscollege_pct)
pairs(fnl_pairs)

If you look closely, we can see that there seems to be a somewhat strong interaction between median household income and counties with a larger percentage of their population with less than a college degree – this make sense intuitively. Other interactions are not as obvious, so we’ll see how this fairs in our regression model.

Building the Multiple Regression

With our factors ready to go, we can create a multiple linear regression model.

covid_lm <- lm(fnl$April20cases ~ fnl$April20cases + fnl$week_avg + fnl$pop_estimate + fnl$median_hh_inc + fnl$median_hh_inc_dichot + fnl$age65andolder_pct + fnl$nonwhite_pct + fnl$lesscollege_pct + fnl$median_hh_inc_dichot:fnl$nonwhite_pct)

Above, we can see the intercept and slope of our linear regression (8.962 and 0.5914 respectively). To get a more detailed outlook of the performance of our model, we can use the summary() function in R:

summary(covid_lm)

## 
## Call:
## lm(formula = fnl$April20cases ~ fnl$April20cases + fnl$week_avg + 
##     fnl$pop_estimate + fnl$median_hh_inc + fnl$median_hh_inc_dichot + 
##     fnl$age65andolder_pct + fnl$nonwhite_pct + fnl$lesscollege_pct + 
##     fnl$median_hh_inc_dichot:fnl$nonwhite_pct)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6797.7   -19.8    14.6    59.5  4860.8 
## 
## Coefficients:
##                                             Estimate Std. Error t value
## (Intercept)                               -2.861e+02  1.547e+02  -1.849
## fnl$week_avg                               3.079e+01  8.542e-02 360.452
## fnl$pop_estimate                           3.396e+03  8.295e+02   4.094
## fnl$median_hh_inc                         -3.765e-05  1.120e-03  -0.034
## fnl$median_hh_inc_dichot                   7.391e+01  4.244e+01   1.742
## fnl$age65andolder_pct                      5.995e-01  2.277e+00   0.263
## fnl$nonwhite_pct                          -3.333e+00  6.278e+00  -0.531
## fnl$lesscollege_pct                       -7.393e-01  1.291e+00  -0.573
## fnl$median_hh_inc_dichot:fnl$nonwhite_pct -2.538e+01  8.695e+00  -2.919
##                                           Pr(>|t|)    
## (Intercept)                                0.06457 .  
## fnl$week_avg                               < 2e-16 ***
## fnl$pop_estimate                          4.37e-05 ***
## fnl$median_hh_inc                          0.97318    
## fnl$median_hh_inc_dichot                   0.08170 .  
## fnl$age65andolder_pct                      0.79234    
## fnl$nonwhite_pct                           0.59548    
## fnl$lesscollege_pct                        0.56700    
## fnl$median_hh_inc_dichot:fnl$nonwhite_pct  0.00354 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 408.4 on 2615 degrees of freedom
##   (19 observations deleted due to missingness)
## Multiple R-squared:  0.9814, Adjusted R-squared:  0.9814 
## F-statistic: 1.727e+04 on 8 and 2615 DF,  p-value: < 2.2e-16

Backward Elimination

It looks like our age65andolder_pct variable has the highest p-value, and others, including median_hh_inc and lesscollege_pct will likely be removed when we do backward elimination because they also exhibit pretty high p-values. We can double check this by running the stepwise process:

covid_lm <- step(covid_lm, direction = 'backward', trace = FALSE)
summary(covid_lm)

## 
## Call:
## lm(formula = fnl$April20cases ~ fnl$week_avg + fnl$pop_estimate + 
##     fnl$median_hh_inc_dichot + fnl$nonwhite_pct + fnl$median_hh_inc_dichot:fnl$nonwhite_pct)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6800.5   -19.7    13.7    59.7  4850.7 
## 
## Coefficients:
##                                             Estimate Std. Error t value
## (Intercept)                               -324.54516   81.98072  -3.959
## fnl$week_avg                                30.79764    0.08459 364.070
## fnl$pop_estimate                          3271.69836  740.05432   4.421
## fnl$median_hh_inc_dichot                    77.07769   41.10527   1.875
## fnl$nonwhite_pct                            -3.77270    5.98143  -0.631
## fnl$median_hh_inc_dichot:fnl$nonwhite_pct  -25.23837    8.63175  -2.924
##                                           Pr(>|t|)    
## (Intercept)                               7.73e-05 ***
## fnl$week_avg                               < 2e-16 ***
## fnl$pop_estimate                          1.02e-05 ***
## fnl$median_hh_inc_dichot                   0.06089 .  
## fnl$nonwhite_pct                           0.52827    
## fnl$median_hh_inc_dichot:fnl$nonwhite_pct  0.00349 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 408.2 on 2618 degrees of freedom
##   (19 observations deleted due to missingness)
## Multiple R-squared:  0.9814, Adjusted R-squared:  0.9814 
## F-statistic: 2.767e+04 on 5 and 2618 DF,  p-value: < 2.2e-16

Evaluating the Model and Residual Analysis

After running the summary above, we can see a few things:

The median residual value is around zero (at 14 cases), which is a good sign.
Additionally, the minimum and maximum values of the residuals are roughly around the same, albeit leaning a bit more on the minimum side of at -7192 and 4883 respectively.
Our Multiple R-squared value is 0.9825, which indicates that these terms by county account for about 98.25% of the variability in the number of confirmed COVID-19 cases by county for April 20th, 2020. This seems to be a pretty good result for our regression model.

Although this seems to be a good sign, we need to check our residuals to see if this qualifies as a suitable regression model.

When plotting residuals (below), we can see that residuals are not uniformly scattered around zero:

plot(fitted(covid_lm),resid(covid_lm))

I’m not sure if this is due to some very large outliers and skewed cases data, but we can also see that there are a fair amount of outliers towards both ends of the q-q plot. This can be visualized in a quantile-versus-quantile (Q-Q) plot (see below):

qqnorm(resid(covid_lm))
qqline(resid(covid_lm))

hist(resid(covid_lm))

Was the Linear Model Appropriate?

From this residual analysis, we can conclude that a linear model here may not be very appropriate, or accurate for areas with larger numbers of confirmed cases (i.e. counties with larger populations). However, this may be effective to predict cases for counties with smaller populations. At this point, I think I would not call this linear model appropriate, however, further manipulation of the initial dataset to exclude counties with very large populations may make this a more effective model.

Predictions for April 21st, 2020

From the t-test below, it does look like a 95% confidence interval includes zero. However, it isn’t a very tight range between the lower and upper bounds, indicating that the predictions for confirmed cases could be drastically off, especially for smaller populations. This confirms that it would be beneficial to adjust the original dataset to build out a more productive and efficient model.

predicted <- predict(covid_lm, covid_test)
delta <- predicted - covid_test$cases

t.test(delta, conf.level = 0.95)

## 
##  One Sample t-test
## 
## data:  delta
## t = -0.16414, df = 2765, p-value = 0.8696
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  -167.8447  141.9148
## sample estimates:
## mean of x 
## -12.96496

plot(delta)