The prevalence and intractability of COVID-19 ‘vaccine hesitancy’ over the past 18 months has been alarming to watch, with resistance and misinformation around public health fanned by political figures, media outlets and social media - especially to vulnerable populations in the United States.
While COVID-19 vaccine hesitancy is a far more complex subject than can be addressed by this small project, there is a great deal of data available to begin exploring any general relationships between vaccination rates and social science research.
Research Question:
Are the political alignment and CDC Social Vulnerability Index (SVI) scores of U.S. counties predictive of COVID-19 vaccination levels?
This project uses two datasets published by the Centers for Disease Control and Prevention (CDC), and one dataset published by the MIT Election Data & Science Lab.
The CDC COVID Data Tracker includes vaccination data reported by state and local Immunization Information Systems (IIS) per U.S. county. These data reflect cumulative national statistics at the county level as of November 2020.
The CDC Social Vulnerability Index (SVI) uses U.S. Census data such as socioeconomic status, household composition, housing type and transportation access to help emergency response planners meet the needs of vulnerable populations. These data reflect the most recent county-level indexes from 2020.
The MIT County Presidential Election Returns 2000-2020 dataset contains county-level returns for presidential elections from 2000 to 2020.
All data preparation steps are embedded in this workbook, and include the following steps:
The resulting dataframe has 3098 observations, each representing a single U.S. county (or county equivalent). Regions with incomplete or obviously erroneous data were excluded from this analysis. Fields include:
The mean vaccination rate for all counties is 44.7 and the median is 44.4, with a standard deviation of 12.24.
The distribution of vaccination rates by county are nearly symmetrical but with somewhat thicker tails than a normal distribution, and some notable outliers on either side. Some of these outliers are easily identifiable as data errors by cross-checking with contemporary sources (for example, Chattahoochee, GA at the top of our distribution was never 99.9% vaccinated - Georgia has some the lowest vax rates in the country. And Honolulu, HI at the bottom of our distribution did not have an 0.1% vax rate in November 2020.)
However most of these extreme values, while well outside the “1.5 IQR” range (shown below), do seem to be legitimate. I opted to cherry-pick and remove these two obvious data errors from the most extreme ends of the distribution and leave the rest of the values in place.
Graphing the response and explanatory variables together reveals moderate linear relationships between vaccination rates and the SVI indexes - some positive and some negative. And a quick boxplot of the categorical variable party reveals similar ranges of variability (similar IQR) between the two factors. It appears a linear model may be useful to predict vaccination rates based on these variables.
Since the data includes many possible explanatory variables, and the response variable is non-categorical, I selected Multivariate Linear Regression for the model.
Response variable:
Explanatory variables:
Model Selection:
The first run of the model with all explanatory variables produced an R-squared of approximately 0.27, but several of the variables had p-values above 0.05. To improve model performance I began to conduct backwards elimination by p-value - dropping the variable with the highest value, refitting the model, and repeating until all variables meet the cutoff of \(\alpha\) = 0.05. In this case only one step was required - when I removed svi_all and refit, all remaining explanatory variables met the required condition.
##
## Call:
## lm(formula = vax_rate ~ party + svi_se + svi_hh + svi_ml + svi_ht,
## data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -58.197 -5.043 0.503 6.250 45.196
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 56.7595 0.7099 79.954 < 2e-16 ***
## partyR -10.5986 0.5358 -19.783 < 2e-16 ***
## svi_se -19.9229 0.9698 -20.543 < 2e-16 ***
## svi_hh 1.9001 0.8424 2.256 0.0242 *
## svi_ml 4.1047 0.7231 5.676 1.5e-08 ***
## svi_ht 7.3905 0.8322 8.880 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.19 on 3092 degrees of freedom
## Multiple R-squared: 0.3083, Adjusted R-squared: 0.3072
## F-statistic: 275.6 on 5 and 3092 DF, p-value: < 2.2e-16
This made intuitive sense, as svi_all is essentially an ‘aggregate’ of the four individual SVI scores - and a possible source of colinnearity.
The resulting regression has an Adjusted R-squared of 0.3072 and p-value of 2.2e-16.
Regression Equation:
\(\widehat{vax\_rate} = 56.7595 - (10.5986 \times partyR) - (19.9229 \times svi\_se) + (1.9001 \times svi\_hh) \\ + (4.1047 \times svi\_ml) + (7.3905 \times svi\_ht)\)
Multivariate regression models generally depend on meeting the following four conditions: normally distributed residuals, near-constant variability of residuals, independence of residuals, and linear relationships of the explanatory variables to the fitted values.
Normally-distributed residuals:
By examining a histogram and Q-Q plot of the residuals, we can see that the distribution is nearly normal, but (as noted) there are thick tails and several distinct outliers present. The condition is largely met, keeping in mind that two of the most extreme outliers, separately identified as data errors, have already been removed.
Constant variability of residuals:
The Residuals x Fitted Plots show a random distribution of residuals around the fitted line (although there seems to be a drop in the number of errors between the 50-55% fitted values), demonstrating the constant variability of the residuals.
Independence of residuals
Independence of the observations is important to establish when they are collected over time. In this case we do not have any time-series data, instead it is a ‘snapshot’ of total vaccination status as of November 2020. Running a simple plot based on the original index (which is ordered by state), does demonstrate some of the ways in which the data varies by region - some of the states in question (Georgia, West Virginia, Nebraska and parts of Massachusetts among them) are known to have lower vaccination rates than average.
Keeping in mind that we do not have visibility into the collection of these data over time, I will assume the condition is met.
Linear relationship of predictors to residuals
By plotting the residuals against each of the explanatory variables, we can spot changes in variability and understand any individual linear relationships. The histogram of residuals and party indicate very little difference in variability between the two factors. Similarly, the scatterplots of the numeric variables demonstrate a fairly constant state of variability with fairly strong linear relationships. The condition is met.
The following variables were found to have some predictive value in estimating the vaccination rates in U.S. counties:
“COVID Data Tracker | COVID-19 Vaccination Equity” Centers for Disease Control and Prevention, Accessed November 10, 2021. https://covid.cdc.gov/covid-data-tracker.
“CDC/ATSDR SVI Data and Documentation Download | Place and Health | ATSDR.” Centers for Disease Control and Prevention, Accessed November 10, 2021. https://www.atsdr.cdc.gov/placeandhealth/svi/data_documentation_download.html.
“County Presidential Election Returns 2000-2020”, MIT Election Data and Science Lab, Accessed December 28, 2021. https://doi.org/10.7910/DVN/VOQCHQ, Harvard Dataverse.