CUNY Data 607 - Final Project

Overview

The spread and effects of COVID-19 misinformation have been widely reported as an obstacle to vaccination. Conspiracy theories and disinformation on social media - as well as some mainstream outlets - has been fanning public resistance to scientific research and legitimate public health advisories, especially among vulnerable populations in the United States.

Assuming the (hypothetical) role of a communications strategist advising NGOs or government agencies on vaccine hesitancy outreach, it would be helpful to identify targetable audiences and effective messaging.

In this analysis I will examine and attempt to quantify any fundamental relationships between county-level vaccination rates and metrics of trust/distrust regarding media, public health and government agencies.


Project Update

One major change from the original proposal is the selection of a alternative source of explanatory variables. I had been planning to examine socioeconomic datasets such as the Geography of Social Capital in America published by the U.S. Congress Joint Economic Committee, or the Social Vulnerability Index (SVI) published by the CDC.

However, both these datasets are only available as direct downloads - in order to fulfill the requirements of this project I opted for other sources of information that could be accessed via API.

Carnegie Mellon University sponsors an epidemiological research team (the Delphi Group) which publishes and maintains COVIDcast, the nation’s largest repository of real-time COVID-19 datasets, available via a public API. Among these data are national surveys of sociological indicators such as sentiment/attitudes towards media, public health and governmental officials with regard to COVID-19 at a county level.


Datasets

All data preparation steps are embedded in this workbook, and include the following steps:

  • Load, select, filter and aggregate relevant fields and rows from each dataset.
  • Format/clean data values and align naming conventions.
  • Join datasets at the county level using the County FIPS ID.

CDC COVID Data Tracker

The CDC COVID Data Tracker includes vaccination data reported by state and local Immunization Information Systems (IIS) per U.S. county. These data reflect cumulative national statistics at the county level as of November 2020. This dataset was retrieved manually from the CDC website and saved to the file covid19_vaccinations_equity.csv.

cdc_vax<- read.csv('data/covid19_vaccinations_equity.csv', skip=2)
County State SVI County.of.Residence.Reporting.Completeness Percent.of.total.population.fully.vaccinated Percent.of.12..pop.fully.vaccinated Percent.of.18..pop.fully.vaccinated Percent.of.65..pop.fully.vaccinated
Autauga County AL Low-Mod 92.6 36.2 42.5 44.3 64.2
Baldwin County AL Low 92.6 45.1 52.3 54.7 77.9
Barbour County AL High 92.6 39 45 46.9 67.7
Bibb County AL Mod-High 92.6 31.4 36.1 37.6 58.5
Blount County AL Low-Mod 92.6 28.3 33.3 35.2 51.3
Bullock County AL High 92.6 45.4 52.9 54 75.2

Federal Information Processing System (FIPS)

A reference dataset of Federal Information Processing System (FIPS) codes at the county level. For simplicity, this dataset was retrieved manually from a public github repo, verified against the U.S. Census ANSI site and saved to the file state_and_county_fips_master.csv.

county_fips <- read.csv('data/state_and_county_fips_master.csv')
fips name state
0 UNITED STATES NA
1000 ALABAMA NA
1001 Autauga County AL
1003 Baldwin County AL
1005 Barbour County AL
1007 Bibb County AL

Data Transformations

Unfortunately the CDC Covid Data Tracker data did not include the FIPS county codes, posing an additional challenge of matching county names as character strings, with some noted inconsistencies. This required some cleanup of both the vaccination and FIPS datasets, incorporating many alternative spellings and designations for county and country-equivalent regions:

vax <- cdc_vax %>%
  select(c(County,State,Percent.of.total.population.fully.vaccinated)) %>%
  rename(vax_rate = Percent.of.total.population.fully.vaccinated) %>%
  rename(name = County) %>%
  rename(state=State) %>%
  mutate(state = str_trim(state)) %>%
  mutate(name= tolower(name)) %>%
  mutate(vax_rate= as.numeric(vax_rate)) %>%
  mutate(name = str_replace_all(name,' and borough','')) %>%
  mutate(name = str_replace_all(name,' borough','')) %>%
  mutate(name = str_replace_all(name,'&bor','')) %>%
  mutate(name = str_replace_all(name,' census area','')) %>%
  mutate(name = str_replace_all(name,' ca','')) %>%
  mutate(name = str_replace_all(name,'rolina',' carolina')) %>%
  mutate(name = str_replace_all(name,' city','')) %>%
  mutate(name = str_replace_all(name,' cty','')) %>%
  mutate(name = str_replace_all(name,' county','')) %>%
  mutate(name = str_replace_all(name,' municipality','')) %>%
  mutate(name = str_replace_all(name,' muno','')) %>%
  mutate(name = str_replace_all(name,' muny','')) %>% 
  mutate(name = str_replace_all(name,' parish','')) %>%
  mutate(name = str_replace_all(name,'la salle','lasalle')) %>%
  mutate(name = str_replace_all(name,'do̱a','dona'))

And similar cleanup for the county_fips table:

fips <- county_fips %>%
  mutate(name= tolower(name)) %>%
  mutate(name = str_replace_all(name,' and borough','')) %>%
  mutate(name = str_replace_all(name,' borough','')) %>%
  mutate(name = str_replace_all(name,'&bor','')) %>%
  mutate(name = str_replace_all(name,' census area','')) %>%
  mutate(name = str_replace_all(name,' ca','')) %>%
  mutate(name = str_replace_all(name,'rolina',' carolina')) %>%
  mutate(name = str_replace_all(name,' city','')) %>%
  mutate(name = str_replace_all(name,' cty','')) %>%
  mutate(name = str_replace_all(name,' county','')) %>%
  mutate(name = str_replace_all(name,' municipality','')) %>%
  mutate(name = str_replace_all(name,' muno','')) %>%
  mutate(name = str_replace_all(name,' muny','')) %>% 
  mutate(name = str_replace_all(name,' parish','')) %>%
  mutate(name = str_replace_all(name,'la salle','lasalle')) %>%
  mutate(name = str_replace_all(name,'do̱a','dona'))

And then join the two for a dataframe of vaccination rates by county, with county FIPS code:

# join 
df_county_vax <- full_join(vax,fips) %>%
  filter(!is.na(fips))
name state vax_rate fips
autauga AL 36.2 1001
baldwin AL 45.1 1003
barbour AL 39.0 1005
bibb AL 31.4 1007
blount AL 28.3 1009
bullock AL 45.4 1011

Now for the survey data. Since all surveys follow an identical format, it will be easier to union them into one dataframe:

df_files <- data.frame() # initialize
files <- list.files(path='data/delphi')

for (file_name in files){
  file_path <- paste('data/delphi/', file_name, sep='')
  df_files <- union_all(df_files, read_csv(file_path, show_col_types = FALSE))
}

And now some cleanup of the survey data, renaming columns and converting the FIPS code to integer format:

df_surveys <- df_files %>%
  select(geo_value, signal, time_value, value, stderr, sample_size) %>%
  rename(fips = geo_value) %>%
  rename(date = time_value) %>%
  mutate(fips = as.integer(fips, length=0))
fips signal date value stderr sample_size
1000 smoothed_wbelief_masking_effective 2021-06-10 64.49579 1.733865 761.6943
1073 smoothed_wbelief_masking_effective 2021-06-10 72.52404 4.242234 110.7250
1089 smoothed_wbelief_masking_effective 2021-06-10 76.90033 3.687850 130.6133
2000 smoothed_wbelief_masking_effective 2021-06-10 69.72459 3.273444 197.0000
4000 smoothed_wbelief_masking_effective 2021-06-10 69.37468 3.064314 226.2638
4013 smoothed_wbelief_masking_effective 2021-06-10 80.38044 1.386473 820.3837

There is one abnormality in the survey data we need to account for. As part of the Delphi Group’s methodology, State-level FIPS codes (all ending in 000) were used to bucket daily county-level responses where a certain response threshold was not met. Since it is not practical to dis-aggregate these rows at the county level, we’ll discard them and keep only those rows where, on a particular day, response levels were high enough to be attributed to a specific county.

df_county_surveys <- df_surveys %>%
  filter(fips %% 1000 != 0)

For this analysis, we’ll calculate the mean response value for each county and each of the seven survey topics (aka ‘signals’), and then pivot_wider so we have a single row of these seven observations per county:

df_county_surveys <- df_county_surveys %>%
  group_by(fips,signal) %>%
  summarize(mean_value = mean(value, rm.na=TRUE)) %>%
  mutate(signal=str_replace(signal,'smoothed_w',''))

df_county_surveys_pvt <- df_county_surveys %>%
  pivot_wider(names_from=signal, values_from=mean_value)

And for the final step, we’ll join the transformed survey and vaccination datasets on the FIPS code.

df <- full_join(df_county_vax, df_county_surveys_pvt)
name state vax_rate fips belief_created_small_group belief_govt_exploitation belief_masking_effective hesitancy_reason_distrust_gov trust_covid_info_doctors trust_covid_info_experts trust_covid_info_journalists
autauga AL 36.2 1001 NA NA NA NA NA NA NA
baldwin AL 45.1 1003 27.73857 34.03297 63.2041 26.99626 62.26397 48.86253 6.387081
barbour AL 39.0 1005 NA NA NA NA NA NA NA
bibb AL 31.4 1007 NA NA NA NA NA NA NA
blount AL 28.3 1009 NA NA NA NA NA NA NA
bullock AL 45.4 1011 NA NA NA NA NA NA NA
butler AL 35.6 1013 NA NA NA NA NA NA NA
calhoun AL 42.0 1015 NA NA NA NA NA NA NA
chambers AL 28.3 1017 NA NA NA NA NA NA NA
cherokee AL 29.1 1019 NA NA NA NA NA NA NA
# some useful references
signals <- unique(df_county_surveys$signal)
state_fips <- fips %>% filter(fips %% 1000 == 0)

Exporatory Data Analysis

The dataset consists of 3207 rows representing individual U.S. counties, with 3138 mean vaccination rates, but a much sparser distribution of county-level survey results:

kbl(sapply(df, function(x) sum(!is.na(x)))) %>%
  kable_styling()
x
name 3207
state 3155
vax_rate 3138
fips 3207
belief_created_small_group 311
belief_govt_exploitation 310
belief_masking_effective 323
hesitancy_reason_distrust_gov 193
trust_covid_info_doctors 298
trust_covid_info_experts 297
trust_covid_info_journalists 295

The mean vaccination rate for all counties is 44.8 and the median is 44.5, with a standard deviation of 12.36.

The distribution of vaccination rates by county are nearly symmetrical but with somewhat thicker tails than a normal distribution, and some notable outliers on either side. Some of these outliers are easily identifiable as data errors by cross-checking with contemporary sources (for example, Chattahoochee, GA at the top of our distribution was never 99.9% vaccinated - Georgia has some the lowest vax rates in the country. And Honolulu, HI at the bottom of our distribution did not have an 0.1% vax rate in November 2020.)

ggplot(df, aes(x=vax_rate)) +
  geom_histogram(bins=150, alpha=0.5) +
  geom_vline(xintercept=median(df$vax_rate, na.rm=TRUE), size=2, color='blue', alpha=0.2) +
  geom_vline(xintercept=median(df$vax_rate, na.rm=TRUE)-(IQR(df$vax_rate, na.rm=TRUE) * 1.5), size=2, color='red', alpha=0.2) +
  geom_vline(xintercept=median(df$vax_rate, na.rm=TRUE)+(IQR(df$vax_rate, na.rm=TRUE) * 1.5), size=2, color='red', alpha=0.2) +
  labs(title="County Vaccination Rates", x = 'vax_rate')

However most of these extreme values, while well outside the “1.5 IQR” range (red lines in the graph), seem legitimate. I opted to cherry-pick and remove these two obvious data errors from the most extreme ends of the distribution and leave the rest of the values in place.

df <- df %>%
  filter(vax_rate > 0.1) %>%
  filter(vax_rate < 99.9)

Graphing the response and explanatory variables together reveals strong linear relationships between vaccination rates and our signals - some positive and some negative.

Where trust signals are strong (for example, trust in doctors, journalists, experts), there is a positive relationship with vaccination rates. Where distrust signals are strong (for example, belief in conspiracy theories or distrust in government), there is a negative relationship with vaccination rates.

for (s in signals){
  print(
    ggplot(df, aes_string(x = s, y='vax_rate')) + 
    geom_point(alpha=0.2) +
    geom_smooth(method='lm') + 
    labs(x = s, y = 'vax_rate')
  )
}


Linear Models

Given the strong linear relationships, linear models may be useful to predict vaccination rates based on these variables.

Running simple linear regressions of vaccination rates against each individual signal provides some insight into the corresponding scatterplots above.

For example, here is the summary of a linear regression of vaccination rates to the signal belief_created_small_group. The coefficent is -0.89, illustrating a strongly negative relationship to vaccination rates, and at a high level of statistical significance.

lm_vax_1 <- lm(vax_rate ~ belief_created_small_group, data=df)
summary(lm_vax_1) # .2269
## 
## Call:
## lm(formula = vax_rate ~ belief_created_small_group, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -32.149  -4.221  -0.066   4.069  26.225 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                73.78991    1.72075  42.882   <2e-16 ***
## belief_created_small_group -0.88640    0.09287  -9.545   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.479 on 306 degrees of freedom
##   (2828 observations deleted due to missingness)
## Multiple R-squared:  0.2294, Adjusted R-squared:  0.2269 
## F-statistic:  91.1 on 1 and 306 DF,  p-value: < 2.2e-16

This signal alone, however, only explains about 23% of the variance in the model as indicated by the Adjusted R-squared. We can run a multivariate linear regression instead with all seven of our signals to see if we get a better result:

lm_vax_2 <- lm(vax_rate ~ belief_created_small_group + 
                 belief_govt_exploitation + 
                 belief_masking_effective + 
                 hesitancy_reason_distrust_gov + 
                 trust_covid_info_doctors + 
                 trust_covid_info_experts + 
                 trust_covid_info_journalists, data=df)

summary(lm_vax_2) # .4479
## 
## Call:
## lm(formula = vax_rate ~ belief_created_small_group + belief_govt_exploitation + 
##     belief_masking_effective + hesitancy_reason_distrust_gov + 
##     trust_covid_info_doctors + trust_covid_info_experts + trust_covid_info_journalists, 
##     data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -30.862  -2.449   0.329   4.016  13.983 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   -71.1366    44.8933  -1.585   0.1148    
## belief_created_small_group      1.4633     0.3359   4.356 2.19e-05 ***
## belief_govt_exploitation       -0.1517     0.5191  -0.292   0.7704    
## belief_masking_effective        0.1736     0.2758   0.629   0.5299    
## hesitancy_reason_distrust_gov  -0.0970     0.1045  -0.928   0.3547    
## trust_covid_info_doctors        0.6579     0.3041   2.163   0.0318 *  
## trust_covid_info_experts        0.7133     0.3121   2.285   0.0234 *  
## trust_covid_info_journalists    0.4326     0.3732   1.159   0.2479    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.333 on 184 degrees of freedom
##   (2944 observations deleted due to missingness)
## Multiple R-squared:  0.4681, Adjusted R-squared:  0.4479 
## F-statistic: 23.13 on 7 and 184 DF,  p-value: < 2.2e-16

Running the regression based on all seven signals produces a much higher Adjusted R-Squared of 45% with p-value of 2.2e-16, although several of the individual explanatory variables now have low levels of statistical significance. However, I found that backing out individual signals with higher p-values and re-fitting the model did not improve the Adjusted R-Squared.


Conclusions

While our explanatory variables only represent seven out of many available survey datasets from the COVIDcast project, we can reasonably conclude that they hold predictive value in estimating county-level vaccination rates in the United States, inferring a relationship between surveyed “trust metrics” and overall vaccine hesitancy in the population.

Of course the larger behavioral picture is much more complicated than a simple model can capture. In actual practice we would likely include many more survey signals, and work to identify and eliminate any sources of bias (for example, selection bias of social media-based surveys.) Any hypotheses formed from future analysis would need to be tested under experimental conditions - but, the model is already useful in proving that relationships exist between these variables, suggesting that they could be used to target and craft outreach.


References

“COVID Data Tracker | COVID-19 Vaccination Equity” Centers for Disease Control and Prevention, Accessed November 10, 2021. https://covid.cdc.gov/covid-data-tracker.

“County and State FIPS Codes” Github, Accessed November 10, 2021. https://github.com/kjhealy/fips-codes.

“COVID-19 Trends and Impact Survey” CMU Delphi Research Group, Accessed Jan 15, 2021 https://cmu-delphi.github.io/delphi-epidata/api/covidcast.html


Appendix

Model Diagnostics

Multivariate regression models generally depend on meeting the following four conditions: normally distributed residuals, near-constant variability of residuals, independence of residuals, and linear relationships of the explanatory variables to the fitted values.

Normally-distributed residuals:

By examining a histogram and Q-Q plot of the residuals, we can see that the distribution is nearly normal, but (as noted) there are thick tails and several distinct outliers present. The condition is largely met, keeping in mind that two of the most extreme outliers, separately identified as data errors, have already been removed.

ggplot(lm_vax_2, aes(x=lm_vax_2$residuals)) +
  geom_histogram(bins=100, alpha=0.5) +
  labs(title="Residuals Distribution",
       x = 'residuals')

ggplot(lm_vax_2, aes(sample=lm_vax_2$residuals)) +
  geom_qq(alpha=0.5) +
  geom_qq_line(colour='blue') +
  labs(title="Residuals Q-Q")

Constant variability of residuals:

The Residuals x Fitted Plots show a random distribution of residuals around the fitted line, demonstrating the constant variability of the residuals.

lm_vax_2 %>%
  ggplot(aes(y=lm_vax_2$residuals, x=lm_vax_2$fitted.values)) + 
  geom_point(alpha=0.2) +
  geom_hline(yintercept=0, colour='blue') +
  labs(title="Residuals x Fitted",
       y = 'residuals',
       x = 'fitted vax_rate')

lm_vax_2 %>%
  ggplot(aes(y=abs(lm_vax_2$residuals), x=lm_vax_2$fitted.values)) + 
  geom_point(alpha=0.2) +
  geom_hline(yintercept=0, colour='blue') +
  labs(title="Abs Residuals x Fitted",
       y = 'abs(residuals)',
       x = 'fitted vax_rate')

Independence of observations

Independence of the observations is important to establish when they are collected over time. I will assume the condition is met.

vax %>% ggplot(aes(x=name,y=vax_rate)) + 
  geom_point() +
  theme(legend.position = 'none')

df_surveys %>% ggplot(aes(x=date,y=value)) +
  geom_point() +
  facet_grid(~ signal) +
  theme(legend.position = 'none')

Linear relationship of predictors to residuals

By plotting the residuals against each of the explanatory variables, we can spot changes in variability and understand any individual linear relationships. The histogram of residuals and party indicate very little difference in variability between the two factors. Similarly, the scatterplots of the numeric variables demonstrate a fairly constant state of variability with fairly strong linear relationships. The condition is met.

for (s in signals){
  i <- paste0('lm_vax_2$model$',s)
  print(
    ggplot(lm_vax_2, aes_string(x=i, y='lm_vax_2$residuals')) + 
      geom_point(alpha=0.2) +
      geom_smooth(method='lm') + 
      labs(x = s, y = 'residuals')
  )
}