CUNY Data 607 - Final Project

Overview

The spread and effects of COVID-19 misinformation have been widely reported as an obstacle to vaccination. Conspiracy theories and disinformation on social media - as well as some mainstream outlets - has been fanning public resistance to scientific research and legitimate public health advisories, especially among vulnerable populations in the United States.

Assuming the (hypothetical) role of a communications strategist advising NGOs or government agencies on vaccine hesitancy outreach, it would be helpful to identify targetable audiences and effective messaging.

In this analysis I will examine and attempt to quantify any fundamental relationships between county-level vaccination rates and metrics of trust/distrust regarding media, public health and government agencies.

Project Update

One major change from the original proposal is the selection of a alternative source of explanatory variables. I had been planning to examine socioeconomic datasets such as the Geography of Social Capital in America published by the U.S. Congress Joint Economic Committee, or the Social Vulnerability Index (SVI) published by the CDC.

However, both these datasets are only available as direct downloads - in order to fulfill the requirements of this project I opted for other sources of information that could be accessed via API.

Carnegie Mellon University sponsors an epidemiological research team (the Delphi Group) which publishes and maintains COVIDcast, the nation’s largest repository of real-time COVID-19 datasets, available via a public API. Among these data are national surveys of sociological indicators such as sentiment/attitudes towards media, public health and governmental officials with regard to COVID-19 at a county level.

Datasets

All data preparation steps are embedded in this workbook, and include the following steps:

Load, select, filter and aggregate relevant fields and rows from each dataset.
Format/clean data values and align naming conventions.
Join datasets at the county level using the County FIPS ID.

CDC COVID Data Tracker

The CDC COVID Data Tracker includes vaccination data reported by state and local Immunization Information Systems (IIS) per U.S. county. These data reflect cumulative national statistics at the county level as of November 2020. This dataset was retrieved manually from the CDC website and saved to the file covid19_vaccinations_equity.csv.

cdc_vax<- read.csv('data/covid19_vaccinations_equity.csv', skip=2)

County	State	SVI	County.of.Residence.Reporting.Completeness	Percent.of.total.population.fully.vaccinated	Percent.of.12..pop.fully.vaccinated	Percent.of.18..pop.fully.vaccinated	Percent.of.65..pop.fully.vaccinated
Autauga County	AL	Low-Mod	92.6	36.2	42.5	44.3	64.2
Baldwin County	AL	Low	92.6	45.1	52.3	54.7	77.9
Barbour County	AL	High	92.6	39	45	46.9	67.7
Bibb County	AL	Mod-High	92.6	31.4	36.1	37.6	58.5
Blount County	AL	Low-Mod	92.6	28.3	33.3	35.2	51.3
Bullock County	AL	High	92.6	45.4	52.9	54	75.2

Federal Information Processing System (FIPS)

A reference dataset of Federal Information Processing System (FIPS) codes at the county level. For simplicity, this dataset was retrieved manually from a public github repo, verified against the U.S. Census ANSI site and saved to the file state_and_county_fips_master.csv.

county_fips <- read.csv('data/state_and_county_fips_master.csv')

fips	name	state
0	UNITED STATES	NA
1000	ALABAMA	NA
1001	Autauga County	AL
1003	Baldwin County	AL
1005	Barbour County	AL
1007	Bibb County	AL

COVID-19 Trends and Impact Survey

The COVID-19 Trends and Impact Survey from the COVIDcast Epidata API, published by the Delphi Group at Carnegie Mellon University, includes multiple data streams covering the spread and impact of COVID-19 in the United States, and is accessed by a public API endpoint.

Special Note: While the COVIDcast project maintains an R library (covidcast) for easy retrieval, I was unable to install it correctly due to dependency issues on my local system. I was able to install the Python version, however, and in completing this project learned how to call Py code from within R using the reticulate library.

Per the Covidcast instructions I have saved these datasets locally after retrieving them via API, but the python code used for accessing them is included here for reference.

# python setup
import pandas as pd
import covidcast as cc
from datetime import date

From the Covidcast API, I selected several datasets in the Belief, experience, and information indicators category. These are Facebook-based surveys measuring respondent’s beliefs about COVID-19, their experiences of health care, their sources of information about COVID-19, and their degree of trust in different information sources. All available results for the calendar year 2021 were retrieved.

Survey Item G3: belief_masking_effective Estimated percentage of respondents who believe that wearing a face mask is either very or moderately effective for preventing the spread of COVID-19.

# Facebook Survey - belief that masking is effective
data_fb_g3 = cc.signal('fb-survey', 'smoothed_wbelief_masking_effective', 
  date(2021,6,10), date(2021,12,31), 'county')
data_fb_g3.to_csv('data/delphi/data_fb_g3.csv')

Survey Item V5abc: hesitancy_reason_distrust_gov Estimated percentage of respondents who say they are hesitant to get vaccinated because they don’t trust the government, among respondents who answered “Yes, probably”, “No, probably not”, or “No, definitely not” when asked if they would get vaccinated if offered.

# Facebook Survey - vaccine hesitancy reason: distrust gov
data_fb_v5abc = cc.signal('fb-survey', 'smoothed_whesitancy_reason_distrust_gov', 
  date(2021,2,26), date(2021,12,31), 'county') 
data_fb_v5abc.to_csv('data/delphi/data_fb_v5abc.csv')

Survey Item i3: belief_created_small_group Estimated percentage of people who believe that the statement “COVID-19 was deliberately created by a small group of people who secretly manipulate world events” is definitely or probably true.

# Facebook Survey - belief COVID created by small group
data_fb_i3 = cc.signal('fb-survey', 'smoothed_wbelief_created_small_group', 
  date(2021,5,20), date(2021,12,31), 'county') 
data_fb_i3.to_csv('data/delphi/data_fb_i3.csv')

Survey Item i4: belief_govt_exploitation Estimated percentage of people who indicate that the statement “The COVID-19 pandemic is being exploited by the government to control people” is definitely or probably true.

# Facebook Survey - belief in government exploitation
data_fb_i4 = cc.signal('fb-survey', 'smoothed_wbelief_govt_exploitation', 
  date(2021,5,20), date(2021,12,31), 'county') 
data_fb_i4.to_csv('data/delphi/data_fb_i4.csv')

Survey Item i6: trust_covid_info_experts Estimated percentage of respondents who trust scientists and other health experts to provide accurate news and information about COVID-19.

# Facebook Survey - Trust experts for covid info
data_fb_i6_xprt = cc.signal('fb-survey', 'smoothed_wtrust_covid_info_experts', 
  date(2021,5,19), date(2021,12,31), 'county') 
data_fb_i6_xprt.to_csv('data/delphi/data_fb_i6_xprt.csv')

Survey Item i6: trust_covid_info_journalists Estimated percentage of respondents who trust journalists to provide accurate news and information about COVID-19.

# Facebook Survey - Trust journalists for covid info
data_fb_i6_jrn = cc.signal('fb-survey', 'smoothed_wtrust_covid_info_journalists', 
  date(2021,5,19), date(2021,12,31), 'county') 
data_fb_i6_jrn.to_csv('data/delphi/data_fb_i6_jrn.csv')

Survey Item i6: trust_covid_info_docs Estimated percentage of respondents who trust doctors to provide accurate news and information about COVID-19.

# Facebook Survey - Trust doctors for covid info
data_fb_i6_jrn = cc.signal('fb-survey', 'smoothed_wtrust_covid_info_doctors', 
  date(2021,5,19), date(2021,12,31), 'county') 
data_fb_i6_jrn.to_csv('data/delphi/data_fb_i6_doc.csv')

Data Transformations

Unfortunately the CDC Covid Data Tracker data did not include the FIPS county codes, posing an additional challenge of matching county names as character strings, with some noted inconsistencies. This required some cleanup of both the vaccination and FIPS datasets, incorporating many alternative spellings and designations for county and country-equivalent regions:

vax <- cdc_vax %>%
  select(c(County,State,Percent.of.total.population.fully.vaccinated)) %>%
  rename(vax_rate = Percent.of.total.population.fully.vaccinated) %>%
  rename(name = County) %>%
  rename(state=State) %>%
  mutate(state = str_trim(state)) %>%
  mutate(name= tolower(name)) %>%
  mutate(vax_rate= as.numeric(vax_rate)) %>%
  mutate(name = str_replace_all(name,' and borough','')) %>%
  mutate(name = str_replace_all(name,' borough','')) %>%
  mutate(name = str_replace_all(name,'&bor','')) %>%
  mutate(name = str_replace_all(name,' census area','')) %>%
  mutate(name = str_replace_all(name,' ca','')) %>%
  mutate(name = str_replace_all(name,'rolina',' carolina')) %>%
  mutate(name = str_replace_all(name,' city','')) %>%
  mutate(name = str_replace_all(name,' cty','')) %>%
  mutate(name = str_replace_all(name,' county','')) %>%
  mutate(name = str_replace_all(name,' municipality','')) %>%
  mutate(name = str_replace_all(name,' muno','')) %>%
  mutate(name = str_replace_all(name,' muny','')) %>% 
  mutate(name = str_replace_all(name,' parish','')) %>%
  mutate(name = str_replace_all(name,'la salle','lasalle')) %>%
  mutate(name = str_replace_all(name,'do̱a','dona'))

And similar cleanup for the county_fips table:

fips <- county_fips %>%
  mutate(name= tolower(name)) %>%
  mutate(name = str_replace_all(name,' and borough','')) %>%
  mutate(name = str_replace_all(name,' borough','')) %>%
  mutate(name = str_replace_all(name,'&bor','')) %>%
  mutate(name = str_replace_all(name,' census area','')) %>%
  mutate(name = str_replace_all(name,' ca','')) %>%
  mutate(name = str_replace_all(name,'rolina',' carolina')) %>%
  mutate(name = str_replace_all(name,' city','')) %>%
  mutate(name = str_replace_all(name,' cty','')) %>%
  mutate(name = str_replace_all(name,' county','')) %>%
  mutate(name = str_replace_all(name,' municipality','')) %>%
  mutate(name = str_replace_all(name,' muno','')) %>%
  mutate(name = str_replace_all(name,' muny','')) %>% 
  mutate(name = str_replace_all(name,' parish','')) %>%
  mutate(name = str_replace_all(name,'la salle','lasalle')) %>%
  mutate(name = str_replace_all(name,'do̱a','dona'))

And then join the two for a dataframe of vaccination rates by county, with county FIPS code:

# join 
df_county_vax <- full_join(vax,fips) %>%
  filter(!is.na(fips))

name	state	vax_rate	fips
autauga	AL	36.2	1001
baldwin	AL	45.1	1003
barbour	AL	39.0	1005
bibb	AL	31.4	1007
blount	AL	28.3	1009
bullock	AL	45.4	1011

Now for the survey data. Since all surveys follow an identical format, it will be easier to union them into one dataframe:

df_files <- data.frame() # initialize
files <- list.files(path='data/delphi')

for (file_name in files){
  file_path <- paste('data/delphi/', file_name, sep='')
  df_files <- union_all(df_files, read_csv(file_path, show_col_types = FALSE))
}

And now some cleanup of the survey data, renaming columns and converting the FIPS code to integer format:

df_surveys <- df_files %>%
  select(geo_value, signal, time_value, value, stderr, sample_size) %>%
  rename(fips = geo_value) %>%
  rename(date = time_value) %>%
  mutate(fips = as.integer(fips, length=0))

fips	signal	date	value	stderr	sample_size
1000	smoothed_wbelief_masking_effective	2021-06-10	64.49579	1.733865	761.6943
1073	smoothed_wbelief_masking_effective	2021-06-10	72.52404	4.242234	110.7250
1089	smoothed_wbelief_masking_effective	2021-06-10	76.90033	3.687850	130.6133
2000	smoothed_wbelief_masking_effective	2021-06-10	69.72459	3.273444	197.0000
4000	smoothed_wbelief_masking_effective	2021-06-10	69.37468	3.064314	226.2638
4013	smoothed_wbelief_masking_effective	2021-06-10	80.38044	1.386473	820.3837

There is one abnormality in the survey data we need to account for. As part of the Delphi Group’s methodology, State-level FIPS codes (all ending in 000) were used to bucket daily county-level responses where a certain response threshold was not met. Since it is not practical to dis-aggregate these rows at the county level, we’ll discard them and keep only those rows where, on a particular day, response levels were high enough to be attributed to a specific county.

df_county_surveys <- df_surveys %>%
  filter(fips %% 1000 != 0)

For this analysis, we’ll calculate the mean response value for each county and each of the seven survey topics (aka ‘signals’), and then pivot_wider so we have a single row of these seven observations per county:

df_county_surveys <- df_county_surveys %>%
  group_by(fips,signal) %>%
  summarize(mean_value = mean(value, rm.na=TRUE)) %>%
  mutate(signal=str_replace(signal,'smoothed_w',''))

df_county_surveys_pvt <- df_county_surveys %>%
  pivot_wider(names_from=signal, values_from=mean_value)

And for the final step, we’ll join the transformed survey and vaccination datasets on the FIPS code.

df <- full_join(df_county_vax, df_county_surveys_pvt)

name	state	vax_rate	fips	belief_created_small_group	belief_govt_exploitation	belief_masking_effective	hesitancy_reason_distrust_gov	trust_covid_info_doctors	trust_covid_info_experts	trust_covid_info_journalists
autauga	AL	36.2	1001	NA	NA	NA	NA	NA	NA	NA
baldwin	AL	45.1	1003	27.73857	34.03297	63.2041	26.99626	62.26397	48.86253	6.387081
barbour	AL	39.0	1005	NA	NA	NA	NA	NA	NA	NA
bibb	AL	31.4	1007	NA	NA	NA	NA	NA	NA	NA
blount	AL	28.3	1009	NA	NA	NA	NA	NA	NA	NA
bullock	AL	45.4	1011	NA	NA	NA	NA	NA	NA	NA
butler	AL	35.6	1013	NA	NA	NA	NA	NA	NA	NA
calhoun	AL	42.0	1015	NA	NA	NA	NA	NA	NA	NA
chambers	AL	28.3	1017	NA	NA	NA	NA	NA	NA	NA
cherokee	AL	29.1	1019	NA	NA	NA	NA	NA	NA	NA

# some useful references
signals <- unique(df_county_surveys$signal)
state_fips <- fips %>% filter(fips %% 1000 == 0)

Exporatory Data Analysis

The dataset consists of 3207 rows representing individual U.S. counties, with 3138 mean vaccination rates, but a much sparser distribution of county-level survey results:

kbl(sapply(df, function(x) sum(!is.na(x)))) %>%
  kable_styling()

	x
name	3207
state	3155
vax_rate	3138
fips	3207
belief_created_small_group	311
belief_govt_exploitation	310
belief_masking_effective	323
hesitancy_reason_distrust_gov	193
trust_covid_info_doctors	298
trust_covid_info_experts	297
trust_covid_info_journalists	295

The mean vaccination rate for all counties is 44.8 and the median is 44.5, with a standard deviation of 12.36.

The distribution of vaccination rates by county are nearly symmetrical but with somewhat thicker tails than a normal distribution, and some notable outliers on either side. Some of these outliers are easily identifiable as data errors by cross-checking with contemporary sources (for example, Chattahoochee, GA at the top of our distribution was never 99.9% vaccinated - Georgia has some the lowest vax rates in the country. And Honolulu, HI at the bottom of our distribution did not have an 0.1% vax rate in November 2020.)

ggplot(df, aes(x=vax_rate)) +
  geom_histogram(bins=150, alpha=0.5) +
  geom_vline(xintercept=median(df$vax_rate, na.rm=TRUE), size=2, color='blue', alpha=0.2) +
  geom_vline(xintercept=median(df$vax_rate, na.rm=TRUE)-(IQR(df$vax_rate, na.rm=TRUE) * 1.5), size=2, color='red', alpha=0.2) +
  geom_vline(xintercept=median(df$vax_rate, na.rm=TRUE)+(IQR(df$vax_rate, na.rm=TRUE) * 1.5), size=2, color='red', alpha=0.2) +
  labs(title="County Vaccination Rates", x = 'vax_rate')

However most of these extreme values, while well outside the “1.5 IQR” range (red lines in the graph), seem legitimate. I opted to cherry-pick and remove these two obvious data errors from the most extreme ends of the distribution and leave the rest of the values in place.

df <- df %>%
  filter(vax_rate > 0.1) %>%
  filter(vax_rate < 99.9)

Graphing the response and explanatory variables together reveals strong linear relationships between vaccination rates and our signals - some positive and some negative.

Where trust signals are strong (for example, trust in doctors, journalists, experts), there is a positive relationship with vaccination rates. Where distrust signals are strong (for example, belief in conspiracy theories or distrust in government), there is a negative relationship with vaccination rates.

for (s in signals){
  print(
    ggplot(df, aes_string(x = s, y='vax_rate')) + 
    geom_point(alpha=0.2) +
    geom_smooth(method='lm') + 
    labs(x = s, y = 'vax_rate')
  )
}

Linear Models

Given the strong linear relationships, linear models may be useful to predict vaccination rates based on these variables.

Running simple linear regressions of vaccination rates against each individual signal provides some insight into the corresponding scatterplots above.

For example, here is the summary of a linear regression of vaccination rates to the signal belief_created_small_group. The coefficent is -0.89, illustrating a strongly negative relationship to vaccination rates, and at a high level of statistical significance.

lm_vax_1 <- lm(vax_rate ~ belief_created_small_group, data=df)
summary(lm_vax_1) # .2269

## 
## Call:
## lm(formula = vax_rate ~ belief_created_small_group, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -32.149  -4.221  -0.066   4.069  26.225 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                73.78991    1.72075  42.882   <2e-16 ***
## belief_created_small_group -0.88640    0.09287  -9.545   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.479 on 306 degrees of freedom
##   (2828 observations deleted due to missingness)
## Multiple R-squared:  0.2294, Adjusted R-squared:  0.2269 
## F-statistic:  91.1 on 1 and 306 DF,  p-value: < 2.2e-16

This signal alone, however, only explains about 23% of the variance in the model as indicated by the Adjusted R-squared. We can run a multivariate linear regression instead with all seven of our signals to see if we get a better result:

lm_vax_2 <- lm(vax_rate ~ belief_created_small_group + 
                 belief_govt_exploitation + 
                 belief_masking_effective + 
                 hesitancy_reason_distrust_gov + 
                 trust_covid_info_doctors + 
                 trust_covid_info_experts + 
                 trust_covid_info_journalists, data=df)

summary(lm_vax_2) # .4479

## 
## Call:
## lm(formula = vax_rate ~ belief_created_small_group + belief_govt_exploitation + 
##     belief_masking_effective + hesitancy_reason_distrust_gov + 
##     trust_covid_info_doctors + trust_covid_info_experts + trust_covid_info_journalists, 
##     data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -30.862  -2.449   0.329   4.016  13.983 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   -71.1366    44.8933  -1.585   0.1148    
## belief_created_small_group      1.4633     0.3359   4.356 2.19e-05 ***
## belief_govt_exploitation       -0.1517     0.5191  -0.292   0.7704    
## belief_masking_effective        0.1736     0.2758   0.629   0.5299    
## hesitancy_reason_distrust_gov  -0.0970     0.1045  -0.928   0.3547    
## trust_covid_info_doctors        0.6579     0.3041   2.163   0.0318 *  
## trust_covid_info_experts        0.7133     0.3121   2.285   0.0234 *  
## trust_covid_info_journalists    0.4326     0.3732   1.159   0.2479    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.333 on 184 degrees of freedom
##   (2944 observations deleted due to missingness)
## Multiple R-squared:  0.4681, Adjusted R-squared:  0.4479 
## F-statistic: 23.13 on 7 and 184 DF,  p-value: < 2.2e-16

Running the regression based on all seven signals produces a much higher Adjusted R-Squared of 45% with p-value of 2.2e-16, although several of the individual explanatory variables now have low levels of statistical significance. However, I found that backing out individual signals with higher p-values and re-fitting the model did not improve the Adjusted R-Squared.

Conclusions

While our explanatory variables only represent seven out of many available survey datasets from the COVIDcast project, we can reasonably conclude that they hold predictive value in estimating county-level vaccination rates in the United States, inferring a relationship between surveyed “trust metrics” and overall vaccine hesitancy in the population.

Of course the larger behavioral picture is much more complicated than a simple model can capture. In actual practice we would likely include many more survey signals, and work to identify and eliminate any sources of bias (for example, selection bias of social media-based surveys.) Any hypotheses formed from future analysis would need to be tested under experimental conditions - but, the model is already useful in proving that relationships exist between these variables, suggesting that they could be used to target and craft outreach.

References

“COVID Data Tracker | COVID-19 Vaccination Equity” Centers for Disease Control and Prevention, Accessed November 10, 2021. https://covid.cdc.gov/covid-data-tracker.

“County and State FIPS Codes” Github, Accessed November 10, 2021. https://github.com/kjhealy/fips-codes.

“COVID-19 Trends and Impact Survey” CMU Delphi Research Group, Accessed Jan 15, 2021 https://cmu-delphi.github.io/delphi-epidata/api/covidcast.html

Appendix

Model Diagnostics

Multivariate regression models generally depend on meeting the following four conditions: normally distributed residuals, near-constant variability of residuals, independence of residuals, and linear relationships of the explanatory variables to the fitted values.

Normally-distributed residuals:

By examining a histogram and Q-Q plot of the residuals, we can see that the distribution is nearly normal, but (as noted) there are thick tails and several distinct outliers present. The condition is largely met, keeping in mind that two of the most extreme outliers, separately identified as data errors, have already been removed.

ggplot(lm_vax_2, aes(x=lm_vax_2$residuals)) +
  geom_histogram(bins=100, alpha=0.5) +
  labs(title="Residuals Distribution",
       x = 'residuals')

ggplot(lm_vax_2, aes(sample=lm_vax_2$residuals)) +
  geom_qq(alpha=0.5) +
  geom_qq_line(colour='blue') +
  labs(title="Residuals Q-Q")

Constant variability of residuals:

The Residuals x Fitted Plots show a random distribution of residuals around the fitted line, demonstrating the constant variability of the residuals.

lm_vax_2 %>%
  ggplot(aes(y=lm_vax_2$residuals, x=lm_vax_2$fitted.values)) + 
  geom_point(alpha=0.2) +
  geom_hline(yintercept=0, colour='blue') +
  labs(title="Residuals x Fitted",
       y = 'residuals',
       x = 'fitted vax_rate')

lm_vax_2 %>%
  ggplot(aes(y=abs(lm_vax_2$residuals), x=lm_vax_2$fitted.values)) + 
  geom_point(alpha=0.2) +
  geom_hline(yintercept=0, colour='blue') +
  labs(title="Abs Residuals x Fitted",
       y = 'abs(residuals)',
       x = 'fitted vax_rate')

Independence of observations

Independence of the observations is important to establish when they are collected over time. I will assume the condition is met.

vax %>% ggplot(aes(x=name,y=vax_rate)) + 
  geom_point() +
  theme(legend.position = 'none')

df_surveys %>% ggplot(aes(x=date,y=value)) +
  geom_point() +
  facet_grid(~ signal) +
  theme(legend.position = 'none')

Linear relationship of predictors to residuals

By plotting the residuals against each of the explanatory variables, we can spot changes in variability and understand any individual linear relationships. The histogram of residuals and party indicate very little difference in variability between the two factors. Similarly, the scatterplots of the numeric variables demonstrate a fairly constant state of variability with fairly strong linear relationships. The condition is met.

for (s in signals){
  i <- paste0('lm_vax_2$model$',s)
  print(
    ggplot(lm_vax_2, aes_string(x=i, y='lm_vax_2$residuals')) + 
      geom_point(alpha=0.2) +
      geom_smooth(method='lm') + 
      labs(x = s, y = 'residuals')
  )
}