This is another one of my notebooks intended to show you how easy it is for you to get data and perform your own data analysis. Everything I do will be completely transparent. I give you the sources. I give you the code. I’ve commented the code so you can see what I’ve done. You can download the data and apply exactly this treatment and get the results I get. I’ve also been clear and open about assumptions I’ve made and where the analysis has issues.

Getting the Data

I pulled all data from the CDC’s http://data.cdc.gov site. Specifically, I will use the following three data sets. Navigate to that page, select the “Export” button, then choose the “CSV” option.

Fair warning, the vaccination data is by county across the whole US, so it is big. It will take a while to download, and a while to read and aggregate.

Current US COVID Cases

This looks like a lot of steps, but it’s really pretty straightforward. All I am doing is:

  1. Getting the dataset with all the cases and deaths in the US, then adding all new cases and new deaths pasted in the most recent dataset over the most recent 2 weeks.
  2. Getting the census data in order to have the population of every state.
  3. Getting the data of the percent of fully vaccinated per county and averaging.
  4. Joining all three datasets and producing population adjusted columns for cases and deaths (per 100K).
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(reshape2))
suppressPackageStartupMessages(library(RColorBrewer))

# Build a Date Converter from a string formatted "mm/dd/YYYY" to a proper date datatype in R
setAs("character", "rpwDateConvert", function(from) as.Date(from, format="%m/%d/%Y") )


## 1. Get and format the case and death numbers for each ssate
USCovidCases <- read.csv('United_States_COVID-19_Cases_and_Deaths_by_State_over_Time.csv',
                          colClasses=c("submission_date"="rpwDateConvert"),  # Convert to a date format
                          header=T) # Read the first line and use it as a header

maxDateInDF <- max(USCovidCases$submission_date) - 14 # Last 14 days of data CHANGE THIS IF YOU WANT
USCovidCases <- filter(USCovidCases, 
                       submission_date >= maxDateInDF, # Only take the latest results
                       state %in%state.abb) # Only consider the actual states
# Total the cases over all dated entries in the data set
AggreggatedUSCases <- summarise(group_by(USCovidCases, state), 
                                TotalNewCases=sum(new_case),
                                TotalNewDeaths=sum(new_death))


## 2. Get and Aggregate US Census numbers so we know population size for each state
filename <- "US_Census_Annual_Estimates_of_the_Resident_Population_for_Selected_Age_Groups_by_Sex_for_the_United_States.csv"
USCensusResults <- subset(mutate(filter(read.csv(filename, header=T),
                                        Age == "Total",                # Only include rows with state totals across age
                                        Gender == "Total",             # Only include rows with state totals across sex
                                        LocationAbbr %in% state.abb),  # Only include the states
                                 state = LocationAbbr,           # Rename the state column
                                 TotalPopulation = Data_Value),  # Give the population total a better name
                          select=c("state","TotalPopulation")) # Just take the two cols we care about


## 3. Get and average the vaccination rates for each state
RawCovidVaccinationData <- read.csv('COVID-19_Vaccinations_in_the_United_States_County.csv', 
                                    colClasses=c("Date"="rpwDateConvert"),  # Convert to a date format
                                    header=T)
maxDateInDF = max(RawCovidVaccinationData$Date)
FilteredCovidVaccinationData <- filter(RawCovidVaccinationData, 
                                       Date >= maxDateInDF,              # Only use the most recent data
                                       Recip_County != "Unknown County", # Exclude 'unknown county'
                                       Recip_State %in% state.abb)       # Make sure just the states
AvgCovidVaccinationData <- summarise(group_by(FilteredCovidVaccinationData, Recip_State),
                                     AverageVaccinationPct=mean(Series_Complete_Pop_Pct))  # Average all counties
CovidVaccinationData <- rename(filter(AvgCovidVaccinationData, 
                                      Recip_State %in% state.abb), # Make sure just the states
                               state = Recip_State)  # Rename the state column


# 4. Mortality and case rates are reported per 100K people, so adjust the data to report this way
AdjustedCovidCaseDataByState <- mutate(inner_join(inner_join(AggreggatedUSCases, CovidVaccinationData), USCensusResults), # Join the datasets
                                       NewCaseRate=100000*TotalNewCases/TotalPopulation,
                                       NewDeathRate=100000*TotalNewDeaths/TotalPopulation)

The Relationship Between Vaccination & New Cases

Let’s start by plotting the vaccination percentage against rate of new cases as of the most recent posting of the dataset by the CDC over the last two weeks. Note that if you want a different window (i.e., more/less days of new cases), you can change that in the code above. I marked it in a comment with “CHANGE THIS IF YOU WANT”.

p <- ggplot(AdjustedCovidCaseDataByState, aes(y=NewCaseRate, x=AverageVaccinationPct, label=state)) +
       #geom_point(shape=21, size=3, fill="firebrick") +
       geom_text(size=2) + 
       geom_smooth(method="lm", se=F, color="firebrick") +
       ylab("State-Wide New Cases (per 100K)") +
       xlab("Percent of Fully Vaccinated")
print(p)

Certainly it looks inversely correlated. Let’s see what the stats say on the raw data before we get “creative”:

model = lm(data=AdjustedCovidCaseDataByState, formula=AverageVaccinationPct ~ NewCaseRate)
summary(model)
## 
## Call:
## lm(formula = AverageVaccinationPct ~ NewCaseRate, data = AdjustedCovidCaseDataByState)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.748  -3.623   0.429   7.536  17.828 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 55.486118   3.746861  14.809  < 2e-16 ***
## NewCaseRate -0.021910   0.005158  -4.248 9.86e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.38 on 48 degrees of freedom
## Multiple R-squared:  0.2732, Adjusted R-squared:  0.2581 
## F-statistic: 18.04 on 1 and 48 DF,  p-value: 9.86e-05

So the correlation is most assuredly statistically significant: We can conclude that there is a statistical relationship between low vaccination rates and high rates of new cases. Let’s be careful: correlation is not cuasation, but we cannot deny that a relationship exists.

There are a couple of reasons why this is an unsatisfying model, though. First, the adjusted r-squared is a little low. The safest pedantic way to describe the results is: “We have very strong evidence to be convinced that there is a weak linear relationship” in the data. But there are some curious things of note.

First, there are clear outliers pulling the fit off target (e.g., Louisiana). Also, if you took the time to actually look at the data we downloaded above, you’d see some pretty strange results. For instance Hawaii and Texas have 0% vaccination rates? Unlikely. More likely they are unreported. GA, VA, and WV vaccination rates look wrong to me, as well. Since I am not inclined to trust my “looks wrong” subjectivity, I made the choice to keep GA, WV, VA, but I will address the 0 vaccination rate issue. I’m also disinclined to omit the outliers since I’ve no logical reason to do so. However, the fit does look non-linear to me, so I will generalize our model a bit. Specifically, since disease propogation is exponential, it seems plausible to me that the fit would be logarithmic.

To address the 0 vaccination rate problen, I looked up TX and HI on the NY Times Covid dashboard and used their vaccination rate numbers as of 8/13/2021: https://www.nytimes.com/interactive/2021/us/covid-cases.html

#subsetDF = filter(AdjustedCovidCaseDataByState, 
#                  AverageVaccinationPct > 30)
txIdx = which(AdjustedCovidCaseDataByState$state == "TX")  # Which index is TX
hiIdx = which(AdjustedCovidCaseDataByState$state == "HI")  # Which index is HI
vaxIdx = 4   # The AverageVaccinePct is the fourth column

# Got this from the following URL on 8/13/2021
# https://www.nytimes.com/interactive/2021/us/covid-cases.html
AdjustedCovidCaseDataByState[txIdx, vaxIdx] = 45
AdjustedCovidCaseDataByState[hiIdx, vaxIdx] = 54

model = lm(data=AdjustedCovidCaseDataByState, formula=AverageVaccinationPct ~ log(NewCaseRate))
summary(model)
## 
## Call:
## lm(formula = AverageVaccinationPct ~ log(NewCaseRate), data = AdjustedCovidCaseDataByState)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -26.0161  -4.4751  -0.6976   5.3743  16.1035 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       119.033     15.236   7.812 4.23e-10 ***
## log(NewCaseRate)  -11.926      2.386  -4.999 8.08e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.401 on 48 degrees of freedom
## Multiple R-squared:  0.3424, Adjusted R-squared:  0.3287 
## F-statistic: 24.99 on 1 and 48 DF,  p-value: 8.079e-06

Here’s what that looks like:

p <- ggplot(AdjustedCovidCaseDataByState, aes(y=NewCaseRate, x=AverageVaccinationPct, label=state)) +
       #geom_point(shape=21, size=3, fill="firebrick") +
       geom_text(size=2) + 
       geom_smooth(method="lm", se=F, color="firebrick", formula=y~log(x)) +
       ylab("State-Wide New Cases (per 100K) Over 14 Days") +
       xlab("Percent of Fully Vaccinated")
print(p)

The Relationship Between Vaccination & Mortality

Now let’ look at the mortality data.

p <- ggplot(AdjustedCovidCaseDataByState, aes(y=NewDeathRate, x=AverageVaccinationPct, label=state)) +
       #geom_point(shape=21, size=3, fill="firebrick") +
       geom_text(size=2) + 
       geom_smooth(method="lm", se=F, color="firebrick", formula=y~log(x)) +
       ylab("State-Wide New Deaths (per 100K) over 14 Days") +
       xlab("Percent of Fully Vaccinated")
print(p)

Again, this looks very correlated. And it is. Let’s see:

model = lm(data=AdjustedCovidCaseDataByState, formula=AverageVaccinationPct ~ log(NewDeathRate))
summary(model)
## 
## Call:
## lm(formula = AverageVaccinationPct ~ log(NewDeathRate), data = AdjustedCovidCaseDataByState)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -26.3777  -4.0764  -0.7707   7.2401  16.0918 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         52.684      2.674  19.699  < 2e-16 ***
## log(NewDeathRate)   -7.403      1.819  -4.069 0.000175 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.933 on 48 degrees of freedom
## Multiple R-squared:  0.2565, Adjusted R-squared:  0.241 
## F-statistic: 16.56 on 1 and 48 DF,  p-value: 0.000175

As an aside, here is the way to interpret that p-value: The likelyhood that we would have seen this strong of a relationship in our sample under the assumption that there is no relationship is nearly zero. So we have strong evidence that there is an inverse relationship between vaccination rates and mortality rate.

The same is true of hospitalization, though I do not have that data here in this notebook.

Other Issues with This Analysis

The primary problem with my analysis is average vaccination rates across counties is not the same as the average vaccination rate of the whole state. Unfortunately, I haven’t yet found easy access to a single dataset containing the latter. I could go to the NY Times Dashboard, or even look at each state’s individual COVID dashboards, but that would involve human transcription, would be labor intensive, and would almost certainly result in errors. It also would be hard to updated periodically. So I elected the way that I did it. I suspect that the correlation would only be stronger if I had the state-wide percentages since the way I did it is undervaluing the impact of larger counties on the average, and larger counties likely have higher vaccination rates (and infection rates, of course). It’s difficult to say since I could be creating a Simpson’s Paradox with my current method.

Whether or not I average across counties, there is very clear evidence of a relationship.

Watch This Space

I just started this notebook. Just like my analysis of 2020 mortality, I will grow this one. I’d like to fix the above issue. I’d also like to perform a county-based analysis in SC. I’ve already done that, but with manually transcribed data. I’ve written to SC DHEC to see if I can get source data directly. I have almost what I need from the CDC data, but (again) I need population numbers for the counties in SC. I can probably find that somewhere online in a convenient form that doesn’t involve manual transcription. When I do, I’ll add that.

I’m also interested in grabbing WHO data and performing a similar analysis on the national level.

Why Do This?

All of these analyses have been done. There is am inverse correlation between vaccination rates and infection rates. This is well-known. My point with this is to show transparently and clearly how you can do this analysis for yourself. You don’t have to trust any media source (your favorite, or the other person’s). You can do it yourself very straightforwardly.