Introduction

For this assignment, I explored the Vaccine Adverse Event Reporting System (VAERS) dataset. VAERS is a system which allows healthcare providers to create reports for any negative health episode which a patient experiences after receiving a vaccine. Vaccines undergo rigorous testing before they are made available to the public, however as with any drug or biological product, there is always a potential for side-effects, which in some cases can be serious. The VAERS system was created by the Food and Drug Administration (FDA) and the Centers for Disease Control (CDC) to receive and track reports about adverse events which may be associated with vaccines. VAERS is used to continually monitor whether any vaccine or vaccine lot has a higher than expected rate of events.

In the case of an adverse health event following vaccination, vaccine providers are encouraged to submit a report even if they are not certain that the vaccine was the cause. Since it is difficult to distinguish a coincidental event from one truly caused by a vaccine, the VAERS database contains events of both types. The Department of Health and Human Services also notes that the data is collected through a passive surveillance system, and may be subject to further limitations such as under-reporting and reporting bias. Additionally, a report may involve more than one vaccine, as well as more than one adverse health event.[1] Ultimately, this data should be interpreted with significant caution.

Data Used in This Report

The particular VAERS dataset for this assignment was sourced from Kaggle.com: https://www.kaggle.com/ayushggarg/covid19-vaccine-adverse-reactions It is limited to patients who received the COVID-19 vaccination, but also includes information about other vaccines those patients may have received. A detailed description of the VAERS dataset can be found in the VAERS Data User Guide, provided by the HHS:

https://vaers.hhs.gov/docs/VAERSDataUseGuide_November2020.pdf

The Dataset consists of three files. 2021VAERSDATA.csv contains information about the patients such as sex, age, location, whether they were hospitalized, whether they died, and so forth. 2021VAERSSYMPTOMS.csv contains information about the symptoms the patient experienced subsequent to vaccination. 2021VAERSVAX.csv documents which vaccine type the patient received; Pfizer or Moderna, if this was included in the report. In this notebook we will use 2021VAERSDATA.csv and 2021VAERSVAX.csv.

2021VAERSDATA.csv contains 5351 observations of 35 variables. We will focus on the following variables:

VAERS_ID - A unique ID for each patient for whom a report was made STATE - State in which a patient received their vaccine AGE_YRS - Patient age in years at time of vaccination DIED - “Y” if it was reported that the patient died, NA otherwise CUR_ILL - Comments made by the report filer on any concurrent ailments the patient was experiencing

2021VAERSVAX.csv contains 5471 observations of 8 variables. It contains more observations than 2021VAERSDATA.csv because it contains multiple entries for patients who received other vaccines in addition to the Covid-19 vaccination, such as a flu shot. We will utilize the following:

VAERS_ID - A unique ID for each patient for whom a report was made; corresponds to the ID column in 2021VAERSDATA.csv VAX_TYPE - A string documenting the type of vaccine given, e.g. “COVID19”, “FLU”, “HPV”, etc. VAX_MANU - A string documenting the manufacturer of the vaccine which was administrated. For the Covid-19 vaccine, the possible values are: “PFIZER\BIONTECH”, “MODERNA”, or “UNKNOWN MANUFACTURER” if this data was not included in the report.

In this report, we will examine various factors to see what, if anything, influenced the likelihood that a patient would have died following vaccination. We will mainly consider the age and location of the patients, and the manufacturer of the vaccine they received. At the end, we will also take a quick look at the concurrent illnesses which were reported for patients who passed away.

load needed libraries:

library(tidyverse)
library(tm)
library(wordcloud)
library(RColorBrewer)

Load the VAERS data:

vaers <- read.csv("2021VAERSDATA.csv", encoding = "latin1")

Load the vaccine manufacturer data:

vax <- read.csv("2021VAERSVAX.csv", encoding = "latin1")

We’ll join these two datasets on the VAERS Id of each patient"

vaers_vax <- vaers %>% right_join(vax, by = "VAERS_ID")

Data Cleaning and Pre-processing

The data in 2021VAERSVAX.csv contains multiple entries for patients who received separate vaccines (for example, a flu shot in addition to the Covid-19 vaccination), and we are not going to examine any effects from this presently. Therefore, we will filter out entries containing any vaccine type other than Covid.

vaers_vax <- vaers_vax %>% filter(VAX_TYPE == "COVID19")
summary(vaers_vax$AGE_YRS)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1.08   40.00   56.00   57.12   74.00  105.00     936

The age column has some odd decimals, as well as a lot of NA entries.

Let’s round off to years:

vaers_vax$AGE_YRS <- vaers_vax$AGE_YRS %>% round()

We want to use the mean age in each state later on and the NA’s will get in the way, so we will go ahead and compute this without the NA entries, and store the result in a separate dataframe:

state_mean_ages <- vaers_vax %>% 
  drop_na(AGE_YRS) %>% 
  group_by(STATE) %>%
  summarise(mean_age = mean(AGE_YRS))
#state_mean_ages <- state_mean_ages[!is.na(state_mean_ages)]

Let’s look at the values in the “DIED” column:

table(vaers_vax$DIED)
## 
##         Y 
## 4480  799

The entries are either “Y” for patients who died, or they are blank.

For the sake of the plot, we’ll replace the labels in the column with more explicit ones:

vaers_vax$DIED <- replace(vaers_vax$DIED, vaers_vax$DIED != "Y", 
                          values = c("ALL OUTCOMES"))
vaers_vax$DIED <- replace(vaers_vax$DIED, vaers_vax$DIED == "Y", 
                          values = c("DIED"))

Analysis

Age of Patients Who Died

ggplot(vaers_vax) +
  geom_histogram(mapping = aes(x = AGE_YRS, fill = DIED), binwidth = 10,
                 center = 55, color = 'black') +
  labs(title = "Patient Death by Age", x = "Patient Age (yrs)",
       y = "Count", fill = "Outcome") + 
  scale_x_continuous(limits = c(0, 100), breaks = seq(0,100,10))

One thing I notice in this plot is that the most frequent occurrence of VAERS reports came from individuals between 35 and 45 years of age, surprisingly young, in my opinion. We would need to compare this with the age distribution in the overall population to see if it is in any way significant. We also see that older individuals were far more likely to die after receiving the vaccine, with those in the 80-90 year age range representing the highest occurrence of deaths. The average life expectancy in the US is 78.7 years [2], so this could potentially be explained by background effects/coincidence as well.

Incidence of Death by State

I was curious to see if there was any difference in patient outcomes between states. Naturally, we would expect the most populous states to have more reports, as well as more deaths. To account for this, we can divide the deaths per state by the total number of reports, to get a frequency statistic.

To do this, we need to get a dataframe which contains the patients who died, as well as the total number of all patients in each state. We will:

  • compute the two totals separately

  • filter out states with below a threshold (30) number of reports

  • combine them into a single dataframe with a join statement.

  • Finally, we’ll compute the death rate per state with mutate():

Get the number of patients who died in each state:

Some of the reports also contain blank entries for states, so we’ll get rid of these:

vaers_vax <- vaers_vax %>% filter(STATE != "")
died <- vaers_vax %>% 
  filter(DIED == "DIED") %>%
  group_by(STATE) %>%
  summarise(total = n())

Get the total patients in each state:

# Reports in each state
vaers_by_state <- vaers_vax %>%
  group_by(STATE) %>%
  summarise(total = n())

Join this with the mean ages computed earlier:

vaers_by_state <- vaers_by_state %>% left_join(state_mean_ages, by = "STATE")

States with a small number of reports would be easily skewed, for example, if a state had two reports and one death, it would not be significant enough to state a 50% death rate for that state.

So, we will filter out states which had fewer than 30 VAERS reports:

vaers_by_state <- vaers_by_state %>% filter(total > 30)

Combine the total reports and deaths per state so that we can compute the ratio:

deaths_by_state <-vaers_by_state %>% left_join(died, by="STATE")

Divide total deaths per state by total reports per state, and store the result in a new column:

deaths_by_state <- deaths_by_state %>% 
  mutate(freq = total.y/total.x)

Now we will replace the state abbreviations by their full names, so it is easier to read in our plot:

state_inds <- match(deaths_by_state$STATE, state.abb)
deaths_by_state$state_names <- state.name[state_inds]

Create a barplot:

deaths_by_state %>%
  ggplot(aes(y = reorder(state_names, freq), x = freq, fill = mean_age)) +
  geom_bar(stat = "identity") +
  scale_fill_gradient(low = "blue", high = "red") +
  labs(title = "Deaths/Total Reports Ratio by State",
       x = "State",
       y = "Deaths/Total Reports",
       fill = "Mean Patient Age 
       (All Reports)")

It actually seems that some states had a surprisingly high death rate following vaccination compared to others. Patients in Kentucky were ~10 times as likely to die as patients in Oregon. We can see from the color scale that patients in Kentucky were, on average, much older than patients in Oregon or other states with a low death rate, however in between we can see some states younger average patient ages having higher death rates than states with older patients. In particular, Nevada was on the lower end of the age spectrum, but had more deaths per report than several states with older average patient ages. In contrast, Louisiana was at the upper end of the age spectrum, but had a higher death-to-report ratio than many states with younger average ages.

Difference in vaccine type

The file 2021VAERSVAX.csv describes which vaccine each patient received (Pfizer/Biontech or Moderna). Let’s examine whether one vaccine was more closely associated with patient mortality than the other.

table(vaers_vax$VAX_MANU)
## 
##              MODERNA     PFIZER\\BIONTECH UNKNOWN MANUFACTURER 
##                 1584                 2806                   12

We will go ahead and drop the patients where the vaccine type is unknown:

vaers_vax <- vaers_vax %>% filter(VAX_MANU != "UNKNOWN MANUFACTURER")

Of the patients who made VAERS reports, about twice as many had received the Pfizer vaccine as received the Moderna vaccine. We cannot infer any signifance here as it is entirely possible that this simply reflects how many individuals overall received the Pfizer or Moderna vaccines. What about the patients who died?

vaers_vax_died <- vaers_vax %>% 
  filter(DIED == "DIED")
table(vaers_vax_died$VAX_MANU)
## 
##          MODERNA PFIZER\\BIONTECH 
##              335              362

It appears that while far more people received the Pfizer vaccine than Moderna, the number of deaths from either vaccine was almost equal. This is surprising, and at a glance would seem to suggest that the Moderna vaccine is not as safe as the Pfizer vaccine. However, once again, we do not have enough information here to draw a conclusion.

We already saw that younger patients were less likely to die after their vaccine. Maybe they were more likely to have received the Pfizer version.

vaers_vax %>%
  ggplot() + 
  geom_histogram(mapping = aes(x = AGE_YRS, fill = VAX_MANU), color = 'black',
                 position = "dodge", binwidth = 10, center = 55) + 
  labs(title = "Vaccine Type by Age", x = "Patient Age (yrs)",
       y = "Count", fill = "Vaccine Type") + 
  scale_x_continuous(limits = c(0, 100), breaks = seq(0,100,10))

From this histogram, we can see that more of the younger patients received the Pfizer vaccine than Moderna, whereas older than 70 received them in roughly equal numbers. This likely accounts for at least some of the incongruence in death rates between the vaccine types.

Comorbidity Wordcloud

One more thing that I thought would be interesting to explore is the data in the concurrent illness column. This data consists of comments entered by the individuals who created each VAERS report, and is not standardized. We have not covered much in the way of language processing in class so far, so I do not want to spend too much time on it, but we can make a quick word cloud and see if anything jumps out.

Create a text Corpus using the tm library:

vaers_died <- vaers_vax %>% filter(DIED == "DIED")
text <- vaers_died$CUR_ILL
docs <- Corpus(VectorSource(text))

Quick clean-up of the text: Remove numbers, punctuation, and whitespace; change all text to lowercase, and remove stop words (common words that convey no useful information)

docs <- docs %>%
  tm_map(removeNumbers) %>%
  tm_map(removePunctuation) %>%
  tm_map(stripWhitespace)
docs <- tm_map(docs, content_transformer(tolower))
docs <- tm_map(docs, removeWords, stopwords("english"))

Create a dataframe of each word and its relative frequency in the overall body of text; again, the tm library provides utilities for this:

dtm <- TermDocumentMatrix(docs) 
matrix <- as.matrix(dtm) 
words <- sort(rowSums(matrix),decreasing=TRUE) 
text_df <- data.frame(word = names(words),freq=words)

Finally, create the word cloud:

set.seed(111)
wordcloud(words = text_df$word, freq = text_df$freq, min.freq = 1,           
          max.words=200, random.order=FALSE, rot.per=0.35,            
          colors=brewer.pal(8, "Dark2"))

We can see that “unspecified” and “none” were some of the most common entries here, but unfortunately we do not know if that means there was no concurrent illness, or it simply was not reported. We could have removed these entries prior to creating the graphic, however, the intention here was not to do any careful analysis and it is also worth noting that these made up the most common entries. Looking past these, we can see that “anemia”, “dementia”, “diabetes”, and “hypertension” were apparently common comorbidities for patients who died subsequent to receiving their vaccine dose.

Conclusion

Summary

In this report, we examined the VAERS dataset, which contains records of adverse vaccine reactions, focusing on reactions to COVID-19 vaccination. As stated in the introduction, this data contains reports of any health event following a vaccination, and it cannot be assumed that the reaction was in any way caused by the vaccine itself. Any attempt to determine evidence of a causal relationship between vaccines and adverse health events would need to take into account the baseline rate at which these events would be expected to occur for a given patient group.

The dataset which we explored here was quite substantial, both in the number of observations recorded and the variables which were collected. Unfortunately, due to the passive reporting method which was used to collect the data, many reports did not include all variables, and the data is likely subject to an under-reporting bias as well.

Nonetheless, if any unusual patterns were to be found in this data, it could suggest a direction for further investigation.

We first examined the age of patients in the dataset, comparing the age of patients who died with the dataset as a whole. We found that middle-aged patients accounted for the highest number of overall reports, while elderly patients accounted for the majority of deaths. At a glance, this seems fairly congruent with the background rates for illnesses and deaths in the United States.

Next, we compared the death rates following vaccination on a state-by-state basis, while attempting to account for the difference in ages of patients in each state. We found that not only did patients have higher death rates in some states relative to others, but also that the average patient age in each state did not appear to fully account for these discrepancies. This was surprising, and could be worth looking into further. I am curious if there is a difference in quality or access to healthcare between states that could be playing a role here.

We then compared outcomes for patients who received the Pfizer/Biontech vaccine or the Moderna vaccine. We found that patients who received the Moderna vaccine were more likely to have expired following an adverse health event post- vaccination, however we also found that more elderly patients received the Moderna vaccine as opposed to the Pfizer/Biontech vaccine, which likely goes a long way towards accounting for this discrepancy.

Finally, we looked at the most frequently occurring words and phrases in the concurrent illness reports of patients who died. While the ailments that seemed to be prevalent were not entirely surprising, it could be worth investigating these correlations further. Perhaps it is the case that patients who are elderly and suffering from a particular health condition do not handle the vaccine as well as younger/healthier patients, though, it bears repeating that we do not have enough evidence to make such a conclusion.

Directions for Further Research

There were a number of things I would like to look into more. I would like to explore how gender played a role in patient reports and outcomes, what, if any factors influenced whether patients were hospitalized and for how long, the significance of any particular symptoms that appeared following vaccination, and also if any particular vaccine batches were implicated in more hospitalizations or deaths.

[1] https://vaers.hhs.gov/docs/VAERSDataUseGuide_November2020.pdf

[2] https://www.cdc.gov/nchs/fastats/life-expectancy.htm