Assignment 5 - Tidying and Transforming Vaccination Data

library(tidyverse)
library(readxl)
library(reshape)

Overview

We were asked to analyze data from August 2021 Israeli hospitalization (“Severe Cases”) rates for pople under 50 (assume “50 and under”) and over 50, for both unvaccinated and fully vaccinated populations.

Specific questions were as follows:

1) Do you have enough information to calculate the total population? What does this total population represent?

2) Calculate the Efficacy vs. Disease; Explain your results.

3) From your calculation of efficacy vs. disease, are you able to compare the rate of severe cases in unvaccinated individuals to that in vaccinated individuals?

Definitions

Severe Cases = hospitalized
Efficacy vs. severe disease = 1 - (% fully vaccinated severe cases per 100k / % not vaccinated severe cases per 100k)

Additional domain knowledge needed

1) Israel’s total population

According to Wikipedia the 2022 population estimate is 9,481,820

2) Who is eligible to receive vaccinations

According to an article in The Lancet as of June 2021 everyone over the age of 12 was eligible to receive a vaccine in Israel.

3) What does it mean to be fully vaccinated

The Wall Street Journal released an artice in August 2021 entitled: In Israel, Being Fully Vaccinated Now Means Three Shots, so this would lead me to believe being fully vaccinated means 3 shots, as opposed to the previously accepted 2 shots. However, being that the data for analysis was as of August 2021, fully vaccinated probably means at least 2 shots.

Read in data

df <- read_excel("./data/israeli_vaccination_data_analysis_start.xlsx",
           range = "A3:E6",
           col_names = c("age","not_vax","fully_vax","sev_not_vax_100k","sev_fully_vax_100k"))

df

## # A tibble: 4 × 5
##   age       not_vax   fully_vax sev_not_vax_100k sev_fully_vax_100k
##   <chr>       <dbl>       <dbl>            <dbl>              <dbl>
## 1 <50   1116834     3501118                   43                 11
## 2 <NA>        0.233       0.73                NA                 NA
## 3 >50    186078     2133516                  171                290
## 4 <NA>        0.079       0.904               NA                 NA

Reconfigure data

I would like to see the data frame transformed a bit, i.e. vaccination status rates, and age become their own columns.

# create cols
df$not_vax_pop_rate <- ""
df$vax_pop_rate <- ""
df

## # A tibble: 4 × 7
##   age       not_vax fully_vax sev_not_vax_100k sev_fully_vax_1… not_vax_pop_rate
##   <chr>       <dbl>     <dbl>            <dbl>            <dbl> <chr>           
## 1 <50   1116834       3.50e+6               43               11 ""              
## 2 <NA>        0.233   7.3 e-1               NA               NA ""              
## 3 >50    186078       2.13e+6              171              290 ""              
## 4 <NA>        0.079   9.04e-1               NA               NA ""              
## # … with 1 more variable: vax_pop_rate <chr>

# move rate values into correct cols
df$not_vax_pop_rate[1] <-  df$not_vax[2] 
df$vax_pop_rate[1] <- df$fully_vax[2]
df$not_vax_pop_rate[3] <-df$not_vax[4]
df$vax_pop_rate[3] <- df$fully_vax[4]

# convert new cols to numeric
df$not_vax_pop_rate <- as.numeric(df$not_vax_pop_rate)
df$vax_pop_rate <- as.numeric(df$vax_pop_rate)

# remove unecessary rows
df <- df[-c(2,4),]

df <- as.data.frame(df)

df

##   age not_vax fully_vax sev_not_vax_100k sev_fully_vax_100k not_vax_pop_rate
## 1 <50 1116834   3501118               43                 11            0.233
## 2 >50  186078   2133516              171                290            0.079
##   vax_pop_rate
## 1        0.730
## 2        0.904

Melt

Melt the data frame to get 4 rows with age and vaccination status

df_melt <- melt(df,id=c("age", "not_vax_pop_rate", "vax_pop_rate", "sev_not_vax_100k","sev_fully_vax_100k"),
     variable_name = "vax_status")

df_melt %>% 
  select(age,vax_status, everything())

##   age vax_status not_vax_pop_rate vax_pop_rate sev_not_vax_100k
## 1 <50    not_vax            0.233        0.730               43
## 2 >50    not_vax            0.079        0.904              171
## 3 <50  fully_vax            0.233        0.730               43
## 4 >50  fully_vax            0.079        0.904              171
##   sev_fully_vax_100k   value
## 1                 11 1116834
## 2                290  186078
## 3                 11 3501118
## 4                290 2133516

Move rate values into their own columns.

# create pop rate col
df_melt$pop_rate[1] <- df_melt$not_vax_pop_rate[1]
df_melt$pop_rate[2] <- df_melt$not_vax_pop_rate[2]
df_melt$pop_rate[3] <- df_melt$vax_pop_rate[1]
df_melt$pop_rate[4] <- df_melt$vax_pop_rate[2]

# create sev_rate_100k col
df_melt$sev_case_100[1] <- df_melt$sev_not_vax_100k[1]
df_melt$sev_case_100[2] <- df_melt$sev_not_vax_100k[2]
df_melt$sev_case_100[3] <- df_melt$sev_fully_vax_100k[1]
df_melt$sev_case_100[4] <- df_melt$sev_fully_vax_100k[2]

# delete unecessary columns
df_melt <- df_melt[,-c(2:5)]

# rename value to pop
df_melt <- df_melt %>% 
  rename(replace = c("value" = "pop"))

df_melt

##   age vax_status     pop pop_rate sev_case_100
## 1 <50    not_vax 1116834    0.233           43
## 2 >50    not_vax  186078    0.079          171
## 3 <50  fully_vax 3501118    0.730           11
## 4 >50  fully_vax 2133516    0.904          290

What story does the cleaned data tell us?

# plot pop rate vs age and vax status
ggplot(df_melt, aes(x = age, y = pop_rate, fill = vax_status)) +
  geom_bar(stat = 'identity') +
  labs(x = "Age", y = "Pop rate", subtitle = "Pop rate by Vax status & Age") +
  scale_fill_discrete(name = "Vax Status", labels = c("Not Vax", "Fully Vax")) +
  coord_flip()

Looking at the data grouped by age and population rate fully vaccinated vs not, we can clearly see that a larger percentage (~90 vs. ~73) of those >50 are fully vaccinated against COVID-19. However, these are stacked bar charts, and one can see that they are not up too 100%, there must be some other data unaccounted for. Let’s come back to this.

Let’s take a look at vaccination status and age in relation to severe cases of COVID-19.

# plot severe case vs age and vax status
ggplot(df_melt, aes(x = sev_case_100, y = age, fill = vax_status)) +
  geom_bar(stat = "identity") +
  labs(x = "Severe Case 100k", y = "Age", subtitle = "Severe case (per 100k) by Vax status & Age" ) +
  scale_fill_discrete(name = "Vax Status", labels = c("Not Vax", "Fully Vax"))

Looking at the data like this, it gives the impression that in folks >50 and fully vaccinated for COVID-19, had an increased risk of severe cases in comparison to those >50 not vaccinated.

Well, let’s look at how Severe Cases per 100k are calculated.

Severe Cases

How are severe cases calculated?

The data gives the quantity of severe cases per 100,000 people. However, to reflect rate we need to calculate incidence rate, or rate of severe cases per 100,000 people.

Incidence rate is calculated by the following:

severe case incidence rate = # of severe cases / pop at risk

Therefore, in the case of those >50 that are vaccinated with a population base of ~2.1 mm and 290 severe cases the calculation would be done as such:

290 / 2,133,516 = 0.00013593 x 100000 = 13.6 incidence rate

Those that are not vaccinated and >50 have a population base of ~186k with 171 severe cases:

171 / 186,078 = 0.00091897 X 100000 = 91.9 incidence rate

This means that in fully vaccinated people >50, there will be ~13 severe cases per 100,000. In people >50 that are not vaccinated, there will be ~91 severe cases per 100,000. So despite severe case count in those vaccinated being higher then those no vaccinated, the incidence rate is much higher in those not vaccinated. Therefore, not getting vaccinated for COVID-19 puts one at greater risk for developing a severe case.

Let’s create a column for Severe Case Incidence Rate to visualize this metric in rates as opposed to count.

df_melt <- df_melt %>% 
  mutate(sev_case_rate_100k = sev_case_100/pop * 100000)

df_melt %>% 
  arrange(age)

##   age vax_status     pop pop_rate sev_case_100 sev_case_rate_100k
## 1 <50    not_vax 1116834    0.233           43          3.8501693
## 2 <50  fully_vax 3501118    0.730           11          0.3141854
## 3 >50    not_vax  186078    0.079          171         91.8969464
## 4 >50  fully_vax 2133516    0.904          290         13.5925861

Now if we view severe cases per 100,000 by rate, the data tells a different story.

# plot severe case incidence rate per 100k vs age and vax status
ggplot(df_melt, aes(x = sev_case_rate_100k, y = age, fill = vax_status)) +
  geom_bar(stat = "identity", position = position_dodge()) +
    labs(x = "Severe Case Incidence rate per 100k", y = "Age", subtitle = "Severe case incidence rate (per 100k) by Vax status & Age" ) +
  scale_fill_discrete(name = "Vax Status", labels = c("Not Vax", "Fully Vax"))

Here we can see that those folks not vaccinated in either age group have a much higher Severe Case Incidence Rate per 100,000. The discrepancy is staggering the >50 age group.

Find the missing part of population

Remember the stacked bar charts when we looked at population rate, grouped by age and vaccination status? Well they didn’t equate to 100%, so some data was missing. Let’s quantify and explain that missing portion of data.

# add other to levels in factor column vax_status
levels(df_melt$vax_status) <- c("not_vax", "fully_vax", "other")

# add rows
df_melt[nrow(df_melt)+1,] <- c("<50","other",0,0,NA,NA)
df_melt[nrow(df_melt)+1,] <- c(">50","other",0,0,NA,NA)

# change columns to numeric
cols.num <- c("pop", "pop_rate","sev_case_100", "sev_case_rate_100k")
df_melt[cols.num] <- sapply(df_melt[cols.num], as.numeric)

df_melt %>% 
  arrange(age)

##   age vax_status     pop pop_rate sev_case_100 sev_case_rate_100k
## 1 <50    not_vax 1116834    0.233           43          3.8501693
## 2 <50  fully_vax 3501118    0.730           11          0.3141854
## 3 <50      other       0    0.000           NA                 NA
## 4 >50    not_vax  186078    0.079          171         91.8969464
## 5 >50  fully_vax 2133516    0.904          290         13.5925861
## 6 >50      other       0    0.000           NA                 NA

Calculate pop_rate for vax_status == other

# calc pop_rate for <50
df_melt$pop_rate[df_melt$vax_status == "other" & df_melt$age == "<50"] <- 1-sum(df_melt[df_melt$age == "<50",]$pop_rate)

# calc pop_rate for >50
df_melt$pop_rate[df_melt$vax_status == "other" & df_melt$age == ">50"] <-  1-sum(df_melt[df_melt$age == ">50",]$pop_rate)

Calculate pop for vax_status == other

# calculate pop for <50
df_melt$pop[df_melt$vax_status == "other" & df_melt$age == "<50"] <- (sum(df_melt[df_melt$age == "<50",]$pop) / sum(df_melt[df_melt$age == "<50" & df_melt$vax_status != "other",]$pop_rate)) - sum(df_melt[df_melt$age == "<50",]$pop)

# calculate pop for >50
df_melt$pop[df_melt$vax_status == "other" & df_melt$age == ">50"] <- (sum(df_melt[df_melt$age == ">50",]$pop) / sum(df_melt[df_melt$age == ">50" & df_melt$vax_status != "other",]$pop_rate)) - sum(df_melt[df_melt$age == ">50",]$pop)

df_melt %>% 
  arrange(age)

##   age vax_status        pop pop_rate sev_case_100 sev_case_rate_100k
## 1 <50    not_vax 1116834.00    0.233           43          3.8501693
## 2 <50  fully_vax 3501118.00    0.730           11          0.3141854
## 3 <50      other  177429.10    0.037           NA                 NA
## 4 >50    not_vax  186078.00    0.079          171         91.8969464
## 5 >50  fully_vax 2133516.00    0.904          290         13.5925861
## 6 >50      other   40115.05    0.017           NA                 NA

tot_pop <- sum(df_melt$pop)
cat("Total population now equates to",tot_pop)

## Total population now equates to 7155090

Unfortunately, the total population even with the missing data in pop_rate accounted for, does not equate to the ~9mm. This leads me to believe the missing data derived from pop_rate not equating to 100, are folks not fully vaccinated, but have received at least one dose, and the ~2mm extra are folks ineligible i.e. <12 years of age and religious exempt.

Visualize with 3rd variable in `vax_status`

Let’s look at the same bar chart with the third categorical variable in vax_status.

ggplot(df_melt, aes(x = age, y = pop_rate, fill = vax_status)) +
  geom_bar(stat = 'identity') +
  coord_flip() +
      labs(x = "Age", y = "Pop rate", subtitle = "Pop rate by Age & Vax status" ) +
  scale_fill_discrete(name = "Vax Status", labels = c("Not Vax", "Fully Vax", "Other"))

Now we see the stacked bar charts equate to 1, or 100% of the population. Here we can see the is a small percentage of other in vaccination status in each age group. These could be folks that were not fully vaccinated but received at least one dose of the COVID-19 vaccine.

Calculate efficacy rate

Efficacy rate is calculated by taking the risk in the not vaccinated group, minus the risk among the fully vaccinated, divided by the risk among the not vaccinated.

So for >50 folks it would be:

~91 - ~13.5 / ~91 = 0.85

85% vaccine efficacy

For folks <50:

~3.85 - 0.31 / ~3.85 = 0.92

92% vaccine efficacy

# create efficacy rate col
df_melt$eff_rate <- ""

# calc efficacy rate for >50
df_melt[df_melt$age == ">50" & df_melt$vax_status != "other",]$eff_rate <- (df_melt[df_melt$age == ">50" & df_melt$vax_status == "not_vax",]$sev_case_rate_100k - df_melt[df_melt$age == ">50" & df_melt$vax_status == "fully_vax",]$sev_case_rate_100k) / df_melt[df_melt$age == ">50" & df_melt$vax_status == "not_vax",]$sev_case_rate_100k

# calc efficacy rate for <50
df_melt[df_melt$age == "<50" & df_melt$vax_status != "other",]$eff_rate <- (df_melt[df_melt$age == "<50" & df_melt$vax_status == "not_vax",]$sev_case_rate_100k - df_melt[df_melt$age == "<50" & df_melt$vax_status == "fully_vax",]$sev_case_rate_100k) / df_melt[df_melt$age == "<50" & df_melt$vax_status == "not_vax",]$sev_case_rate_100k

# convert to numeric
df_melt$eff_rate <- as.numeric(df_melt$eff_rate)

df_melt

##   age vax_status        pop pop_rate sev_case_100 sev_case_rate_100k  eff_rate
## 1 <50    not_vax 1116834.00    0.233           43          3.8501693 0.9183970
## 2 >50    not_vax  186078.00    0.079          171         91.8969464 0.8520888
## 3 <50  fully_vax 3501118.00    0.730           11          0.3141854 0.9183970
## 4 >50  fully_vax 2133516.00    0.904          290         13.5925861 0.8520888
## 5 <50      other  177429.10    0.037           NA                 NA        NA
## 6 >50      other   40115.05    0.017           NA                 NA        NA

Now let’s visualize efficacy rate for severe cases.

ggplot(df_melt, aes(x = eff_rate, y = age, fill = age)) +
  geom_bar(stat = 'identity', position = position_dodge()) +
  geom_text(aes(label = round(eff_rate,2)), hjust = 1.2, color = "white") +
  labs(y = "Age", x = "Efficacy Rate", subtitle = "Efficacy Rate by Age" )

Therefore, we can see the vaccines efficacy rate against severe cases of COVID-19 for folks >50 is 7% less then those <50. Still good efficacy rates for both age brackets, which would lead me to reccomend vaccination for those inclined

Conclusion

1) Do you have enough information to calculate the total population? What does this total population represent?

I believe I did have enough information to calculate the total population, however when I filled in the missing values it came to ~7mm, and according to Wikipedia it should be ~9mm. So that leads me to believe that the total population prior to the missing values were those not vaccinated or fully vaccinated and the missing part of the population were those partially vaccinated. Whereas the other ~2mm were folks ineligible i.e. <12 years of age or religious exemptions.

2) Calculate the Efficacy vs. Disease; Explain your results.

The calculation of Efficacy vs Disease in this case is only against Severe Disease or those hospitalized due to COVID-19. As explained above, the efficacy rate for those >50 is 85% and those <50 92%. Despite the misleading presentation of the initial data frame, it is still highly recommended to get the vaccine. Both efficacy rates regardless of age bracket, are extremely preventative against severe cases.

3) From your calculation of efficacy vs. disease, are you able to compare the rate of severe cases in unvaccinated individuals to that in vaccinated individuals?

Yes, I was able to compare this prior to obtaining efficacy rate, by doing a reverse calculation for incidence rate. We were given the amount of cases per 100,00 people, therefore imputing the numerator and dividing by population at risk to obtain incidence rate of severe disease per 100,000 people.

Assignment 5 - Tidying and Transforming Vaccination Data

Justin Williams

3/4/2022

Overview

Additional domain knowledge needed

Read in data

Reconfigure data

Melt

What story does the cleaned data tell us?

Severe Cases

Find the missing part of population

Visualize with 3rd variable in `vax_status`

Calculate efficacy rate

Conclusion

Assignment 5 - Tidying and Transforming Vaccination Data

Justin Williams

3/4/2022

Overview

Additional domain knowledge needed

Read in data

Reconfigure data

Melt

What story does the cleaned data tell us?

Severe Cases

Find the missing part of population

Visualize with 3rd variable in vax_status

Calculate efficacy rate

Conclusion

Visualize with 3rd variable in `vax_status`