library(tidyverse)
library(readxl)
library(reshape)
We were asked to analyze data from August 2021 Israeli hospitalization (“Severe Cases”) rates for pople under 50 (assume “50 and under”) and over 50, for both unvaccinated and fully vaccinated populations.
Specific questions were as follows:
1) Do you have enough information to calculate the total population? What does this total population represent?
2) Calculate the Efficacy vs. Disease; Explain your results.
3) From your calculation of efficacy vs. disease, are you able to compare the rate of severe cases in unvaccinated individuals to that in vaccinated individuals?
Definitions
Severe Cases = hospitalized
Efficacy vs. severe disease = 1 - (% fully vaccinated severe cases per 100k / % not vaccinated severe cases per 100k)
1) Israel’s total population
According to Wikipedia the 2022 population estimate is 9,481,820
2) Who is eligible to receive vaccinations
According to an article in The Lancet as of June 2021 everyone over the age of 12 was eligible to receive a vaccine in Israel.
3) What does it mean to be fully vaccinated
The Wall Street Journal released an artice in August 2021 entitled: In Israel, Being Fully Vaccinated Now Means Three Shots, so this would lead me to believe being fully vaccinated means 3 shots, as opposed to the previously accepted 2 shots. However, being that the data for analysis was as of August 2021, fully vaccinated probably means at least 2 shots.
df <- read_excel("./data/israeli_vaccination_data_analysis_start.xlsx",
range = "A3:E6",
col_names = c("age","not_vax","fully_vax","sev_not_vax_100k","sev_fully_vax_100k"))
df
## # A tibble: 4 × 5
## age not_vax fully_vax sev_not_vax_100k sev_fully_vax_100k
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 <50 1116834 3501118 43 11
## 2 <NA> 0.233 0.73 NA NA
## 3 >50 186078 2133516 171 290
## 4 <NA> 0.079 0.904 NA NA
I would like to see the data frame transformed a bit, i.e. vaccination status rates, and age become their own columns.
# create cols
df$not_vax_pop_rate <- ""
df$vax_pop_rate <- ""
df
## # A tibble: 4 × 7
## age not_vax fully_vax sev_not_vax_100k sev_fully_vax_1… not_vax_pop_rate
## <chr> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 <50 1116834 3.50e+6 43 11 ""
## 2 <NA> 0.233 7.3 e-1 NA NA ""
## 3 >50 186078 2.13e+6 171 290 ""
## 4 <NA> 0.079 9.04e-1 NA NA ""
## # … with 1 more variable: vax_pop_rate <chr>
# move rate values into correct cols
df$not_vax_pop_rate[1] <- df$not_vax[2]
df$vax_pop_rate[1] <- df$fully_vax[2]
df$not_vax_pop_rate[3] <-df$not_vax[4]
df$vax_pop_rate[3] <- df$fully_vax[4]
# convert new cols to numeric
df$not_vax_pop_rate <- as.numeric(df$not_vax_pop_rate)
df$vax_pop_rate <- as.numeric(df$vax_pop_rate)
# remove unecessary rows
df <- df[-c(2,4),]
df <- as.data.frame(df)
df
## age not_vax fully_vax sev_not_vax_100k sev_fully_vax_100k not_vax_pop_rate
## 1 <50 1116834 3501118 43 11 0.233
## 2 >50 186078 2133516 171 290 0.079
## vax_pop_rate
## 1 0.730
## 2 0.904
Melt the data frame to get 4 rows with age and vaccination status
df_melt <- melt(df,id=c("age", "not_vax_pop_rate", "vax_pop_rate", "sev_not_vax_100k","sev_fully_vax_100k"),
variable_name = "vax_status")
df_melt %>%
select(age,vax_status, everything())
## age vax_status not_vax_pop_rate vax_pop_rate sev_not_vax_100k
## 1 <50 not_vax 0.233 0.730 43
## 2 >50 not_vax 0.079 0.904 171
## 3 <50 fully_vax 0.233 0.730 43
## 4 >50 fully_vax 0.079 0.904 171
## sev_fully_vax_100k value
## 1 11 1116834
## 2 290 186078
## 3 11 3501118
## 4 290 2133516
Move rate values into their own columns.
# create pop rate col
df_melt$pop_rate[1] <- df_melt$not_vax_pop_rate[1]
df_melt$pop_rate[2] <- df_melt$not_vax_pop_rate[2]
df_melt$pop_rate[3] <- df_melt$vax_pop_rate[1]
df_melt$pop_rate[4] <- df_melt$vax_pop_rate[2]
# create sev_rate_100k col
df_melt$sev_case_100[1] <- df_melt$sev_not_vax_100k[1]
df_melt$sev_case_100[2] <- df_melt$sev_not_vax_100k[2]
df_melt$sev_case_100[3] <- df_melt$sev_fully_vax_100k[1]
df_melt$sev_case_100[4] <- df_melt$sev_fully_vax_100k[2]
# delete unecessary columns
df_melt <- df_melt[,-c(2:5)]
# rename value to pop
df_melt <- df_melt %>%
rename(replace = c("value" = "pop"))
df_melt
## age vax_status pop pop_rate sev_case_100
## 1 <50 not_vax 1116834 0.233 43
## 2 >50 not_vax 186078 0.079 171
## 3 <50 fully_vax 3501118 0.730 11
## 4 >50 fully_vax 2133516 0.904 290
# plot pop rate vs age and vax status
ggplot(df_melt, aes(x = age, y = pop_rate, fill = vax_status)) +
geom_bar(stat = 'identity') +
labs(x = "Age", y = "Pop rate", subtitle = "Pop rate by Vax status & Age") +
scale_fill_discrete(name = "Vax Status", labels = c("Not Vax", "Fully Vax")) +
coord_flip()
Looking at the data grouped by age and population rate fully vaccinated vs not, we can clearly see that a larger percentage (~90 vs. ~73) of those >50
are fully vaccinated against COVID-19. However, these are stacked bar charts, and one can see that they are not up too 100%, there must be some other data unaccounted for. Let’s come back to this.
Let’s take a look at vaccination status and age in relation to severe cases of COVID-19.
# plot severe case vs age and vax status
ggplot(df_melt, aes(x = sev_case_100, y = age, fill = vax_status)) +
geom_bar(stat = "identity") +
labs(x = "Severe Case 100k", y = "Age", subtitle = "Severe case (per 100k) by Vax status & Age" ) +
scale_fill_discrete(name = "Vax Status", labels = c("Not Vax", "Fully Vax"))
Looking at the data like this, it gives the impression that in folks >50 and fully vaccinated for COVID-19, had an increased risk of severe cases in comparison to those >50 not vaccinated.
Well, let’s look at how Severe Cases per 100k are calculated.
How are severe cases calculated?
The data gives the quantity of severe cases per 100,000 people. However, to reflect rate we need to calculate incidence rate, or rate of severe cases per 100,000 people.
Incidence rate is calculated by the following:
severe case incidence rate = # of severe cases / pop at risk
Therefore, in the case of those >50 that are vaccinated with a population base of ~2.1 mm and 290 severe cases the calculation would be done as such:
290 / 2,133,516 = 0.00013593 x 100000 = 13.6 incidence rate
Those that are not vaccinated and >50 have a population base of ~186k with 171 severe cases:
171 / 186,078 = 0.00091897 X 100000 = 91.9 incidence rate
This means that in fully vaccinated people >50, there will be ~13 severe cases per 100,000. In people >50 that are not vaccinated, there will be ~91 severe cases per 100,000. So despite severe case count in those vaccinated being higher then those no vaccinated, the incidence rate is much higher in those not vaccinated. Therefore, not getting vaccinated for COVID-19 puts one at greater risk for developing a severe case.
Let’s create a column for Severe Case Incidence Rate to visualize this metric in rates as opposed to count.
df_melt <- df_melt %>%
mutate(sev_case_rate_100k = sev_case_100/pop * 100000)
df_melt %>%
arrange(age)
## age vax_status pop pop_rate sev_case_100 sev_case_rate_100k
## 1 <50 not_vax 1116834 0.233 43 3.8501693
## 2 <50 fully_vax 3501118 0.730 11 0.3141854
## 3 >50 not_vax 186078 0.079 171 91.8969464
## 4 >50 fully_vax 2133516 0.904 290 13.5925861
Now if we view severe cases per 100,000 by rate, the data tells a different story.
# plot severe case incidence rate per 100k vs age and vax status
ggplot(df_melt, aes(x = sev_case_rate_100k, y = age, fill = vax_status)) +
geom_bar(stat = "identity", position = position_dodge()) +
labs(x = "Severe Case Incidence rate per 100k", y = "Age", subtitle = "Severe case incidence rate (per 100k) by Vax status & Age" ) +
scale_fill_discrete(name = "Vax Status", labels = c("Not Vax", "Fully Vax"))
Here we can see that those folks not vaccinated in either age group have a much higher Severe Case Incidence Rate per 100,000. The discrepancy is staggering the >50 age group.
Remember the stacked bar charts when we looked at population rate, grouped by age and vaccination status? Well they didn’t equate to 100%, so some data was missing. Let’s quantify and explain that missing portion of data.
# add other to levels in factor column vax_status
levels(df_melt$vax_status) <- c("not_vax", "fully_vax", "other")
# add rows
df_melt[nrow(df_melt)+1,] <- c("<50","other",0,0,NA,NA)
df_melt[nrow(df_melt)+1,] <- c(">50","other",0,0,NA,NA)
# change columns to numeric
cols.num <- c("pop", "pop_rate","sev_case_100", "sev_case_rate_100k")
df_melt[cols.num] <- sapply(df_melt[cols.num], as.numeric)
df_melt %>%
arrange(age)
## age vax_status pop pop_rate sev_case_100 sev_case_rate_100k
## 1 <50 not_vax 1116834 0.233 43 3.8501693
## 2 <50 fully_vax 3501118 0.730 11 0.3141854
## 3 <50 other 0 0.000 NA NA
## 4 >50 not_vax 186078 0.079 171 91.8969464
## 5 >50 fully_vax 2133516 0.904 290 13.5925861
## 6 >50 other 0 0.000 NA NA
Calculate pop_rate
for vax_status
== other
# calc pop_rate for <50
df_melt$pop_rate[df_melt$vax_status == "other" & df_melt$age == "<50"] <- 1-sum(df_melt[df_melt$age == "<50",]$pop_rate)
# calc pop_rate for >50
df_melt$pop_rate[df_melt$vax_status == "other" & df_melt$age == ">50"] <- 1-sum(df_melt[df_melt$age == ">50",]$pop_rate)
Calculate pop
for vax_status
== other
# calculate pop for <50
df_melt$pop[df_melt$vax_status == "other" & df_melt$age == "<50"] <- (sum(df_melt[df_melt$age == "<50",]$pop) / sum(df_melt[df_melt$age == "<50" & df_melt$vax_status != "other",]$pop_rate)) - sum(df_melt[df_melt$age == "<50",]$pop)
# calculate pop for >50
df_melt$pop[df_melt$vax_status == "other" & df_melt$age == ">50"] <- (sum(df_melt[df_melt$age == ">50",]$pop) / sum(df_melt[df_melt$age == ">50" & df_melt$vax_status != "other",]$pop_rate)) - sum(df_melt[df_melt$age == ">50",]$pop)
df_melt %>%
arrange(age)
## age vax_status pop pop_rate sev_case_100 sev_case_rate_100k
## 1 <50 not_vax 1116834.00 0.233 43 3.8501693
## 2 <50 fully_vax 3501118.00 0.730 11 0.3141854
## 3 <50 other 177429.10 0.037 NA NA
## 4 >50 not_vax 186078.00 0.079 171 91.8969464
## 5 >50 fully_vax 2133516.00 0.904 290 13.5925861
## 6 >50 other 40115.05 0.017 NA NA
tot_pop <- sum(df_melt$pop)
cat("Total population now equates to",tot_pop)
## Total population now equates to 7155090
Unfortunately, the total population even with the missing data in pop_rate
accounted for, does not equate to the ~9mm. This leads me to believe the missing data derived from pop_rate
not equating to 100, are folks not fully vaccinated, but have received at least one dose, and the ~2mm extra are folks ineligible i.e. <12 years of age and religious exempt.
vax_status
Let’s look at the same bar chart with the third categorical variable in vax_status
.
ggplot(df_melt, aes(x = age, y = pop_rate, fill = vax_status)) +
geom_bar(stat = 'identity') +
coord_flip() +
labs(x = "Age", y = "Pop rate", subtitle = "Pop rate by Age & Vax status" ) +
scale_fill_discrete(name = "Vax Status", labels = c("Not Vax", "Fully Vax", "Other"))
Now we see the stacked bar charts equate to 1, or 100% of the population. Here we can see the is a small percentage of other
in vaccination status in each age group. These could be folks that were not fully vaccinated but received at least one dose of the COVID-19 vaccine.
Efficacy rate is calculated by taking the risk in the not vaccinated group, minus the risk among the fully vaccinated, divided by the risk among the not vaccinated.
So for >50
folks it would be:
~91 - ~13.5 / ~91 = 0.85
or
85% vaccine efficacy
For folks <50
:
~3.85 - 0.31 / ~3.85 = 0.92
or
92% vaccine efficacy
# create efficacy rate col
df_melt$eff_rate <- ""
# calc efficacy rate for >50
df_melt[df_melt$age == ">50" & df_melt$vax_status != "other",]$eff_rate <- (df_melt[df_melt$age == ">50" & df_melt$vax_status == "not_vax",]$sev_case_rate_100k - df_melt[df_melt$age == ">50" & df_melt$vax_status == "fully_vax",]$sev_case_rate_100k) / df_melt[df_melt$age == ">50" & df_melt$vax_status == "not_vax",]$sev_case_rate_100k
# calc efficacy rate for <50
df_melt[df_melt$age == "<50" & df_melt$vax_status != "other",]$eff_rate <- (df_melt[df_melt$age == "<50" & df_melt$vax_status == "not_vax",]$sev_case_rate_100k - df_melt[df_melt$age == "<50" & df_melt$vax_status == "fully_vax",]$sev_case_rate_100k) / df_melt[df_melt$age == "<50" & df_melt$vax_status == "not_vax",]$sev_case_rate_100k
# convert to numeric
df_melt$eff_rate <- as.numeric(df_melt$eff_rate)
df_melt
## age vax_status pop pop_rate sev_case_100 sev_case_rate_100k eff_rate
## 1 <50 not_vax 1116834.00 0.233 43 3.8501693 0.9183970
## 2 >50 not_vax 186078.00 0.079 171 91.8969464 0.8520888
## 3 <50 fully_vax 3501118.00 0.730 11 0.3141854 0.9183970
## 4 >50 fully_vax 2133516.00 0.904 290 13.5925861 0.8520888
## 5 <50 other 177429.10 0.037 NA NA NA
## 6 >50 other 40115.05 0.017 NA NA NA
Now let’s visualize efficacy rate for severe cases.
ggplot(df_melt, aes(x = eff_rate, y = age, fill = age)) +
geom_bar(stat = 'identity', position = position_dodge()) +
geom_text(aes(label = round(eff_rate,2)), hjust = 1.2, color = "white") +
labs(y = "Age", x = "Efficacy Rate", subtitle = "Efficacy Rate by Age" )
Therefore, we can see the vaccines efficacy rate against severe cases of COVID-19 for folks >50
is 7% less then those <50
. Still good efficacy rates for both age brackets, which would lead me to reccomend vaccination for those inclined
1) Do you have enough information to calculate the total population? What does this total population represent?
I believe I did have enough information to calculate the total population, however when I filled in the missing values it came to ~7mm, and according to Wikipedia it should be ~9mm. So that leads me to believe that the total population prior to the missing values were those not vaccinated or fully vaccinated and the missing part of the population were those partially vaccinated. Whereas the other ~2mm were folks ineligible i.e. <12 years of age or religious exemptions.
2) Calculate the Efficacy vs. Disease; Explain your results.
The calculation of Efficacy vs Disease in this case is only against Severe Disease or those hospitalized due to COVID-19. As explained above, the efficacy rate for those >50
is 85% and those <50
92%. Despite the misleading presentation of the initial data frame, it is still highly recommended to get the vaccine. Both efficacy rates regardless of age bracket, are extremely preventative against severe cases.
3) From your calculation of efficacy vs. disease, are you able to compare the rate of severe cases in unvaccinated individuals to that in vaccinated individuals?
Yes, I was able to compare this prior to obtaining efficacy rate, by doing a reverse calculation for incidence rate. We were given the amount of cases per 100,00 people, therefore imputing the numerator and dividing by population at risk to obtain incidence rate of severe disease per 100,000 people.