DATA 607 Assignment Four

Overview

In this assignment, we are provided a spreadsheet that describes August 2021 data for Israeli hospitalization (“Severe Cases”) rates for people under 50 (assume “50 and under”) and over 50, for both un-vaccinated and fully vaccinated populations. We are required to analyze the data, and try to answer the questions below. High level domain knowledge is needed around:

Israel’s total population (per the World Bank it is 9.216 million as of 2020),
Who is eligible to receive vaccinations, and
What does it mean to be fully vaccinated? Please note any apparent discrepancies that you observe in your analysis.

Questions to Answer:

Create a .CSV file (or optionally, a relational database!) that includes the information in the spreadsheet. You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below.
Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data.
Perform analysis as described in the spreadsheet and above.
Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions. Please include in your homework submission: 1) The URL to the .Rmd file in your GitHub repository, and 2) The URL for your rpubs.com web page.

Upon examination the question, we looked more closely at what appears to be the blog that this dataset was posted (see blog at: https://www.covid-datascience.com/post/israeli-data-how-can-efficacy-vs-severe-disease-be-strong-when-60-of-hospitalized-are-vaccinated). In summary, Jeffrey Morris, the author of the blog challenged what he perceived as a distortion of reports suggesting that vaccines do not work or have lost their effectiveness vs. severe diseases. Morris obtained the original dataset posted by the Isreali government. The dataset included age stratified data. Morris translated the header information from Hebrew to English, performed additional calculations to determine population data for all Isrealis at the age of 12 and higher.

As background, in this assignment an excel file was stored in the instructor’s GitHub account and downloaded to my hard drive. We then saved the file as a csv and uploaded into account GitHub accounts. Then, we imported the dataset into R along with the needed libraries.

Analysis

Loading of Libraries.

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.3     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   2.0.1     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(dplyr)
library(ggplot2)

Load the .csv files and start tidying the data.

VaxData  = read_csv("https://raw.githubusercontent.com/professorfoy/DATA-607/main/israeli_vaccination_data_analysis_start.csv",show_col_types = FALSE)

## New names:
## * `` -> ...3
## * `` -> ...5

# Start tidying the data
glimpse(VaxData)

## Rows: 19
## Columns: 6
## $ Age            <chr> NA, "<50", NA, ">50", NA, NA, NA, NA, "Definitions", NA…
## $ `Population %` <chr> "Not Vax\n%", "1,116,834", "23.3%", "186,078", "7.9%", …
## $ ...3           <chr> "Fully Vax\n%", "3,501,118", "73.0%", "2,133,516", "90.…
## $ `Severe Cases` <chr> "Not Vax\nper 100K\n\n\np", "43", NA, "171", NA, NA, NA…
## $ ...5           <chr> "Fully Vax\nper 100K", "11", NA, "290", NA, NA, NA, NA,…
## $ Efficacy       <chr> "vs. severe disease", NA, NA, NA, NA, NA, NA, NA, NA, N…

#### The original .xlsx dataset contained some headers that were merged across two columns.  For exammple, columns two and three contained the header "Population %".  Upon conversion from the .xlsx to a .csv file, the merged text unmerged,  The Population % entered into the second column A blank cell was created in the third column This also happened with column four and five, where "Severe Cases" separated into column four and a blank cell in column five.  As the .csv dataset was loaded into RStudio, and the glimpse command initialted, RStudio created two new column names.  The third column was named "...3" and the fifth column remamed "...5".  We continued the data cleaning process by renaming the column headers.

# rename the column names
VaxData <- rename(VaxData, Not_Vax = `Population %`)
VaxData <- rename(VaxData, Fully_Vax = `...3`)
VaxData <- rename(VaxData, Not_Vax_Per_100k = `Severe Cases`)
VaxData <- rename(VaxData, Fully_Vax_Per_100k = `...5`)
VaxData <- rename(VaxData, Efficacy_vs_Severe_Disease = `Efficacy`)

We can view the dataset with new headers as the cleaning process develops.

view(VaxData)
str(VaxData)

## spec_tbl_df [19 × 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Age                       : chr [1:19] NA "<50" NA ">50" ...
##  $ Not_Vax                   : chr [1:19] "Not Vax\n%" "1,116,834" "23.3%" "186,078" ...
##  $ Fully_Vax                 : chr [1:19] "Fully Vax\n%" "3,501,118" "73.0%" "2,133,516" ...
##  $ Not_Vax_Per_100k          : chr [1:19] "Not Vax\nper 100K\n\n\np" "43" NA "171" ...
##  $ Fully_Vax_Per_100k        : chr [1:19] "Fully Vax\nper 100K" "11" NA "290" ...
##  $ Efficacy_vs_Severe_Disease: chr [1:19] "vs. severe disease" NA NA NA ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Age = col_character(),
##   ..   `Population %` = col_character(),
##   ..   ...3 = col_character(),
##   ..   `Severe Cases` = col_character(),
##   ..   ...5 = col_character(),
##   ..   Efficacy = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

We decided to remove uneeded rows. We did this by simply removing rows one and two and all rows greater than or equal to the fifth row.

# remove unneeded rows
VaxData <- VaxData %>%
  filter(row(VaxData) >= 2 & row(VaxData) <=5)

We can view the dataset without the uneccesary rows as the cleaning process develops.

view(VaxData)
str(VaxData)

## spec_tbl_df [4 × 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Age                       : chr [1:4] "<50" NA ">50" NA
##  $ Not_Vax                   : chr [1:4] "1,116,834" "23.3%" "186,078" "7.9%"
##  $ Fully_Vax                 : chr [1:4] "3,501,118" "73.0%" "2,133,516" "90.4%"
##  $ Not_Vax_Per_100k          : chr [1:4] "43" NA "171" NA
##  $ Fully_Vax_Per_100k        : chr [1:4] "11" NA "290" NA
##  $ Efficacy_vs_Severe_Disease: chr [1:4] NA NA NA NA
##  - attr(*, "spec")=
##   .. cols(
##   ..   Age = col_character(),
##   ..   `Population %` = col_character(),
##   ..   ...3 = col_character(),
##   ..   `Severe Cases` = col_character(),
##   ..   ...5 = col_character(),
##   ..   Efficacy = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

When cleaning all the uneccessary rows, the percentages of the population of vax verse nonvax persons for the categories of under and over 50 years old appear as NA under the Age category. To address this, we added a ‘<50’ and ‘>50’ character strings.

VaxData$Age[[2]] <- c('<50')
VaxData$Age[[4]] <- c('>50')

We can view the dataset without the uneccesary rows as the cleaning process develops.

view(VaxData)
str(VaxData)

## spec_tbl_df [4 × 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Age                       : chr [1:4] "<50" "<50" ">50" ">50"
##  $ Not_Vax                   : chr [1:4] "1,116,834" "23.3%" "186,078" "7.9%"
##  $ Fully_Vax                 : chr [1:4] "3,501,118" "73.0%" "2,133,516" "90.4%"
##  $ Not_Vax_Per_100k          : chr [1:4] "43" NA "171" NA
##  $ Fully_Vax_Per_100k        : chr [1:4] "11" NA "290" NA
##  $ Efficacy_vs_Severe_Disease: chr [1:4] NA NA NA NA
##  - attr(*, "spec")=
##   .. cols(
##   ..   Age = col_character(),
##   ..   `Population %` = col_character(),
##   ..   ...3 = col_character(),
##   ..   `Severe Cases` = col_character(),
##   ..   ...5 = col_character(),
##   ..   Efficacy = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

We created two additional columns, Not_Vax_% and Fully_Vax_% to set up the ability to transpose the percentages rows to columns.

VaxData <- VaxData %>%
    mutate(`Not_Vax_%` = 0)

VaxData <- VaxData %>%
    mutate(`Fully_Vax_%` = 0)
view(VaxData)
str(VaxData)

## spec_tbl_df [4 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Age                       : chr [1:4] "<50" "<50" ">50" ">50"
##  $ Not_Vax                   : chr [1:4] "1,116,834" "23.3%" "186,078" "7.9%"
##  $ Fully_Vax                 : chr [1:4] "3,501,118" "73.0%" "2,133,516" "90.4%"
##  $ Not_Vax_Per_100k          : chr [1:4] "43" NA "171" NA
##  $ Fully_Vax_Per_100k        : chr [1:4] "11" NA "290" NA
##  $ Efficacy_vs_Severe_Disease: chr [1:4] NA NA NA NA
##  $ Not_Vax_%                 : num [1:4] 0 0 0 0
##  $ Fully_Vax_%               : num [1:4] 0 0 0 0
##  - attr(*, "spec")=
##   .. cols(
##   ..   Age = col_character(),
##   ..   `Population %` = col_character(),
##   ..   ...3 = col_character(),
##   ..   `Severe Cases` = col_character(),
##   ..   ...5 = col_character(),
##   ..   Efficacy = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

The percentages in their own rows was deemed uneccesary. We copied the percentages in row two from the Not_Vax column and copied it to the first row of the Not_Vax_% column; and copied the percentages in row four from the Fully_Vax column and copied it to the third row of the Fully_Vax_% columnn.

VaxData$`Not_Vax_%`[[1]] <- VaxData$Not_Vax[[2]]
VaxData$`Fully_Vax_%`[[1]] <- VaxData$Fully_Vax[[2]]

VaxData$`Not_Vax_%`[[3]] <- VaxData$Not_Vax[[4]]
VaxData$`Fully_Vax_%`[[3]] <- VaxData$Fully_Vax[[4]]
view(VaxData)
str(VaxData)

## spec_tbl_df [4 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Age                       : chr [1:4] "<50" "<50" ">50" ">50"
##  $ Not_Vax                   : chr [1:4] "1,116,834" "23.3%" "186,078" "7.9%"
##  $ Fully_Vax                 : chr [1:4] "3,501,118" "73.0%" "2,133,516" "90.4%"
##  $ Not_Vax_Per_100k          : chr [1:4] "43" NA "171" NA
##  $ Fully_Vax_Per_100k        : chr [1:4] "11" NA "290" NA
##  $ Efficacy_vs_Severe_Disease: chr [1:4] NA NA NA NA
##  $ Not_Vax_%                 : chr [1:4] "23.3%" "0" "7.9%" "0"
##  $ Fully_Vax_%               : chr [1:4] "73.0%" "0" "90.4%" "0"
##  - attr(*, "spec")=
##   .. cols(
##   ..   Age = col_character(),
##   ..   `Population %` = col_character(),
##   ..   ...3 = col_character(),
##   ..   `Severe Cases` = col_character(),
##   ..   ...5 = col_character(),
##   ..   Efficacy = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

We removed the second and fourth rows by filtering out the percentages rows.

VaxData <- VaxData %>%
  filter(row(VaxData) == 1 | row(VaxData) == 3)
view(VaxData)
str(VaxData)

## spec_tbl_df [2 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Age                       : chr [1:2] "<50" ">50"
##  $ Not_Vax                   : chr [1:2] "1,116,834" "186,078"
##  $ Fully_Vax                 : chr [1:2] "3,501,118" "2,133,516"
##  $ Not_Vax_Per_100k          : chr [1:2] "43" "171"
##  $ Fully_Vax_Per_100k        : chr [1:2] "11" "290"
##  $ Efficacy_vs_Severe_Disease: chr [1:2] NA NA
##  $ Not_Vax_%                 : chr [1:2] "23.3%" "7.9%"
##  $ Fully_Vax_%               : chr [1:2] "73.0%" "90.4%"
##  - attr(*, "spec")=
##   .. cols(
##   ..   Age = col_character(),
##   ..   `Population %` = col_character(),
##   ..   ...3 = col_character(),
##   ..   `Severe Cases` = col_character(),
##   ..   ...5 = col_character(),
##   ..   Efficacy = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

We created columns to accommodate the normalized rates per 100,000.

VaxData <- VaxData %>%
    mutate(`Not_Vax_Normalized` = 0)

VaxData <- VaxData %>%
    mutate(`Fully_Vax_Normalized` = 0)
view(VaxData)
str(VaxData)

## spec_tbl_df [2 × 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Age                       : chr [1:2] "<50" ">50"
##  $ Not_Vax                   : chr [1:2] "1,116,834" "186,078"
##  $ Fully_Vax                 : chr [1:2] "3,501,118" "2,133,516"
##  $ Not_Vax_Per_100k          : chr [1:2] "43" "171"
##  $ Fully_Vax_Per_100k        : chr [1:2] "11" "290"
##  $ Efficacy_vs_Severe_Disease: chr [1:2] NA NA
##  $ Not_Vax_%                 : chr [1:2] "23.3%" "7.9%"
##  $ Fully_Vax_%               : chr [1:2] "73.0%" "90.4%"
##  $ Not_Vax_Normalized        : num [1:2] 0 0
##  $ Fully_Vax_Normalized      : num [1:2] 0 0
##  - attr(*, "spec")=
##   .. cols(
##   ..   Age = col_character(),
##   ..   `Population %` = col_character(),
##   ..   ...3 = col_character(),
##   ..   `Severe Cases` = col_character(),
##   ..   ...5 = col_character(),
##   ..   Efficacy = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

To calculate the normalized rates, we converted the Fully_Vax_Per_100k, Not_Vax_Per_100k, Fully_Vax, and Not_Vax from characters to numeric values, removed the commas in the numbers with the gsub command, and then created new objects FullyVaxed_Normal_Population and NotVaxed_Normal_Population.

VaxData$Fully_Vax_Per_100k <- as.numeric(VaxData$Fully_Vax_Per_100k)
VaxData$Not_Vax_Per_100k <- as.numeric((VaxData$Not_Vax_Per_100k))
VaxData$Fully_Vax <- gsub(",", "", VaxData$Fully_Vax)
VaxData$Fully_Vax <- as.numeric(VaxData$Fully_Vax)
VaxData$Not_Vax <- gsub(",", "", VaxData$Not_Vax)
VaxData$Not_Vax <- as.numeric(VaxData$Not_Vax)

Fully_Vax_Normalized <- VaxData$Fully_Vax_Per_100k/((VaxData$Fully_Vax/100000))
Not_Vax_Normalized <- VaxData$Not_Vax_Per_100k/((VaxData$Not_Vax/100000))

Fully_Vax_Normalized

## [1]  0.3141854 13.5925861

Not_Vax_Normalized

## [1]  3.850169 91.896946

view(VaxData)
str(VaxData)

## spec_tbl_df [2 × 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Age                       : chr [1:2] "<50" ">50"
##  $ Not_Vax                   : num [1:2] 1116834 186078
##  $ Fully_Vax                 : num [1:2] 3501118 2133516
##  $ Not_Vax_Per_100k          : num [1:2] 43 171
##  $ Fully_Vax_Per_100k        : num [1:2] 11 290
##  $ Efficacy_vs_Severe_Disease: chr [1:2] NA NA
##  $ Not_Vax_%                 : chr [1:2] "23.3%" "7.9%"
##  $ Fully_Vax_%               : chr [1:2] "73.0%" "90.4%"
##  $ Not_Vax_Normalized        : num [1:2] 0 0
##  $ Fully_Vax_Normalized      : num [1:2] 0 0
##  - attr(*, "spec")=
##   .. cols(
##   ..   Age = col_character(),
##   ..   `Population %` = col_character(),
##   ..   ...3 = col_character(),
##   ..   `Severe Cases` = col_character(),
##   ..   ...5 = col_character(),
##   ..   Efficacy = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

The Fully_Vax_Normalized and Not_Vax_Normalized have zero values which we converted to NA values, then replaced them with the Normalized values. This step sets up the calculation of the Efficacy_vs_Severe_Disease ratios.

VaxData$Fully_Vax_Normalized[VaxData$Fully_Vax_Normalized == 0] <- NA
VaxData$Not_Vax_Normalized[VaxData$Not_Vax_Normalized == 0] <- NA

VaxData$Fully_Vax_Normalized[which(is.na(VaxData$Fully_Vax_Normalized))] <- Fully_Vax_Normalized
VaxData$Not_Vax_Normalized[which(is.na(VaxData$Not_Vax_Normalized))] <- Not_Vax_Normalized

VaxData$Fully_Vax_Normalized <- round(VaxData$Fully_Vax_Normalized, 2)
VaxData$Not_Vax_Normalized <- round(VaxData$Not_Vax_Normalized, 2)

view(VaxData)
str(VaxData, digits = 2)

## spec_tbl_df [2 × 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Age                       : chr [1:2] "<50" ">50"
##  $ Not_Vax                   : num [1:2] 1116834 186078
##  $ Fully_Vax                 : num [1:2] 3501118 2133516
##  $ Not_Vax_Per_100k          : num [1:2] 43 171
##  $ Fully_Vax_Per_100k        : num [1:2] 11 290
##  $ Efficacy_vs_Severe_Disease: chr [1:2] NA NA
##  $ Not_Vax_%                 : chr [1:2] "23.3%" "7.9%"
##  $ Fully_Vax_%               : chr [1:2] "73.0%" "90.4%"
##  $ Not_Vax_Normalized        : num [1:2] 3.9 91.9
##  $ Fully_Vax_Normalized      : num [1:2] 0.31 13.59
##  - attr(*, "spec")=
##   .. cols(
##   ..   Age = col_character(),
##   ..   `Population %` = col_character(),
##   ..   ...3 = col_character(),
##   ..   `Severe Cases` = col_character(),
##   ..   ...5 = col_character(),
##   ..   Efficacy = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

Now still have NA values in the Efficacy_vs_Severe_Disease column. We need to replace the NA values with a formula to calculate the Efficacy_vs_Severe_Disease rates. Per the Morris analysis, we used the formula 1 - V/N, where V=rate of infection per 100k for fully vaccinated, N=rate of infection per 100k for unvaccinated. We created a new object, Efficacy_vs_Severe_Disease and replaced the NA values with the ratios for the Efficacy_vs_Severe_Disease column.

Efficacy_vs_Severe_Disease <- round(1 - (Fully_Vax_Normalized / Not_Vax_Normalized), 2)

VaxData$Efficacy_vs_Severe_Disease <- Efficacy_vs_Severe_Disease

view(VaxData)
str(VaxData)

## spec_tbl_df [2 × 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Age                       : chr [1:2] "<50" ">50"
##  $ Not_Vax                   : num [1:2] 1116834 186078
##  $ Fully_Vax                 : num [1:2] 3501118 2133516
##  $ Not_Vax_Per_100k          : num [1:2] 43 171
##  $ Fully_Vax_Per_100k        : num [1:2] 11 290
##  $ Efficacy_vs_Severe_Disease: num [1:2] 0.92 0.85
##  $ Not_Vax_%                 : chr [1:2] "23.3%" "7.9%"
##  $ Fully_Vax_%               : chr [1:2] "73.0%" "90.4%"
##  $ Not_Vax_Normalized        : num [1:2] 3.85 91.9
##  $ Fully_Vax_Normalized      : num [1:2] 0.31 13.59
##  - attr(*, "spec")=
##   .. cols(
##   ..   Age = col_character(),
##   ..   `Population %` = col_character(),
##   ..   ...3 = col_character(),
##   ..   `Severe Cases` = col_character(),
##   ..   ...5 = col_character(),
##   ..   Efficacy = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

We created a tibble to view the final outcome.

head(VaxData)

## # A tibble: 2 × 10
##   Age   Not_Vax Fully_Vax Not_Vax_Per_100k Fully_Vax_Per_100k Efficacy_vs_Sever…
##   <chr>   <dbl>     <dbl>            <dbl>              <dbl>              <dbl>
## 1 <50   1116834   3501118               43                 11               0.92
## 2 >50    186078   2133516              171                290               0.85
## # … with 4 more variables: Not_Vax_% <chr>, Fully_Vax_% <chr>,
## #   Not_Vax_Normalized <dbl>, Fully_Vax_Normalized <dbl>

Questions & Answers

Question One: Do you have enough information to calculate the population? What does this total population represent?

Answer One: In short, there is not enough information to calculate the population. There are a few issues with Morris’ presentation in a summary table. First, the data that he downloaded was stratified in age groups. The data included stata at 40-49 and 50-59. He presented his table for all below and above 50, but his data did included 50 year old persons. In his table, the age cutoff point should have been presented as greater than or equal to 50. The second issue is the population count. The original data excludes persons under 12 years old. That would partially reconcile the difference between the World Bank estimated population of Isreal at 9.216 million and the total population of everyone vaccinated or not vaccinated, under 50 or over 50 (total = 6,937,546) on the Morris table. However, this reported data is not accurate. Morris presents his table in two segments, either vaccinated or not vaccinated. He did not report on partially vaccinated persons, which he did disclose in his blog.

The original data set reported by the Isreali government did include partially vaccinated persons. When Morris reported the percentages of persons vaccinated or not, in both age groups, the reported percentage did not included partially vaccinated persons. Morris reported that 78.7% of those under 50 years old were fully vaccinated, but he included partially vaccinated in his calculation. From the information presented, the total population appears to be 6,937,546, but when using that figure, the total vaccinated population percentage is 81.2% (5,634,634/6,937,546). The inclusion of the partially vaccinated population should have been reported so the reader could arrive at a total population figure of 7,156,910, instead of 6,937,546. Morris should had considered inclusion of partially vaccinated as not being vaccinated and include those numbers or report partially vaccinated as well as no vaccination and fully vaccinated numbers. In the United States, until you have all shots, one is not fully vaccinated and considered not vaccinated. If Isreal follows the same policy, all the percentage figures that Morris reported may be considered misleading. The fully vaccinated percentages could bolster a case for further vaccine effectiveness.

Question Two: Calculate the Efficacy vs. Disease; Explain your results.

Answer Two: The table above summarizes the Efficacy vs. Disease calculations. For the population under 50 years old, the effectiveness of the vaccination is higher than those over 50 years. The results show that the vaccines are very effective. The effectiveness is lesser for those over 50 and can be explained due to the fact that those over 50 are more prone to serious diseases.

Question Three: From your calculation of efficacy vs. disease, are you able to compare the rate of severe cases in unvaccinated individuals to that in vaccinated individuals?

Yes, we can compare the rates. See the calculations above.