Overview
In this assignment, we are provided a spreadsheet that describes August 2021 data for Israeli hospitalization (“Severe Cases”) rates for people under 50 (assume “50 and under”) and over 50, for both un-vaccinated and fully vaccinated populations. We are required to analyze the data, and try to answer the questions below. High level domain knowledge is needed around:
(1) Israel’s total population (per the World Bank it is 9.216 million as of 2020),
(2) Who is eligible to receive vaccinations, and
(3) What does it mean to be fully vaccinated? Please note any apparent discrepancies that you observe in your analysis.
Questions to Answer:
1. Create a .CSV file (or optionally, a relational database!) that includes the information in the spreadsheet. You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below.
2. Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data.
3. Perform analysis as described in the spreadsheet and above.
4. Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions. Please include in your homework submission: 1) The URL to the .Rmd file in your GitHub repository, and 2) The URL for your rpubs.com web page.
Upon examination the question, we looked more closely at what appears to be the blog that this dataset was posted (see blog at: https://www.covid-datascience.com/post/israeli-data-how-can-efficacy-vs-severe-disease-be-strong-when-60-of-hospitalized-are-vaccinated). In summary, Jeffrey Morris, the author of the blog challenged what he perceived as a distortion of reports suggesting that vaccines do not work or have lost their effectiveness vs. severe diseases. Morris obtained the original dataset posted by the Isreali government. The dataset included age stratified data. Morris translated the header information from Hebrew to English, performed additional calculations to determine population data for all Isrealis at the age of 12 and higher.
As background, in this assignment an excel file was stored in the instructor’s GitHub account and downloaded to my hard drive. We then saved the file as a csv and uploaded into account GitHub accounts. Then, we imported the dataset into R along with the needed libraries.
Loading of Libraries.
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.4 v dplyr 1.0.7
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 2.0.1 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(dplyr)
library(ggplot2)
Load the .csv files and start tidying the data.
VaxData = read_csv("https://raw.githubusercontent.com/professorfoy/DATA-607/main/israeli_vaccination_data_analysis_start.csv",show_col_types = FALSE)
## New names:
## * `` -> ...3
## * `` -> ...5
# Start tidying the data
glimpse(VaxData)
## Rows: 19
## Columns: 6
## $ Age <chr> NA, "<50", NA, ">50", NA, NA, NA, NA, "Definitions", NA~
## $ `Population %` <chr> "Not Vax\n%", "1,116,834", "23.3%", "186,078", "7.9%", ~
## $ ...3 <chr> "Fully Vax\n%", "3,501,118", "73.0%", "2,133,516", "90.~
## $ `Severe Cases` <chr> "Not Vax\nper 100K\n\n\np", "43", NA, "171", NA, NA, NA~
## $ ...5 <chr> "Fully Vax\nper 100K", "11", NA, "290", NA, NA, NA, NA,~
## $ Efficacy <chr> "vs. severe disease", NA, NA, NA, NA, NA, NA, NA, NA, N~
#### The original .xlsx dataset contained some headers that were merged across two columns. For exammple, columns two and three contained the header "Population %". Upon conversion from the .xlsx to a .csv file, the merged text unmerged, The Population % entered into the second column A blank cell was created in the third column This also happened with column four and five, where "Severe Cases" separated into column four and a blank cell in column five. As the .csv dataset was loaded into RStudio, and the glimpse command initialted, RStudio created two new column names. The third column was named "...3" and the fifth column remamed "...5". We continued the data cleaning process by renaming the column headers.
# rename the column names
VaxData <- rename(VaxData, Not_Vax = `Population %`)
VaxData <- rename(VaxData, Fully_Vax = `...3`)
VaxData <- rename(VaxData, Not_Vax_Per_100k = `Severe Cases`)
VaxData <- rename(VaxData, Fully_Vax_Per_100k = `...5`)
VaxData <- rename(VaxData, Efficacy_vs_Severe_Disease = `Efficacy`)
We decided to remove uneeded rows. We did this by simply removing rows one and two and all rows greater than or equal to the fifth row.
# remove unneeded rows
VaxData <- VaxData %>%
filter(row(VaxData) >= 2 & row(VaxData) <=5)
We can view the dataset without the uneccesary rows as the cleaning process develops.
view(VaxData)
str(VaxData)
## spec_tbl_df [4 x 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Age : chr [1:4] "<50" NA ">50" NA
## $ Not_Vax : chr [1:4] "1,116,834" "23.3%" "186,078" "7.9%"
## $ Fully_Vax : chr [1:4] "3,501,118" "73.0%" "2,133,516" "90.4%"
## $ Not_Vax_Per_100k : chr [1:4] "43" NA "171" NA
## $ Fully_Vax_Per_100k : chr [1:4] "11" NA "290" NA
## $ Efficacy_vs_Severe_Disease: chr [1:4] NA NA NA NA
## - attr(*, "spec")=
## .. cols(
## .. Age = col_character(),
## .. `Population %` = col_character(),
## .. ...3 = col_character(),
## .. `Severe Cases` = col_character(),
## .. ...5 = col_character(),
## .. Efficacy = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
When cleaning all the uneccessary rows, the percentages of the population of vax verse nonvax persons for the categories of under and over 50 years old appear as NA under the Age category. To address this, we added a ‘<50’ and ‘>50’ character strings.
VaxData$Age[[2]] <- c('<50')
VaxData$Age[[4]] <- c('>50')
We can view the dataset without the uneccesary rows as the cleaning process develops.
view(VaxData)
str(VaxData)
## spec_tbl_df [4 x 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Age : chr [1:4] "<50" "<50" ">50" ">50"
## $ Not_Vax : chr [1:4] "1,116,834" "23.3%" "186,078" "7.9%"
## $ Fully_Vax : chr [1:4] "3,501,118" "73.0%" "2,133,516" "90.4%"
## $ Not_Vax_Per_100k : chr [1:4] "43" NA "171" NA
## $ Fully_Vax_Per_100k : chr [1:4] "11" NA "290" NA
## $ Efficacy_vs_Severe_Disease: chr [1:4] NA NA NA NA
## - attr(*, "spec")=
## .. cols(
## .. Age = col_character(),
## .. `Population %` = col_character(),
## .. ...3 = col_character(),
## .. `Severe Cases` = col_character(),
## .. ...5 = col_character(),
## .. Efficacy = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
We created two additional columns, Not_Vax_% and Fully_Vax_% to set up the ability to transpose the percentages rows to columns.
VaxData <- VaxData %>%
mutate(`Not_Vax_%` = 0)
VaxData <- VaxData %>%
mutate(`Fully_Vax_%` = 0)
view(VaxData)
str(VaxData)
## spec_tbl_df [4 x 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Age : chr [1:4] "<50" "<50" ">50" ">50"
## $ Not_Vax : chr [1:4] "1,116,834" "23.3%" "186,078" "7.9%"
## $ Fully_Vax : chr [1:4] "3,501,118" "73.0%" "2,133,516" "90.4%"
## $ Not_Vax_Per_100k : chr [1:4] "43" NA "171" NA
## $ Fully_Vax_Per_100k : chr [1:4] "11" NA "290" NA
## $ Efficacy_vs_Severe_Disease: chr [1:4] NA NA NA NA
## $ Not_Vax_% : num [1:4] 0 0 0 0
## $ Fully_Vax_% : num [1:4] 0 0 0 0
## - attr(*, "spec")=
## .. cols(
## .. Age = col_character(),
## .. `Population %` = col_character(),
## .. ...3 = col_character(),
## .. `Severe Cases` = col_character(),
## .. ...5 = col_character(),
## .. Efficacy = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
The percentages in their own rows was deemed uneccesary. We copied the percentages in row two from the Not_Vax column and copied it to the first row of the Not_Vax_% column; and copied the percentages in row four from the Fully_Vax column and copied it to the third row of the Fully_Vax_% columnn.
VaxData$`Not_Vax_%`[[1]] <- VaxData$Not_Vax[[2]]
VaxData$`Fully_Vax_%`[[1]] <- VaxData$Fully_Vax[[2]]
VaxData$`Not_Vax_%`[[3]] <- VaxData$Not_Vax[[4]]
VaxData$`Fully_Vax_%`[[3]] <- VaxData$Fully_Vax[[4]]
view(VaxData)
str(VaxData)
## spec_tbl_df [4 x 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Age : chr [1:4] "<50" "<50" ">50" ">50"
## $ Not_Vax : chr [1:4] "1,116,834" "23.3%" "186,078" "7.9%"
## $ Fully_Vax : chr [1:4] "3,501,118" "73.0%" "2,133,516" "90.4%"
## $ Not_Vax_Per_100k : chr [1:4] "43" NA "171" NA
## $ Fully_Vax_Per_100k : chr [1:4] "11" NA "290" NA
## $ Efficacy_vs_Severe_Disease: chr [1:4] NA NA NA NA
## $ Not_Vax_% : chr [1:4] "23.3%" "0" "7.9%" "0"
## $ Fully_Vax_% : chr [1:4] "73.0%" "0" "90.4%" "0"
## - attr(*, "spec")=
## .. cols(
## .. Age = col_character(),
## .. `Population %` = col_character(),
## .. ...3 = col_character(),
## .. `Severe Cases` = col_character(),
## .. ...5 = col_character(),
## .. Efficacy = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
We removed the second and fourth rows by filtering out the percentages rows.
VaxData <- VaxData %>%
filter(row(VaxData) == 1 | row(VaxData) == 3)
view(VaxData)
str(VaxData)
## spec_tbl_df [2 x 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Age : chr [1:2] "<50" ">50"
## $ Not_Vax : chr [1:2] "1,116,834" "186,078"
## $ Fully_Vax : chr [1:2] "3,501,118" "2,133,516"
## $ Not_Vax_Per_100k : chr [1:2] "43" "171"
## $ Fully_Vax_Per_100k : chr [1:2] "11" "290"
## $ Efficacy_vs_Severe_Disease: chr [1:2] NA NA
## $ Not_Vax_% : chr [1:2] "23.3%" "7.9%"
## $ Fully_Vax_% : chr [1:2] "73.0%" "90.4%"
## - attr(*, "spec")=
## .. cols(
## .. Age = col_character(),
## .. `Population %` = col_character(),
## .. ...3 = col_character(),
## .. `Severe Cases` = col_character(),
## .. ...5 = col_character(),
## .. Efficacy = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
We created columns to accommodate the normalized rates per 100,000.
VaxData <- VaxData %>%
mutate(`Not_Vax_Normalized` = 0)
VaxData <- VaxData %>%
mutate(`Fully_Vax_Normalized` = 0)
view(VaxData)
str(VaxData)
## spec_tbl_df [2 x 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Age : chr [1:2] "<50" ">50"
## $ Not_Vax : chr [1:2] "1,116,834" "186,078"
## $ Fully_Vax : chr [1:2] "3,501,118" "2,133,516"
## $ Not_Vax_Per_100k : chr [1:2] "43" "171"
## $ Fully_Vax_Per_100k : chr [1:2] "11" "290"
## $ Efficacy_vs_Severe_Disease: chr [1:2] NA NA
## $ Not_Vax_% : chr [1:2] "23.3%" "7.9%"
## $ Fully_Vax_% : chr [1:2] "73.0%" "90.4%"
## $ Not_Vax_Normalized : num [1:2] 0 0
## $ Fully_Vax_Normalized : num [1:2] 0 0
## - attr(*, "spec")=
## .. cols(
## .. Age = col_character(),
## .. `Population %` = col_character(),
## .. ...3 = col_character(),
## .. `Severe Cases` = col_character(),
## .. ...5 = col_character(),
## .. Efficacy = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
To calculate the normalized rates, we converted the Fully_Vax_Per_100k, Not_Vax_Per_100k, Fully_Vax, and Not_Vax from characters to numeric values, removed the commas in the numbers with the gsub command, and then created new objects FullyVaxed_Normal_Population and NotVaxed_Normal_Population.
VaxData$Fully_Vax_Per_100k <- as.numeric(VaxData$Fully_Vax_Per_100k)
VaxData$Not_Vax_Per_100k <- as.numeric((VaxData$Not_Vax_Per_100k))
VaxData$Fully_Vax <- gsub(",", "", VaxData$Fully_Vax)
VaxData$Fully_Vax <- as.numeric(VaxData$Fully_Vax)
VaxData$Not_Vax <- gsub(",", "", VaxData$Not_Vax)
VaxData$Not_Vax <- as.numeric(VaxData$Not_Vax)
Fully_Vax_Normalized <- VaxData$Fully_Vax_Per_100k/((VaxData$Fully_Vax/100000))
Not_Vax_Normalized <- VaxData$Not_Vax_Per_100k/((VaxData$Not_Vax/100000))
Fully_Vax_Normalized
## [1] 0.3141854 13.5925861
Not_Vax_Normalized
## [1] 3.850169 91.896946
view(VaxData)
str(VaxData)
## spec_tbl_df [2 x 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Age : chr [1:2] "<50" ">50"
## $ Not_Vax : num [1:2] 1116834 186078
## $ Fully_Vax : num [1:2] 3501118 2133516
## $ Not_Vax_Per_100k : num [1:2] 43 171
## $ Fully_Vax_Per_100k : num [1:2] 11 290
## $ Efficacy_vs_Severe_Disease: chr [1:2] NA NA
## $ Not_Vax_% : chr [1:2] "23.3%" "7.9%"
## $ Fully_Vax_% : chr [1:2] "73.0%" "90.4%"
## $ Not_Vax_Normalized : num [1:2] 0 0
## $ Fully_Vax_Normalized : num [1:2] 0 0
## - attr(*, "spec")=
## .. cols(
## .. Age = col_character(),
## .. `Population %` = col_character(),
## .. ...3 = col_character(),
## .. `Severe Cases` = col_character(),
## .. ...5 = col_character(),
## .. Efficacy = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
The Fully_Vax_Normalized and Not_Vax_Normalized have zero values which we converted to NA values, then replaced them with the Normalized values. This step sets up the calculation of the Efficacy_vs_Severe_Disease ratios.
VaxData$Fully_Vax_Normalized[VaxData$Fully_Vax_Normalized == 0] <- NA
VaxData$Not_Vax_Normalized[VaxData$Not_Vax_Normalized == 0] <- NA
VaxData$Fully_Vax_Normalized[which(is.na(VaxData$Fully_Vax_Normalized))] <- Fully_Vax_Normalized
VaxData$Not_Vax_Normalized[which(is.na(VaxData$Not_Vax_Normalized))] <- Not_Vax_Normalized
VaxData$Fully_Vax_Normalized <- round(VaxData$Fully_Vax_Normalized, 2)
VaxData$Not_Vax_Normalized <- round(VaxData$Not_Vax_Normalized, 2)
view(VaxData)
str(VaxData, digits = 2)
## spec_tbl_df [2 x 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Age : chr [1:2] "<50" ">50"
## $ Not_Vax : num [1:2] 1116834 186078
## $ Fully_Vax : num [1:2] 3501118 2133516
## $ Not_Vax_Per_100k : num [1:2] 43 171
## $ Fully_Vax_Per_100k : num [1:2] 11 290
## $ Efficacy_vs_Severe_Disease: chr [1:2] NA NA
## $ Not_Vax_% : chr [1:2] "23.3%" "7.9%"
## $ Fully_Vax_% : chr [1:2] "73.0%" "90.4%"
## $ Not_Vax_Normalized : num [1:2] 3.8 91.9
## $ Fully_Vax_Normalized : num [1:2] 0.31 13.59
## - attr(*, "spec")=
## .. cols(
## .. Age = col_character(),
## .. `Population %` = col_character(),
## .. ...3 = col_character(),
## .. `Severe Cases` = col_character(),
## .. ...5 = col_character(),
## .. Efficacy = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
We created a tibble to view the final outcome.
head(VaxData)
## # A tibble: 2 x 10
## Age Not_Vax Fully_Vax Not_Vax_Per_100k Fully_Vax_Per_100k Efficacy_vs_Sever~
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 <50 1116834 3501118 43 11 0.92
## 2 >50 186078 2133516 171 290 0.85
## # ... with 4 more variables: Not_Vax_% <chr>, Fully_Vax_% <chr>,
## # Not_Vax_Normalized <dbl>, Fully_Vax_Normalized <dbl>