Assignment – Tidying and Transforming Vaccination Data

The chart above describes August 2021 data for Israeli hospitalization (“Severe Cases”) rates for people under 50 (assume “50 and under”) and over 50, for both un-vaccinated and fully vaccinated populations. Analyze the data, and try to answer the questions posed in the spreadsheet. You’ll need some high level domain knowledge around (1) Israel’s total population, (2) Who is eligible to receive vaccinations, and (3) What does it mean to be fully vaccinated? Please note any apparent discrepancies that you observe in your analysis.

  • (1) Create a .CSV file (or optionally, a relational database!) that includes all the information above. You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below.

  • (2) Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data.

  • (3) Perform analysis as described in the spreadsheet and above.

  • (4) Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions. Please include in your homework submission:

    • The URL to the .Rmd file in your GitHub repository. and
    • The URL for your rpubs.com web page.
#Import libraries
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.4     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   2.0.1     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(dplyr)
library(tidyr)
library(stringr)
#Load the .csv file
theUrl <- "https://raw.githubusercontent.com/letisalba/Data607_Assignment_Week5/main/Israel_Vax_Data.csv"

israel_vax <- read.csv(file = theUrl, header = TRUE, sep = ",", na = ".", skip = "1")
## Warning in read.table(file = file, header = header, sep = sep, quote
## = quote, : incomplete final line found by readTableHeader on 'https://
## raw.githubusercontent.com/letisalba/Data607_Assignment_Week5/main/
## Israel_Vax_Data.csv'
head(israel_vax)
##      X Not.Vax.. Fully.Vax.. Not.Vax.per.100K Fully.Vax.per.100K
## 1 < 50 1,116,834   3,501,118               43                 11
## 2         23.30%      73.00%               NA                 NA
## 3 > 50   186,078   2,133,516              171                290
## 4          7.90%      90.40%               NA                 NA
##   vs..severe.disease
## 1                 NA
## 2                 NA
## 3                 NA
## 4                 NA
glimpse(israel_vax)
## Rows: 4
## Columns: 6
## $ X                  <chr> "< 50", "", "> 50", ""
## $ Not.Vax..          <chr> "1,116,834", "23.30%", "186,078", "7.90%"
## $ Fully.Vax..        <chr> "3,501,118", "73.00%", "2,133,516", "90.40%"
## $ Not.Vax.per.100K   <int> 43, NA, 171, NA
## $ Fully.Vax.per.100K <int> 11, NA, 290, NA
## $ vs..severe.disease <lgl> NA, NA, NA, NA
Tidy Data

The goal when tidying this data set is to clean the rows into categories of not vaccinated vs fully vaccinated. By doing so I will be able to then combine the appropriate rows to create new columns to get the age, vaccination status, population, percent of population, and severe cases within each category.

#Get columns names
colnames(israel_vax) <- c("X1", "X2", "X3", "X4", "X5", "X6")
israel_vax
##     X1        X2        X3  X4  X5 X6
## 1 < 50 1,116,834 3,501,118  43  11 NA
## 2         23.30%    73.00%  NA  NA NA
## 3 > 50   186,078 2,133,516 171 290 NA
## 4          7.90%    90.40%  NA  NA NA
#Add ID column 1 to 4 before column X1
israel_vax2 <- israel_vax %>% 
  add_column(X0 = 1:4, .before = "X1")
israel_vax2
##   X0   X1        X2        X3  X4  X5 X6
## 1  1 < 50 1,116,834 3,501,118  43  11 NA
## 2  2         23.30%    73.00%  NA  NA NA
## 3  3 > 50   186,078 2,133,516 171 290 NA
## 4  4          7.90%    90.40%  NA  NA NA
#Extract all Percentages
percent_pop <- unlist(str_extract_all(israel_vax2, "\\d+(\\.\\d+){0,1}%"))
## Warning in stri_extract_all_regex(string, pattern, simplify = simplify, :
## argument is not an atomic vector; coercing
#Separate using ","
paste( unlist(percent_pop), collapse = ',')
## [1] "23.30%,7.90%,73.00%,90.40%"
#Create df
percent_pop <- data.frame(matrix(unlist(percent_pop), nrow=length(percent_pop), byrow=TRUE),stringsAsFactors=FALSE)
percent_pop
##   matrix.unlist.percent_pop...nrow...length.percent_pop...byrow...TRUE.
## 1                                                                23.30%
## 2                                                                 7.90%
## 3                                                                73.00%
## 4                                                                90.40%
#Rename column
colnames(percent_pop) <- c("Percent_from_Pop")

#Add column ID
percent_pop <- percent_pop %>% 
  add_column(ID = 1:4, .before = "Percent_from_Pop")
percent_pop
##   ID Percent_from_Pop
## 1  1           23.30%
## 2  2            7.90%
## 3  3           73.00%
## 4  4           90.40%
#Arrange Percentage from Population
percent_pop2 <- percent_pop %>% 
  arrange(factor(Percent_from_Pop, levels = c("23.30%", "73.00%", "7.9%", "90.4%")))

#Create new column and identify the values
x <- c("1,116,834", "3,501,118", "186,078", "2,133,516")
percent_pop2 <- percent_pop2 %>% 
  add_column(Population = x, .before = "Percent_from_Pop")
percent_pop2
##   ID Population Percent_from_Pop
## 1  1  1,116,834           23.30%
## 2  3  3,501,118           73.00%
## 3  2    186,078            7.90%
## 4  4  2,133,516           90.40%
#Extract all severe cases numbers
Severe_Cases <- unlist(str_extract_all(israel_vax2, "[:digit:]{2,3}")) #This regex didn't work out too well because it took out additional numbers that fell into this category leading to extra steps below
## Warning in stri_extract_all_regex(string, pattern, simplify = simplify, :
## argument is not an atomic vector; coercing
Severe_Cases
##  [1] "50"  "50"  "116" "834" "23"  "30"  "186" "078" "90"  "501" "118" "73" 
## [13] "00"  "133" "516" "90"  "40"  "43"  "171" "11"  "290"
#Convert into df
severe_cases <- data.frame(matrix(unlist(Severe_Cases), nrow=length(Severe_Cases), byrow=TRUE),stringsAsFactors=FALSE)

#Add a new column ID
severe_cases2 <- severe_cases %>% 
  add_column(ID = 1:21, .before = "matrix.unlist.Severe_Cases...nrow...length.Severe_Cases...byrow...TRUE.")

#Deop columns not needed
severe_cases3 <- severe_cases2[-c(1:17),] 
severe_cases3
##    ID matrix.unlist.Severe_Cases...nrow...length.Severe_Cases...byrow...TRUE.
## 18 18                                                                      43
## 19 19                                                                     171
## 20 20                                                                      11
## 21 21                                                                     290
#Create column names for new df
colnames(severe_cases3) <- c("ID2","Severe Cases")

severe_cases3
##    ID2 Severe Cases
## 18  18           43
## 19  19          171
## 20  20           11
## 21  21          290
#cbind tables 
vaccination_israel <-  cbind(israel_vax2, percent_pop2, severe_cases3)
#Drop rows not needed in df
vaccination_israel <- vaccination_israel[, -c(1, 3:8, 11)]

#Assign column names
colnames(vaccination_israel) <- c("Age","Population" ,"Percent_from_Pop", "Severe Cases")

vaccination_israel
##     Age Population Percent_from_Pop Severe Cases
## 18 < 50  1,116,834           23.30%           43
## 19       3,501,118           73.00%          171
## 20 > 50    186,078            7.90%           11
## 21       2,133,516           90.40%          290
#Assign value order within columns and replace the new columns[df severe_values] from the old columns [df vaccination_israel]

#Assign values
value <- c("43", "11", "171", "290")
values2 <- c("Not Vax", "Fully Vax", "Not Vax", "Fully Vax")

#Add new columns
severe_values <- vaccination_israel %>% 
  add_column(Severe_Cases = value, Vaccination_Status = values2)

#Drop column 4
severe_values <- severe_values[, -c(4)]
severe_values
##     Age Population Percent_from_Pop Severe_Cases Vaccination_Status
## 18 < 50  1,116,834           23.30%           43            Not Vax
## 19       3,501,118           73.00%           11          Fully Vax
## 20 > 50    186,078            7.90%          171            Not Vax
## 21       2,133,516           90.40%          290          Fully Vax
#Reorder columns
col_order <- c("Age", "Vaccination_Status", "Population",
               "Percent_from_Pop", "Severe_Cases")
vax_israel <- severe_values[, col_order]
vax_israel
##     Age Vaccination_Status Population Percent_from_Pop Severe_Cases
## 18 < 50            Not Vax  1,116,834           23.30%           43
## 19               Fully Vax  3,501,118           73.00%           11
## 20 > 50            Not Vax    186,078            7.90%          171
## 21               Fully Vax  2,133,516           90.40%          290

To answer the questions:

    1. Do you have enough information to calculate the total population? What does this total population represent?

Basing from this data, we do not have enough information to calculate the total population of Israel. The data is only representing a group [severe cases] of people above and below the age of 50 but we do not know if this is an actual representation of the total population who are or not fully vaccinated. It also comes to mind, what the definition of fully vaccinated means (in Israel’s (1 or 2 doses needed). There can be estimated calculations to find the total population but since it is estimates it won’t be accurate.

    1. Calculate the Efficacy vs. Disease; Explain your results.

Efficacy vs. severe disease = 1 - (% fully vaxed severe cases per 100K / % not vaxed severe cases per 100K)

Efficacy vs. severe disease for under fifty = 0.7441 or 74.41% Efficacy vs. severe disease for over fifty = -0.696 or -69.60%

Basing on these results the efficacy for those under the age of fifty is much higher than at preventing a severe disease than those in the over the age of fifty group.

#Under Fifty
not_vax <- (43 / 1000)
vax <- (11 / 1000)

ESD_under_fifty <- 1 - (vax / not_vax)
ESD_under_fifty
## [1] 0.744186
#Over Fifty
not_vax2 <- (171 / 1000)
vax2 <- (290 / 1000)


ESD_over_fifty <- 1 - (vax2 / not_vax2)
ESD_over_fifty
## [1] -0.6959064
    1. From your calculation of efficacy vs. disease, are you able to compare the rate of severe cases in unvaccinated individuals to that in vaccinated individuals?

For those who are over the age of fifty, whether they are vaccinated or not, they run into a higher risk for diseases than those under the age of fifty. This can be due to any underlying condition(s) the older individuals may have. Not surprised to see this since it is known that the older a person gets the more health issues they can suffer from. As opposed to a younger person who may or may not be in a better health state.

Conclusion:

Overall, being of older age does compromise health regardless of vaccination status. Chances are, we don’t really know how much a person’s vaccination status and severe diseases affect a person’s body nor how quickly it recuperates. We do know being vaccinated does play a part and it would be great to have more information about the whole population and their percentages in vaccinations to see if there’s a correlation.