(1) Create a .CSV file (or optionally, a relational database!) that includes all the information above. You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below.
(2) Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data.
(3) Perform analysis as described in the spreadsheet and above.
(4) Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions. Please include in your homework submission:
#Import libraries
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.4 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 2.0.1 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(dplyr)
library(tidyr)
library(stringr)
#Load the .csv file
theUrl <- "https://raw.githubusercontent.com/letisalba/Data607_Assignment_Week5/main/Israel_Vax_Data.csv"
israel_vax <- read.csv(file = theUrl, header = TRUE, sep = ",", na = ".", skip = "1")
## Warning in read.table(file = file, header = header, sep = sep, quote
## = quote, : incomplete final line found by readTableHeader on 'https://
## raw.githubusercontent.com/letisalba/Data607_Assignment_Week5/main/
## Israel_Vax_Data.csv'
head(israel_vax)
## X Not.Vax.. Fully.Vax.. Not.Vax.per.100K Fully.Vax.per.100K
## 1 < 50 1,116,834 3,501,118 43 11
## 2 23.30% 73.00% NA NA
## 3 > 50 186,078 2,133,516 171 290
## 4 7.90% 90.40% NA NA
## vs..severe.disease
## 1 NA
## 2 NA
## 3 NA
## 4 NA
glimpse(israel_vax)
## Rows: 4
## Columns: 6
## $ X <chr> "< 50", "", "> 50", ""
## $ Not.Vax.. <chr> "1,116,834", "23.30%", "186,078", "7.90%"
## $ Fully.Vax.. <chr> "3,501,118", "73.00%", "2,133,516", "90.40%"
## $ Not.Vax.per.100K <int> 43, NA, 171, NA
## $ Fully.Vax.per.100K <int> 11, NA, 290, NA
## $ vs..severe.disease <lgl> NA, NA, NA, NA
The goal when tidying this data set is to clean the rows into categories of not vaccinated vs fully vaccinated. By doing so I will be able to then combine the appropriate rows to create new columns to get the age, vaccination status, population, percent of population, and severe cases within each category.
#Get columns names
colnames(israel_vax) <- c("X1", "X2", "X3", "X4", "X5", "X6")
israel_vax
## X1 X2 X3 X4 X5 X6
## 1 < 50 1,116,834 3,501,118 43 11 NA
## 2 23.30% 73.00% NA NA NA
## 3 > 50 186,078 2,133,516 171 290 NA
## 4 7.90% 90.40% NA NA NA
#Add ID column 1 to 4 before column X1
israel_vax2 <- israel_vax %>%
add_column(X0 = 1:4, .before = "X1")
israel_vax2
## X0 X1 X2 X3 X4 X5 X6
## 1 1 < 50 1,116,834 3,501,118 43 11 NA
## 2 2 23.30% 73.00% NA NA NA
## 3 3 > 50 186,078 2,133,516 171 290 NA
## 4 4 7.90% 90.40% NA NA NA
#Extract all Percentages
percent_pop <- unlist(str_extract_all(israel_vax2, "\\d+(\\.\\d+){0,1}%"))
## Warning in stri_extract_all_regex(string, pattern, simplify = simplify, :
## argument is not an atomic vector; coercing
#Separate using ","
paste( unlist(percent_pop), collapse = ',')
## [1] "23.30%,7.90%,73.00%,90.40%"
#Create df
percent_pop <- data.frame(matrix(unlist(percent_pop), nrow=length(percent_pop), byrow=TRUE),stringsAsFactors=FALSE)
percent_pop
## matrix.unlist.percent_pop...nrow...length.percent_pop...byrow...TRUE.
## 1 23.30%
## 2 7.90%
## 3 73.00%
## 4 90.40%
#Rename column
colnames(percent_pop) <- c("Percent_from_Pop")
#Add column ID
percent_pop <- percent_pop %>%
add_column(ID = 1:4, .before = "Percent_from_Pop")
percent_pop
## ID Percent_from_Pop
## 1 1 23.30%
## 2 2 7.90%
## 3 3 73.00%
## 4 4 90.40%
#Arrange Percentage from Population
percent_pop2 <- percent_pop %>%
arrange(factor(Percent_from_Pop, levels = c("23.30%", "73.00%", "7.9%", "90.4%")))
#Create new column and identify the values
x <- c("1,116,834", "3,501,118", "186,078", "2,133,516")
percent_pop2 <- percent_pop2 %>%
add_column(Population = x, .before = "Percent_from_Pop")
percent_pop2
## ID Population Percent_from_Pop
## 1 1 1,116,834 23.30%
## 2 3 3,501,118 73.00%
## 3 2 186,078 7.90%
## 4 4 2,133,516 90.40%
#Extract all severe cases numbers
Severe_Cases <- unlist(str_extract_all(israel_vax2, "[:digit:]{2,3}")) #This regex didn't work out too well because it took out additional numbers that fell into this category leading to extra steps below
## Warning in stri_extract_all_regex(string, pattern, simplify = simplify, :
## argument is not an atomic vector; coercing
Severe_Cases
## [1] "50" "50" "116" "834" "23" "30" "186" "078" "90" "501" "118" "73"
## [13] "00" "133" "516" "90" "40" "43" "171" "11" "290"
#Convert into df
severe_cases <- data.frame(matrix(unlist(Severe_Cases), nrow=length(Severe_Cases), byrow=TRUE),stringsAsFactors=FALSE)
#Add a new column ID
severe_cases2 <- severe_cases %>%
add_column(ID = 1:21, .before = "matrix.unlist.Severe_Cases...nrow...length.Severe_Cases...byrow...TRUE.")
#Deop columns not needed
severe_cases3 <- severe_cases2[-c(1:17),]
severe_cases3
## ID matrix.unlist.Severe_Cases...nrow...length.Severe_Cases...byrow...TRUE.
## 18 18 43
## 19 19 171
## 20 20 11
## 21 21 290
#Create column names for new df
colnames(severe_cases3) <- c("ID2","Severe Cases")
severe_cases3
## ID2 Severe Cases
## 18 18 43
## 19 19 171
## 20 20 11
## 21 21 290
#cbind tables
vaccination_israel <- cbind(israel_vax2, percent_pop2, severe_cases3)
#Drop rows not needed in df
vaccination_israel <- vaccination_israel[, -c(1, 3:8, 11)]
#Assign column names
colnames(vaccination_israel) <- c("Age","Population" ,"Percent_from_Pop", "Severe Cases")
vaccination_israel
## Age Population Percent_from_Pop Severe Cases
## 18 < 50 1,116,834 23.30% 43
## 19 3,501,118 73.00% 171
## 20 > 50 186,078 7.90% 11
## 21 2,133,516 90.40% 290
#Assign value order within columns and replace the new columns[df severe_values] from the old columns [df vaccination_israel]
#Assign values
value <- c("43", "11", "171", "290")
values2 <- c("Not Vax", "Fully Vax", "Not Vax", "Fully Vax")
#Add new columns
severe_values <- vaccination_israel %>%
add_column(Severe_Cases = value, Vaccination_Status = values2)
#Drop column 4
severe_values <- severe_values[, -c(4)]
severe_values
## Age Population Percent_from_Pop Severe_Cases Vaccination_Status
## 18 < 50 1,116,834 23.30% 43 Not Vax
## 19 3,501,118 73.00% 11 Fully Vax
## 20 > 50 186,078 7.90% 171 Not Vax
## 21 2,133,516 90.40% 290 Fully Vax
#Reorder columns
col_order <- c("Age", "Vaccination_Status", "Population",
"Percent_from_Pop", "Severe_Cases")
vax_israel <- severe_values[, col_order]
vax_israel
## Age Vaccination_Status Population Percent_from_Pop Severe_Cases
## 18 < 50 Not Vax 1,116,834 23.30% 43
## 19 Fully Vax 3,501,118 73.00% 11
## 20 > 50 Not Vax 186,078 7.90% 171
## 21 Fully Vax 2,133,516 90.40% 290
Basing from this data, we do not have enough information to calculate the total population of Israel. The data is only representing a group [severe cases] of people above and below the age of 50 but we do not know if this is an actual representation of the total population who are or not fully vaccinated. It also comes to mind, what the definition of fully vaccinated means (in Israel’s (1 or 2 doses needed). There can be estimated calculations to find the total population but since it is estimates it won’t be accurate.
Efficacy vs. severe disease = 1 - (% fully vaxed severe cases per 100K / % not vaxed severe cases per 100K)
Efficacy vs. severe disease for under fifty = 0.7441 or 74.41% Efficacy vs. severe disease for over fifty = -0.696 or -69.60%
Basing on these results the efficacy for those under the age of fifty is much higher than at preventing a severe disease than those in the over the age of fifty group.
#Under Fifty
not_vax <- (43 / 1000)
vax <- (11 / 1000)
ESD_under_fifty <- 1 - (vax / not_vax)
ESD_under_fifty
## [1] 0.744186
#Over Fifty
not_vax2 <- (171 / 1000)
vax2 <- (290 / 1000)
ESD_over_fifty <- 1 - (vax2 / not_vax2)
ESD_over_fifty
## [1] -0.6959064
For those who are over the age of fifty, whether they are vaccinated or not, they run into a higher risk for diseases than those under the age of fifty. This can be due to any underlying condition(s) the older individuals may have. Not surprised to see this since it is known that the older a person gets the more health issues they can suffer from. As opposed to a younger person who may or may not be in a better health state.
Overall, being of older age does compromise health regardless of vaccination status. Chances are, we don’t really know how much a person’s vaccination status and severe diseases affect a person’s body nor how quickly it recuperates. We do know being vaccinated does play a part and it would be great to have more information about the whole population and their percentages in vaccinations to see if there’s a correlation.