TidyVerse EXTEND Assignment
purrr enhances R’s functional programming (FP) toolkit by providing a complete and consistent set of tools for working with functions and vectors. If you’ve never heard of FP before, the best place to start is the family of map() functions which allow you to replace many for loops with code that is both more succinct and easier to read. The best place to learn about the map() functions is the iteration chapter in R for data science.( https://purrr.tidyverse.org/)
The map functions transform their input by applying a function to each element of a list or atomic vector and returning an object of the same length as the input.(https://purrr.tidyverse.org/reference/map.html)
I will demosntrate the purrr::map() and purrr::pmap() function with the use of purrr library for this vignette.
The data set can be found in (https://github.com/CSSEGISandData/COVID-19). (https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_daily_reports_us/12-31-2021.csv)
library(tidyverse)
library(stringr)
covid_19_data <- readr::read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports_us/12-31-2021.csv")
## Rows: 58 Columns: 21
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): Province_State, Country_Region, ISO3
## dbl (10): Lat, Long_, Confirmed, Deaths, FIPS, Incident_Rate, Total_Test_Re...
## lgl (6): Recovered, Active, People_Hospitalized, Hospitalization_Rate, Peo...
## dttm (1): Last_Update
## date (1): Date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
covid_19_data$Province_State <- replace_na(covid_19_data$Province_State, "")
covid_19_data <- covid_19_data %>% select(Province_State, Country_Region, Lat, Long_, Confirmed, Deaths) %>% arrange(desc(Confirmed))
head(covid_19_data)
## # A tibble: 6 × 6
## Province_State Country_Region Lat Long_ Confirmed Deaths
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 California US 36.1 -120. 5517926 76478
## 2 Texas US 31.1 -97.6 4623370 75748
## 3 Florida US 27.8 -81.7 4209927 62504
## 4 New York US 42.2 -74.9 3480280 59508
## 5 Illinois US 40.3 -89.0 2149574 31017
## 6 Pennsylvania US 40.6 -77.2 2036424 36705
This function will allow you to apply a function with a single argument to a vector.
Let’s us first create our function and for instance we will determined what is the total percentage of confirmed cases made up each location.Then finally we will add a new column of percent total.
percent_total <- function(x) {
return(x/sum(x) * 100)
}
new_col <- covid_19_data %>% dplyr::select(Confirmed) %>% purrr::map(percent_total)
class(new_col)
## [1] "list"
new_col[[1]][1:5]
## [1] 10.050012 8.420723 7.667703 6.338769 3.915102
covid_19_data <- covid_19_data %>% mutate("Percent_Total" = round(unlist(new_col),2))
covid_19_data
## # A tibble: 58 × 7
## Province_State Country_Region Lat Long_ Confirmed Deaths Percent_Total
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 California US 36.1 -120. 5517926 76478 10.0
## 2 Texas US 31.1 -97.6 4623370 75748 8.42
## 3 Florida US 27.8 -81.7 4209927 62504 7.67
## 4 New York US 42.2 -74.9 3480280 59508 6.34
## 5 Illinois US 40.3 -89.0 2149574 31017 3.92
## 6 Pennsylvania US 40.6 -77.2 2036424 36705 3.71
## 7 Ohio US 40.4 -82.8 2016082 31897 3.67
## 8 Georgia US 33.0 -83.6 1839879 31443 3.35
## 9 Michigan US 43.3 -84.5 1710325 29020 3.12
## 10 North Carolina US 35.6 -79.8 1686667 19426 3.07
## # … with 48 more rows
This function is a variation of map() that will allow you to apply a function with multiple arguments to a vector.
Let us create a function with a mutliple arguments. For example we will create a new column and concatenate the Providence_State column with the Country_Region and separate the output with comma.
comma_function <- function(x, y) {
if (x == "") {
column_value <- y
} else {
column_value <- stringr::str_c(x, y, sep = ", ")
}
return(column_value)
}
new_argument_list <- list(x = covid_19_data$Province_State, y = covid_19_data$Country_Region )
covid_19_data <- covid_19_data %>% mutate("Location" = unlist(purrr::pmap(new_argument_list, comma_function)))
head(covid_19_data$Location, 5)
## [1] "California, US" "Texas, US" "Florida, US" "New York, US"
## [5] "Illinois, US"
Display some rows on the final dataframe.
covid_19_data <- covid_19_data %>% select(Location, Lat, Long_, Confirmed, Percent_Total, Deaths)
head(covid_19_data)
## # A tibble: 6 × 6
## Location Lat Long_ Confirmed Percent_Total Deaths
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 California, US 36.1 -120. 5517926 10.0 76478
## 2 Texas, US 31.1 -97.6 4623370 8.42 75748
## 3 Florida, US 27.8 -81.7 4209927 7.67 62504
## 4 New York, US 42.2 -74.9 3480280 6.34 59508
## 5 Illinois, US 40.3 -89.0 2149574 3.92 31017
## 6 Pennsylvania, US 40.6 -77.2 2036424 3.71 36705
Extend Assignment By Ivan Tikhonov
covid_19_data[,1]
## # A tibble: 58 × 1
## Location
## <chr>
## 1 California, US
## 2 Texas, US
## 3 Florida, US
## 4 New York, US
## 5 Illinois, US
## 6 Pennsylvania, US
## 7 Ohio, US
## 8 Georgia, US
## 9 Michigan, US
## 10 North Carolina, US
## # … with 48 more rows
ggplot(covid_19_data, aes(y=Percent_Total, x=Location)) +
geom_bar(position="stack", stat="identity") +
ggtitle("covid_19") +
scale_fill_brewer()
library(ggpubr) # for arranging plots
# ex. 1
Percent_Total <- ggplot(covid_19_data, aes(y=Percent_Total, x=Location)) +
geom_bar(position="stack", stat="identity") +
ggtitle("covid_19")
# ex. 2
Deaths <- ggplot(covid_19_data, aes( y=Deaths, x=Location)) +
geom_bar(position="stack", stat="identity") +
ggtitle("covid_19")
# ex. 3
Lat <- ggplot(covid_19_data, aes( y=Lat, x=Location)) +
geom_bar(position="stack", stat="identity") +
ggtitle("covid_19")
# ex.4
Confirmed <- ggplot(covid_19_data, aes( y=Confirmed, x=Location)) +
geom_bar(position="stack", stat="identity") +
ggtitle("covid_19")
# Put plots together
ggarrange(Percent_Total, Deaths, Lat, Confirmed,
ncol = 2, nrow = 2)
## Warning: Removed 2 rows containing missing values (`position_stack()`).
Conclusion While California had the most deaths, Confirmed was highest in NYC.