TidyVerse EXTEND Assignment

The purrr package and map() function

purrr enhances R’s functional programming (FP) toolkit by providing a complete and consistent set of tools for working with functions and vectors. If you’ve never heard of FP before, the best place to start is the family of map() functions which allow you to replace many for loops with code that is both more succinct and easier to read. The best place to learn about the map() functions is the iteration chapter in R for data science.( https://purrr.tidyverse.org/)

The map functions transform their input by applying a function to each element of a list or atomic vector and returning an object of the same length as the input.(https://purrr.tidyverse.org/reference/map.html)

I will demosntrate the purrr::map() and purrr::pmap() function with the use of purrr library for this vignette.

The data set can be found in (https://github.com/CSSEGISandData/COVID-19). (https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_daily_reports_us/12-31-2021.csv)

Loading the required libraries

library(tidyverse)
library(stringr)

covid_19_data <- readr::read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports_us/12-31-2021.csv")

## Rows: 58 Columns: 21
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (3): Province_State, Country_Region, ISO3
## dbl  (10): Lat, Long_, Confirmed, Deaths, FIPS, Incident_Rate, Total_Test_Re...
## lgl   (6): Recovered, Active, People_Hospitalized, Hospitalization_Rate, Peo...
## dttm  (1): Last_Update
## date  (1): Date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

covid_19_data$Province_State <- replace_na(covid_19_data$Province_State, "")
covid_19_data <- covid_19_data %>% select(Province_State, Country_Region, Lat, Long_, Confirmed, Deaths) %>% arrange(desc(Confirmed)) 
head(covid_19_data)

## # A tibble: 6 × 6
##   Province_State Country_Region   Lat  Long_ Confirmed Deaths
##   <chr>          <chr>          <dbl>  <dbl>     <dbl>  <dbl>
## 1 California     US              36.1 -120.    5517926  76478
## 2 Texas          US              31.1  -97.6   4623370  75748
## 3 Florida        US              27.8  -81.7   4209927  62504
## 4 New York       US              42.2  -74.9   3480280  59508
## 5 Illinois       US              40.3  -89.0   2149574  31017
## 6 Pennsylvania   US              40.6  -77.2   2036424  36705

purrr:map()

This function will allow you to apply a function with a single argument to a vector.

Let’s us first create our function and for instance we will determined what is the total percentage of confirmed cases made up each location.Then finally we will add a new column of percent total.

percent_total <- function(x) {
  return(x/sum(x) * 100)
}

new_col <- covid_19_data %>% dplyr::select(Confirmed) %>% purrr::map(percent_total)
class(new_col)

## [1] "list"

new_col[[1]][1:5]

## [1] 10.050012  8.420723  7.667703  6.338769  3.915102

covid_19_data <- covid_19_data %>% mutate("Percent_Total" = round(unlist(new_col),2))
covid_19_data

## # A tibble: 58 × 7
##    Province_State Country_Region   Lat  Long_ Confirmed Deaths Percent_Total
##    <chr>          <chr>          <dbl>  <dbl>     <dbl>  <dbl>         <dbl>
##  1 California     US              36.1 -120.    5517926  76478         10.0 
##  2 Texas          US              31.1  -97.6   4623370  75748          8.42
##  3 Florida        US              27.8  -81.7   4209927  62504          7.67
##  4 New York       US              42.2  -74.9   3480280  59508          6.34
##  5 Illinois       US              40.3  -89.0   2149574  31017          3.92
##  6 Pennsylvania   US              40.6  -77.2   2036424  36705          3.71
##  7 Ohio           US              40.4  -82.8   2016082  31897          3.67
##  8 Georgia        US              33.0  -83.6   1839879  31443          3.35
##  9 Michigan       US              43.3  -84.5   1710325  29020          3.12
## 10 North Carolina US              35.6  -79.8   1686667  19426          3.07
## # … with 48 more rows

purrr:pmap()

This function is a variation of map() that will allow you to apply a function with multiple arguments to a vector.

Let us create a function with a mutliple arguments. For example we will create a new column and concatenate the Providence_State column with the Country_Region and separate the output with comma.

comma_function <- function(x, y) {
  if (x == "") {
    column_value <- y
  } else {
    column_value <- stringr::str_c(x, y, sep = ", ")
  }
  return(column_value)
}

new_argument_list <- list(x = covid_19_data$Province_State,  y = covid_19_data$Country_Region )
covid_19_data <- covid_19_data %>% mutate("Location" = unlist(purrr::pmap(new_argument_list, comma_function)))
head(covid_19_data$Location, 5)

## [1] "California, US" "Texas, US"      "Florida, US"    "New York, US"  
## [5] "Illinois, US"

Display some rows on the final dataframe.

covid_19_data <- covid_19_data %>% select(Location, Lat, Long_, Confirmed, Percent_Total, Deaths)
head(covid_19_data)

## # A tibble: 6 × 6
##   Location           Lat  Long_ Confirmed Percent_Total Deaths
##   <chr>            <dbl>  <dbl>     <dbl>         <dbl>  <dbl>
## 1 California, US    36.1 -120.    5517926         10.0   76478
## 2 Texas, US         31.1  -97.6   4623370          8.42  75748
## 3 Florida, US       27.8  -81.7   4209927          7.67  62504
## 4 New York, US      42.2  -74.9   3480280          6.34  59508
## 5 Illinois, US      40.3  -89.0   2149574          3.92  31017
## 6 Pennsylvania, US  40.6  -77.2   2036424          3.71  36705

Extend Assignment By Ivan Tikhonov

covid_19_data[,1]

## # A tibble: 58 × 1
##    Location          
##    <chr>             
##  1 California, US    
##  2 Texas, US         
##  3 Florida, US       
##  4 New York, US      
##  5 Illinois, US      
##  6 Pennsylvania, US  
##  7 Ohio, US          
##  8 Georgia, US       
##  9 Michigan, US      
## 10 North Carolina, US
## # … with 48 more rows

ggplot(covid_19_data, aes(y=Percent_Total, x=Location)) + 
  geom_bar(position="stack", stat="identity") +
  ggtitle("covid_19") +
  scale_fill_brewer()

library(ggpubr) # for arranging plots

# ex. 1
Percent_Total <- ggplot(covid_19_data, aes(y=Percent_Total, x=Location)) + 
  geom_bar(position="stack", stat="identity") +
  ggtitle("covid_19") 
  

# ex. 2
Deaths <- ggplot(covid_19_data, aes( y=Deaths, x=Location)) + 
  geom_bar(position="stack", stat="identity") +
  ggtitle("covid_19") 
  

# ex. 3
Lat <- ggplot(covid_19_data, aes( y=Lat, x=Location)) + 
  geom_bar(position="stack", stat="identity") +
  ggtitle("covid_19") 
  

# ex.4
Confirmed <- ggplot(covid_19_data, aes( y=Confirmed, x=Location)) + 
  geom_bar(position="stack", stat="identity") +
  ggtitle("covid_19") 
  

# Put plots together
ggarrange(Percent_Total, Deaths, Lat, Confirmed,
          ncol = 2, nrow = 2)

## Warning: Removed 2 rows containing missing values (`position_stack()`).

Conclusion While California had the most deaths, Confirmed was highest in NYC.

DATA 607: Data Acquisition and Management

Melvin Matanos, Fall 2022

2022-11-19

The purrr package and map() function

Loading the required libraries

purrr:map()

purrr:pmap()