Using purrr::map() Instead of For Loops in R

In many other programming languages, for loops are extremely important. However, R is a functional programming language, which means that R has the ability “to wrap up for loops in a function, and call that function instead of using the for loop directly” (R for Data Science, pg. 322).

Many people familiar with R are fimiliar with the apply family of functions in base R (i.e. apply(), vapply(), lapply()). These functions, while incredibly useful, can be inconsistent in their application and can make understanding/using them more difficult and often intimidating. This is where the purrr functions come in. Similar to the apply functions from base R, these functions allow you to apply a function to all elements of a vector. This library was built with consistency in mind, making it easier to learn and use than its apply counterpart. Additionally, this library is part of Tidyverse and so can be used in conjunction with all the other functions that are part of Tidyverse.

There many different functions inside of the purrr library. For this vignette I will explain just two:

In order to demonstrate how these functions work, we will work with the most recent COVID-19 data set as of March 25, 2020, provided by the Johns Hopkins Whiting School of Engineering. This data set can be found here as part of this GitHub.

Before jumping in, I’ll load the necessary libraries as well as the data. I’ll also remove some columns that we won’t be using in this demonstration.

library(tidyverse)
library(stringr)
covid <- readr::read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/03-25-2020.csv")
covid$Province_State <- replace_na(covid$Province_State, "")
covid <- covid %>% select(Province_State, Country_Region, Lat, Long_, Confirmed, Deaths) %>% arrange(desc(Confirmed)) 

head(covid)
## # A tibble: 6 x 6
##   Province_State Country_Region   Lat  Long_ Confirmed Deaths
##   <chr>          <chr>          <dbl>  <dbl>     <dbl>  <dbl>
## 1 ""             Italy           41.9  12.6      74386   7503
## 2 "Hubei"        China           31.0 112.       67801   3163
## 3 ""             Spain           40.5  -3.75     49515   3647
## 4 ""             Germany         51.2  10.5      37323    206
## 5 ""             Iran            32.4  53.7      27017   2077
## 6 ""             France          46.2   2.21     25233   1331

Note: There may be better, more elegant ways to do what I am demonstrating without using the purrr::map() and purrr:::pmap() functions, however, for the sake of example, I will use these functions.

purrr:map()

As an example, let’s say that we are curious about what percentage of total Confirmed cases each location makes up. To find this out we want to add a new column called “Percent_of_Total” that will hold the calculation. One way to do this would be to use the purrr::map() function. This funtion will allow us to apply any single argument function we create to every row of our data set, in essence, doing the same work a for loop would do, but in a functional way. To accomplish our goal we will need to create a function that looks at a single row’s Confirmed value and divides it by the total sum of the Confirmed values and apply it to every row of the vector.

Let’s first create our function:

#estimator function
percent_of_total <- function(x) {
  return(x/sum(x) * 100)
}

The funciton above takes an argument “x” and divides it by the sum of “x” (sum of the entire vector) and then multiplies that value by 100. This will calculate our percentage.

As mentioned above, one of the benefits of using the purrr package is that it can be used with other Tidyverse functions. To create our new column, we will use dplyr::select() to select the Confirmed column. Next, we will apply the function to that column (vector) by using purrr::map(), passing in our percent_of_total function as an argument.

new_col <- covid %>% dplyr::select(Confirmed) %>% purrr::map(percent_of_total)
class(new_col)
## [1] "list"
new_col[[1]][1:10]
##  [1] 15.908245 14.499972 10.589315  7.981924  5.777876  5.396348  3.818697
##  [8]  2.330441  2.037879  1.954046

In the output above you will notice that the output of purrr::map() is a list. If we want to add these percentages as a new column in our data frame, we can use the dplyr::mutate() function in combination with unlist(). Unlist in this instance is simply changing the list to a vector, allowing it to be easily added to the data frame.

covid <- covid %>% mutate("Percent_of_Total" = round(unlist(new_col),2))
covid
## # A tibble: 3,420 x 7
##    Province_State Country_Region   Lat  Long_ Confirmed Deaths Percent_of_Total
##    <chr>          <chr>          <dbl>  <dbl>     <dbl>  <dbl>            <dbl>
##  1 ""             Italy           41.9  12.6      74386   7503            15.9 
##  2 "Hubei"        China           31.0 112.       67801   3163            14.5 
##  3 ""             Spain           40.5  -3.75     49515   3647            10.6 
##  4 ""             Germany         51.2  10.5      37323    206             7.98
##  5 ""             Iran            32.4  53.7      27017   2077             5.78
##  6 ""             France          46.2   2.21     25233   1331             5.4 
##  7 "New York"     US              40.8 -74.0      17856    199             3.82
##  8 ""             Switzerland     46.8   8.23     10897    153             2.33
##  9 ""             United Kingdom  55.4  -3.44      9529    465             2.04
## 10 ""             Korea, South    35.9 128.        9137    126             1.95
## # … with 3,410 more rows

Looking at the data frame above, you can see that utilizing the purrr::map() function enabled us to create the Percent_of_Total column very easily and without a for loop.

purrr:pmap()

What happens if you have a function with multiple arguments that you would like to apply to a vector? This is where purrr::pmap() comes in. This function is a variation of purrr:map() but allows you to work with functions with any number of variables as arguments. The one change you will need to make is that you will have to pass in a list() with the function arguments to purrr:pmap(). I will demonstrate this below with an example.

Let’s say for example, that we want to create a new column where we concatenate the Province_State column with the Country_Region column. More specifically, for those locations with both a Province_State and Country_Region value, we want to seperate the concatenated value with a comma. If there is no Province_State value, then we just want to return the Country_Region value. We can do this easily with purrr:pmap(). We’ll first create a function called “add_comma” that takes two arguments, x and y, which will end up being the Province_state column and the Country_Region column, respectively. Inside the function, I use an if statement to see if x (Province_State) is empty. If it is, then I just return y (Country_Region). If it’s not empty, then I concatenate the two columns together, seperated by a comma. We will apply this function to each row in the same way we did in the previous example with two distinct differences. First, we’ll need to create a list of arguments we want to pass to the function, here I’m calling it “arg_list”. Second, instead of chaining funtions like we did before, we will make this code more consise by directly placing purrr::pmap as an argument to the dplyr::mutate function. In order to do this, we need to first pass the argument list into purrr::pmap, then we need to pass the function we wish to call. As before, pmap, also returns a list, so we will need to call unlist() to tranform the list to a vector in order to create the new column in our data frame.

add_comma <- function(x, y) {
  if (x == "") {
    col_val <- y
  } else {
    col_val <- stringr::str_c(x, y, sep = ", ")
  }
  return(col_val)
}

arg_list <- list(x = covid$Province_State,  y = covid$Country_Region )
covid <- covid %>% mutate("Location" = unlist(purrr::pmap(arg_list, add_comma)))
head(covid$Location, 10)
##  [1] "Italy"          "Hubei, China"   "Spain"          "Germany"       
##  [5] "Iran"           "France"         "New York, US"   "Switzerland"   
##  [9] "United Kingdom" "Korea, South"

Let’s take a look at the final data frame reordered and cleaned up:

covid <- covid %>% select(Location, Lat, Long_, Confirmed, Percent_of_Total, Deaths)
head(covid)
## # A tibble: 6 x 6
##   Location       Lat  Long_ Confirmed Percent_of_Total Deaths
##   <chr>        <dbl>  <dbl>     <dbl>            <dbl>  <dbl>
## 1 Italy         41.9  12.6      74386            15.9    7503
## 2 Hubei, China  31.0 112.       67801            14.5    3163
## 3 Spain         40.5  -3.75     49515            10.6    3647
## 4 Germany       51.2  10.5      37323             7.98    206
## 5 Iran          32.4  53.7      27017             5.78   2077
## 6 France        46.2   2.21     25233             5.4    1331

As you can see above, the purrr::pmap function worked seamlessly. As I mentioned earlier, there are many other functions in the purrr library. Many of them allow you to return specific data type objects instead of lists such as map_int(), map_chr(), pmap_int(), and pmap_char(). Among other applications, these other functions can make it so you don’t need to use the unlist() function when working with the output.

The purrr library is an incredible tool to help make your code faster and more efficient by eliminating for loops and taking advantage of R’s functional abilities.

(Extension)

There are going to be 3 functions and one important concept that I’m going to showcase in this extension of a well-done tutorial by Christian

Tutorial 1. map_df(X, Func) where X is A list or atomic vector and Func can be a function, formula, or vector

Purpose:
Map_df takes a list and a function and returns a single data frame.

Example:
This example is meant to help you speed up your file loading locally.

What it does is map_df takes myfiles which is basically a list of files that matches the pattern of vgsales in the filename and ends with a .csv extension. After that, the funcdtion map_df takes a second arugment in read_csv which it applies to the first argument myfiles iteratively. At the end, the end result is a data frame

# match the filename that begins with vgsales and ends with .csv
myfiles <- list.files(pattern = "^(vgsales.+)\\.csv")

mydf <- map_df (myfiles, read_csv )
## Warning: 271 parsing failures.
##  row  col expected actual                 file
## 1975 Year a double    N/A 'vgsales_pre_2k.csv'
## 1976 Year a double    N/A 'vgsales_pre_2k.csv'
## 1977 Year a double    N/A 'vgsales_pre_2k.csv'
## 1978 Year a double    N/A 'vgsales_pre_2k.csv'
## 1979 Year a double    N/A 'vgsales_pre_2k.csv'
## .... .... ........ ...... ....................
## See problems(...) for more details.
mydf
## # A tibble: 16,598 x 11
##     Rank Name  Platform  Year Genre Publisher NA_Sales EU_Sales JP_Sales
##    <dbl> <chr> <chr>    <dbl> <chr> <chr>        <dbl>    <dbl>    <dbl>
##  1   133 Poké… GB        2000 Role… Nintendo      2.55     1.56     1.29
##  2   174 Fina… PS        2000 Role… SquareSo…     1.62     0.77     2.78
##  3   224 Driv… PS        2000 Acti… Atari         2.36     2.1      0.02
##  4   226 Tony… PS        2000 Spor… Activisi…     3.05     1.41     0.02
##  5   243 Drag… PS        2000 Role… Enix Cor…     0.2      0.14     4.1 
##  6   295 Tekk… PS2       2000 Figh… Namco Ba…     1.68     1.51     0.51
##  7   334 Spyr… PS        2000 Plat… Sony Com…     1.93     1.58     0   
##  8   359 WWF … PS        2000 Figh… THQ           2.01     1.35     0.06
##  9   368 Rugr… PS        2000 Acti… THQ           1.96     1.33     0   
## 10   394 Cras… PS        2000 Misc  Sony Com…     1.56     1.47     0.19
## # … with 16,588 more rows, and 2 more variables: Other_Sales <dbl>,
## #   Global_Sales <dbl>

Note that I intentionally brought in the records that has NA in Year. See the 271 parsing failures above.

Important concept: What is an anonymous function? It is a function that doesn’t have a name.
e.g. 

A normal function, round is being used in the following situation: map_dbl(my_vector, round) means you take a map function and applies the round function to its first argument my_vector and outputs the results in dbl, namely double precision, as Christian mentioned above.

An anonymous function would be something like the following: map_dbl(my_vector, \(\sim .x + 10\)) where you see the second arugment starts with a ~ and follows with \(.x\) (. alone is also acceptable) + 10. What it means is there is a function of no name that takes columns in my_vector and add 10 to each of it. Note that ~ here just simply is a connector that means of. And \(.x\) or . itself symbolizes each element of the first argument.

Tutorial 2. keep(X, Func) keeps all matching elements to Func

Purpose:
keep takes on a list or vector and applies the function and will only keep all matching elements

Example:
The function keep keeps only numeric variables in the data frame mydf using is.numeric. Then I go on the summarize the sales attributes by year.

# keep takes only the matching columns that satisfies is.numeric (). 
# Removing Rank 
# filter out the rows that has Year NA
# group_by Year and summarise by the function sum()

mydf %>% keep(is.numeric) %>% select (-Rank) %>% filter (is.na(Year) == FALSE) %>% group_by (Year) %>% summarise_all(sum)
## # A tibble: 39 x 6
##     Year NA_Sales EU_Sales JP_Sales Other_Sales Global_Sales
##    <dbl>    <dbl>    <dbl>    <dbl>       <dbl>        <dbl>
##  1  1980    10.6      0.67      0          0.12         11.4
##  2  1981    33.4      1.96      0          0.32         35.8
##  3  1982    26.9      1.65      0          0.31         28.9
##  4  1983     7.76     0.8       8.1        0.14         16.8
##  5  1984    33.3      2.1      14.3        0.7          50.4
##  6  1985    33.7      4.74     14.6        0.92         53.9
##  7  1986    12.5      2.84     19.8        1.93         37.1
##  8  1987     8.46     1.41     11.6        0.2          21.7
##  9  1988    23.9      6.59     15.8        0.99         47.2
## 10  1989    45.2      8.44     18.4        1.5          73.4
## # … with 29 more rows

Tutorial 3. discard(X, Func) keeps all matching elements to Func

Purpose:
discard takes on a list or vector and applies the function Func. It will only discard all matching elements as a result of the function Func

Example:
The function discard apparently removes all the numeric columns from the dataframe mydf.

# note that there are a number of columns in mydf that is of type numeric
str(mydf)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 16598 obs. of  11 variables:
##  $ Rank        : num  133 174 224 226 243 295 334 359 368 394 ...
##  $ Name        : chr  "Pokémon Crystal Version" "Final Fantasy IX" "Driver 2" "Tony Hawk's Pro Skater 2" ...
##  $ Platform    : chr  "GB" "PS" "PS" "PS" ...
##  $ Year        : num  2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...
##  $ Genre       : chr  "Role-Playing" "Role-Playing" "Action" "Sports" ...
##  $ Publisher   : chr  "Nintendo" "SquareSoft" "Atari" "Activision" ...
##  $ NA_Sales    : num  2.55 1.62 2.36 3.05 0.2 1.68 1.93 2.01 1.96 1.56 ...
##  $ EU_Sales    : num  1.56 0.77 2.1 1.41 0.14 1.51 1.58 1.35 1.33 1.47 ...
##  $ JP_Sales    : num  1.29 2.78 0.02 0.02 4.1 0.51 0 0.06 0 0.19 ...
##  $ Other_Sales : num  0.99 0.14 0.25 0.2 0.02 0.35 0.19 0.16 0.23 0.17 ...
##  $ Global_Sales: num  6.39 5.3 4.73 4.68 4.47 4.05 3.71 3.58 3.52 3.39 ...
# discard along with dplyr gives you 
mydf %>% discard( is.numeric) %>% str
## Classes 'tbl_df', 'tbl' and 'data.frame':    16598 obs. of  4 variables:
##  $ Name     : chr  "Pokémon Crystal Version" "Final Fantasy IX" "Driver 2" "Tony Hawk's Pro Skater 2" ...
##  $ Platform : chr  "GB" "PS" "PS" "PS" ...
##  $ Genre    : chr  "Role-Playing" "Role-Playing" "Action" "Sports" ...
##  $ Publisher: chr  "Nintendo" "SquareSoft" "Atari" "Activision" ...