Using purrr::map() Instead of For Loops in R

In many other programming languages, for loops are extremely important. However, R is a functional programming language, which means that R has the ability “to wrap up for loops in a function, and call that function instead of using the for loop directly” (R for Data Science, pg. 322).

Many people familiar with R are fimiliar with the apply family of functions in base R (i.e. apply(), vapply(), lapply()). These functions, while incredibly useful, can be inconsistent in their application and can make understanding/using them more difficult and often intimidating. This is where the purrr functions come in. Similar to the apply functions from base R, these functions allow you to apply a function to all elements of a vector. This library was built with consistency in mind, making it easier to learn and use than its apply counterpart. Additionally, this library is part of Tidyverse and so can be used in conjunction with all the other functions that are part of Tidyverse.

There many different functions inside of the purrr library. For this vignette I will explain just two:

In order to demonstrate how these functions work, we will work with the most recent COVID-19 data set as of March 25, 2020, provided by the Johns Hopkins Whiting School of Engineering. This data set can be found here as part of this GitHub.

Before jumping in, I’ll load the necessary libraries as well as the data. I’ll also remove some columns that we won’t be using in this demonstration.

library(tidyverse)
library(stringr)
covid <- readr::read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/03-25-2020.csv")
covid$Province_State <- replace_na(covid$Province_State, "")
covid <- covid %>% select(Province_State, Country_Region, Lat, Long_, Confirmed, Deaths) %>% arrange(desc(Confirmed)) 

head(covid)
## # A tibble: 6 x 6
##   Province_State Country_Region   Lat  Long_ Confirmed Deaths
##   <chr>          <chr>          <dbl>  <dbl>     <dbl>  <dbl>
## 1 ""             Italy           41.9  12.6      74386   7503
## 2 "Hubei"        China           31.0 112.       67801   3163
## 3 ""             Spain           40.5  -3.75     49515   3647
## 4 ""             Germany         51.2  10.5      37323    206
## 5 ""             Iran            32.4  53.7      27017   2077
## 6 ""             France          46.2   2.21     25233   1331

Note: There may be better, more elegant ways to do what I am demonstrating without using the purrr::map() and purrr:::pmap() functions, however, for the sake of example, I will use these functions.

purrr:map()

As an example, let’s say that we are curious about what percentage of total Confirmed cases each location makes up. To find this out we want to add a new column called “Percent_of_Total” that will hold the calculation. One way to do this would be to use the purrr::map() function. This funtion will allow us to apply any single argument function we create to every row of our data set, in essence, doing the same work a for loop would do, but in a functional way. To accomplish our goal we will need to create a function that looks at a single row’s Confirmed value and divides it by the total sum of the Confirmed values and apply it to every row of the vector.

Let’s first create our function:

#estimator function
percent_of_total <- function(x) {
  return(x/sum(x) * 100)
}

The funciton above takes an argument “x” and divides it by the sum of “x” (sum of the entire vector) and then multiplies that value by 100. This will calculate our percentage.

As mentioned above, one of the benefits of using the purrr package is that it can be used with other Tidyverse functions. To create our new column, we will use dplyr::select() to select the Confirmed column. Next, we will apply the function to that column (vector) by using purrr::map(), passing in our percent_of_total function as an argument.

new_col <- covid %>% dplyr::select(Confirmed) %>% purrr::map(percent_of_total)
class(new_col)
## [1] "list"
new_col[[1]][1:10]
##  [1] 15.908245 14.499972 10.589315  7.981924  5.777876  5.396348  3.818697
##  [8]  2.330441  2.037879  1.954046

In the output above you will notice that the output of purrr::map() is a list. If we want to add these percentages as a new column in our data frame, we can use the dplyr::mutate() function in combination with unlist(). Unlist in this instance is simply changing the list to a vector, allowing it to be easily added to the data frame.

covid <- covid %>% mutate("Percent_of_Total" = round(unlist(new_col),2))
covid
## # A tibble: 3,420 x 7
##    Province_State Country_Region   Lat  Long_ Confirmed Deaths Percent_of_Total
##    <chr>          <chr>          <dbl>  <dbl>     <dbl>  <dbl>            <dbl>
##  1 ""             Italy           41.9  12.6      74386   7503            15.9 
##  2 "Hubei"        China           31.0 112.       67801   3163            14.5 
##  3 ""             Spain           40.5  -3.75     49515   3647            10.6 
##  4 ""             Germany         51.2  10.5      37323    206             7.98
##  5 ""             Iran            32.4  53.7      27017   2077             5.78
##  6 ""             France          46.2   2.21     25233   1331             5.4 
##  7 "New York"     US              40.8 -74.0      17856    199             3.82
##  8 ""             Switzerland     46.8   8.23     10897    153             2.33
##  9 ""             United Kingdom  55.4  -3.44      9529    465             2.04
## 10 ""             Korea, South    35.9 128.        9137    126             1.95
## # ... with 3,410 more rows

Looking at the data frame above, you can see that utilizing the purrr::map() function enabled us to create the Percent_of_Total column very easily and without a for loop.

purrr:pmap()

What happens if you have a function with multiple arguments that you would like to apply to a vector? This is where purrr::pmap() comes in. This function is a variation of purrr:map() but allows you to work with functions with any number of variables as arguments. The one change you will need to make is that you will have to pass in a list() with the function arguments to purrr:pmap(). I will demonstrate this below with an example.

Let’s say for example, that we want to create a new column where we concatenate the Province_State column with the Country_Region column. More specifically, for those locations with both a Province_State and Country_Region value, we want to seperate the concatenated value with a comma. If there is no Province_State value, then we just want to return the Country_Region value. We can do this easily with purrr:pmap(). We’ll first create a function called “add_comma” that takes two arguments, x and y, which will end up being the Province_state column and the Country_Region column, respectively. Inside the function, I use an if statement to see if x (Province_State) is empty. If it is, then I just return y (Country_Region). If it’s not empty, then I concatenate the two columns together, seperated by a comma. We will apply this function to each row in the same way we did in the previous example with two distinct differences. First, we’ll need to create a list of arguments we want to pass to the function, here I’m calling it “arg_list”. Second, instead of chaining funtions like we did before, we will make this code more consise by directly placing purrr::pmap as an argument to the dplyr::mutate function. In order to do this, we need to first pass the argument list into purrr::pmap, then we need to pass the function we wish to call. As before, pmap, also returns a list, so we will need to call unlist() to tranform the list to a vector in order to create the new column in our data frame.

add_comma <- function(x, y) {
  if (x == "") {
    col_val <- y
  } else {
    col_val <- stringr::str_c(x, y, sep = ", ")
  }
  return(col_val)
}

arg_list <- list(x = covid$Province_State,  y = covid$Country_Region )
covid <- covid %>% mutate("Location" = unlist(purrr::pmap(arg_list, add_comma)))
head(covid$Location, 10)
##  [1] "Italy"          "Hubei, China"   "Spain"          "Germany"       
##  [5] "Iran"           "France"         "New York, US"   "Switzerland"   
##  [9] "United Kingdom" "Korea, South"

Let’s take a look at the final data frame reordered and cleaned up:

covid <- covid %>% select(Location, Lat, Long_, Confirmed, Percent_of_Total, Deaths)
head(covid)
## # A tibble: 6 x 6
##   Location       Lat  Long_ Confirmed Percent_of_Total Deaths
##   <chr>        <dbl>  <dbl>     <dbl>            <dbl>  <dbl>
## 1 Italy         41.9  12.6      74386            15.9    7503
## 2 Hubei, China  31.0 112.       67801            14.5    3163
## 3 Spain         40.5  -3.75     49515            10.6    3647
## 4 Germany       51.2  10.5      37323             7.98    206
## 5 Iran          32.4  53.7      27017             5.78   2077
## 6 France        46.2   2.21     25233             5.4    1331

As you can see above, the purrr::pmap function worked seamlessly. As I mentioned earlier, there are many other functions in the purrr library. Many of them allow you to return specific data type objects instead of lists such as map_int(), map_chr(), pmap_int(), and pmap_char(). Among other applications, these other functions can make it so you don’t need to use the unlist() function when working with the output.

The purrr library is an incredible tool to help make your code faster and more efficient by eliminating for loops and taking advantage of R’s functional abilities.