In many other programming languages, for loops are extremely important. However, R is a functional programming language, which means that R has the ability “to wrap up for loops in a function, and call that function instead of using the for loop directly” (R for Data Science, pg. 322).
Many people familiar with R are fimiliar with the apply family of functions in base R (i.e. apply(), vapply(), lapply()). These functions, while incredibly useful, can be inconsistent in their application and can make understanding/using them more difficult and often intimidating. This is where the purrr functions come in. Similar to the apply functions from base R, these functions allow you to apply a function to all elements of a vector. This library was built with consistency in mind, making it easier to learn and use than its apply counterpart. Additionally, this library is part of Tidyverse and so can be used in conjunction with all the other functions that are part of Tidyverse.
There many different functions inside of the purrr library. For this vignette I will explain just two:
In order to demonstrate how these functions work, we will work with the most recent COVID-19 data set as of March 25, 2020, provided by the Johns Hopkins Whiting School of Engineering. This data set can be found here as part of this GitHub.
Before jumping in, I’ll load the necessary libraries as well as the data. I’ll also remove some columns that we won’t be using in this demonstration.
library(tidyverse)
library(stringr)
covid <- readr::read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/03-25-2020.csv")
covid$Province_State <- replace_na(covid$Province_State, "")
covid <- covid %>% select(Province_State, Country_Region, Lat, Long_, Confirmed, Deaths) %>% arrange(desc(Confirmed))
head(covid)
## # A tibble: 6 x 6
## Province_State Country_Region Lat Long_ Confirmed Deaths
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 "" Italy 41.9 12.6 74386 7503
## 2 "Hubei" China 31.0 112. 67801 3163
## 3 "" Spain 40.5 -3.75 49515 3647
## 4 "" Germany 51.2 10.5 37323 206
## 5 "" Iran 32.4 53.7 27017 2077
## 6 "" France 46.2 2.21 25233 1331
Note: There may be better, more elegant ways to do what I am demonstrating without using the purrr::map() and purrr:::pmap() functions, however, for the sake of example, I will use these functions.
As an example, let’s say that we are curious about what percentage of total Confirmed cases each location makes up. To find this out we want to add a new column called “Percent_of_Total” that will hold the calculation. One way to do this would be to use the purrr::map() function. This funtion will allow us to apply any single argument function we create to every row of our data set, in essence, doing the same work a for loop would do, but in a functional way. To accomplish our goal we will need to create a function that looks at a single row’s Confirmed value and divides it by the total sum of the Confirmed values and apply it to every row of the vector.
Let’s first create our function:
#estimator function
percent_of_total <- function(x) {
return(x/sum(x) * 100)
}
The funciton above takes an argument “x” and divides it by the sum of “x” (sum of the entire vector) and then multiplies that value by 100. This will calculate our percentage.
As mentioned above, one of the benefits of using the purrr package is that it can be used with other Tidyverse functions. To create our new column, we will use dplyr::select() to select the Confirmed column. Next, we will apply the function to that column (vector) by using purrr::map(), passing in our percent_of_total function as an argument.
new_col <- covid %>% dplyr::select(Confirmed) %>% purrr::map(percent_of_total)
class(new_col)
## [1] "list"
new_col[[1]][1:10]
## [1] 15.908245 14.499972 10.589315 7.981924 5.777876 5.396348 3.818697
## [8] 2.330441 2.037879 1.954046
In the output above you will notice that the output of purrr::map() is a list. If we want to add these percentages as a new column in our data frame, we can use the dplyr::mutate() function in combination with unlist(). Unlist in this instance is simply changing the list to a vector, allowing it to be easily added to the data frame.
covid <- covid %>% mutate("Percent_of_Total" = round(unlist(new_col),2))
covid
## # A tibble: 3,420 x 7
## Province_State Country_Region Lat Long_ Confirmed Deaths Percent_of_Total
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 "" Italy 41.9 12.6 74386 7503 15.9
## 2 "Hubei" China 31.0 112. 67801 3163 14.5
## 3 "" Spain 40.5 -3.75 49515 3647 10.6
## 4 "" Germany 51.2 10.5 37323 206 7.98
## 5 "" Iran 32.4 53.7 27017 2077 5.78
## 6 "" France 46.2 2.21 25233 1331 5.4
## 7 "New York" US 40.8 -74.0 17856 199 3.82
## 8 "" Switzerland 46.8 8.23 10897 153 2.33
## 9 "" United Kingdom 55.4 -3.44 9529 465 2.04
## 10 "" Korea, South 35.9 128. 9137 126 1.95
## # … with 3,410 more rows
Looking at the data frame above, you can see that utilizing the purrr::map() function enabled us to create the Percent_of_Total column very easily and without a for loop.
What happens if you have a function with multiple arguments that you would like to apply to a vector? This is where purrr::pmap() comes in. This function is a variation of purrr:map() but allows you to work with functions with any number of variables as arguments. The one change you will need to make is that you will have to pass in a list() with the function arguments to purrr:pmap(). I will demonstrate this below with an example.
Let’s say for example, that we want to create a new column where we concatenate the Province_State column with the Country_Region column. More specifically, for those locations with both a Province_State and Country_Region value, we want to seperate the concatenated value with a comma. If there is no Province_State value, then we just want to return the Country_Region value. We can do this easily with purrr:pmap(). We’ll first create a function called “add_comma” that takes two arguments, x and y, which will end up being the Province_state column and the Country_Region column, respectively. Inside the function, I use an if statement to see if x (Province_State) is empty. If it is, then I just return y (Country_Region). If it’s not empty, then I concatenate the two columns together, seperated by a comma. We will apply this function to each row in the same way we did in the previous example with two distinct differences. First, we’ll need to create a list of arguments we want to pass to the function, here I’m calling it “arg_list”. Second, instead of chaining funtions like we did before, we will make this code more consise by directly placing purrr::pmap as an argument to the dplyr::mutate function. In order to do this, we need to first pass the argument list into purrr::pmap, then we need to pass the function we wish to call. As before, pmap, also returns a list, so we will need to call unlist() to tranform the list to a vector in order to create the new column in our data frame.
add_comma <- function(x, y) {
if (x == "") {
col_val <- y
} else {
col_val <- stringr::str_c(x, y, sep = ", ")
}
return(col_val)
}
arg_list <- list(x = covid$Province_State, y = covid$Country_Region )
covid <- covid %>% mutate("Location" = unlist(purrr::pmap(arg_list, add_comma)))
head(covid$Location, 10)
## [1] "Italy" "Hubei, China" "Spain" "Germany"
## [5] "Iran" "France" "New York, US" "Switzerland"
## [9] "United Kingdom" "Korea, South"
Let’s take a look at the final data frame reordered and cleaned up:
covid <- covid %>% select(Location, Lat, Long_, Confirmed, Percent_of_Total, Deaths)
head(covid)
## # A tibble: 6 x 6
## Location Lat Long_ Confirmed Percent_of_Total Deaths
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Italy 41.9 12.6 74386 15.9 7503
## 2 Hubei, China 31.0 112. 67801 14.5 3163
## 3 Spain 40.5 -3.75 49515 10.6 3647
## 4 Germany 51.2 10.5 37323 7.98 206
## 5 Iran 32.4 53.7 27017 5.78 2077
## 6 France 46.2 2.21 25233 5.4 1331
As you can see above, the purrr::pmap function worked seamlessly. As I mentioned earlier, there are many other functions in the purrr library. Many of them allow you to return specific data type objects instead of lists such as map_int(), map_chr(), pmap_int(), and pmap_char(). Among other applications, these other functions can make it so you don’t need to use the unlist() function when working with the output.
The purrr library is an incredible tool to help make your code faster and more efficient by eliminating for loops and taking advantage of R’s functional abilities.
There are going to be 3 functions and one important concept that I’m going to showcase in this extension of a well-done tutorial by Christian
Purpose:
Map_df takes a list and a function and returns a single data frame.
Example:
This example is meant to help you speed up your file loading locally.
What it does is map_df takes myfiles which is basically a list of files that matches the pattern of vgsales in the filename and ends with a .csv extension. After that, the funcdtion map_df takes a second arugment in read_csv which it applies to the first argument myfiles iteratively. At the end, the end result is a data frame
# match the filename that begins with vgsales and ends with .csv
myfiles <- list.files(pattern = "^(vgsales.+)\\.csv")
mydf <- map_df (myfiles, read_csv )
## Warning: 271 parsing failures.
## row col expected actual file
## 1975 Year a double N/A 'vgsales_pre_2k.csv'
## 1976 Year a double N/A 'vgsales_pre_2k.csv'
## 1977 Year a double N/A 'vgsales_pre_2k.csv'
## 1978 Year a double N/A 'vgsales_pre_2k.csv'
## 1979 Year a double N/A 'vgsales_pre_2k.csv'
## .... .... ........ ...... ....................
## See problems(...) for more details.
mydf
## # A tibble: 16,598 x 11
## Rank Name Platform Year Genre Publisher NA_Sales EU_Sales JP_Sales
## <dbl> <chr> <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 133 Poké… GB 2000 Role… Nintendo 2.55 1.56 1.29
## 2 174 Fina… PS 2000 Role… SquareSo… 1.62 0.77 2.78
## 3 224 Driv… PS 2000 Acti… Atari 2.36 2.1 0.02
## 4 226 Tony… PS 2000 Spor… Activisi… 3.05 1.41 0.02
## 5 243 Drag… PS 2000 Role… Enix Cor… 0.2 0.14 4.1
## 6 295 Tekk… PS2 2000 Figh… Namco Ba… 1.68 1.51 0.51
## 7 334 Spyr… PS 2000 Plat… Sony Com… 1.93 1.58 0
## 8 359 WWF … PS 2000 Figh… THQ 2.01 1.35 0.06
## 9 368 Rugr… PS 2000 Acti… THQ 1.96 1.33 0
## 10 394 Cras… PS 2000 Misc Sony Com… 1.56 1.47 0.19
## # … with 16,588 more rows, and 2 more variables: Other_Sales <dbl>,
## # Global_Sales <dbl>
Note that I intentionally brought in the records that has NA in Year. See the 271 parsing failures above.
Important concept: What is an anonymous function? It is a function that doesn’t have a name.
e.g.
A normal function, round is being used in the following situation: map_dbl(my_vector, round) means you take a map function and applies the round function to its first argument my_vector and outputs the results in dbl, namely double precision, as Christian mentioned above.
An anonymous function would be something like the following: map_dbl(my_vector, \(\sim .x + 10\)) where you see the second arugment starts with a ~ and follows with \(.x\) (. alone is also acceptable) + 10. What it means is there is a function of no name that takes columns in my_vector and add 10 to each of it. Note that ~ here just simply is a connector that means of. And \(.x\) or . itself symbolizes each element of the first argument.
Purpose:
keep takes on a list or vector and applies the function and will only keep all matching elements
Example:
The function keep keeps only numeric variables in the data frame mydf using is.numeric. Then I go on the summarize the sales attributes by year.
# keep takes only the matching columns that satisfies is.numeric ().
# Removing Rank
# filter out the rows that has Year NA
# group_by Year and summarise by the function sum()
mydf %>% keep(is.numeric) %>% select (-Rank) %>% filter (is.na(Year) == FALSE) %>% group_by (Year) %>% summarise_all(sum)
## # A tibble: 39 x 6
## Year NA_Sales EU_Sales JP_Sales Other_Sales Global_Sales
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1980 10.6 0.67 0 0.12 11.4
## 2 1981 33.4 1.96 0 0.32 35.8
## 3 1982 26.9 1.65 0 0.31 28.9
## 4 1983 7.76 0.8 8.1 0.14 16.8
## 5 1984 33.3 2.1 14.3 0.7 50.4
## 6 1985 33.7 4.74 14.6 0.92 53.9
## 7 1986 12.5 2.84 19.8 1.93 37.1
## 8 1987 8.46 1.41 11.6 0.2 21.7
## 9 1988 23.9 6.59 15.8 0.99 47.2
## 10 1989 45.2 8.44 18.4 1.5 73.4
## # … with 29 more rows
Purpose:
discard takes on a list or vector and applies the function Func. It will only discard all matching elements as a result of the function Func
Example:
The function discard apparently removes all the numeric columns from the dataframe mydf.
# note that there are a number of columns in mydf that is of type numeric
str(mydf)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 16598 obs. of 11 variables:
## $ Rank : num 133 174 224 226 243 295 334 359 368 394 ...
## $ Name : chr "Pokémon Crystal Version" "Final Fantasy IX" "Driver 2" "Tony Hawk's Pro Skater 2" ...
## $ Platform : chr "GB" "PS" "PS" "PS" ...
## $ Year : num 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...
## $ Genre : chr "Role-Playing" "Role-Playing" "Action" "Sports" ...
## $ Publisher : chr "Nintendo" "SquareSoft" "Atari" "Activision" ...
## $ NA_Sales : num 2.55 1.62 2.36 3.05 0.2 1.68 1.93 2.01 1.96 1.56 ...
## $ EU_Sales : num 1.56 0.77 2.1 1.41 0.14 1.51 1.58 1.35 1.33 1.47 ...
## $ JP_Sales : num 1.29 2.78 0.02 0.02 4.1 0.51 0 0.06 0 0.19 ...
## $ Other_Sales : num 0.99 0.14 0.25 0.2 0.02 0.35 0.19 0.16 0.23 0.17 ...
## $ Global_Sales: num 6.39 5.3 4.73 4.68 4.47 4.05 3.71 3.58 3.52 3.39 ...
# discard along with dplyr gives you
mydf %>% discard( is.numeric) %>% str
## Classes 'tbl_df', 'tbl' and 'data.frame': 16598 obs. of 4 variables:
## $ Name : chr "Pokémon Crystal Version" "Final Fantasy IX" "Driver 2" "Tony Hawk's Pro Skater 2" ...
## $ Platform : chr "GB" "PS" "PS" "PS" ...
## $ Genre : chr "Role-Playing" "Role-Playing" "Action" "Sports" ...
## $ Publisher: chr "Nintendo" "SquareSoft" "Atari" "Activision" ...