9_4_map_variants

Summary

Every column means a type of return, governed by a family

and based on different number/type of arguments, every family has a series of members.


List
Atomic Same type Nothing
One argument map() map_lgl(), … modify() walk()
Two arguments map2() map2_lgl(), … modify2() walk2()
One argument + index imap() imap_lgl(), … imodify() iwalk()
N arguments pmap() pmap_lgl(), …
pwalk()

9.4.1 modify(): Same type of output as input:

How do we double every column in a dataframe?

map would return us a list instead of a dataframe we want.

df <- data.frame(
  a <- 1:10,
  b <- 25:16
)

map(df, ~.x * 2)
$a....1.10
 [1]  2  4  6  8 10 12 14 16 18 20

$b....25.16
 [1] 50 48 46 44 42 40 38 36 34 32

modify

We can use modify() to keep the output the same type as input

(df <- modify(df, .f = ~.x * 2))
   a....1.10 b....25.16
1          2         50
2          4         48
3          6         46
4          8         44
5         10         42
6         12         40
7         14         38
8         16         36
9         18         34
10        20         32
  • as indicated above, modify does not modify in place, meaning we need to assign if we want to preserve the modified value.

Core of modify is basically:

simple_modify <- function(x, f, ...) {
  for (i in seq_along(x)) {
    x[[i]] <- f(x[[i]], ...)
  }
  x
} # simplified a little bit than the actual one.

modify_if

what about a mixed dataframe with both character columns and numeric columns?

we can use modify_if() to only modify the numeric ones.

modify_if(iris, .p = is.numeric, .f = ~.x * 100) %>% as_tibble()
# A tibble: 150 × 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
 1          510         350          140          20 setosa 
 2          490         300          140          20 setosa 
 3          470         320          130          20 setosa 
 4          460         310          150          20 setosa 
 5          500         360          140          20 setosa 
 6          540         390          170          40 setosa 
 7          460         340          140          30 setosa 
 8          500         340          150          20 setosa 
 9          440         290          140          20 setosa 
10          490         310          150          10 setosa 
# ℹ 140 more rows

9.4.2 map2() and friends: 2 inputs

Recall map() only vectorise the .x, for the rest arguments fed to .f, even if they are in vector forms, they remain the same instead of being vectorised.

But what if we need to be able to provide the extra argument as a vector?
For example, sometimes we want to calculate mean for each group by weights

# 8 groups of 10 numbers
sample_data_points <- map(1:8, ~ runif(10))

sample_data_points[[1]][1] <- NA

# 8 groups of weights, each weight is a vector of length 10 corresponding to the 10 numbers in the group.
weight <- map(1:8, ~ rpois(n = 10, lambda =  5) + 1)

For one group we just do

sample_data_points[[2]] %>% 
  weighted.mean(w = weight[[2]])
[1] 0.6030941

To calculate weighted mean for all groups, we will face Error if we go for map()

map_dbl(sample_data_points, .f = weighted.mean, w = weight)

# Error: ! 'x' and 'w' must have the same length

This is because, for each group, weight is not vectorised and simply fed to w as a list of weights.

map2

map2() can help: it is vectorised over two arguments. This means both .x and .y are varied in each call to .f:

map2_dbl(.x = sample_data_points, 
         .y = weight, 
         .f = weighted.mean)
[1]        NA 0.6030941 0.6082897 0.5487233 0.5430378 0.5943945 0.5288921
[8] 0.5265478

Compared to map()

two vectors come before the function, rather than one. Additional arguments still go afterwards

nice property: recycling .y

map2 recycles .y when its length is not sufficient, meaning if .y is of length 1, map2(x, y, f) is essentially equivalent to map(x, f, y)

9.4.3 walk() and friends: No outputs

Most functions are called for the value that they return, so it makes sense to capture and store the value with a map() function. But some functions are called primarily for their side-effects(e.g. cat(), write.csv(), or ggsave())

names <- c("Danny", "Daniel")
map(names, ~ print(paste0("Hello, ", .x) ))
[1] "Hello, Danny"
[1] "Hello, Daniel"
[[1]]
[1] "Hello, Danny"

[[2]]
[1] "Hello, Daniel"

The main purpose is to print the message to users. Returning a list is somewhat redundant.

We can muffle the ouput by assigning the value to a variable we will never use.

useless_variable <- map(names, ~ print(paste0("Hello, ", .x) ))
[1] "Hello, Danny"
[1] "Hello, Daniel"

But this makes our code hard to read and not purpose-clear.

walk

walk ignores the return values of the .f and instead return .x invisibly

walk(names, ~ print(paste0("Hello, ", .x) ))
[1] "Hello, Danny"
[1] "Hello, Daniel"

walk2

Very useful! As we often want to output a file for each of the element in a list. And that requires a pair of arguments, the object and the path that you want to save it to.


For example we have some dataframes we want to output respectively as .csv into a folder.

# assign a path to store the output files.
path_folder <- "/Users/handingzhang/Desktop/Advanced_R_Notes/functional/9_4_output"
if(!file.exists(path_folder)) dir.create(path_folder)

# split the dfs, each to be written to a csv in the path.
cyls <- split(mtcars, mtcars$cyl)

# create a vector of paths for the outputs to go to.
csv_paths <- file.path(path_folder, 
                   paste0("cyl-", 
                          names(cyls), 
                          ".csv") # make the file name for each csv using the df name.
                   )

# walk2 would iterate through the dataframes and write them out as csv.
walk2(.x = cyls, .y = csv_paths, .f = write.csv)

# check the files in the folder to make sure they have created!
dir(path_folder)
#> [1] "cyl-4.csv" "cyl-6.csv" "cyl-8.csv"

Here the walk2() is equivalent to write.csv(cyls[[1]], paths[[1]]), write.csv(cyls[[2]], paths[[2]]), write.csv(cyls[[3]], paths[[3]]).

9.4.4 Iterating over values and indices

There are three basic ways to loop over a vector with a for loop:

  • Loop over the elements: for (x in xs)

  • Loop over the numeric indices: for (i in seq_along(xs))

  • Loop over the names: for (nm in names(xs))

The first form is analogous to the map() family.

imap

The second and third forms are equivalent to the imap() family which allows you to iterate over the values and the indices of a vector in parallel

properties

imap(x, f) is equivalent to map2(x, names(x), f) if x has names.

imap(x, f) is equivalent to map2(x, seq_along(x), f) if x has no names.

over names

recall a dataframe is essentially a list of same-length, named vectors.

imap_chr(iris, ~ paste0("The first value of ", .y, " is ", .x[[1]]))
                            Sepal.Length 
"The first value of Sepal.Length is 5.1" 
                             Sepal.Width 
 "The first value of Sepal.Width is 3.5" 
                            Petal.Length 
"The first value of Petal.Length is 1.4" 
                             Petal.Width 
 "The first value of Petal.Width is 0.2" 
                                 Species 
  "The first value of Species is setosa" 

basically using imap(), we have access to .y

another example:

imap(.x = mtcars, .f = function(x, y){paste0("Index(name) is: ", y, ". Corresponding column has first value as: ", x[[1]])})
$mpg
[1] "Index(name) is: mpg. Corresponding column has first value as: 21"

$cyl
[1] "Index(name) is: cyl. Corresponding column has first value as: 6"

$disp
[1] "Index(name) is: disp. Corresponding column has first value as: 160"

$hp
[1] "Index(name) is: hp. Corresponding column has first value as: 110"

$drat
[1] "Index(name) is: drat. Corresponding column has first value as: 3.9"

$wt
[1] "Index(name) is: wt. Corresponding column has first value as: 2.62"

$qsec
[1] "Index(name) is: qsec. Corresponding column has first value as: 16.46"

$vs
[1] "Index(name) is: vs. Corresponding column has first value as: 0"

$am
[1] "Index(name) is: am. Corresponding column has first value as: 1"

$gear
[1] "Index(name) is: gear. Corresponding column has first value as: 4"

$carb
[1] "Index(name) is: carb. Corresponding column has first value as: 4"

over indices

If the vector is unnamed, the second argument will be the index:

# generate 6 groups of 10 numbers.
x <- map(1:6, ~ sample(x = 1000, size = 10))

imap_chr(x, ~ paste0("The highest value of the ", .y, "th group is ", max(.x)))
[1] "The highest value of the 1th group is 989"
[2] "The highest value of the 2th group is 961"
[3] "The highest value of the 3th group is 793"
[4] "The highest value of the 4th group is 725"
[5] "The highest value of the 5th group is 830"
[6] "The highest value of the 6th group is 965"

imap() is a useful helper if you want to work with the values in a vector along with their positions.

9.4.5 pmap() and friends: Any number of inputs

pmap

Instead of generalising map2() to an arbitrary number of arguments, purrr takes a slightly different tack with pmap(): you supply it a single list, which contains any number of arguments. In most cases, that will be a list of equal-length vectors, i.e. something very similar to a data frame.

Thus:

map2(x, y, f) is equivalent to pmap(list(x, y), f)

Let’s use the previous example in map2()

# sample_data_points: list 8 groups of 10 numbers, with the first group containing NA.
# weight: list of 8 groups of weights, each weight is a vector of length 10 corresponding to the 10 numbers in the group.

pmap_dbl(list(sample_data_points, weight), weighted.mean)
[1]        NA 0.6030941 0.6082897 0.5487233 0.5430378 0.5943945 0.5288921
[8] 0.5265478
#> [1]    NA 0.451 0.603 0.452 0.563 0.510 0.342 0.464

Just as in map2() , the exatra constant argument(s) can come after.

pmap_dbl(list(sample_data_points, weight), weighted.mean, na.rm = T)
[1] 0.3550598 0.6030941 0.6082897 0.5487233 0.5430378 0.5943945 0.5288921
[8] 0.5265478

name the argument in the list

We have nice control over the arguments as we can name the elements in the .l list that contains the arguments to feed into the .f

trims <- c(0, 0.1, 0.2, 0.5)
x <- rcauchy(1000)

pmap_dbl(list(trim = trims), mean, x = x)
[1] -1.71627511 -0.05947227 -0.05067121 -0.04380706
#> [1] -6.6740  0.0210  0.0235  0.0151

making it crystal clear that for the .f = mean, the trim argument is supplied with trims vector as value. x argument is supplied with x as value.

whereas if we do equivalently in map_dbl()

map_dbl(.x = trims, 
        .f = mean, 
        x = x)
[1] -1.71627511 -0.05947227 -0.05067121 -0.04380706

we are counting on the reader or ourselves to remember that the first argument after x is trim in the mean() function for this call to make sense.

tip:

it’s good practice to name the components of the list to make it very clear how the function will be called.

argument dataframe!

It’s often convenient to call pmap() with a data frame. A handy way to create that data frame is with tibble::tribble()

Because thereby we can describe the arguments row by row, less likely to make mistakes!

Each row essentially corresponds to one linear combination of arguments to a call of .f

Example:

# Recall dplyr::tribble is a good way to create small and easy-to-read tibbles in a row-by-row style.
params <- dplyr::tribble(
  ~ n, ~ min, ~ max,
   1L,     0,     1,
   2L,    10,   100,
   3L,   100,  1000
)

pmap(params, runif)
[[1]]
[1] 0.6960803

[[2]]
[1] 65.87241 45.44729

[[3]]
[1] 490.2665 479.6596 466.4164

In the above example then, pmap(params, runif) is equivalent to runif(n = 1L, min = 0, max = 1), runif(n = 2, min = 10, max = 100), runif(n = 3L, min = 100, max = 1000)

if the names of the argument dataframe does not match well with the .f function . Simply use dplyr::rename to remedy.

9.4.6 Exercises

Q1

  1. Explain the results of modify(mtcars, 1).

    modify(mtcars,1)
       mpg cyl disp  hp drat   wt  qsec vs am gear carb
    1   21   6  160 110  3.9 2.62 16.46  0  1    4    4
    2   21   6  160 110  3.9 2.62 16.46  0  1    4    4
    3   21   6  160 110  3.9 2.62 16.46  0  1    4    4
    4   21   6  160 110  3.9 2.62 16.46  0  1    4    4
    5   21   6  160 110  3.9 2.62 16.46  0  1    4    4
    6   21   6  160 110  3.9 2.62 16.46  0  1    4    4
    7   21   6  160 110  3.9 2.62 16.46  0  1    4    4
    8   21   6  160 110  3.9 2.62 16.46  0  1    4    4
    9   21   6  160 110  3.9 2.62 16.46  0  1    4    4
    10  21   6  160 110  3.9 2.62 16.46  0  1    4    4
    11  21   6  160 110  3.9 2.62 16.46  0  1    4    4
    12  21   6  160 110  3.9 2.62 16.46  0  1    4    4
    13  21   6  160 110  3.9 2.62 16.46  0  1    4    4
    14  21   6  160 110  3.9 2.62 16.46  0  1    4    4
    15  21   6  160 110  3.9 2.62 16.46  0  1    4    4
    16  21   6  160 110  3.9 2.62 16.46  0  1    4    4
    17  21   6  160 110  3.9 2.62 16.46  0  1    4    4
    18  21   6  160 110  3.9 2.62 16.46  0  1    4    4
    19  21   6  160 110  3.9 2.62 16.46  0  1    4    4
    20  21   6  160 110  3.9 2.62 16.46  0  1    4    4
    21  21   6  160 110  3.9 2.62 16.46  0  1    4    4
    22  21   6  160 110  3.9 2.62 16.46  0  1    4    4
    23  21   6  160 110  3.9 2.62 16.46  0  1    4    4
    24  21   6  160 110  3.9 2.62 16.46  0  1    4    4
    25  21   6  160 110  3.9 2.62 16.46  0  1    4    4
    26  21   6  160 110  3.9 2.62 16.46  0  1    4    4
    27  21   6  160 110  3.9 2.62 16.46  0  1    4    4
    28  21   6  160 110  3.9 2.62 16.46  0  1    4    4
    29  21   6  160 110  3.9 2.62 16.46  0  1    4    4
    30  21   6  160 110  3.9 2.62 16.46  0  1    4    4
    31  21   6  160 110  3.9 2.62 16.46  0  1    4    4
    32  21   6  160 110  3.9 2.62 16.46  0  1    4    4

Well we take a look at as_mapper(1) and know that basically modify(mtcars, .f = 1) is essentially indexing position 1 for us using pluck_raw in elements of the list, mtcars, which are the columns. So we see the value of the first element in each column in our output. Meanwhile modify() tries to maintain the structure of the output to be the same as the input, which is a dataframe of 32 rows. Thus it recycles the index of length one 32 times. Eventually showed us the 32 rows of the same value, the first element of each column.

official solution

Q2

Rewrite the following code to use iwalk() instead of walk2(). What are the advantages and disadvantages?

# author did this earlier
temp <- tempfile()
dir.create(temp)

cyls <- split(mtcars, mtcars$cyl)
paths <- file.path(temp, paste0("cyl-", names(cyls), ".csv"))
walk2(cyls, paths, write.csv)

dir(temp)
[1] "cyl-4.csv" "cyl-6.csv" "cyl-8.csv"
list.files(temp)
[1] "cyl-4.csv" "cyl-6.csv" "cyl-8.csv"

let’s use iwalk

#reset the folder
unlink(temp)
temp <- tempfile()
dir.create(temp)

# make sure it is empty.
dir(temp)
character(0)
# recreate the path names based on the new tempfile.
paths <- file.path(temp, paste0("cyl-", names(cyls), ".csv"))


iwalk(paths, .f = ~write.csv(x = cyls[[.y]], file = .x))


# check to ensure they are correctly created.
dir(temp)
[1] "cyl-4.csv" "cyl-6.csv" "cyl-8.csv"

book solution

A: iwalk() allows us to use a single variable, storing the output path in the names.

temp <- tempfile()
dir.create(temp)

cyls <- split(mtcars, mtcars$cyl)
names(cyls) <- file.path(temp, paste0("cyl-", names(cyls), ".csv"))
iwalk(cyls, ~ write.csv(.x, .y))

Copy

We could do this in a single pipe by taking advantage of set_names():

mtcars %>%
  split(mtcars$cyl) %>%
  set_names(~ file.path(temp, paste0("cyl-", .x, ".csv"))) %>%
  iwalk(~ write.csv(.x, .y))

Q3

  1. Explain how the following code transforms a data frame using functions stored in a list.
trans <- list(
  disp = function(x) x * 0.0163871,
  am = function(x) factor(x, labels = c("auto", "manual"))
)

nm <- names(trans)
mtcars[nm] <- map2(trans, mtcars[nm], function(f, var) f(var))

Compare and contrast the map2() approach to this map() approach:

# reset mtcars
data(mtcars)

(mtcars[nm] <- map(nm, ~ trans[[.x]](mtcars[[.x]])))
[[1]]
 [1] 2.621936 2.621936 1.769807 4.227872 5.899356 3.687098 5.899356 2.403988
 [9] 2.307304 2.746478 2.746478 4.519562 4.519562 4.519562 7.734711 7.538066
[17] 7.210324 1.289665 1.240503 1.165123 1.968091 5.211098 4.981678 5.735485
[25] 6.554840 1.294581 1.971368 1.558413 5.751872 2.376130 4.932517 1.982839

[[2]]
 [1] manual manual manual auto   auto   auto   auto   auto   auto   auto  
[11] auto   auto   auto   auto   auto   auto   auto   manual manual manual
[21] auto   auto   auto   auto   auto   manual manual manual manual manual
[31] manual manual
Levels: auto manual

my solution

Basically trans stores two named functions with names corresponding to 2 column names in mtcars. Each name corresponds to the index of the column and the function for the actions we want to take with that column.

nm then serves as the vector of the column names we will index.

finally, map2(trans, mtcars[nm], function(f, var) f(var)) basically align the name of the function with the column with the same name, creating two modified columns.

In the alternative way mtcars[nm] <- map(nm, ~ trans[.x]) nm is serving as both the function names to index from our function list, trans and the column names to index from the dataframe. This way also works but we can easily make mistakes if even one pair of name of the function in trans and name of the column in mtcars are not perfectly matching.

book solution

A: In the first approach

mtcars[nm] <- map2(trans, mtcars[nm], function(f, var) f(var))

the list of the 2 functions (trans) and the 2 appropriately selected data frame columns (mtcars[nm]) are supplied to map2(). map2() creates an anonymous function (f(var)) which applies the functions to the variables when map2() iterates over their (similar) indices. On the left-hand side, the respective 2 elements of mtcars are being replaced by their new transformations.

The map() variant

mtcars[nm] <- map(nm, ~ trans[[.x]](mtcars[[.x]]))

does basically the same. However, it directly iterates over the names (nm) of the transformations. Therefore, the data frame columns are selected during the iteration.

Besides the iteration pattern, the approaches differ in the possibilities for appropriate argument naming in the .f argument. In the map2() approach we iterate over the elements of x and y. Therefore, it is possible to choose appropriate placeholders like f and var. This makes the anonymous function more expressive at the cost of making it longer. We think using the formula interface in this way is preferable compared to the rather cryptic mtcars[nm] <- map2(trans, mtcars[nm], ~ .x(.y)).

In the map() approach we map over the variable names. It is therefore not possible to introduce placeholders for the function and variable names. The formula syntax together with the .x pronoun is pretty compact. The object names and the brackets clearly indicate the application of transformations to specific columns of mtcars. In this case the iteration over the variable names comes in handy, as it highlights the importance of matching between trans and mtcars element names. Together with the replacement form on the left-hand side, this line is relatively easy to inspect.

To summarise, in situations where map() and map2() provide solutions for an iteration problem, several points may be considered before deciding for one or the other approach.

Q4

Q4: What does write.csv() return, i.e. what happens if you use it with map2() instead of walk2()?

A: write.csv() returns NULL. As we call the function for its side effect (creating a CSV file), walk2() would be appropriate here. Otherwise, we receive a rather uninformative list of NULLs.

cyls <- split(mtcars, mtcars$cyl)

temp1 <- tempfile()
dir.create(temp1)

paths <- file.path(temp1, paste0("cyl-", names(cyls), ".csv"))

map2(cyls, paths, write.csv)
$`4`
NULL

$`6`
NULL

$`8`
NULL