Iteration with Purrr

In functions, we talked about how important it is to reduce duplication in your code by creating functions instead of copying-and-pasting. Reducing code duplication has three main benefits:

It’s easier to see the intent of your code, because your eyes are drawn to what’s different, not what stays the same.

It’s easier to respond to changes in requirements. As your needs change, you only need to make changes in one place, rather than remembering to change every place that you copied-and-pasted the code.

You’re likely to have fewer bugs because each line of code is used in more places.

Another tool for reducing duplication is iteration, which helps you when you need to do the same thing to multiple inputs: repeating the same operation on different columns, or on different datasets. In this chapter you’ll learn about two important iteration paradigms: imperative programming and functional programming. N/p>

On the imperative side you have tools like for loops and while loops, which are a great place to start because they make iteration very explicit, so it’s obvious what’s happening. However, for loops are quite verbose, and require quite a bit of bookkeeping code that is duplicated for every for loop.

Functional programming (FP) offers tools to extract out this duplicated code, so each common for loop pattern gets its own function. Once you master the vocabulary of FP, you can solve many common iteration problems with less code, more ease, and fewer errors

Prerequisites

library(tidyverse)
package <U+393C><U+3E31>tidyverse<U+393C><U+3E32> was built under R version 3.3.3Loading tidyverse: ggplot2
Loading tidyverse: tibble
Loading tidyverse: tidyr
Loading tidyverse: readr
Loading tidyverse: purrr
Loading tidyverse: dplyr
Conflicts with tidy packages -----------------------------------------------------
filter(): dplyr, stats
lag():    dplyr, stats

for Loops

df <- tibble(
  a = rnorm(10),
  b = rnorm(10),
  c = rnorm(10),
  d = rnorm(10)
)
df

We want to compute the median of each column. You could do with copy-and-paste:

median(df$a)
[1] 0.05681125
#> [1] -0.246
median(df$b)
[1] 0.5622278
#> [1] -0.287
median(df$c)
[1] 0.07211131
#> [1] -0.0567
median(df$d)
[1] 0.3104609
#> [1] 0.144

But that breaks our rule of thumb: never copy and paste more than twice. Instead, we could use a for loop:

output <- vector("double", ncol(df))
for (i in seq_along(df)) {            
  output[[i]] <- median(df[[i]])      
}
output
[1] 0.05681125 0.56222775 0.07211131 0.31046093

Every for loop has three components:

The output: output <- vector(“double”, length(x)). Before you start the loop, you must always allocate sufficient space for the output. This is very important for efficiency: if you grow the [for loop] at each iteration using c() (for example), your for loop will be very slow.

A general way of creating an empty vector of given length is the vector() function. It has two arguments: the type of the vector (“logical”, “integer”, “double”, “character”, etc) and the length of the vector.

The sequence: i in seq_along(df). This determines what to loop over: each run of the for loop will assign i to a different value from seq_along(df). It’s useful to think of i as a pronoun, like “it”.

You might not have seen seq_along() before. It’s a safe version of the familiar 1:length(l), with an important difference: if you have a zero-length vector, seq_along() does the right thing:

y <- vector("double", 0)
seq_along(y)
integer(0)
1:length(y)
[1] 1 0

You probably won’t create a zero-length vector deliberately, but it’s easy to create them accidentally. If you use 1:length(x) instead of seq_along(x), you’re likely to get a confusing error message.

The body: output[[i]] <- median(df[[i]]). This is the code that does the work. It’s run repeatedly, each time with a different value for i. The first iteration will run output[[1]] <- median(df[[1]]), the second will run output[[2]] <- median(df[[2]]), and so on.

Exercises

Write for loops to:

Compute the mean of every column in mtcars.

output <-  vector("double", ncol(mtcars))
names(output) <-  names(mtcars)
for (i in names(mtcars)) {
  output[[1]] <-  mean(mtcars[[i]])
}
output
   mpg    cyl   disp     hp   drat     wt   qsec     vs     am   gear   carb 
2.8125 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 

Determine the type of each column in nycflights13::flights.

data("flights", package = "nycflights13")
output <-  vector("list", ncol(flights))
names(output) <-  names(flights)
for (i in names(flights)){
  output[[i]] <-  class(flights[[i]])
}
output
$year
[1] "integer"

$month
[1] "integer"

$day
[1] "integer"

$dep_time
[1] "integer"

$sched_dep_time
[1] "integer"

$dep_delay
[1] "numeric"

$arr_time
[1] "integer"

$sched_arr_time
[1] "integer"

$arr_delay
[1] "numeric"

$carrier
[1] "character"

$flight
[1] "integer"

$tailnum
[1] "character"

$origin
[1] "character"

$dest
[1] "character"

$air_time
[1] "numeric"

$distance
[1] "numeric"

$hour
[1] "numeric"

$minute
[1] "numeric"

$time_hour
[1] "POSIXct" "POSIXt" 

Compute the number of unique values in each column of iris.

data(iris)
iris_uniq <- vector("double", ncol(iris))
names(iris_uniq) <- names(iris)
for (i in names(iris)) {
  iris_uniq[i] <- length(unique(iris[[i]]))
}
iris_uniq
Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
          35           23           43           22            3 

Generate 10 random normals for each of
??=???10 , 0 , 10 and 100 Think about the output, sequence, and body before you start writing the loop.

# number to draw
n <- 10
# values of the mean
mu <- c(-10, 0, 10, 100)
normals <- vector("list", length(mu))
for (i in seq_along(normals)) {
  normals[[i]] <- rnorm(n, mean = mu[i])
}
normals
[[1]]
 [1]  -9.891964  -9.872125  -9.409878 -11.415893  -9.431410 -10.563044 -10.467774
 [8]  -8.692471  -8.636785  -9.666098

[[2]]
 [1]  0.9846059  0.2957479 -0.3794097  0.9516208 -0.1972844 -0.9125486 -2.2860760
 [8]  1.2632451  0.3307835 -0.9994685

[[3]]
 [1]  9.224396  9.529922  9.197843  9.946666 10.476068 10.076336 10.540870
 [8]  9.657992  9.790516  9.796507

[[4]]
 [1] 101.26619 100.57311  99.28083  98.09691  99.41012 101.47679 100.87982
 [8]  99.39403 100.61694  99.94372

Eliminate the for loop in each of the following examples by taking advantage of an existing function that works with vectors:

out <- ""
letters
 [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
[20] "t" "u" "v" "w" "x" "y" "z"
for (x in letters) {
  out <- stringr::str_c(out, x)
}
out
[1] "abcdefghijklmnopqrstuvwxyz"

Solution: str_c already works with vectors, so simply use str_c with the collapse argument to return a single string.

stringr::str_c(letters, collapse = "")
[1] "abcdefghijklmnopqrstuvwxyz"
#> [1] "abcdefghijklmnopqrstuvwxyz"
Write a for loop that prints() the lyrics to the children’s song “Alice the camel”.

humps <- c("five", "four", "three", "two", "one", "no")
for (i in humps) {
  cat(stringr::str_c("Alice the camel has ", rep(i, 3), " humps.",
             collapse = "\n"), "\n")
  if (i == "no") {
    cat("Now Alice is a horse.\n")
  } else {
    cat("So go, Alice, go.\n")
  }
  cat("\n")
}
Alice the camel has five humps.
Alice the camel has five humps.
Alice the camel has five humps. 
So go, Alice, go.

Alice the camel has four humps.
Alice the camel has four humps.
Alice the camel has four humps. 
So go, Alice, go.

Alice the camel has three humps.
Alice the camel has three humps.
Alice the camel has three humps. 
So go, Alice, go.

Alice the camel has two humps.
Alice the camel has two humps.
Alice the camel has two humps. 
So go, Alice, go.

Alice the camel has one humps.
Alice the camel has one humps.
Alice the camel has one humps. 
So go, Alice, go.

Alice the camel has no humps.
Alice the camel has no humps.
Alice the camel has no humps. 
Now Alice is a horse.
Convert the nursery rhyme “ten in the bed” to a function. Generalise it to any number of people in any sleeping structure.

numbers <- c("ten", "nine", "eight", "seven", "six", "five",
             "four", "three", "two", "one")
for (i in numbers) {
  cat(stringr::str_c("There were ", i, " in the bed\n"))
  cat("and the little one said\n")
  if (i == "one") {
    cat("I'm lonely...")
    } else {
    cat("Roll over, roll over\n")
    cat("So they all rolled over and one fell out.\n")
      }
      cat("\n")
  }
There were ten in the bed
and the little one said
Roll over, roll over
So they all rolled over and one fell out.

There were nine in the bed
and the little one said
Roll over, roll over
So they all rolled over and one fell out.

There were eight in the bed
and the little one said
Roll over, roll over
So they all rolled over and one fell out.

There were seven in the bed
and the little one said
Roll over, roll over
So they all rolled over and one fell out.

There were six in the bed
and the little one said
Roll over, roll over
So they all rolled over and one fell out.

There were five in the bed
and the little one said
Roll over, roll over
So they all rolled over and one fell out.

There were four in the bed
and the little one said
Roll over, roll over
So they all rolled over and one fell out.

There were three in the bed
and the little one said
Roll over, roll over
So they all rolled over and one fell out.

There were two in the bed
and the little one said
Roll over, roll over
So they all rolled over and one fell out.

There were one in the bed
and the little one said
I'm lonely...
Convert the song “99 bottles of beer on the wall” to a function. Generalise to any number of any vessel containing any liquid on any surface.

bottles <- function(i) {
  if (i > 2) {
   bottles <- stringr::str_c(i - 1, " bottles")
  } else if (i == 2) {
   bottles <- stringr::str_c(1," bottles")
  } else {
   bottles <- stringr::str_c("No more bottles")
  }
  bottles
}
beer_bottles <- function(n) {
  # should test whether n >= 1.
  for (i in seq(n, 1)) {
     cat(stringr::str_c(bottles(i), " of beer on the wall, ", bottles(i), " of beer.\n"))
     cat(stringr::str_c("Take one down and pass it around, ", bottles(i - 1),
                " of beer on the wall.\n\n"))
  }
  cat("No more bottles of beer on the wall, no more bottles of beer.\n")
  cat(stringr::str_c("Go to the store and buy some more, ", bottles(n), " of beer on the wall.\n"))
}
beer_bottles(4)
3 bottles of beer on the wall, 3 bottles of beer.
Take one down and pass it around, 2 bottles of beer on the wall.

2 bottles of beer on the wall, 2 bottles of beer.
Take one down and pass it around, 1 bottles of beer on the wall.

1 bottles of beer on the wall, 1 bottles of beer.
Take one down and pass it around, No more bottles of beer on the wall.

No more bottles of beer on the wall, No more bottles of beer.
Take one down and pass it around, No more bottles of beer on the wall.

No more bottles of beer on the wall, no more bottles of beer.
Go to the store and buy some more, 3 bottles of beer on the wall.

It’s common to see for loops that don’t preallocate the output and instead increase the length of a vector at each step:

output <- vector("integer", 0)
for (i in seq_along(x)) {
  output <- c(output, lengths(x[[i]]))
}
output
[1] 1

I’ll use the package microbenchmark to time this. Microbenchmark will run an R expression a number of times and time it.

Define a function that appends to an integer vector.

library(microbenchmark)
package <U+393C><U+3E31>microbenchmark<U+393C><U+3E32> was built under R version 3.3.3
add_to_vector <- function(n) {
  output <- vector("integer", 0)
  for (i in seq_len(n)) {
    output <- c(output, i)
  }
  output  
}
microbenchmark(add_to_vector(10000), times = 3)
Unit: milliseconds
                 expr      min      lq     mean  median       uq      max neval
 add_to_vector(10000) 257.1821 271.989 293.0816 286.796 311.0313 335.2667     3

And one that pre-allocates it.

add_to_vector_2 <- function(n) {
  output <- vector("integer", n)
  for (i in seq_len(n)) {
    output[[i]] <- i
  }
  output
}
microbenchmark(add_to_vector_2(10000), times = 3)
Unit: milliseconds
                   expr      min       lq     mean   median       uq      max
 add_to_vector_2(10000) 26.80056 26.90097 27.38167 27.00137 27.67223 28.34308
 neval
     3

The pre-allocated vector is about 100 times faster! YMMV, but the longer the vector and the bigger the objects, the more that pre-allocation will outperform appending.

For Loop Variations

Once you have the basic for loop under your belt, there are some variations that you should be aware of. These variations are important regardless of how you do iteration, so don’t forget about them once you’ve master the FP techniques you’ll learn about in the next section.

There are four variations on the basic theme of the for loop:

  • Modifying an existing object, instead of creating a new object.
  • Looping over names or values, instead of indices.
  • Handling outputs of unknown length.
  • Handling sequences of unknown length.

    Modifying an existing object

    Sometimes you want to use a for loop to modify an existing object. For example, remember our challenge from functions. We wanted to rescale every column in a data frame:

    df <- tibble(
      a = rnorm(10),
      b = rnorm(10),
      c = rnorm(10),
      d = rnorm(10)
    )
    rescale01 <- function(x) {
      rng <- range(x, na.rm = TRUE)
      (x - rng[1]) / (rng[2] - rng[1])
    }
    df$a <- rescale01(df$a)
    df$b <- rescale01(df$b)
    df$c <- rescale01(df$c)
    df$d <- rescale01(df$d)
    To solve this with a for loop we again think about the three components:

    Output: we already have the output - it’s the same as the input!
    Sequence: we can think about a data frame as a list of columns, so we can iterate over each column with seq_along(df).
    Body: apply rescale01().

    This gives us:

    for (i in seq_along(df)) {
      df[[i]] <- rescale01(df[[i]])
    }
    df
    Typically you’ll be modifying a list or data frame with this sort of loop, so remember to use [[, not [. You might have spotted that I used [[ in all my for loops: I think it’s better to use [[ even for atomic vectors because it makes it clear that I want to work with a single element.

    Looping Patterns

    There are three basic ways to loop over a vector. So far I’ve shown you the most general: looping over the numeric indices with for (i in seq_along(xs)), and extracting the value with x[[i]]. There are two other forms:

    Loop over the elements: for (x in xs). This is most useful if you only care about side-effects, like plotting or saving a file, because it’s difficult to save the output efficiently.

    Loop over the names: for (nm in names(xs)). This gives you name, which you can use to access the value with x[[nm]]. This is useful if you want to use the name in a plot title or a file name. If you’re creating named output, make sure to name the results vector like so:

    results <- vector("list", length(x))
    names(results) <- names(x)
    names
    function (x)  .Primitive("names")

    Unknown Output Length

    Sometimes you might not know how long the output will be. For example, imagine you want to simulate some random vectors of random lengths. You might be tempted to solve this problem by progressively growing the vector:

    means <- c(0, 1, 2)
    output <- double()
    for (i in seq_along(means)) {
      n <- sample(100, 1)
      output <- c(output, rnorm(n, means[[i]]))
    }
    str(output)
     num [1:207] -1.3229 2.1883 0.2413 0.0751 -1.9295 ...
    But this is not very efficient because in each iteration, R has to copy all the data from the previous iterations. In technical terms you get “quadratic” ( O(n2) ) behaviour which means that a loop with three times as many elements would take nine times as long to run.

    A better solution to save the results in a list, and then combine into a single vector after the loop is done:

    out <- vector("list", length(means))
    for (i in seq_along(means)) {
      n <- sample(100, 1)
      out[[i]] <- rnorm(n, means[[i]])
    }
    str(out)
    List of 3
     $ : num [1:50] -0.0931 0.0333 0.3342 -0.0619 1.4863 ...
     $ : num [1:100] 0.437 0.643 0.651 1.468 0.681 ...
     $ : num [1:94] 2.868 1.49 1.013 2.033 0.656 ...
    str(unlist(out))
     num [1:244] -0.0931 0.0333 0.3342 -0.0619 1.4863 ...
    Here I’ve used unlist() to flatten a list of vectors into a single vector. A stricter option is to use purrr::flatten_dbl() - it will throw an error if the input isn’t a list of doubles.

    This pattern occurs in other places too:

    You might be generating a long string. Instead of paste()ing together each iteration with the previous, save the output in a character vector and then combine that vector into a single string with paste(output, collapse = “”).

    You might be generating a big data frame. Instead of sequentially rbind()ing in each iteration, save the output in a list, then use dplyr::bind_rows(output) to combine the output into a single data frame.

    Watch out for this pattern. Whenever you see it, switch to a more complex result object, and then combine in one step at the end.

    Unknown Sequence Length

    Sometimes you don’t even know how long the input sequence should run for. This is common when doing simulations. For example, you might want to loop until you get three heads in a row. You can’t do that sort of iteration with the for loop. Instead, you can use a while loop. A while loop is simpler than for loop because it only has two components, a condition and a body.

    A while loop is also more general than a for loop, because you can rewrite any for loop as a while loop, but you can’t rewrite every while loop as a for loop:

    for (i in seq_along(x)) {
      # body
    }
    # Equivalent to
    i <- 1
    while (i <= length(x)) {
      # body
      i <- i + 1 
    }

    Here’s how we could use a while loop to find how many tries it takes to get three heads in a row:

    flip <- function() sample(c("T", "H"), 1)
    flips <- 0
    nheads <- 0
    while (nheads < 3) {
      if (flip() == "H") {
        nheads <- nheads + 1
      } else {
        nheads <- 0
      }
      flips <- flips + 1
    }
    flips
    [1] 3

    I mention while loops only briefly, because I hardly ever use them. They’re most often used for simulation, which is outside the scope of this book. However, it is good to know they exist so that you’re prepared for problems where the number of iterations is not known in advance.

    Exercises

    Imagine you have a directory full of CSV files that you want to read in. You have their paths in a vector, files <- dir(“data/”, pattern = “\.csv$”, full.names = TRUE), and now want to read each one with read_csv(). Write the for loop that will load them into a single data frame.

    df <- vector("list", length(files))
    for (fname in seq_along(files)) {
      df[[i]] <- read_csv(files[[i]])
    }
    df <- bind_rows(df)

    What happens if you use for (nm in names(x)) and x has no names? What if only some of the elements are named? What if the names are not unique?

    When there are no names for the vector, it does not run the code in the loop (it runs zero iterations of the loop):

    x <- 1:3
    print(names(x))
    NULL
    #> NULL
    for (nm in names(x)) {
      print(nm)
      print(x[[nm]])
    }

    If there only some names, then we get an error if we try to access an element without a name. However, oddly, nm == “” when there is no name.

    x <- c(a = 1, 2, c = 3)
    names(x)
    [1] "a" ""  "c"
    for (nm in names(x)) {
      print(nm)
      print(x[[nm]])
    }
    [1] "a"
    [1] 1
    [1] ""
    Error in x[[nm]] : subscript out of bounds

    Finally, if there are duplicate names, then x[[nm]] will give the first element with that name. There is no way to access duplicately named elements by name.

    x <- c(a = 1, a = 2, c = 3)
    names(x)
    [1] "a" "a" "c"
    for (nm in names(x)) {
      print(nm)
      print(x[[nm]])
    }
    [1] "a"
    [1] 1
    [1] "a"
    [1] 1
    [1] "c"
    [1] 3

    Write a function that prints the mean of each numeric column in a data frame, along with its name. For example, show_mean(iris) would print:

    show_mean <- function(df, digits = 2) {
      # Get max length of any variable in the dataset
      maxstr <- max(stringr::str_length(names(df)))
      for (nm in names(df)) {
        if (is.numeric(df[[nm]])) {
          cat(stringr::str_c(stringr::str_pad(stringr::str_c(nm, ":"), maxstr + 1L, side = "right"),
                    format(mean(df[[nm]]), digits = digits, nsmall = digits),
                    sep = " "),
              "\n")
        }
      }
    }
    show_mean(iris)
    Sepal.Length: 5.84 
    Sepal.Width:  3.06 
    Petal.Length: 3.76 
    Petal.Width:  1.20 

    What does this code do? How does it work?

    trans <- list( 
      disp = function(x) x * 0.0163871,
      am = function(x) {
        factor(x, labels = c("auto", "manual"))
      }
    )
    for (var in names(trans)) {
      mtcars[[var]] <- trans[[var]](mtcars[[var]])
    }
    

    This code mutates the disp and am columns.

    disp is multiplied by 0.0163871 am is replaced by a factor variable. The code works by looping over a named list of functions. It calls the named function in the list on the column of mtcars with the same name, and replaces the values of that column.

    For Loops vs functionals

    For loops are not as important in R as they are in other languages because R is a functional programming language. This means that it’s possible to wrap up for loops in a function, and call that function instead of using the for loop directly.

    To see why this is important, consider (again) this simple data frame:

    df <- tibble(
      a = rnorm(10),
      b = rnorm(10),
      c = rnorm(10),
      d = rnorm(10)
    )

    Imagine you want to compute the mean of every column. You could do that with a for loop:

    output <- vector("double", length(df))
    for (i in seq_along(df)) {
      output[[i]] <- mean(df[[i]])
    }
    output
    [1] -0.2436047  0.6860735 -0.3421132  0.6683193

    You realise that you’re going to want to compute the means of every column pretty frequently, so you extract it out into a function:

    col_mean <- function(df) {
      output <- vector("double", length(df))
      for (i in seq_along(df)) {
        output[i] <- mean(df[[i]])
      }
      output
    }
    

    But then you think it’d also be helpful to be able to compute the median, and the standard deviation, so you copy and paste your col_mean() function and replace the mean() with median() and sd():

    col_median <- function(df) {
      output <- vector("double", length(df))
      for (i in seq_along(df)) {
        output[i] <- median(df[[i]])
      }
      output
    }
    col_sd <- function(df) {
      output <- vector("double", length(df))
      for (i in seq_along(df)) {
        output[i] <- sd(df[[i]])
      }
      output
    }

    Uh oh! You’ve copied-and-pasted this code twice, so it’s time to think about how to generalise it. Notice that most of this code is for-loop boilerplate and it’s hard to see the one thing (mean(), median(), sd()) that is different between the functions.

    We can do exactly the same thing with col_mean(), col_median() and col_sd() by adding an argument that supplies the function to apply to each column:

    col_summary <- function(df, fun) {
      out <- vector("double", length(df))
      for (i in seq_along(df)) {
        out[i] <- fun(df[[i]])
      }
      out
    }
    col_summary(df, median)
    [1] -0.1096523  0.5112549 -0.3405418  0.6707342
    col_summary(df, mean)
    [1] -0.2436047  0.6860735 -0.3421132  0.6683193
    The idea of passing a function to another function is extremely powerful idea, and it’s one of the behaviours that makes R a functional programming language. It might take you a while to wrap your head around the idea, but it’s worth the investment. In the rest of the chapter, you’ll learn about and use the purrr package, which provides functions that eliminate the need for many common for loops. The apply family of functions in base R (apply(), lapply(), tapply(), etc) solve a similar problem, but purrr is more consistent and thus is easier to learn.

    The goal of using purrr functions instead of for loops is to allow you break common list manipulation challenges into independent pieces:

    How can you solve the problem for a single element of the list? Once you’ve solved that problem, purrr takes care of generalising your solution to every element in the list.

    If you’re solving a complex problem, how can you break it down into bite-sized pieces that allow you to advance one small step towards a solution? With purrr, you get lots of small pieces that you can compose together with the pipe.

    This structure makes it easier to solve new problems. It also makes it easier to understand your solutions to old problems when you re-read your old code.

    Exercises

    Read the documentation for apply(). In the 2d case, what two for loops does it generalise.

    It generalises looping over the rows or columns of a matrix or data-frame.

    Adapt col_summary() so that it only applies to numeric columns You might want to start with an is_numeric() function that returns a logical vector that has a TRUE corresponding to each numeric column.

    col_summary2 <- function(df, fun) {
      # test whether each colum is numeric
      numeric_cols <- vector("logical", length(df))
      for (i in seq_along(df)) {
        numeric_cols[[i]] <- is.numeric(df[[i]])
      }
      # indexes of numeric columns
      idxs <- seq_along(df)[numeric_cols]
      # number of numeric columns
      n <- sum(numeric_cols)
      out <- vector("double", n)
      for (i in idxs) {
        out[i] <- fun(df[[i]])
      }
      out
    }
    df <- tibble(
      a = rnorm(10),
      b = rnorm(10),
      c = letters[1:10],
      d = rnorm(10)
    )
    col_summary2(df, mean)
    [1] -0.02891776  0.52704384  0.00000000 -0.45001795

    The Map functions

    There is one function for each type of output:

  • map() makes a list.
  • map_lgl() makes a logical vector.
  • map_int() makes an integer vector.
  • map_dbl() makes a double vector.
  • map_chr() makes a character vector.
  • Each function takes a vector as input, applies a function to each piece, and then returns a new vector that’s the same length (and has the same names) as the input. The type of the vector is determined by the suffix to the map function.

    Once you master these functions, you’ll find it takes much less time to solve iteration problems. But you should never feel bad about using a for loop instead of a map function. The map functions are a step up a tower of abstraction, and it can take a long time to get your head around how they work. The important thing is that you solve the problem that you’re working on, not write the most concise and elegant code (although that’s definitely something you want to strive towards!).

    Some people will tell you to avoid for loops because they are slow. They’re wrong! (Well at least they’re rather out of date, as for loops haven’t been slow for many years). The chief benefits of using functions like map() is not speed, but clarity: they make your code easier to write and to read.

    We can use these functions to perform the same computations as the last for loop. Those summary functions returned doubles, so we need to use map_dbl():

    map_dbl(df, mean)
    argument is not numeric or logical: returning NA
              a           b           c           d 
    -0.02891776  0.52704384          NA -0.45001795 
    map_dbl(df, median)
    argument is not numeric or logical: returning NA
             a          b          c          d 
    -0.2058552  0.7948718         NA -0.3792446 
    map_dbl(df, sd)
    NAs introduced by coercion
           a        b        c        d 
    1.133854 1.388394       NA 0.785551 

    Compared to using a for loop, focus is on the operation being performed (i.e. mean(), median(), sd()), not the bookkeeping required to loop over every element and store the output. This is even more apparent if we use the pipe:

    df %>% map_dbl(mean)
    argument is not numeric or logical: returning NA
              a           b           c           d 
    -0.02891776  0.52704384          NA -0.45001795 
    df %>% map_dbl(median)
    argument is not numeric or logical: returning NA
             a          b          c          d 
    -0.2058552  0.7948718         NA -0.3792446 
    df %>% map_dbl(sd)
    NAs introduced by coercion
           a        b        c        d 
    1.133854 1.388394       NA 0.785551 

    There are a few differences between map_*() and col_summary():

    All purrr functions are implemented in C. This makes them a little faster at the expense of readability.

    The second argument, .f, the function to apply, can be a formula, a character vector, or an integer vector. You’ll learn about those handy shortcuts in the next section. map_*() uses . ([dot dot dot]) to pass along additional arguments to .f each time it’s called:

    map_dbl(df, mean, trim = 0.5)
    argument is not numeric or logical: returning NA
             a          b          c          d 
    -0.2058552  0.7948718         NA -0.3792446 

    The map function also preserves names:

    z <- list(x = 1:3, y = 4:5)
    map_int(z, length)
    x y 
    3 2 

    There are a few shortcuts that you can use with .f in order to save a little typing. Imagine you want to fit a linear model to each group in a dataset. The following toy example splits the up the mtcars dataset in to three pieces (one for each value of cylinder) and fits the same linear model to each piece:

    models <- mtcars %>% 
      split(.$cyl) %>% 
      map(function(df) lm(mpg ~ wt, data = df))
    models
    $`4`
    
    Call:
    lm(formula = mpg ~ wt, data = df)
    
    Coefficients:
    (Intercept)           wt  
         39.571       -5.647  
    
    
    $`6`
    
    Call:
    lm(formula = mpg ~ wt, data = df)
    
    Coefficients:
    (Intercept)           wt  
          28.41        -2.78  
    
    
    $`8`
    
    Call:
    lm(formula = mpg ~ wt, data = df)
    
    Coefficients:
    (Intercept)           wt  
         23.868       -2.192  
    The syntax for creating an anonymous function in R is quite verbose so purrr provides a convenient shortcut: a one-sided formula. Note: The lm() function runs a linear regression. It is covered in the Model Basics chapter.

    From r-bloggers.coms
    In FP, naming and applying a function are two separate operations, you don’t need to give your functions names in order to call them. So, calling this function

    powfun <- function(x, pow) {
        x^pow
    }
    powfun(2, 10)
    [1] 1024

    to the interpreter is exactly the same as applying variables to the anonymous function:

    #anonymouse equivalent 
    (function(x, pow) {
        x^pow
    })(2, 10)
    [1] 1024
    models <- mtcars %>% 
      split(.$cyl) %>% 
      map(~lm(mpg ~ wt, data = .))
    models
    $`4`
    
    Call:
    lm(formula = mpg ~ wt, data = .)
    
    Coefficients:
    (Intercept)           wt  
         39.571       -5.647  
    
    
    $`6`
    
    Call:
    lm(formula = mpg ~ wt, data = .)
    
    Coefficients:
    (Intercept)           wt  
          28.41        -2.78  
    
    
    $`8`
    
    Call:
    lm(formula = mpg ~ wt, data = .)
    
    Coefficients:
    (Intercept)           wt  
         23.868       -2.192  
    Here I’ve used . as a pronoun: it refers to the current list element (in the same way that i referred to the current index in the for loop).

    When you’re looking at many models, you might want to extract a summary statistic like the R2 . To do that we need to first run summary() and then extract the component called r.squared. We could do that using the shorthand for anonymous functions:

    models %>% 
      map(summary) %>% 
      map_dbl(~.$r.squared)
            4         6         8 
    0.5086326 0.4645102 0.4229655 

    But extracting named components is a common operation, so purrr provides an even shorter shortcut: you can use a string.

    models %>% 
      map(summary) %>% 
      map_dbl("r.squared")

    You can also use an integer to select elements by position:

    x <- list(list(1, 2, 3), list(4, 5, 6), list(7, 8, 9))
    x %>% map_dbl(2)
    [1] 2 5 8

    Base R

    If you’re familiar with the apply family of functions in base R, you might have noticed some similarities with the purrr functions:

    lapply() is basically identical to map(), except that map() is consistent with all the other functions in purrr, and you can use the shortcuts for .f.

    Base sapply() is a wrapper around lapply() that automatically simplifies the output. This is useful for interactive work but is problematic in a function because you never know what sort of output you’ll get:

    x1 <- list(
      c(0.27, 0.37, 0.57, 0.91, 0.20),
      c(0.90, 0.94, 0.66, 0.63, 0.06), 
      c(0.21, 0.18, 0.69, 0.38, 0.77)
    )
    x2 <- list(
      c(0.50, 0.72, 0.99, 0.38, 0.78), 
      c(0.93, 0.21, 0.65, 0.13, 0.27), 
      c(0.39, 0.01, 0.38, 0.87, 0.34)
    )
    threshold <- function(x, cutoff = 0.8) x[x > cutoff]
    x1 %>% sapply(threshold) %>% str()
    List of 3
     $ : num 0.91
     $ : num [1:2] 0.9 0.94
     $ : num(0) 
    x2 %>% sapply(threshold) %>% str()
     num [1:3] 0.99 0.93 0.87
    vapply() is a safe alternative to sapply() because you supply an additional argument that defines the type. The only problem with vapply() is that it’s a lot of typing: vapply(df, is.numeric, logical(1)) is equivalent to map_lgl(df, is.numeric). One advantage of vapply() over purrr’s map functions is that it can also produce matrices - the map functions only ever produce vectors.

    Exercises

    Write code that uses one of the map functions to:

    Compute the mean of every column in mtcars.

    map_dbl(mtcars, mean)
           mpg        cyl       disp         hp       drat         wt       qsec 
     20.090625   6.187500 230.721875 146.687500   3.596563   3.217250  17.848750 
            vs         am       gear       carb 
      0.437500   0.406250   3.687500   2.812500 

    Determine the type of each column in nycflights13::flights.

    map(nycflights13::flights, class)
    $year
    [1] "integer"
    
    $month
    [1] "integer"
    
    $day
    [1] "integer"
    
    $dep_time
    [1] "integer"
    
    $sched_dep_time
    [1] "integer"
    
    $dep_delay
    [1] "numeric"
    
    $arr_time
    [1] "integer"
    
    $sched_arr_time
    [1] "integer"
    
    $arr_delay
    [1] "numeric"
    
    $carrier
    [1] "character"
    
    $flight
    [1] "integer"
    
    $tailnum
    [1] "character"
    
    $origin
    [1] "character"
    
    $dest
    [1] "character"
    
    $air_time
    [1] "numeric"
    
    $distance
    [1] "numeric"
    
    $hour
    [1] "numeric"
    
    $minute
    [1] "numeric"
    
    $time_hour
    [1] "POSIXct" "POSIXt" 

    I had to use map rather than map_chr since the class Though if by type, typeof is meant.

    Compute the number of unique values in each column of iris.

    map_int(iris, ~ length(unique(.)))
    Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
              35           23           43           22            3 

    Generate 10 random normals for each of \(\mu = -10\), \(0\), \(10\), and \(100\).

    map(c(-10, 0, 10, 100), rnorm, n = 10)
    [[1]]
     [1] -12.319979  -9.515587  -7.726388  -9.017238 -10.437171  -8.877139  -9.526525
     [8]  -9.290557  -9.901963  -9.491832
    
    [[2]]
     [1] -0.2397833  1.3283208  0.7593858  1.1392236 -1.1767585 -0.3524574 -0.7176960
     [8]  0.2637527  0.3423694  0.9854774
    
    [[3]]
     [1]  9.464330 10.941647  9.479644  9.312160  8.755509 11.193252  7.601449
     [8]  8.920083  8.864263 10.266669
    
    [[4]]
     [1]  99.48009  99.73046 100.14210  99.92110 100.07910 100.11577 101.17778
     [8] 100.83154  99.90260  99.41518

    How can you create a single vector that for each column in a data frame indicates whether or not it’s a factor?

    Use map_lgl (Apply a function to each element of a vector) with the function is.factor

    map_lgl(mtcars, is.factor)
      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb 
    FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE 

    What happens when you use the map functions on vectors that aren’t lists? What does map(1:5, runif) do? Why? The function map applies the function to each element of the vector.

    map(1:5, runif)
    [[1]]
    [1] 0.9602394
    
    [[2]]
    [1] 0.9062845 0.3908404
    
    [[3]]
    [1] 0.7121115 0.6018702 0.2382705
    
    [[4]]
    [1] 0.4452197 0.4119878 0.5517169 0.9214921
    
    [[5]]
    [1] 0.7207585 0.2955939 0.5625326 0.9537604 0.7442990

    What does map(-2:2, rnorm, n = 5) do? Why?

    map(-2:2, rnorm, n = 5)
    [[1]]
    [1] -1.893329 -2.439700 -1.148182 -3.201482 -1.696993
    
    [[2]]
    [1]  0.0234227 -1.2800125 -0.5399861 -0.4143731 -1.0108234
    
    [[3]]
    [1] 1.0361046 1.6727639 1.8509029 0.5924434 0.3846546
    
    [[4]]
    [1] 0.3298516 0.9574716 0.9820513 2.0092424 1.0857084
    
    [[5]]
    [1] 0.7368190 4.1516435 2.0429288 3.5349912 0.9127184
    This takes samples of n = 5 from normal distributions of means -2, -1, 0, 1, and 2, and returns a list with each element a numeric vectors of length 5.

    What does map_dbl(-2:2, rnorm, n = 5) do? Why?

    map_dbl(-2:2, rnorm, n = 5)
    Error: Result 1 is not a length 1 atomic vector

    However, if we use map_dbl it throws an error. map_dbl expects the function to return a numeric vector of length one. If we wanted a numeric vector, we could use map followed by flatten_dbl

    flatten_dbl(map(-2:2, rnorm, n = 5))
     [1] -2.7346927 -3.0902229 -2.3386143 -3.1495769 -1.9779044 -2.4560725 -2.3933565
     [8] -0.8794021 -1.8144065 -1.3836109 -2.0641389 -1.2068642 -1.1286256 -1.2974258
    [15]  3.4572218  3.1948030  0.8738478  0.3903348  0.7967609  3.7033513  1.7926023
    [22]  1.4496549  0.7881413  1.4840187  2.4098862

    Rewrite map(x, function(df) lm(mpg ~ wt, data = df)) to eliminate the anonymous function.

    map(list(mtcars), ~ lm(mpg ~ wt, data = .))
    [[1]]
    
    Call:
    lm(formula = mpg ~ wt, data = .)
    
    Coefficients:
    (Intercept)           wt  
         37.285       -5.344  

    Dealing with Failure

    When you use the map functions to repeat many operations, the chances are much higher that one of those operations will fail. When this happens, you’ll get an error message, and no output. This is annoying: why does one failure prevent you from accessing all the other successes? How do you ensure that one bad apple doesn’t ruin the whole barrel?

    In this section you’ll learn how to deal this situation with a new function: safely(). safely() is an adverb: it takes a function (a verb) and returns a modified version. In this case, the modified function will never throw an error. Instead, it always returns a list with two elements:

    result is the original result. If there was an error, this will be NULL.

    error is an error object. If the operation was successful, this will be NULL

    Let’s illustrate this with a simple example: log():

    safe_log <- safely(log)
    str(safe_log(10))
    List of 2
     $ result: num 2.3
     $ error : NULL
    str(safe_log("a"))
    List of 2
     $ result: NULL
     $ error :List of 2
      ..$ message: chr "non-numeric argument to mathematical function"
      ..$ call   : language .f(...)
      ..- attr(*, "class")= chr [1:3] "simpleError" "error" "condition"
    When the function succeeds, the result element contains the result and the error element is NULL. When the function fails, the result element is NULL and the error element contains an error object.

    safely() is designed to work with map:

    x <- list(1, 10, "a")
    y <- x %>% map(safely(log))
    str(y)
    List of 3
     $ :List of 2
      ..$ result: num 0
      ..$ error : NULL
     $ :List of 2
      ..$ result: num 2.3
      ..$ error : NULL
     $ :List of 2
      ..$ result: NULL
      ..$ error :List of 2
      .. ..$ message: chr "non-numeric argument to mathematical function"
      .. ..$ call   : language .f(...)
      .. ..- attr(*, "class")= chr [1:3] "simpleError" "error" "condition"

    This would be easier to work with if we had two lists: one of all the errors and one of all the output. That’s easy to get with purrr::transpose():

    y <- y %>% transpose()
    str(y)
    List of 2
     $ result:List of 3
      ..$ : num 0
      ..$ : num 2.3
      ..$ : NULL
     $ error :List of 3
      ..$ : NULL
      ..$ : NULL
      ..$ :List of 2
      .. ..$ message: chr "non-numeric argument to mathematical function"
      .. ..$ call   : language .f(...)
      .. ..- attr(*, "class")= chr [1:3] "simpleError" "error" "condition"

    It’s up to you how to deal with the errors, but typically you’ll either look at the values of x where y is an error, or work with the values of y that are ok:

    is_ok <- y$error %>% map_lgl(is_null)
    x[!is_ok]
    [[1]]
    [1] "a"
    y$result[is_ok] %>% flatten_dbl()
    [1] 0.000000 2.302585

    Purrr provides two other useful adverbs:

    Like safely(), possibly() always succeeds. It’s simpler than safely(), because you give it a default value to return when there is an error.

    x <- list(1, 10, "a")
    x %>% map_dbl(possibly(log, NA_real_))
    [1] 0.000000 2.302585       NA

    quietly() performs a similar role to safely(), but instead of capturing errors, it captures printed output, messages, and warnings:

    x <- list(1, -1)
    x %>% map(quietly(log)) %>% str()
    List of 2
     $ :List of 4
      ..$ result  : num 0
      ..$ output  : chr ""
      ..$ warnings: chr(0) 
      ..$ messages: chr(0) 
     $ :List of 4
      ..$ result  : num NaN
      ..$ output  : chr ""
      ..$ warnings: chr "NaNs produced"
      ..$ messages: chr(0) 

    Mapping over multiple arguments

    So far we’ve mapped along a single input. But often you have multiple related inputs that you need iterate along in parallel. That’s the job of the map2() and pmap() functions. For example, imagine you want to simulate some random normals with different means. You know how to do that with map():

    mu <- list(5, 10, -3)
    mu %>% 
      map(rnorm, n = 5) %>% 
      str()
    List of 3
     $ : num [1:5] 5.79 5.33 4.87 6.87 5.43
     $ : num [1:5] 11.74 8.58 9.09 10.38 9.83
     $ : num [1:5] -2.44 -3.47 -3.42 -3.77 -2.71

    What if you also want to vary the standard deviation? One way to do that would be to iterate over the indices and index into vectors of means and sds:

    sigma <- list(1, 5, 10)
    seq_along(mu) %>% 
      map(~rnorm(5, mu[[.]], sigma[[.]])) %>% 
      str()
    List of 3
     $ : num [1:5] 4.34 6.5 4.39 5.3 4.07
     $ : num [1:5] 11.85 0.901 13.068 16.143 7.794
     $ : num [1:5] -11.01 -5.97 -4.67 -15.62 -14.86

    But that obfuscates the intent of the code. Instead we could use map2() which iterates over two vectors in parallel:

    map2(mu, sigma, rnorm, n = 5) %>% str()
    List of 3
     $ : num [1:5] 5.18 3.86 3.88 5.8 5.15
     $ : num [1:5] 11.5 1.1 8.3 10.5 15.7
     $ : num [1:5] -2.15 1.42 -8.27 -7.34 -11.78

    map2() generates this series of function calls:

    Note that the arguments that vary for each call come before the function; arguments that are the same for every call come after. Like map(), map2() is just a wrapper around a for loop:

    map2 <- function(x, y, f, ...) {
      out <- vector("list", length(x))
      for (i in seq_along(x)) {
        out[[i]] <- f(x[[i]], y[[i]], ...)
      }
      out
    }

    You could also imagine map3(), map4(), map5(), map6() etc, but that would get tedious quickly. Instead, purrr provides pmap() which takes a list of arguments. You might use that if you wanted to vary the mean, standard deviation, and number of samples:

    n <- list(1, 3, 5)
    args1 <- list(n, mu, sigma)
    args1 %>%
      pmap(rnorm) %>% 
      str()
    List of 3
     $ : num 5.47
     $ : num [1:3] 14.45 1.39 5.61
     $ : num [1:5] -9.07 -17.17 -1.43 9.34 -3.74

    That looks like:

    If you don’t name the elements of list, pmap() will use positional matching when calling the function. That’s a little fragile, and makes the code harder to read, so it’s better to name the arguments:

    args2 <- list(mean = mu, sd = sigma, n = n)
    args2 %>% 
      pmap(rnorm) %>% 
      str()
    List of 3
     $ : num 4.36
     $ : num [1:3] 19.1 3.25 14
     $ : num [1:5] 1.57 9.64 24.07 3.58 -5.16

    That generates longer, but safer, calls:

    Since the arguments are all the same length, it makes sense to store them in a data frame:

    params <- tribble(
      ~mean, ~sd, ~n,
        5,     1,  1,
       10,     5,  3,
       -3,    10,  5
    )
    params %>% 
      pmap(rnorm)
    [[1]]
    [1] 7.035716
    
    [[2]]
    [1]  1.485603  6.994151 15.032192
    
    [[3]]
    [1] -14.558272 -13.772941 -18.629398   2.936900  -9.220714

    Invoking Different functions

    There’s one more step up in complexity - as well as varying the arguments to the function you might also vary the function itself:

    f <- c("runif", "rnorm", "rpois")
    param <- list(
      list(min = -1, max = 1), 
      list(sd = 5), 
      list(lambda = 10)
    )

    To handle this case, you can use invoke_map():

    invoke_map(f, param, n = 5) %>% str()
    List of 3
     $ : num [1:5] -0.0417 -0.7741 -0.5943 -0.3098 0.7438
     $ : num [1:5] -2.36 -6.97 -9.15 7.47 -12.81
     $ : int [1:5] 14 15 10 4 7

    The first argument is a list of functions or character vector of function names. The second argument is a list of lists giving the arguments that vary for each function. The subsequent arguments are passed on to every function.

    And again, you can use tribble() to make creating these matching pairs a little easier:

    