Disclaimer: The contents of this document come from Chapter 17. Iteration with purrr of R for Data Science (Wickham & Grolemund, 2017). This document is prepared for CP6521 Advanced GIS, a graduate-level city planning elective course at Georgia Tech in Spring 2019. For any question, contact the instructor, Yongsung Lee, Ph.D. via yongsung.lee(at)gatech.edu.
This document is also published on RPubs.
install.packages("tidyverse", repos = "http://cran.us.r-project.org", dependencies = TRUE)
library(tidyverse)

1. Intro

What we do:

  1. Iterate over multiple inputs
    1. Learn for loops
    2. Learn purrr::map functions

Why we do:

  1. Reduce repetition and redundancy in your code.

2. For loops

See the three basic components from an example.

df <- tibble(
  a = rnorm(10), 
  b = rnorm(10), 
  c = rnorm(10), 
  d = rnorm(10)
)

median(df$a)
median(df$b)
median(df$c)
median(df$d)

output <- vector("double", ncol(df))    # output (arguments: data type + # of elements) 
for (i in seq_along(df)){               # sequence
  output[[i]] <- median(df[[i]])        # body 
}
output
  1. output: Create an empty vector by vector("double", ncol(df)), which will store individual outputs
vector("double", ncol(df)) 
vector("integer", ncol(df)) 
vector("logical", ncol(df)) 
vector("character", ncol(df)) 
  1. sequence: Usually i in seq_along(df)

  2. body: this is run repeatedly with different i

Exercises

No1. Write for loops to:

  1. Compute the mean of every column in mtcars.
  2. Determine the type of each column in nycflights13::flights.
  3. Compute the number of unique values in each column of iris.
  4. Generate 10 random normals for each of μ = −10, 0, 10, and 100.

Think about the output, sequence, and body before you start writing the loop.

output <- vector("double", length(mtcars))
for (i in seq_along(mtcars)){
  output[[i]] <- mean(mtcars[[i]], na.rm = TRUE)
}
output 

library(nycflights13)
output <- vector("character", length(flights))
for (i in seq_along(flights)){
  output[[i]] <- typeof(flights[[i]])
}
output 

output <- vector("integer", length(iris))
for (i in seq_along(iris)){
  output[[i]] <- n_distinct(iris[[i]])
}
output

means <- c(-10, 0, 10, 100)
output <- vector("list", 4)
for (i in 1:4){
  output[[i]] <- rnorm(n = 10, mean = means[[i]])
}
output 

No4. It’s common to see for loops that don’t preallocate the output and instead increase the length of a vector at each step:

output <- vector("integer", 0)
for (i in seq_along(x)) {
  output <- c(output, lengths(x[[i]]))
}
output

How does this affect performance? Design and execute an experiment. Interesting experiments

3. For Loop Variations

3-1. Modifying an Existing Object
df <- tibble(
  a = rnorm(10), 
  b = rnorm(10), 
  c = rnorm(10), 
  d = rnorm(10)
)

rescale01 <- function(x) {
  rng <- range(x, na.rm = TRUE)
  (x - rng[1])/(rng[2] - rng[1])
}

df$a <- rescale01(df$a)
df$b <- rescale01(df$b)
df$c <- rescale01(df$c)
df$d <- rescale01(df$d)
                                # no creation of an output vector 
for (i in seq_along(df)){       # sequence
  df[[i]] <- rescale01(df[[i]]) # body 
}
3-2. Looping Patterns
  1. Loop over the numeric indices : for (i in seq_along(xs)) with x[[i]]
  2. Loop over the elements: for (x in xs)
  3. Loop over the names: for (nm in names(xs)) with x[[nm]]
results <- vector("list", length(x))
names(results) <- names(x)

for (i in seq_along(x)){
  names <- names(x)[[i]]
  value <- x[[i]]
}
3-3. Unknown Ouput Length

A less efficient approach

means <- c(0, 1, 2)
output <- double()

for (i in seq_along(means)){
  n <- sample(100, 1)                       # one integer draw from [1, 100]
  output <- c(output, rnorm(n, means[[i]])) # at each iteration, n random numbers with the mean of means[[i]]
}
str(output)

A more efficient approach

means <- c(0, 1, 2)
out <- vector("list", length(means))        # create an empty list 

for (i in seq_along(means)){
  n <- sample(100, 1) 
  out[[i]] <- rnorm(n, means[[i]])          # fill each element of the list 
}
str(out)                                    # still a list 
str(unlist(out))                            # unlist  

str(flatten_dbl(out))                       # stricter 

rbind()                                     # sequentially row-bind one data frame after another 
dplyr::bind_rows(output)                    # when an output is a list of data frames -> a single data frame  
Exercises

No1. Imagine you have a directory full of CSV files that you want to read in. You have their paths in a vector, files <- dir("data/", pattern = "\\.csv$", full.names = TRUE), and now want to read each one with read_csv(). Write the for loop that will load them into a single data frame.

No3. Write a function that prints the mean of each numeric column in a data frame, along with its name. For example, show_mean(iris) would print:

show_mean(iris)
#> Sepal.Length: 5.84
#> Sepal.Width:  3.06
#> Petal.Length: 3.76
#> Petal.Width:  1.20

(Extra challenge: what function did I use to make sure that the numbers lined up nicely, even though the variable names had different lengths?)

4. For Loops Versus Functionals

“A higher-order function is a function that takes a function as an input or returns a function as output. … [A] functional, a function that takes a function as an input and returns a vector as output.”

df <- tibble(
  a = rnorm(10), 
  b = rnorm(10), 
  c = rnorm(10), 
  d = rnorm(10)
)

output <- vector("double", length(df))
for (i in seq_along(df)) {
  output[[i]] <- mean(df[[i]])
}
output

# define a function that has a for loop in its body 
col_mean <- function(df) {
  output <- vector("double", length(df))
  for (i in seq_along(df)) {
    output[i] <- mean(df[[i]])
  }
  output
}

# define a similar function for median 
col_median <- function(df) {
  output <- vector("double", length(df))
  for (i in seq_along(df)) {
    output[i] <- median(df[[i]])
  }
  output
}

# define a similar function for standard deviation 
col_sd <- function(df) {
  output <- vector("double", length(df))
  for (i in seq_along(df)) {
    output[i] <- sd(df[[i]])
  }
  output
}

Here, we need one more step-up in generalization. Let’s start with a simple example:

# to generalize below three functions 
f1 <- function(x) abs(x - mean(x)) ^ 1
f2 <- function(x) abs(x - mean(x)) ^ 2
f3 <- function(x) abs(x - mean(x)) ^ 3

# use an index i inside a function (as the second argument)
f <- function(x, i) abs(x - mean(x)) ^ i

Then, come back to our problem.

col_summary <- function(df, fun) {
  out <- vector("double", length(df))
  for (i in seq_along(df)) {
    out[i] <- fun(df[[i]])
  }
  out
}

col_summary(df, median)
col_summary(df, mean)

See that fun is an argument of a new function, and it takes many values including median and mean. That is, a function takes other functions as inputs.

Exercises

No2. Adapt col_summary() so that it only applies to numeric columns You might want to start with an is_numeric() function that returns a logical vector that has a TRUE corresponding to each numeric column.

5. The map Functions

Each function taks a vector as input, applies a function to each piece, and then returns a new vector that’s the same length (and has the same names) as the input.

map_dbl(df, mean)
map_dbl(df, median)
map_dbl(df, sd)

# with pipes 
df %>% map_dbl(mean)
df %>% map_dbl(median)
df %>% map_dbl(sd)
Shortcuts
models <- mtcars %>%
  split(.$cyl) %>%                           # output - a list of three dataframes by cyliner type  
  map(function(df) lm(mpg ~ wt, data = df))  # not additional arguments, instead function definition 
models

models <- mtcars %>% 
  split(.$cyl) %>%
  map(~ lm(mpg ~ wt, data = .))              # . refers to the current list element  
models

models %>% 
  map(summary) %>%
  str()

models %>% 
  map(summary) %>% 
  map_dbl(~ .$r.squared)

models %>% 
  map(summary) %>% 
  map_dbl("r.squared")                       # a special shortcut that calls a named component 
x <- list(list(1, 2, 3), list(4, 5, 6), list(7, 8, 9))
x %>% map_dbl(2)                             # extract the second element from each list 
Exercises

No1. Write for loops to:

  1. Compute the mean of every column in mtcars.
  2. Determine the type of each column in nycflights13::flights.
  3. Compute the number of unique values in each column of iris.
  4. Generate 10 random normals for each of μ = −10, 0, 10, and 100.
map_dbl(mtcars, mean)
map_chr(flights, typeof)
map_int(iris, n_distinct)
map(c(-10, 0, 10, 100), ~ rnorm(n= 10, mean=.))