purrr
purrr
of R for Data Science (Wickham & Grolemund, 2017). This document is prepared for CP6521 Advanced GIS, a graduate-level city planning elective course at Georgia Tech in Spring 2019. For any question, contact the instructor, Yongsung Lee, Ph.D. via yongsung.lee(at)gatech.edu.install.packages("tidyverse", repos = "http://cran.us.r-project.org", dependencies = TRUE)
library(tidyverse)
What we do:
for
loopspurrr::map
functionsWhy we do:
For
loopsSee the three basic components from an example.
df <- tibble(
a = rnorm(10),
b = rnorm(10),
c = rnorm(10),
d = rnorm(10)
)
median(df$a)
median(df$b)
median(df$c)
median(df$d)
output <- vector("double", ncol(df)) # output (arguments: data type + # of elements)
for (i in seq_along(df)){ # sequence
output[[i]] <- median(df[[i]]) # body
}
output
vector("double", ncol(df))
, which will store individual outputsvector("double", ncol(df))
vector("integer", ncol(df))
vector("logical", ncol(df))
vector("character", ncol(df))
sequence: Usually i in seq_along(df)
body: this is run repeatedly with different i
No1. Write for loops to:
mtcars
.nycflights13::flights
.iris
.Think about the output, sequence, and body before you start writing the loop.
output <- vector("double", length(mtcars))
for (i in seq_along(mtcars)){
output[[i]] <- mean(mtcars[[i]], na.rm = TRUE)
}
output
library(nycflights13)
output <- vector("character", length(flights))
for (i in seq_along(flights)){
output[[i]] <- typeof(flights[[i]])
}
output
output <- vector("integer", length(iris))
for (i in seq_along(iris)){
output[[i]] <- n_distinct(iris[[i]])
}
output
means <- c(-10, 0, 10, 100)
output <- vector("list", 4)
for (i in 1:4){
output[[i]] <- rnorm(n = 10, mean = means[[i]])
}
output
No4. It’s common to see for loops that don’t preallocate the output and instead increase the length of a vector at each step:
output <- vector("integer", 0)
for (i in seq_along(x)) {
output <- c(output, lengths(x[[i]]))
}
output
How does this affect performance? Design and execute an experiment. Interesting experiments
df <- tibble(
a = rnorm(10),
b = rnorm(10),
c = rnorm(10),
d = rnorm(10)
)
rescale01 <- function(x) {
rng <- range(x, na.rm = TRUE)
(x - rng[1])/(rng[2] - rng[1])
}
df$a <- rescale01(df$a)
df$b <- rescale01(df$b)
df$c <- rescale01(df$c)
df$d <- rescale01(df$d)
# no creation of an output vector
for (i in seq_along(df)){ # sequence
df[[i]] <- rescale01(df[[i]]) # body
}
for (i in seq_along(xs))
with x[[i]]
for (x in xs)
for (nm in names(xs))
with x[[nm]]
results <- vector("list", length(x))
names(results) <- names(x)
for (i in seq_along(x)){
names <- names(x)[[i]]
value <- x[[i]]
}
A less efficient approach
means <- c(0, 1, 2)
output <- double()
for (i in seq_along(means)){
n <- sample(100, 1) # one integer draw from [1, 100]
output <- c(output, rnorm(n, means[[i]])) # at each iteration, n random numbers with the mean of means[[i]]
}
str(output)
A more efficient approach
means <- c(0, 1, 2)
out <- vector("list", length(means)) # create an empty list
for (i in seq_along(means)){
n <- sample(100, 1)
out[[i]] <- rnorm(n, means[[i]]) # fill each element of the list
}
str(out) # still a list
str(unlist(out)) # unlist
str(flatten_dbl(out)) # stricter
rbind() # sequentially row-bind one data frame after another
dplyr::bind_rows(output) # when an output is a list of data frames -> a single data frame
No1. Imagine you have a directory full of CSV files that you want to read in. You have their paths in a vector, files <- dir("data/", pattern = "\\.csv$", full.names = TRUE)
, and now want to read each one with read_csv()
. Write the for loop that will load them into a single data frame.
No3. Write a function that prints the mean of each numeric column in a data frame, along with its name. For example, show_mean(iris)
would print:
show_mean(iris)
#> Sepal.Length: 5.84
#> Sepal.Width: 3.06
#> Petal.Length: 3.76
#> Petal.Width: 1.20
(Extra challenge: what function did I use to make sure that the numbers lined up nicely, even though the variable names had different lengths?)
“A higher-order function is a function that takes a function as an input or returns a function as output. … [A] functional, a function that takes a function as an input and returns a vector as output.”
df <- tibble(
a = rnorm(10),
b = rnorm(10),
c = rnorm(10),
d = rnorm(10)
)
output <- vector("double", length(df))
for (i in seq_along(df)) {
output[[i]] <- mean(df[[i]])
}
output
# define a function that has a for loop in its body
col_mean <- function(df) {
output <- vector("double", length(df))
for (i in seq_along(df)) {
output[i] <- mean(df[[i]])
}
output
}
# define a similar function for median
col_median <- function(df) {
output <- vector("double", length(df))
for (i in seq_along(df)) {
output[i] <- median(df[[i]])
}
output
}
# define a similar function for standard deviation
col_sd <- function(df) {
output <- vector("double", length(df))
for (i in seq_along(df)) {
output[i] <- sd(df[[i]])
}
output
}
Here, we need one more step-up in generalization. Let’s start with a simple example:
# to generalize below three functions
f1 <- function(x) abs(x - mean(x)) ^ 1
f2 <- function(x) abs(x - mean(x)) ^ 2
f3 <- function(x) abs(x - mean(x)) ^ 3
# use an index i inside a function (as the second argument)
f <- function(x, i) abs(x - mean(x)) ^ i
Then, come back to our problem.
col_summary <- function(df, fun) {
out <- vector("double", length(df))
for (i in seq_along(df)) {
out[i] <- fun(df[[i]])
}
out
}
col_summary(df, median)
col_summary(df, mean)
See that fun
is an argument of a new function, and it takes many values including median
and mean
. That is, a function takes other functions as inputs.
No2. Adapt col_summary()
so that it only applies to numeric columns You might want to start with an is_numeric()
function that returns a logical vector that has a TRUE
corresponding to each numeric column.
map
Functionsmap()
makes a listmap_lgl()
makes a logical vectormap_int()
makes a integer vectormap_dbl()
makes a double vectormap_chr()
makes a character vectorEach function taks a vector as input, applies a function to each piece, and then returns a new vector that’s the same length (and has the same names) as the input.
map_dbl(df, mean)
map_dbl(df, median)
map_dbl(df, sd)
# with pipes
df %>% map_dbl(mean)
df %>% map_dbl(median)
df %>% map_dbl(sd)
models <- mtcars %>%
split(.$cyl) %>% # output - a list of three dataframes by cyliner type
map(function(df) lm(mpg ~ wt, data = df)) # not additional arguments, instead function definition
models
models <- mtcars %>%
split(.$cyl) %>%
map(~ lm(mpg ~ wt, data = .)) # . refers to the current list element
models
models %>%
map(summary) %>%
str()
models %>%
map(summary) %>%
map_dbl(~ .$r.squared)
models %>%
map(summary) %>%
map_dbl("r.squared") # a special shortcut that calls a named component
x <- list(list(1, 2, 3), list(4, 5, 6), list(7, 8, 9))
x %>% map_dbl(2) # extract the second element from each list
No1. Write for loops to:
mtcars
.nycflights13::flights
.iris
.map_dbl(mtcars, mean)
map_chr(flights, typeof)
map_int(iris, n_distinct)
map(c(-10, 0, 10, 100), ~ rnorm(n= 10, mean=.))