Summaries and exercises from R for Data Science by Hadley Wickham & Garrett Grolemund.

Pipes with magrittr


Using the Pipe

The pipe allows you to create a logical order to code. Foo Foo hops, then scoops, then bops.

> foo_foo %>%
+   hop(through = forest) %>%
+   scoop(up = field_mice) %>%
+   bop(on = head)

The pipe won’t work for two classes on functions.

  1. Functions that use the current environment, like assign(), get(), and load().
  2. Functions that use lazy evaluation, like tryCatch(), try(), suppressMessages(), and suppressWarnings().
> library(tidyverse)
> library(magrittr)
> assign("x", 10)
> x
[1] 10
> #Does not work because the pipe assigns 
> #it to a temporary environment
> "x" %>% assign(100)
> x
[1] 10
> #Must specify the environment
> env <- environment()
> "x" %>% assign(100, envir = env)
> x
[1] 100
> #Lazy functions only evaluate the arguments
> #when the function uses them, so the pipe
> #won't work
> tryCatch(stop("!"), error = function(e) "An error")
[1] "An error"
> stop("!") %>% 
+   tryCatch(error = function(e) "An error")
Error in eval(lhs, parent, parent): !

It’s best not to use the pipe when.

  • Your pipes are longer than 10 steps
  • You have multiple inputs and outputs
  • You are expressing complex relationships

Other Tools from magrittr

The “tee” pipe, %T>%, returns the lefthand side instead of the righthand side.

> rnorm(100) %>%
+   matrix(ncol = 2) %>%
+   plot() %>%
+   str()
 NULL
> rnorm(100) %>%
+   matrix(ncol = 2) %T>%
+   plot() %>%
+   str()
 num [1:50, 1:2] 0.251 -0.978 -0.244 0.603 -0.464 ...

For many functions like cor(), where you’re passing them vectors that come from a data frame and not the data frame itself, you can use %$%.

> mtcars %$%
+   cor(disp, mpg)
[1] -0.8475514

You can use %<>% to perform a function and assign at the same time.

> #Typical way to assign
> mtcars <- mtcars %>% 
+   transform(cyl = cyl * 2)
> # Assign with magrittr
> mtcars %<>% transform(cyl = cyl * 2)

Functions


When Should You Write a Function?

Imagine you were doing the following transformations:

> df <- tibble::tibble(
+   a = rnorm(10),
+   b = rnorm(10),
+   c = rnorm(10),
+   d = rnorm(10)
+ )
> 
> df$a <- (df$a - min(df$a, na.rm = TRUE)) / 
+   (max(df$a, na.rm = TRUE) - min(df$a, na.rm = TRUE))
> #Spot the error
> df$b <- (df$b - min(df$b, na.rm = TRUE)) / 
+   (max(df$b, na.rm = TRUE) - min(df$a, na.rm = TRUE))
> df$c <- (df$c - min(df$c, na.rm = TRUE)) / 
+   (max(df$c, na.rm = TRUE) - min(df$c, na.rm = TRUE))
> df$d <- (df$d - min(df$d, na.rm = TRUE)) / 
+   (max(df$d, na.rm = TRUE) - min(df$d, na.rm = TRUE))

Normally you would cut and paste, but that often results in errors. df$b has an error.

The following code has one input, df$a, which can be substituted with \(x\).

> (df$a - min(df$a, na.rm = TRUE)) /
+   (max(df$a, na.rm = TRUE) - min(df$a, na.rm = TRUE))
 [1] 0.5309992 0.0000000 0.6226931 0.6557572 0.4130389 0.7992905 0.6612732
 [8] 0.0466836 0.2791986 1.0000000
> x <- df$a
> (x - min(x, na.rm = TRUE)) / 
+ (max(x, na.rm = TRUE) - min(x, na.rm = TRUE))
 [1] 0.5309992 0.0000000 0.6226931 0.6557572 0.4130389 0.7992905 0.6612732
 [8] 0.0466836 0.2791986 1.0000000

We can make it even simpler by substituting with an existing function.

> rng <- range(x, na.rm = TRUE)
> (x - rng[1]) / (rng[2] - rng[1])
 [1] 0.5309992 0.0000000 0.6226931 0.6557572 0.4130389 0.7992905 0.6612732
 [8] 0.0466836 0.2791986 1.0000000

Now we can create a custom function.

> rescale01 <- function(x) {
+   rng <- range(x, na.rm = TRUE)
+   (x - rng[1]) / (rng[2] - rng[1])
+ }
> rescale01(c(0, 5, 10))
[1] 0.0 0.5 1.0

There are three steps to creating a function:

  1. Pick a name, like rescale01().
  2. List the inputs, or arguments inside function().
  3. Place the code in the body { code }.

It’s best to test a few inputs.

> rescale01(c(-10, 0, 10))
> rescale01(c(1, 2, 3, NA, 5))
[1] 0.0 0.5 1.0
[1] 0.00 0.25 0.50   NA 1.00

Now we can simplify the original example.

> df$a <- rescale01(df$a)
> df$b <- rescale01(df$b)
> df$c <- rescale01(df$c)
> df$d <- rescale01(df$d)

And if there’s an error you only have to change it in one place.

> x <- c(1:10, Inf)
> rescale01(x)
 [1]   0   0   0   0   0   0   0   0   0   0 NaN
> rescale01 <- function(x) {
+   rng <- range(x, na.rm = TRUE, finite = TRUE)
+   (x - rng[1]) / (rng[2] - rng[1])
+ }
> rescale01(x)
 [1] 0.0000000 0.1111111 0.2222222 0.3333333 0.4444444 0.5555556 0.6666667
 [8] 0.7777778 0.8888889 1.0000000       Inf

Exercises

1. Why is TRUE not a parameter to rescale01()? What would happen if x contained a single missing value, and na.rm was FALSE?
  • Adding na.rm as a parameter doesn’t change the output. The setting finite=TRUE will drop all non-finite elements, and NA is a non-finite element.
> #Original Function
> rescale01 <- function(x) {
+   rng <- range(x, na.rm = TRUE, finite = TRUE)
+   (x - rng[1]) / (rng[2] - rng[1])
+ }
> rescale01(c(NA, 1:5))
[1]   NA 0.00 0.25 0.50 0.75 1.00
> #Updated Function
> rescale02 <- function(x, na.rm = FALSE) {
+   rng <- range(x, na.rm = na.rm, finite = TRUE)
+   (x - rng[1]) / (rng[2] - rng[1])
+ }
> 
> rescale02(c(NA, 1:5), na.rm = FALSE)
> rescale02(c(NA, 1:5), na.rm = TRUE)
[1]   NA 0.00 0.25 0.50 0.75 1.00
[1]   NA 0.00 0.25 0.50 0.75 1.00
2. In the second variant of rescale01(), infinite values are left unchanged. Rewrite rescale01() so that -Inf is mapped to 0, and Inf is mapped to 1.
> rescale03 <- function(x) {
+   rng <- range(x, na.rm = TRUE, finite = TRUE)
+   y <- (x - rng[1]) / (rng[2] - rng[1])
+   y[y == -Inf] <- 0
+   y[y == Inf] <- 1
+   y
+ }
> 
> rescale03(c(Inf, -Inf, 0:5, NA))
[1] 1.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0  NA
3. Practice turning the following code snippets into functions. Think about what each function does. What would you call it? How many arguments does it need? Can you rewrite it to be more expressive or less duplicative?
> mean(is.na(x))
> 
> x / sum(x, na.rm = TRUE)
> 
> sd(x, na.rm = TRUE) / mean(x, na.rm = TRUE)
  • mean(is.na(x)) calculates the proportion of NA values in a vector.
> prop_na <- function(x) {
+   mean(is.na(x))
+ }
> prop_na(c(0, 1, NA, 3, 4, NA,6))
[1] 0.2857143
  • x / sum(x, na.rm = TRUE) standardizes a vector so that it sums to one.
> sum_to_one <- function(x, na.rm = FALSE) {
+   x / sum(x, na.rm = na.rm)
+ }
> sum_to_one(c(1:6))
[1] 0.04761905 0.09523810 0.14285714 0.19047619 0.23809524 0.28571429
  • sd(x, na.rm = TRUE) / mean(x, na.rm = TRUE) calculates the coefficient of variation.
> coef_variation <- function(x, na.rm = FALSE) {
+   sd(x, na.rm = na.rm) / mean(x, na.rm = na.rm)
+ }
> 
> coef_variation(c(1:10, NA), na.rm = TRUE)
[1] 0.5504819
4.Write your own functions to compute the variance and skewness of a numeric vector. Variance is defined as

\[Var(x)=\frac{1}{n-1}\sum_{i=1}^n(x_i-\bar{x})^2\]

where \(\bar{x}=\left(\sum_{i=1}^{n}x_i/n\right)\) is the sample mean. Skewness is defined as

\[Skew(x)=\frac{\frac{1}{n-2}\left(\sum_{i=1}^{n}(x_i-\bar{x})^3\right)}{Var(x)^{3/2}}\]

> variance <- function(x, na.rm = TRUE) {
+   n <- length(x)
+   m <- mean(x, na.rm = TRUE)
+   sq_err <- (x - m)^2
+   sum(sq_err) / (n - 1)
+ }
> 
> var(1:10)
> variance(1:10)
[1] 9.166667
[1] 9.166667
> skewness <- function(x, na.rm = FALSE) {
+   n <- length(x)
+   m <- mean(x, na.rm = na.rm)
+   v <- var(x, na.rm = na.rm)
+   (sum((x - m) ^ 3) / (n - 2)) / v ^ (3 / 2)
+ }
> 
> skewness(c(5, 10, 15, 100))
[1] 1.463378
5. Write both_na(), a function that takes two vectors of the same length and returns the number of positions that have an NA in both vectors.
> both_na <- function(x, y) {
+   stopifnot(length(x)==length(y))
+   sum(is.na(x) & is.na(y))
+ }
> 
> both_na(
+   c(NA, NA, 1, 2),
+   c(NA, 1, NA, 2)
+ )
[1] 1
> both_na(
+   c(NA, NA, 1, 2, NA, NA, 1),
+   c(NA, 1, NA, 2, NA, NA, 1)
+ )
[1] 3
> both_na(
+   c(NA, NA, 1, 2, NA, NA, 1, 3),
+   c(NA, 1, NA, 2, NA, NA, 1)
+ )
Error in both_na(c(NA, NA, 1, 2, NA, NA, 1, 3), c(NA, 1, NA, 2, NA, NA, : length(x) == length(y) is not TRUE
6. What do the following functions do? Why are they useful even though they are so short?
> is_directory <- function(x) file.info(x)$isdir
> is_readable <- function(x) file.access(x, 4) == 0
  • is_directory() checks whether the path in x is a directory. * is_readable() checks whether the path in x exists and the user has permission to open it.

Functions Are for Humans and Computers

The formatting should be easy to read and understand.

> # Too short
> f()
> 
> # Not a verb, or descriptive
> my_awesome_function()
> 
> # Long, but clear
> impute_missing()
> collapse_years()

You should try to use “snake_case”, where each lowercase word is separated by an underscore or “camelCase”, but be consistent!

> # Try not to switch around styles, pick one.
> col_mins <- function(x, y) {}
> rowMaxes <- function(y, x) {}

With a common prefix autocomplete allows you to see all the members of the family

> # Good
> input_select()
> input_checkbox()
> input_text()
> 
> # Not so good
> select_input()
> checkbox_input()
> text_input()

Avoid overriding existing functions and variables.

> # Don't do this!
> T <- FALSE
> c <- 10
> mean <- function(x) sum(x)

You should use comments, # to explain why you’re doing something. You should also break your file into readable chucks with (Cmd/Ctrl + Shift + R). They will then be displayed in the code navigation drop-down at the bottom-left of the editor.

> #(Cmd/Ctrl + Shift + R) Section breaks
> 
> # Load data --------------------------------------
> 
> # Plot data --------------------------------------

Exercises

1. Read the source code for each of the following three functions, puzzle out what they do, and then brainstorm better names.
> f1 <- function(string, prefix) {
+   substr(string, 1, nchar(prefix)) == prefix
+ }
> 
> f2 <- function(x) {
+   if (length(x) <= 1) return(NULL)
+   x[-length(x)]
+ }
> 
> f3 <- function(x, y) {
+   rep(y, length.out = length(x))
+ }
  • f1 tests whether each element of the character vector starts with the string prefix. Rename = has_prefix()
  • f2 drops the last element of the vector x. Rename = drop_last()
  • f3 repeats y once for each element of x. Rename = repeat_value()
> f1(c("abc", "abcde", "ad"), "ab")
> f2(1:4)
> f3(1:4, 5)
[1]  TRUE  TRUE FALSE
[1] 1 2 3
[1] 5 5 5 5
2. Compare and contrast rnorm() and MASS::mvrnorm(). How could you make them more consistent?
  • rnorm() samples from the univariate normal distribution, while MASS::mvrnorm samples from the multivariate normal distribution. The main arguments in rnorm() are n, mean, sd. The main arguments is MASS::mvrnorm are n, mu, Sigma. To be consistent they should have the same names.
3. Make a case for why norm_r(), norm_d() etc would be better than rnorm(), dnorm(). Make a case for the opposite.
  • norm = normal distribution, r = random sample, d = density
  • norm_r() and norm_d() start with a distribution type and end with a function.
  • rnorm() and dnorm() start with a function and end with the distribution. R uses this format.
  • Both make sense, so it’s a matter of preference.

Conditional Execution

You’ll often have to conditionally execute code.

> if (condition) {
+   # code executed when condition is TRUE
+ } else {
+   # code executed when condition is FALSE
+ }

The goal of this function is to return a logical vector describing whether or not each element of a vector is named. A function returns the last value that it computed.

> has_name <- function(x) {
+   nms <- names(x)
+   if (is.null(nms)) {
+     rep(FALSE, length(x))
+   } else {
+     !is.na(nms) & nms != ""
+   }
+ }

Conditions

The condition must evaluate to either TRUE or FALSE. If it’s a vector, you’ll get a warning message; if it’s an NA, you’ll get an error.

> if (c(TRUE, FALSE)) {}
> #> Warning in if (c(TRUE, FALSE)) {: the 
> #condition has length > 1 and only the
> #> first element will be used
> #> NULL
> 
> if (NA) {}
> #> Error in if (NA) {: missing value where 
> #TRUE/FALSE needed

You can use || (or) and && (and) to combine multiple logical expressions. These operators are “short-circuiting”: as soon as || sees the first TRUE it returns TRUE without computing anything else. As soon as && sees the first FALSE it returns FALSE.

You should never use | or & in an if statement: these are vectorised operations that apply to multiple values (vectorized, meaning that the function will operate on all elements of a vector without needing to loop through and act on each element one at a time.). If you do have a logical vector, you can use any() or all() to collapse it to a single value.

== is also vectorised, which means that it’s easy to get more than one output. Either check the length is already 1, collapse with all() or any(), or use the non-vectorised identical(). identical() is very strict: it always returns either a single TRUE or a single FALSE, and doesn’t coerce types. This means that you need to be careful when comparing integers and doubles:

> identical(0L, 0)
[1] FALSE

You also need to be wary of floating point numbers. Instead use dplyr::near() for comparisons.

> x <- sqrt(2) ^ 2
> x
> x == 2
> x - 2
> dplyr::near(x,2)
[1] 2
[1] FALSE
[1] 4.440892e-16
[1] TRUE

Mulitple Conditions

You can add numerous else if statements if there are a number of conditions.

> if (this) {
+   # do that
+ } else if (that) {
+   # do something else
+ } else {
+   # 
+ }

Or, the switch() function allows you to evaluate selected code based on position or name.

>  function(x, y, op) {
+   switch(op,
+        plus = x + y,
+        minus = x - y,
+        times = x * y,
+        divide = x / y,
+        stop("Unknown op!")
+       )
+     }

Also, cut() can be used to discretise continuous variables.

Code Style

Both if and function should (almost) always be followed by squiggly brackets ({}), and the contents should be indented by two spaces. This makes it easier to see the hierarchy in your code by skimming the left-hand margin.

An opening curly brace should never go on its own line and should always be followed by a new line. A closing curly brace should always go on its own line, unless it’s followed by else. Always indent the code inside curly braces.

> function(x,y){
+ # Good
+    if (y < 0 && debug) {
+      message("Y is negative")
+    }
+ 
+    if (y == 0) {
+      log(x)
+    } else {
+      y ^ x
+    }
+ }
> 
> function(x,y){
+ # Bad
+ if (y < 0 && debug)
+ message("Y is negative")
+ 
+ if (y == 0) {
+   log(x)
+ } 
+ else {
+   y ^ x
+ }
+ }

For very short statements one line is okay.

> y <- 10
> x <- if (y < 20) "Too low" else "Too high"

But multiple lines is easier to read.

> if (y < 20) {
+   x <- "Too low" 
+ } else {
+   x <- "Too high"
+ }

Exercises

1. What’s the difference between if and ifelse()? .
  • ifelse() has two potential answers.
> y <- c(3,4,5)
> ifelse(sum(y)>10, "yes", "no")
[1] "yes"
2. Write a greeting function that says “good morning”, “good afternoon”, or “good evening”, depending on the time of day. (Hint: use a time argument that defaults to lubridate::now(). That will make it easier to test your function.)
> library(lubridate)
> greet <- function(time = lubridate::now()) {
+   hr <- lubridate::hour(time)
+   if (hr < 12) {
+     print("good morning")
+   } else if (hr < 17) {
+     print("good afternoon")
+   } else {
+     print("good evening")
+   }
+ }
> 
> greet()
> greet(ymd_h("2017-01-08 05"))
> greet(ymd_h("2017-01-08 13"))
> greet(ymd_h("2017-01-08 20"))
[1] "good afternoon"
[1] "good morning"
[1] "good afternoon"
[1] "good evening"
3. Implement a fizzbuzz function. It takes a single number as input. If the number is divisible by three, it returns “fizz”. If it’s divisible by five it returns “buzz”. If it’s divisible by three and five, it returns “fizzbuzz”. Otherwise, it returns the number. Make sure you first write working code before you create the function.
> fizzbuzz <- function(x) {
+   stopifnot(length(x) == 1)
+   stopifnot(is.numeric(x))
+   if (!(x %% 3) && !(x %% 5)) {
+     "fizzbuzz"
+   } else if (!(x %% 3)) {
+     "fizz"
+   } else if (!(x %% 5)) {
+     "buzz"
+   } else {
+     as.character(x)
+   }
+ }
> 
> fizzbuzz(6)
> fizzbuzz(10)
> fizzbuzz(15)
> fizzbuzz(2)
[1] "fizz"
[1] "buzz"
[1] "fizzbuzz"
[1] "2"
> fizzbuzz_vec <- function(x) {
+   case_when(!(x %% 3) & !(x %% 5) ~ "fizzbuzz",
+             !(x %% 3) ~ "fizz",
+             !(x %% 5) ~ "buzz",
+             TRUE ~ as.character(x)
+   )
+ }
> fizzbuzz_vec(c(0, 1, 2, 3, 5, 9, 10, 12, 15))
[1] "fizzbuzz" "1"        "2"        "fizz"     "buzz"     "fizz"     "buzz"    
[8] "fizz"     "fizzbuzz"
> fizzbuzz_vec2 <- function(x) {
+   y <- as.character(x)
+   # put the individual cases first - any elements divisible by both 3 and 5
+   # will be overwritten with fizzbuzz later
+   y[!(x %% 3)] <- "fizz"
+   y[!(x %% 3)] <- "buzz"
+   y[!(x %% 3) & !(x %% 5)] <- "fizzbuzz"
+   y
+ }
> 
> fizzbuzz_vec2(c(0, 1, 2, 3, 5, 9, 10, 12, 15))
[1] "fizzbuzz" "1"        "2"        "buzz"     "5"        "buzz"     "10"      
[8] "buzz"     "fizzbuzz"
4. How could you use cut() to simplify this set of nested if-else statements?
> if (temp <= 0) {
+   "freezing"
+ } else if (temp <= 10) {
+   "cold"
+ } else if (temp <= 20) {
+   "cool"
+ } else if (temp <= 30) {
+   "warm"
+ } else {
+   "hot"
+ }
How would you change the call to cut() if I’d used < instead of <=?
> #less than or equal to
> temp <- seq(-10, 50, by = 5)
> cut(temp, c(-Inf, 0, 10, 20, 30, Inf),
+     right = TRUE,
+     labels = c("freezing", "cold", 
+                "cool", "warm", "hot")
+ )
 [1] freezing freezing freezing cold     cold     cool     cool     warm    
 [9] warm     hot      hot      hot      hot     
Levels: freezing cold cool warm hot
> #less than
> temp <- seq(-10, 50, by = 5)
> cut(temp, c(-Inf, 0, 10, 20, 30, Inf),
+     right = FALSE,
+     labels = c("freezing", "cold", 
+                "cool", "warm", "hot")
+ )
 [1] freezing freezing cold     cold     cool     cool     warm     warm    
 [9] hot      hot      hot      hot      hot     
Levels: freezing cold cool warm hot
5. What happens if you use switch() with numeric values?
  • If n is numeric, it will return the nth argument.
> switch(1, "apple", "banana", "cantaloupe")
> switch(2, "apple", "banana", "cantaloupe")
[1] "apple"
[1] "banana"
> # only uses the integer part
> switch(1.2, "apple", "banana", "cantaloupe")
> switch(2.8, "apple", "banana", "cantaloupe")
[1] "apple"
[1] "banana"
6. What does this switch() call do? What happens if x is “e”?
> x <- "e"
> switch(x,
+   a = ,
+   b = "ab",
+   c = ,
+   d = "cd"
+ )
  • When switch() encounters an argument with a missing value, like a = ,, it will return the value of the next argument with a non missing value.
  • When it runs out of non-missing values it will return NULL.
> switcheroo <- function(x) {
+   switch(x,
+          a = ,
+          b = "ab",
+          c = ,
+          d = "cd"
+   )
+ }
> 
> switcheroo("a")
> switcheroo("b")
> switcheroo("c")
> switcheroo("d")
> switcheroo("e")
> switcheroo("f")
[1] "ab"
[1] "ab"
[1] "cd"
[1] "cd"

Function Arguments

The function must include the data and the arguments that control the details.

  • log() - the data is x, and the detail is the base of the logarithm.

  • mean() - the data is x, and the details are how much data to trim from the ends (trim) and how to handle missing values (na.rm).

  • t.test() - the data are x and y, and the details of the test are alternative, mu, paired, var.equal, and conf.level.

  • str_c() - you can supply any number of strings to ..., and the details of the concatenation are controlled by sep and collapse.

The arguments come first followed by details, which should have default values.

> # Compute confidence interval around mean 
> # using normal approximation
> mean_ci <- function(x, conf = 0.95) {
+   se <- sd(x) / sqrt(length(x))
+   alpha <- 1 - conf
+   mean(x) + se * qnorm(c(alpha / 2, 
+                          1 - alpha / 2))
+ }
> 
> x <- runif(100)
> mean_ci(x)
[1] 0.4002615 0.5165031
> mean_ci(x, conf = 0.99)
[1] 0.3819986 0.5347660

If you override the default value of a detail argument, you should use the full name:

> # Good
> mean(1:10, na.rm = TRUE)
> 
> # Bad
> mean(x = 1:10, , FALSE)
> mean(, TRUE, x = c(1:10, NA))

You should place a space around = in function calls, and always put a space after a comma. Using whitespace makes it easier to skim the function for the important components.

> # Good
> average <- mean(feet / 12 + inches, na.rm = TRUE)
> 
> # Bad
> average<-mean(feet/12+inches,na.rm=TRUE)

Choosing Names

Generally you should prefer longer, more descriptive names, but there are a handful of very common, very short names.

  • x, y, z: vectors.
  • w: a vector of weights.
  • df: a data frame.
  • i, j: numeric indices (typically rows and columns).
  • n: length, or number of rows.
  • p: number of columns.

Otherwise, consider matching names of arguments in existing R functions. For example, use na.rm to determine if missing values should be removed.

Checking values

It’s easy to call your function with invalid inputs. To avoid this problem, it’s often useful to make constraints explicit.

> wt_mean <- function(x, w) {
+   sum(x * w) / sum(w)
+ }
> wt_var <- function(x, w) {
+   mu <- wt_mean(x, w)
+   sum(w * (x - mu) ^ 2) / sum(w)
+ }
> wt_sd <- function(x, w) {
+   sqrt(wt_var(x, w))
+ }

What happens if x and w are not the same length?

> wt_mean(1:6, 1:3)
[1] 7.666667

In this case, because of R’s vector recycling rules, we don’t get an error.

It’s good practice to check important preconditions, and throw an error (with stop()), if they are not true:

> wt_mean <- function(x, w) {
+   if (length(x) != length(w)) {
+     stop("`x` and `w` must be the same length", call. = FALSE)
+   }
+   sum(w * x) / sum(w)
+ }

Sometimes you don’t have to go too far and check everything, like na.rm.

> wt_mean <- function(x, w, na.rm = FALSE) {
+   if (!is.logical(na.rm)) {
+     stop("`na.rm` must be logical")
+   }
+   if (length(na.rm) != 1) {
+     stop("`na.rm` must be length 1")
+   }
+   if (length(x) != length(w)) {
+     stop("`x` and `w` must be the same length", call. = FALSE)
+   }
+   
+   if (na.rm) {
+     miss <- is.na(x) | is.na(w)
+     x <- x[!miss]
+     w <- w[!miss]
+   }
+   sum(w * x) / sum(w)
+ }

This is a lot of extra work for little additional gain. A useful compromise is the built-in stopifnot(): it checks that each argument is TRUE, and produces a generic error message if not.

> wt_mean <- function(x, w, na.rm = FALSE) {
+   stopifnot(is.logical(na.rm), length(na.rm) == 1)
+   stopifnot(length(x) == length(w))
+   
+   if (na.rm) {
+     miss <- is.na(x) | is.na(w)
+     x <- x[!miss]
+     w <- w[!miss]
+   }
+   sum(w * x) / sum(w)
+ }
> wt_mean(1:6, 6:1, na.rm = "foo")
Error in wt_mean(1:6, 6:1, na.rm = "foo"): is.logical(na.rm) is not TRUE

Dot-dot-dot (…)

Many functions in R take an arbitrary number of inputs:

> sum(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
[1] 55
> stringr::str_c("a", "b", "c", "d", "e", "f")
[1] "abcdef"

They rely on a special argument: ... (pronounced dot-dot-dot). This special argument captures any number of arguments that aren’t otherwise matched.

It’s useful because you can then send those ... on to another function. This is a useful catch-all if your function primarily wraps another function.

> commas <- function(...) stringr::str_c(..., 
+                               collapse = ", ")
> commas(letters[1:10])
[1] "a, b, c, d, e, f, g, h, i, j"
> rule <- function(..., pad = "-") {
+   title <- paste0(...)
+   width <- getOption("width") - 
+            nchar(title) - 5
+   cat(title, " ", stringr::str_dup(pad, 
+                   width), "\n", sep = "")
+ }
> rule("Important output")
Important output -----------------------------------------------------------

However, any misspelled arguments will not raise an error. This makes it easy for typos to go unnoticed:

> x <- c(1, 2)
> sum(x, na.mr = TRUE)
[1] 4

If you just want to capture the values of the ..., use list(...).

Lazy Evaluation

Arguments in R are lazily evaluated: they’re not computed until they’re needed. That means if they’re never used, they’re never called.

Exercises

1. What does commas(letters, collapse = "-") do? Why?
  • It returns an error because the original function already has a set value for collapse.
> commas <- function(...) {
+   str_c(..., collapse = ", ")
+ }
> 
> commas(letters)
[1] "a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z"
> commas(letters, collapse = "-")
Error in str_c(..., collapse = ", "): formal argument "collapse" matched by multiple actual arguments
> commas <- function(..., collapse = ", ") {
+   str_c(..., collapse = collapse)
+ }
> commas(letters, collapse = "-")
[1] "a-b-c-d-e-f-g-h-i-j-k-l-m-n-o-p-q-r-s-t-u-v-w-x-y-z"
2. It’d be nice if you could supply multiple characters to the pad argument, e.g. rule("Title", pad = "-+"). Why doesn’t this currently work? How could you fix it?
  • The output is too long, but it can be fixed with stringr::str_length().
> rule <- function(..., pad = "-") {
+   title <- paste0(...)
+   width <- getOption("width") - nchar(title) - 5
+   cat(title, " ", str_dup(pad, width), "\n", sep = "")
+ }
> 
> rule("Important output")
Important output -----------------------------------------------------------
> rule("Valuable output", pad = "-+")
Valuable output -+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> rule2 <- function(..., pad = "-") {
+   title <- paste0(...)
+   width <- getOption("width") - nchar(title) - 5
+   cat(title, " ", str_dup(pad, 
+     width/stringr::str_length(pad)), 
+     "\n", sep = "")
+ }
> rule2("Valuable output", pad = "-+")
Valuable output -+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
3. What does the trim argument to mean() do? When might you use it?
  • The trim arguments trims a fraction of observations from each end of the vector before calculating the mean. This is used to eliminate outliers.
4. The default value for the method argument to cor() is c("pearson", "kendall", "spearman"). What does that mean? What value is used by default?
  • It means that the method argument can take one of those three values. The first value, “pearson”, is used by default.

Return Values

There are two things you should consider when returning a value:

  1. Does returning early make your function easier to read?
  2. Can you make your function pipeable?

Explicit Return Statements

You can use return() to end the function early.

> complicated_function <- function(x, y, z) {
+   if (length(x) == 0 || length(y) == 0) {
+     return(0)
+   }
+     
+   # Complicated code here
+ }

If the first block is very long, by the time you get to the else, you’ may have forgotten the condition.

> f <- function() {
+   if (x) {
+     # Do 
+     # something
+     # that
+     # takes
+     # many
+     # lines
+     # to
+     # express
+   } else {
+     # return something short
+   }
+ }

Instead, use an early return for the simple case.

> f <- function() {
+   if (!x) {
+     return(something_short)
+   }
+ 
+   # Do 
+   # something
+   # that
+   # takes
+   # many
+   # lines
+   # to
+   # express
+ }

Writing Pipeable Functions

If you want to write your own pipeable functions, it’s important to think about the return value. Knowing the return value’s object type will mean that your pipeline will “just work”. For example, with dplyr and tidyr the object type is the data frame.

There are two basic types of pipeable functions: transformations and side-effects. With transformations, an object is passed to the function’s first argument and a modified object is returned. With side-effects, the passed object is not transformed. Instead, the function performs an action on the object, like drawing a plot or saving a file. Side-effects functions should “invisibly” return the first argument, so that while they’re not printed they can still be used in a pipeline. For example, this simple function prints the number of missing values in a data frame:

> show_missings <- function(df) {
+   n <- sum(is.na(df))
+   cat("Missing values: ", n, "\n", sep = "")
+   
+   invisible(df)
+ }

If we call it interactively, the invisible() means that the input df doesn’t get printed out:

> show_missings(mtcars)
Missing values: 0

But it’s still there, it’s just not printed by default:

> x <- show_missings(mtcars) 
Missing values: 0
> class(x)
[1] "data.frame"
> dim(x)
[1] 32 11

And we can still use it in a pipe:

> mtcars %>% 
+   show_missings() %>% 
+   mutate(mpg = ifelse(mpg < 20, NA, mpg)) %>% 
+   show_missings() 
Missing values: 0
Missing values: 18

Environment

The last component of a function is its environment. The environment of a function controls how R finds the value associated with a name. For example, take this function:

> f <- function(x) {
+   x + y
+ } 

In many programming languages, this would be an error, because y is not defined inside the function. In R, this is valid code because R uses rules called lexical scoping to find the value associated with a name. Since y is not defined inside the function, R will look in the environment where the function was defined:

> y <- 100
> f(10)
[1] 110
> y <- 1000
> f(10)
[1] 1010

This allows you to do devious things like:

> `+` <- function(x, y) {
+   if (runif(1) < 0.1) {
+     sum(x, y)
+   } else {
+     sum(x, y) * 1.1
+   }
+ }
> table(replicate(1000, 1 + 2))

  3 3.3 
 88 912 
> rm(`+`)

Vectors


There are two types of vectors.

  • Atomic vectors, of which there are six types: logical, integer and double (numeric), character, complex, and raw.
  • Lists, which are sometimes called recursive vectors because lists can contain other lists.

Atomic vectors are homogeneous, while lists can be heterogeneous.

NULL is often used to represent the absence of a vector while NA which is used to represent the absence of a value in a vector.

You can check the type with typeof()

> typeof(letters)
> typeof(1:10)
[1] "character"
[1] "integer"

You can check its length with length()

> x <- list("a", "b", 1:10)
> length(x)
[1] 3

Vectors can also contain arbitrary additional metadata in the form of attributes. These attributes are used to create augmented vectors which build on additional behaviour. There are three important types of augmented vector:

  • Factors are built on top of integer vectors.
  • Dates and date-times are built on top of numeric vectors.
  • Data frames and tibbles are built on top of lists.

Important Types of Atomic Vector

The four most important types of atomic vectors are logical, integer, double, and character.

Logical

Logical vectors can take only three possible values: FALSE, TRUE, and NA.

> 1:10 %% 3 == 0
 [1] FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE
> c(TRUE, TRUE, FALSE, NA)
[1]  TRUE  TRUE FALSE    NA

Numeric

Integer and double vectors are known collectively as numeric vectors. In R, numbers are doubles by default. To make an integer, place an L after the number:

> typeof(1)
[1] "double"
> typeof(1L)
[1] "integer"
> 1.5L
[1] 1.5

Integers vs. Doubles

  1. Doubles are approximations. Doubles represent floating point numbers that can not always be precisely represented with a fixed amount of memory.
> x <- sqrt(2) ^ 2
> x
[1] 2
> x - 2
[1] 4.440892e-16

Instead of comparing floating point numbers using ==, you should use dplyr::near() which allows for some numerical tolerance.

  1. Integers have one special value: NA, while doubles have four: NA, NaN, Inf and -Inf. All three special values NaN, Inf and -Inf can arise during division:
> c(-1, 0, 1) / 0
[1] -Inf  NaN  Inf

Avoid using == to check for these other special values. Instead use the helper functions.

> is.finite(0)
> is.infinite(Inf)
> is.na(NA)
> is.nan(NaN)
[1] TRUE
[1] TRUE
[1] TRUE
[1] TRUE

Character

Character vectors are the most complex type of atomic vector, because each element of a character vector is a string, and a string can contain an arbitrary amount of data.

Each unique string is only stored in memory once, and every use of the string points to that representation. This reduces the amount of memory needed by duplicated strings.

Missing Values

Each type of atomic vector has its own missing value:

> NA            # logical
> NA_integer_   # integer
> NA_real_      # double
> NA_character_ # character
[1] NA
[1] NA
[1] NA
[1] NA

You can always use NA and it will be converted to the correct type. However, there are some functions that are strict about their inputs.

Exercises

1. Describe the difference between is.finite(x) and !is.infinite(x)
  • You can see below how they operate differently.
> x <- c(0, NA, NaN, Inf, -Inf)
> is.finite(x)
> !is.infinite(x)
[1]  TRUE FALSE FALSE FALSE FALSE
[1]  TRUE  TRUE  TRUE FALSE FALSE
2. Read the source code for dplyr::near() (Hint: to see the source code, drop the ()). How does it work?
> dplyr::near
function (x, y, tol = .Machine$double.eps^0.5) 
{
    abs(x - y) < tol
}
<bytecode: 0x000000001cf0a838>
<environment: namespace:dplyr>
  • It checks to see if the absolute difference is less than square root of .Machine$double.eps, which is the smallest floating point number that the computer can represent.
3. A logical vector can take 3 possible values. How many possible values can an integer vector take? How many possible values can a double take? Use google to do some research.

The range of integers values that R can represent in an integer vector is \(\pm2^{31}-1\)

> .Machine$integer.max
[1] 2147483647

The double can represent numbers in the range of about \(\pm2 \times 10^{308}\)

> .Machine$double.xmax
[1] 1.797693e+308
4. Brainstorm at least four functions that allow you to convert a double to an integer. How do they differ? Be precise.
> tibble(
+   x = c(1.8, 1.5, 1.2, 0.8, 0.5, 0.2, 
+         -0.2, -0.5, -0.8, -1.2, -1.5, -1.8),
+   `Round down` = floor(x),
+   `Round up` = ceiling(x),
+   `Round towards zero` = trunc(x),
+   `Nearest, round half to even` = round(x)
+ ) 
# A tibble: 12 x 5
       x `Round down` `Round up` `Round towards zero` `Nearest, round half to e~
   <dbl>        <dbl>      <dbl>                <dbl>                      <dbl>
 1   1.8            1          2                    1                          2
 2   1.5            1          2                    1                          2
 3   1.2            1          2                    1                          1
 4   0.8            0          1                    0                          1
 5   0.5            0          1                    0                          0
 6   0.2            0          1                    0                          0
 7  -0.2           -1          0                    0                          0
 8  -0.5           -1          0                    0                          0
 9  -0.8           -1          0                    0                         -1
10  -1.2           -2         -1                   -1                         -1
11  -1.5           -2         -1                   -1                         -2
12  -1.8           -2         -1                   -1                         -2
5. What functions from the readr package allow you to turn a string into logical, integer, and double vector?
> parse_logical(c("TRUE", "FALSE", "1", 
+                 "0", "true", "t", "NA"))
[1]  TRUE FALSE  TRUE FALSE  TRUE  TRUE    NA
> parse_integer(c("1235", "0134", "NA"))
[1] 1235  134   NA
> parse_number(c("1.0", "3.5", "$1,000.00", 
+                "NA", "ABCD12234.90", 
+                "1234ABC", "A123B", "A1B2C"))
[1]     1.0     3.5  1000.0      NA 12234.9  1234.0   123.0     1.0

Using Atomic Vectors

It’s useful to review some of the important tools for working with vectors.

  1. How to convert from one type to another, and when that happens automatically.

  2. How to tell if an object is a specific type of vector.

  3. What happens when you work with vectors of different lengths.

  4. How to name the elements of a vector.

  5. How to pull out elements of interest.

Coercion

There are two ways to convert, or coerce, one type of vector to another:

  • Explicit coercion happens when you call a function like as.logical(), as.integer(), as.double(), or as.character(). Whenever you find yourself using explicit coercion, you should always check whether you can make the fix upstream, so that the vector never had the wrong type in the first place. For example, you may need to tweak your readr col_types specification.

  • Implicit coercion happens when you use a vector in a specific context that expects a certain type of vector. For example, when you use a logical vector with a numeric summary function, or when you use a double vector where an integer vector is expected.

The most important type of implicit coercion: using a logical vector in a numeric context. In this case TRUE is converted to 1 and FALSE converted to 0. That means the sum of a logical vector is the number of trues, and the mean of a logical vector is the proportion of trues:

> x <- sample(1:20, 100, replace = TRUE)
> z <- x > 10
> sum(z)  # how many are greater than 10?
> # what proportion are greater than 10?
> scales::percent(mean(z)) 
[1] 56
[1] "56%"

It’s also important to understand what happens when you try and create a vector containing multiple types with c(): the most complex type always wins.

> typeof(c(TRUE, 1L))
> typeof(c(1L, 1.5))
> typeof(c(1.5, "a"))
[1] "integer"
[1] "double"
[1] "character"

Test Functions

To test you can use the is_* functions provided by purrr.

> is_logical(TRUE)              
> is_integer(2L)                
> is_double(2.5)                
> is_numeric(3.5)   
> is_character("x")             
> is_atomic(1)  
> is_list(list(1:10,2:5))                   
> is_vector(c("x","y"))
[1] TRUE
[1] TRUE
[1] TRUE
[1] TRUE
[1] TRUE
[1] TRUE
[1] TRUE
[1] TRUE

Each predicate also comes with a “scalar” version, like is_scalar_atomic(), which checks that the length is 1. This is useful, for example, if you want to check that an argument to your function is a single logical value.

Scalars and Recylcling Rules

  • recycling - the shorter vector is repeated, or recycled, to the same length as the longer vector.

  • vectorized - the function will operate on all elements of a vector without needing to loop through and act on each element one at a time.

> sample(10) + 100
 [1] 102 105 103 104 109 108 101 110 106 107
> runif(10) > 0.5
 [1]  TRUE FALSE  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE  TRUE

R will expand the shortest vector to the same length as the longest, so called recycling. This is silent except when the length of the longer is not an integer multiple of the length of the shorter:

> 1:10 + 1:3
Warning in 1:10 + 1:3: longer object length is not a multiple of shorter object
length
 [1]  2  4  6  5  7  9  8 10 12 11

The vectorised functions in tidyverse will throw errors when you recycle anything other than a scalar. If you do want to recycle, you’ll need to do it yourself with rep().

> tibble(x = 1:4, y = 1:2)
Error: Tibble columns must have compatible sizes.
* Size 4: Existing data.
* Size 2: Column `y`.
i Only values of size one are recycled.
> tibble(x = 1:4, y = rep(1:2, 2))
# A tibble: 4 x 2
      x     y
  <int> <int>
1     1     1
2     2     2
3     3     1
4     4     2
> tibble(x = 1:4, y = rep(1:2, each = 2))
# A tibble: 4 x 2
      x     y
  <int> <int>
1     1     1
2     2     1
3     3     2
4     4     2

Naming Vectors

All types of vectors can be named. You can name them during creation with c():

> c(x = 1, y = 2, z = 4)
x y z 
1 2 4 

Or after the fact with purrr::set_names():

> set_names(1:3, c("a", "b", "c"))
a b c 
1 2 3 

Subsetting

filter() only works with tibble, so we’ll need new tool for vectors: [. [ is the subsetting function, and is called like x[a]. There are four types of things that you can subset a vector with:

  1. A numeric vector containing only integers. The integers must either be all positive, all negative, or zero.

Subsetting with positive integers keeps the elements at those positions:

> x <- c("one", "two", "three", "four", "five")
> x[c(3, 2, 5)]
[1] "three" "two"   "five" 

By repeating a position, you can actually make a longer output than input:

> x[c(1, 1, 5, 5, 5, 2)]
[1] "one"  "one"  "five" "five" "five" "two" 

Negative values drop the elements at the specified positions:

> x[c(-1, -3, -5)]
[1] "two"  "four"

It’s an error to mix positive and negative values:

> x[c(1, -1)]
Error in x[c(1, -1)]: only 0's may be mixed with negative subscripts

The error message mentions subsetting with zero, which returns no values:

> x[0]
character(0)
  1. Subsetting with a logical vector keeps all values corresponding to a TRUE value. This is most often useful in conjunction with the comparison functions.
> x <- c(10, 3, NA, 5, 8, 1, NA)
> 
> # All non-missing values of x
> x[!is.na(x)]
[1] 10  3  5  8  1
> # All even (or missing!) values of x
> x[x %% 2 == 0]
[1] 10 NA  8 NA
  1. If you have a named vector, you can subset it with a character vector:
> x <- c(abc = 1, def = 2, xyz = 5)
> x[c("xyz", "def")]
xyz def 
  5   2 
  1. The simplest type of subsetting is nothing, x[], which returns the complete x. This is not useful for subsetting vectors, but it is useful when subsetting matrices (and other high dimensional structures) because it lets you select all the rows or all the columns, by leaving that index blank. For example, if x is 2d, x[1, ] selects the first row and all the columns, and x[, -1] selects all rows and all columns except the first.

There is an important variation of [ called [[. [[ only ever extracts a single element, and always drops names. It’s a good idea to use it whenever you want to make it clear that you’re extracting a single item, as in a for loop.

Exercises

1. What does mean(is.na(x)) tell you about a vector x? What about sum(!is.finite(x))?
> x <- c(-Inf, -1, 0, 1, Inf, NA, NaN)
> # 2/7 Na, NaN
> mean(is.na(x))
[1] 0.2857143
> # 4 -Inf, Inf, Na, NaN
> sum(!is.finite(x))
[1] 4
2. Carefully read the documentation of is.vector(). What does it actually test for? Why does is.atomic() not agree with the definition of atomic vectors above?
  • The function is.vector() only checks whether the object has no attributes other than names.
> is.vector(list(a = 1, b = 2))
[1] TRUE
  • The function is.atomic() explicitly checks whether an object is one of the atomic types (“logical”, “integer”, “numeric”, “complex”, “character”, and “raw”) or NULL.
> is.atomic(1:10)
[1] TRUE
> is.atomic(list(a = 1))
[1] FALSE
3. Compare and contrast setNames() with purrr::set_names().
  • The function setNames() takes two arguments, a vector to be named and a vector of names to apply to its elements.
> setNames(1:4, c("a", "b", "c", "d"))
a b c d 
1 2 3 4 
> setNames(nm = c("a", "b", "c", "d"))
  a   b   c   d 
"a" "b" "c" "d" 
  • The function set_names() has more ways to set the names than setNames().
> purrr::set_names(1:4, c("a", "b", "c", "d"))
a b c d 
1 2 3 4 
> purrr::set_names(1:4, "a", "b", "c", "d")
a b c d 
1 2 3 4 
> purrr::set_names(c("a", "b", "c", "d"))
  a   b   c   d 
"a" "b" "c" "d" 
> purrr::set_names(c(a = 1, b = 2, c = 3), toupper)
A B C 
1 2 3 
> purrr::set_names(1:4, c("a", "b"))
Error: `nm` must be `NULL` or a character vector the same length as `x`
4. Create functions that take a vector as input and returns:
\(\space\space a.\) The last value. Should you use [ or [[?
> last_value <- function(x) {
+   if (length(x)) {
+     x[[length(x)]]
+   } else {
+     x
+   }
+ }
> 
> last_value(numeric())
numeric(0)
> last_value(1)
[1] 1
> last_value(1:10)
[1] 10
\(\space\space b.\) The elements at even numbered positions.
> even_indices <- function(x) {
+   if (length(x)) {
+     x[seq_along(x) %% 2 == 0]
+   } else {
+     x
+   }
+ }
> even_indices(numeric())
numeric(0)
> even_indices(1)
numeric(0)
> even_indices(1:10)
[1]  2  4  6  8 10
> even_indices(letters)
 [1] "b" "d" "f" "h" "j" "l" "n" "p" "r" "t" "v" "x" "z"
\(\space\space c.\) Every element except the last value.
> not_last <- function(x) {
+   n <- length(x)
+   if (n) {
+     x[-n]
+   } else {
+     # n == 0
+     x
+   }
+ }
> 
> not_last(1:3)
[1] 1 2
\(\space\space d.\) Only even numbers (and no missing values).
> even_numbers2 <- function(x) {
+   x[!is.infinite(x) & !is.nan(x) & (x %% 2 == 0)]
+ }
> even_numbers2(c(0:4, NA, NaN, Inf, -Inf))
[1]  0  2  4 NA
5. Why is x[-which(x > 0)] not the same as x[x <= 0]?
> x <- c(-1:1, Inf, -Inf, NaN, NA)
> x[-which(x > 0)]
[1]   -1    0 -Inf  NaN   NA
> x[x <= 0]
[1]   -1    0 -Inf   NA   NA
  • They return the same values except for a NaN instead of an NA in the expression using which.
6. What happens when you subset with a positive integer that’s bigger than the length of the vector? What happens when you subset with a name that doesn’t exist?
> x <- c(a = 10, b = 20)
> x[3:5]
<NA> <NA> <NA> 
  NA   NA   NA 
> x[1:5]
   a    b <NA> <NA> <NA> 
  10   20   NA   NA   NA 

Recursive Vectors (Lists)

Lists are a step up in complexity from atomic vectors, because lists can contain other lists. This makes them suitable for representing hierarchical or tree-like structures. You create a list with list():

> x <- list(1, 2, 3)
> x
[[1]]
[1] 1

[[2]]
[1] 2

[[3]]
[1] 3

A very useful tool for working with lists is str() because it focuses on the structure, not the contents.

> str(x)
List of 3
 $ : num 1
 $ : num 2
 $ : num 3
> x_named <- list(a = 1, b = 2, c = 3)
> str(x_named)
List of 3
 $ a: num 1
 $ b: num 2
 $ c: num 3

Unlike atomic vectors, list() can contain a mix of objects:

> y <- list("a", 1L, 1.5, TRUE)
> str(y)
List of 4
 $ : chr "a"
 $ : int 1
 $ : num 1.5
 $ : logi TRUE

Lists can even contain other lists.

> z <- list(list(1, 2), list(3, 4))
> str(z)
List of 2
 $ :List of 2
  ..$ : num 1
  ..$ : num 2
 $ :List of 2
  ..$ : num 3
  ..$ : num 4

Visualizing Lists

To explain more complicated list manipulation functions, it’s helpful to have a visual representation of lists. For example, take these three lists:

> x1 <- list(c(1, 2), c(3, 4))
> x2 <- list(list(1, 2), list(3, 4))
> x3 <- list(1, list(2, list(3)))

Subsetting

There are three ways to subset a list, which can be illustrated with a list named a:

> a <- list(a = 1:3, b = "a string", 
+           c = pi, d = list(-1, -5))
  • [ extracts a sub-list. The result will always be a list.
> str(a[1:2])
List of 2
 $ a: int [1:3] 1 2 3
 $ b: chr "a string"
> str(a[4])
List of 1
 $ d:List of 2
  ..$ : num -1
  ..$ : num -5

Like with vectors, you can subset with a logical, integer, or character vector.

[[ extracts a single component from a list. It removes a level of hierarchy from the list.

> str(a[[1]])
 int [1:3] 1 2 3
> str(a[[4]])
List of 2
 $ : num -1
 $ : num -5

$ is a shorthand for extracting named elements of a list. It works similarly to [[ except that you don’t need to use quotes.

> a$a
[1] 1 2 3
> a[["a"]]
[1] 1 2 3

The distinction between [ and [[ is really important for lists, because [[ drills down into the list while [ returns a new, smaller list. Compare the code and output above with the visual representation

Exercises

1. What happens if you subset a tibble as if you’re subsetting a list? What are the key differences between a list and a tibble?

Subsetting a tibble works the same way as a list; a data frame can be thought of as a list of columns. The key difference between a list and a tibble is that all the elements (columns) of a tibble must have the same length (number of rows). Lists can have vectors with different lengths as elements.

> x <- tibble(a = 1:2, b = 3:4)
> x[[1]]
[1] 1 2

Attributes

Any vector can contain arbitrary additional metadata through its attributes. You can think of attributes as named list of vectors that can be attached to any object. You can get and set individual attribute values with attr() or see them all at once with attributes().

> x <- 1:10
> attr(x, "greeting")
NULL
> attr(x, "greeting") <- "Hi!"
> attr(x, "farewell") <- "Bye!"
> attributes(x)
$greeting
[1] "Hi!"

$farewell
[1] "Bye!"

There are three very important attributes that are used to implement fundamental parts of R:

  1. Names are used to name the elements of a vector.
  2. Dimensions (dims, for short) make a vector behave like a matrix or array.
  3. Class is used to implement the S3 object oriented system.

Here’s what a typical generic function looks like:

> as.Date
function (x, ...) 
UseMethod("as.Date")
<bytecode: 0x0000000012fe60a0>
<environment: namespace:base>

The call to “UseMethod” means that this is a generic function, and it will call a specific method, a function, based on the class of the first argument. (All methods are functions; not all functions are methods). You can list all the methods for a generic with methods():

> methods("as.Date")
[1] as.Date.character   as.Date.default     as.Date.factor     
[4] as.Date.numeric     as.Date.POSIXct     as.Date.POSIXlt    
[7] as.Date.vctrs_sclr* as.Date.vctrs_vctr*
see '?methods' for accessing help and source code

For example, if x is a character vector, as.Date() will call as.Date.character(); if it’s a factor, it’ll call as.Date.factor().

You can see the specific implementation of a method with getS3method():

> getS3method("as.Date", "default")
function (x, ...) 
{
    if (inherits(x, "Date")) 
        x
    else if (is.null(x)) 
        .Date(numeric())
    else if (is.logical(x) && all(is.na(x))) 
        .Date(as.numeric(x))
    else stop(gettextf("do not know how to convert '%s' to class %s", 
        deparse1(substitute(x)), dQuote("Date")), domain = NA)
}
<bytecode: 0x000000001d0dc8e8>
<environment: namespace:base>
> getS3method("as.Date", "numeric")
function (x, origin, ...) 
{
    if (missing(origin)) {
        if (!length(x)) 
            return(.Date(numeric()))
        if (!any(is.finite(x))) 
            return(.Date(x))
        stop("'origin' must be supplied")
    }
    as.Date(origin, ...) + x
}
<bytecode: 0x000000001e2d03e8>
<environment: namespace:base>

The most important S3 generic is print(): it controls how the object is printed when you type its name at the console. Other important generics are the subsetting functions [, [[, and $.

Augmented Vectors

Atomic vectors and lists are the building blocks for other important vector types like factors and dates. These are called augmented vectors, because they are vectors with additional attributes, including class. Because augmented vectors have a class, they behave differently to the atomic vector on which they are built. Four important augmented vectors:

  • Factors
  • Dates
  • Date-times
  • Tibbles

Factors

Factors are designed to represent categorical data that can take a fixed set of possible values. Factors are built on top of integers, and have a levels attribute:

> x <- factor(c("ab", "cd", "ab"), levels = c("ab", "cd", "ef"))
> typeof(x)
[1] "integer"
> attributes(x)
$levels
[1] "ab" "cd" "ef"

$class
[1] "factor"

Dates and Date-Times

Dates in R are numeric vectors that represent the number of days since 1 January 1970.

> x <- as.Date("1971-01-01")
> unclass(x)
[1] 365
> typeof(x)
[1] "double"
> attributes(x)
$class
[1] "Date"

Date-times are numeric vectors with class POSIXct that represent the number of seconds since 1 January 1970. (“POSIXct” stands for “Portable Operating System Interface”, calendar time.)

> x <- lubridate::ymd_hm("1970-01-01 01:00")
> unclass(x)
[1] 3600
attr(,"tzone")
[1] "UTC"
> typeof(x)
[1] "double"
> attributes(x)
$class
[1] "POSIXct" "POSIXt" 

$tzone
[1] "UTC"

The tzone attribute is optional. It controls how the time is printed, not what absolute time it refers to.

> attr(x, "tzone") <- "US/Pacific"
> x
[1] "1969-12-31 17:00:00 PST"
> attr(x, "tzone") <- "US/Eastern"
> x
[1] "1969-12-31 20:00:00 EST"

There is another type of date-times called POSIXlt. These are built on top of named lists:

> y <- as.POSIXlt(x)
> typeof(y)
[1] "list"
> attributes(y)
$names
 [1] "sec"    "min"    "hour"   "mday"   "mon"    "year"   "wday"   "yday"  
 [9] "isdst"  "zone"   "gmtoff"

$class
[1] "POSIXlt" "POSIXt" 

$tzone
[1] "US/Eastern" "EST"        "EDT"       

POSIXct’s are always easier to work with, so if you find you have a POSIXlt, you should always convert it to a regular data time lubridate::as_date_time().

Tibbles

Tibbles are augmented lists: they have class “tbl_df” + “tbl” + “data.frame”, and names (column) and row.names attributes:

> tb <- tibble::tibble(x = 1:5, y = 5:1)
> typeof(tb)
[1] "list"
> attributes(tb)
$names
[1] "x" "y"

$row.names
[1] 1 2 3 4 5

$class
[1] "tbl_df"     "tbl"        "data.frame"

The difference between a tibble and a list is that all the elements of a data frame must be vectors with the same length. All functions that work with tibbles enforce this constraint.

Traditional data.frames have a very similar structure:

> df <- data.frame(x = 1:5, y = 5:1)
> typeof(df)
[1] "list"
> attributes(df)
$names
[1] "x" "y"

$class
[1] "data.frame"

$row.names
[1] 1 2 3 4 5

The main difference is the class. The class of tibble includes “data.frame” which means tibbles inherit the regular data frame behaviour by default.

Exercises

1. What does hms::hms(3600) return? How does it print? What primitive type is the augmented vector built on top of? What attributes does it use?
  • Prints the time in “%H:%M:%S” format (3600 seconds).
> x <- hms::hms(3600)
> x
01:00:00
> class(x)
[1] "hms"      "difftime"
> typeof(x)
[1] "double"
> attributes(x)
$units
[1] "secs"

$class
[1] "hms"      "difftime"
2. Try and make a tibble that has columns with different lengths. What happens?
  • A scalar is repeated to the length of the longer vector.
> tibble(x = 1, y = 1:5)
# A tibble: 5 x 2
      x     y
  <dbl> <int>
1     1     1
2     1     2
3     1     3
4     1     4
5     1     5
  • A tibble with two vectors of different lengths (other than one) throws an error.
> tibble(x = 1:3, y = 1:4)
Error: Tibble columns must have compatible sizes.
* Size 3: Existing data.
* Size 4: Column `y`.
i Only values of size one are recycled.
3. Based on the definition above, is it ok to have a list as a column of a tibble?
> tibble(x = 1:3, y = list("a", 1, list(1:3)))
# A tibble: 3 x 2
      x y         
  <int> <list>    
1     1 <chr [1]> 
2     2 <dbl [1]> 
3     3 <list [1]>

Iteration with purrr


Iteration helps you when you need to do the same thing to multiple inputs: repeating the same operation on different columns, or on different datasets.

  • Imperative programming - On the imperative side you have tools like for loops and while loops, which make iteration very explicit, so it’s obvious what’s happening. However, for loops are quite verbose, and require quite a bit of bookkeeping code that is duplicated for every for loop.

  • Functional programming (FP) - offers tools to extract out this duplicated code, so each common for loop pattern gets its own function. With FP you can solve many common iteration problems with less code, more ease, and fewer errors.

For Loops

Imagine we have this simple tibble:

> df <- tibble(
+   a = rnorm(10),
+   b = rnorm(10),
+   c = rnorm(10),
+   d = rnorm(10)
+ )

You want to compute the median of each column. You could do it with copy-and-paste:

> median(df$a)
> median(df$b)
> median(df$c)
> median(df$d)
[1] -0.3534617
[1] 0.305703
[1] -0.481693
[1] 0.3886924

However, that could be a long and messy process. Instead, we could use a for loop:

> output <- vector("double", ncol(df))  # 1. output
> for (i in seq_along(df)) {            # 2. sequence
+   output[[i]] <- median(df[[i]])      # 3. body
+ }
> output
[1] -0.3534617  0.3057030 -0.4816930  0.3886924

Every for loop has three components:

  1. The output: output <- vector("double", length(x)). Before you start the loop, you must always allocate sufficient space for the output. This is very important for efficiency: if you grow the for loop at each iteration using c() (for example), your for loop will be very slow.

A general way of creating an empty vector of given length is the vector() function. It has two arguments: the type of the vector (“logical”, “integer”, “double”, “character”, etc) and the length of the vector.

  1. The sequence: i in seq_along(df). This determines what to loop over: each run of the for loop will assign i to a different value from seq_along(df). It’s useful to think of i as a pronoun, like “it”.

You might not have seen seq_along() before. It’s a safe version of the familiar 1:length(l), with an important difference: if you have a zero-length vector, seq_along() does the right thing:

> y <- vector("double", 0)
> seq_along(y)
integer(0)
> 1:length(y)
[1] 1 0
  1. The body: output[[i]] <- median(df[[i]]). This is the code that does the work. It’s run repeatedly, each time with a different value for i. The first iteration will run output[[1]] <- median(df[[1]]), the second will run output[[2]] <- median(df[[2]]), and so on.

Exercises

1. Write for loops to:
\(\space\space a.\) Compute the mean of every column in mtcars.
\(\space\space b.\) Determine the type of each column in nycflights13::flights.
\(\space\space c.\) Compute the number of unique values in each column of iris.
\(\space\space d.\) Generate 10 random normals from distributions with means of -10, 0, 10, and 100.
Think about the output, sequence, and body before you start writing the loop.

mean of every column in mtcars

> output <- vector("double", ncol(mtcars))
> names(output) <- names(mtcars)
> for (i in names(mtcars)) {
+   output[i] <- mean(mtcars[[i]])
+ }
> output
       mpg        cyl       disp         hp       drat         wt       qsec 
 20.090625  24.750000 230.721875 146.687500   3.596563   3.217250  17.848750 
        vs         am       gear       carb 
  0.437500   0.406250   3.687500   2.812500 

type of each column in nycflights13::flights

> output <- vector("list", ncol(nycflights13::flights))
> names(output) <- names(nycflights13::flights)
> for (i in names(nycflights13::flights)) {
+   output[[i]] <- class(nycflights13::flights[[i]])
+ }
> output
$year
[1] "integer"

$month
[1] "integer"

$day
[1] "integer"

$dep_time
[1] "integer"

$sched_dep_time
[1] "integer"

$dep_delay
[1] "numeric"

$arr_time
[1] "integer"

$sched_arr_time
[1] "integer"

$arr_delay
[1] "numeric"

$carrier
[1] "character"

$flight
[1] "integer"

$tailnum
[1] "character"

$origin
[1] "character"

$dest
[1] "character"

$air_time
[1] "numeric"

$distance
[1] "numeric"

$hour
[1] "numeric"

$minute
[1] "numeric"

$time_hour
[1] "POSIXct" "POSIXt" 

the number of unique values in each column of iris

> data("iris")
> iris_uniq <- vector("double", ncol(iris))
> names(iris_uniq) <- names(iris)
> for (i in names(iris)) {
+   iris_uniq[i] <- n_distinct(iris[[i]])
+ }
> iris_uniq
Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
          35           23           43           22            3 

10 random normals from distributions with means of -10, 0, 10, and 100

> n <- 10
> # values of the mean
> mu <- c(-10, 0, 10, 100)
> normals <- vector("list", length(mu))
> for (i in seq_along(normals)) {
+   normals[[i]] <- rnorm(n, mean = mu[i])
+ }
> normals
[[1]]
 [1] -11.305450  -8.969553 -10.597783  -9.490975 -12.033553  -8.355739
 [7]  -8.497958  -7.966495  -9.667332 -10.892709

[[2]]
 [1]  0.73334559 -0.39124276  0.02113921  1.01163921  0.59841898 -1.04622886
 [7]  1.04824229  0.21131357  1.18336142 -1.82724122

[[3]]
 [1] 10.671651 10.374447 11.521582 10.735073  9.589511 10.806080 10.519606
 [8]  9.582236 11.563617 10.219076

[[4]]
 [1] 100.06090 101.17170 100.95919 100.57750 101.63769 102.98548 102.08713
 [8] 100.72449  98.89897  99.43802
2. Eliminate the for loop in each of the following examples by taking advantage of an existing function that works with vectors:

\(A\)

> out <- ""
> for (x in letters) {
+   out <- stringr::str_c(out, x)
+ }
> out
[1] "abcdefghijklmnopqrstuvwxyz"

Answer

> str_c(letters, collapse = "")
[1] "abcdefghijklmnopqrstuvwxyz"

\(B\)

> x <- sample(100)
> std <- 0
> for (i in seq_along(x)) {
+   std <- std + (x[i] - mean(x)) ^ 2
+ }
> std <- sqrt(std / (length(x) - 1))
> std
[1] 29.01149

Answer

> sd(x)
[1] 29.01149

\(C\)

> x <- runif(100)
> out <- vector("numeric", length(x))
> out[1] <- x[1]
> for (i in 2:length(x)) {
+   out[i] <- out[i - 1] + x[i]
+ }
> out
  [1]  0.1440869  1.0679018  1.3891570  2.1269189  2.6886382  3.3821259
  [7]  3.9923716  4.6016887  4.9330813  5.3848960  5.7287877  5.7609090
 [13]  6.6276744  7.3797277  8.0035553  8.3853359  8.9505014  9.2988127
 [19]  9.7329107 10.4536404 11.4071862 12.2734312 12.2970024 12.7295530
 [25] 13.1116046 13.2240063 13.7072301 14.2046590 15.0448627 15.9517964
 [31] 16.0222470 16.6757728 16.7059309 17.3781322 17.4546802 18.0476092
 [37] 18.7081376 19.6130017 20.1667295 20.7087546 21.6915076 22.6413271
 [43] 23.5487731 24.0550506 25.0443236 25.5013848 26.4970597 27.1068480
 [49] 27.5307976 27.5383474 28.4668664 29.0783425 29.6132066 30.4023030
 [55] 31.3647640 31.5877037 31.9190192 32.4822475 33.2450028 33.9801238
 [61] 34.5088791 34.6028091 35.0269705 35.1175262 35.3690433 35.5147748
 [67] 35.6804987 36.2560811 36.5100650 36.9092297 36.9531377 37.5071875
 [73] 37.7281001 38.4011549 39.2863424 39.7743467 40.3965827 40.9600051
 [79] 41.4264888 42.1911425 42.4171137 42.9757523 43.9440758 44.3726887
 [85] 44.6519331 44.8845706 45.4585461 45.5974488 46.1936807 46.8595879
 [91] 47.8160022 48.8087514 48.9079801 49.8920008 49.9395958 50.2672004
 [97] 51.0444748 51.8900169 52.4357557 53.1680187

Answer

> cumsum(x)
  [1]  0.1440869  1.0679018  1.3891570  2.1269189  2.6886382  3.3821259
  [7]  3.9923716  4.6016887  4.9330813  5.3848960  5.7287877  5.7609090
 [13]  6.6276744  7.3797277  8.0035553  8.3853359  8.9505014  9.2988127
 [19]  9.7329107 10.4536404 11.4071862 12.2734312 12.2970024 12.7295530
 [25] 13.1116046 13.2240063 13.7072301 14.2046590 15.0448627 15.9517964
 [31] 16.0222470 16.6757728 16.7059309 17.3781322 17.4546802 18.0476092
 [37] 18.7081376 19.6130017 20.1667295 20.7087546 21.6915076 22.6413271
 [43] 23.5487731 24.0550506 25.0443236 25.5013848 26.4970597 27.1068480
 [49] 27.5307976 27.5383474 28.4668664 29.0783425 29.6132066 30.4023030
 [55] 31.3647640 31.5877037 31.9190192 32.4822475 33.2450028 33.9801238
 [61] 34.5088791 34.6028091 35.0269705 35.1175262 35.3690433 35.5147748
 [67] 35.6804987 36.2560811 36.5100650 36.9092297 36.9531377 37.5071875
 [73] 37.7281001 38.4011549 39.2863424 39.7743467 40.3965827 40.9600051
 [79] 41.4264888 42.1911425 42.4171137 42.9757523 43.9440758 44.3726887
 [85] 44.6519331 44.8845706 45.4585461 45.5974488 46.1936807 46.8595879
 [91] 47.8160022 48.8087514 48.9079801 49.8920008 49.9395958 50.2672004
 [97] 51.0444748 51.8900169 52.4357557 53.1680187
3. Combine your function writing and for loop skills:
\(\space\space a.\) Write a for loop that prints() the lyrics to the children’s song “Alice the camel”.
\(\space\space b.\) Convert the nursery rhyme “ten in the bed” to a function. Generalise it to any number of people in any sleeping structure.
\(\space\space c.\) Convert the song “99 bottles of beer on the wall” to a function. Generalise to any number of any vessel containing any liquid on any surface.

\(A\)

> humps <- c("five", "four", "three", "two", "one", "no")
> for (i in humps) {
+   cat(str_c("Alice the camel has ", rep(i, 3), " humps.",
+     collapse = "\n"
+   ), "\n")
+   if (i == "no") {
+     cat("Now Alice is a horse.\n")
+   } else {
+     cat("So go, Alice, go.\n")
+   }
+   cat("\n")
+ }
Alice the camel has five humps.
Alice the camel has five humps.
Alice the camel has five humps. 
So go, Alice, go.

Alice the camel has four humps.
Alice the camel has four humps.
Alice the camel has four humps. 
So go, Alice, go.

Alice the camel has three humps.
Alice the camel has three humps.
Alice the camel has three humps. 
So go, Alice, go.

Alice the camel has two humps.
Alice the camel has two humps.
Alice the camel has two humps. 
So go, Alice, go.

Alice the camel has one humps.
Alice the camel has one humps.
Alice the camel has one humps. 
So go, Alice, go.

Alice the camel has no humps.
Alice the camel has no humps.
Alice the camel has no humps. 
Now Alice is a horse.

\(B\)

> numbers <- c(
+   "ten", "nine", "eight", "seven", "six", "five",
+   "four", "three", "two", "one"
+ )
> for (i in numbers) {
+   cat(str_c("There were ", i, " in the bed\n"))
+   cat("and the little one said\n")
+   if (i == "one") {
+     cat("I'm lonely...")
+   } else {
+     cat("Roll over, roll over\n")
+     cat("So they all rolled over and one fell out.\n")
+   }
+   cat("\n")
+ }
There were ten in the bed
and the little one said
Roll over, roll over
So they all rolled over and one fell out.

There were nine in the bed
and the little one said
Roll over, roll over
So they all rolled over and one fell out.

There were eight in the bed
and the little one said
Roll over, roll over
So they all rolled over and one fell out.

There were seven in the bed
and the little one said
Roll over, roll over
So they all rolled over and one fell out.

There were six in the bed
and the little one said
Roll over, roll over
So they all rolled over and one fell out.

There were five in the bed
and the little one said
Roll over, roll over
So they all rolled over and one fell out.

There were four in the bed
and the little one said
Roll over, roll over
So they all rolled over and one fell out.

There were three in the bed
and the little one said
Roll over, roll over
So they all rolled over and one fell out.

There were two in the bed
and the little one said
Roll over, roll over
So they all rolled over and one fell out.

There were one in the bed
and the little one said
I'm lonely...

\(C\)

> bottles <- function(n) {
+   if (n > 1) {
+     str_c(n, " bottles")
+   } else if (n == 1) {
+     "1 bottle"
+   } else {
+     "no more bottles"
+   }
+ }
> 
> beer_bottles <- function(total_bottles) {
+   # print each lyric
+   for (current_bottles in seq(total_bottles, 0)) {
+     # first line
+     cat(str_to_sentence(str_c(bottles(current_bottles), " of beer on the wall, ", bottles(current_bottles), " of beer.\n")))   
+     # second line
+     if (current_bottles > 0) {
+       cat(str_c(
+         "Take one down and pass it around, ", bottles(current_bottles - 1),
+         " of beer on the wall.\n"
+       ))          
+     } else {
+       cat(str_c("Go to the store and buy some more, ", bottles(total_bottles), " of beer on the wall.\n"))                }
+     cat("\n")
+   }
+ }
> beer_bottles(3)
3 bottles of beer on the wall, 3 bottles of beer.
Take one down and pass it around, 2 bottles of beer on the wall.

2 bottles of beer on the wall, 2 bottles of beer.
Take one down and pass it around, 1 bottle of beer on the wall.

1 bottle of beer on the wall, 1 bottle of beer.
Take one down and pass it around, no more bottles of beer on the wall.

No more bottles of beer on the wall, no more bottles of beer.
Go to the store and buy some more, 3 bottles of beer on the wall.
4. It’s common to see for loops that don’t preallocate the output and instead increase the length of a vector at each step. How does this affect performance? Design and execute an experiment.
> add_to_vector <- function(n) {
+   output <- vector("integer", 0)
+   for (i in seq_len(n)) {
+     output <- c(output, i)
+   }
+   output
+ }
> add_to_vector(10)
 [1]  1  2  3  4  5  6  7  8  9 10
> add_to_vector_2 <- function(n) {
+   output <- vector("integer", n)
+   for (i in seq_len(n)) {
+     output[[i]] <- i
+   }
+   output
+ }
> add_to_vector_2(10)
 [1]  1  2  3  4  5  6  7  8  9 10
> library(microbenchmark)
> timings <- microbenchmark(add_to_vector(10000), add_to_vector_2(10000), times = 10)
> timings
Unit: microseconds
                   expr     min      lq     mean   median      uq      max
   add_to_vector(10000) 74960.7 76436.5 82950.63 79618.15 82265.7 117413.2
 add_to_vector_2(10000)   374.0   382.0   638.15   555.90   936.8    951.4
 neval cld
    10   b
    10  a 

Appending to a vector takes much times longer than pre-allocating the vector.

For Loop Variations

There are four variations on the basic theme of the for loop:

  1. Modifying an existing object, instead of creating a new object.
  2. Looping over names or values, instead of indices.
  3. Handling outputs of unknown length.
  4. Handling sequences of unknown length.

Modifying an Existing Object

Sometimes you want to use a for loop to modify an existing object. For example, when we wanted to rescale every column in a data frame it wasn’t very efficient:

> df <- tibble(
+   a = rnorm(10),
+   b = rnorm(10),
+   c = rnorm(10),
+   d = rnorm(10)
+ )
> rescale01 <- function(x) {
+   rng <- range(x, na.rm = TRUE)
+   (x - rng[1]) / (rng[2] - rng[1])
+ }
> 
> df$a <- rescale01(df$a)
> df$b <- rescale01(df$b)
> df$c <- rescale01(df$c)
> df$d <- rescale01(df$d)

To solve this with a for loop we again think about the three components:

  1. Output: we already have the output — it’s the same as the input

  2. Sequence: we can think about a data frame as a list of columns, so we can iterate over each column with seq_along(df).

  3. Body: apply rescale01().

This gives us:

> for (i in seq_along(df)) {
+   df[[i]] <- rescale01(df[[i]])
+ }

Typically you’ll be modifying a list or data frame with this sort of loop, so remember to use [[, not [.

Looping Patterns

There are three basic ways to loop over a vector.

  1. Looping over the numeric indices with for (i in seq_along(xs)) and extracting the value with x[[i]].

  2. Loop over the elements: for (x in xs). This is most useful if you only care about side-effects, like plotting or saving a file, because it’s difficult to save the output efficiently.

  3. Loop over the names: for (nm in names(xs)). This gives you name, which you can use to access the value with x[[nm]]. This is useful if you want to use the name in a plot title or a file name. If you’re creating named output, make sure to name the results vector like so:

> results <- vector("list", length(x))
> names(results) <- names(x)

Iteration over the numeric indices is the most general form, because given the position you can extract both the name and the value:

> for (i in seq_along(x)) {
+   name <- names(x)[[i]]
+   value <- x[[i]]
+ }

Unknown Output Length

Sometimes you might not know how long the output will be. For example, imagine you want to simulate some random vectors of random lengths. You might be tempted to solve this problem by progressively growing the vector:

> means <- c(0, 1, 2)
> 
> output <- double()
> for (i in seq_along(means)) {
+   n <- sample(100, 1)
+   output <- c(output, rnorm(n, means[[i]]))
+ }
> str(output)
 num [1:144] -0.831 0.219 -0.806 1.523 -0.697 ...

But this is not very efficient because in each iteration, R has to copy all the data from the previous iterations. A better solution to save the results in a list, and then combine into a single vector after the loop is done:

> out <- vector("list", length(means))
> for (i in seq_along(means)) {
+   n <- sample(100, 1)
+   out[[i]] <- rnorm(n, means[[i]])
+ }
> str(out)
List of 3
 $ : num [1:87] 0.2105 0.13719 0.00106 0.4588 -1.41733 ...
 $ : num [1:54] 0.965 2.433 -0.512 1.621 -0.187 ...
 $ : num [1:11] 4.676 1.363 -0.211 1.046 2.086 ...
> str(unlist(out))
 num [1:152] 0.2105 0.13719 0.00106 0.4588 -1.41733 ...

unlist() will flatten a list of vectors into a single vector. A stricter option is to use purrr::flatten_dbl() — it will throw an error if the input isn’t a list of doubles.

This pattern occurs in other places too:

  1. You might be generating a long string. Instead of paste()ing together each iteration with the previous, save the output in a character vector and then combine that vector into a single string with paste(output, collapse = "").

  2. You might be generating a big data frame. Instead of sequentially rbind()ing in each iteration, save the output in a list, then use dplyr::bind_rows(output) to combine the output into a single data frame.

Watch out for this pattern. Whenever you see it, switch to a more complex result object, and then combine in one step at the end.

Unkown Sequence Length

Sometimes you don’t even know how long the input sequence should run for. This is common when doing simulations. For example, you might want to loop until you get three heads in a row. You can’t do that sort of iteration with the for loop. Instead, you can use a while loop. A while loop is simpler than for loop because it only has two components, a condition and a body:

> while (condition) {
+   # body
+ }

A while loop is also more general than a for loop, because you can rewrite any for loop as a while loop, but you can’t rewrite every while loop as a for loop:

> for (i in seq_along(x)) {
+   # body
+ }
> 
> # Equivalent to
> i <- 1
> while (i <= length(x)) {
+   # body
+   i <- i + 1 
+ }

Here’s how we could use a while loop to find how many tries it takes to get three heads in a row:

> flip <- function() sample(c("T", "H"), 1)
> 
> flips <- 0
> nheads <- 0
> 
> while (nheads < 3) {
+   if (flip() == "H") {
+     nheads <- nheads + 1
+   } else {
+     nheads <- 0
+   }
+   flips <- flips + 1
+ }
> flips
[1] 9

Exercises

1. Imagine you have a directory full of CSV files that you want to read in. You have their paths in a vector, files <- dir("data/", pattern = "\\.csv$", full.names = TRUE), and now want to read each one with read_csv(). Write the for loop that will load them into a single data frame.
> files <- dir("data/", pattern = "\\.csv$", full.names = TRUE)
> files
> #> [1] "data//file1.csv" "data//file2.csv" "data//file3.csv"
> df_list <- vector("list", length(files))
> for (i in seq_along(files)) {
+   df_list[[i]] <- read_csv(files[[i]])
+ }
> #> Parsed with column specification:
> #> cols(
> #>   X1 = col_double(),
> #>   X2 = col_character()
> #> )
> #> Parsed with column specification:
> #> cols(
> #>   X1 = col_double(),
> #>   X2 = col_character()
> #> )
> #> Parsed with column specification:
> #> cols(
> #>   X1 = col_double(),
> #>   X2 = col_character()
> #> )
> print(df_list)
> #> [[1]]
> #> # A tibble: 2 x 2
> #>      X1 X2   
> #>   <dbl> <chr>
> #> 1     1 a    
> #> 2     2 b    
> #> 
> #> [[2]]
> #> # A tibble: 2 x 2
> #>      X1 X2   
> #>   <dbl> <chr>
> #> 1     3 c    
> #> 2     4 d    
> #> 
> #> [[3]]
> #> # A tibble: 2 x 2
> #>      X1 X2   
> #>   <dbl> <chr>
> #> 1     5 e    
> #> 2     6 f
> df <- bind_rows(df_list)
> print(df)
> #> # A tibble: 6 x 2
> #>      X1 X2   
> #>   <dbl> <chr>
> #> 1     1 a    
> #> 2     2 b    
> #> 3     3 c    
> #> 4     4 d    
> #> 5     5 e    
> #> 6     6 f

Alternatively

> df2_list <- vector("list", length(files))
> names(df2_list) <- files
> for (fname in files) {
+   df2_list[[fname]] <- read_csv(fname)
+ }
2. What happens if you use for (nm in names(x)) and x has no names? What if only some of the elements are named? What if the names are not unique?
> x <- c(11, 12, 13)
> print(names(x))
> #> NULL
> for (nm in names(x)) {
+   print(nm)
+   print(x[[nm]])
+ }
> length(NULL)
> #> [1] 0
> x <- c(a = 11, 12, c = 13)
> names(x)
> #> [1] "a" ""  "c"
> for (nm in names(x)) {
+   print(nm)
+   print(x[[nm]])
+ }
> #> [1] "a"
> #> [1] 11
> #> [1] ""
> #> Error in x[[nm]]: subscript out of bounds
> x <- c(a = 11, a = 12, c = 13)
> names(x)
> #> [1] "a" "a" "c"
> 
> for (nm in names(x)) {
+   print(nm)
+   print(x[[nm]])
+ }
> #> [1] "a"
> #> [1] 11
> #> [1] "a"
> #> [1] 11
> #> [1] "c"
> #> [1] 13
3. Write a function that prints the mean of each numeric column in a data frame, along with its name. For example, show_mean(iris) would print:
> show_mean(iris)
> #> Sepal.Length: 5.84
> #> Sepal.Width:  3.06
> #> Petal.Length: 3.76
> #> Petal.Width:  1.20
> show_mean <- function(df, digits = 2) {
+   # Get max length of all variable names in the dataset
+   maxstr <- max(str_length(names(df)))
+   for (nm in names(df)) {
+     if (is.numeric(df[[nm]])) {
+       cat(
+         str_c(str_pad(str_c(nm, ":"), maxstr + 1L, side = "right"),
+           format(mean(df[[nm]]), digits = digits, nsmall = digits),
+           sep = " "
+         ),
+         "\n"
+       )
+     }
+   }
+ }
> show_mean(iris)
Sepal.Length: 5.84 
Sepal.Width:  3.06 
Petal.Length: 3.76 
Petal.Width:  1.20 
4. What does this code do? How does it work?
> trans <- list( 
+   disp = function(x) x * 0.0163871,
+   am = function(x) {
+     factor(x, labels = c("auto", "manual"))
+   }
+ )
> for (var in names(trans)) {
+   mtcars[[var]] <- trans[[var]](mtcars[[var]])
+ }

This code mutates the disp and am columns:

  • disp is multiplied by 0.0163871
  • am is replaced by a factor variable

The code works by looping over a named list of functions. It calls the named function in the list on the column of mtcars with the same name, and replaces the values of that column.

This is a function.

> trans[["disp"]]

This applies the function to the column of mtcars with the same name

> trans[["disp"]](mtcars[["disp"]])

For Loops Versus Functionals

For loops are not as important in R as they are in other languages because R is a functional programming language. This means that it’s possible to wrap up for loops in a function, and call that function instead of using the for loop directly.

To see why this is important, consider (again) this simple data frame:

> df <- tibble(
+   a = rnorm(10),
+   b = rnorm(10),
+   c = rnorm(10),
+   d = rnorm(10)
+ )

Imagine you want to compute the mean of every column. You could do that with a for loop:

> output <- vector("double", length(df))
> for (i in seq_along(df)) {
+   output[[i]] <- mean(df[[i]])
+ }
> output
[1]  0.2969105  0.1263920 -0.4363532  0.1929435

You realise that you’re going to want to compute the means of every column pretty frequently, so you extract it out into a function:

> col_mean <- function(df) {
+   output <- vector("double", length(df))
+   for (i in seq_along(df)) {
+     output[i] <- mean(df[[i]])
+   }
+   output
+ }

But then you think it’d also be helpful to be able to compute the median, and the standard deviation, so you copy and paste your col_mean() function and replace the mean() with median() and sd():

> col_median <- function(df) {
+   output <- vector("double", length(df))
+   for (i in seq_along(df)) {
+     output[i] <- median(df[[i]])
+   }
+   output
+ }
> 
> col_sd <- function(df) {
+   output <- vector("double", length(df))
+   for (i in seq_along(df)) {
+     output[i] <- sd(df[[i]])
+   }
+   output
+ }

Now it’s time to think about how to generalize it.

What would you do if you saw a set of functions like this:

> f1 <- function(x) abs(x - mean(x)) ^ 1
> f2 <- function(x) abs(x - mean(x)) ^ 2
> f3 <- function(x) abs(x - mean(x)) ^ 3

You’d notice that there’s a lot of duplication, and extract it out into an additional argument:

> f <- function(x, i) abs(x - mean(x)) ^ i

We can do exactly the same thing with col_mean(), col_median() and col_sd() by adding an argument that supplies the function to apply to each column:

> col_summary <- function(df, fun) {
+   out <- vector("double", length(df))
+   for (i in seq_along(df)) {
+     out[i] <- fun(df[[i]])
+   }
+   out
+ }
> col_summary(df, median)
[1]  0.6253409  0.2379445 -0.2843706  0.4614654
> col_summary(df, mean)
[1]  0.2969105  0.1263920 -0.4363532  0.1929435

The purrr package provides functions that eliminate the need for many common for loops. The apply family of functions in base R (apply(), lapply(), tapply(), etc) solve a similar problem, but purrr is more consistent.

The goal of using purrr functions instead of for loops is to allow you to break common list manipulation challenges into independent pieces:

  1. How can you solve the problem for a single element of the list? Once you’ve solved that problem, purrr takes care of generalizing your solution to every element in the list.

  2. If you’re solving a complex problem, how can you break it down into bite-sized pieces that allow you to advance one small step towards a solution? With purrr, you get lots of small pieces that you can compose together with the pipe.

This structure makes it easier to solve new problems. It also makes it easier to understand your solutions to old problems when you re-read your old code.

Exercises

1. Read the documentation for apply(). In the 2d case, what two for loops does it generalise?

For an object with two-dimensions, such as a matrix or data frame, apply() replaces looping over the rows or columns of a matrix or data-frame. The apply() function is used like apply(X, MARGIN, FUN, ...), where X is a matrix or array, FUN is a function to apply, and ... are additional arguments passed to FUN.

When MARGIN = 1, then the function is applied to each row. For example, the following example calculates the row means of a matrix.

When MARGIN = 1, then the function is applied to each row. For example, the following example calculates the row means of a matrix.

> X <- matrix(rnorm(15), nrow = 5)
> X
           [,1]       [,2]        [,3]
[1,]  0.6944777  0.5483574 -0.07868662
[2,]  1.3358468 -1.4439711  0.44619633
[3,] -0.6631026 -0.3226765  0.45548651
[4,]  1.2309291  0.9318699 -1.02463961
[5,]  0.2355328 -1.8492699  1.48251032
> apply(X, 1, mean)
[1]  0.38804949  0.11269068 -0.17676421  0.37938648 -0.04374227

That is equivalent to this for-loop.

> X_row_means <- vector("numeric", length = nrow(X))
> for (i in seq_len(nrow(X))) {
+   X_row_means[[i]] <- mean(X[i, ])
+ }
> X_row_means
[1]  0.38804949  0.11269068 -0.17676421  0.37938648 -0.04374227
> X <- matrix(rnorm(15), nrow = 5)
> X
           [,1]       [,2]       [,3]
[1,] -0.5445332  0.6140945 -0.3722522
[2,]  0.8828511  0.8138215  1.2358328
[3,]  0.2435959 -0.9422863 -1.4996260
[4,] -0.7618232 -0.3257123  0.4565981
[5,] -1.5681726  0.1806481 -0.6348601

When MARGIN = 2, apply() is equivalent to a for-loop looping over columns.

> apply(X, 2, mean)
[1] -0.34961640  0.06811311 -0.16286146
> X_col_means <- vector("numeric", length = ncol(X))
> for (i in seq_len(ncol(X))) {
+   X_col_means[[i]] <- mean(X[, i])
+ }
> X_col_means
[1] -0.34961640  0.06811311 -0.16286146
2. Adapt col_summary() so that it only applies to numeric columns You might want to start with an is_numeric() function that returns a logical vector that has a TRUE corresponding to each numeric column.

The original col_summary() function is:

> col_summary <- function(df, fun) {
+   out <- vector("double", length(df))
+   for (i in seq_along(df)) {
+     out[i] <- fun(df[[i]])
+   }
+   out
+ }

The adapted version adds extra logic to only apply the function to numeric columns.

> col_summary2 <- function(df, fun) {
+   # create an empty vector which will store whether each
+   # column is numeric
+   numeric_cols <- vector("logical", length(df))
+   # test whether each column is numeric
+   for (i in seq_along(df)) {
+     numeric_cols[[i]] <- is.numeric(df[[i]])
+   }
+   # find the indexes of the numeric columns
+   idxs <- which(numeric_cols)
+   # find the number of numeric columns
+   n <- sum(numeric_cols)
+   # create a vector to hold the results
+   out <- vector("double", n)
+   # apply the function only to numeric vectors
+   for (i in seq_along(idxs)) {
+     out[[i]] <- fun(df[[idxs[[i]]]])
+   }
+   # name the vector
+   names(out) <- names(df)[idxs]
+   out
+ }

Let’s test that col_summary2() works by creating a small data frame with some numeric and non-numeric columns.

> df <- tibble(
+   X1 = c(1, 2, 3),
+   X2 = c("A", "B", "C"),
+   X3 = c(0, -1, 5),
+   X4 = c(TRUE, FALSE, TRUE)
+ )
> col_summary2(df, mean)
      X1       X3 
2.000000 1.333333 

The Map Functions

The pattern of looping over a vector, doing something to each element and saving the results is so common that the purrr package provides a family of functions to do it for you. There is one function for each type of output:

  • map() makes a list.
  • map_lgl() makes a logical vector.
  • map_int() makes an integer vector.
  • map_dbl() makes a double vector.
  • map_chr() makes a character vector.

Each function takes a vector as input, applies a function to each piece, and then returns a new vector that’s the same length (and has the same names) as the input. The type of the vector is determined by the suffix to the map function.

The chief benefits of using functions like map() is not speed, but clarity: they make your code easier to write and to read.

We can use these functions to perform the same computations as the last for loop. Those summary functions returned doubles, so we need to use map_dbl():

> df <- tibble(
+   a = rnorm(10),
+   b = rnorm(10),
+   c = rnorm(10),
+   d = rnorm(10)
+ )
> 
> map_dbl(df, mean)
          a           b           c           d 
 0.08452007 -0.31266076 -0.12060485  0.17055680 
> map_dbl(df, median)
         a          b          c          d 
 0.2558613 -0.2808854 -0.1146915  0.1351632 
> map_dbl(df, sd)
        a         b         c         d 
0.9527619 1.0748197 0.9835811 0.9954018 

Compared to using a for loop, focus is on the operation being performed (i.e. mean(), median(), sd()), not the bookkeeping required to loop over every element and store the output. This is even more apparent if we use the pipe:

> df %>% map_dbl(mean)
          a           b           c           d 
 0.08452007 -0.31266076 -0.12060485  0.17055680 
> df %>% map_dbl(median)
         a          b          c          d 
 0.2558613 -0.2808854 -0.1146915  0.1351632 
> df %>% map_dbl(sd)
        a         b         c         d 
0.9527619 1.0748197 0.9835811 0.9954018 

There are a few differences between map_*() and col_summary():

  • All purrr functions are implemented in C. This makes them a little faster at the expense of readability.

  • The second argument, .f, the function to apply, can be a formula, a character vector, or an integer vector.

  • map_*() uses … ([dot dot dot]) to pass along additional arguments to .f each time it’s called:

> map_dbl(df, mean, trim = 0.5)
         a          b          c          d 
 0.2558613 -0.2808854 -0.1146915  0.1351632 

The map functions also preserve names:

> z <- list(x = 1:3, y = 4:5)
> map_int(z, length)
x y 
3 2 

Shortcuts

There are a few shortcuts that you can use with .f in order to save a little typing. Imagine you want to fit a linear model to each group in a dataset. The following toy example splits up the mtcars dataset into three pieces (one for each value of cylinder) and fits the same linear model to each piece:

> models <- mtcars %>% 
+   split(.$cyl) %>% 
+   map(function(df) lm(mpg ~ wt, data = df))

The syntax for creating an anonymous function in R is quite verbose so purrr provides a convenient shortcut: a one-sided formula.

> models <- mtcars %>% 
+   split(.$cyl) %>% 
+   map(~lm(mpg ~ wt, data = .))

Here we’ve used . as a pronoun: it refers to the current list element (in the same way that ireferred to the current index in the for loop).

When you’re looking at many models, you might want to extract a summary statistic like the \(R^2\). To do that we need to first run summary() and then extract the component called r.squared. We could do that using the shorthand for anonymous functions:

> models %>% 
+   map(summary) %>% 
+   map_dbl(~.$r.squared)
       16        24        32 
0.5086326 0.4645102 0.4229655 

But extracting named components is a common operation, so purrr provides an even shorter shortcut: you can use a string.

> models %>% 
+   map(summary) %>% 
+   map_dbl("r.squared")
       16        24        32 
0.5086326 0.4645102 0.4229655 

You can also use an integer to select elements by position:

> x <- list(list(1, 2, 3), list(4, 5, 6), list(7, 8, 9))
> x %>% map_dbl(2)
[1] 2 5 8

Base R

The apply family of functions in base R have some similarities with the purrr functions:

  • lapply() is basically identical to map(), except that map() is consistent with all the other functions in purrr, and you can use the shortcuts for .f.

  • Base sapply() is a wrapper around lapply() that automatically simplifies the output. This is useful for interactive work but is problematic in a function because you never know what sort of output you’ll get:

> x1 <- list(
+   c(0.27, 0.37, 0.57, 0.91, 0.20),
+   c(0.90, 0.94, 0.66, 0.63, 0.06), 
+   c(0.21, 0.18, 0.69, 0.38, 0.77)
+ )
> x2 <- list(
+   c(0.50, 0.72, 0.99, 0.38, 0.78), 
+   c(0.93, 0.21, 0.65, 0.13, 0.27), 
+   c(0.39, 0.01, 0.38, 0.87, 0.34)
+ )
> 
> threshold <- function(x, cutoff = 0.8) x[x > cutoff]
> x1 %>% sapply(threshold) %>% str()
List of 3
 $ : num 0.91
 $ : num [1:2] 0.9 0.94
 $ : num(0) 
> x2 %>% sapply(threshold) %>% str()
 num [1:3] 0.99 0.93 0.87
  • vapply() is a safe alternative to sapply() because you supply an additional argument that defines the type. The only problem with vapply() is that it’s a lot of typing: vapply(df, is.numeric, logical(1)) is equivalent to map_lgl(df, is.numeric). One advantage of vapply() over purrr’s map functions is that it can also produce matrices — the map functions only ever produce vectors.

Exercises

1. Write code that uses one of the map functions to:
\(\space\space a.\) Compute the mean of every column in mtcars.
\(\space\space b.\) Determine the type of each column in nycflights13::flights.
\(\space\space c.\) Compute the number of unique values in each column of iris.
\(\space\space d.\) Generate 10 random normals from distributions with means of -10, 0, 10, and 100.

\(A\) mean of every column in mtcars.

> map_dbl(mtcars, mean)
       mpg        cyl       disp         hp       drat         wt       qsec 
 20.090625  24.750000 230.721875 146.687500   3.596563   3.217250  17.848750 
        vs         am       gear       carb 
  0.437500   0.406250   3.687500   2.812500 

\(B\) type of each column in nycflights13::flights.

> map_chr(nycflights13::flights, typeof)
          year          month            day       dep_time sched_dep_time 
     "integer"      "integer"      "integer"      "integer"      "integer" 
     dep_delay       arr_time sched_arr_time      arr_delay        carrier 
      "double"      "integer"      "integer"       "double"    "character" 
        flight        tailnum         origin           dest       air_time 
     "integer"    "character"    "character"    "character"       "double" 
      distance           hour         minute      time_hour 
      "double"       "double"       "double"       "double" 

\(C\) number of unique values in each column of iris.

> map_int(iris, n_distinct)
Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
          35           23           43           22            3 
> map_int(iris, ~length(unique(.)))
Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
          35           23           43           22            3 

\(D\) 10 random normals from distributions with means of -10, 0, 10, and 100

> map(c(-10, 0, 10, 100), ~rnorm(n = 10, mean = .))
[[1]]
 [1] -10.765580 -10.354792  -8.630696  -9.222892  -9.983079 -10.862739
 [7]  -8.446143 -10.687488 -11.758374  -9.282130

[[2]]
 [1] -0.1717519  0.3488467 -0.8176360  0.2375417 -0.6415341  0.5376673
 [7] -1.2825508 -0.9099092  1.0666013  0.2527361

[[3]]
 [1]  9.023351  9.738028  9.988458 10.318579  9.676457 10.322253 10.330490
 [8] 11.816240 11.948324 11.313721

[[4]]
 [1]  99.46283  99.29963 100.60308  99.22069 101.46597 100.22511  98.90328
 [8]  98.23925 100.98210  99.17732
2. How can you create a single vector that for each column in a data frame indicates whether or not it’s a factor?
> map_lgl(diamonds, is.factor)
  carat     cut   color clarity   depth   table   price       x       y       z 
  FALSE    TRUE    TRUE    TRUE   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE 
3. What happens when you use the map functions on vectors that aren’t lists? What does map(1:5, runif) do? Why?

Map functions work with any vectors, not just lists, but the output is always a list.

> map(1:5, runif)
[[1]]
[1] 0.135653

[[2]]
[1] 0.7360124 0.3761619

[[3]]
[1] 0.9430826 0.3133162 0.2449193

[[4]]
[1] 0.2607224 0.7893160 0.2471994 0.4369824

[[5]]
[1] 0.61941998 0.45527702 0.37159786 0.04816784 0.88527555

The map() function loops through the numbers 1 to 5. For each value, it calls the runif() with that number as the first argument, which is the number of sample to draw. The result is a length five list with numeric vectors of sizes one through five, each with random samples from a uniform distribution.

4. What does map(-2:2, rnorm, n = 5) do? Why? What does map_dbl(-2:2, rnorm, n = 5) do? Why?
> map(-2:2, rnorm, n = 5)
[[1]]
[1] -1.065540 -1.428568 -1.889852 -2.568519 -2.483179

[[2]]
[1]  0.5256277 -1.5588777 -1.4982390 -0.5576297 -1.0534056

[[3]]
[1]  0.91434208  1.11712743 -0.05561116 -0.63470759 -0.47005863

[[4]]
[1] 1.7068155 1.5203229 1.9596955 0.3492204 1.3292990

[[5]]
[1] 1.040654 1.068034 1.237006 1.469892 2.135265

This expression takes samples of size five from five normal distributions, with means of (-2, -1, 0, 1, and 2), but the same standard deviation (1). It returns a list with each element a numeric vectors of length 5.

However, if instead, we use map_dbl(), the expression raises an error.

> map_dbl(-2:2, rnorm, n = 5)
Error: Result 1 must be a single double, not a double vector of length 5

This is because the map_dbl() function requires the function it applies to each element to return a numeric vector of length one.

To return a numeric vector, use flatten_dbl() to coerce the list returned by map() to a numeric vector.

> map(-2:2, rnorm, n = 5) %>%
+   flatten_dbl()
 [1] -4.3235094 -2.8643875 -1.3219671 -1.5742265 -1.9908245 -0.3568805
 [7] -0.4634469  0.6073714 -2.3118770 -0.2788020  0.7030964 -2.7938192
[13]  1.1452780  1.1432346  0.8491727  0.4047326 -1.3385323  3.3168344
[19]  0.4558347  1.3934177  2.8659056  1.3200135  1.9807593  1.8088077
[25]  1.6873271
5. Rewrite map(x, function(df) lm(mpg ~ wt, data = df)) to eliminate the anonymous function.
> x <- split(mtcars, mtcars$cyl)
> map(x, ~ lm(mpg ~ wt, data = .))
$`16`

Call:
lm(formula = mpg ~ wt, data = .)

Coefficients:
(Intercept)           wt  
     39.571       -5.647  


$`24`

Call:
lm(formula = mpg ~ wt, data = .)

Coefficients:
(Intercept)           wt  
      28.41        -2.78  


$`32`

Call:
lm(formula = mpg ~ wt, data = .)

Coefficients:
(Intercept)           wt  
     23.868       -2.192  

Dealing with Failure

When you use the map functions to repeat many operations, the chances are much higher that one of those operations will fail. When this happens, you’ll get an error message, and no output.

safely() is an adverb: it takes a function (a verb) and returns a modified version. In this case, the modified function will never throw an error. Instead, it always returns a list with two elements:

  1. result is the original result. If there was an error, this will be NULL.

  2. error is an error object. If the operation was successful, this will be NULL.

The try() function in base R. It’s similar, but because it sometimes returns the original result and it sometimes returns an error object it’s more difficult to work with.

Let’s illustrate this with a simple example: log():

> safe_log <- safely(log)
> str(safe_log(10))
List of 2
 $ result: num 2.3
 $ error : NULL
> str(safe_log("a"))
List of 2
 $ result: NULL
 $ error :List of 2
  ..$ message: chr "non-numeric argument to mathematical function"
  ..$ call   : language .Primitive("log")(x, base)
  ..- attr(*, "class")= chr [1:3] "simpleError" "error" "condition"

When the function succeeds, the result element contains the result and the error element is NULL. When the function fails, the result element is NULL and the error element contains an error object.

safely() is designed to work with map:

> x <- list(1, 10, "a")
> y <- x %>% map(safely(log))
> str(y)
List of 3
 $ :List of 2
  ..$ result: num 0
  ..$ error : NULL
 $ :List of 2
  ..$ result: num 2.3
  ..$ error : NULL
 $ :List of 2
  ..$ result: NULL
  ..$ error :List of 2
  .. ..$ message: chr "non-numeric argument to mathematical function"
  .. ..$ call   : language .Primitive("log")(x, base)
  .. ..- attr(*, "class")= chr [1:3] "simpleError" "error" "condition"

This would be easier to work with if we had two lists: one of all the errors and one of all the output. That’s easy to get with purrr::transpose():

> y <- y %>% transpose()
> str(y)
List of 2
 $ result:List of 3
  ..$ : num 0
  ..$ : num 2.3
  ..$ : NULL
 $ error :List of 3
  ..$ : NULL
  ..$ : NULL
  ..$ :List of 2
  .. ..$ message: chr "non-numeric argument to mathematical function"
  .. ..$ call   : language .Primitive("log")(x, base)
  .. ..- attr(*, "class")= chr [1:3] "simpleError" "error" "condition"

Typically you’ll either look at the values of x where y is an error, or work with the values of y that are ok:

> is_ok <- y$error %>% map_lgl(is_null)
> x[!is_ok]
[[1]]
[1] "a"
> y$result[is_ok] %>% flatten_dbl()
[1] 0.000000 2.302585

Purrr provides two other useful adverbs:

  • Like safely(), possibly() always succeeds. It’s simpler than safely(), because you give it a default value to return when there is an error.
> x <- list(1, 10, "a")
> x %>% map_dbl(possibly(log, NA_real_))
[1] 0.000000 2.302585       NA

quietly() performs a similar role to safely(), but instead of capturing errors, it captures printed output, messages, and warnings:

> x <- list(1, -1)
> x %>% map(quietly(log)) %>% str()
List of 2
 $ :List of 4
  ..$ result  : num 0
  ..$ output  : chr ""
  ..$ warnings: chr(0) 
  ..$ messages: chr(0) 
 $ :List of 4
  ..$ result  : num NaN
  ..$ output  : chr ""
  ..$ warnings: chr "NaNs produced"
  ..$ messages: chr(0) 

Mapping Over Multiple Arguements

Often you have multiple related inputs that you need iterate along in parallel. That’s the job of the map2() and pmap() functions. For example, imagine you want to simulate some random normals with different means.

> mu <- list(5, 10, -3)
> mu %>% 
+   map(rnorm, n = 5) %>% 
+   str()
List of 3
 $ : num [1:5] 5.1 5.02 6.59 3.8 3.81
 $ : num [1:5] 10.88 10.87 8.13 11.11 12
 $ : num [1:5] -1.7 -3.83 -2.77 -3.57 -1.6

What if you also want to vary the standard deviation? One way to do that would be to iterate over the indices and index into vectors of means and sds:

> sigma <- list(1, 5, 10)
> seq_along(mu) %>% 
+   map(~rnorm(5, mu[[.]], sigma[[.]])) %>% 
+   str()
List of 3
 $ : num [1:5] 5.82 5.71 5.92 5.47 4.45
 $ : num [1:5] 2.56 6.63 9.02 14.79 13.53
 $ : num [1:5] 1.82 8.18 -15.91 -15.49 -7.08

But that obfuscates the intent of the code. Instead we could use map2() which iterates over two vectors in parallel:

> map2(mu, sigma, rnorm, n = 5) %>% str()
List of 3
 $ : num [1:5] 7.64 4.3 5.76 4.39 4.22
 $ : num [1:5] 1.3 11.19 8.13 11.87 7.21
 $ : num [1:5] 4.052 15.024 -0.904 -11.016 -11.34

map2() generates this series of function calls:

Note that the arguments that vary for each call come before the function; arguments that are the same for every call come after.

Like map(), map2() is just a wrapper around a for loop:

> map2 <- function(x, y, f, ...) {
+   out <- vector("list", length(x))
+   for (i in seq_along(x)) {
+     out[[i]] <- f(x[[i]], y[[i]], ...)
+   }
+   out
+ }

You could also imagine map3(), map4(), map5(), map6() etc, but that would get tedious quickly. Instead, purrr provides pmap() which takes a list of arguments. You might use that if you wanted to vary the mean, standard deviation, and number of samples:

> n <- list(1, 3, 5)
> args1 <- list(n, mu, sigma)
> args1 %>%
+   pmap(rnorm) %>% 
+   str()
List of 3
 $ : num 5.44
 $ : num [1:3] 6.5 3.27 11.96
 $ : num [1:5] 9.856 -0.434 -5.895 -14.533 -10.192

That looks like:

If you don’t name the list’s elements, pmap() will use positional matching when calling the function. That’s a little fragile, and makes the code harder to read, so it’s better to name the arguments:

> args2 <- list(mean = mu, sd = sigma, n = n)
> args2 %>% 
+   pmap(rnorm) %>% 
+   str()

That generates longer, but safer, calls:

Since the arguments are all the same length, it makes sense to store them in a data frame:

> params <- tribble(
+   ~mean, ~sd, ~n,
+     5,     1,  1,
+    10,     5,  3,
+    -3,    10,  5
+ )
> params %>% 
+   pmap(rnorm)
[[1]]
[1] 4.920801

[[2]]
[1] 14.479088  5.613026  1.594877

[[3]]
[1]   4.354443  10.125541 -11.431938 -23.574923   4.859846

Invoking Different Functions

There’s one more step up in complexity - as well as varying the arguments to the function you might also vary the function itself:

> f <- c("runif", "rnorm", "rpois")
> param <- list(
+   list(min = -1, max = 1), 
+   list(sd = 5), 
+   list(lambda = 10)
+ )

To handle this case, you can use invoke_map():

> invoke_map(f, param, n = 5) %>% str()
List of 3
 $ : num [1:5] -0.3153 0.3146 -0.8008 0.6843 0.0278
 $ : num [1:5] 6.11 5.3 -1.24 1.07 -1.55
 $ : int [1:5] 8 12 7 13 11

The first argument is a list of functions or character vector of function names. The second argument is a list of lists giving the arguments that vary for each function. The subsequent arguments are passed on to every function.

And again, you can use tribble() to make creating these matching pairs a little easier:

> sim <- tribble(
+   ~f,      ~params,
+   "runif", list(min = -1, max = 1),
+   "rnorm", list(sd = 5),
+   "rpois", list(lambda = 10)
+ )
> sim %>% 
+   mutate(sim = invoke_map(f, params, n = 10))
# A tibble: 3 x 3
  f     params           sim       
  <chr> <list>           <list>    
1 runif <named list [2]> <dbl [10]>
2 rnorm <named list [1]> <dbl [10]>
3 rpois <named list [1]> <int [10]>

Walk

Walk is an alternative to map that you use when you want to call a function for its side effects, rather than for its return value. You typically do this because you want to render output to the screen or save files to disk - the important thing is the action, not the return value. Here’s a very simple example:

> x <- list(1, "a", 3)
> 
> x %>% 
+   walk(print)
[1] 1
[1] "a"
[1] 3

walk() is generally not that useful compared to walk2() or pwalk(). For example, if you had a list of plots and a vector of file names, you could use pwalk() to save each file to the corresponding location on disk:

> library(ggplot2)
> plots <- mtcars %>% 
+   split(.$cyl) %>% 
+   map(~ggplot(., aes(mpg, wt)) + geom_point())
> paths <- stringr::str_c(names(plots), ".pdf")
> 
> pwalk(list(paths, plots), ggsave, path = tempdir())

walk(), walk2() and pwalk() all invisibly return .x, the first argument. This makes them suitable for use in the middle of pipelines.

Other Patterns of for Loops

Purrr provides a number of other functions that abstract over other types of for loops. They’re used less frequently than the map functions, but they’re useful to know about.

Predicate Functions

A number of functions work with predicate functions that return either a single TRUE or FALSE.

keep() and discard() keep elements of the input where the predicate is TRUE or FALSE respectively:

> iris %>% 
+   keep(is.factor) %>% 
+   str()
'data.frame':   150 obs. of  1 variable:
 $ Species: Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
> iris %>% 
+   discard(is.factor) %>% 
+   str()
'data.frame':   150 obs. of  4 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...

some() and every() determine if the predicate is true for any or for all of the elements.

> x <- list(1:5, letters, list(10))
> 
> x %>% 
+   some(is_character)
[1] TRUE
> x %>% 
+   every(is_vector)
[1] TRUE

detect() finds the first element where the predicate is true; detect_index() returns its position.

> x <- sample(10)
> x
 [1]  7  2 10  6  8  4  3  5  9  1
> x %>% 
+   detect(~ . > 5)
[1] 7
> x %>% 
+   detect_index(~ . > 5)
[1] 1

head_while() and tail_while() take elements from the start or end of a vector while a predicate is true:

> x %>% 
+   head_while(~ . > 5)
[1] 7
> x %>% 
+   tail_while(~ . > 5)
integer(0)

Reduce and Accumulate

Sometimes you have a complex list that you want to reduce to a simple list by repeatedly applying a function that reduces a pair to a singleton. This is useful if you want to apply a two-table dplyr verb to multiple tables. For example, you might have a list of data frames, and you want to reduce to a single data frame by joining the elements together:

> dfs <- list(
+   age = tibble(name = "John", age = 30),
+   sex = tibble(name = c("John", "Mary"), sex = c("M", "F")),
+   trt = tibble(name = "Mary", treatment = "A")
+ )
> 
> dfs %>% reduce(full_join)
# A tibble: 2 x 4
  name    age sex   treatment
  <chr> <dbl> <chr> <chr>    
1 John     30 M     <NA>     
2 Mary     NA F     A        

Or maybe you have a list of vectors, and want to find the intersection:

> vs <- list(
+   c(1, 3, 5, 6, 10),
+   c(1, 2, 3, 7, 8, 10),
+   c(1, 2, 3, 4, 8, 9, 10)
+ )
> 
> vs %>% reduce(intersect)
[1]  1  3 10

The reduce function takes a “binary” function (i.e. a function with two primary inputs), and applies it repeatedly to a list until there is only a single element left.

Accumulate is similar but it keeps all the interim results. You could use it to implement a cumulative sum:

> x <- sample(10)
> x
 [1]  3  5  2  8  7  1 10  4  6  9
> x %>% accumulate(`+`)
 [1]  3  8 10 18 25 26 36 40 46 55

Exercises

1. Implement your own version of every() using a for loop.
> # Use ... to pass arguments to the function
> every2 <- function(.x, .p, ...) {
+   for (i in .x) {
+     if (!.p(i, ...)) {
+       # If any is FALSE we know not all of then were TRUE
+       return(FALSE)
+     }
+   }
+   # if nothing was FALSE, then it is TRUE
+   TRUE
+ }
> 
> every2(1:3, function(x) {
+   x > 1
+ })
[1] FALSE
> every2(1:3, function(x) {
+   x > 0
+ })
[1] TRUE
2. Create an enhanced col_summary() that applies a summary function to every numeric column in a data frame.
> col_sum2 <- function(df, f, ...) {
+   map(keep(df, is.numeric), f, ...)
+ }
3. A possible base R equivalent of col_summary()is:
> col_sum3 <- function(df, f) {
+   is_num <- sapply(df, is.numeric)
+   df_num <- df[, is_num]
+ 
+   sapply(df_num, f)
+ }
> col_sum2(iris, mean)
$Sepal.Length
[1] 5.843333

$Sepal.Width
[1] 3.057333

$Petal.Length
[1] 3.758

$Petal.Width
[1] 1.199333
But it has a number of bugs as illustrated with the following inputs:
> df <- tibble(
+   x = 1:3, 
+   y = 3:1,
+   z = c("a", "b", "c")
+ )
> # OK
> col_sum3(df, mean)
> # Has problems: don't always return numeric vector
> col_sum3(df[1:2], mean)
> col_sum3(df[1], mean)
> col_sum3(df[0], mean)
What causes the bugs?

The cause of these bugs is the behavior of sapply(). The sapply() function does not guarantee the type of vector it returns, and will returns different types of vectors depending on its inputs. If no columns are selected, instead of returning an empty numeric vector, it returns an empty list. This causes an error since we can’t use a list with [.