do.call, and sys.call vs. match.call

There’s a significant performance difference between using sys.call and match.call inside a function that’s invoked by do.call.

syscall <- function(x) {
  sys.call()
}
matchcall <- function(x) {
  match.call()
}

microbenchmark(
  syscall(diamonds),
  do.call(syscall, list(diamonds)),
  do.call(syscall, list(quote(diamonds))),

  matchcall(diamonds),
  do.call(matchcall, list(diamonds)),
  do.call(matchcall, list(quote(diamonds))),
  unit = "us"
)
## Unit: microseconds
##                                       expr     min        lq   median        uq      max neval
##                          syscall(diamonds)   0.545    0.8885    1.163     2.218    12.78   100
##           do.call(syscall, list(diamonds))   1.867    2.4635    3.432     8.083    28.12   100
##    do.call(syscall, list(quote(diamonds)))   1.908    2.4410    3.300     8.434    21.43   100
##                        matchcall(diamonds)   1.804    2.4510    3.069     7.723    21.86   100
##         do.call(matchcall, list(diamonds)) 619.160 1607.1600 1760.490 18493.735 89446.12   100
##  do.call(matchcall, list(quote(diamonds)))   3.416    4.8420    5.620    10.846    36.95   100

Notice that in the results above, there’s essentially no difference between the second and third condition, where sys.call is used with diamonds unquoted, and then quoted. However, with match.call, there’s a big difference between the unquoted and quoted calls.

One important difference between sys.call and match.call is that the latter adds in argument names if they weren’t explicit:

df <- data.frame(x = 11, y = 22)
syscall(df)
## syscall(df)
matchcall(df)
## matchcall(x = df)

You can get the same output if you use explicit arguments with sys.call:

syscall(x = df)
## syscall(x = df)
matchcall(df)
## matchcall(x = df)

# Another way of showing it
compare_call <- function(x) identical(sys.call(), match.call())
compare_call(df)
## [1] FALSE
compare_call(x = df)  # This comes out TRUE at the console, FALSE in knitr. Bug?
## [1] FALSE

If we use explicit arguments, does this fix the slowness of match.call?

microbenchmark(
  syscall(diamonds),
  do.call(syscall, list(diamonds)),
  do.call(syscall, list(x = diamonds)),
  matchcall(diamonds),
  do.call(matchcall, list(diamonds)),
  do.call(matchcall, list(x = diamonds)),
  unit = "us"
)
## Unit: microseconds
##                                    expr     min      lq   median      uq      max neval
##                       syscall(diamonds)   0.560   1.082    1.588    5.27    11.48   100
##        do.call(syscall, list(diamonds))   2.228   3.796    8.930   14.41    23.95   100
##    do.call(syscall, list(x = diamonds))   2.542   4.585    9.666   15.48    36.35   100
##                     matchcall(diamonds)   2.299   4.749    6.895   10.43    24.24   100
##      do.call(matchcall, list(diamonds)) 508.191 813.774 1556.076 1814.69 38024.79   100
##  do.call(matchcall, list(x = diamonds)) 496.136 659.027 1378.517 1711.32 45996.42   100

Nope.

Compared to using sys.call directly, there’s a small penalty to using sys.call and do.call together. For match.call, the penalty for using do.call is much greater.

When do.call is invoked, R doesn’t actually build up the huge call object (which constructs the data frame explicitly) and then evaluate it; it probably does something more efficient. It’s confusing because this is R prints out:

res <- do.call(syscall, list(df))
res
## (function (x) 
## {
##     sys.call()
## })(list(x = 11, y = 22))

But notice that this isn’t actually the same as what went in: df is a data frame, not a list. The printing actually loses some of the information that’s perserved in the call. If we look at the returned call object directly, we can see that the object is still a data frame:

str(res[[2]])
## 'data.frame':    1 obs. of  2 variables:
##  $ x: num 11
##  $ y: num 22

It looks like R is printing out something different from what’s actually there.

Setting aside the data frame vs. list issue, if we take the exact printed output of res and turned it into a quoted expression, it’s clearly not the same as res:

expr <- quote(
  (function (x) 
  {
      sys.call()
  })(list(x = 11, y = 22))
)

str(expr[[2]])
##  language list(x = 11, y = 22)

The second element of expr isn’t a list; it’s a quoted expression which generates a list when evaluated. In contrast, the second element of res is a data frame.

The lesson from this is that we can’t trust that R will print a call object in a way that honestly represents what’s going in. R makes it look like it’s building up a language object which creates the data object, but that’s only because the printed formatting a call object obscures what’s really going on.

Memory used by objects from `sys.call` and `match.call`

Let’s take a look at the sizes of the objects created by sys.call and match.call. First, some utility functions for calculating size (you can ignore the contents).

library(pryr)
# Utility functions for calculating sizes
obj_size <- function(expr, .env = parent.frame()) {
  size_n <- function(n = 1) {
    objs <- lapply(1:n, function(x) eval(expr, .env))
    as.numeric(do.call(object_size, objs))
  }

  data.frame(one = size_n(1), incremental = size_n(2) - size_n(1))
}

obj_sizes <- function(..., .env = parent.frame()) {
  exprs <- as.list(match.call(expand.dots = FALSE)$...)
  names(exprs) <- lapply(1:length(exprs),
    FUN = function(n) {
      name <- names(exprs)[n]
      if (is.null(name) || name == "") paste(deparse(exprs[[n]]), collapse = " ")
      else name
    })

  sizes <- mapply(obj_size, exprs, MoreArgs = list(.env = .env), SIMPLIFY = FALSE)
  do.call(rbind, sizes)
}

How much memory does it use for one call? How much memory does each additional call use?

obj_sizes(
  do.call(syscall, list(diamonds)),
  do.call(matchcall, list(x = diamonds))
)
##                                            one incremental
## do.call(syscall, list(diamonds))       3469792         112
## do.call(matchcall, list(x = diamonds)) 3469792     3454416

It looks like match.call creates a copy of the data each time it’s called, while sys.call does not.

If we take the size of the diamonds data along with the call object that sys.call returns, it’s obvious that most of the memory is shared:

# Size of the returned call object
object_size(do.call(syscall, list(diamonds)))
## 3.47 MB
# Size of diamonds data and the returned call object
object_size(diamonds, do.call(syscall, list(diamonds)))
## 3.47 MB
# Subtract off size of diamonds data
object_size(diamonds, do.call(syscall, list(diamonds))) - object_size(diamonds)
## 14 kB

Not true for match.call, which again indicates that it makes copies of the data:

# Size of the returned call object
object_size(do.call(matchcall, list(diamonds)))
## 3.47 MB
# Size of diamonds data and the returned call object
object_size(diamonds, do.call(matchcall, list(diamonds)))
## 6.92 MB
# Subtract off size of diamonds data
object_size(diamonds, do.call(matchcall, list(diamonds))) - object_size(diamonds)
## 3.47 MB

This is more evidence that do.call doesn’t actually build a huge, verbose call object which re-creates the diamonds data set. The real problem is that match.call creates a copy of the data when the call has directly-bound argument values (instead of quoted expressions), which is how do.call works.

Manipulating call objects

Below is a demonstration of how to manipulate the argument portion of a call object, and what it means for the arguments to be quoted expressions vs. directly bound to values.

# A simple quoted expression, and how to evaluate it
expr <- quote(rowSums(df))
expr
## rowSums(df)
eval(expr)
## [1] 33

# The argument is a language object, AKA quoted expression
str(expr[[2]])
##  symbol df
# It's the same as quote(df):
str(quote(df))
##  symbol df

# Let's replace the language object with the actual value of df
expr[[2]] <- df
# The second item is now a value, not a quoted expression
str(expr[[2]])
## 'data.frame':    1 obs. of  2 variables:
##  $ x: num 11
##  $ y: num 22

# Evaluating the expression gives the same result as before
eval(expr)
## [1] 33

# But printing it out looks different (and is inaccurate)
expr
## rowSums(list(x = 11, y = 22))

# If you try to actually run this, you'll get an error:
rowSums(list(x = 11, y = 22))
## Error: 'x' must be an array of at least two dimensions

When calling a function the normal way, the arguments are never directly bound to values. Instead, they are quoted expressions which are evaluated when the function is called.

do.call, and sys.call vs. match.call

Memory used by objects from sys.call and match.call

Manipulating call objects

Memory used by objects from `sys.call` and `match.call`