This document is motivated by some experiments creating a replacement for reference classes.

How expensive is it to define functions inline inside other functions, in terms of time and memory? How expensive is it to copy a function and assign the copy a new environment?

What is the cost of calling a function?

# Some setup stuff
library(microbenchmark)
library(pryr)

# Utility functions for calculating sizes
obj_size <- function(expr, .env = parent.frame()) {
  size_n <- function(n = 1) {
    objs <- lapply(1:n, function(x) eval(expr, .env))
    as.numeric(do.call(object_size, objs))
  }

  data.frame(one = size_n(1), incremental = size_n(2) - size_n(1))
}

obj_sizes <- function(..., .env = parent.frame()) {
  exprs <- as.list(match.call(expand.dots = FALSE)$...)
  names(exprs) <- lapply(1:length(exprs),
    FUN = function(n) {
      name <- names(exprs)[n]
      if (is.null(name) || name == "") paste(deparse(exprs[[n]]), collapse = " ")
      else name
    })

  sizes <- mapply(obj_size, exprs, MoreArgs = list(.env = .env), SIMPLIFY = FALSE)
  do.call(rbind, sizes)
}

Environments

A new environment created with new.env() uses 328 bytes, regardless of what the parent environment is.

as.numeric(object_size(new.env()))
## [1] 328
as.numeric(object_size(new.env(), new.env()) - object_size(new.env()))
## [1] 328

as.numeric(object_size(new.env(parent = emptyenv())))
## [1] 328
as.numeric(object_size(new.env(parent = asNamespace('pryr'))))
## [1] 328

But much of that space is actually taken up by a hash table. The hash table speeds up access when there is a larger number of items, but when there is a small number of items, it probably doesn’t help much, if at all.

as.numeric(object_size(new.env(hash = FALSE)))
## [1] 56

The list2env function uses 100 items as the threshold for using a hash table to an environment. This seems like a reasonable number to me.

Creating a new environment is pretty quick – on the order of 1 microsecond.

microbenchmark(
  new.env(),
  new.env(hash = FALSE),
  unit = "us"
)
## Unit: microseconds
##                   expr   min     lq median    uq    max neval
##              new.env() 0.800 0.9085 1.0025 1.144  5.378   100
##  new.env(hash = FALSE) 0.688 0.7920 0.8905 1.054 35.458   100

Memory footprint of hashed vs. non-hashed environments.

obj_sizes(
  hashed = list2env(list(a = 1, b = function() 3), hash = TRUE),
  unhashed = list2env(list(a = 1, b = function() 3), hash = FALSE)
)
##           one incremental
## hashed   9016         552
## unhashed 8744         280

Functions

Time calling functions

Calling a function takes a minimum of about 0.15 microseconds. As you add more arguments, it takes more time. The function signature can use ... or explicit arguments, and it seems to not impact the speed.

blank <- function() NULL
dots <- function(...) NULL
xyz <- function(x, y, z) NULL

invisible(gc())
microbenchmark(
  blank(),
  dots(),
  dots(1),
  dots(1, 2),
  dots(1, 2, 3),
  xyz(1, 2, 3),
  unit = "us"
)
## Unit: microseconds
##           expr   min     lq median     uq    max neval
##        blank() 0.121 0.1320 0.1440 0.2050  0.746   100
##         dots() 0.132 0.1390 0.1465 0.1970  8.169   100
##        dots(1) 0.160 0.1670 0.2410 0.3260 34.033   100
##     dots(1, 2) 0.177 0.2035 0.3095 0.3895  0.975   100
##  dots(1, 2, 3) 0.198 0.2240 0.3910 0.4785  1.250   100
##   xyz(1, 2, 3) 0.237 0.3245 0.4205 0.5165  2.685   100

Time handling and evaluating arguments

noArg_noEval <- function() { 3; NULL }
arg_noEval <- function(x) { 3; NULL }
arg_eval <- function(x) { x; NULL }

invisible(gc())
microbenchmark(
  noArg_noEval(),
  arg_noEval(3),
  arg_eval(3),
  unit = "us"
)
## Unit: microseconds
##            expr   min    lq median     uq    max neval
##  noArg_noEval() 0.181 0.191  0.193 0.2270  1.812   100
##   arg_noEval(3) 0.213 0.228  0.257 0.3540 36.479   100
##     arg_eval(3) 0.244 0.253  0.261 0.3805  4.757   100

Copying functions

Memory footprint of copied functions

How much memory does a copy of a function take up?

We’ll start by looking at a large-ish function, lm.

as.numeric(object_size(lm))
## [1] 48000

Making a copy of it takes up no extra space (other than keeping track of the new “pointer” to the function, which object_size doesn’t capture):

lm2 <- lm
as.numeric(object_size(lm, lm2))
## [1] 48000

But if we change the environment of the copied function, it does take a little bit more memory:

e <- new.env()
environment(lm2) <- e
as.numeric(object_size(lm, lm2) - object_size(lm)) 
## [1] 384

Maybe the extra memory is just from the new environment – if we assigned it to an existing environment, it might not use more memory, or at least not that much more.

lm3 <- lm
environment(lm3) <- e
as.numeric(object_size(lm, lm2, lm3) - object_size(lm)) 
## [1] 440

It looks like it uses 56 more bytes when you make a copy and point it to an environment that’s already in use. Oddly, if you take copy a function, then assign the same environment that it started with, this uses 56 bytes.

lm2 <- lm
as.numeric(object_size(lm, lm2) - object_size(lm))
## [1] 0

environment(lm2) <- environment(lm2)
identical(lm, lm2)
## [1] TRUE
as.numeric(object_size(lm, lm2) - object_size(lm))
## [1] 56

Time to create a function

  • Does it cost time and/or memory to define a function within another function?

If we call a function which creates a function, how quickly does R create the function? Does it matter how large that function is?

We’ll test it with these two functions. The first returns an extremely simple function, and the second returns the same thing as the stats::lm function. (You can ignore the contents of this function – its only purpose here is to be a long function.)

# This function returns a very simple function
create_inline_short <- function() { 
  function() 3
}

# The long function returns the lm function (spelled out)
create_inline_long <- function() {
  function(formula, data, subset, weights, na.action, method = "qr",
      model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE,
      contrasts = NULL, offset, ...) {
    ret.x <- x
    ret.y <- y
    cl <- match.call()
    mf <- match.call(expand.dots = FALSE)
    m <- match(c("formula", "data", "subset", "weights", "na.action",
        "offset"), names(mf), 0L)
    mf <- mf[c(1L, m)]
    mf$drop.unused.levels <- TRUE
    mf[[1L]] <- quote(stats::model.frame)
    mf <- eval(mf, parent.frame())
    if (method == "model.frame")
        return(mf)
    else if (method != "qr")
        warning(gettextf("method = '%s' is not supported. Using 'qr'",
            method), domain = NA)
    mt <- attr(mf, "terms")
    y <- model.response(mf, "numeric")
    w <- as.vector(model.weights(mf))
    if (!is.null(w) && !is.numeric(w))
        stop("'weights' must be a numeric vector")
    offset <- as.vector(model.offset(mf))
    if (!is.null(offset)) {
        if (length(offset) != NROW(y))
            stop(gettextf("number of offsets is %d, should equal %d (number of observations)",
                length(offset), NROW(y)), domain = NA)
    }
    if (is.empty.model(mt)) {
        x <- NULL
        z <- list(coefficients = if (is.matrix(y)) matrix(, 0,
            3) else numeric(), residuals = y, fitted.values = 0 *
            y, weights = w, rank = 0L, df.residual = if (!is.null(w)) sum(w !=
            0) else if (is.matrix(y)) nrow(y) else length(y))
        if (!is.null(offset)) {
            z$fitted.values <- offset
            z$residuals <- y - offset
        }
    }
    else {
        x <- model.matrix(mt, mf, contrasts)
        z <- if (is.null(w))
            lm.fit(x, y, offset = offset, singular.ok = singular.ok,
                ...)
        else lm.wfit(x, y, w, offset = offset, singular.ok = singular.ok,
            ...)
    }
    class(z) <- c(if (is.matrix(y)) "mlm", "lm")
    z$na.action <- attr(mf, "na.action")
    z$offset <- offset
    z$contrasts <- attr(x, "contrasts")
    z$xlevels <- .getXlevels(mt, mf)
    z$call <- cl
    z$terms <- mt
    if (model)
        z$model <- mf
    if (ret.x)
        z$x <- x
    if (ret.y)
        z$y <- y
    if (!qr)
        z$qr <- NULL
    z
  }
}

microbenchmark(
  create_inline_short(),
  create_inline_long(),
  unit = "us"
)
## Unit: microseconds
##                   expr   min    lq median     uq    max neval
##  create_inline_short() 0.264 0.276 0.2985 0.4015 37.915   100
##   create_inline_long() 0.275 0.286 0.3110 0.3875  3.463   100

Surprisingly, it takes almost exactly the same amount of time to create a very long function as it does to create a short function, and both are very fast. This is probably because everything is already parsed by the time the outer function is called, so the inner function merely needs to be returned.

Let’s dig a little deeper. Presumably, there’s some part about making a larger function that takes more time, but whatever it is isn’t being reflected in the test above with functions defined inline.

We can break down the creation of a function into two stages: parsing the text into an unevaluated expression, and evaluating the expression. With the inline functions above, by the time we call the outer function, all the text has been parsed. Evaluating the parsed expression seems to not depend on the length of the function contained within; R probably just returns the already-created parse tree for the function, and assigns it a new environment.

We can test these stages directly. Given a string representation of a function, how long does it take to parse it into an unevaluated expression? How long does it take to evaluate the expression? And does it matter how long the expression is?

We’ll create the same function in three different ways here.

short_string <- "function() 3"
# Parse the string into an unevaluated expression
parse_string_short <- function() parse(text = short_string, keep.source = FALSE)

short_expr <- parse_string_short()
# Evaluate the (parsed) expression
eval_expr_short <- function() eval(short_expr, envir = parent.frame())

# Same for long functions
long_string <- deparse(lm)  # This is a string representation of lm
parse_string_long <- function() parse(text = long_string, keep.source = FALSE)
long_expr <- parse_string_long()
eval_expr_long <- function() eval(long_expr, envir = parent.frame())

# We'll also compare it to versions above, where the function was defined inline in a function
invisible(gc())
microbenchmark(
  create_inline_short(),
  create_inline_long(),
  parse_string_short(),
  parse_string_long(),
  eval_expr_short(),
  eval_expr_long(),
  unit = "us"
)
## Unit: microseconds
##                   expr     min       lq   median       uq     max neval
##  create_inline_short()   0.256   0.3810   0.4775   0.6100   2.329   100
##   create_inline_long()   0.272   0.4125   0.5685   0.7515   9.681   100
##   parse_string_short()   7.726   8.3280   9.3200   9.7415  46.711   100
##    parse_string_long() 454.085 457.5030 458.6815 462.0220 532.169   100
##      eval_expr_short()   1.473   1.9860   2.4250   2.9465  10.075   100
##       eval_expr_long()   1.494   1.9805   2.3685   2.9690  14.690   100

It appears that parsing is the slowest step, and, not surprisingly, dependent on the length of the text.

Once the text has been parsed into an unevaluated expression, evaluating it is very fast, and apparently not dependent on the length or complexity of the function that’s returned. Does this mean that, as soon as an expression is parsed which has a function, that the function is already defined somewhere? We can test it by looking at memory usage.

Memory footprint of new functions

When we create functions inside other functions, how much memory does it take? Is the memory shared between instances of these functions?

Similarly, is memory shared between functions if we first create an unevaluated expression that returns the function, then evaluate that expression multiple times? If so, that suggests that when a function is created, R simply returns the expression representing the function (with an environment added).

To test these things, we need to be a little more careful than we were previously; in the cases above, the environment of the created functions weren’t always the same. The wrapper functions below ensure that the environment of the created functions is always the same:

create_inline_short_env <- function() { 
  f <- create_inline_short()
  environment(f) <- parent.frame()
  f
}
create_inline_long_env <- function() {
  f <- create_inline_long()
  environment(f) <- parent.frame()
  f
}

# Create the function by parsing a string and evaluating it
eval_parse_string_short <- function() {
  f <- eval(parse_string_short())
  environment(f) <- parent.frame()
  f
}
eval_parse_string_long <- function() {
  f <- eval(parse_string_long())
  environment(f) <- parent.frame()
  f
}

We now have three functions, create_inline_short_env(), eval_expr_short(), eval_parse_string_short(), which will create functions that are exactly the same. (The same is true for the long versions as well.)

For each of these ways of creating functions, we can now test how much memory is required to create a single instance of a function, and how much memory is required to create each subsequent instance.

obj_sizes(
  create_inline_short_env(),
  eval_expr_short(),
  eval_parse_string_short(),
  create_inline_long_env(),
  eval_expr_long(),
  eval_parse_string_long()
)
##                              one incremental
## create_inline_short_env()  60816         112
## eval_expr_short()            104          56
## eval_parse_string_short()    104         104
## create_inline_long_env()  102904         112
## eval_expr_long()           30448          56
## eval_parse_string_long()   30448       24920

Oddly, creating functions inline in a larger function seems to take a lot of memory for the first instance. Creating one more copy takes only 112 bytes.

Creating a function by evaluating an expression takes a smaller amount of memory, and only 112 bytes for another copy.

Creating a function by parsing a string and then evaluating it takes a smaller amount of memory. Instead of taking 112 bytes, each additional copy takes close to the same amount of memory as the first copy.

These results suggest that the memory used by a function is allocated largely when the function is parsed, not when the function is actually created.

Size of functions with srcref

f <- function() {
  res <- function() 3
  res
}
g <- function() {
  res <- function() 3
  attr(res, "srcref") <- NULL
  res
}

obj_sizes(f(), g())
##      one incremental
## f() 9280         224
## g()  272         168

Speed of :: operator

microbenchmark(stats::lm, lm, unit="us")
## Unit: microseconds
##       expr    min     lq median     uq    max neval
##  stats::lm 11.360 12.637 13.495 14.491 57.739   100
##         lm  0.031  0.041  0.077  0.092  2.122   100

Appendix

sessionInfo()
## R version 3.1.0 (2014-04-10)
## Platform: x86_64-apple-darwin13.1.0 (64-bit)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] pryr_0.1             microbenchmark_1.3-0
## 
## loaded via a namespace (and not attached):
## [1] codetools_0.2-8  evaluate_0.5.3   formatR_0.10     knitr_1.5.33    
## [5] Rcpp_0.11.1      rmarkdown_0.1.99 stringr_0.6.2    tools_3.1.0     
## [9] yaml_2.1.11

Old code for calculating object sizes