Speeding up (some) tree-walking operations

tl;dr: You can speed up some tree walking operations by passing an additional reference object when the function recurses.

Tree-walking tests

Here’s the challenge: find every unique symbol used in a package namespace.

The “normal” way of going about it is to recurse into each node.

For leaf nodes (symbols), return a string representation of the symbol
For non-leaf nodes (expressions, functions, pairlists, environments), recurse, collect the values from the children, and call unique() on the results.

find_symbols <- function(x) {

  if (is.symbol(x)) {
    as.character(x)

  } else if (is.language(x) || is.pairlist(x) || is.environment(x)) {
    unique(unlist(lapply(x, find_symbols)))

  } else if (is.function(x)) {
    unique(unlist(c(
      find_symbols(body(x)),
      find_symbols(formals(x))
    )))
    
  } else {
    NULL
  }
}

t <- system.time({
  syms <- find_symbols(getNamespace('stats'))
})
t
#>    user  system elapsed 
#>   0.791   0.019   0.811

To find all the symbols used in the stats package, it takes 0.81 seconds.

Here’s another implementation, which has two modifications:

When it recurses, it passes along a reference object (an environment), which is used to store each unique symbol. The reference object is mutated as it walks the tree; this is in contrast to the normal method, where each node returns results which are then collected and filtered (with unique()) by the parent.
Instead of lapply(), it uses map_null(). This just applies the user-supplied function in a for-loop. It’s faster than lapply() because it does not collect the results – but we don’t need to colect the results because the reference object (the environment) is doing that.

find_symbols2 <- function(x) {
  sym_table <- find_symbols2_impl(x)
  ls(sym_table, all.names = TRUE)
}

# Like lapply, but faster because it doesn't collect return values.
map_null <- function(.x, .f, ...) {
  for (i in seq_along(.x)) {
    .f(.x[[i]], ...)
  }
  NULL
}

find_symbols2_impl <- function(x, sym_table = new.env()) {
  if (is.symbol(x)) {
    x_str <- as.character(x)
    if (nzchar(x_str)) {
      sym_table[[x_str]] <- TRUE
    }

  } else if (is.language(x) || is.pairlist(x)) {
    map_null(x, find_symbols2_impl, sym_table)

  } else if (is.function(x)) {
    find_symbols2_impl(body(x), sym_table)
    find_symbols2_impl(formals(x), sym_table)

  } else if (is.environment(x)) {
    map_null(as.list.environment(x), find_symbols2_impl, sym_table)
  }
  # Ignore everything else

  sym_table
}


t2 <- system.time({
  syms2 <- find_symbols2(getNamespace('stats'))
})
t2
#>    user  system elapsed 
#>   0.267   0.003   0.270

This takes about 0.27 seconds, so it’s 3x as fast as the normal method.

Both of these algorithms visit 170872 nodes in the stats package, so the normal method visits nodes at a rate of ~210000 nodes per second, while the reference object method visits ~630000 per second.

Other comments

This method can’t be used for every use of lapply().

In the example above, it speeds things up by reducing the number of objects that are created and garbage-collected. In other cases, it could speed things up by reducing the amount of computation needed; for example, you may need to do a computation only the first time you seed a node of type X.

The reference object doesn’t have to be a plain environment. It could, for example, be an R6 object.