tl;dr: You can speed up some tree walking operations by passing an additional reference object when the function recurses.

Tree-walking tests

Here’s the challenge: find every unique symbol used in a package namespace.

The “normal” way of going about it is to recurse into each node.

find_symbols <- function(x) {

  if (is.symbol(x)) {
    as.character(x)

  } else if (is.language(x) || is.pairlist(x) || is.environment(x)) {
    unique(unlist(lapply(x, find_symbols)))

  } else if (is.function(x)) {
    unique(unlist(c(
      find_symbols(body(x)),
      find_symbols(formals(x))
    )))
    
  } else {
    NULL
  }
}

t <- system.time({
  syms <- find_symbols(getNamespace('stats'))
})
t
#>    user  system elapsed 
#>   0.791   0.019   0.811

To find all the symbols used in the stats package, it takes 0.81 seconds.

Here’s another implementation, which has two modifications:

find_symbols2 <- function(x) {
  sym_table <- find_symbols2_impl(x)
  ls(sym_table, all.names = TRUE)
}

# Like lapply, but faster because it doesn't collect return values.
map_null <- function(.x, .f, ...) {
  for (i in seq_along(.x)) {
    .f(.x[[i]], ...)
  }
  NULL
}

find_symbols2_impl <- function(x, sym_table = new.env()) {
  if (is.symbol(x)) {
    x_str <- as.character(x)
    if (nzchar(x_str)) {
      sym_table[[x_str]] <- TRUE
    }

  } else if (is.language(x) || is.pairlist(x)) {
    map_null(x, find_symbols2_impl, sym_table)

  } else if (is.function(x)) {
    find_symbols2_impl(body(x), sym_table)
    find_symbols2_impl(formals(x), sym_table)

  } else if (is.environment(x)) {
    map_null(as.list.environment(x), find_symbols2_impl, sym_table)
  }
  # Ignore everything else

  sym_table
}


t2 <- system.time({
  syms2 <- find_symbols2(getNamespace('stats'))
})
t2
#>    user  system elapsed 
#>   0.267   0.003   0.270

This takes about 0.27 seconds, so it’s 3x as fast as the normal method.

Both of these algorithms visit 170872 nodes in the stats package, so the normal method visits nodes at a rate of ~210000 nodes per second, while the reference object method visits ~630000 per second.

Other comments

This method can’t be used for every use of lapply().

In the example above, it speeds things up by reducing the number of objects that are created and garbage-collected. In other cases, it could speed things up by reducing the amount of computation needed; for example, you may need to do a computation only the first time you seed a node of type X.

The reference object doesn’t have to be a plain environment. It could, for example, be an R6 object.