tl;dr: You can speed up some tree walking operations by passing an additional reference object when the function recurses.
Here’s the challenge: find every unique symbol used in a package namespace.
The “normal” way of going about it is to recurse into each node.
unique() on the results.find_symbols <- function(x) {
if (is.symbol(x)) {
as.character(x)
} else if (is.language(x) || is.pairlist(x) || is.environment(x)) {
unique(unlist(lapply(x, find_symbols)))
} else if (is.function(x)) {
unique(unlist(c(
find_symbols(body(x)),
find_symbols(formals(x))
)))
} else {
NULL
}
}
t <- system.time({
syms <- find_symbols(getNamespace('stats'))
})
t
#> user system elapsed
#> 0.791 0.019 0.811
To find all the symbols used in the stats package, it takes 0.81 seconds.
Here’s another implementation, which has two modifications:
unique()) by the parent.lapply(), it uses map_null(). This just applies the user-supplied function in a for-loop. It’s faster than lapply() because it does not collect the results – but we don’t need to colect the results because the reference object (the environment) is doing that.find_symbols2 <- function(x) {
sym_table <- find_symbols2_impl(x)
ls(sym_table, all.names = TRUE)
}
# Like lapply, but faster because it doesn't collect return values.
map_null <- function(.x, .f, ...) {
for (i in seq_along(.x)) {
.f(.x[[i]], ...)
}
NULL
}
find_symbols2_impl <- function(x, sym_table = new.env()) {
if (is.symbol(x)) {
x_str <- as.character(x)
if (nzchar(x_str)) {
sym_table[[x_str]] <- TRUE
}
} else if (is.language(x) || is.pairlist(x)) {
map_null(x, find_symbols2_impl, sym_table)
} else if (is.function(x)) {
find_symbols2_impl(body(x), sym_table)
find_symbols2_impl(formals(x), sym_table)
} else if (is.environment(x)) {
map_null(as.list.environment(x), find_symbols2_impl, sym_table)
}
# Ignore everything else
sym_table
}
t2 <- system.time({
syms2 <- find_symbols2(getNamespace('stats'))
})
t2
#> user system elapsed
#> 0.267 0.003 0.270
This takes about 0.27 seconds, so it’s 3x as fast as the normal method.
Both of these algorithms visit 170872 nodes in the stats package, so the normal method visits nodes at a rate of ~210000 nodes per second, while the reference object method visits ~630000 per second.
This method can’t be used for every use of lapply().
In the example above, it speeds things up by reducing the number of objects that are created and garbage-collected. In other cases, it could speed things up by reducing the amount of computation needed; for example, you may need to do a computation only the first time you seed a node of type X.
The reference object doesn’t have to be a plain environment. It could, for example, be an R6 object.