options(boomer.safe_print = TRUE)
stylize_table <- function(table) {
kableExtra::kable(table, row.names = FALSE) |>
kableExtra::kable_paper()
}This document is an attempt to reproduce the main steps taken during
work on the {tableboom} package. I hope it will be useful
for myself in the future, but also for any reader interested in
metaprogramming. Unfortunately, perhaps the most interesting is
everything what happens not in the {tableboom} package, but
{boomer}1, because there is a core algorithm used to
inspect calls. Saying that, please be noted that presented vignette
focus only on things before and after the call is
closed in boomer::boom() function, not inside. As
the main difference between {tableboom} and
{boomer} is that the former makes it possible to inspect
whole R script (i.e. all top calls in script) and the latter
assumes that user will manually point to the chosen call by wrapping it
into boomer::boom() function, I focus on the problem: how
to automatically wrap all calls in some function?
path_to_script <- file.path(system.file(package = "tableboom", "vignette_helpers"),
"script_code_as_data.R")Let’s take the script below as an example (with some traps!):
# define data
my_df <- data.frame(col_a = c("one", "two"),
col_b = c(1, 2))
my_df$col_c <- my_df$col_b + 2 # add 2 to each row in col_b
my_df
sum(my_df$col_b); mean(my_df$col_c)
And the mentioned traps are:
Although in theory we could treat the words in the script as
text (and read it using e.g. readLines()), that won’t take
us too far (because source code is not just a words). Instead, we can
use getParseData() to find calls we need. This function
needs an object which contains srcref attribute and that
can be a function (if sourced keeping srcref, so when
e.g. function in the script is sourced using
Ctrl + Shift + Enter in RStudio IDE on Windows OS, not just
Ctrl + Enter) or an object returned by
parse().
attributes(parse(path_to_script, keep.source = TRUE))
$srcref
$srcref[[1]]
my_df <- data.frame(col_a = c("one", "two"),
col_b = c(1, 2))
$srcref[[2]]
my_df$col_c <- my_df$col_b + 2
$srcref[[3]]
my_df
$srcref[[4]]
sum(my_df$col_b)
$srcref[[5]]
mean(my_df$col_c)
$srcfile
C:/Users/gsmolinski/Documents/R/R-4.2.1/library/tableboom/vignette_helpers/script_code_as_data.R
$wholeSrcref
# define data
my_df <- data.frame(col_a = c("one", "two"),
col_b = c(1, 2))
my_df$col_c <- my_df$col_b + 2 # add 2 to each row in col_b
my_df
sum(my_df$col_b); mean(my_df$col_c)Now, getParseData() returns a table with precise
information about each element in source code.
getParseData(parse(path_to_script, keep.source = TRUE)) |>
stylize_table()| line1 | col1 | line2 | col2 | id | parent | token | terminal | text |
|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 1 | 13 | 1 | -57 | COMMENT | TRUE | # define data |
| 2 | 1 | 3 | 36 | 57 | 0 | expr | FALSE | |
| 2 | 1 | 2 | 5 | 4 | 6 | SYMBOL | TRUE | my_df |
| 2 | 1 | 2 | 5 | 6 | 57 | expr | FALSE | |
| 2 | 7 | 2 | 8 | 5 | 57 | LEFT_ASSIGN | TRUE | <- |
| 2 | 10 | 3 | 36 | 55 | 57 | expr | FALSE | |
| 2 | 10 | 2 | 19 | 7 | 9 | SYMBOL_FUNCTION_CALL | TRUE | data.frame |
| 2 | 10 | 2 | 19 | 9 | 55 | expr | FALSE | |
| 2 | 20 | 2 | 20 | 8 | 55 | ‘(’ | TRUE | ( |
| 2 | 21 | 2 | 25 | 10 | 55 | SYMBOL_SUB | TRUE | col_a |
| 2 | 27 | 2 | 27 | 11 | 55 | EQ_SUB | TRUE | = |
| 2 | 29 | 2 | 43 | 27 | 55 | expr | FALSE | |
| 2 | 29 | 2 | 29 | 12 | 14 | SYMBOL_FUNCTION_CALL | TRUE | c |
| 2 | 29 | 2 | 29 | 14 | 27 | expr | FALSE | |
| 2 | 30 | 2 | 30 | 13 | 27 | ‘(’ | TRUE | ( |
| 2 | 31 | 2 | 35 | 15 | 17 | STR_CONST | TRUE | “one” |
| 2 | 31 | 2 | 35 | 17 | 27 | expr | FALSE | |
| 2 | 36 | 2 | 36 | 16 | 27 | ‘,’ | TRUE | , |
| 2 | 38 | 2 | 42 | 21 | 23 | STR_CONST | TRUE | “two” |
| 2 | 38 | 2 | 42 | 23 | 27 | expr | FALSE | |
| 2 | 43 | 2 | 43 | 22 | 27 | ‘)’ | TRUE | ) |
| 2 | 44 | 2 | 44 | 28 | 55 | ‘,’ | TRUE | , |
| 3 | 21 | 3 | 25 | 33 | 55 | SYMBOL_SUB | TRUE | col_b |
| 3 | 27 | 3 | 27 | 34 | 55 | EQ_SUB | TRUE | = |
| 3 | 29 | 3 | 35 | 50 | 55 | expr | FALSE | |
| 3 | 29 | 3 | 29 | 35 | 37 | SYMBOL_FUNCTION_CALL | TRUE | c |
| 3 | 29 | 3 | 29 | 37 | 50 | expr | FALSE | |
| 3 | 30 | 3 | 30 | 36 | 50 | ‘(’ | TRUE | ( |
| 3 | 31 | 3 | 31 | 38 | 39 | NUM_CONST | TRUE | 1 |
| 3 | 31 | 3 | 31 | 39 | 50 | expr | FALSE | |
| 3 | 32 | 3 | 32 | 40 | 50 | ‘,’ | TRUE | , |
| 3 | 34 | 3 | 34 | 44 | 45 | NUM_CONST | TRUE | 2 |
| 3 | 34 | 3 | 34 | 45 | 50 | expr | FALSE | |
| 3 | 35 | 3 | 35 | 46 | 50 | ‘)’ | TRUE | ) |
| 3 | 36 | 3 | 36 | 51 | 55 | ‘)’ | TRUE | ) |
| 5 | 1 | 5 | 30 | 79 | 0 | expr | FALSE | |
| 5 | 1 | 5 | 11 | 66 | 79 | expr | FALSE | |
| 5 | 1 | 5 | 5 | 62 | 64 | SYMBOL | TRUE | my_df |
| 5 | 1 | 5 | 5 | 64 | 66 | expr | FALSE | |
| 5 | 6 | 5 | 6 | 63 | 66 | ‘$’ | TRUE | $ |
| 5 | 7 | 5 | 11 | 65 | 66 | SYMBOL | TRUE | col_c |
| 5 | 13 | 5 | 14 | 67 | 79 | LEFT_ASSIGN | TRUE | <- |
| 5 | 16 | 5 | 30 | 78 | 79 | expr | FALSE | |
| 5 | 16 | 5 | 26 | 72 | 78 | expr | FALSE | |
| 5 | 16 | 5 | 20 | 68 | 70 | SYMBOL | TRUE | my_df |
| 5 | 16 | 5 | 20 | 70 | 72 | expr | FALSE | |
| 5 | 21 | 5 | 21 | 69 | 72 | ‘$’ | TRUE | $ |
| 5 | 22 | 5 | 26 | 71 | 72 | SYMBOL | TRUE | col_b |
| 5 | 28 | 5 | 28 | 73 | 78 | ‘+’ | TRUE |
|
| 5 | 30 | 5 | 30 | 74 | 75 | NUM_CONST | TRUE | 2 |
| 5 | 30 | 5 | 30 | 75 | 78 | expr | FALSE | |
| 5 | 32 | 5 | 59 | 76 | -79 | COMMENT | TRUE | # add 2 to each row in col_b |
| 7 | 1 | 7 | 5 | 84 | 86 | SYMBOL | TRUE | my_df |
| 7 | 1 | 7 | 5 | 86 | 0 | expr | FALSE | |
| 9 | 1 | 9 | 16 | 103 | 0 | expr | FALSE | |
| 9 | 1 | 9 | 3 | 91 | 93 | SYMBOL_FUNCTION_CALL | TRUE | sum |
| 9 | 1 | 9 | 3 | 93 | 103 | expr | FALSE | |
| 9 | 4 | 9 | 4 | 92 | 103 | ‘(’ | TRUE | ( |
| 9 | 5 | 9 | 15 | 98 | 103 | expr | FALSE | |
| 9 | 5 | 9 | 9 | 94 | 96 | SYMBOL | TRUE | my_df |
| 9 | 5 | 9 | 9 | 96 | 98 | expr | FALSE | |
| 9 | 10 | 9 | 10 | 95 | 98 | ‘$’ | TRUE | $ |
| 9 | 11 | 9 | 15 | 97 | 98 | SYMBOL | TRUE | col_b |
| 9 | 16 | 9 | 16 | 99 | 103 | ‘)’ | TRUE | ) |
| 9 | 17 | 9 | 17 | 104 | 0 | ‘;’ | TRUE | ; |
| 9 | 19 | 9 | 35 | 119 | 0 | expr | FALSE | |
| 9 | 19 | 9 | 22 | 107 | 109 | SYMBOL_FUNCTION_CALL | TRUE | mean |
| 9 | 19 | 9 | 22 | 109 | 119 | expr | FALSE | |
| 9 | 23 | 9 | 23 | 108 | 119 | ‘(’ | TRUE | ( |
| 9 | 24 | 9 | 34 | 114 | 119 | expr | FALSE | |
| 9 | 24 | 9 | 28 | 110 | 112 | SYMBOL | TRUE | my_df |
| 9 | 24 | 9 | 28 | 112 | 114 | expr | FALSE | |
| 9 | 29 | 9 | 29 | 111 | 114 | ‘$’ | TRUE | $ |
| 9 | 30 | 9 | 34 | 113 | 114 | SYMBOL | TRUE | col_c |
| 9 | 35 | 9 | 35 | 115 | 119 | ‘)’ | TRUE | ) |
From this table, we would like to retrieve information about
expr token. expr stands for expression and we
can say that expression is a complex text:
An expression is any member of the set of base types created by parsing code: constant scalars, symbols, call objects, and pairlists.2
This is not visible in the table above, because the text
(value in the text column) is hidden for the
expr by default, but we can override this.
getParseData(parse(path_to_script, keep.source = TRUE), includeText = TRUE) |>
head(20) |>
stylize_table()| line1 | col1 | line2 | col2 | id | parent | token | terminal | text |
|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 1 | 13 | 1 | -57 | COMMENT | TRUE | # define data |
| 2 | 1 | 3 | 36 | 57 | 0 | expr | FALSE | my_df <- data.frame(col_a = c(“one”, “two”), col_b = c(1, 2)) |
| 2 | 1 | 2 | 5 | 4 | 6 | SYMBOL | TRUE | my_df |
| 2 | 1 | 2 | 5 | 6 | 57 | expr | FALSE | my_df |
| 2 | 7 | 2 | 8 | 5 | 57 | LEFT_ASSIGN | TRUE | <- |
| 2 | 10 | 3 | 36 | 55 | 57 | expr | FALSE | data.frame(col_a = c(“one”, “two”), col_b = c(1, 2)) |
| 2 | 10 | 2 | 19 | 7 | 9 | SYMBOL_FUNCTION_CALL | TRUE | data.frame |
| 2 | 10 | 2 | 19 | 9 | 55 | expr | FALSE | data.frame |
| 2 | 20 | 2 | 20 | 8 | 55 | ‘(’ | TRUE | ( |
| 2 | 21 | 2 | 25 | 10 | 55 | SYMBOL_SUB | TRUE | col_a |
| 2 | 27 | 2 | 27 | 11 | 55 | EQ_SUB | TRUE | = |
| 2 | 29 | 2 | 43 | 27 | 55 | expr | FALSE | c(“one”, “two”) |
| 2 | 29 | 2 | 29 | 12 | 14 | SYMBOL_FUNCTION_CALL | TRUE | c |
| 2 | 29 | 2 | 29 | 14 | 27 | expr | FALSE | c |
| 2 | 30 | 2 | 30 | 13 | 27 | ‘(’ | TRUE | ( |
| 2 | 31 | 2 | 35 | 15 | 17 | STR_CONST | TRUE | “one” |
| 2 | 31 | 2 | 35 | 17 | 27 | expr | FALSE | “one” |
| 2 | 36 | 2 | 36 | 16 | 27 | ‘,’ | TRUE | , |
| 2 | 38 | 2 | 42 | 21 | 23 | STR_CONST | TRUE | “two” |
| 2 | 38 | 2 | 42 | 23 | 27 | expr | FALSE | “two” |
Expression is built by different base types created by parsing code
and by expressions, i.e. in the script, expression can belong
to other expression(s). {boomer} gives us possibility to
inspect intermediate steps of call - call is a function call
(usage). Thus we could say that all we need is to find the top
expression, which is true, but what about this difference between
expression and call? Do we need to care about this?
Actually, yes.
Taking as granted information from {boomer} description,
we need to remember that with boomer::boom() we can inspect
a call and as we saw in the table above, expr can
be an expression and doesn’t contain a call:
getParseData(parse(path_to_script, keep.source = TRUE), includeText = TRUE)[52:55, ] |>
stylize_table()| line1 | col1 | line2 | col2 | id | parent | token | terminal | text |
|---|---|---|---|---|---|---|---|---|
| 5 | 32 | 5 | 59 | 76 | -79 | COMMENT | TRUE | # add 2 to each row in col_b |
| 7 | 1 | 7 | 5 | 84 | 86 | SYMBOL | TRUE | my_df |
| 7 | 1 | 7 | 5 | 86 | 0 | expr | FALSE | my_df |
| 9 | 1 | 9 | 16 | 103 | 0 | expr | FALSE | sum(my_df$col_b) |
my_df is a non-nested expression (value from
parent column is equal to 0) and it is just a
SYMBOL, not SYMBOL_FUNCTION_CALL
(SYMBOL_FUNCTION_CALL is a call) -
boomer::boom() does not work if there is no call at
all:
boomer::boom(1, print = dplyr::glimpse)
[1] 1
boomer::boom(c(1), print = dplyr::glimpse)
< > c(1)
num 1
[1] 1
identical(1, c(1))
[1] TRUEThe output is different and in the first case 1 is just
printed, not inspected. This is true even that we all know there is no
practical difference between 1 and c(1). The
consequence is that if something is not a call, we need something
different than boomer::boom() - luckily we can use directly
the function passed as a argument to the boomer::boom()
above, i.e. glimpse() from {dplyr} package3.
dplyr::glimpse(1)
num 1To differentiate between a call and any other item, we can check if
after expression (i.e. in the next row) is an expr token
with parent equal to 0 or not. If not, then we have a call.
Our aim is to construct a data.frame with information
about expr having 0 as parent (i.e. not nested
expressions); we could end up with the result as below4:
parse_data <- getParseData(parse(path_to_script, keep.source = TRUE))
tableboom:::find_exprs(parse_data) |>
stylize_table()| line1 | col1 | line2 | col2 | id | fun |
|---|---|---|---|---|---|
| 2 | 1 | 3 | 36 | 57 | boomer::boom |
| 5 | 1 | 5 | 30 | 79 | boomer::boom |
| 7 | 1 | 7 | 5 | 86 | dplyr::glimpse |
| 9 | 1 | 9 | 16 | 103 | boomer::boom |
| 9 | 19 | 9 | 35 | 119 | boomer::boom |
We already know where and what to put in the script to modify it
(this information was retrieved at the end of previous chapter). To
summarize it: we want to add some function (boomer::boom()
or dplyr::glimpse()) to the top expression.
In the simplest, but naive approach, we could use information about
the starting line (column line1), end line (column
line2) and function to use (column fun) and
just paste everything (at the beginning of the line it would be
e.g. "boomer::boom(” and at the end of line closing bracket
- ")"). However, problems occur if there is a comment after
the expression (so we would paste closing bracket to the end of comment,
not expression) or in one line is more than one call (separated by
semicolon) - the problem in this case will be the same as with the
comment after expression. These problems exists, because on the one hand
we talk about expressions as a language elements and on the other hand
about the text file where the expressions exist (so we treat source code
as a text in this case).
Luckily, additionally to the information about starting and ending
line, we know also where exactly is the first character of expression
(col1) and last character (col2). Thus we can
imagine the following steps:
readLines() (some
alternative might be getParseText())tableboom:::find_exprs() to get
information about starting line, first character, ending line and last
character to insert function to the correct locationcol1 or col2 must be shifted
accordinglyThese steps in the form of function could look like this:
tableboom:::insert_fun
function(exprs_df, temp_path, script_path) {
file <- readLines(script_path, warn = FALSE)
file_orig <- file
for (i in seq_along(rownames(exprs_df))) {
split_str <- unlist(strsplit(file[exprs_df[i, "line1"]], ""), use.names = FALSE)
chars_added <- nchar(file[exprs_df[i, "line1"]]) - nchar(file_orig[exprs_df[i, "line1"]])
# minus 1 below, because we want to add function "before" first character, not "after"
str_modified <- append(split_str, paste0(exprs_df[i, "fun"], "("), after = (exprs_df[i, "col1"] - 1) + chars_added)
file[exprs_df[i, "line1"]] <- paste0(str_modified, collapse = "")
split_str <- unlist(strsplit(file[exprs_df[i, "line2"]], ""), use.names = FALSE)
chars_added <- nchar(file[exprs_df[i, "line2"]]) - nchar(file_orig[exprs_df[i, "line2"]])
str_modified <- append(split_str, ")", after = exprs_df[i, "col2"] + chars_added)
file[exprs_df[i, "line2"]] <- paste0(str_modified, collapse = "")
}
writeLines(file, temp_path)
}
<bytecode: 0x00000236f6a28c48>
<environment: namespace:tableboom>And the modified script would look like this:
top_exprs <- tableboom:::find_exprs(parse_data)
temp_path <- tempfile(fileext = ".R")
tableboom:::insert_fun(top_exprs, temp_path, path_to_script)# define data
boomer::boom(my_df <- data.frame(col_a = c("one", "two"),
col_b = c(1, 2)))
boomer::boom(my_df$col_c <- my_df$col_b + 2) # add 2 to each row in col_b
dplyr::glimpse(my_df)
boomer::boom(sum(my_df$col_b)); boomer::boom(mean(my_df$col_c))
Evaluation is - for our purposes - the same as executing code. We
have used parse() many times before - it returns all
top expressions (when script is passed to the
parse()) in the form of list. This list can be subset to
get nested expressions or simpler elements. The length of this list is a
number of top expressions - it doesn’t contain
e.g. comments.
Having modified script, we need now to read it, run the code
and capture output (i.e. output from boomer::boom() or
dplyr::glimpse()). And because we are dealing now with
language element rather than text, we won’t use readLines()
but - of course - parse().
eval() function is responsible to perform evaluation and
before we move into our case, it may be justified to show the usage on
simpler example:
parse(text = "2 + 2") # expression before evaluation, see also expression() or quote()
expression(2 + 2)
typeof(parse(text = "2 + 2")) # expression type
[1] "expression"
eval(parse(text = "2 + 2")) # evaluation
[1] 4Often it may be safer to evaluate expression in different environment, especially if we don’t know how the expression will affect the default environment - evaluation can e.g. define variable and if a variable of the same name already exists in our environment, this will be side effect we most likely don’t want:
a <- 3
eval(parse(text = "a <- 2 + 2"))
a
[1] 4
a <- 3
e <- new.env()
eval(parse(text = "a <- 2 + 2"), envir = e)
a
[1] 3
e$a
[1] 4One could think the alternative for parse() and
eval() would be simply source(), but we want
to evaluate each expression one by one to be able to capture output for
each expression. We also want to do two additional things:
boomer::boom() do not affect environment, i.e. if we use
boomer::boom(a <- 2), variable a won’t be
defined. And if next expression needs variable a,
we would end up with the error saying that a does not
existtry to silently return
error and do not stoptableboom:::get_output # just a helper function
function(parsed_file, envir) {
utils::capture.output(try(eval(parsed_file, envir = envir), silent = TRUE))
}
<bytecode: 0x00000236f9ee6370>
<environment: namespace:tableboom>
tableboom:::eval_file # main function to eval original and modified script
function(parsed_mod_file, parsed_orig_file) {
e <- new.env()
output <- vector("list", length(parsed_mod_file))
for (i in seq_along(parsed_orig_file)) {
try(eval(parsed_orig_file[[i - 1]], envir = e), silent = TRUE)
output[[i]] <- tableboom:::get_output(parsed_mod_file[[i]], envir = e)
}
output
}
<bytecode: 0x00000236f9f65098>
<environment: namespace:tableboom>To capture the output, capture.output is used and then
modified a little as well (some not important parts of the output are
removed using other functions in {tableboom}). All of this
gives us access to the original source code and the output from modified
source code which we can now combine as we want, e.g. in the form of
table, where one row contains original code (expression) and the second
row contains output from inspected function for the same expression.
Dealing with the data we don’t know upfront is always risky and
that’s true also when the code is a data - many things may happen and it
is very hard to anticipate everything. Although we have talked about the
way to avoid side effects when evaluating expression (using separate
environment), in the script we can find many different things for which
using new environment won’t be enough (e.g. <<- or
library() call):
"package:tableboom" %in% search()
[1] FALSE
eval(quote(library(tableboom)), envir = new.env())
"package:tableboom" %in% search()
[1] TRUE
detach(package:tableboom)
"package:tableboom" %in% search()
[1] FALSEWe have used new environment, but search path has been affected. The
safest way to evaluate expression would be to do it in separate R
session - and we can do this using {callr} package5.
"package:tableboom" %in% search()
[1] FALSE
# code below will return result from `search()` call, so we can see
# that {tableboom} is present in separate R session
callr::r(function() eval(quote(library(tableboom))))
[1] "tableboom" "stats" "graphics" "grDevices" "utils" "datasets"
[7] "methods" "base"
"package:tableboom" %in% search() # but is not present in our session
[1] FALSEAt the end of this chapter, let’s make something a little more
complicated and convincing that indeed {callr} performs
excellent for our purposes - we will use some function from
{dplyr} which would not run (error will occur) if
{dplyr} is not loaded.
"package:dplyr" %in% search() # dplyr is not loaded
[1] FALSE
tryCatch(mutate(data.frame(a = 1), new_col = "new"),
error = function(e) "Error occured") # error, because can't find `mutate()` function
[1] "Error occured"
callr::r(function() {
invisible(library(dplyr))
mutate(data.frame(a = 1), new_col = "new")
}
) # data.frame is returned correctly
a new_col
1 1 new
"package:dplyr" %in% search() # dplyr is still not loaded
[1] FALSEOur path led us from the difference between text (words) and
source code (expression) - we have used both modes of script
(readLines(), parse(),
getParseData()) to load code as a data, through script
modification (in this second step we had stick to the string
manipulation, but language or expression can
also be manipulated, however it is tricky to keep srcref
attribute after manipulation), and to the code evaluation
(eval()). It was a short road - many things has been left
on margin, mostly because of lack of my knowledge. I think, however,
that the signs we saw (Advanced R, quote(),
language type, environments) will be a good starting point
for the future search if needed.
Fabri A (2021). boomer: Debugging Tools to Inspect the Intermediate Steps of a Call. R package version 0.1.0, https://CRAN.R-project.org/package=boomer.↩︎
Wickham H. Advanced R. Second Edition, chapter 18.3..↩︎
Wickham H, François R, Henry L, Müller K (2022). dplyr: A Grammar of Data Manipulation. R package version 1.0.10, https://CRAN.R-project.org/package=dplyr.↩︎
In the code block we use triple colon to get function which exists in the package, but is not exported, i.e. not intended to be usable directly by package user.↩︎
Csárdi G, Chang W (2022). callr: Call R from R. R package version 3.7.2, https://CRAN.R-project.org/package=callr.↩︎