Treating Code As Data - Notes

Code
options(boomer.safe_print = TRUE)

stylize_table <- function(table) {
  kableExtra::kable(table, row.names = FALSE) |> 
    kableExtra::kable_paper()
}

Intention

This document is an attempt to reproduce the main steps taken during work on the {tableboom} package. I hope it will be useful for myself in the future, but also for any reader interested in metaprogramming. Unfortunately, perhaps the most interesting is everything what happens not in the {tableboom} package, but {boomer}1, because there is a core algorithm used to inspect calls. Saying that, please be noted that presented vignette focus only on things before and after the call is closed in boomer::boom() function, not inside. As the main difference between {tableboom} and {boomer} is that the former makes it possible to inspect whole R script (i.e. all top calls in script) and the latter assumes that user will manually point to the chosen call by wrapping it into boomer::boom() function, I focus on the problem: how to automatically wrap all calls in some function?

Expression

Code
path_to_script <- file.path(system.file(package = "tableboom", "vignette_helpers"),
                            "script_code_as_data.R")

Let’s take the script below as an example (with some traps!):

# define data
my_df <- data.frame(col_a = c("one", "two"),
                    col_b = c(1, 2))

my_df$col_c <- my_df$col_b + 2 # add 2 to each row in col_b

my_df

sum(my_df$col_b); mean(my_df$col_c)

And the mentioned traps are:

Although in theory we could treat the words in the script as text (and read it using e.g. readLines()), that won’t take us too far (because source code is not just a words). Instead, we can use getParseData() to find calls we need. This function needs an object which contains srcref attribute and that can be a function (if sourced keeping srcref, so when e.g. function in the script is sourced using Ctrl + Shift + Enter in RStudio IDE on Windows OS, not just Ctrl + Enter) or an object returned by parse().

Code
attributes(parse(path_to_script, keep.source = TRUE))
$srcref
$srcref[[1]]
my_df <- data.frame(col_a = c("one", "two"),
                    col_b = c(1, 2))

$srcref[[2]]
my_df$col_c <- my_df$col_b + 2

$srcref[[3]]
my_df

$srcref[[4]]
sum(my_df$col_b)

$srcref[[5]]
mean(my_df$col_c)


$srcfile
C:/Users/gsmolinski/Documents/R/R-4.2.1/library/tableboom/vignette_helpers/script_code_as_data.R 

$wholeSrcref
# define data
my_df <- data.frame(col_a = c("one", "two"),
                    col_b = c(1, 2))

my_df$col_c <- my_df$col_b + 2 # add 2 to each row in col_b

my_df

sum(my_df$col_b); mean(my_df$col_c)

Now, getParseData() returns a table with precise information about each element in source code.

Code
getParseData(parse(path_to_script, keep.source = TRUE)) |> 
 stylize_table()
line1 col1 line2 col2 id parent token terminal text
1 1 1 13 1 -57 COMMENT TRUE # define data
2 1 3 36 57 0 expr FALSE
2 1 2 5 4 6 SYMBOL TRUE my_df
2 1 2 5 6 57 expr FALSE
2 7 2 8 5 57 LEFT_ASSIGN TRUE <-
2 10 3 36 55 57 expr FALSE
2 10 2 19 7 9 SYMBOL_FUNCTION_CALL TRUE data.frame
2 10 2 19 9 55 expr FALSE
2 20 2 20 8 55 ‘(’ TRUE (
2 21 2 25 10 55 SYMBOL_SUB TRUE col_a
2 27 2 27 11 55 EQ_SUB TRUE =
2 29 2 43 27 55 expr FALSE
2 29 2 29 12 14 SYMBOL_FUNCTION_CALL TRUE c
2 29 2 29 14 27 expr FALSE
2 30 2 30 13 27 ‘(’ TRUE (
2 31 2 35 15 17 STR_CONST TRUE “one”
2 31 2 35 17 27 expr FALSE
2 36 2 36 16 27 ‘,’ TRUE ,
2 38 2 42 21 23 STR_CONST TRUE “two”
2 38 2 42 23 27 expr FALSE
2 43 2 43 22 27 ‘)’ TRUE )
2 44 2 44 28 55 ‘,’ TRUE ,
3 21 3 25 33 55 SYMBOL_SUB TRUE col_b
3 27 3 27 34 55 EQ_SUB TRUE =
3 29 3 35 50 55 expr FALSE
3 29 3 29 35 37 SYMBOL_FUNCTION_CALL TRUE c
3 29 3 29 37 50 expr FALSE
3 30 3 30 36 50 ‘(’ TRUE (
3 31 3 31 38 39 NUM_CONST TRUE 1
3 31 3 31 39 50 expr FALSE
3 32 3 32 40 50 ‘,’ TRUE ,
3 34 3 34 44 45 NUM_CONST TRUE 2
3 34 3 34 45 50 expr FALSE
3 35 3 35 46 50 ‘)’ TRUE )
3 36 3 36 51 55 ‘)’ TRUE )
5 1 5 30 79 0 expr FALSE
5 1 5 11 66 79 expr FALSE
5 1 5 5 62 64 SYMBOL TRUE my_df
5 1 5 5 64 66 expr FALSE
5 6 5 6 63 66 ‘$’ TRUE $
5 7 5 11 65 66 SYMBOL TRUE col_c
5 13 5 14 67 79 LEFT_ASSIGN TRUE <-
5 16 5 30 78 79 expr FALSE
5 16 5 26 72 78 expr FALSE
5 16 5 20 68 70 SYMBOL TRUE my_df
5 16 5 20 70 72 expr FALSE
5 21 5 21 69 72 ‘$’ TRUE $
5 22 5 26 71 72 SYMBOL TRUE col_b
5 28 5 28 73 78 ‘+’ TRUE
5 30 5 30 74 75 NUM_CONST TRUE 2
5 30 5 30 75 78 expr FALSE
5 32 5 59 76 -79 COMMENT TRUE # add 2 to each row in col_b
7 1 7 5 84 86 SYMBOL TRUE my_df
7 1 7 5 86 0 expr FALSE
9 1 9 16 103 0 expr FALSE
9 1 9 3 91 93 SYMBOL_FUNCTION_CALL TRUE sum
9 1 9 3 93 103 expr FALSE
9 4 9 4 92 103 ‘(’ TRUE (
9 5 9 15 98 103 expr FALSE
9 5 9 9 94 96 SYMBOL TRUE my_df
9 5 9 9 96 98 expr FALSE
9 10 9 10 95 98 ‘$’ TRUE $
9 11 9 15 97 98 SYMBOL TRUE col_b
9 16 9 16 99 103 ‘)’ TRUE )
9 17 9 17 104 0 ‘;’ TRUE ;
9 19 9 35 119 0 expr FALSE
9 19 9 22 107 109 SYMBOL_FUNCTION_CALL TRUE mean
9 19 9 22 109 119 expr FALSE
9 23 9 23 108 119 ‘(’ TRUE (
9 24 9 34 114 119 expr FALSE
9 24 9 28 110 112 SYMBOL TRUE my_df
9 24 9 28 112 114 expr FALSE
9 29 9 29 111 114 ‘$’ TRUE $
9 30 9 34 113 114 SYMBOL TRUE col_c
9 35 9 35 115 119 ‘)’ TRUE )

From this table, we would like to retrieve information about expr token. expr stands for expression and we can say that expression is a complex text:

An expression is any member of the set of base types created by parsing code: constant scalars, symbols, call objects, and pairlists.2

This is not visible in the table above, because the text (value in the text column) is hidden for the expr by default, but we can override this.

Code
getParseData(parse(path_to_script, keep.source = TRUE), includeText = TRUE) |> 
  head(20) |> 
  stylize_table()
line1 col1 line2 col2 id parent token terminal text
1 1 1 13 1 -57 COMMENT TRUE # define data
2 1 3 36 57 0 expr FALSE my_df <- data.frame(col_a = c(“one”, “two”), col_b = c(1, 2))
2 1 2 5 4 6 SYMBOL TRUE my_df
2 1 2 5 6 57 expr FALSE my_df
2 7 2 8 5 57 LEFT_ASSIGN TRUE <-
2 10 3 36 55 57 expr FALSE data.frame(col_a = c(“one”, “two”), col_b = c(1, 2))
2 10 2 19 7 9 SYMBOL_FUNCTION_CALL TRUE data.frame
2 10 2 19 9 55 expr FALSE data.frame
2 20 2 20 8 55 ‘(’ TRUE (
2 21 2 25 10 55 SYMBOL_SUB TRUE col_a
2 27 2 27 11 55 EQ_SUB TRUE =
2 29 2 43 27 55 expr FALSE c(“one”, “two”)
2 29 2 29 12 14 SYMBOL_FUNCTION_CALL TRUE c
2 29 2 29 14 27 expr FALSE c
2 30 2 30 13 27 ‘(’ TRUE (
2 31 2 35 15 17 STR_CONST TRUE “one”
2 31 2 35 17 27 expr FALSE “one”
2 36 2 36 16 27 ‘,’ TRUE ,
2 38 2 42 21 23 STR_CONST TRUE “two”
2 38 2 42 23 27 expr FALSE “two”

Expression is built by different base types created by parsing code and by expressions, i.e. in the script, expression can belong to other expression(s). {boomer} gives us possibility to inspect intermediate steps of call - call is a function call (usage). Thus we could say that all we need is to find the top expression, which is true, but what about this difference between expression and call? Do we need to care about this? Actually, yes.

Taking as granted information from {boomer} description, we need to remember that with boomer::boom() we can inspect a call and as we saw in the table above, expr can be an expression and doesn’t contain a call:

Code
getParseData(parse(path_to_script, keep.source = TRUE), includeText = TRUE)[52:55, ] |> 
  stylize_table()
line1 col1 line2 col2 id parent token terminal text
5 32 5 59 76 -79 COMMENT TRUE # add 2 to each row in col_b
7 1 7 5 84 86 SYMBOL TRUE my_df
7 1 7 5 86 0 expr FALSE my_df
9 1 9 16 103 0 expr FALSE sum(my_df$col_b)

my_df is a non-nested expression (value from parent column is equal to 0) and it is just a SYMBOL, not SYMBOL_FUNCTION_CALL (SYMBOL_FUNCTION_CALL is a call) - boomer::boom() does not work if there is no call at all:

Code
boomer::boom(1, print = dplyr::glimpse)
[1] 1
boomer::boom(c(1), print = dplyr::glimpse)
<  >  c(1) 
 num 1
[1] 1
identical(1, c(1))
[1] TRUE

The output is different and in the first case 1 is just printed, not inspected. This is true even that we all know there is no practical difference between 1 and c(1). The consequence is that if something is not a call, we need something different than boomer::boom() - luckily we can use directly the function passed as a argument to the boomer::boom() above, i.e. glimpse() from {dplyr} package3.

Code
dplyr::glimpse(1)
 num 1

To differentiate between a call and any other item, we can check if after expression (i.e. in the next row) is an expr token with parent equal to 0 or not. If not, then we have a call.

Our aim is to construct a data.frame with information about expr having 0 as parent (i.e. not nested expressions); we could end up with the result as below4:

Code
parse_data <- getParseData(parse(path_to_script, keep.source = TRUE))
tableboom:::find_exprs(parse_data) |> 
  stylize_table()
line1 col1 line2 col2 id fun
2 1 3 36 57 boomer::boom
5 1 5 30 79 boomer::boom
7 1 7 5 86 dplyr::glimpse
9 1 9 16 103 boomer::boom
9 19 9 35 119 boomer::boom

Script modification

We already know where and what to put in the script to modify it (this information was retrieved at the end of previous chapter). To summarize it: we want to add some function (boomer::boom() or dplyr::glimpse()) to the top expression.

In the simplest, but naive approach, we could use information about the starting line (column line1), end line (column line2) and function to use (column fun) and just paste everything (at the beginning of the line it would be e.g. "boomer::boom(” and at the end of line closing bracket - ")"). However, problems occur if there is a comment after the expression (so we would paste closing bracket to the end of comment, not expression) or in one line is more than one call (separated by semicolon) - the problem in this case will be the same as with the comment after expression. These problems exists, because on the one hand we talk about expressions as a language elements and on the other hand about the text file where the expressions exist (so we treat source code as a text in this case).

Luckily, additionally to the information about starting and ending line, we know also where exactly is the first character of expression (col1) and last character (col2). Thus we can imagine the following steps:

These steps in the form of function could look like this:

Code
tableboom:::insert_fun
function(exprs_df, temp_path, script_path) {
  file <- readLines(script_path, warn = FALSE)
  file_orig <- file
  for (i in seq_along(rownames(exprs_df))) {

    split_str <- unlist(strsplit(file[exprs_df[i, "line1"]], ""), use.names = FALSE)
    chars_added <- nchar(file[exprs_df[i, "line1"]]) - nchar(file_orig[exprs_df[i, "line1"]])
    # minus 1 below, because we want to add function "before" first character, not "after"
    str_modified <- append(split_str, paste0(exprs_df[i, "fun"], "("), after = (exprs_df[i, "col1"] - 1) + chars_added)
    file[exprs_df[i, "line1"]] <- paste0(str_modified, collapse = "")

    split_str <- unlist(strsplit(file[exprs_df[i, "line2"]], ""), use.names = FALSE)
    chars_added <- nchar(file[exprs_df[i, "line2"]]) - nchar(file_orig[exprs_df[i, "line2"]])
    str_modified <- append(split_str, ")", after = exprs_df[i, "col2"] + chars_added)
    file[exprs_df[i, "line2"]] <- paste0(str_modified, collapse = "")

  }
  writeLines(file, temp_path)
}
<bytecode: 0x00000236f6a28c48>
<environment: namespace:tableboom>

And the modified script would look like this:

Code
top_exprs <- tableboom:::find_exprs(parse_data)
temp_path <- tempfile(fileext = ".R")
tableboom:::insert_fun(top_exprs, temp_path, path_to_script)
# define data
boomer::boom(my_df <- data.frame(col_a = c("one", "two"),
                    col_b = c(1, 2)))

boomer::boom(my_df$col_c <- my_df$col_b + 2) # add 2 to each row in col_b

dplyr::glimpse(my_df)

boomer::boom(sum(my_df$col_b)); boomer::boom(mean(my_df$col_c))

Evaluation

Evaluation is - for our purposes - the same as executing code. We have used parse() many times before - it returns all top expressions (when script is passed to the parse()) in the form of list. This list can be subset to get nested expressions or simpler elements. The length of this list is a number of top expressions - it doesn’t contain e.g. comments.

Having modified script, we need now to read it, run the code and capture output (i.e. output from boomer::boom() or dplyr::glimpse()). And because we are dealing now with language element rather than text, we won’t use readLines() but - of course - parse().

eval() function is responsible to perform evaluation and before we move into our case, it may be justified to show the usage on simpler example:

Code
parse(text = "2 + 2") # expression before evaluation, see also expression() or quote()
expression(2 + 2)
typeof(parse(text = "2 + 2")) # expression type
[1] "expression"
eval(parse(text = "2 + 2")) # evaluation
[1] 4

Often it may be safer to evaluate expression in different environment, especially if we don’t know how the expression will affect the default environment - evaluation can e.g. define variable and if a variable of the same name already exists in our environment, this will be side effect we most likely don’t want:

Code
a <- 3
eval(parse(text = "a <- 2 + 2"))
a
[1] 4

a <- 3
e <- new.env()
eval(parse(text = "a <- 2 + 2"), envir = e)
a
[1] 3
e$a
[1] 4

One could think the alternative for parse() and eval() would be simply source(), but we want to evaluate each expression one by one to be able to capture output for each expression. We also want to do two additional things:

Code
tableboom:::get_output # just a helper function
function(parsed_file, envir) {
  utils::capture.output(try(eval(parsed_file, envir = envir), silent = TRUE))
}
<bytecode: 0x00000236f9ee6370>
<environment: namespace:tableboom>

tableboom:::eval_file # main function to eval original and modified script
function(parsed_mod_file, parsed_orig_file) {
  e <- new.env()
  output <- vector("list", length(parsed_mod_file))
  for (i in seq_along(parsed_orig_file)) {
    try(eval(parsed_orig_file[[i - 1]], envir = e), silent = TRUE)
    output[[i]] <- tableboom:::get_output(parsed_mod_file[[i]], envir = e)
  }
  output
}
<bytecode: 0x00000236f9f65098>
<environment: namespace:tableboom>

To capture the output, capture.output is used and then modified a little as well (some not important parts of the output are removed using other functions in {tableboom}). All of this gives us access to the original source code and the output from modified source code which we can now combine as we want, e.g. in the form of table, where one row contains original code (expression) and the second row contains output from inspected function for the same expression.

Dealing with the data we don’t know upfront is always risky and that’s true also when the code is a data - many things may happen and it is very hard to anticipate everything. Although we have talked about the way to avoid side effects when evaluating expression (using separate environment), in the script we can find many different things for which using new environment won’t be enough (e.g. <<- or library() call):

Code
"package:tableboom" %in% search()
[1] FALSE
eval(quote(library(tableboom)), envir = new.env())
"package:tableboom" %in% search()
[1] TRUE
detach(package:tableboom)
"package:tableboom" %in% search()
[1] FALSE

We have used new environment, but search path has been affected. The safest way to evaluate expression would be to do it in separate R session - and we can do this using {callr} package5.

Code
"package:tableboom" %in% search()
[1] FALSE
# code below will return result from `search()` call, so we can see
# that {tableboom} is present in separate R session
callr::r(function() eval(quote(library(tableboom))))
[1] "tableboom" "stats"     "graphics"  "grDevices" "utils"     "datasets" 
[7] "methods"   "base"     
"package:tableboom" %in% search() # but is not present in our session
[1] FALSE

At the end of this chapter, let’s make something a little more complicated and convincing that indeed {callr} performs excellent for our purposes - we will use some function from {dplyr} which would not run (error will occur) if {dplyr} is not loaded.

Code
"package:dplyr" %in% search() # dplyr is not loaded
[1] FALSE
tryCatch(mutate(data.frame(a = 1), new_col = "new"),
         error = function(e) "Error occured") # error, because can't find `mutate()` function
[1] "Error occured"

callr::r(function() {
  invisible(library(dplyr))
  mutate(data.frame(a = 1), new_col = "new")
  }
) # data.frame is returned correctly
  a new_col
1 1     new

"package:dplyr" %in% search() # dplyr is still not loaded
[1] FALSE

Summary

Our path led us from the difference between text (words) and source code (expression) - we have used both modes of script (readLines(), parse(), getParseData()) to load code as a data, through script modification (in this second step we had stick to the string manipulation, but language or expression can also be manipulated, however it is tricky to keep srcref attribute after manipulation), and to the code evaluation (eval()). It was a short road - many things has been left on margin, mostly because of lack of my knowledge. I think, however, that the signs we saw (Advanced R, quote(), language type, environments) will be a good starting point for the future search if needed.


  1. Fabri A (2021). boomer: Debugging Tools to Inspect the Intermediate Steps of a Call. R package version 0.1.0, https://CRAN.R-project.org/package=boomer.↩︎

  2. Wickham H. Advanced R. Second Edition, chapter 18.3..↩︎

  3. Wickham H, François R, Henry L, Müller K (2022). dplyr: A Grammar of Data Manipulation. R package version 1.0.10, https://CRAN.R-project.org/package=dplyr.↩︎

  4. In the code block we use triple colon to get function which exists in the package, but is not exported, i.e. not intended to be usable directly by package user.↩︎

  5. Csárdi G, Chang W (2022). callr: Call R from R. R package version 3.7.2, https://CRAN.R-project.org/package=callr.↩︎