Treating Code As Data

Intention

This document is an attempt to reproduce the main steps taken during work on the {tableboom} package. I hope it will be useful for myself in the future, but also for any reader interested in metaprogramming. Unfortunately, perhaps the most interesting is everything what happens not in the {tableboom} package, but {boomer}¹, because there is a core algorithm used to inspect calls. Saying that, please be noted that presented vignette focus only on things before and after the call is closed in boomer::boom() function, not inside. As the main difference between {tableboom} and {boomer} is that the former makes it possible to inspect whole R script (i.e. all top calls in script) and the latter assumes that user will manually point to the chosen call by wrapping it into boomer::boom() function, I focus on the problem: how to automatically wrap all calls in some function?

Expression

Code

path_to_script <- file.path(system.file(package = "tableboom", "vignette_helpers"),
                            "script_code_as_data.R")

Let’s take the script below as an example (with some traps!):

# define data
my_df <- data.frame(col_a = c("one", "two"),
                    col_b = c(1, 2))

my_df$col_c <- my_df$col_b + 2 # add 2 to each row in col_b

my_df

sum(my_df$col_b); mean(my_df$col_c)

And the mentioned traps are:

comment in their own line
comment in the same line as call
two calls in the same line, separated by semicolon

Although in theory we could treat the words in the script as text (and read it using e.g. readLines()), that won’t take us too far (because source code is not just a words). Instead, we can use getParseData() to find calls we need. This function needs an object which contains srcref attribute and that can be a function (if sourced keeping srcref, so when e.g. function in the script is sourced using Ctrl + Shift + Enter in RStudio IDE on Windows OS, not just Ctrl + Enter) or an object returned by parse().

Code

attributes(parse(path_to_script, keep.source = TRUE))
$srcref
$srcref[[1]]
my_df <- data.frame(col_a = c("one", "two"),
                    col_b = c(1, 2))

$srcref[[2]]
my_df$col_c <- my_df$col_b + 2

$srcref[[3]]
my_df

$srcref[[4]]
sum(my_df$col_b)

$srcref[[5]]
mean(my_df$col_c)


$srcfile
C:/Users/gsmolinski/Documents/R/R-4.2.1/library/tableboom/vignette_helpers/script_code_as_data.R 

$wholeSrcref
# define data
my_df <- data.frame(col_a = c("one", "two"),
                    col_b = c(1, 2))

my_df$col_c <- my_df$col_b + 2 # add 2 to each row in col_b

my_df

sum(my_df$col_b); mean(my_df$col_c)

Now, getParseData() returns a table with precise information about each element in source code.

Code

getParseData(parse(path_to_script, keep.source = TRUE)) |> 
 stylize_table()

line1	col1	line2	col2	id	parent	token	terminal	text
1	1	1	13	1	-57	COMMENT	TRUE	# define data
2	1	3	36	57	0	expr	FALSE
2	1	2	5	4	6	SYMBOL	TRUE	my_df
2	1	2	5	6	57	expr	FALSE
2	7	2	8	5	57	LEFT_ASSIGN	TRUE	<-
2	10	3	36	55	57	expr	FALSE
2	10	2	19	7	9	SYMBOL_FUNCTION_CALL	TRUE	data.frame
2	10	2	19	9	55	expr	FALSE
2	20	2	20	8	55	‘(’	TRUE	(
2	21	2	25	10	55	SYMBOL_SUB	TRUE	col_a
2	27	2	27	11	55	EQ_SUB	TRUE	=
2	29	2	43	27	55	expr	FALSE
2	29	2	29	12	14	SYMBOL_FUNCTION_CALL	TRUE	c
2	29	2	29	14	27	expr	FALSE
2	30	2	30	13	27	‘(’	TRUE	(
2	31	2	35	15	17	STR_CONST	TRUE	“one”
2	31	2	35	17	27	expr	FALSE
2	36	2	36	16	27	‘,’	TRUE	,
2	38	2	42	21	23	STR_CONST	TRUE	“two”
2	38	2	42	23	27	expr	FALSE
2	43	2	43	22	27	‘)’	TRUE	)
2	44	2	44	28	55	‘,’	TRUE	,
3	21	3	25	33	55	SYMBOL_SUB	TRUE	col_b
3	27	3	27	34	55	EQ_SUB	TRUE	=
3	29	3	35	50	55	expr	FALSE
3	29	3	29	35	37	SYMBOL_FUNCTION_CALL	TRUE	c
3	29	3	29	37	50	expr	FALSE
3	30	3	30	36	50	‘(’	TRUE	(
3	31	3	31	38	39	NUM_CONST	TRUE	1
3	31	3	31	39	50	expr	FALSE
3	32	3	32	40	50	‘,’	TRUE	,
3	34	3	34	44	45	NUM_CONST	TRUE	2
3	34	3	34	45	50	expr	FALSE
3	35	3	35	46	50	‘)’	TRUE	)
3	36	3	36	51	55	‘)’	TRUE	)
5	1	5	30	79	0	expr	FALSE
5	1	5	11	66	79	expr	FALSE
5	1	5	5	62	64	SYMBOL	TRUE	my_df
5	1	5	5	64	66	expr	FALSE
5	6	5	6	63	66	‘$’	TRUE	$
5	7	5	11	65	66	SYMBOL	TRUE	col_c
5	13	5	14	67	79	LEFT_ASSIGN	TRUE	<-
5	16	5	30	78	79	expr	FALSE
5	16	5	26	72	78	expr	FALSE
5	16	5	20	68	70	SYMBOL	TRUE	my_df
5	16	5	20	70	72	expr	FALSE
5	21	5	21	69	72	‘$’	TRUE	$
5	22	5	26	71	72	SYMBOL	TRUE	col_b
5	28	5	28	73	78	‘+’	TRUE
5	30	5	30	74	75	NUM_CONST	TRUE	2
5	30	5	30	75	78	expr	FALSE
5	32	5	59	76	-79	COMMENT	TRUE	# add 2 to each row in col_b
7	1	7	5	84	86	SYMBOL	TRUE	my_df
7	1	7	5	86	0	expr	FALSE
9	1	9	16	103	0	expr	FALSE
9	1	9	3	91	93	SYMBOL_FUNCTION_CALL	TRUE	sum
9	1	9	3	93	103	expr	FALSE
9	4	9	4	92	103	‘(’	TRUE	(
9	5	9	15	98	103	expr	FALSE
9	5	9	9	94	96	SYMBOL	TRUE	my_df
9	5	9	9	96	98	expr	FALSE
9	10	9	10	95	98	‘$’	TRUE	$
9	11	9	15	97	98	SYMBOL	TRUE	col_b
9	16	9	16	99	103	‘)’	TRUE	)
9	17	9	17	104	0	‘;’	TRUE	;
9	19	9	35	119	0	expr	FALSE
9	19	9	22	107	109	SYMBOL_FUNCTION_CALL	TRUE	mean
9	19	9	22	109	119	expr	FALSE
9	23	9	23	108	119	‘(’	TRUE	(
9	24	9	34	114	119	expr	FALSE
9	24	9	28	110	112	SYMBOL	TRUE	my_df
9	24	9	28	112	114	expr	FALSE
9	29	9	29	111	114	‘$’	TRUE	$
9	30	9	34	113	114	SYMBOL	TRUE	col_c
9	35	9	35	115	119	‘)’	TRUE	)

From this table, we would like to retrieve information about expr token. expr stands for expression and we can say that expression is a complex text:

An expression is any member of the set of base types created by parsing code: constant scalars, symbols, call objects, and pairlists.²

This is not visible in the table above, because the text (value in the text column) is hidden for the expr by default, but we can override this.

Code

getParseData(parse(path_to_script, keep.source = TRUE), includeText = TRUE) |> 
  head(20) |> 
  stylize_table()

line1	col1	line2	col2	id	parent	token	terminal	text
1	1	1	13	1	-57	COMMENT	TRUE	# define data
2	1	3	36	57	0	expr	FALSE	my_df <- data.frame(col_a = c(“one”, “two”), col_b = c(1, 2))
2	1	2	5	4	6	SYMBOL	TRUE	my_df
2	1	2	5	6	57	expr	FALSE	my_df
2	7	2	8	5	57	LEFT_ASSIGN	TRUE	<-
2	10	3	36	55	57	expr	FALSE	data.frame(col_a = c(“one”, “two”), col_b = c(1, 2))
2	10	2	19	7	9	SYMBOL_FUNCTION_CALL	TRUE	data.frame
2	10	2	19	9	55	expr	FALSE	data.frame
2	20	2	20	8	55	‘(’	TRUE	(
2	21	2	25	10	55	SYMBOL_SUB	TRUE	col_a
2	27	2	27	11	55	EQ_SUB	TRUE	=
2	29	2	43	27	55	expr	FALSE	c(“one”, “two”)
2	29	2	29	12	14	SYMBOL_FUNCTION_CALL	TRUE	c
2	29	2	29	14	27	expr	FALSE	c
2	30	2	30	13	27	‘(’	TRUE	(
2	31	2	35	15	17	STR_CONST	TRUE	“one”
2	31	2	35	17	27	expr	FALSE	“one”
2	36	2	36	16	27	‘,’	TRUE	,
2	38	2	42	21	23	STR_CONST	TRUE	“two”
2	38	2	42	23	27	expr	FALSE	“two”

Expression is built by different base types created by parsing code and by expressions, i.e. in the script, expression can belong to other expression(s). {boomer} gives us possibility to inspect intermediate steps of call - call is a function call (usage). Thus we could say that all we need is to find the top expression, which is true, but what about this difference between expression and call? Do we need to care about this? Actually, yes.

Taking as granted information from {boomer} description, we need to remember that with boomer::boom() we can inspect a call and as we saw in the table above, expr can be an expression and doesn’t contain a call:

Code

getParseData(parse(path_to_script, keep.source = TRUE), includeText = TRUE)[52:55, ] |> 
  stylize_table()

line1	col1	line2	col2	id	parent	token	terminal	text
5	32	5	59	76	-79	COMMENT	TRUE	# add 2 to each row in col_b
7	1	7	5	84	86	SYMBOL	TRUE	my_df
7	1	7	5	86	0	expr	FALSE	my_df
9	1	9	16	103	0	expr	FALSE	sum(my_df$col_b)

my_df is a non-nested expression (value from parent column is equal to 0) and it is just a SYMBOL, not SYMBOL_FUNCTION_CALL (SYMBOL_FUNCTION_CALL is a call) - boomer::boom() does not work if there is no call at all:

Code

boomer::boom(1, print = dplyr::glimpse)
[1] 1
boomer::boom(c(1), print = dplyr::glimpse)
<  >  c(1) 
 num 1
[1] 1
identical(1, c(1))
[1] TRUE

The output is different and in the first case 1 is just printed, not inspected. This is true even that we all know there is no practical difference between 1 and c(1). The consequence is that if something is not a call, we need something different than boomer::boom() - luckily we can use directly the function passed as a argument to the boomer::boom() above, i.e. glimpse() from {dplyr} package³.

Code

dplyr::glimpse(1)
 num 1

To differentiate between a call and any other item, we can check if after expression (i.e. in the next row) is an expr token with parent equal to 0 or not. If not, then we have a call.

Our aim is to construct a data.frame with information about expr having 0 as parent (i.e. not nested expressions); we could end up with the result as below⁴:

Code

parse_data <- getParseData(parse(path_to_script, keep.source = TRUE))
tableboom:::find_exprs(parse_data) |> 
  stylize_table()

line1	col1	line2	col2	id	fun
2	1	3	36	57	boomer::boom
5	1	5	30	79	boomer::boom
7	1	7	5	86	dplyr::glimpse
9	1	9	16	103	boomer::boom
9	19	9	35	119	boomer::boom

Script modification

We already know where and what to put in the script to modify it (this information was retrieved at the end of previous chapter). To summarize it: we want to add some function (boomer::boom() or dplyr::glimpse()) to the top expression.

In the simplest, but naive approach, we could use information about the starting line (column line1), end line (column line2) and function to use (column fun) and just paste everything (at the beginning of the line it would be e.g. "boomer::boom(” and at the end of line closing bracket - ")"). However, problems occur if there is a comment after the expression (so we would paste closing bracket to the end of comment, not expression) or in one line is more than one call (separated by semicolon) - the problem in this case will be the same as with the comment after expression. These problems exists, because on the one hand we talk about expressions as a language elements and on the other hand about the text file where the expressions exist (so we treat source code as a text in this case).

Luckily, additionally to the information about starting and ending line, we know also where exactly is the first character of expression (col1) and last character (col2). Thus we can imagine the following steps:

read original script using readLines() (some alternative might be getParseText())
use table returned by tableboom:::find_exprs() to get information about starting line, first character, ending line and last character to insert function to the correct location
after first modification (first inserted character, e.g. function name with opening bracket), check how the line (the line which should be modified next) changed - if it was changed indeed one or more times, that means some characters were already added to this line and thus value in col1 or col2 must be shifted accordingly
save the modified script

These steps in the form of function could look like this:

Code

tableboom:::insert_fun
function(exprs_df, temp_path, script_path) {
  file <- readLines(script_path, warn = FALSE)
  file_orig <- file
  for (i in seq_along(rownames(exprs_df))) {

    split_str <- unlist(strsplit(file[exprs_df[i, "line1"]], ""), use.names = FALSE)
    chars_added <- nchar(file[exprs_df[i, "line1"]]) - nchar(file_orig[exprs_df[i, "line1"]])
    # minus 1 below, because we want to add function "before" first character, not "after"
    str_modified <- append(split_str, paste0(exprs_df[i, "fun"], "("), after = (exprs_df[i, "col1"] - 1) + chars_added)
    file[exprs_df[i, "line1"]] <- paste0(str_modified, collapse = "")

    split_str <- unlist(strsplit(file[exprs_df[i, "line2"]], ""), use.names = FALSE)
    chars_added <- nchar(file[exprs_df[i, "line2"]]) - nchar(file_orig[exprs_df[i, "line2"]])
    str_modified <- append(split_str, ")", after = exprs_df[i, "col2"] + chars_added)
    file[exprs_df[i, "line2"]] <- paste0(str_modified, collapse = "")

  }
  writeLines(file, temp_path)
}
<bytecode: 0x00000236f6a28c48>
<environment: namespace:tableboom>

And the modified script would look like this:

Code

top_exprs <- tableboom:::find_exprs(parse_data)
temp_path <- tempfile(fileext = ".R")
tableboom:::insert_fun(top_exprs, temp_path, path_to_script)

# define data
boomer::boom(my_df <- data.frame(col_a = c("one", "two"),
                    col_b = c(1, 2)))

boomer::boom(my_df$col_c <- my_df$col_b + 2) # add 2 to each row in col_b

dplyr::glimpse(my_df)

boomer::boom(sum(my_df$col_b)); boomer::boom(mean(my_df$col_c))

Evaluation

Evaluation is - for our purposes - the same as executing code. We have used parse() many times before - it returns all top expressions (when script is passed to the parse()) in the form of list. This list can be subset to get nested expressions or simpler elements. The length of this list is a number of top expressions - it doesn’t contain e.g. comments.

Having modified script, we need now to read it, run the code and capture output (i.e. output from boomer::boom() or dplyr::glimpse()). And because we are dealing now with language element rather than text, we won’t use readLines() but - of course - parse().

eval() function is responsible to perform evaluation and before we move into our case, it may be justified to show the usage on simpler example:

Code

parse(text = "2 + 2") # expression before evaluation, see also expression() or quote()
expression(2 + 2)
typeof(parse(text = "2 + 2")) # expression type
[1] "expression"
eval(parse(text = "2 + 2")) # evaluation
[1] 4

Often it may be safer to evaluate expression in different environment, especially if we don’t know how the expression will affect the default environment - evaluation can e.g. define variable and if a variable of the same name already exists in our environment, this will be side effect we most likely don’t want:

Code

a <- 3
eval(parse(text = "a <- 2 + 2"))
a
[1] 4

a <- 3
e <- new.env()
eval(parse(text = "a <- 2 + 2"), envir = e)
a
[1] 3
e$a
[1] 4

One could think the alternative for parse() and eval() would be simply source(), but we want to evaluate each expression one by one to be able to capture output for each expression. We also want to do two additional things:

before we evaluate some expression from modified script, we will evaluate previous expression from original script, because boomer::boom() do not affect environment, i.e. if we use boomer::boom(a <- 2), variable a won’t be defined. And if next expression needs variable a, we would end up with the error saying that a does not exist
if evaluation of some expression returns error, we want to continue and evaluate next expressions and to be able to do this, we need to catch the error - we use for this try to silently return error and do not stop

Code

tableboom:::get_output # just a helper function
function(parsed_file, envir) {
  utils::capture.output(try(eval(parsed_file, envir = envir), silent = TRUE))
}
<bytecode: 0x00000236f9ee6370>
<environment: namespace:tableboom>

tableboom:::eval_file # main function to eval original and modified script
function(parsed_mod_file, parsed_orig_file) {
  e <- new.env()
  output <- vector("list", length(parsed_mod_file))
  for (i in seq_along(parsed_orig_file)) {
    try(eval(parsed_orig_file[[i - 1]], envir = e), silent = TRUE)
    output[[i]] <- tableboom:::get_output(parsed_mod_file[[i]], envir = e)
  }
  output
}
<bytecode: 0x00000236f9f65098>
<environment: namespace:tableboom>

To capture the output, capture.output is used and then modified a little as well (some not important parts of the output are removed using other functions in {tableboom}). All of this gives us access to the original source code and the output from modified source code which we can now combine as we want, e.g. in the form of table, where one row contains original code (expression) and the second row contains output from inspected function for the same expression.

Dealing with the data we don’t know upfront is always risky and that’s true also when the code is a data - many things may happen and it is very hard to anticipate everything. Although we have talked about the way to avoid side effects when evaluating expression (using separate environment), in the script we can find many different things for which using new environment won’t be enough (e.g. <<- or library() call):

Code

"package:tableboom" %in% search()
[1] FALSE
eval(quote(library(tableboom)), envir = new.env())
"package:tableboom" %in% search()
[1] TRUE
detach(package:tableboom)
"package:tableboom" %in% search()
[1] FALSE

We have used new environment, but search path has been affected. The safest way to evaluate expression would be to do it in separate R session - and we can do this using {callr} package⁵.

Code

"package:tableboom" %in% search()
[1] FALSE
# code below will return result from `search()` call, so we can see
# that {tableboom} is present in separate R session
callr::r(function() eval(quote(library(tableboom))))
[1] "tableboom" "stats"     "graphics"  "grDevices" "utils"     "datasets" 
[7] "methods"   "base"     
"package:tableboom" %in% search() # but is not present in our session
[1] FALSE

At the end of this chapter, let’s make something a little more complicated and convincing that indeed {callr} performs excellent for our purposes - we will use some function from {dplyr} which would not run (error will occur) if {dplyr} is not loaded.

Code

"package:dplyr" %in% search() # dplyr is not loaded
[1] FALSE
tryCatch(mutate(data.frame(a = 1), new_col = "new"),
         error = function(e) "Error occured") # error, because can't find `mutate()` function
[1] "Error occured"

callr::r(function() {
  invisible(library(dplyr))
  mutate(data.frame(a = 1), new_col = "new")
  }
) # data.frame is returned correctly
  a new_col
1 1     new

"package:dplyr" %in% search() # dplyr is still not loaded
[1] FALSE

Summary

Our path led us from the difference between text (words) and source code (expression) - we have used both modes of script (readLines(), parse(), getParseData()) to load code as a data, through script modification (in this second step we had stick to the string manipulation, but language or expression can also be manipulated, however it is tricky to keep srcref attribute after manipulation), and to the code evaluation (eval()). It was a short road - many things has been left on margin, mostly because of lack of my knowledge. I think, however, that the signs we saw (Advanced R, quote(), language type, environments) will be a good starting point for the future search if needed.

Treating Code As Data - Notes

Intention

Expression

Script modification

Evaluation

Summary