Reference classes in R are very useful for some situations, but they don’t come for free. In this document, I’ll explore the costs in memory and speed of standard R reference classes vs. other reference objects which are created in different ways.
I won’t cover the other major cost of reference classes, which is the complexity and fickleness of using S4 (since reference classes are built on S4).
# Some setup stuff
library(microbenchmark)
library(pryr)
library(testclasses) # For Winston's simple reference classes
Here are a number of ways of creating reference objects in R, starting with the most complicated (standard R reference class) and ending with the simplest (an environment created by a closure).
A_rc <- setRefClass("A_rc",
fields = list(x = "numeric"),
methods = list(
initialize = function(x = 1) .self$x <<- x,
inc = function(n = 1) x <<- x + n
)
)
This is a simpler implementation of reference objects, from the testclasses package.
B_wrc <- createRefClass("B_wrc",
members = list(
x = NULL,
initialize = function(x = 1) self$x <<- x,
inc = function(n = 1) x <<- x + n
)
)
Objects of this type also have an automatically-created self member:
print(B_wrc$new())
## B_wrc object
## inc: function
## initialize: function
## self: environment
## x: 1
This is a variant on the previous type of reference class, but this version has public and private members.
C_wrc_priv <- createRefClass2("C_wrc_priv",
private = list(x = NULL),
public = list(
initialize = function(x = 1) private$x <<- x,
inc = function(n = 1) x <<- x + n
)
)
Instead of a single self object which refers to all items in an object, these objects have public and private.
print(C_wrc_priv$new())
## <C_wrc_priv>
## Public:
## inc: function
## initialize: function
## private: environment
## public: environment
## Private:
## private: environment
## public: environment
## x: 1
This is simply an environment with a class attached to it.
D_closure_class <- function(x = 1) {
inc <- function(n = 1) x <<- x + n
structure(environment(), class = "D_closure")
}
This doesn’t have a self member, although that could be added.
Even though x isn’t declared in the function body, it gets captured because it’s an argument to the function.
# Roundabout way to print the contents of a D object
str(as.list.environment(D_closure_class()))
## List of 2
## $ inc:function (n = 1)
## ..- attr(*, "srcref")=Class 'srcref' atomic [1:8] 2 10 2 36 10 36 2 2
## .. .. ..- attr(*, "srcfile")=Classes 'srcfilecopy', 'srcfile' <environment: 0x7fb79b1d91f0>
## $ x : num 1
This is the simplest type of reference object:
E_closure_noclass <- function(x = 1) {
inc <- function(n = 1) x <<- x + n
environment()
}
For all the timings using microbenchmark(), the results are reported in microseconds, and the most useful value is probably the median column.
How much memory does a single instance of each object take, and how much memory does each additional object take?
# Utility functions for calculating sizes
obj_size <- function(newfun) {
data.frame(
one = as.numeric(object_size(newfun())),
incremental = as.numeric(object_size(newfun(), newfun()) - object_size(newfun()))
)
}
obj_sizes <- function(...) {
dots <- list(...)
sizes <- Map(obj_size, dots)
do.call(rbind, sizes)
}
Sizes of each type of object, in bytes:
obj_sizes(
A_rc = A_rc$new,
B_wrc = B_wrc$new,
C_wrc_priv = C_wrc_priv$new,
D_closure_class = D_closure_class,
E_closure_noclass = E_closure_noclass
)
## one incremental
## A_rc 470664 1368
## B_wrc 11080 616
## C_wrc_priv 11744 840
## D_closure_class 9304 400
## E_closure_noclass 8232 344
It looks like the standard reference class object takes up a huge amount of memory, but much of that is shared between reference classes. Adding another object from a different reference class doesn’t require much more memory – about 30KB:
A_rc2 <- setRefClass("A_rc2",
fields = list(x = "numeric"),
methods = list(
initialize = function(x = 2) .self$x <<- x,
inc = function(n = 2) x <<- x * n
)
)
# Size of a new A_rc2 object, over and above an A_rc object
as.numeric(object_size(A_rc$new(), A_rc2$new()) - object_size(A_rc$new()))
## [1] 37688
How much time does it take to create one of these objects?
# Garbage collect now so that we (probably) won't do it in the middle of a run
invisible(gc())
speed <- microbenchmark(
A_rc = A_rc$new(),
B_wrc = B_wrc$new(),
C_wrc_priv = C_wrc_priv$new(),
D_closure_class = D_closure_class(),
E_closure_noclass = E_closure_noclass(),
unit = "us"
)
speed
## Unit: microseconds
## expr min lq median uq max neval
## A_rc 321.118 337.682 362.034 470.760 1662.109 100
## B_wrc 44.832 47.737 53.237 62.462 114.351 100
## C_wrc_priv 61.043 67.573 74.205 87.382 249.420 100
## D_closure_class 10.976 13.198 16.269 22.483 31.282 100
## E_closure_noclass 0.717 1.329 1.621 2.185 4.073 100
Standard R reference classes are much slower to instantiate than the other types of classes, with a median of 0.362 milliseconds per instantiation.
D is slower than E. The only difference between them is that the D object has a class attribute. The slowness appears to be due mostly to the call to the structure() function that’s called in the creation of a D object:
microbenchmark(
structure = structure(1, class = "foo"),
unit = "us"
)
## Unit: microseconds
## expr min lq median uq max neval
## structure 9.148 9.614 9.953 10.34 59.95 100
How much overhead is there when calling a method from one of these objects?
A <- A_rc$new()
B <- B_wrc$new()
C <- C_wrc_priv$new()
D <- D_closure_class()
E <- E_closure_noclass()
invisible(gc())
microbenchmark(
A_rc = A$inc(),
B_wrc = B$inc(),
C_wrc_priv = C$inc(),
D_closure_class = D$inc(),
E_closure_noclass = E$inc(),
unit = "us"
)
## Unit: microseconds
## expr min lq median uq max neval
## A_rc 37.144 68.303 70.445 74.675 391.70 100
## B_wrc 2.690 5.319 6.369 7.215 13.05 100
## C_wrc_priv 2.537 3.410 6.071 7.313 21.81 100
## D_closure_class 2.093 4.235 5.184 5.813 14.77 100
## E_closure_noclass 0.639 1.358 1.723 2.034 10.38 100
As expected, standard reference classes are the slowest by a large margin. The difference between D and E is probably due to S3 method lookup for the $ function – there could be a $.myclass method which would be called for myclass objects.
When we manually remove the class from the B object, it performs on par with E.
# Really create a reference object without a class
B2 <- B_wrc$new()
class(B2) <- NULL
microbenchmark(
B_wrc_noclass = B2$inc(),
E_closure_noclass = E$inc(),
unit = "us"
)
## Unit: microseconds
## expr min lq median uq max neval
## B_wrc_noclass 0.594 0.639 0.716 0.8245 16.91 100
## E_closure_noclass 0.570 0.621 0.690 0.8485 34.30 100
selfWith standard reference class objects, you can modify fields using the <<- operator, or by using the self object. For example, compare the inc() methods of these two classes:
rc_no_self <- setRefClass("rc_no_self",
fields = list(x = "numeric"),
methods = list(
initialize = function(x = 1) .self$x <<- x,
inc = function(n = 1) x <<- x + n
)
)
rc_self <- setRefClass("rc_self",
fields = list(x = "numeric"),
methods = list(
initialize = function(x = 1) .self$x <<- x,
inc = function(n = 1) .self$x <- x + n
)
)
Winston’s reference classes are similar, except they use self instead of .self:
wrc_no_self <- createRefClass("wrc_no_self",
members = list(
x = 1,
inc = function(n = 1) x <<- x + n
)
)
wrc_self <- createRefClass("wrc_self",
members = list(
x = 1,
inc = function(n = 1) self$x <- self$x + n
)
)
rc_no_self_obj <- rc_no_self$new()
rc_self_obj <- rc_self$new()
wrc_no_self_obj <- wrc_no_self$new()
wrc_self_obj <- wrc_self$new()
invisible(gc())
microbenchmark(
rc_no_self = rc_no_self_obj$inc(),
rc_self = rc_self_obj$inc(),
wrc_no_self = wrc_no_self_obj$inc(),
wrc_self = wrc_self_obj$inc(),
unit = "us"
)
## Unit: microseconds
## expr min lq median uq max neval
## rc_no_self 38.580 40.044 41.285 47.610 262.523 100
## rc_self 61.047 64.189 66.593 74.271 151.730 100
## wrc_no_self 2.819 3.567 4.181 4.717 9.865 100
## wrc_self 8.509 9.752 10.835 11.648 62.110 100
Using .self or self adds some overhead, which makes sense when you consider how R searches for objects.
When the method accesses x without using .self, R first searches in the execution environment but doesn’t find x there, so it then searches in the parent environment, finds x there, and assigns the value.
When using .self, R searches for .self in the function’s execution environment but doesn’t find it there, so it looks in the parent environment (which also happens to be the object environment, as well as the environment that .self points to) and does find it there. Then it looks in the .self environment for x, and assigns the value.
This compares member access time with lists vs. environments, and when the list/environment has a class attribute vs. not having a class. If the class has a class attribute, R will use method lookup for $, which adds overhead.
list_noclass <- list(x = 10)
list_class <- structure(list(x = 10), class = "foo")
env_noclass <- new.env()
env_noclass$x <- 10
env_class <- structure(new.env(), class = "foo")
env_class$x <- 10
invisible(gc())
microbenchmark(
list_noclass = list_noclass$x,
list_class = list_class$x,
env_noclass = env_noclass$x,
env_class = env_class$x,
unit = "us"
)
## Unit: microseconds
## expr min lq median uq max neval
## list_noclass 0.200 0.2365 0.3095 0.363 6.841 100
## list_class 1.504 1.6105 1.7200 1.841 36.184 100
## env_noclass 0.201 0.2390 0.3175 0.367 10.274 100
## env_class 1.551 1.6250 1.6990 1.772 24.421 100
Standard reference class objects take more memory and are slower than other, simpler types of reference objects. They do provide additional features, such as type checking of fields and class inheritance, but these aren’t, in my opinion, enough to offset the performance penalty and especially the pain of dealing with S4 (which reference classes are built on). Type checking of fields is only a minor benefit, and it would be simple to add inheritance to Winston’s reference class implementation. It would also be simple to implement private and public members, though there would be a small performance penalty.
Another drawback to standard R reference class objects is it’s not entirely clear how they work, as many advanced R developers can attest. The other types of reference objects used here are simple and can be understood completely, as long as one understands how environments work.
sessionInfo()
## R version 3.1.0 (2014-04-10)
## Platform: x86_64-apple-darwin13.1.0 (64-bit)
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] testclasses_0.1 pryr_0.1 microbenchmark_1.3-0
##
## loaded via a namespace (and not attached):
## [1] codetools_0.2-8 evaluate_0.5.3 formatR_0.10 knitr_1.5.33
## [5] Rcpp_0.11.1 rmarkdown_0.1.95 stringr_0.6.2 tools_3.1.0
## [9] yaml_2.1.11