As time goes on, R scripts are probably getting longer and more complicated. Timing parts of the script could save us precious time when re-running code over and over again. There are many different ways we can to benchmark R code. usually, such thing will not bother us if we are using small data. But some times, with lot of data, we better make sure that our code is optimized to the max. A quick online search revealed at least three R packages for benchmarking R code (rbenchmark, microbenchmark, and tictoc). Additionally, base R provides at least two methods to measure the running time of R code (Sys.time and system.time).
In this blog, I’ll go through the syntax for using Base R, then using each of those 3 packges.
Sys.time()
Sys.time() takes a “snap-shot” of the current time and so it can be used to record start and end times of code.
start_time = Sys.time()
a <- runif(5000000, 0, 100)
b <- rnorm(6000000, 0, 10)
end_time = Sys.time()
#To calculate the difference, we just use a simple subtraction
end_time - start_time
## Time difference of 1.046068 secs
tictoc
The functions tic and toc are used in the same manner for benchmarking as the just demonstrated Sys.time. However tictoc adds a lot more convenience to the whole. This package can also do nested timing. Let’s see an example of this package in action benchmarking a linear model, and before that - generating data. To start timing you use tic()
and to stop it you use toc()
. The time elapsed will be generated automatically.
library(tictoc)
tic("total")
tic("data generation")
X <- matrix(rnorm(50000*1000), 50000, 1000)
b <- sample(1:1000, 1000)
y <- runif(1) + X %*% b + rnorm(50000)
toc()
## data generation: 5.64 sec elapsed
tic("model fitting")
model <- lm(y ~ X)
toc()
## model fitting: 80.68 sec elapsed
toc()
## total: 86.33 sec elapsed
rbenchmark
The documentation to the function benchmark from the rbenchmark R package describes it as “a simple wrapper around system.time” which is another R base timer call. However it adds a lot of convenience compared to bare system.time calls. For example it requires just one benchmark call to time multiple replications of multiple expressions. Additionally, the returned results are conveniently organized in a data frame. For an example of this library, let’s compare the time required to compute linear regression coefficients using three alternative computational procedures: 1. lm, 2. the Moore-Penrose pseudoinverse, 3. the Moore-Penrose pseudoinverse but without explicit matrix inverses.
library(rbenchmark)
benchmark("lm" = {
X <- matrix(rnorm(1000), 100, 10)
y <- X %*% sample(1:10, 10) + rnorm(100)
b <- lm(y ~ X + 0)$coef
},
"pseudoinverse" = {
X <- matrix(rnorm(1000), 100, 10)
y <- X %*% sample(1:10, 10) + rnorm(100)
b <- solve(t(X) %*% X) %*% t(X) %*% y
},
"linear system" = {
X <- matrix(rnorm(1000), 100, 10)
y <- X %*% sample(1:10, 10) + rnorm(100)
b <- solve(t(X) %*% X, t(X) %*% y)
},
replications = 1000,
columns = c("test", "replications", "elapsed",
"relative", "user.self", "sys.self"))
## test replications elapsed relative user.self sys.self
## 3 linear system 1000 0.26 1.000 0.27 0.00
## 1 lm 1000 1.47 5.654 1.47 0.00
## 2 pseudoinverse 1000 0.46 1.769 0.44 0.02
the meaning of elapsed, user.self, and sys.self is simply the time ratio with the fastest test. Interestingly enough, lm is by far the slowest computational procedure.
microbenchmark
Much like benchmark from the package rbenchmark, the function microbenchmark can be used to compare running times of multiple R code chunks. But it offers a great deal of convenience and additional functionality. I find that one particularly nice feature of microbenchmark is the ability to automatically check the results of the benchmarked expressions with a user-specified function. The following code will compare three methods computing the coefficient vector of a linear model.
library(microbenchmark)
set.seed(2017)
n <- 10000
p <- 100
X <- matrix(rnorm(n*p), n, p)
y <- X %*% rnorm(p) + rnorm(100)
check_for_equal_coefs <- function(values) {
tol <- 1e-12
max_error <- max(c(abs(values[[1]] - values[[2]]),
abs(values[[2]] - values[[3]]),
abs(values[[1]] - values[[3]])))
max_error < tol
}
mbm <- microbenchmark("lm" = { b <- lm(y ~ X + 0)$coef },
"pseudoinverse" = {
b <- solve(t(X) %*% X) %*% t(X) %*% y
},
"linear system" = {
b <- solve(t(X) %*% X, t(X) %*% y)
},
check = check_for_equal_coefs)
mbm
## Unit: milliseconds
## expr min lq mean median uq max neval
## lm 138.6434 165.4714 212.1756 185.5570 224.5906 871.7433 100
## pseudoinverse 160.0347 179.0712 1085.1908 207.7549 263.8873 85502.7999 100
## linear system 98.6864 108.5913 152.0144 128.6509 168.2286 470.8723 100
Notice we get a nicely formatted table of summary statistics. We can record our times in anything from seconds to nanoseconds. More interestingly, we can compare sections of code in an easy-to-do way and name the sections of code for an easy-to-read output, like so:
sleep = microbenchmark(sleepy = Sys.sleep(0.1),
sleepier = Sys.sleep(0.2),
sleepiest = Sys.sleep(0.3),
times = 10,
unit = "s")
One reason I like this package, is it has a great integration feature with ggplot2. Here is another example showing just that.
library(ggplot2)
autoplot(mbm)
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
The given example of the different benchmarking libraries above is surely not exhaustive. Nevertheless I drew some conclusions: