Data621 - Blog2

Introduction

As time goes on, R scripts are probably getting longer and more complicated. Timing parts of the script could save us precious time when re-running code over and over again. There are many different ways we can to benchmark R code. usually, such thing will not bother us if we are using small data. But some times, with lot of data, we better make sure that our code is optimized to the max. A quick online search revealed at least three R packages for benchmarking R code (rbenchmark, microbenchmark, and tictoc). Additionally, base R provides at least two methods to measure the running time of R code (Sys.time and system.time).

In this blog, I’ll go through the syntax for using Base R, then using each of those 3 packges.

Base R - `Sys.time()`

Sys.time() takes a “snap-shot” of the current time and so it can be used to record start and end times of code.

start_time = Sys.time()
a <- runif(5000000, 0, 100)
b <- rnorm(6000000, 0, 10)
end_time = Sys.time()

#To calculate the difference, we just use a simple subtraction
end_time - start_time

## Time difference of 1.046068 secs

library `tictoc`

The functions tic and toc are used in the same manner for benchmarking as the just demonstrated Sys.time. However tictoc adds a lot more convenience to the whole. This package can also do nested timing. Let’s see an example of this package in action benchmarking a linear model, and before that - generating data. To start timing you use tic() and to stop it you use toc(). The time elapsed will be generated automatically.

library(tictoc)

tic("total")
tic("data generation")
X <- matrix(rnorm(50000*1000), 50000, 1000)
b <- sample(1:1000, 1000)
y <- runif(1) + X %*% b + rnorm(50000)
toc()

## data generation: 5.64 sec elapsed

tic("model fitting")
model <- lm(y ~ X)
toc()

## model fitting: 80.68 sec elapsed

toc()

## total: 86.33 sec elapsed

library `rbenchmark`

The documentation to the function benchmark from the rbenchmark R package describes it as “a simple wrapper around system.time” which is another R base timer call. However it adds a lot of convenience compared to bare system.time calls. For example it requires just one benchmark call to time multiple replications of multiple expressions. Additionally, the returned results are conveniently organized in a data frame. For an example of this library, let’s compare the time required to compute linear regression coefficients using three alternative computational procedures: 1. lm, 2. the Moore-Penrose pseudoinverse, 3. the Moore-Penrose pseudoinverse but without explicit matrix inverses.

library(rbenchmark)

benchmark("lm" = {
            X <- matrix(rnorm(1000), 100, 10)
            y <- X %*% sample(1:10, 10) + rnorm(100)
            b <- lm(y ~ X + 0)$coef
          },
          "pseudoinverse" = {
            X <- matrix(rnorm(1000), 100, 10)
            y <- X %*% sample(1:10, 10) + rnorm(100)
            b <- solve(t(X) %*% X) %*% t(X) %*% y
          },
          "linear system" = {
            X <- matrix(rnorm(1000), 100, 10)
            y <- X %*% sample(1:10, 10) + rnorm(100)
            b <- solve(t(X) %*% X, t(X) %*% y)
          },
          replications = 1000,
          columns = c("test", "replications", "elapsed",
                      "relative", "user.self", "sys.self"))

##            test replications elapsed relative user.self sys.self
## 3 linear system         1000    0.26    1.000      0.27     0.00
## 1            lm         1000    1.47    5.654      1.47     0.00
## 2 pseudoinverse         1000    0.46    1.769      0.44     0.02

the meaning of elapsed, user.self, and sys.self is simply the time ratio with the fastest test. Interestingly enough, lm is by far the slowest computational procedure.

library `microbenchmark`

Much like benchmark from the package rbenchmark, the function microbenchmark can be used to compare running times of multiple R code chunks. But it offers a great deal of convenience and additional functionality. I find that one particularly nice feature of microbenchmark is the ability to automatically check the results of the benchmarked expressions with a user-specified function. The following code will compare three methods computing the coefficient vector of a linear model.

library(microbenchmark)

set.seed(2017)
n <- 10000
p <- 100
X <- matrix(rnorm(n*p), n, p)
y <- X %*% rnorm(p) + rnorm(100)

check_for_equal_coefs <- function(values) {
  tol <- 1e-12
  max_error <- max(c(abs(values[[1]] - values[[2]]),
                     abs(values[[2]] - values[[3]]),
                     abs(values[[1]] - values[[3]])))
  max_error < tol
}

mbm <- microbenchmark("lm" = { b <- lm(y ~ X + 0)$coef },
               "pseudoinverse" = {
                 b <- solve(t(X) %*% X) %*% t(X) %*% y
               },
               "linear system" = {
                 b <- solve(t(X) %*% X, t(X) %*% y)
               },
               check = check_for_equal_coefs)

mbm

## Unit: milliseconds
##           expr      min       lq      mean   median       uq        max neval
##             lm 138.6434 165.4714  212.1756 185.5570 224.5906   871.7433   100
##  pseudoinverse 160.0347 179.0712 1085.1908 207.7549 263.8873 85502.7999   100
##  linear system  98.6864 108.5913  152.0144 128.6509 168.2286   470.8723   100

Notice we get a nicely formatted table of summary statistics. We can record our times in anything from seconds to nanoseconds. More interestingly, we can compare sections of code in an easy-to-do way and name the sections of code for an easy-to-read output, like so:

sleep = microbenchmark(sleepy = Sys.sleep(0.1), 
                       sleepier = Sys.sleep(0.2),
                       sleepiest = Sys.sleep(0.3),
                       times = 10, 
                       unit = "s")

One reason I like this package, is it has a great integration feature with ggplot2. Here is another example showing just that.

library(ggplot2)
autoplot(mbm)

## Coordinate system already present. Adding new coordinate system, which will replace the existing one.

Conclusion

The given example of the different benchmarking libraries above is surely not exhaustive. Nevertheless I drew some conclusions:

The Sys.time approach as well as the tictoc package can be used for timing steps of a complicated algorithm. However, tictoc is more convenient, and foolproof.
We saw that microbenchmark returns other types of measurements than benchmark.
From what I know, microbenchmark is the only benchmarking package that has visualizations built in.

Data621 - Blog2

Abdelmalek Hajjam

Introduction

Base R - Sys.time()

library tictoc

library rbenchmark

library microbenchmark

Conclusion

Base R - `Sys.time()`

library `tictoc`

library `rbenchmark`

library `microbenchmark`