The Art Of Benchmarking

R has unfortunate reputation of being slow. In a competiton of raw speed with a language like C, R would come second. This is a unfair comparison to make. Instead the total programming time is made of three components. Thinking, Coding & Running. In many statistical analysis you want to try/test multiple algorithms. Therefore you want the thinking and coding part optimised and this is R’s strength.

By the time you have loaded data in R, created a scatterplot and fitted a regression line our friendly C programmer would have just launched their code editor and would be checking on stack overflow how to read a csv file. We use R because it is good for statistics. However with the advent of big data and complex statistical algorithms we may need to optimise our code.

Note:

Premature optimization is the root of all evil. That is optimize when necessary.
Always keep R up-to-date. New versions of R rarely break code. New versions of R often provide speed boost such as improved handling of data frames so your code goes a little bit faster.(ex. v2.13 introduces byte compiler for speeding up functions, v3.0 supports large vectors.). Just by upgrading your code runs faster.

Main R releases happen in April with smaller incremental releases throughout the year.

# Print the R version details using version
version

##                _                           
## platform       x86_64-w64-mingw32          
## arch           x86_64                      
## os             mingw32                     
## system         x86_64, mingw32             
## status                                     
## major          3                           
## minor          5.3                         
## year           2019                        
## month          03                          
## day            11                          
## svn rev        76217                       
## language       R                           
## version.string R version 3.5.3 (2019-03-11)
## nickname       Great Truth

# Assign the variable major to the major component
major <- version$major

# Assign the variable minor to the minor component
minor <- version$minor

Benchmarking

Every R programmer at one point or another has complained that my R code is slow. But, what do you mean by slow is it 1 sec slow, 1 min slow or 1 hour slow. This is obviously problem depenedent. What you need is code that is fast enough. to determine if it is worth changing your code you need to compare your existing solutions with one or more alternatives. This is Benchmarking. You simply run and time all the solutions/options that you have with all other things kept same and then select the fastest.

Benchmarking is a two step process

Construct a function around a feature you wish to benchmark. Typically function has an argument which allows you to vary the complexity of the task.(For ex. a parameter n that alters data size)
Time this function under different scenarios (ex. different data size)

Suppose you want to generte a sequence of integers

We look for three different ways this can be done
Wrap this inside a function and allow the sequence length n to be passed as an argument to the function.
To find how long the function takes, we wrap the function inside system.time()

We get three numbers

user time: CPU time charged for the execution of user instructions.

system time: CPU time charged for execution by the system on behalf ofthe calling process.

elapsed time: Approximately the sum of user and system,this is the number we typically care about.

colon <- function(n) 1:n
system.time(colon(1e8))

##    user  system elapsed 
##       0       0       0

# default n argument
seq_default <- function(n) seq(1, n)
system.time(seq_default(1e8))

##    user  system elapsed 
##       0       0       0

# explicit;y specify n size
seq_by <- function(n) seq(1, n, by = 1)
system.time(seq_by(1e8))

##    user  system elapsed 
##    1.08    0.28    1.36

Sometime we also need the result of the code. In this case we use the arrow (<-) operator. Using arrow operator inside a function call performs two tasks.Using the <- operator inside a function call will create a new (or overwrite an existing) object.This makes ‘<-’ useful inside system.time() since we can store the result.

Arguemnt passing
Object assignment

This allows us to time and store the operation. Where as the equals operator (=) performs only one thing argument passing or object assignment.

Along with calculating elapsed time it is also worth calculating relative time which is simply a ratio. In the below example you can see that using relative time we can say that seq_by function is 26 times slower than the colon function.

As with all things in R there is a package that simplifies benchmarking. The microbenchmark package is a wrapper around system.time function and makes it straight forward when comparing multiple functions. The key function in this package is microbenchmark. The times argument allows us to tell how many times should we run each function. In the result the cld column gives us a statistical ranking of each function.

library("microbenchmark")
n <- 1e8
microbenchmark(colon(n), seq_default(n), seq_by(n), times = 10) # Run each function 10 times

## Unit: nanoseconds
##            expr        min         lq       mean     median         uq
##        colon(n)        400        700     113050       3250       7400
##  seq_default(n)       6200       8800     210010      36350      69400
##       seq_by(n) 1199049600 1281619700 1348972310 1315636650 1388946600
##         max neval
##     1091700    10
##     1802900    10
##  1576603100    10

One of the most common tasks we perform is reading in data from CSV files. However, for large CSV files this can be slow. One neat trick is to read in the data and save as an R binary file (rds) using saveRDS(). To read in the rds file, we use readRDS().

Note: Since rds is R’s native format for storing single objects, you have not introduced any third-party dependencies that may change in the future.

# How long does it take to read movies from RDS?
system.time(movies_rds <- readRDS("movies.rds"))

##    user  system elapsed 
##    0.05    0.00    0.05

# Write the data to csv
write.csv(movies_rds, "movies.csv")

# How long does it take to read movies from CSV?
system.time(movies_csv <- read.csv("movies.csv"))

##    user  system elapsed 
##    0.40    0.03    0.43

The microbenchmark() function makes it easier to compare multiple function calls at once by compiling all the relevant information in a data frame. It does this by running each function call multiple times, recording the time it took for the function to run each time, then computing summary statistics for each expression as you can see here.

# Load the microbenchmark package
library("microbenchmark")

# Compare the two functions
compare <- microbenchmark(read.csv("movies.csv"), 
                          readRDS("movies.rds"), 
                          times = 10)

# Print compare
compare

## Unit: milliseconds
##                    expr      min       lq      mean    median       uq      max
##  read.csv("movies.csv") 403.9977 404.9286 448.29266 420.32195 458.0394 582.2420
##   readRDS("movies.rds")  52.5306  53.9747  55.98524  55.74855  57.7381  61.9119
##  neval
##     10
##     10

How good is your machine?

benchmarkme package allows you to know where your machine stands or helps you decide whether you need a machine upgrade. We basically run the same piece of code on our machone and compare the results.

The main function in this packages is benchmark_std(). These becnhmarks are standard R operations such as loops and matrix operations. On a normal machine this code will take around 4 minutes to run. Once the benchmark is complete you can upload/compare your result for other people to compare/use.

 # Load the benchmarkme package
library("benchmarkme")

# Assign the variable `ram` to the amount of RAM on this machine
ram <- get_ram()
ram

# Assign the variable `cpu` to the cpu specs
cpu <- get_cpu()
cpu

# Run each benchmark 3 times
res <- benchmark_std(runs = 3)
plot(res)

The benchmarkme package allows you to run a set of standardized benchmarks and compare your results to other users. One set of benchmarks tests is reading and writing speeds.The function call benchmark_io() records the length of time it takes to read and write a 5MB(size argument) file.

# Load the package
library("benchmarkme")

# Run the benchmark
res <- benchmark_io(runs = 1, size = 5)

# Plot the results
plot(res)

Fine Tuning - Efficient Base R

Memory allocation

The first rule of R club - Never ever grow a vector. Preallocation

If we had programmed in C we would have been aware of memory allocation since that language hold us accountable for reclaiming memory. In R memory allocation happens automatically. When you assign a variable R has to allocate memory in RAM to store that variable. This takes time so an important way to make your code run faster is to minimize the amount of memory allocation that R has to perform.

In the below example you can see three differnt methods to generate a sequence of integers

## Method 1 - The obvious and best way
x <- 1:n

## Method 2 - Not so bad
## Start with a vector of length n and then loop to change the vector entries. Length of x doesn't change in 
## the loop

x <- vector("numeric", n) # length n
for(i in 1:n)
 x[i] <- i

## Method 3 - Don't ever do this!
## the object starts empty and gradually fill up the vector with integers

x <- NULL # Length zero
for(i in 1:n)
 x <- c(x, i)

The reason that method 3 above is slowest because requesting memory is a relatively slow operation this introduces terrible bottlenecks. So remember never ever grow a vector. In the first two methods x is pre-allocated / length of x does not change in the loop.

The growing() function defined below generates n random standard normal numbers, but grows the size of the vector each time an element is added!

Pre-allocating the vector is significantly faster than growing the vector!

# Method 1 - Growing
growing <- function(n) {
    x = NULL
    for(i in 1:n) 
        x = c(x, rnorm(1))
    x
}

# Use `<-` with system.time() to store the result as res_grow
system.time(res_grow <- growing(30000))

##    user  system elapsed 
##    0.79    0.18    0.99

# Method 2 - Preallocating
pre_allocate <- function(n) {
    x = numeric(n) # Create a vector of length n
    for(i in 1:n) 
        x[i] = rnorm(1)
    x
}

# Use `<-` with system.time() to store the result as res_allocate
n <- 30000
system.time(res_allocate <- pre_allocate(n))

##    user  system elapsed 
##    0.04    0.00    0.04

Importance of vectorizing your code

The second rule of R club - Use a vectorized solution wherever possible. For loops in R are just slow!

When you call a R function you eventually call a C/FORTRAN function. This underlying code is heavily optimised. A general rule for making your code run faster is to access the underlying optimised C/FORTRAN code as quickly as possible. The fewer the functions call the better. This basically means vectorized code. Many R functions are vectorized example

rnorm() accepts a single number but return a vector.
mean() accepts a vector but returns a single number.

In the below example we are generating a million random numbers using standard normal distribution.We have preallocated the vector but still the vectorized version of rnorm is 40 times faster.

library(microbenchmark)
n <- 1e6
x <- vector("numeric", n)

microbenchmark(
 x <- rnorm(n),
  {
  for(i in seq_along(x)) x[i] <- rnorm(1)
  },
times = 10 )

## Unit: milliseconds
##                                              expr       min        lq
##                                     x <- rnorm(n)   51.7433   51.9396
##  {     for (i in seq_along(x)) x[i] <- rnorm(1) } 1538.0340 1567.0962
##        mean     median        uq       max neval
##    55.09353   51.95465   52.0414   83.1987    10
##  1590.35734 1580.25295 1598.5305 1682.2149    10

Why is the loop slow?

There are three parts the solution is divided into

1.Allocation: Similar in both the methods x <- vector("numeric", n)

2.Generation: In the Forloop we have one million calls to the rnorm function whereas in the vectorized version we have just one call

3.Assignment: One million calls to the assignment method in the for loop whereas there is just one assignment call in the vectroized method.

So in conclusion it is not that the for loop in R is slow it is just that there are 2 million more calls to the same function when you use the for loop.

Example 1

x <- rnorm(10)
x2 <- numeric(length(x))

microbenchmark(
x2_imp <- x * x,
for(i in 1:10)
    x2[i] <- x[i] * x[i]
,times = 10 )

## Unit: nanoseconds
##                                  expr     min      lq    mean  median      uq
##                       x2_imp <- x * x     300     400    3230    2100    4400
##  for (i in 1:10) x2[i] <- x[i] * x[i] 1689400 1943800 2211760 2153000 2517800
##      max neval
##    12500    10
##  2996700    10

Example 2

# Initial code
n <- 100
total <- 0
x <- runif(n)

microbenchmark(
for(i in 1:n) 
    total <- total + log(x[i]),

# Rewrite in a single line. Store the result in log_sum
log_sum <- sum(log(x)),

times = 10 )

## Unit: microseconds
##                                       expr    min     lq    mean  median     uq
##  for (i in 1:n) total <- total + log(x[i]) 1523.8 1660.7 2062.93 2042.50 2258.3
##                     log_sum <- sum(log(x))    4.2    5.1    9.46    5.75    8.6
##     max neval
##  3010.6    10
##    37.8    10

Data frames and matrices

The third rule of R club - Use a matrix whenever appropriate

The key data structure in R is data frame. These objects are so useful that they are copied in other languages(ex. Python Pandas Dataframes). Dataframe is a tabular structure for representing data. Columns are variables of interest and rows are observations. Column in a data frame must be of the same type but row in a data frame can be of different types.

Dataframe is actually a list of vectors where each column is a single vector. This has implications for storage. As each column is stored at a separate location you just need to find the column and retrieve the entire object. However if you want to retrieve entire row you need to find the starting locations of every single column and then select what you need which is time consuming.

Matrix have similar rectangular structure and has ususal subsetting and extracting operations. The crucial difference is that a matrix can have only one data type(no mix and match). This makes storage easier as the entire matrix is stored as one continuous block. Selecting columns is easy we find the start of the column and retrieve the data. Selecting rows is also straightforward - Find the first value and then increment along the matrix.

# Which is faster, mat[, 1] or df[, 1]? 
microbenchmark(mat[,1],df[,1])

class(df[1,])

Diagnose problems - Code profiling

Benchmarking is for timining multiple functions and selecting the optimal function. However this doesn’t help us finding the bottlenecks, if we don’t know where the bottleneck is then how are we going to compare different functions. To find bottlenecks you need to use code profiling. We run the code and take snapshots of the call stack at regular intervals. This gives you data on how long each function took to run.

R comes with a built in tool for code profiling called as Rprof(). Unfortunately Rprof() isn’t that user friendly. An alternate tool is profvis package. Also profvis is integrated into R studio so it is easy to use.

Using R Studio - Select the lines of code and just select the Proflie Selected line(s) from the Profile tab
profvis() - It records the call stack at regular intervals.This generates an interactive page that describes the amount of time we spend on each line of code. The two measurements returned by profvis are amount of memory used and time spent with each function. Don’t worry on the units but focus on the relative contribution of each component.

library(profvis)
profvis({
data(movies, package ="ggplot2movies")


braveheart <- movies[7288,]
movies <- movies[movies$Action==1,]
plot(movies$year, movies$rating, xlab = "Year", ylab = "Rating")

# local regression line
model <- loess(rating ~ year, data = movies)

j <- order(movies$year) 

lines(movies$year[j], model$fitted[j], col = "forestgreen")

points(braveheart$year, braveheart$rating, pch = 21, bg = "steelblue")
})

# Load the microbenchmark package
library("microbenchmark")

# The previous data frame solution is defined
# d() Simulates 6 dices rolls
d <- function() {
  data.frame(
    d1 = sample(1:6, 3, replace = TRUE),
    d2 = sample(1:6, 3, replace = TRUE)
  )
}

# Complete the matrix solution
m <- function() {
  matrix(sample(1:6, 6, replace = TRUE), ncol = 2)
}

# Use microbenchmark to time m() and d()
microbenchmark(
 data.frame_solution = d(),
 matrix_solution     = m()
)

## Unit: microseconds
##                 expr   min     lq    mean median    uq    max neval
##  data.frame_solution 130.7 134.05 171.728 137.15 147.6 2676.7   100
##      matrix_solution   5.0   5.80  27.270   7.35   8.1 1995.1   100

# Example data
rolls <- matrix(c(2,4,5),c(4,1,4))

## Warning in matrix(c(2, 4, 5), c(4, 1, 4)): data length [3] is not a sub-multiple
## or multiple of the number of rows [4]

# Define the previous solution 
app <- function(x) {
    apply(x, 1, sum)
}

# Define the new solution
r_sum <- function(x) {
    rowSums(x)
}

# Compare the methods
microbenchmark(
    app_sol = app(rolls),
    r_sum_sol = r_sum(rolls)
)

## Unit: microseconds
##       expr  min   lq   mean median   uq    max neval
##    app_sol 15.9 16.8 29.882   17.2 18.2 1097.8   100
##  r_sum_sol  3.2  3.4 16.168    3.6  3.8 1115.6   100

The & operator will always evaluate both its arguments. That is, if you type x & y, R will always try to work out what x and y are. There are some cases where this is inefficient. For example, if x is FALSE, then x & y will always be FALSE, regardless of the value of y. Thus, you can save a little processing time by not calculating it. The && operator takes advantage of this trick, and doesn’t bother to calculate y if it doesn’t make a difference to the overall result.

One thing to note is that && only works on single logical values, i.e., logical vectors of length 1 (like you would pass into an if condition), but & also works on vectors of length greater than 1.

# Example data
is_double <- c(FALSE,TRUE,TRUE)

# Define the previous solution
move <- function(is_double) {
    if (is_double[1] & is_double[2] & is_double[3]) {
        current <- 11 # Go To Jail
    }
}

# Define the improved solution
improved_move <- function(is_double) {
    if (is_double[1] && is_double[2] && is_double[3]) {
        current <- 11 # Go To Jail
    }
}

## microbenchmark both solutions
microbenchmark(move, improved_move, times = 1e5)

## Warning in microbenchmark(move, improved_move, times = 1e+05): Could not measure
## a positive execution time for 30268 evaluations.

## Unit: nanoseconds
##           expr min lq  mean median uq   max neval
##           move   0  0 2.028      0  0 20700 1e+05
##  improved_move   0  0 1.952      0  0 20100 1e+05

Turbo Charged Code - Parallel Programming

Faster the CPU, faster your code will run. For last decade the speed fo CPU’s have stabilised because when a CPU becomes faster it also becomes hotter and the problem is that it is difficult to keep them cool. So computer manafacturers moved from making single core machine to multi-core machine. Unfortunately R is single threaded. This means that by default it will will use only one core. But by using parallel package you can make sections of your code use all available cores.

library(parallel)

detectCores()

## [1] 8

library(benchmarkme)

get_cpu()

## $vendor_id
## [1] "GenuineIntel"
## 
## $model_name
## [1] "Intel(R) Core(TM) i5-1035G4 CPU @ 1.10GHz"
## 
## $no_of_cores
## [1] 8

What sort of problems benefit from parallel computing?

The Rule of thumb to check if a loop can leverage multi-core computing is to check if it can be run backwards/reverse without affecting the results.

The vast majority of statistical methods haven’t been designed with parallel computing in mind. This is starting to change with the advent of big data but many standard methods don’t fit in the parallel paradigm.

Embarrasingly Parallel - Easiest type of parallel computing since a little effort is need to divide the problem into a set of parallel tasks. Below is an example of eight monte carlo simulation which can be run in parallel and then combined using combine function.

# Can be run in parallel - YES
for(i in 1:8)
sims[i] <- monte_carlo()
combine(sims)

In Below example we cannot use parallel computing because order of evaluation is important.But order of evaluation in parallel computing can’t be predicted and hence we’ll get the wrong answer, since x[3] may get evaluated before x[2]. In sequential algorithms The ith value depends on the previous value.

x <- 1:8
for(i in 2:8)
 x[i] <- x[i-1]

total <- no_of_rolls <- 0 # Initialise
while(total < 10) {
  total <- total + sample(1:6, 1)

  if(total %% 2 == 0) total <- 0  # If even. Reset to 0

  no_of_rolls <- no_of_rolls + 1
}
no_of_rolls

## [1] 7

play <- function() {
  total <- no_of_rolls <- 0
  while(total < 10) {
    total <- total + sample(1:6, 1)

    # If even. Reset to 0
    if(total %% 2 == 0) total <- 0 
    no_of_rolls <- no_of_rolls + 1
  }
  no_of_rolls
}
# and construct a loop to play the game:

results <- numeric(100)
for(i in seq_along(results)) 
    results[i] <- play()

parApply

The parallel package comes with R and enable us to write code that will work on operating systems. It has parallel version of standard functions. A simple version to run in parallel is the apply(glorified for loop) function. In apply we are applying a function to each row or column of a matrix. in the below example we will be using parApply

# Using traditional apply
m <- matrix(rnorm(100000), ncol = 10)
res <- apply(m, 1, median) # middle arguemnt 1 is for rows(2 - columns)

# Using parApply from parallel package
library("parallel")

# No fo cores to be used
copies_of_r <- 4
cl <- makeCluster(copies_of_r)

# Run the loop
parApply(cl, m, 1, median)

# Stop the cluster/ free up the resources
stopCluster(cl)

Unfortunately there is an additional overhead when running in parallel. When we are uisng multiple cores this requires communication between CPUs. If the job is really very fast then the communication can swamp the potential benefit. Therefore you need to benchmark both single core and parallel core options.

The parallel package - parSapply

sapply() is just another way off writing for loop. Applying a function to each value of the vector. There is also a little speed difference between apply and a std loop. To run this code in parallel you simply replace sapply with parSapply

for(i in 1:10)
x[i] <- simulate(i)

#Can be written as
sapply(1:10, simulate)

Example where parallel computing can be applied is that of Bootstrapping. The idea behind bootstrapping is that in real world you cannot sample from the population. Instead you resample from the original sample. You need to follow two steps

Sample from the original data set with replacement(this means data points from the original dataset could appear multiple times in the new sample). So if our orignal dataset was of size 100 then our new sample will also be of size 100.
Calculate the correlation coefficient of our new sample.
Repeat the step 1 & 2 number of times. The distribution of correlation values gives us measure of uncertainty(confidence interval) for our statistic.

Also please note you will observe/notice the difference in speed/boost in performance while using parallel computing when the no of sample size is large(>1000)

bootstrap <- function(data_set) {
# Sample with replacement
s <- sample(1:nrow(data_set), replace = TRUE)
new_data <- data_set[s,]

# Calculate the correlation
cor(new_data$Attack, new_data$Defense)
}

# 100 independent bootstrap simulations
sapply(1:100, function(i) bootstrap(pokemon))

# Do the similar steps using parallel computing
library("parallel")

no_of_cores <- 4

cl <- makeCluster(no_of_cores)

clusterExport(cl, c("bootstrap", "pokemon")) # Export function and dataset

parSapply(cl, 1:100, function(i) bootstrap(pokemon))

stopCluster(cl)

play <- function() {
  total <- no_of_rolls <- 0
  while(total < 10) {
    total <- total + sample(1:6, 1)

    # If even. Reset to 0
    if(total %% 2 == 0) total <- 0 
    no_of_rolls <- no_of_rolls + 1
  }
  no_of_rolls
}

# Using loop or sapply
res <- sapply(1:100, function(i) play())

# Create a cluster via makeCluster (2 cores)
cl <- makeCluster(2)

# Export the play() function to the cluster
clusterExport(cl, "play")

# Parallelize this code
res <- parSapply(cl,1:100, function(i) play())

# Stop the cluster
stopCluster(cl)

# Set the number of games to play
no_of_games <- 1e5

## Time serial version
system.time(serial <- sapply(1:no_of_games, function(i) play()))

##    user  system elapsed 
##    7.79    0.00    7.81

## Set up cluster
cl <- makeCluster(4)
clusterExport(cl, "play")

## Time parallel version
system.time(par <- parSapply(cl, 1:no_of_games, function(i) play()))

##    user  system elapsed 
##    0.05    0.00    3.03

## Stop cluster
stopCluster(cl)

Writing Efficient Code in R

Amar Kapote