R inferno

If you are coming from a C or Java background, you definitely will complain about R is slow. If you are new to programming, maybe you have heard about people talking that R code is slow.

In this blog, I hope to show you how to make your R code faster since it has very different language design compared to C/C++, Java or even Python.

Before we proceed, always remember the famous saying by Donald Knuth that premature optimization is the root of all evil.

Disclaimer: The content of this blog is a combination of my 2-year experience using R, the famous R inferno book, Advanced R by Hadley and the online datacamp class on writing efficient R code.Special thanks to Hadley for making his Advanced R book free online.

0.1 Common pitfalls to avoid

0.1.1 Never grow your data in R.

It is perhaps the number one thing you should consider when you are writing any R code.

Suppose we want to create a vector from 1 to 1000. How do we do that in R?

There are four possible ways:

Add a number to the sequence using a for loop.
pre-allocate a vector of size 1000 and change the values using a for loop.
Using seq function in base R.
use the colon : operator

Let us compare the computation time for these four methods.

First, we will define the functions to do these calculations.

# Add a number to the sequence using a for loop. 
growing=function(n){
  x=NULL
  for(i in 1:n){
  x=append(x,i)
  }
  return(x)
}

# pre-allocate a vector
preallocate=function(n){
  x=vector(mode="integer", length=n)
  for(i in 1:n){
    x[i]=i
  }
  return(x)
}

# Using seq function in base R.
seq(1,100,by=1)

##   [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
##  [18]  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34
##  [35]  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51
##  [52]  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68
##  [69]  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85
##  [86]  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100

# use the colon : operator
1:100

##   [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
##  [18]  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34
##  [35]  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51
##  [52]  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68
##  [69]  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85
##  [86]  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100

Next, fire up the microbenchmark package to benchmark them.

library(microbenchmark)
library(magrittr)
microbenchmark(
  growing(1000),
  preallocate(1000),
  seq(1,1000),
  1:1000,
  times=1000L
)%>%print()

## Unit: nanoseconds
##               expr     min        lq        mean    median        uq
##      growing(1000) 1872548 2020690.0 2393326.988 2043141.5 2810033.5
##  preallocate(1000)   50957   56159.0   63897.181   57427.0   59268.0
##       seq(1, 1000)    4645    5531.0   10211.340    8166.5   14229.5
##             1:1000     555     708.5     978.008     855.5    1174.5
##       max neval
##  34404698  1000
##   2506822  1000
##     53383  1000
##      5595  1000

It is easy to see that the colon operator is the clear winner here.

The difference would be more significant if we are building a longer vector.

library(microbenchmark)
library(magrittr)
microbenchmark(
  growing(10000),
  preallocate(10000),
  seq(1,10000),
  1:10000,
  times=100L
)%>%print()

## Unit: microseconds
##                expr        min         lq         mean     median
##      growing(10000) 101114.150 103258.904 117568.07945 109242.104
##  preallocate(10000)    462.800    472.873    505.83326    480.292
##       seq(1, 10000)      7.386     10.488     26.00096     23.695
##             1:10000      3.751      4.575     14.32613      4.921
##          uq        max neval
##  129284.965 162259.991   100
##     538.015    711.325   100
##      28.235    616.911   100
##       5.533    925.939   100

It is computationally expensive to allocate memory to R function so that the growing function is very very slow. Doing memory pre-allocation as preallocate function gets rid of the expensive memory allocation. But the preallocate function still makes one thousand times more function call compared to seq and colon functions. The difference between seq and colon function come as a surprise to me. It must comes from the difference in the C and FORTRAN code underneath these two functions. Let me know why there is a big difference between them.

The difference between preallocate function and seq is the difference between vectorized form vs non-vectorized form.

0.1.2 Always vectorize

R is faster when you use a vectorized form of calculation. Luckily, most R functions do calculations in a vectorized fashion.

Suppose we want to multiply each element of a vector by 2.

non_vector_mutiply2=function(x){
  x_multi=vector(mode="integer", length=length(x))
  for(i in 1:length(x)){
    x_multi[i]=x[i]*2
  }
  return(x_multi)
}

non_vector_mutiply2(c(1,2))

## [1] 2 4

Let us benchmark.

library(microbenchmark)
# use the fastest methods in generating a sequence
vector=1:1000
microbenchmark(
  non_vector_mutiply2(vector),
  vector*2,
  times=1000L
)%>%print()

## Unit: microseconds
##                         expr    min      lq      mean  median     uq
##  non_vector_mutiply2(vector) 66.639 67.4415 72.777498 68.4425 70.136
##                   vector * 2  1.805  1.9825  3.281393  2.1120  2.344
##       max neval
##  1352.305  1000
##  1069.528  1000

Suppose we want to get the cumulative sum of a sequence. Let us benchmark the vectorized R function cumsum and a non-vectorized function.

non_vector_cumsum=function(x){
  x_multi=vector(mode="integer", length=length(x))
  x_multi[1]=x[1]
  for(i in 1:(length(x)-1) ){
    x_multi[i+1]=x_multi[i]+x[i+1]
  }
  return(x_multi)
}

non_vector_cumsum(c(2,4,8))

## [1]  2  6 14

library(microbenchmark)
vector=1:1000
microbenchmark(
  non_vector_cumsum(vector),
  cumsum(vector),
  times = 100L
)%>%print()

## Unit: microseconds
##                       expr     min       lq     mean   median      uq
##  non_vector_cumsum(vector) 106.627 107.1985 108.3275 107.6815 108.800
##             cumsum(vector)   2.538   2.6520   2.9222   2.7065   2.921
##      max neval
##  119.567   100
##   13.886   100

R code optimizes a lot of calculations in a vectorized way with C and FORTRAN Thus, it is always a good idea to

That concludes the two simple methods to make your R code faster.

R inferno

Zijun Lu

December 20, 2017

0.1 Common pitfalls to avoid

0.1.1 Never grow your data in R.

0.1.2 Always vectorize