If you are coming from a C or Java background, you definitely will complain about R is slow. If you are new to programming, maybe you have heard about people talking that R code is slow.
In this blog, I hope to show you how to make your R code faster since it has very different language design compared to C/C++, Java or even Python.
Before we proceed, always remember the famous saying by Donald Knuth that premature optimization is the root of all evil.
Disclaimer: The content of this blog is a combination of my 2-year experience using R, the famous R inferno book, Advanced R by Hadley and the online datacamp class on writing efficient R code.Special thanks to Hadley for making his Advanced R book free online.
It is perhaps the number one thing you should consider when you are writing any R code.
Suppose we want to create a vector from 1 to 1000. How do we do that in R?
There are four possible ways:
Let us compare the computation time for these four methods.
First, we will define the functions to do these calculations.
# Add a number to the sequence using a for loop.
growing=function(n){
x=NULL
for(i in 1:n){
x=append(x,i)
}
return(x)
}
# pre-allocate a vector
preallocate=function(n){
x=vector(mode="integer", length=n)
for(i in 1:n){
x[i]=i
}
return(x)
}
# Using seq function in base R.
seq(1,100,by=1) ## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## [18] 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
## [35] 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51
## [52] 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68
## [69] 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85
## [86] 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
# use the colon : operator
1:100## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## [18] 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
## [35] 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51
## [52] 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68
## [69] 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85
## [86] 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
Next, fire up the microbenchmark package to benchmark them.
library(microbenchmark)
library(magrittr)
microbenchmark(
growing(1000),
preallocate(1000),
seq(1,1000),
1:1000,
times=1000L
)%>%print()## Unit: nanoseconds
## expr min lq mean median uq
## growing(1000) 1872548 2020690.0 2393326.988 2043141.5 2810033.5
## preallocate(1000) 50957 56159.0 63897.181 57427.0 59268.0
## seq(1, 1000) 4645 5531.0 10211.340 8166.5 14229.5
## 1:1000 555 708.5 978.008 855.5 1174.5
## max neval
## 34404698 1000
## 2506822 1000
## 53383 1000
## 5595 1000
It is easy to see that the colon operator is the clear winner here.
The difference would be more significant if we are building a longer vector.
library(microbenchmark)
library(magrittr)
microbenchmark(
growing(10000),
preallocate(10000),
seq(1,10000),
1:10000,
times=100L
)%>%print()## Unit: microseconds
## expr min lq mean median
## growing(10000) 101114.150 103258.904 117568.07945 109242.104
## preallocate(10000) 462.800 472.873 505.83326 480.292
## seq(1, 10000) 7.386 10.488 26.00096 23.695
## 1:10000 3.751 4.575 14.32613 4.921
## uq max neval
## 129284.965 162259.991 100
## 538.015 711.325 100
## 28.235 616.911 100
## 5.533 925.939 100
It is computationally expensive to allocate memory to R function so that the growing function is very very slow. Doing memory pre-allocation as preallocate function gets rid of the expensive memory allocation. But the preallocate function still makes one thousand times more function call compared to seq and colon functions. The difference between seq and colon function come as a surprise to me. It must comes from the difference in the C and FORTRAN code underneath these two functions. Let me know why there is a big difference between them.
The difference between preallocate function and seq is the difference between vectorized form vs non-vectorized form.
R is faster when you use a vectorized form of calculation. Luckily, most R functions do calculations in a vectorized fashion.
Suppose we want to multiply each element of a vector by 2.
non_vector_mutiply2=function(x){
x_multi=vector(mode="integer", length=length(x))
for(i in 1:length(x)){
x_multi[i]=x[i]*2
}
return(x_multi)
}
non_vector_mutiply2(c(1,2))## [1] 2 4
Let us benchmark.
library(microbenchmark)
# use the fastest methods in generating a sequence
vector=1:1000
microbenchmark(
non_vector_mutiply2(vector),
vector*2,
times=1000L
)%>%print()## Unit: microseconds
## expr min lq mean median uq
## non_vector_mutiply2(vector) 66.639 67.4415 72.777498 68.4425 70.136
## vector * 2 1.805 1.9825 3.281393 2.1120 2.344
## max neval
## 1352.305 1000
## 1069.528 1000
Suppose we want to get the cumulative sum of a sequence. Let us benchmark the vectorized R function cumsum and a non-vectorized function.
non_vector_cumsum=function(x){
x_multi=vector(mode="integer", length=length(x))
x_multi[1]=x[1]
for(i in 1:(length(x)-1) ){
x_multi[i+1]=x_multi[i]+x[i+1]
}
return(x_multi)
}
non_vector_cumsum(c(2,4,8))## [1] 2 6 14
library(microbenchmark)
vector=1:1000
microbenchmark(
non_vector_cumsum(vector),
cumsum(vector),
times = 100L
)%>%print()## Unit: microseconds
## expr min lq mean median uq
## non_vector_cumsum(vector) 106.627 107.1985 108.3275 107.6815 108.800
## cumsum(vector) 2.538 2.6520 2.9222 2.7065 2.921
## max neval
## 119.567 100
## 13.886 100
R code optimizes a lot of calculations in a vectorized way with C and FORTRAN Thus, it is always a good idea to
That concludes the two simple methods to make your R code faster.