Will Nicholson
October 7, 2014
(Borrowed from CS 5220)
One should not sacrifice correctness for speed
One should not re-invent (or re-tune) the wheel
Your time matters more than computer time
I've been worried for some time that R isn't going to provide the base that we’re going to need for statistical computation in the future. (It may well be that the future is already upon us.) There are certainly efficiency problems (speed and memory use), but there are more fundamental issues too. Some of these were inherited from S and some are peculiar to R.
In light of this, I [Ihaka] have come to the conclusion that rather than “fixing” R, it would be much more productive to simply start over and build something better. I think the best you could hope for by fixing the efficiency problems in R would be to boost performance by a small multiple, or perhaps as much as an order of magnitude. This probably isn’t enough to justify the effort (Luke Tierney has been working on R compilation for over a decade now)
(Summarized from Hadley Wickham's “Advanced R”)
Most R code is poorly written (either by programming novices or due to rule 3)
Since R often uses pass by value, you are often modifying a copy of an object as opposed to that object directly. This is true even outside of functions
R is a completely dynamic language, so every operation requires a lookup (i.e. am I operating on an int? double? etc)
x <- 0L
for (i in 1:1e6) {
x <- x + 1
}
At each loop iteration, R does not know what type of object it is operating on and must search for the correct method.
We (and our advisors) all know it.
Strong support community and large user base.
Excellent graphics and support for data analysis.
Free.
Very flexible and easy to write external modules.
Good integration with other languages.
In the past, we could count on Moore's law (number of transistors in an integrated circuit to double every two years) to rely on major hardware improvements, resulting in substantially faster processors every few years.
However, recent microchip developments have stagnated.
Consequently, developing good code is more important now than ever before, as we can no longer rely on major hardware improvements.
This has led to a resurgence in both programming in fast compiled languages (such as C++) as well as parallel computing.

My research consists of estimating high dimensional vector autoregressions
Consider a multivariate time series of dimensions \( T\times k \).
Let \( \{ \mathbf{y_t} \in \mathbb{R}^k\}_{t = 1}^T \) denote a \( k \) dimensional vector time series. A $p$th order vector autoregression (VAR\( _k \)(\( p \))) may be expressed as \[ \begin{align} &\mathbf{y}_t=\mathbf{\nu}+\sum_{\ell=1}^p\mathbf{B}_\ell\mathbf{y}_{t-\ell}+\mathbf{u}_t, \end{align} \]
In compact matrix representation, the aforementioned system is equivalent to \( \mathbf{Y}=\mathbf{\nu}\mathbf{1}^{'}+\mathbf{B}\mathbf{Z}+\mathbf{U} \) \[ \begin{align*} & \begin{array}[r]{lc} \mathbf{Y}=(\mathbf{y}_1,\mathbf{y}_2,\dots,\mathbf{y}_T),& (k\times T)\\ \mathbf{B}=(\mathbf{B}_1,\mathbf{B}_2,\dots,\mathbf{B}_p) & (k\times kp)\\ \mathbf{z}_t=(\mathbf{y}_{t-1}^{'},\dots \mathbf{y}_{t-p}^{'})^{'} & (kp\times 1)\\ \mathbf{Z}=(\mathbf{z}_1,\dots,\mathbf{z}_{T}) & (kp\times T)\\ \mathbf{U}=(\mathbf{u}_1,\dots,\mathbf{u}_T) & (k\times T)\\ \mathbf{1}=(1,\dots,1)^{'} & (T \times 1) \end{array} \end{align*} \]
The VAR takes the form \( \frac{1}{2}\|Y-BZ\|_F^2 \)
| Model Structure | Penalty | Solution Algorithm |
|---|---|---|
| Lasso | \( \lambda\|B\|_1 \) | Coordinate Descent |
| Lag Group Lasso | \( \lambda\sum_{\ell=1}^p||B_\ell||_F \) | Block Coordinate Descent |
| Own/Other Lag Group Lasso | \( \lambda(\rho_1\sum_{\ell=1}^p \|\text{diag}(B_{\ell})\|_F+\gamma_1\sum_{\ell=1}^p\|B_{\ell}^{-}\|_F) \) | Block Coordinate Descent |
| Lag Sparse Group Lasso | \( \lambda(1-\alpha) \sum_{\ell=1}^{p}\|B_{\ell}\|_{F}+\lambda\alpha\|B\|_{1} \) | Generalized Gradient Descent |
| Own/Other Sparse Group Lasso | \( \lambda(1-\alpha)[\rho_1\sum_{\ell=1}^p \|\text{diag}(B_{\ell})\|_F+\gamma_1\sum_{\ell=1}^p\|B_{\ell}^{-}\|_F]+\alpha\lambda\|B\|_{1} \) | Generalized Gradient Descent |
\( \lambda>0 \) is a penalty parameter estimated by cross-validation, \( 0\leq \alpha\leq 1 \) is an additional penalty parameter set to \( \frac{1}{k+1} \) to control within-group sparsity. \( \rho_1=\sqrt{k} \) and \( \gamma_1=\sqrt{k(k-1)} \) are weights accounting for the number of elements in each group.
In order to account for time-dependence, cross-validation is conducted in a rolling manner. Define time indices \( T_1=\left \lfloor \frac{T}{3} \right\rfloor \), and \( T_2=\left\lfloor \frac{2T}{3} \right\rfloor \)
The training period \( T_1+1 \) through \( T_2 \) is used to select \( \lambda \),
\( T_2+1 \) through \( T \) is for evaluation of forecast accuracy in a rolling manner.
The coefficient matrix has dimension \( k\times kp \). If k=20 and p=12, that is 4800 parameters!
The Lasso doesn't have a closed form solution. The structured approaches don't even have closed form solutions for their one-block subproblems.
I have to re-estimate the entire coefficient matrix for a grid of \( \lambda \) values at every time point in the validation stage.
Hence, even in low dimensions, these problems can appear computationally intractible if coded inefficiently.
library(microbenchmark)
A=matrix(runif(1e4^2),nrow=1e4,ncol=1e4)
test1=microbenchmark(
# Accessing 100 contiguous columns
B=A[,200:300],
## Accessing 100 contiguous rows, bad
B1=A[200:300,],
#Accessing 99 non-contiguous rows, really bad
B2=A[seq(100,9900,by=100),],
#Accessing 99 non-contiguous columns, doesn't matter that much..
B3=A[,seq(100,9900,by=100)],
times=100L )
Unit: milliseconds
expr min lq mean median uq max neval
B 10.99 11.60 12.38 11.71 13.92 14.47 100
B1 36.43 38.90 39.93 39.17 40.46 43.68 100
B2 54.23 54.86 56.23 55.29 58.33 61.08 100
B3 10.96 11.52 12.23 11.68 12.25 19.33 100
chol2inv(chol(A)) instead of solve(A)only.values option.crossprod and tcrossprod are almost always faster than %*%One thing I haven't looked into are using sparse matrices.
They have advantages with respect to storage and matrix operations
The benefits are dependent upon the degree of sparsity.
Also, look-up costs become non-trivial!
Golub/Van Loan (or CS 6210) is a good source for learning efficient matrix algorithms.
library(microbenchmark)
A=matrix(runif(1e2^2),nrow=1e2,ncol=1e2)
B=matrix(runif(1e2^2),nrow=1e2,ncol=1e2)
test1=microbenchmark(
B=t(A)%*%A,
B1=crossprod(A),
times=50L)
Unit: microseconds
expr min lq mean median uq max neval
B 1284.7 1404.5 1876.3 1619.5 1966.5 12070 50
B1 679.9 681.3 739.4 683.2 698.1 2926 50
A=matrix(runif(1e2^2),nrow=1e2,ncol=1e2)
A=t(A)%*%A
test1=microbenchmark(
B=solve(A),
B1=chol2inv(chol(A)),
B2=qr.solve(A),
times=50L
)
Unit: microseconds
expr min lq mean median uq max neval
B 2583.4 2622.9 2788.7 2650.7 2702.8 7821 50
B1 747.5 769.7 976.7 781.8 801.3 9614 50
B2 3584.8 3629.1 3764.6 3662.7 3785.7 5572 50
A=matrix(runif(1e2^2),nrow=1e2,ncol=1e2)
B=matrix(runif(1e2^2),nrow=1e2,ncol=1e2)
test1=microbenchmark(
C=A%*%B,
B1=crossprod(t(A),B),
times=50L
)
Unit: milliseconds
expr min lq mean median uq max neval
C 1.251 1.275 1.460 1.285 1.643 2.127 50
B1 1.355 1.365 1.384 1.372 1.386 1.533 50
A=matrix(runif(1e2^2),nrow=1e2,ncol=1e2)
test1=microbenchmark(
C=eigen(A),
B1=eigen(A,only.values=TRUE),
times=50L
)
Unit: milliseconds
expr min lq mean median uq max neval
C 16.73 16.91 17.26 17.03 17.25 21.71 50
B1 10.15 10.25 10.47 10.39 10.50 11.48 50
microbenchmark can be useful to compare different approaches, but R also has a good profiler than can give you some insight.YYY <- MultVarSim(k,A,p,diag(k),nsims)
ZZZ <- Zmat2(YYY,p,k)
Z <- ZZZ$Z
Y <- ZZZ$Y
gran2=50
gamm <- LambdaGrid(40,gran2,Y,Z)
B <- betaCreate(k,p=2,gran2)
# microbenchmarking
Rprof()
op <- microbenchmark(
B1 <- lassoVAR(B,Z,Y,gamm,1e-3)
,times=50L)
Rprof(NULL)
summaryRprof()
$by.self
self.time self.pct total.time total.pct
"lasso" 4.60 43.48 9.94 93.95
"%*%" 1.92 18.15 1.92 18.15
"ST" 1.44 13.61 2.28 21.55
"seq" 0.52 4.91 0.90 8.51
"sum" 0.38 3.59 0.38 3.59
"seq.default" 0.22 2.08 0.34 3.21
"-" 0.22 2.08 0.22 2.08
"FUN" 0.20 1.89 0.22 2.08
"*" 0.18 1.70 0.18 1.70
"max" 0.18 1.70 0.18 1.70
"apply" 0.12 1.13 0.50 4.73
"mode" 0.12 1.13 0.12 1.13
"abs" 0.06 0.57 0.06 0.57
"gc" 0.06 0.57 0.06 0.57
"sign" 0.06 0.57 0.06 0.57
"lassoVAR" 0.04 0.38 10.50 99.24
"unlist" 0.04 0.38 0.08 0.76
"match.fun" 0.04 0.38 0.04 0.38
".Call" 0.02 0.19 10.52 99.43
"lapply" 0.02 0.19 0.04 0.38
"any" 0.02 0.19 0.02 0.19
"aperm.default" 0.02 0.19 0.02 0.19
"matrix" 0.02 0.19 0.02 0.19
"mean.default" 0.02 0.19 0.02 0.19
"ncol" 0.02 0.19 0.02 0.19
"nrow" 0.02 0.19 0.02 0.19
"prod" 0.02 0.19 0.02 0.19
$by.total
total.time total.pct self.time self.pct
"block_exec" 10.58 100.00 0.00 0.00
"call_block" 10.58 100.00 0.00 0.00
"eval" 10.58 100.00 0.00 0.00
"evaluate" 10.58 100.00 0.00 0.00
"evaluate_call" 10.58 100.00 0.00 0.00
"handle" 10.58 100.00 0.00 0.00
"in_dir" 10.58 100.00 0.00 0.00
"knit" 10.58 100.00 0.00 0.00
"microbenchmark" 10.58 100.00 0.00 0.00
"process_file" 10.58 100.00 0.00 0.00
"process_group" 10.58 100.00 0.00 0.00
"process_group.block" 10.58 100.00 0.00 0.00
"withCallingHandlers" 10.58 100.00 0.00 0.00
"withVisible" 10.58 100.00 0.00 0.00
".Call" 10.52 99.43 0.02 0.19
"lassoVAR" 10.50 99.24 0.04 0.38
"lasso" 9.94 93.95 4.60 43.48
"ST" 2.28 21.55 1.44 13.61
"%*%" 1.92 18.15 1.92 18.15
"seq" 0.90 8.51 0.52 4.91
"apply" 0.50 4.73 0.12 1.13
"sum" 0.38 3.59 0.38 3.59
"seq.default" 0.34 3.21 0.22 2.08
"-" 0.22 2.08 0.22 2.08
"FUN" 0.22 2.08 0.20 1.89
"*" 0.18 1.70 0.18 1.70
"max" 0.18 1.70 0.18 1.70
"mode" 0.12 1.13 0.12 1.13
"unlist" 0.08 0.76 0.04 0.38
"abs" 0.06 0.57 0.06 0.57
"gc" 0.06 0.57 0.06 0.57
"sign" 0.06 0.57 0.06 0.57
"match.fun" 0.04 0.38 0.04 0.38
"lapply" 0.04 0.38 0.02 0.19
"any" 0.02 0.19 0.02 0.19
"aperm.default" 0.02 0.19 0.02 0.19
"matrix" 0.02 0.19 0.02 0.19
"mean.default" 0.02 0.19 0.02 0.19
"ncol" 0.02 0.19 0.02 0.19
"nrow" 0.02 0.19 0.02 0.19
"prod" 0.02 0.19 0.02 0.19
"aperm" 0.02 0.19 0.00 0.00
$sample.interval
[1] 0.02
$sampling.time
[1] 10.58
Profiling gives you an idea of where your program spends the most amount of time
This can give you some insight as to either how to optimize your code within R or move it to C++.
For example, this procedure spends a lot of time in the Soft-Threshold function. This can easily be ported to C++.
Alternatively, I could use vectorized operations… What will be faster?
Unit: milliseconds
expr min lq mean median
Orig <- lassoVAR(B, Z, Y, gamm, 0.001) 102.29 103.38 106.52 104.35
Cpp <- lassoVAR2(B, Z, Y, gamm, 0.001) 103.83 105.68 109.01 106.50
Vec <- lassoVARVec(B, Z, Y, gamm, 0.001) 47.67 49.29 51.59 52.27
uq max neval
107.54 164.74 50
109.74 164.07 50
52.85 55.75 50
Let's see what the performance difference is between vectorized code and C++ code
Unit: milliseconds
expr min lq mean median
Orig <- lassoVAR(B, Z, Y, gamm, 0.001) 391.20 394.92 399.00 396.76
CppST <- lassoVAR2(B, Z, Y, gamm, 0.001) 396.54 402.92 407.76 405.53
Vec <- lassoVARVec(B, Z, Y, gamm, 0.001) 162.85 165.80 166.91 166.83
FullCpp <- lassoVARCPP(B, Z, Y, gamm, 0.001) 33.81 34.66 35.83 35.22
uq max neval
398.81 449.27 50
408.60 460.06 50
167.64 173.49 50
36.48 40.47 50
There are many algorithms that can be used as solution methods for the Lasso
Which one is going to be the fastest? Generally, it is context dependent.
Unit: milliseconds
expr min lq
CD <- lassoVARCPP(B, Z, Y, gamm, 0.001) 19.639 19.778
FISTA <- .lassoVARFist(B1a, Z, Y, gamm, 0.001, p, FALSE) 4.815 4.856
mean median uq max neval
20.314 20.129 20.670 21.887 25
4.932 4.923 4.976 5.161 25
Provides an interface to C++ which lets you directly incorporate R data structures.
Can be done inline or via sourceCpp
Doesn't require that much knowledge of C++ to use effectively
Also provides an interface to the Armadillo numerical linear algebra library, which allows for fast and relatively painless matrix operations.
I started using Rcpp in August 2013 with no knowledge of C++
The general strategy to port to C++ is to start from the inside and work your way out.
What results in the biggest performance gains?
Can you profile Rcpp code? Yes, but it requires a bit of work.
The interface between R and C++ has some overhead. Avoid calling a C++ function multiple times from R.
PASS BY REFERENCE!!! This is one the biggest time-sinks in R. All functions pass by value. You can avoid this in Rcpp by adding & before Armadillo object and _ after an Rcpp object.
Don't try to be fancy. Take advantage of of Rcpp and Armadillo's functionality (remember rule 2).
Since microchip development has been relatively stagnant, attention has instead been shifted to distributed computing, as most modern computers have multiple cores.
By linking external BLAS libraries, R has some multithreaded support, but it is primarily single threaded.
If you convert all of your Rcpp objects to Armadillo or Native C++ objects, you can utilize OpenMP within C++ to parallelize for loops.
There is also a package RcppParallel in development.
Within R, there are facilities for parallel computing in packages snow, foreach, doMC, etc.
These procedures involving spawning separate R processes and aggregating the data.
Aggregated data are usually stored in lists that will need to be collapsed.
On your own system, parallel code might look something like this
We have a computing cluster available for research
It has 30 nodes each with 8 processors
Slightly different than parallel computing on your own machine
More care is required when setting up the cluster, as you will need the addresses of nodes.
Typically, you write a bash script and run it through a scheduler rather than run R code directly.
Unfortunately, the cluster is fairly dated and technical issues need to be addressed with CAC.
As an alternative, it is possible to create an “on-demand” cluster using Amazon Ec2.
One can create virtual “instances,” each of which can be thought of as one “node” of a cluster
You have full control over these instances and can create up to 20 at a time.
You don't need to use a scheduler, you will instead execute your script with something like
nohup R CMD BATCH Script.R Res.txt &Though R has severe limitations, through good coding practices it is possible to achieve good performance.
We will not be using R forever. Hopefully the eventual successor will correct its flaws.
Julia looks particularly promising, but in my opinion, it is still too experimental for everyday use.