The apply function is used in R to compute a function separately for each column of a dense matrix. However, it doesn’t work with sparse matrices because it always coerces to a dense matrix first. Here we examine several possibilities for apply on sparse Matrix objects (specifically, column-oriented dgCmatrix from package Matrix). For simplicity we always assume the function is applied separately to each column of the matrix. Also, we assume the function is only computed on the nonzero elements of each column. We consider the following possible approaches
qlcMatrix approach: as.simple_triplet_matrix then slam::rollup(Y, 2, FUN = max)data.table approach group_by and summarisevapply approach: convert to a list of cols, each element is a vector containing nonzero values of that columnNote that we are not trying to replace packages like sparseMatrixStats which provide column-wise function application for common summary statistics like mean, median, max, etc. Rather, our goal is to allow application of arbitrary user-defined functions that operate only on nonzero values.
Each method will be tested with and without parallelization. We use the median as the summary function since it gives a different answer depending on whether zeros are included in the computation or not.
library(Matrix)
library(qlcMatrix) #also loads slam
## Loading required package: slam
## Loading required package: sparsesvd
library(data.table)
##
## Attaching package: 'data.table'
## The following object is masked from 'package:slam':
##
## rollup
library(microbenchmark)
library(pryr)
## Registered S3 method overwritten by 'pryr':
## method from
## print.bytes Rcpp
##
## Attaching package: 'pryr'
## The following object is masked from 'package:data.table':
##
## address
library(parallel)
setDTthreads(1)
create a large matrix for testing
m<-rsparsematrix(2000,10000,.05)
convert a sparse matrix into a list of the nonzero values of each column.
listCols<-function(m){
#converts a sparse Matrix into a list of its columns
res<-split(m@x, findInterval(seq_len(nnzero(m)), m@p, left.open=TRUE))
names(res)<-colnames(m)
res
}
listCols2<-function(m){
#using slam simple sparse array
m2<-as.simple_sparse_array(m)
res<-split(m2$v,m2$i[,2])
names(res)<-colnames(m)
res
}
listCols3<-function(m){
#using slam simple sparse triplet
m2<-as.simple_triplet_matrix(m)
res<-split(m2$v,m2$j)
names(res)<-colnames(m)
res
}
listCols4<-function(m){
#using data.frame
m2<-summary(m)
res<-split(m2$x,m2$j)
names(res)<-colnames(m)
res
}
listCols5<-function(m){
#using data.table
m2<-as.data.table(summary(m))
res<-split(m2$x,m2$j)
names(res)<-colnames(m)
res
}
res<-microbenchmark(listCols(m),
listCols2(m),
listCols3(m),
listCols4(m),
listCols5(m),
unit="s",times=10)
boxplot(res,unit="s")
Ranking of methods from fastest to slowest. All but the slam simple sparse array approach had very similar speed.
We will use the findInterval approach henceforth for comparisons to minimize dependencies.
test different baseline approaches
m<-rsparsematrix(500,1000,.05)
baseline1<-function(m,f){
vapply(seq_len(ncol(m)),function(t){f(m[,t])},0.0)
}
baseline2<-function(m,f){
Z<-m!=0 #indicator of nonzero elements
vapply(seq_len(ncol(m)),function(t){f(m[Z[,t],t])},0.0)
}
baseline3<-function(m,f){
vapply(listCols(m), f, FUN.VALUE=0.0)
}
f<-median
res<-microbenchmark(baseline1(m,f),
baseline2(m,f),
baseline3(m,f),
unit="s",times=10)
boxplot(res,unit="s")
ranking of methods from fastest to slowest
Note that baseline3 would be easily parallelizable (ie replace vapply with mclapply). Also note that built-in approaches like Matrix::colSums and linear algebra approaches like premultiplying by a vector of ones will be way faster for the limited set of functions like sum that can be done this way.
m<-rsparsematrix(2000,5000,.05)
slam0<-function(m,f,...){
as(slam::rollup(m,1,FUN=f,...), "sparseVector")
}
slam1<-function(m,f){ #this is most similar to rowMax from qlcMatrix
slam0(as.simple_triplet_matrix(drop0(m)),f)
}
slam2<-function(m,f,expand="none"){
slam0(as.simple_sparse_array(drop0(m)),f,EXPAND=expand)
}
slam3<-function(m,f){
#compare to slam2 with expand="all"
colapply_simple_triplet_matrix(as.simple_triplet_matrix(m),f)
#as.data.frame(do.call(rbind,res))
}
f<-median
res<-microbenchmark(baseline3(m,f),
slam1(m,f),
slam3(m,f),
slam2(m,f),
slam2(m,f,expand="all"),
slam0(m,f),
unit="s",times=10)
boxplot(res,unit="s")
The ranking from fastest to slowest:
Now trying with data.table
m<-rsparsematrix(2000,5000,.05)
dt1<-function(m,f){
d<-as.data.table(summary(m)) #colnames i,j,x, ordered by j
d[,f(x),keyby=j]$V1
}
f<-median
res<-microbenchmark(baseline3(m,f),
dt1(m,f),
slam1(m,f),
slam3(m,f),
unit="s",times=50)
boxplot(res,unit="s")
Method ranking from fastest to slowest
tru<-Matrix::colSums(m)
max(abs(dt1(m,sum)-tru))
## [1] 2.131628e-14
max(abs(slam1(m,sum)-tru))
## [1] 2.04281e-14
max(abs(baseline3(m,sum)-tru))
## [1] 2.131628e-14
as expected all the results are close to true value. Let’s now examine memory consumption of each approach.
mem_change(Matrix::colSums(m))
## -114 kB
mem_change(dt1(m,sum))
## 68.2 kB
mem_change(slam1(m,sum))
## 528 B
mem_change(baseline3(m,sum))
## 528 B
#slam detailed memory change profile
rm(Y)
## Warning in rm(Y): object 'Y' not found
mem_change(Y<-as.simple_triplet_matrix(drop0(m)))
## 8 MB
mem_change(as(slam::rollup(Y,1,FUN=sum), "sparseVector"))
## 584 B
#data.table detailed memory change profile
rm(d)
## Warning in rm(d): object 'd' not found
mem_change(d<-as.data.table(summary(m)))
## 8.02 MB
mem_change(d[,sum(x),keyby=j]$V1)
## 54.3 kB
#baseline3 detailed memory change profile
rm(cl)
## Warning in rm(cl): object 'cl' not found
mem_change(cl<-listCols(m))
## 4.32 MB
mem_change(vapply(cl, sum, FUN.VALUE=0.0))
## 584 B
Ranking of memory consumption from least to most. Memory consumption is dominated by creation of intermediate objects that allow fast computation of the summary stats.
We now test it on the median function to see if each method is applied to all or just nonzero elements of the matrix.
tru<-apply(m,2,median)
#methods that operate only on nonzero elements
max(abs(dt1(m,median)-tru))
## [1] 0.45
max(abs(slam1(m,median)-tru))
## [1] 0.45
max(abs(baseline3(m,median)-tru))
## [1] 0.45
#methods that operate on all elements
max(abs(slam3(m,median)-tru))
## [1] 0
max(abs(slam0(m,median)-tru))
## [1] 0
max(abs(slam2(m,median,expand="all")-tru)) #operates on all elements
## [1] 0
max(abs(baseline1(m,median)-tru)) #operates on all elements
## [1] 0
Based on this, for serial processing we make the following recommendations:
Functions that need all elements of each column- the clear winner here is to first convert to a slam simple triplet matrix then use the colapply_ function. This is implemented above as “slam3”.
Functions that need only the nonzero elements- there are three viable approaches: 1) convert to data.table, 2) convert to slam simple triplet matrix and rollup 3) convert to a list of columns. In terms of speed, all three have similar performance but slam triplet rollup seems to be slightly slower. In terms of memory consumption and minimizing dependencies, conversion to list of columns is best. In terms of parallelization, either data.table or list of columns support this approach. Therefore, we do not recommend the slam approach since it is dominated by data.table on speed and by list of columns on memory.
We now compare list of columns versus data.table with parallel processing.
m<-rsparsematrix(5000,10000,.05)
f<-function(x){
Sys.sleep(1e-3)
sum(x)
}
#does parallelization speed up the list of columns approach?
cl<-listCols(m)
system.time(res<-vapply(cl,f,FUN.VALUE=0.0)) #14 sec
## user system elapsed
## 0.543 0.246 13.806
system.time(res2<-unlist(mclapply(cl,f,mc.cores=2))) #6.7 sec
## user system elapsed
## 0.007 0.008 6.384
When the function being applied is really fast, serial processing is faster. But when the function being applied is slow, parallel processing helps. We now compare against the multithreaded data.table method.
baseline<-function(m,f,nproc=2){
unlist(mclapply(listCols(m),f,mc.cores=nproc))
}
system.time(res3<-baseline(m,f,nproc=2)) #6.8 sec
## user system elapsed
## 0.167 0.039 6.758
setDTthreads(2)
system.time(res4<-dt1(m,f)) #14.5 sec
## user system elapsed
## 0.703 0.285 13.769
setDTthreads(1)
The baseline approach using listCols combined with parallel processing is faster than the data.table approach.