Sparse Matrix Apply Function

The apply function is used in R to compute a function separately for each column of a dense matrix. However, it doesn’t work with sparse matrices because it always coerces to a dense matrix first. Here we examine several possibilities for apply on sparse Matrix objects (specifically, column-oriented dgCmatrix from package Matrix). For simplicity we always assume the function is applied separately to each column of the matrix. Also, we assume the function is only computed on the nonzero elements of each column. We consider the following possible approaches

baseline: explicitly index out each column and apply the function
qlcMatrix approach: as.simple_triplet_matrix then slam::rollup(Y, 2, FUN = max)
data.table approach group_by and summarise
vapply approach: convert to a list of cols, each element is a vector containing nonzero values of that column

Note that we are not trying to replace packages like sparseMatrixStats which provide column-wise function application for common summary statistics like mean, median, max, etc. Rather, our goal is to allow application of arbitrary user-defined functions that operate only on nonzero values.

Each method will be tested with and without parallelization. We use the median as the summary function since it gives a different answer depending on whether zeros are included in the computation or not.

library(Matrix)
library(qlcMatrix) #also loads slam

## Loading required package: slam

## Loading required package: sparsesvd

library(data.table)

## 
## Attaching package: 'data.table'

## The following object is masked from 'package:slam':
## 
##     rollup

library(microbenchmark)
library(pryr)

## Registered S3 method overwritten by 'pryr':
##   method      from
##   print.bytes Rcpp

## 
## Attaching package: 'pryr'

## The following object is masked from 'package:data.table':
## 
##     address

library(parallel)

setDTthreads(1)

create a large matrix for testing

m<-rsparsematrix(2000,10000,.05)

convert a sparse matrix into a list of the nonzero values of each column.

listCols<-function(m){
    #converts a sparse Matrix into a list of its columns
    res<-split(m@x, findInterval(seq_len(nnzero(m)), m@p, left.open=TRUE))
    names(res)<-colnames(m)
    res
}

listCols2<-function(m){
    #using slam simple sparse array
    m2<-as.simple_sparse_array(m)
    res<-split(m2$v,m2$i[,2])
    names(res)<-colnames(m)
    res
}

listCols3<-function(m){
    #using slam simple sparse triplet
    m2<-as.simple_triplet_matrix(m)
    res<-split(m2$v,m2$j)
    names(res)<-colnames(m)
    res
}

listCols4<-function(m){
    #using data.frame
    m2<-summary(m)
    res<-split(m2$x,m2$j)
    names(res)<-colnames(m)
    res
}

listCols5<-function(m){
    #using data.table
    m2<-as.data.table(summary(m))
    res<-split(m2$x,m2$j)
    names(res)<-colnames(m)
    res
}

res<-microbenchmark(listCols(m),
                    listCols2(m),
                    listCols3(m),
                    listCols4(m),
                    listCols5(m),
                    unit="s",times=10)
boxplot(res,unit="s")

Ranking of methods from fastest to slowest. All but the slam simple sparse array approach had very similar speed.

findInterval approach (tie for fastest)
slam with triplet matrix (tie for fastest)
data.frame approach
data.table approach
slam with simple sparse array (much slower than others)

We will use the findInterval approach henceforth for comparisons to minimize dependencies.

test different baseline approaches

m<-rsparsematrix(500,1000,.05)
baseline1<-function(m,f){
    vapply(seq_len(ncol(m)),function(t){f(m[,t])},0.0)
}
baseline2<-function(m,f){
    Z<-m!=0 #indicator of nonzero elements
    vapply(seq_len(ncol(m)),function(t){f(m[Z[,t],t])},0.0)
}
baseline3<-function(m,f){
    vapply(listCols(m), f, FUN.VALUE=0.0)
}

f<-median
res<-microbenchmark(baseline1(m,f),
                    baseline2(m,f),
                    baseline3(m,f),
                    unit="s",times=10)
boxplot(res,unit="s")

ranking of methods from fastest to slowest

baseline3: convert matrix to list of cols, then vapply
baseline1: naive extraction of columns within vapply
baseline2: naive extraction of columns with subsetting by nonzero

Note that baseline3 would be easily parallelizable (ie replace vapply with mclapply). Also note that built-in approaches like Matrix::colSums and linear algebra approaches like premultiplying by a vector of ones will be way faster for the limited set of functions like sum that can be done this way.

m<-rsparsematrix(2000,5000,.05)
slam0<-function(m,f,...){
    as(slam::rollup(m,1,FUN=f,...), "sparseVector")
}
slam1<-function(m,f){ #this is most similar to rowMax from qlcMatrix
    slam0(as.simple_triplet_matrix(drop0(m)),f)
}
slam2<-function(m,f,expand="none"){
    slam0(as.simple_sparse_array(drop0(m)),f,EXPAND=expand)
}
slam3<-function(m,f){
    #compare to slam2 with expand="all"
    colapply_simple_triplet_matrix(as.simple_triplet_matrix(m),f)
    #as.data.frame(do.call(rbind,res))
}

f<-median
res<-microbenchmark(baseline3(m,f),
                    slam1(m,f),
                    slam3(m,f),
                    slam2(m,f),
                    slam2(m,f,expand="all"),
                    slam0(m,f),
                    unit="s",times=10)
boxplot(res,unit="s")

The ranking from fastest to slowest:

baseline3 (convert Matrix to list of cols then vapply)
slam1 rollup on simple_triplet_matrix
slam3 colapply with simple triplet matrix
slam2 rollup on simple_sparse_array with expand=“none”
slam2 rollup on simple_sparse_array with expand=“all”
slam0 rollup directly on dgCMatrix object

Now trying with data.table

m<-rsparsematrix(2000,5000,.05)
dt1<-function(m,f){
    d<-as.data.table(summary(m)) #colnames i,j,x, ordered by j
    d[,f(x),keyby=j]$V1
}

f<-median
res<-microbenchmark(baseline3(m,f),
                    dt1(m,f),
                    slam1(m,f),
                    slam3(m,f),
                    unit="s",times=50)
boxplot(res,unit="s")

Method ranking from fastest to slowest

data.table
vapply on list of cols (only slighty slower)
slam1 rollup of simple triplet matrix
slam3 colapply of simple triplet matrix

tru<-Matrix::colSums(m)
max(abs(dt1(m,sum)-tru))

## [1] 2.131628e-14

max(abs(slam1(m,sum)-tru))

## [1] 2.04281e-14

max(abs(baseline3(m,sum)-tru))

## [1] 2.131628e-14

as expected all the results are close to true value. Let’s now examine memory consumption of each approach.

mem_change(Matrix::colSums(m))

## -114 kB

mem_change(dt1(m,sum))

## 68.2 kB

mem_change(slam1(m,sum))

## 528 B

mem_change(baseline3(m,sum))

## 528 B

#slam detailed memory change profile
rm(Y)

## Warning in rm(Y): object 'Y' not found

mem_change(Y<-as.simple_triplet_matrix(drop0(m)))

## 8 MB

mem_change(as(slam::rollup(Y,1,FUN=sum), "sparseVector"))

## 584 B

#data.table detailed memory change profile
rm(d)

## Warning in rm(d): object 'd' not found

mem_change(d<-as.data.table(summary(m)))

## 8.02 MB

mem_change(d[,sum(x),keyby=j]$V1)

## 54.3 kB

#baseline3 detailed memory change profile
rm(cl)

## Warning in rm(cl): object 'cl' not found

mem_change(cl<-listCols(m))

## 4.32 MB

mem_change(vapply(cl, sum, FUN.VALUE=0.0))

## 584 B

Ranking of memory consumption from least to most. Memory consumption is dominated by creation of intermediate objects that allow fast computation of the summary stats.

baseline3: conversion of matrix to list of columns then vapply (4.32 MB)
conversion to simple triplet matrix then slam rollup (8 MB)
conversion to data.table (8.02 MB)

We now test it on the median function to see if each method is applied to all or just nonzero elements of the matrix.

tru<-apply(m,2,median)

#methods that operate only on nonzero elements
max(abs(dt1(m,median)-tru))

## [1] 0.45

max(abs(slam1(m,median)-tru))

## [1] 0.45

max(abs(baseline3(m,median)-tru))

## [1] 0.45

#methods that operate on all elements
max(abs(slam3(m,median)-tru))

## [1] 0

max(abs(slam0(m,median)-tru))

## [1] 0

max(abs(slam2(m,median,expand="all")-tru)) #operates on all elements

## [1] 0

max(abs(baseline1(m,median)-tru)) #operates on all elements

## [1] 0

Based on this, for serial processing we make the following recommendations:

Functions that need all elements of each column- the clear winner here is to first convert to a slam simple triplet matrix then use the colapply_ function. This is implemented above as “slam3”.

Functions that need only the nonzero elements- there are three viable approaches: 1) convert to data.table, 2) convert to slam simple triplet matrix and rollup 3) convert to a list of columns. In terms of speed, all three have similar performance but slam triplet rollup seems to be slightly slower. In terms of memory consumption and minimizing dependencies, conversion to list of columns is best. In terms of parallelization, either data.table or list of columns support this approach. Therefore, we do not recommend the slam approach since it is dominated by data.table on speed and by list of columns on memory.

We now compare list of columns versus data.table with parallel processing.

m<-rsparsematrix(5000,10000,.05)

f<-function(x){
    Sys.sleep(1e-3)
    sum(x)
}
#does parallelization speed up the list of columns approach?
cl<-listCols(m)
system.time(res<-vapply(cl,f,FUN.VALUE=0.0)) #14 sec

##    user  system elapsed 
##   0.543   0.246  13.806

system.time(res2<-unlist(mclapply(cl,f,mc.cores=2))) #6.7 sec

##    user  system elapsed 
##   0.007   0.008   6.384

When the function being applied is really fast, serial processing is faster. But when the function being applied is slow, parallel processing helps. We now compare against the multithreaded data.table method.

baseline<-function(m,f,nproc=2){
    unlist(mclapply(listCols(m),f,mc.cores=nproc))
}

system.time(res3<-baseline(m,f,nproc=2)) #6.8 sec

##    user  system elapsed 
##   0.167   0.039   6.758

setDTthreads(2)
system.time(res4<-dt1(m,f)) #14.5 sec

##    user  system elapsed 
##   0.703   0.285  13.769

setDTthreads(1)

The baseline approach using listCols combined with parallel processing is faster than the data.table approach.

Sparse Matrix Apply Function

Will Townes