Matrix operations, algebra and Statistical Methods for Big Data. Part I: working in memory

Dolors Pelegri1,2,3 and Juan R. Gonzalez1,2,3*

1Universitat de Barcelona - Universitat de Barcelona (UAB)
2ISGlobal, Centre for Research in Environmental Epidemiology (ISGlobal)
3Bioinformatics Research Group in Epidemiolgy (BRGE)

*juanr.gonzalez@isglobal.org

2021-04-23

Abstract

Description of functions to perform matrix operations, algebra and some basic statistical models using Matrix (CRAN) and DelayedMatrix (Bioconductor) objects.

Package

BigDataStatMeth 0.99.1

1 Overview

This package implements several matrix operations using Matrix and DelayMatrix objects as well as HDF5 data files. Some basic algebra operations that can also be computed that are useful to implement statistical analyses using standard methodologies such as principal component analyses (PCA) or least squares estimation. The package also contains specific statistical methods mainly used in omic data analysis such as lasso regression. All procedures referred to HDF5 can be found in BigDataStatMeth_hdf5 vignette.

2 Prerequisites

The package requires other packages from CRAN and Bioconductor to be installed.

CRAN: Matrix, RcppEigen and RSpectra.
Bioconductor: beachmat, DelayedArray

As the package can also deal with hdf5 files [See Vignette BigDataStatMeth_hdf5], these other packages are required: HDF5Array, rhdf5. The user can execute this code to install the required packages:

# Install BiocManager (if not previously installed)
install.packages("BiocManager") 

# Install required packages
BiocManager::install(c("Matrix", "RcppEigen", "RSpectra",
                       "beachmat", "DelayedArray",
                       "HDF5Array", "rhdf5"))

Our package needs to be installed from source code. In such cases, a collection of software (e.g. C, C++, Fortran, …) are required, mainly for Windows users. These programs can be installed using Rtools.

3 Install package

Once required packages and Rtools are installed, BigDataStatMeth package can be installed from our GitHub repository as follows:

# Install devtools and load library (if not previously installed)
install.packages("devtools") 
library(devtools)

# Install BigDataStatMeth 
install_github("isglobal-brge/BigDataStatMeth")

4 Getting started

First, let us start by loading the required packages to describe the main capabilities of the package

library(Matrix)
library(DelayedArray)
library(BigDataStatMeth)

This packages is also required to reproduce this vignette

library(microbenchmark)

5 Previous knowledge

5.1 `DelayedMatrix`

A DelayedMatrix is a type of DelayedArray which is like an ordinary array in R, but allows for the data to be in-memory, on-disk in a file, or even hosted on a remote server.

The DelayedArray framework supports some on-disk backends for genomic data. We can deal with HDF5 files via rhdf5 and the HDF5Array package, GenomicDataStorage (GDS) via the gdsfmt and GDSArray package, and Variant Calling Format (VCF) files via the VCFArray package. However, this can be extended to support other on-disk backends. In theory, it should be possible to implement a DelayedArray backend for any file format that has the capability to store array data with fast random access.

Having data in such format allows to perform very fast operations for a given matrix. This has been implemented in the DelayedMatrixStats package. Several examples of how to deal with DelayedArray can be found in this workshop.

6 Matrix Multiplication

In this section, different products of matrices and vectors are introduced. The methods implement different strategies including block multiplication algorithms and the use of parallel implementations.

6.1 Block matrix multiplication

A block matrix or a partitioned matrix is a matrix that is interpreted as having been broken into sections called blocks or submatrices. Intuitively, a block matrix can be visualized as the original matrix with a collection of horizontal and vertical lines, which break it up, or partition it, into a collection of smaller matrices. the implementation has been made from the adaptation of the Fox algorithm [1].

\[A*B=\begin{pmatrix} { A }_{ 11 } & { A }_{ 12 } \\ { A }_{ 21 } & { A }_{ 22 } \end{pmatrix}*\begin{pmatrix} { B }_{ 11 } & B_{ 12 } \\ B_{ 21 } & B_{ 22 } \end{pmatrix}=\begin{pmatrix} { C }_{ 11 } & C_{ 12 } \\ C_{ 21 } & C_{ 22 } \end{pmatrix}=C\]

Let us simulate to set of matrices to illustrate the use of the function accross the entire documment. First, we simulate a simple case with to matrices A and B with dimensions 500x500 and 500x750, respectively. Second, another example with dimensions 1000x10000 are use to evaluate the performance in large matrices. The examples with big datasets will be illustrated using real data belonging to different omic settings. We can simulate to matrices with the desired dimensions by


# Define small matrix A and B
set.seed(123456)
n <- 500
p <- 750
A <- matrix(rnorm(n*n), nrow=n, ncol=n)
B <- matrix(rnorm(n*p), nrow=n, ncol=p)


# Define big matrix Abig and Bbig
n <- 1000
p <- 10000
Abig <- matrix(rnorm(n*n), nrow=n, ncol=n)
Bbig <- matrix(rnorm(n*p), nrow=n, ncol=p)

They can be converted into DelayedMatrix object by simply

DA <- DelayedArray(A)
DB <- DelayedArray(B)

Matrix multiplication using block matrices is implemented in the blockmult() function. It only requires to setting the argument block_size, by default block_size = 128. An optimal block size is important to better performance but it is difficult to asses what is the optimum block size for each matrix.

# Use 10x10 blocks
AxB <- blockmult(A, B, block_size = 10)

# Use 256x256 blocks
AxBDelay <- blockmult(DA, DB, block_size = 256 )

As expected the results obtained using this procedure are the correct ones

all.equal(AxBDelay,A%*%B)
[1] TRUE
all.equal(AxB, AxBDelay)
[1] TRUE

Note that when the argument block_size is larger than any of the dimensions of matrix A or B the blocks_size is set to min(cols(A), rows(A), cols(B), rows(B)).

The process can be speed up by making computations in parallel using paral=TRUE an optional parameter threads can be used to indicate the number of threads to launch simultaneously, if threads=NULL the function takes the available threads - 1, leaving one available for user.

AxB <- blockmult(A, B, block_size = 10, paral = TRUE)
AxBDelay <- blockmult(DA, DB, block_size = 10, paral = TRUE )
AxBDelayT <- blockmult(DA, DB, block_size = 10, paral = TRUE, threads =  3 )

all.equal(AxBDelay,A%*%B)
[1] TRUE
all.equal(AxB, AxBDelay)
[1] TRUE
all.equal(AxBDelay, AxBDelayT)
[1] TRUE

To work with big matrices blockmult() can save matrices in hdf5 file format in order to be able to operate with them in blocks and not overload the memory, by default are considered large matrices if the number of rows or columns is greater than 5000, but it can be changed with bigmatrix argument. Or we can also force to execute the matrix multiplication with data on memory setting the parameter onmemory = TRUE

DAbig <- DelayedArray(Abig)
DBbig <- DelayedArray(Abig)

# We want to force it to run in memory
AxBNOBig <- blockmult(Abig, Bbig, onmemory = TRUE) 

# Run matrix multiplication with data on memory using submatrices of 256x256
AxBBig3000 <- blockmult(DAbig, DBbig, block_size = 256 , onmemory = TRUE)

Here, we compare the performance of the block method with the different options.

bench1 <- microbenchmark( 
  # Parallel block size = 128
  Paral128Mem = blockmult(Abig, Bbig, paral = TRUE), 
  # On disk parallel block size = 256
  Paral256Disk = blockmult(Abig, Bbig, block_size=256, paral=TRUE), 
  Paral256Mem = blockmult(Abig, Bbig, block_size=256, 
                          paral=TRUE, onmemory=TRUE),
  Paral1024Mem = blockmult(Abig, Bbig, block_size=1024, 
                           paral=TRUE, onmemory=TRUE), times = 3 )

bench1
Unit: milliseconds
         expr       min       lq      mean    median
  Paral128Mem 9492.7368 9673.295 9866.7438 9853.8523
 Paral256Disk 3750.9919 3810.253 3992.0555 3869.5138
  Paral256Mem  662.7442  693.227  725.3804  723.7099
 Paral1024Mem 1513.6794 1514.524 1525.8138 1515.3689
         uq        max neval  cld
 10053.7473 10253.6423     3    d
  4112.5873  4355.6607     3   c 
   756.6985   789.6872     3 a   
  1531.8809  1548.3930     3  b

We can observe that the shortest execution time is achieved with block_size = 256. The same information is depicted in the next Figure which illustrates the comparison between the different assessed methods

ggplot2::autoplot(bench1)

6.2 Sparse matrix multiplication

A sparse matrix or sparse array is a matrix in which most of the elements are zero. BigDataStatMeth allows to perform matrix multiplication with sparse matrices using the function blockmult_sparse. It is necessary that at least one of the the two matrices is defined as sparse in R using dgCMatrix class. An example of a sparse matrix could be :

\[\\[.4in]\]

\[ \begin{equation} \begin{bmatrix} a_{11} & a_{12} & 0 & \cdots & \cdots & \cdots & \cdots & 0 \\ a_{21} & a_{22} & a_{23} & \ddots & && & \vdots \\ 0 & a_{32} & a_{33} & a_{34} & \ddots & & & \vdots \\ \vdots & \ddots & \ddots & \ddots & \ddots & \ddots & & \vdots \\ \vdots & & \ddots & \ddots & \ddots & \ddots & \ddots& \vdots\\ \vdots & & & & \ddots & a_{n-1,n-2} & a_{n-1,n-1} & a_{n-1,n}\\ 0 & \cdots & \cdots & \cdots & \cdots & 0 & a_{n,n-1} & a_{n,n} \\ \end{bmatrix} \end{equation} \]

\[\\[.4in]\]

k <- 1e3

# Generate 2 sparse matrix x_sparse and y_sparse
set.seed(1)
x_sparse <- sparseMatrix(
   i = sample(x = k, size = k),
   j = sample(x = k, size = k),
   x = rnorm(n = k)
)

set.seed(2)
y_sparse <- sparseMatrix(
   i = sample(x = k, size = k),
   j = sample(x = k, size = k),
   x = rnorm(n = k)
)

d <- blockmult_sparse(x_sparse, y_sparse)
f <- x_sparse%*%y_sparse

all.equal(d,f)
[1] TRUE

Here, we compare the performace using sparse matrix multiplication with sparse matrices and the blockmult multiplication with the same matrices not declared as sparce.

res <- microbenchmark( 
  sparse_mult = blockmult_sparse(x_sparse, y_sparse),
  matrix_mult = blockmult(as.matrix(x_sparse), as.matrix(y_sparse)), 
  RSparse_mult = x_sparse%*% y_sparse, 
  times = 3 )

res
Unit: microseconds
         expr     min       lq    mean  median       uq
  sparse_mult    43.7    45.65    63.3    47.6    73.10
  matrix_mult 61779.0 63112.25 66635.2 64445.5 69063.30
 RSparse_mult   116.8   122.30   149.9   127.8   166.45
     max neval cld
    98.6     3  a 
 73681.1     3   b
   205.1     3  a

The same information is depicted in the next Figure which illustrates the comparison between the different assessed methods

ggplot2::autoplot(res)

We can observe the huge difference in execution time

7 Cross-product and Transposed Cross-product

To perform a cross-product (similar to crossprod()) and a transposed cross-product (similar to tcrossprod() with BigDataStatMeth we use bdCrossprod() and bdtCrossprod() functions, respectively. Like other functions implemented in BigDataStatMeth, we can work with DelayedArray or R Objects.

bdCrossprod() performs \[\left( C \right) ={ \left( A \right) }^{ t }\left( A \right) \]
bdtCrossprod() performs \[\left( C \right) = \left( A \right){ \left( A \right) }^{ t } \]

Notice that bdCrossprod() and bdtCrossprod() allows parallelization and get the product of two different matrices since the arguments of the functions are:

args(bdCrossprod)
function (A, B = NULL, block_size = 256, paral = TRUE, threads = NULL) 
NULL

Here we shown some examples

n <- 500
m <- 250
A <- matrix(rnorm(n*m), nrow=n, ncol=m)
DA <- DelayedArray(A)

# Cross Product of a standard R matrix
cpA <- bdCrossprod(A)
# Result with DelayedArray data type
cpDA <- bdCrossprod(DA)

We obtain the expected values computed using crossprod R function

all.equal(cpDA, crossprod(A))
[1] TRUE

The same for the transposed cross-product:

# Transposed Cross Product R matrices
tcpA <- bdtCrossprod(A)
# With DelayeArray data types
tcpDA <- bdtCrossprod(DA)

We obtain the expected values computed using tcrossprod() function

all.equal(tcpDA, tcrossprod(A))
[1] TRUE

We can show that the implemented version really improves the R implementation computational speed.

res <- microbenchmark(
  bdcrossp_tr = bdtCrossprod(A),
  rcrossp_tr = tcrossprod(A),
  bdcrossp = bdCrossprod(A),
  rcrossp = crossprod(A),
  times = 3)
res
Unit: milliseconds
        expr     min       lq      mean  median       uq     max
 bdcrossp_tr  6.3396  6.34385  6.451467  6.3481  6.50740  6.6667
  rcrossp_tr 25.7072 25.75945 25.908267 25.8117 26.00880 26.2059
    bdcrossp  2.7252  2.73280  2.749533  2.7404  2.76170  2.7830
     rcrossp 15.6620 15.69405 15.797567 15.7261 15.86535 16.0046
 neval  cld
     3  b  
     3    d
     3 a   
     3   c

ggplot2::autoplot(res)

8 Multiplying matrices and vectors

Sometime the user need to perform different matrices and vector multiplications. The simplest multiplication is \(X %*% w\) that can be perform in parallel using blockmult_vector () function

Other useful operations are:

\(X^t\text{diag}(w)X\)
\(X\text{diag}(w)X^t\)
\(X\text{diag}(w)\)
\(\text{diag(w)}X\)

These can be perform using bdwproduct() function that has an argument called op that performs the required operations:

‘XtwX’ or ‘xtwx’: t(X) %*% diag(w) %*% X
‘XwXt’ or ‘xwxt’: X %*% diag(w) %*% t(X)
‘Xw’ or ‘xw’: X %*% diag(w)
‘wX’ or ‘wx’: diag(w) %*% X

Next, we have some examples

n <- 250
X <- matrix(rnorm(n*n), nrow=n, ncol=n)
u <- runif(n)
w <- u * (1 - u)
DX <- DelayedArray(X)
Dw <- DelayedArray(as.matrix(w))

wcpX <- bdwproduct(X, w, op="xwxt")
wcpDX <- bdwproduct(DX, Dw, op= "xwxt") # with DelayedArray

wcpDX[1:5,1:5]
           [,1]      [,2]       [,3]       [,4]       [,5]
[1,] 44.1890362  1.032160  0.7061483 -1.8762456 -1.2572584
[2,]  1.0321595 43.919531 -2.2856272 -1.3729013  1.1013293
[3,]  0.7061483 -2.285627 46.4180971 -3.6621389  2.1444686
[4,] -1.8762456 -1.372901 -3.6621389 39.1658810 -0.5414553
[5,] -1.2572584  1.101329  2.1444686 -0.5414553 39.1266511

We check whether we obtaine the expected values:

all.equal( wcpDX, X%*%diag(w)%*%t(X) )
[1] TRUE

Another example

wtcpX <- bdwproduct(X, w, op = "xtwx")
wtcpDX <- bdwproduct(DX, Dw, op = "xtwx") # with DelayedArray

wtcpDX[1:5,1:5]
            [,1]        [,2]       [,3]       [,4]       [,5]
[1,] 42.10759651 -0.05701354 -0.4767347 -6.3149197 -0.4659907
[2,] -0.05701354 50.97505787 -3.9501323 -0.7959214 -0.0952895
[3,] -0.47673475 -3.95013234 43.4955384 -0.5409623  4.2215983
[4,] -6.31491965 -0.79592141 -0.5409623 43.0341187 -2.0391817
[5,] -0.46599066 -0.09528950  4.2215983 -2.0391817 43.7745996

Again, we obtain the expected values

all.equal(wtcpDX, t(X)%*%diag(w)%*%X)
[1] TRUE

9 Multiplying matrices

In some situations the user is interested in computing

\[X^tWX\] or

\[XWX^t\]

were \(W\) is a matrix of weights

This can be efficiently perform in BigDataStatMeth using the functions Crossprod_Weighted () and tCrossprod_Weighted() respectively. Here one example (it also works with DelayedArray matrices)

n <- k <- 400
X <- matrix(rnorm(n*k), nrow=n, ncol=k)
W <- matrix(rnorm(n*k), nrow=k, ncol=n)

# Serial execution
Serie<- tCrossprod_Weighted(X, W, paral = FALSE)

# Parallel execution with 2 threads and blocks 256x256
Par_3cor <-  tCrossprod_Weighted(X, W, 
                                paral = TRUE,
                                block_size = 256,
                                threads = 3)

We can check whether we obtain the expected results

all.equal(Serie, X%*%W%*%t(X))
[1] TRUE
all.equal(Serie, X%*%W%*%t(X))
[1] TRUE

10 Inverse Cholesky

The Cholesky factorization is widely used for solving a system of linear equations whose coefficient matrix is symmetric and positive definite.

\[A = LL^t = U^tU\]

where \(L\) is a lower triangular matrix and U is an upper triangular matrix. To get the Inverse Cholesky we can use the function bdInvCholesky(). Let us start by illustrating how to do this computations in a simulated data:

# Generate a positive definite matrix
Posdef <- function (n, ev = runif(n, 0, 10))
{
  Z <- matrix(ncol=n, rnorm(n^2))
  decomp <- qr(Z)
  Q <- qr.Q(decomp)
  R <- qr.R(decomp)
  d <- diag(R)
  ph <- d / abs(d)
  O <- Q %*% diag(ph)
  Z <- t(O) %*% diag(ev) %*% O
  return(Z)
}

A <- Posdef(n = 500, ev = 1:500)
DA <- DelayedArray(A)

invchol <- bdInvCholesky(A)
Dinvchol <- bdInvCholesky(DA)

round(invchol[1:5,1:5],8)
           [,1]        [,2]        [,3]        [,4]        [,5]
[1,] 0.01257496  0.00234614  0.00096388  0.00015561  0.00246762
[2,] 0.00234614  0.01185093  0.00239172 -0.00057207  0.00013259
[3,] 0.00096388  0.00239172  0.01246749 -0.00132076  0.00096416
[4,] 0.00015561 -0.00057207 -0.00132076  0.01072538 -0.00139334
[5,] 0.00246762  0.00013259  0.00096416 -0.00139334  0.01463243

We can check whether this function returns the expected values obtained with the standard R function solve:

all.equal(Dinvchol, solve(A))
[1] TRUE

We can show that the implemented version really improves the R implementation computational speed.

res <- microbenchmark(invchol = bdInvCholesky(A),
                      invcholDelayedA = bdInvCholesky(DA),
                      invcholR = solve(A),
                      times = 3)
res
Unit: milliseconds
            expr     min       lq     mean  median       uq
         invchol 31.4085 31.76325 31.90663 32.1180 32.15570
 invcholDelayedA 36.0496 36.60800 36.94377 37.1664 37.39085
        invcholR 95.5738 95.96695 96.99200 96.3601 97.70110
     max neval cld
 32.1934     3 a  
 37.6153     3  b 
 99.0421     3   c

ggplot2::autoplot(res)

11 Pseudo-Inverse

The Moore-Penrose pseudoinverse is a direct application of the SVD. The inverse of a matrix A can be used to solve the equation \({Ax}={b}\). But in the case where the set of equations have 0 or many solutions the inverse cannot be found and the equation cannot be solved. The following formula is used to find the pseudoinverse: \[{A}^+= {VD}^+{U}^T\] In BigDataStatMeth we implemented Moore-Penrose pseudoinverse in function bdpseudoinv, now we get the pseudoinverse from a simulated data


m <- 1300
n <- 1200

A <- matrix(rnorm(n*m), nrow=n, ncol=m)

pseudoinv <- bdpseudoinv(A)
zapsmall(pseudoinv)[1:5,1:5]
             [,1]         [,2]         [,3]         [,4]
[1,] -0.000929938 -0.000219023  0.004998460  0.003535350
[2,] -0.001946806  0.002491788 -0.005024868 -0.002086451
[3,]  0.000010026  0.003979115 -0.000634740 -0.000881813
[4,] -0.000500903 -0.001114744 -0.001527819 -0.000728396
[5,] -0.001369427 -0.008313211  0.000877194  0.003920593
             [,5]
[1,] -0.001476193
[2,]  0.004296669
[3,]  0.001094804
[4,]  0.003100715
[5,] -0.003083451

12 QR Factorization

QR decomposition, also known as a QR factorization or QU factorization is a decomposition of a matrix A into a product : \[A = QR\] of an orthogonal matrix Q and an upper triangular matrix R. QR decomposition is often used to solve the linear least squares problems.

In BigDataStatMeth we implemented QR decomposition in function bdQR, the QR decomposition can be performed in R objects or in DelayedArray objects. To show how to use this function we performa a QR decomposition from the previous simulated data in matrix A.


# Get Delayed array from A
DA <- DelayedArray(A)

QR_A <- bdQR(A)
QR_DA <- bdQR(DA)
QR_R <- qr(A)

# Show results for Q
zapsmall(QR_A$Q[1:5,1:5])
            [,1]        [,2]        [,3]        [,4]        [,5]
[1,] -0.04008586 -0.02472723  0.02550645  0.02997907  0.02737843
[2,]  0.03923424 -0.06262526  0.03758668 -0.05461130  0.00665810
[3,]  0.03167128  0.02161421 -0.01659119 -0.06438171 -0.00175184
[4,]  0.00577572 -0.01218693 -0.03905370 -0.04780413 -0.00092894
[5,] -0.00267035 -0.00386603  0.06940760  0.06123214 -0.03928808

# Show results for R
zapsmall(QR_A$R[1:5,1:5])
         [,1]     [,2]     [,3]     [,4]      [,5]
[1,] 34.75369  0.12262 -0.08356 -1.27886  -0.30175
[2,]  0.00000 34.73975 -0.93004  0.95486   0.29010
[3,]  0.00000  0.00000 35.35888  0.20043   1.58712
[4,]  0.00000  0.00000  0.00000 34.11577  -1.22742
[5,]  0.00000  0.00000  0.00000  0.00000 -34.88482

# Test Q 
all.equal(QR_A$Q, qr.Q(QR_R), check.attributes=FALSE)
[1] TRUE

13 Solve matrix equation

In BigDataStatMeth we implemented the function bdSolve that computes the solution to a real system of linear equations

\[A*X = B\] where A is an N-by-N matrix and X and B are N-by-K matrices.

Here we solve a matrix equation with a squared matrix A (1000 by 1000) and B matrix (1000 by 2)


# Simulate data
m <- 1000
n <- 1000

A <- matrix(runif(n*m), nrow = n, ncol = m)
B <- matrix(runif(n*2), nrow = n)

# Solve matrix equation
X <- bdSolve(A, B)

# Show results
X[1:5,]
           [,1]       [,2]
[1,] -3.3413699 -1.8256915
[2,] -1.7191686  0.2768148
[3,]  1.6582799  1.6851948
[4,] -0.9308524  1.7915137
[5,] -4.4860380  0.7196707

Now we check results multiplying matrix A by results X, if all is correct \(A*X = B\)


testB <- blockmult(A,X)

B[1:5,]
           [,1]       [,2]
[1,] 0.89672669 0.58589763
[2,] 0.07182663 0.09126186
[3,] 0.33021122 0.43776093
[4,] 0.76410828 0.95991296
[5,] 0.84585130 0.78990866
testB[1:5,]
           [,1]       [,2]
[1,] 0.89672669 0.58589763
[2,] 0.07182663 0.09126186
[3,] 0.33021122 0.43776093
[4,] 0.76410828 0.95991296
[5,] 0.84585130 0.78990866

all.equal(B, testB)
[1] TRUE

14 Singular Value Decomposition (SVD)

The SVD of an \(m \times n\) real or complex matrix \(A\) is a factorization of the form:

\[U\Sigma { V }^{ T }\]

where :

-\(U\) is a \(m \times m\) real or complex unitary matrix -\(\Sigma\) is a \(m \times n\) rectangular diagonal matrix with non-negative real numbers on the diagonal -\(V\) is a \(n \times n\) real or complex unitary matrix.

Notice that:

The diagonal entries \(\sigma_i\) of \(\Sigma\) are known as the singular values of \(A\).
The columns of \(U\) are called the left-singular vectors of \(A\).
The columns of \(V\) are called the right-singular vectors of \(A\).

14.1 Simple Singular Values Decomposition

He have implemented the SVD for R matrices and Delayed Array objects in the function bdSVD(). The method, so far, only allows SVD of real matrices \(A\). This code illustrate how to perform such computations:

# Matrix simulation
set.seed(413)
n <- 500
A <- matrix(rnorm(n*n), nrow=n, ncol=n)
# Get a Delayed Array object
DA <- DelayedArray(A)

# SVD
bsvd <- bdSVD(A)
Dbsvd <- bdSVD(DA)

# Singular values, and right and left singular vectors
bsvd$d[1:5]
[1] 44.27037 43.95609 43.31037 43.01980 42.81728
bsvd$u[1:5,1:5]
             [,1]        [,2]         [,3]         [,4]
[1,]  0.028712445  0.10169472  0.019783036  0.032804622
[2,]  0.001193891  0.02296487  0.009024529  0.059683600
[3,] -0.008878497 -0.02414059 -0.006035909 -0.004867886
[4,]  0.037323916 -0.04675198  0.035928628 -0.021097593
[5,] -0.054073083  0.04382626 -0.009027050 -0.002269443
             [,5]
[1,] -0.048915416
[2,]  0.027888834
[3,]  0.024570252
[4,] -0.073485962
[5,] -0.003071978
bsvd$v[1:5,1:5]
            [,1]        [,2]        [,3]        [,4]
[1,] -0.07105335  0.04554495 -0.03850442  0.07270812
[2,]  0.05989691 -0.02957792 -0.07508850 -0.01171385
[3,] -0.00200688 -0.04179449  0.02798939 -0.05686766
[4,] -0.01284403 -0.05024617 -0.09038993 -0.05725360
[5,] -0.01751359  0.07320492 -0.02120536 -0.02397367
             [,5]
[1,]  0.078124410
[2,]  0.001581823
[3,] -0.073573833
[4,]  0.023632656
[5,] -0.067643671

We get the expected results obtained with standard R functions:

all.equal( sqrt( svd( tcrossprod( scale(A) ) )$d[1:10] ), bsvd$d[1:10] )
[1] TRUE
all.equal( sqrt( svd( tcrossprod( scale(A) ) )$d[1:10] ), Dbsvd$d[1:10] )
[1] TRUE

you get the \(\sigma_i\), \(U\) and \(V\) of normalized matrix \(A\), if you want to perform the SVD from not normalized matrix \(A\) then you have to set the parameter bcenter = false and bscale = false.

bsvd <- bdSVD(A, bcenter = FALSE, bscale = FALSE)
Dbsvd <- bdSVD(DA, bcenter = FALSE, bscale = FALSE)


bsvd$d[1:5]
[1] 44.44007 43.89640 43.38384 43.23563 42.82658
bsvd$u[1:5,1:5]
              [,1]        [,2]         [,3]         [,4]
[1,]  3.075740e-02  0.09861737 -0.026705796  0.001569226
[2,]  6.480471e-05  0.04151230 -0.049215808  0.054763599
[3,] -1.857597e-02 -0.02463540 -0.001246071  0.015838141
[4,]  3.555600e-02 -0.05715730 -0.009596880 -0.040247700
[5,] -5.290501e-02  0.04784627 -0.006624289  0.010168748
            [,5]
[1,] -0.04384078
[2,] -0.01436232
[3,]  0.05725309
[4,] -0.03888504
[5,]  0.01934154
bsvd$v[1:5,1:5]
              [,1]        [,2]         [,3]        [,4]
[1,] -0.0677291931  0.03646765 -0.072013679 -0.06460296
[2,]  0.0558086446 -0.02816709 -0.067619533  0.02478961
[3,]  0.0024305441 -0.04679000  0.064115513  0.04134591
[4,] -0.0007861777 -0.06602592 -0.062833191  0.06307396
[5,] -0.0165396063  0.07316212  0.002600021  0.07030477
            [,5]
[1,]  0.03834539
[2,]  0.01316491
[3,] -0.03519768
[4,]  0.03651487
[5,] -0.07916813

all.equal( sqrt(svd(tcrossprod(A))$d[1:10]), bsvd$d[1:10] )
[1] TRUE
all.equal( sqrt(svd(tcrossprod(A))$d[1:10]), Dbsvd$d[1:10] )
[1] TRUE

14.2 Block Singular Values Decomposition

A method developed by M. A. Iwen and B. W. Ong uses a distributed and incremental SVD algorithm that is useful for agglomerative data analysis on large networks. The algorithm calculates the singular values and left singular vectors of a matrix A, by first, partitioning it by columns. This creates a set of submatrices of A with the same number of rows, but only some of its columns. After that, the SVD of each of the submatrices is computed. The final step consists of combining the results obtained by merging them again and computing the SVD of the resulting matrix.

This approach is only implemented using HDF5 files. This method is implemented in bdSVD_hdf5 function, this function works directly on hdf5 data format, loading in memory only the data to perform calculations and saving the results again in the hdf5 file for later use. The user is referred to read section 7.3.1 from this vignette.

15 Using algebra procedure for `DelayedMatrix` objects to implement basic statistical methods

15.1 Principal component analysis (PCA)

Let us illustrate how to perform a PCA using miRNA data obtained from TCGA corresponding to 3 different tumors: melanoma (ME), leukemia (LEU) and centran nervous system (CNS). Data is available at BigDataStatMeth and can be loaded by simply:

data(miRNA)
data(cancer)
dim(miRNA)
[1]  21 537

We observe that there are a total of 21 individuals and 537 miRNAs. The vector cancer contains the type of tumor of each individual. For each type we have:

table(cancer)
cancer
CNS  LE  ME 
  6   6   9

Now, the typical principal component analysis on the samples can be run on the miRNA matrix since it has miRNAs in columns and individuals in rows

pc <- prcomp(miRNA)

We can plot the two first components with:

plot(pc$x[, 1], pc$x[, 2],
     main = "miRNA data on tumor samples",
     xlab = "PC1", ylab = "PC2", type="n")
abline(h=0, v=0, lty=2)
points(pc$x[, 1], pc$x[, 2], col = cancer,
       pch=16, cex=1.2)
legend("topright", levels(cancer), pch=16, col=1:3)

The same analysis can be performed using SVD decomposition and having miRNAs as a DelayedMatrix object. The PCA is equivalent to performing the SVD on the centered data, where the centering occurs on the columns. In that case the function bdSVD requires to set the argument bcenter and bscale equal to TRUE, the dafault values. Notice that sweep() and colMeans() functions can be applied to a DelayedMAtrix object since this method is implemented for that type of objects in the DelayedArray package.

miRNAD <- DelayedArray(miRNA)
miRNAD.c <- DelayedArray::sweep(miRNAD, 2,
      DelayedArray::colMeans(miRNAD), "-")
svd.da <- bdSVD(miRNAD.c, bcenter = FALSE, bscale = FALSE)

The PCA plot for the two principal components can then be be obtained by:

plot(svd.da$u[, 1], svd.da$u[, 2],
     main = "miRNA data on tumor samples",
     xlab = "PC1", ylab = "PC2", type="n")
abline(h=0, v=0, lty=2)
points(svd.da$u[, 1], svd.da$u[, 2], col = cancer,
       pch=16, cex=1.2)
legend("topright", levels(cancer), pch=16, col=1:3)

We can observe that both figures are equal irrespective to a sign change of second component (that can happen in SVD).

16 Session information

sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)

Matrix products: default

locale:
[1] LC_COLLATE=Spanish_Spain.1252  LC_CTYPE=Spanish_Spain.1252   
[3] LC_MONETARY=Spanish_Spain.1252 LC_NUMERIC=C                  
[5] LC_TIME=Spanish_Spain.1252    

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils    
[7] datasets  methods   base     

other attached packages:
 [1] rhdf5_2.32.2           microbenchmark_1.4-7  
 [3] DelayedArray_0.14.1    IRanges_2.22.2        
 [5] S4Vectors_0.26.1       BiocGenerics_0.34.0   
 [7] matrixStats_0.57.0     Matrix_1.2-18         
 [9] BigDataStatMeth_0.99.1 BiocStyle_2.16.0      
[11] knitr_1.29            

loaded via a namespace (and not attached):
 [1] RcppEigen_0.3.3.9.1 zoo_1.8-8           tidyselect_1.1.0   
 [4] xfun_0.16           beachmat_2.4.0      purrr_0.3.4        
 [7] splines_4.0.2       lattice_0.20-41     colorspace_2.0-0   
[10] vctrs_0.3.7         generics_0.0.2      htmltools_0.5.1.1  
[13] yaml_2.2.1          utf8_1.2.1          survival_3.2-3     
[16] rlang_0.4.10        pillar_1.6.0        glue_1.4.2         
[19] multcomp_1.4-13     lifecycle_1.0.0     stringr_1.4.0      
[22] munsell_0.5.0       gtable_0.3.0        mvtnorm_1.1-1      
[25] codetools_0.2-16    evaluate_0.14       fansi_0.4.2        
[28] TH.data_1.0-10      Rcpp_1.0.6          scales_1.1.1       
[31] BiocManager_1.30.10 magick_2.4.0        RcppParallel_5.1.2 
[34] farver_2.1.0        ggplot2_3.3.3       digest_0.6.27      
[37] stringi_1.4.6       bookdown_0.20       dplyr_1.0.2        
[40] grid_4.0.2          tools_4.0.2         sandwich_2.5-1     
[43] magrittr_2.0.1      tibble_3.1.0        crayon_1.4.1       
[46] pkgconfig_2.0.3     MASS_7.3-52         ellipsis_0.3.1     
[49] rsconnect_0.8.16    rmarkdown_2.7       Rhdf5lib_1.10.1    
[52] R6_2.5.0            compiler_4.0.2