ZINB-WaVE

June 27, 2017

biorxiv.org/content/early/2017/04/06/125112

bioconductor.org/packages/zinbwave

Outline

Motivation
The ZINB-WaVE model
Estimation procedure
Comparison with existing approaches

Motivation

Single-cell signal is low-dimensional

Wagner, Regev, Yosef (2016)

Single-cell signal is noisy

High level of technical and biological variance.
High drop-out rate (zero inflation).
Amplification bias (high expression outliers).
Unique Molecular Identifiers (UMIs) help but do not solve all these problems.

Problematic with typical workflow: global scaling + log + PCA

Characterization of Layer 5 cortical neurons

PCA is affected by sample quality

The ZINB-WaVE model

General and flexible a zero-inflated negative binomial (ZINB) model.

Extract low-dimensional signal from the data.

Accounting for zero inflation (dropouts), over-dispersion, and the count nature of the data.

The model includes a sample-level intercept, which serves as a global-scaling normalization factor.

Gives the user the ability to include both gene-level and sample-level covariates.

Inclusion of observed and unobserved sample-level covariates enables normalization for complex, non-linear effects (batch effects)

Gene-level covariates may be used to adjust for sequence composition effects, such as gene length and GC-content effects.

The negative binomial distribution

For any \(\mu\geq 0\) and \(\theta>0\), the probability mass function (PMF) of the negative binomial (NB) distribution is

\[ f_N(y;\mu,\theta) = \frac{\Gamma(y+\theta)}{\Gamma(y+1)\Gamma(\theta)} \left(\frac{\theta}{\theta+\mu}\right)^{\theta} \left(\frac{\mu}{\mu + \theta}\right)^y, \quad \forall y\in\mathbb{N}. \]

The mean of the NB distribution is \(\mu\) and its variance is: \[ \sigma^2 = \mu + \frac{\mu^2}{\theta}. \] In particular, the NB distribution boils down to a Poisson distribution when \(\theta = +\infty\).

The zero-inflated negative binomial

For any \(\pi\in[0,1]\), the PMF of the zero-inflated negative binomial (ZINB) distribution is given by

\[ f(y;\mu,\theta, \pi) = \pi \delta_0(y) + (1 - \pi) f_N(y;\mu,\theta), \quad \forall y\in\mathbb{N}, \]

where \(\delta_0(\cdot)\) is the Dirac function.

Here, \(\pi\) can be interpreted as the probability that a \(0\) is observed instead of the actual count, resulting in an inflation of zeros compared to the NB distribution, hence the name ZINB.

The ZINB-WaVE model

Given \(n\) samples and \(J\) genes, let \(Y_{ij}\) denote the count of gene \(j\) (for \(j=1,\ldots,J\)) for sample \(i\) (for \(i=1,\ldots,n\)).

The ZINB-WaVE model

The \(W \alpha\) term refers to unknown low-dimensional variation, that could be due to unwanted technical effects, such as batch effects, or to biological effects of interest, such as cell cycle or cell differentiation.

\(W\) is the same, but \(X\) and \(V\) could differ in the modeling of \(\mu\) and \(\pi\), if we assume that some known factors do not affect both.

When \(X = 1_n\) and \(V = 1_J\), the model is a factor model akin to PCA, where \(W\) is a factor matrix and \(\alpha_\mu\) and \(\alpha_\pi\) are loading matrices.

Note that the intercept in \(V\) acts as a scaling factor, hence the model can be applied to raw counts (no need for prior normalization).

The model is more general, allowing the inclusion of additional sample and gene-level covariates that might help to infer the unknown factors.

The model includes a gene-wise dispersion parameter, but in practice a single common dispersion works well.

Estimation procedure

To estimate the parameters \((\alpha, \beta, \gamma, W, \theta)\), we follow a penalized maximum likelihood approach, to reduce overfitting and improve the numerical stability of the optimization problem in the setting of many parameters.

\[ \max_{\beta,\gamma,W,\alpha,\zeta} \left\{\ell(\beta,\gamma,W,\alpha,\zeta) - \text{Pen}(\beta,\gamma,W,\alpha,\zeta) \right\}\,, \]

where \(\zeta = \log \theta\) and

\[ \text{Pen}(\beta,\gamma,W,\alpha,\zeta) = \frac{\epsilon_{\beta}}{2} \|\beta^0\|^2 + \frac{\epsilon_{\gamma}}{2} \|\gamma^0\|^2 + \frac{\epsilon_{W}}{2}\|W\|^2 + \frac{\epsilon_{\alpha}}{2}\|\alpha\|^2 + \frac{\epsilon_{\zeta}}{2}\text{Var}(\zeta) \,. \]

Estimation procedure

\(\beta^0\) and \(\gamma^0\) denote the matrices \(\beta\) and \(\gamma\) without the rows corresponding to the intercepts if an unpenalized intercept is included in the model.

\(\|\cdot\|\) is the Frobenius matrix norm ( \(\|A\| = \sqrt{\text{tr}(A^*A)}\), where \(A^*\) denotes the conjugate transpose of \(A\)).

The penalty shrinks the estimated parameters to \(0\), except for the cell and gene-specific intercepts which are not penalized and the dispersion parameters which are shrunk towards a constant value across genes (cf. edgeR).

The penalty also ensures that at the optimum \(W\) and \(\alpha^T\) have orthogonal columns, which is useful for visualization or interpretation of latent factors.

Optimization procedure

The penalized likelihood is not concave, making its maximization computationally challenging.

We instead find a local maximum, starting from a smart initialization and iterating a numerical optimization scheme until local convergence.

Optimization procedure

Dispersion optimization

\[ \hat{\zeta} \leftarrow \arg\max_{\zeta} \left\{\ell(\hat{\beta}, \hat{\gamma}, \hat{W}, \hat{\alpha}, \zeta) - \frac{\epsilon_\zeta}{2} \text{Var}(\zeta) \right\} \,. \]

Left factor (cell-specific) optimization

\[ \left(\hat{\gamma}, \hat{W} \right) \leftarrow \arg\max_{\left(\gamma, W\right)} \left\{\ell(\hat{\beta}, \gamma, W, \hat{\alpha}, \hat{\zeta}) - \frac{\epsilon_\gamma}{2} \|\gamma^0\|^2 - \frac{\epsilon_W}{2} \|W\|^2 \right\}\,. \]

Right factor (sample-specific) optimization

\[ \left(\hat{\beta}, \hat{\alpha} \right) \leftarrow \arg\max_{\left(\beta, \alpha \right)} \left\{\ell(\beta, \hat{\gamma},\hat{W}, \alpha, \hat{\zeta}) - \frac{\epsilon_\beta}{2} \|\beta^0\|^2 - \frac{\epsilon_\alpha}{2} \|\alpha\|^2\right\}\,. \]

Orthogonalization

\[ \left(\hat{W}, \hat{\alpha} \right) \leftarrow \arg\min_{(W,\alpha)\,:\,W\alpha = \hat{W}\hat{\alpha}} \frac{1}{2} \left( \epsilon_W \|W\|^2 + \epsilon_\alpha \|\alpha\|^2 \right)\,. \]

Comparison with existing approaches

Simulated data (correlation)

Simulated data (silhouette width)

Real data (silhouette width)

Back to layer 5 neurons…

PCA is affected by sample quality

ZINB-WaVE adjusts for QC

qcpca <- prcomp(qc, scale. = TRUE)

qcmod <- model.matrix(~qcpca$x[,1:2])
head(qcmod)

##   (Intercept) qcpca$x[, 1:2]PC1 qcpca$x[, 1:2]PC2
## 1           1        -2.0656901        0.53766716
## 2           1         1.8332237       -0.11599891
## 3           1         1.2944881       -0.03171802
## 4           1         0.5874294        0.87552522
## 5           1        -2.0671591        1.12292102
## 6           1        -2.1200918        0.94022293

zinbres <- zinbFit(filtered, K = 3, X = qcmod)

ZINB-WaVE adjusts for QC

ZINB-WaVE yields biologically meaningful signal

Validation

Conclusions and future directions

ZINB-WaVE is a general and flexible method for low-dimensional signal extraction from noisy, zero-inflated count data.

It works well for projecting single-cell data in two dimensions as a dimensionality reduction step prior to clustering or pseudotime inference.

We are extending the model to provide normalized values for visualization (residuals).

And to supervised differential expression problems (test \(\beta = 0\)).

Acknowledgements

Co-authors

Fanny Perradeau
Svetlana Gribkova
Sandrine Dudoit
Jean-Philippe Vert

John Ngai Lab

John Ngai
David Stafford
Justin Choi

Thank you!

ZINB-Wave preprint: doi.org/10.1101/125112

zinbwave package: bioconductor.org/packages/zinbwave

These slides: rpubs.com/daviderisso

Contacts

Github: @drisso

Twitter: @drisso1893

Email: dar2062@med.cornell.edu