June 27, 2017

Outline

  • Motivation

  • The ZINB-WaVE model

  • Estimation procedure

  • Comparison with existing approaches

Motivation

Single-cell signal is low-dimensional

Wagner, Regev, Yosef (2016)

Single-cell signal is noisy

  • High level of technical and biological variance.
  • High drop-out rate (zero inflation).
  • Amplification bias (high expression outliers).
  • Unique Molecular Identifiers (UMIs) help but do not solve all these problems.

Problematic with typical workflow: global scaling + log + PCA

Characterization of Layer 5 cortical neurons

PCA is affected by sample quality

The ZINB-WaVE model

The ZINB-WaVE model

  • General and flexible a zero-inflated negative binomial (ZINB) model.
  • Extract low-dimensional signal from the data.
  • Accounting for zero inflation (dropouts), over-dispersion, and the count nature of the data.
  • The model includes a sample-level intercept, which serves as a global-scaling normalization factor.
  • Gives the user the ability to include both gene-level and sample-level covariates.
  • Inclusion of observed and unobserved sample-level covariates enables normalization for complex, non-linear effects (batch effects)
  • Gene-level covariates may be used to adjust for sequence composition effects, such as gene length and GC-content effects.

The negative binomial distribution

For any \(\mu\geq 0\) and \(\theta>0\), the probability mass function (PMF) of the negative binomial (NB) distribution is

\[ f_N(y;\mu,\theta) = \frac{\Gamma(y+\theta)}{\Gamma(y+1)\Gamma(\theta)} \left(\frac{\theta}{\theta+\mu}\right)^{\theta} \left(\frac{\mu}{\mu + \theta}\right)^y, \quad \forall y\in\mathbb{N}. \]

The mean of the NB distribution is \(\mu\) and its variance is: \[ \sigma^2 = \mu + \frac{\mu^2}{\theta}. \] In particular, the NB distribution boils down to a Poisson distribution when \(\theta = +\infty\).

The zero-inflated negative binomial

For any \(\pi\in[0,1]\), the PMF of the zero-inflated negative binomial (ZINB) distribution is given by

\[ f(y;\mu,\theta, \pi) = \pi \delta_0(y) + (1 - \pi) f_N(y;\mu,\theta), \quad \forall y\in\mathbb{N}, \]

where \(\delta_0(\cdot)\) is the Dirac function.

Here, \(\pi\) can be interpreted as the probability that a \(0\) is observed instead of the actual count, resulting in an inflation of zeros compared to the NB distribution, hence the name ZINB.

The ZINB-WaVE model

Given \(n\) samples and \(J\) genes, let \(Y_{ij}\) denote the count of gene \(j\) (for \(j=1,\ldots,J\)) for sample \(i\) (for \(i=1,\ldots,n\)).

The ZINB-WaVE model

  • The \(W \alpha\) term refers to unknown low-dimensional variation, that could be due to unwanted technical effects, such as batch effects, or to biological effects of interest, such as cell cycle or cell differentiation.
  • \(W\) is the same, but \(X\) and \(V\) could differ in the modeling of \(\mu\) and \(\pi\), if we assume that some known factors do not affect both.
  • When \(X = 1_n\) and \(V = 1_J\), the model is a factor model akin to PCA, where \(W\) is a factor matrix and \(\alpha_\mu\) and \(\alpha_\pi\) are loading matrices.
  • Note that the intercept in \(V\) acts as a scaling factor, hence the model can be applied to raw counts (no need for prior normalization).
  • The model is more general, allowing the inclusion of additional sample and gene-level covariates that might help to infer the unknown factors.
  • The model includes a gene-wise dispersion parameter, but in practice a single common dispersion works well.

Estimation procedure

To estimate the parameters \((\alpha, \beta, \gamma, W, \theta)\), we follow a penalized maximum likelihood approach, to reduce overfitting and improve the numerical stability of the optimization problem in the setting of many parameters.

\[ \max_{\beta,\gamma,W,\alpha,\zeta} \left\{\ell(\beta,\gamma,W,\alpha,\zeta) - \text{Pen}(\beta,\gamma,W,\alpha,\zeta) \right\}\,, \]

where \(\zeta = \log \theta\) and

\[ \text{Pen}(\beta,\gamma,W,\alpha,\zeta) = \frac{\epsilon_{\beta}}{2} \|\beta^0\|^2 + \frac{\epsilon_{\gamma}}{2} \|\gamma^0\|^2 + \frac{\epsilon_{W}}{2}\|W\|^2 + \frac{\epsilon_{\alpha}}{2}\|\alpha\|^2 + \frac{\epsilon_{\zeta}}{2}\text{Var}(\zeta) \,. \]

Estimation procedure

  • \(\beta^0\) and \(\gamma^0\) denote the matrices \(\beta\) and \(\gamma\) without the rows corresponding to the intercepts if an unpenalized intercept is included in the model.
  • \(\|\cdot\|\) is the Frobenius matrix norm ( \(\|A\| = \sqrt{\text{tr}(A^*A)}\), where \(A^*\) denotes the conjugate transpose of \(A\)).
  • The penalty shrinks the estimated parameters to \(0\), except for the cell and gene-specific intercepts which are not penalized and the dispersion parameters which are shrunk towards a constant value across genes (cf. edgeR).
  • The penalty also ensures that at the optimum \(W\) and \(\alpha^T\) have orthogonal columns, which is useful for visualization or interpretation of latent factors.

Optimization procedure

The penalized likelihood is not concave, making its maximization computationally challenging.

We instead find a local maximum, starting from a smart initialization and iterating a numerical optimization scheme until local convergence.

Optimization procedure

  • Dispersion optimization

\[ \hat{\zeta} \leftarrow \arg\max_{\zeta} \left\{\ell(\hat{\beta}, \hat{\gamma}, \hat{W}, \hat{\alpha}, \zeta) - \frac{\epsilon_\zeta}{2} \text{Var}(\zeta) \right\} \,. \]

  • Left factor (cell-specific) optimization

\[ \left(\hat{\gamma}, \hat{W} \right) \leftarrow \arg\max_{\left(\gamma, W\right)} \left\{\ell(\hat{\beta}, \gamma, W, \hat{\alpha}, \hat{\zeta}) - \frac{\epsilon_\gamma}{2} \|\gamma^0\|^2 - \frac{\epsilon_W}{2} \|W\|^2 \right\}\,. \]

  • Right factor (sample-specific) optimization

\[ \left(\hat{\beta}, \hat{\alpha} \right) \leftarrow \arg\max_{\left(\beta, \alpha \right)} \left\{\ell(\beta, \hat{\gamma},\hat{W}, \alpha, \hat{\zeta}) - \frac{\epsilon_\beta}{2} \|\beta^0\|^2 - \frac{\epsilon_\alpha}{2} \|\alpha\|^2\right\}\,. \]

  • Orthogonalization

\[ \left(\hat{W}, \hat{\alpha} \right) \leftarrow \arg\min_{(W,\alpha)\,:\,W\alpha = \hat{W}\hat{\alpha}} \frac{1}{2} \left( \epsilon_W \|W\|^2 + \epsilon_\alpha \|\alpha\|^2 \right)\,. \]

Comparison with existing approaches

Simulated data (correlation)

Simulated data (silhouette width)

Real data (silhouette width)

Back to layer 5 neurons…

PCA is affected by sample quality

ZINB-WaVE adjusts for QC

qcpca <- prcomp(qc, scale. = TRUE)
qcmod <- model.matrix(~qcpca$x[,1:2])
head(qcmod)
##   (Intercept) qcpca$x[, 1:2]PC1 qcpca$x[, 1:2]PC2
## 1           1        -2.0656901        0.53766716
## 2           1         1.8332237       -0.11599891
## 3           1         1.2944881       -0.03171802
## 4           1         0.5874294        0.87552522
## 5           1        -2.0671591        1.12292102
## 6           1        -2.1200918        0.94022293
zinbres <- zinbFit(filtered, K = 3, X = qcmod)

ZINB-WaVE adjusts for QC

ZINB-WaVE yields biologically meaningful signal

ZINB-WaVE yields biologically meaningful signal

Validation

Conclusions and future directions

  • ZINB-WaVE is a general and flexible method for low-dimensional signal extraction from noisy, zero-inflated count data.
  • It works well for projecting single-cell data in two dimensions as a dimensionality reduction step prior to clustering or pseudotime inference.
  • We are extending the model to provide normalized values for visualization (residuals).
  • And to supervised differential expression problems (test \(\beta = 0\)).

Acknowledgements

Co-authors

  • Fanny Perradeau
  • Svetlana Gribkova
  • Sandrine Dudoit
  • Jean-Philippe Vert

John Ngai Lab

  • John Ngai
  • David Stafford
  • Justin Choi

Thank you!