August 3, 2016

Outline

  • Motivation
  • Experimental design: batch effects and confounding
  • Normalization: accounting for sample quality
  • SCONE: An R package to normalize single-cell RNA-seq


slides available at: http://rpubs.com/daviderisso

Illuminating cellular diversity in the brain

The brain is made up of 100’s if not 1000’s different cell types.
We need a rational way to identify and classify them.

Single-cell sequencing of S1 cortex

  • 285 Layer 4 cells
  • 657 Layer 5 cells
  • 307 Layer 6 cells

Experimental design: batch effects and confounding

Experimental process

Low sequencing depth: 192 cells per Illumina lane (average 1.2M reads per cell)

Snapshot of the data

_ Olfactory Cortical
Mice 51 41
C1 runs 61 40
Illumina Lanes 19 7
Cells 2,627 1,249
Cells pass QC 2,190 1,042
Sequenced reads 4,000 M 1,500 M

Ideal design: Factorial experiment

Each level of the factor of interest, say layer of origin, is observed in each batch.

Technical limitation: Complete confounding

Partial solution: Replication

Need to have multiple batches per condition and account for batch effects in the design matrix (nested design).

Note that mouse and C1 run effects are still confounded!


See also in bioRxiv:
Tung et al. (2016) http://dx.doi.org/10.1101/062919

Partial solution: Nested design

To account for batch \(j(i)\) in condition \(i\), we can model the log-expression of each sample \(k\) as

\[ y_{ijk} = \mu + \alpha_i + \beta_{j(i)} + \varepsilon_{ijk}, \]

subject to the \(n+1\) constraints

\[ \sum_{i=1}^n \alpha_i = 0; \\ \sum_{j=1}^{n_i} \beta_{j(i)} = 0. \]

Sample quality influences gene expression

Absolute correlation between PCA of log(TPM+1) and QC scores.

Sample quality differs among batches

PC1 of the QC score matrix stratified by batch.

Normalization: accounting for sample quality

A general normalization framework

RUV – negative control genes (RUVg)

  1. Identify a subset of negative control genes, i.e., non-DE genes, for which \(\beta_c = 0\). Then, \[ \log E[Y_c | W, X] = W \, \alpha_c. \]
  2. Perform the singular value decomposition (SVD) of \(\log Y_c\), \[ \log Y_c = U \Lambda V^T. \] For a given \(k\), estimate \(W\) by \[ \widehat{W} = U \Lambda_k. \]
  3. Substitute \(\widehat{W}\) into the model for the full set of \(J\) genes and estimate both \(\alpha\) and \(\beta\) by GLM regression.
  4. (Optionally) Define normalized read counts as the residuals from ordinary least squares (OLS) regression of \(\log Y\) on \(\widehat{W}\).

Risso et al. (2014) [http://dx.doi.org/10.1038/nbt.2931]
RUVSeq package [http://bioconductor.org/packages/RUVSeq/]

RUV captures sample quality

Should we scale the data before RUV?
QC or RUV? How many factors? Batch?

SCONE: An R package to normalize single-cell RNA-seq

SCONE

  1. Apply a (combination of) normalization method(s).

    • Global scaling, e.g., DESeq, TMM.
    • Full-quantile (FQ)
    • Unknown factors of unwanted variation, e.g., RUV.
    • Known factors of unwanted variation, e.g., regression on QC measures, C1 batch.
  2. Rank the normalizations using a set of performance scores.

R package available at https://github.com/YosefLab/scone



Michael Cole, Elizabeth Purdom, Nir Yosef, Sandrine Dudoit

SCONE performance metrics

  • BIO_SIL: Average silhouette width by biological condition.
  • BATCH_SIL: Average silhouette width by batch.
  • PAM_SIL: Maximum average silhouette width for PAM clusterings, for a range of user-supplied numbers of clusters.
  • EXP_QC_COR: Maximum squared Spearman correlation between count PCs and QC measures.
  • EXP_UV_COR: Maximum squared Spearman correlation between count PCs and factors of unwanted variation (preferably derived from other set of negative control genes than used in RUV).
  • EXP_WV_COR: Maximum squared Spearman correlation between count PCs and factors of wanted variation (derived from positive control genes).
  • RLE_MED: Mean squared median relative log expression (RLE).
  • RLE_IQR: Mean inter-quartile range (IQR) of RLE.

Application to the BRAIN dataset

Apply and evaluate 99 combinations

  • Scaling methods: None, TMM, FQ.
  • UV factors: None, RUV (\(k=1, ..., 5\)), QC (\(k=1, ..., 5\)).
  • Adjust for batch: yes/no.

Select a normalization procedure

Top three methods by average score:

  1. no scaling, RUV (k=4), batch corrected.
  2. TMM scaling, RUV (k=4), batch corrected.
  3. FQ scaling, QC (k=5), batch corrected.

Exploring scone results

Points color-coded by average score.

Scaling methods behave similarly

Batch effects must be taken into account…

Importance of preserving the biological difference in a nested design.

…but additional factors of UV are needed!

TPM normalization

Heatmap of the 100 most variable genes

RUV (k=4) + nested batch

Heatmap of the 100 most variable genes

Summary

  • Sample quality influences (single-cell) RNA-seq expression data.
  • Sample quality may differ between batches
    • Hence, need for appropriate experimental design and statistical model.
  • We propose a flexible linear model to account for known and unknown factors of unwanted variation.
  • scone helps explore different normalization schemes and rank them according to performance scores.
  • R package available at https://github.com/YosefLab/scone

Acknowledgements

References