Design and normalization of single-cell RNA-seq

August 3, 2016

Outline

Motivation
Experimental design: batch effects and confounding
Normalization: accounting for sample quality
SCONE: An R package to normalize single-cell RNA-seq

slides available at: http://rpubs.com/daviderisso

Illuminating cellular diversity in the brain

The brain is made up of 100’s if not 1000’s different cell types.
We need a rational way to identify and classify them.

Single-cell sequencing of S1 cortex

285 Layer 4 cells
657 Layer 5 cells
307 Layer 6 cells

Experimental design: batch effects and confounding

Experimental process

Low sequencing depth: 192 cells per Illumina lane (average 1.2M reads per cell)

Snapshot of the data

_	Olfactory	Cortical
Mice	51	41
C1 runs	61	40
Illumina Lanes	19	7
Cells	2,627	1,249
Cells pass QC	2,190	1,042
Sequenced reads	4,000 M	1,500 M

Ideal design: Factorial experiment

Each level of the factor of interest, say layer of origin, is observed in each batch.

Technical limitation: Complete confounding

We can only isolate one cell type per animal / batch.

_{^{See also in bioRxiv:
Hicks et al. (2015) http://dx.doi.org/10.1101/025528}}

Partial solution: Replication

Need to have multiple batches per condition and account for batch effects in the design matrix (nested design).

Note that mouse and C1 run effects are still confounded!

_{^{See also in bioRxiv:
Tung et al. (2016) http://dx.doi.org/10.1101/062919}}

Partial solution: Nested design

To account for batch \(j(i)\) in condition \(i\), we can model the log-expression of each sample \(k\) as

\[ y_{ijk} = \mu + \alpha_i + \beta_{j(i)} + \varepsilon_{ijk}, \]

subject to the \(n+1\) constraints

\[ \sum_{i=1}^n \alpha_i = 0; \\ \sum_{j=1}^{n_i} \beta_{j(i)} = 0. \]

Sample quality influences gene expression

Absolute correlation between PCA of log(TPM+1) and QC scores.

Sample quality differs among batches

PC1 of the QC score matrix stratified by batch.

Normalization: accounting for sample quality

A general normalization framework

RUV can be used to estimate \(W \alpha\) using negative control genes.

_{^{Gagnon-Bartsch & Speed (2012) [http://dx.doi.org/10.1093/biostatistics/kxr034]
Risso et al. (2014) [http://dx.doi.org/10.1038/nbt.2931]
RUVSeq package [http://bioconductor.org/packages/RUVSeq/]}}

RUV – negative control genes (RUVg)

Identify a subset of negative control genes, i.e., non-DE genes, for which \(\beta_c = 0\). Then, \[ \log E[Y_c | W, X] = W \, \alpha_c. \]
Perform the singular value decomposition (SVD) of \(\log Y_c\), \[ \log Y_c = U \Lambda V^T. \] For a given \(k\), estimate \(W\) by \[ \widehat{W} = U \Lambda_k. \]
Substitute \(\widehat{W}\) into the model for the full set of \(J\) genes and estimate both \(\alpha\) and \(\beta\) by GLM regression.
(Optionally) Define normalized read counts as the residuals from ordinary least squares (OLS) regression of \(\log Y\) on \(\widehat{W}\).

_{^{Risso et al. (2014) [http://dx.doi.org/10.1038/nbt.2931]
RUVSeq package [http://bioconductor.org/packages/RUVSeq/]}}

RUV captures sample quality

Should we scale the data before RUV?
QC or RUV? How many factors? Batch?

SCONE: An R package to normalize single-cell RNA-seq

SCONE

Apply a (combination of) normalization method(s).
- Global scaling, e.g., DESeq, TMM.
- Full-quantile (FQ)
- Unknown factors of unwanted variation, e.g., RUV.
- Known factors of unwanted variation, e.g., regression on QC measures, C1 batch.
Rank the normalizations using a set of performance scores.

R package available at https://github.com/YosefLab/scone

_{^{Michael Cole, Elizabeth Purdom, Nir Yosef, Sandrine Dudoit}}

SCONE performance metrics

BIO_SIL: Average silhouette width by biological condition.
BATCH_SIL: Average silhouette width by batch.
PAM_SIL: Maximum average silhouette width for PAM clusterings, for a range of user-supplied numbers of clusters.
EXP_QC_COR: Maximum squared Spearman correlation between count PCs and QC measures.
EXP_UV_COR: Maximum squared Spearman correlation between count PCs and factors of unwanted variation (preferably derived from other set of negative control genes than used in RUV).
EXP_WV_COR: Maximum squared Spearman correlation between count PCs and factors of wanted variation (derived from positive control genes).
RLE_MED: Mean squared median relative log expression (RLE).
RLE_IQR: Mean inter-quartile range (IQR) of RLE.

Application to the BRAIN dataset

Apply and evaluate 99 combinations

Scaling methods: None, TMM, FQ.
UV factors: None, RUV (\(k=1, ..., 5\)), QC (\(k=1, ..., 5\)).
Adjust for batch: yes/no.

Select a normalization procedure

Top three methods by average score:

no scaling, RUV (k=4), batch corrected.
TMM scaling, RUV (k=4), batch corrected.
FQ scaling, QC (k=5), batch corrected.

Exploring scone results

Points color-coded by average score.

Scaling methods behave similarly

Batch effects must be taken into account…

Importance of preserving the biological difference in a nested design.

…but additional factors of UV are needed!

TPM normalization

Heatmap of the 100 most variable genes

RUV (k=4) + nested batch

Heatmap of the 100 most variable genes

Summary

Sample quality influences (single-cell) RNA-seq expression data.
Sample quality may differ between batches
- Hence, need for appropriate experimental design and statistical model.
We propose a flexible linear model to account for known and unknown factors of unwanted variation.
scone helps explore different normalization schemes and rank them according to performance scores.
R package available at https://github.com/YosefLab/scone

Acknowledgements

References

RUVSeq R package: http://bioconductor.org/packages/RUVSeq/
RUVSeq paper: http://rdcu.be/jrCZ

scone R package: https://github.com/YosefLab/scone
scone tutorial: https://github.com/drisso/bioc2016singlecell

These slides: http://rpubs.com/daviderisso