Normalization, clustering, and differential expression of single-cell RNA-seq data

August 19, 2016

Outline

Motivation
Experimental design: batch effects and confounding
Normalization: accounting for sample quality
Clustering: resampling and sequential strategies

slides available at: http://rpubs.com/daviderisso/iisa2016

Illuminating cellular diversity in the brain

The brain is made up of 100’s if not 1000’s different cell types.
We need a rational way to identify and classify them.

Single-cell sequencing of S1 cortex

285 Layer 4 cells
657 Layer 5 cells
307 Layer 6 cells

Experimental design: batch effects and confounding

Experimental process

Low sequencing depth: 192 cells per Illumina lane (average 1.2M reads per cell)

Snapshot of the data

_	Olfactory	Cortical
Mice	51	41
C1 runs	61	40
Illumina Lanes	19	7
Cells	2,627	1,249
Cells pass QC	2,190	1,042
Sequenced reads	4,000 M	1,500 M

Ideal design: Factorial experiment

Each level of the factor of interest, say layer of origin, is observed in each batch.

Technical limitation: Complete confounding

We can only isolate one cell type per animal / batch.

_{^{See also in bioRxiv:
Hicks et al. (2015) http://dx.doi.org/10.1101/025528}}

Partial solution: Replication

Need to have multiple batches per condition and account for batch effects in the design matrix (nested design).

Note that mouse and C1 run effects are still confounded!

_{^{See also in bioRxiv:
Tung et al. (2016) http://dx.doi.org/10.1101/062919}}

Partial solution: Nested design

To account for batch \(j(i)\) in condition \(i\), we can model the log-expression of each sample \(k\) as

\[ y_{ijk} = \mu + \alpha_i + \beta_{j(i)} + \varepsilon_{ijk}, \]

subject to the \(n+1\) constraints

\[ \sum_{i=1}^n \alpha_i = 0; \\ \sum_{j=1}^{n_i} \beta_{j(i)} = 0. \]

Sample quality influences gene expression

Absolute correlation between PCA of log(TPM+1) and QC scores.

Sample quality differs among batches

PC1 of the QC score matrix stratified by batch.

Normalization: accounting for sample quality

A general normalization framework

RUV can be used to estimate \(W \alpha\) using negative control genes.

_{^{Gagnon-Bartsch & Speed (2012) [http://dx.doi.org/10.1093/biostatistics/kxr034]
Risso et al. (2014) [http://dx.doi.org/10.1038/nbt.2931]
RUVSeq package [http://bioconductor.org/packages/RUVSeq/]}}

RUV – negative control genes (RUVg)

Identify a subset of negative control genes, i.e., non-DE genes, for which \(\beta_c = 0\). Then, \[ \log E[Y_c | W, X] = W \, \alpha_c. \]
Perform the singular value decomposition (SVD) of \(\log Y_c\), \[ \log Y_c = U \Lambda V^T. \] For a given \(k\), estimate \(W\) by \[ \widehat{W} = U \Lambda_k. \]
Substitute \(\widehat{W}\) into the model for the full set of \(J\) genes and estimate both \(\alpha\) and \(\beta\) by GLM regression.
(Optionally) Define normalized read counts as the residuals from ordinary least squares (OLS) regression of \(\log Y\) on \(\widehat{W}\).

_{^{Risso et al. (2014) [http://dx.doi.org/10.1038/nbt.2931]
RUVSeq package [http://bioconductor.org/packages/RUVSeq/]}}

RUV captures sample quality

Should we scale the data before RUV?
QC or RUV? How many factors? Batch?

SCONE

Apply a (combination of) normalization method(s).
- Global scaling, e.g., DESeq, TMM.
- Full-quantile (FQ)
- Unknown factors of unwanted variation, e.g., RUV.
- Known factors of unwanted variation, e.g., regression on QC measures, C1 batch.
Rank the normalizations using a set of performance scores.

R package available at https://github.com/YosefLab/scone

_{^{Michael Cole, Nir Yosef, Sandrine Dudoit}}

SCONE performance metrics

BIO_SIL: Average silhouette width by biological condition.
BATCH_SIL: Average silhouette width by batch.
PAM_SIL: Maximum average silhouette width for PAM clusterings, for a range of user-supplied numbers of clusters.
EXP_QC_COR: Maximum squared Spearman correlation between count PCs and QC measures.
EXP_UV_COR: Maximum squared Spearman correlation between count PCs and factors of unwanted variation (preferably derived from other set of negative control genes than used in RUV).
EXP_WV_COR: Maximum squared Spearman correlation between count PCs and factors of wanted variation (derived from positive control genes).
RLE_MED: Mean squared median relative log expression (RLE).
RLE_IQR: Mean inter-quartile range (IQR) of RLE.

Exploring scone results

Points color-coded by average score.

TPM normalization

Heatmap of the 100 most variable genes

RUV (k=4) + nested batch

Heatmap of the 100 most variable genes

Clustering: resampling and sequential strategies

Clustering of single-cell RNA-seq data

In the literature, most approaches can be summarized by three steps.

Dimensionality reduction (e.g., PCA, t-SNE, most variable genes).
Compute a distance matrix between samples in the reduce space.
Clustering based on a partitioning method (e.g., PAM, k-means).

For each step there are many tuning parameters. E.g.,

How many principal components?
Which distance?
How many clusters?

Resampling-based Sequential Ensemble Clustering (RSEC)

Given a base cluster algorithm

Generate a single candidate clustering using
- resampling (to find robust clusters)
- sequential clustering (to find stable clusters)
Repeat the procedure for different algorithms and tuning parameters to generate a collection of candidate clusterings
Identify a consensus over the different candidates

Implemented in the R/Bioconductor package clusterExperiment: http://bioconductor.org/packages/clusterExperiment

_{^{Elizabeth Purdom}}

Subsampling

Given an underlying clustering strategy, e.g., k-means or PAM with a particular choice of k, we repeat the following:

Subsample the data, e.g. 70% of samples.
Find clusters on the subsample.
Create a co-clustering matrix D:
- % of subsamples where samples were in the same cluster.

Sequential clustering

Our sequential clustering works as follows.

Range over k in PAM clustering using the subsampling strategy.
The cluster that remains stable across values of k is identified and removed.
Repeat until no more stable clusters are found.

_{^{Inspired by the "tight clustering" algorithm
Tseng and Wong (2005) http://dx.doi.org/10.1111/j.0006-341X.2005.031032.x}}

Clustering reveals L5 sub-populations

Differential expression

Find cluster gene expression signatures (marker genes).
Standard solutions:
- F-test for any difference between clusters
- all pairwise comparisons
Our solution:
- Create a hierarchy of clusters
- Select appropriate contrasts that compare sister nodes

Differential expression

clusterExperiment R package

Functions to

Generate a collection of candidate cluster labels.
Find a consensus clustering.
Merge clusters.
Find cluster signatures.
Visualize clustering results.

Available at http://bioconductor.org/packages/clusterExperiment

clusterExperiment shiny app (coming soon!)

_{^{Liam Purvis and Elizabeth Purdom}}

Summary

Sample quality influences (single-cell) RNA-seq expression data.
We propose a flexible linear model to account for known and unknown factors of unwanted variation.
scone helps explore different normalization schemes and rank them according to performance scores.
Resampling-based sequential clustering strategies can help achieve stable and robust clusters.
clusterExperiment provides a framework for comparing and visualizing different clustering techniques.

Acknowledgements

References

RUVSeq R package: http://bioconductor.org/packages/RUVSeq/
RUVSeq paper: http://rdcu.be/jrCZ

scone: https://github.com/YosefLab/scone
clusterExperiment: http://bioconductor.org/packages/clusterExperiment
tutorial: https://github.com/drisso/bioc2016singlecell

These slides: http://rpubs.com/daviderisso/iisa2016