Processing math: 100%

Normalization and clustering of single-cell RNA-seq data

davide.risso@berkeley.edu

September 12, 2016

Outline

Motivation
Experimental design: batch effects and confounding
Normalization: accounting for sample quality
Clustering: resampling and sequential strategies
Lineage reconstruction with single-cell RNA-seq

slides available at: http://rpubs.com/daviderisso/compgen2016

RNA-seq

_{^{Wang et al. (2009) Nature Review Genetics 10, 57-63.}}

Single-cell RNA-seq

_{^{Owens (2012) Nature 491, 27–29.}}

The Fluidigm C1 system

_{^{www.fluidigm.com}}

Illuminating cellular diversity in the brain

The brain is made up of 100’s if not 1000’s different cell types.
We need a rational way to identify and classify them.

Single-cell sequencing of S1 cortex

285 Layer 4 cells
657 Layer 5 cells
307 Layer 6 cells

Experimental design: batch effects and confounding

Experimental process

Low sequencing depth: 192 cells per Illumina lane (average 1.2M reads per cell)

Snapshot of the data

_	Olfactory	Cortical
Mice	51	41
C1 runs	61	40
Illumina Lanes	19	7
Cells	2,627	1,249
Cells pass QC	2,190	1,042
Sequenced reads	4,000 M	1,500 M

Ideal design: Factorial experiment

Each level of the factor of interest, say layer of origin, is observed in each batch.

Technical limitation: Complete confounding

We can only isolate one cell type per animal / batch.

_{^{See also in bioRxiv:
Hicks et al. (2015) http://dx.doi.org/10.1101/025528}}

Partial solution: Replication

Need to have multiple batches per condition and account for batch effects in the design matrix (nested design).

Note that mouse and C1 run effects are still confounded!

_{^{See also in bioRxiv:
Tung et al. (2016) http://dx.doi.org/10.1101/062919}}

Partial solution: Nested design

To account for batch $j(i)$ in condition $i$ , we can model the log-expression of each sample $k$ as

$y_{ijk} = \mu + \alpha_i + \beta_{j(i)} + \varepsilon_{ijk},$

subject to the $n+1$ constraints

$\sum_{i=1}^n \alpha_i = 0; \\ \sum_{j=1}^{n_i} \beta_{j(i)} = 0.$

Sample quality influences gene expression

Absolute correlation between PCA of log(TPM+1) and QC scores.

Sample quality differs among batches

PC1 of the QC score matrix stratified by batch.

Normalization: accounting for sample quality

A general normalization framework

RUV can be used to estimate $W \alpha$ using negative control genes.

_{^{Gagnon-Bartsch & Speed (2012) [http://dx.doi.org/10.1093/biostatistics/kxr034]
Risso et al. (2014) [http://dx.doi.org/10.1038/nbt.2931]
RUVSeq package [http://bioconductor.org/packages/RUVSeq/]}}

RUV – negative control genes (RUVg)

Identify a subset of negative control genes, i.e., non-DE genes, for which $\beta_c = 0$ . Then, $\log E[Y_c | W, X] = W \, \alpha_c.$
Perform the singular value decomposition (SVD) of $\log Y_c$ , $\log Y_c = U \Lambda V^T.$ For a given $k$ , estimate $W$ by $\widehat{W} = U \Lambda_k.$
Substitute $\widehat{W}$ into the model for the full set of $J$ genes and estimate both $\alpha$ and $\beta$ by GLM regression.
(Optionally) Define normalized read counts as the residuals from ordinary least squares (OLS) regression of $\log Y$ on $\widehat{W}$ .

_{^{Risso et al. (2014) [http://dx.doi.org/10.1038/nbt.2931]
RUVSeq package [http://bioconductor.org/packages/RUVSeq/]}}

RUV captures sample quality

Should we scale the data before RUV?
QC or RUV? How many factors? Batch?

SCONE

Apply a (combination of) normalization method(s).
- Global scaling, e.g., DESeq, TMM.
- Full-quantile (FQ)
- Unknown factors of unwanted variation, e.g., RUV.
- Known factors of unwanted variation, e.g., regression on QC measures, C1 batch.
Rank the normalizations using a set of performance scores.

R package available at https://github.com/YosefLab/scone

_{^{Michael Cole, Nir Yosef, Sandrine Dudoit}}

SCONE performance metrics

BIO_SIL: Average silhouette width by biological condition.
BATCH_SIL: Average silhouette width by batch.
PAM_SIL: Maximum average silhouette width for PAM clusterings, for a range of user-supplied numbers of clusters.
EXP_QC_COR: Maximum squared Spearman correlation between count PCs and QC measures.
EXP_UV_COR: Maximum squared Spearman correlation between count PCs and factors of unwanted variation (preferably derived from other set of negative control genes than used in RUV).
EXP_WV_COR: Maximum squared Spearman correlation between count PCs and factors of wanted variation (derived from positive control genes).
RLE_MED: Mean squared median relative log expression (RLE).
RLE_IQR: Mean inter-quartile range (IQR) of RLE.

Exploring scone results

Points color-coded by average score.

TPM normalization

Heatmap of the 100 most variable genes

RUV (k=4) + nested batch

Heatmap of the 100 most variable genes

Clustering: resampling and sequential strategies

Clustering of single-cell RNA-seq data

In the literature, most approaches can be summarized by three steps.

Dimensionality reduction (e.g., PCA, t-SNE, most variable genes).
Compute a distance matrix between samples in the reduce space.
Clustering based on a partitioning method (e.g., PAM, k-means).

For each step there are many tuning parameters. E.g.,

How many principal components?
Which distance?
How many clusters?

Resampling-based Sequential Ensemble Clustering (RSEC)

Given a base cluster algorithm

Generate a single candidate clustering using
- resampling (to find robust clusters)
- sequential clustering (to find stable clusters)
Repeat the procedure for different algorithms and tuning parameters to generate a collection of candidate clusterings
Identify a consensus over the different candidates

Implemented in the R/Bioconductor package clusterExperiment: http://bioconductor.org/packages/clusterExperiment

_{^{Elizabeth Purdom}}

Subsampling

Given an underlying clustering strategy, e.g., k-means or PAM with a particular choice of k, we repeat the following:

Subsample the data, e.g. 70% of samples.
Find clusters on the subsample.
Create a co-clustering matrix D:
- % of subsamples where samples were in the same cluster.

Sequential clustering

Our sequential clustering works as follows.

Range over k in PAM clustering using the subsampling strategy.
The cluster that remains stable across values of k is identified and removed.
Repeat until no more stable clusters are found.

_{^{Inspired by the "tight clustering" algorithm
Tseng and Wong (2005) http://dx.doi.org/10.1111/j.0006-341X.2005.031032.x}}

Clustering reveals L5 sub-populations

Differential expression

Find cluster gene expression signatures (marker genes).
Standard solutions:
- F-test for any difference between clusters
- all pairwise comparisons
Our solution:
- Create a hierarchy of clusters
- Select appropriate contrasts that compare sister nodes

Differential expression

clusterExperiment R package

Functions to

Generate a collection of candidate cluster labels.
Find a consensus clustering.
Merge clusters.
Find cluster signatures.
Visualize clustering results.

Available at http://bioconductor.org/packages/clusterExperiment

clusterExperiment shiny app (coming soon!)

_{^{Liam Purvis, Elizabeth Purdom}}

Lineage reconstruction with single-cell RNA-seq

The olfactory epithelium (OE)

Problem description

GOAL: High-resolution view of transcriptional changes during differentiation and neurogenesis.

Questions:

Where does the neuronal lineage branch off?
Which genes are differentially expressed throughout this process?
Which genes are driving cell fate decisions?

_{^{Kelly Street}}

The experiment

Isolate cells from the OE system by FACS
Sync cells temporally by conditional knockout of p63, an inhibitor of differentiation.
Quantify gene expression with single-cell RNA sequencing.

_{^{Russell Fletcher}}

The approach

_{^{Kelly Street}}

The approach

_{^{Kelly Street}}

The approach

_{^{Kelly Street}}

slingshot: R package for lineage reconstruction

Flexible, supervised branching lineage reconstruction.

Input: scRNA-seq data after normalization, clustering and dimensionality reduction.

R package available at: https://github.com/kstreet13/slingshot

_{^{Kelly Street, Elizabeth Purdom, Sandrine Dudoit}}

slingshot: Lineage identification

_{^{Kelly Street}}

slingshot: Curve fitting and pseudotime ordering

_{^{Kelly Street}}

Results on the OE system

_{^{Kelly Street, Russell Fletcher, Diya Das}}

Results on the OE system

_{^{Kelly Street, Russell Fletcher, Diya Das}}

Summary

Sample quality influences (single-cell) RNA-seq expression data.
We propose a flexible linear model to account for known and unknown factors of unwanted variation.
scone helps explore different normalization schemes and rank them according to performance scores.
Resampling-based sequential clustering strategies can help achieve stable and robust clusters.
clusterExperiment provides a framework for comparing and visualizing different clustering techniques.
slingshot gives flexible and robust estimates of branching differentiation lineages.

Acknowledgements

References

RUVSeq R package: http://bioconductor.org/packages/RUVSeq/
RUVSeq paper: http://rdcu.be/jrCZ

scone: https://github.com/YosefLab/scone
clusterExperiment: http://bioconductor.org/packages/clusterExperiment
slingshot: https://github.com/kstreet13/slingshot
tutorial: https://github.com/drisso/bioc2016singlecell

These slides: http://rpubs.com/daviderisso/compgen16