Processing math: 100%

Normalization and clustering of single-cell RNA-seq data

davide.risso@berkeley.edu

September 12, 2016

Outline

  • Motivation
  • Experimental design: batch effects and confounding
  • Normalization: accounting for sample quality
  • Clustering: resampling and sequential strategies
  • Lineage reconstruction with single-cell RNA-seq


slides available at: http://rpubs.com/daviderisso/compgen2016

RNA-seq

Wang et al. (2009) Nature Review Genetics 10, 57-63.

Single-cell RNA-seq

Owens (2012) Nature 491, 27–29.

The Fluidigm C1 system


www.fluidigm.com

Illuminating cellular diversity in the brain

The brain is made up of 100’s if not 1000’s different cell types.
We need a rational way to identify and classify them.

Single-cell sequencing of S1 cortex

  • 285 Layer 4 cells
  • 657 Layer 5 cells
  • 307 Layer 6 cells

Experimental design: batch effects and confounding

Experimental process

Low sequencing depth: 192 cells per Illumina lane (average 1.2M reads per cell)

Snapshot of the data

_ Olfactory Cortical
Mice 51 41
C1 runs 61 40
Illumina Lanes 19 7
Cells 2,627 1,249
Cells pass QC 2,190 1,042
Sequenced reads 4,000 M 1,500 M

Ideal design: Factorial experiment

Each level of the factor of interest, say layer of origin, is observed in each batch.

Technical limitation: Complete confounding

Partial solution: Replication

Need to have multiple batches per condition and account for batch effects in the design matrix (nested design).

Note that mouse and C1 run effects are still confounded!


See also in bioRxiv:
Tung et al. (2016) http://dx.doi.org/10.1101/062919

Partial solution: Nested design

To account for batch j(i) in condition i, we can model the log-expression of each sample k as

yijk=μ+αi+βj(i)+εijk,

subject to the n+1 constraints

n∑i=1αi=0;ni∑j=1βj(i)=0.

Sample quality influences gene expression

Absolute correlation between PCA of log(TPM+1) and QC scores.

Sample quality differs among batches

PC1 of the QC score matrix stratified by batch.

Normalization: accounting for sample quality

A general normalization framework

RUV – negative control genes (RUVg)

  1. Identify a subset of negative control genes, i.e., non-DE genes, for which βc=0. Then, logE[Yc|W,X]=Wαc.
  2. Perform the singular value decomposition (SVD) of logYc, logYc=UΛVT. For a given k, estimate W by ˆW=UΛk.
  3. Substitute ˆW into the model for the full set of J genes and estimate both α and β by GLM regression.
  4. (Optionally) Define normalized read counts as the residuals from ordinary least squares (OLS) regression of logY on ˆW.

Risso et al. (2014) [http://dx.doi.org/10.1038/nbt.2931]
RUVSeq package [http://bioconductor.org/packages/RUVSeq/]

RUV captures sample quality

Should we scale the data before RUV?
QC or RUV? How many factors? Batch?

SCONE

  1. Apply a (combination of) normalization method(s).

    • Global scaling, e.g., DESeq, TMM.
    • Full-quantile (FQ)
    • Unknown factors of unwanted variation, e.g., RUV.
    • Known factors of unwanted variation, e.g., regression on QC measures, C1 batch.
  2. Rank the normalizations using a set of performance scores.

R package available at https://github.com/YosefLab/scone



Michael Cole, Nir Yosef, Sandrine Dudoit

SCONE performance metrics

  • BIO_SIL: Average silhouette width by biological condition.
  • BATCH_SIL: Average silhouette width by batch.
  • PAM_SIL: Maximum average silhouette width for PAM clusterings, for a range of user-supplied numbers of clusters.
  • EXP_QC_COR: Maximum squared Spearman correlation between count PCs and QC measures.
  • EXP_UV_COR: Maximum squared Spearman correlation between count PCs and factors of unwanted variation (preferably derived from other set of negative control genes than used in RUV).
  • EXP_WV_COR: Maximum squared Spearman correlation between count PCs and factors of wanted variation (derived from positive control genes).
  • RLE_MED: Mean squared median relative log expression (RLE).
  • RLE_IQR: Mean inter-quartile range (IQR) of RLE.

Exploring scone results

Points color-coded by average score.

TPM normalization

Heatmap of the 100 most variable genes

RUV (k=4) + nested batch

Heatmap of the 100 most variable genes

Clustering: resampling and sequential strategies

Clustering of single-cell RNA-seq data

In the literature, most approaches can be summarized by three steps.

  1. Dimensionality reduction (e.g., PCA, t-SNE, most variable genes).
  2. Compute a distance matrix between samples in the reduce space.
  3. Clustering based on a partitioning method (e.g., PAM, k-means).


For each step there are many tuning parameters. E.g.,

  • How many principal components?
  • Which distance?
  • How many clusters?

Resampling-based Sequential Ensemble Clustering (RSEC)

Given a base cluster algorithm

  • Generate a single candidate clustering using
    • resampling (to find robust clusters)
    • sequential clustering (to find stable clusters)
  • Repeat the procedure for different algorithms and tuning parameters to generate a collection of candidate clusterings
  • Identify a consensus over the different candidates

Implemented in the R/Bioconductor package clusterExperiment: http://bioconductor.org/packages/clusterExperiment


Elizabeth Purdom

Subsampling

Given an underlying clustering strategy, e.g., k-means or PAM with a particular choice of k, we repeat the following:

  • Subsample the data, e.g. 70% of samples.
  • Find clusters on the subsample.
  • Create a co-clustering matrix D:
    • % of subsamples where samples were in the same cluster.

Sequential clustering


Our sequential clustering works as follows.

  • Range over k in PAM clustering using the subsampling strategy.
  • The cluster that remains stable across values of k is identified and removed.
  • Repeat until no more stable clusters are found.




Inspired by the "tight clustering" algorithm
Tseng and Wong (2005) http://dx.doi.org/10.1111/j.0006-341X.2005.031032.x

Clustering reveals L5 sub-populations

Differential expression

  • Find cluster gene expression signatures (marker genes).
  • Standard solutions:
    • F-test for any difference between clusters
    • all pairwise comparisons
  • Our solution:
    • Create a hierarchy of clusters
    • Select appropriate contrasts that compare sister nodes

Differential expression

Differential expression

clusterExperiment R package

clusterExperiment shiny app (coming soon!)

Liam Purvis, Elizabeth Purdom

Lineage reconstruction with single-cell RNA-seq

The olfactory epithelium (OE)

Problem description

GOAL: High-resolution view of transcriptional changes during differentiation and neurogenesis.

Questions:

  • Where does the neuronal lineage branch off?
  • Which genes are differentially expressed throughout this process?
  • Which genes are driving cell fate decisions?




Kelly Street

The experiment

  • Isolate cells from the OE system by FACS
  • Sync cells temporally by conditional knockout of p63, an inhibitor of differentiation.
  • Quantify gene expression with single-cell RNA sequencing.


Russell Fletcher

The approach


Kelly Street

The approach


Kelly Street

The approach

Kelly Street

slingshot: R package for lineage reconstruction

Flexible, supervised branching lineage reconstruction.


Input: scRNA-seq data after normalization, clustering and dimensionality reduction.


R package available at: https://github.com/kstreet13/slingshot


Kelly Street, Elizabeth Purdom, Sandrine Dudoit

slingshot: Lineage identification

Kelly Street

slingshot: Curve fitting and pseudotime ordering

Kelly Street

Results on the OE system

Kelly Street, Russell Fletcher, Diya Das

Results on the OE system

Kelly Street, Russell Fletcher, Diya Das

Summary

  • Sample quality influences (single-cell) RNA-seq expression data.
  • We propose a flexible linear model to account for known and unknown factors of unwanted variation.
  • scone helps explore different normalization schemes and rank them according to performance scores.
  • Resampling-based sequential clustering strategies can help achieve stable and robust clusters.
  • clusterExperiment provides a framework for comparing and visualizing different clustering techniques.
  • slingshot gives flexible and robust estimates of branching differentiation lineages.

Acknowledgements

References