- Motivation
- Experimental design: batch effects and confounding
- Normalization: accounting for sample quality
- Clustering: resampling and sequential strategies
August 19, 2016
The brain is made up of 100’s if not 1000’s different cell types.
We need a rational way to identify and classify them.
Low sequencing depth: 192 cells per Illumina lane (average 1.2M reads per cell)
| _ | Olfactory | Cortical |
|---|---|---|
| Mice | 51 | 41 |
| C1 runs | 61 | 40 |
| Illumina Lanes | 19 | 7 |
| Cells | 2,627 | 1,249 |
| Cells pass QC | 2,190 | 1,042 |
| Sequenced reads | 4,000 M | 1,500 M |
Each level of the factor of interest, say layer of origin, is observed in each batch.
We can only isolate one cell type per animal / batch.
See also in bioRxiv:
Hicks et al. (2015) http://dx.doi.org/10.1101/025528
Need to have multiple batches per condition and account for batch effects in the design matrix (nested design).
Note that mouse and C1 run effects are still confounded!
See also in bioRxiv:
Tung et al. (2016) http://dx.doi.org/10.1101/062919
To account for batch \(j(i)\) in condition \(i\), we can model the log-expression of each sample \(k\) as
\[ y_{ijk} = \mu + \alpha_i + \beta_{j(i)} + \varepsilon_{ijk}, \]
subject to the \(n+1\) constraints
\[ \sum_{i=1}^n \alpha_i = 0; \\ \sum_{j=1}^{n_i} \beta_{j(i)} = 0. \]
Absolute correlation between PCA of log(TPM+1) and QC scores.
PC1 of the QC score matrix stratified by batch.
RUV can be used to estimate \(W \alpha\) using negative control genes.
Gagnon-Bartsch & Speed (2012) [http://dx.doi.org/10.1093/biostatistics/kxr034]
Risso et al. (2014) [http://dx.doi.org/10.1038/nbt.2931]
RUVSeq package [http://bioconductor.org/packages/RUVSeq/]
Risso et al. (2014) [http://dx.doi.org/10.1038/nbt.2931]
RUVSeq package [http://bioconductor.org/packages/RUVSeq/]
Apply a (combination of) normalization method(s).
Rank the normalizations using a set of performance scores.
Michael Cole, Nir Yosef, Sandrine Dudoit
Points color-coded by average score.
Heatmap of the 100 most variable genes
Heatmap of the 100 most variable genes
In the literature, most approaches can be summarized by three steps.
For each step there are many tuning parameters. E.g.,
Given a base cluster algorithm
Implemented in the R/Bioconductor package clusterExperiment: http://bioconductor.org/packages/clusterExperiment
Elizabeth Purdom
Given an underlying clustering strategy, e.g., k-means or PAM with a particular choice of k, we repeat the following:
Our sequential clustering works as follows.
Inspired by the "tight clustering" algorithm
Tseng and Wong (2005) http://dx.doi.org/10.1111/j.0006-341X.2005.031032.x
Functions to
Available at http://bioconductor.org/packages/clusterExperiment
Liam Purvis and Elizabeth Purdom