- Motivation
- Experimental design: batch effects and confounding
- Normalization: accounting for sample quality
- SCONE: An R package to normalize single-cell RNA-seq
August 3, 2016
The brain is made up of 100’s if not 1000’s different cell types.
We need a rational way to identify and classify them.
Low sequencing depth: 192 cells per Illumina lane (average 1.2M reads per cell)
_ | Olfactory | Cortical |
---|---|---|
Mice | 51 | 41 |
C1 runs | 61 | 40 |
Illumina Lanes | 19 | 7 |
Cells | 2,627 | 1,249 |
Cells pass QC | 2,190 | 1,042 |
Sequenced reads | 4,000 M | 1,500 M |
Each level of the factor of interest, say layer of origin, is observed in each batch.
We can only isolate one cell type per animal / batch.
See also in bioRxiv:
Hicks et al. (2015) http://dx.doi.org/10.1101/025528
Need to have multiple batches per condition and account for batch effects in the design matrix (nested design).
Note that mouse and C1 run effects are still confounded!
See also in bioRxiv:
Tung et al. (2016) http://dx.doi.org/10.1101/062919
To account for batch \(j(i)\) in condition \(i\), we can model the log-expression of each sample \(k\) as
\[ y_{ijk} = \mu + \alpha_i + \beta_{j(i)} + \varepsilon_{ijk}, \]
subject to the \(n+1\) constraints
\[ \sum_{i=1}^n \alpha_i = 0; \\ \sum_{j=1}^{n_i} \beta_{j(i)} = 0. \]
Absolute correlation between PCA of log(TPM+1) and QC scores.
PC1 of the QC score matrix stratified by batch.
RUV can be used to estimate \(W \alpha\) using negative control genes.
Gagnon-Bartsch & Speed (2012) [http://dx.doi.org/10.1093/biostatistics/kxr034]
Risso et al. (2014) [http://dx.doi.org/10.1038/nbt.2931]
RUVSeq package [http://bioconductor.org/packages/RUVSeq/]
Risso et al. (2014) [http://dx.doi.org/10.1038/nbt.2931]
RUVSeq package [http://bioconductor.org/packages/RUVSeq/]
Apply a (combination of) normalization method(s).
Rank the normalizations using a set of performance scores.
Michael Cole, Elizabeth Purdom, Nir Yosef, Sandrine Dudoit
Apply and evaluate 99 combinations
Select a normalization procedure
Top three methods by average score:
Points color-coded by average score.
Importance of preserving the biological difference in a nested design.
Heatmap of the 100 most variable genes
Heatmap of the 100 most variable genes