We have a single-cell dataset stored as raw_counts, that has been filtered for genes based on minimum expression:
load("../data/raw_counts.Rdata")
print(dim(raw_counts))
## [1] 24005 175
is_ercc <- grepl("ERCC", rownames(raw_counts))
print(table(is_ercc))
## is_ercc
## FALSE TRUE
## 23922 83
Now calculated two sets of size-factors - one using TMM and one using RLE:
sf_tmm <- calcNormFactors(raw_counts, method = "TMM")
sf_rle <- calcNormFactors(raw_counts, method = "RLE")
We can compare them to each other:
qplot(sf_tmm, sf_rle) + xlab("TMM size factor") + ylab("RLE size factor")
Now make normalised count matrices:
norm_counts_tmm <- t(t(raw_counts) / sf_tmm)
norm_counts_rle <- t(t(raw_counts) / sf_rle)
And compare the mean expression of each gene:
data_frame(tmm_mean = rowMeans(norm_counts_tmm),
rle_mean = rowMeans(norm_counts_rle),
is_ercc) %>%
ggplot(aes(x = tmm_mean, y = rle_mean, color = is_ercc)) +
geom_point(alpha = 0.7) + scale_color_brewer(palette = "Set1") +
xlab("Mean expression (TMM normalisation)") +
ylab("Mean expression (RLE normalisation)") +
scale_x_log10() + scale_y_log10()
We see good agreement, except in highly-expressed spike ins where a bizarre splitting happens.
We can subsequently make the mean-CV2 plots of the Brennecke et al. Nature Methods paper. Light grey dots are ``biological’’ genes and red are spike-ins.
make_brennecke_plot <- function(normc, is_ercc) {
mean_exprs <- rowMeans(normc)
var_exprs <- matrixStats::rowVars(normc)
CV2 <- var_exprs / mean_exprs^2
dd <- data_frame(CV2 = CV2, mean = mean_exprs, is_ercc)
dd <- filter(dd, mean > 0.001)
cols <- brewer.pal(3, "Set1")
ggplot(dd, aes(x = mean, y = CV2)) + geom_point(color = "darkgrey", size = 0.5) +
geom_point(data = filter(dd, is_ercc), fill = cols[1], size = 2, shape = 21, color = "black") +
scale_x_log10() + scale_y_log10() + xlab(expression(mu)) + ylab(expression(CV^2))
}
Plot for RLE:
make_brennecke_plot(norm_counts_rle, is_ercc)
then for TMM:
make_brennecke_plot(norm_counts_tmm, is_ercc)
The odd disagreement of the ERCC spike-ins clearly has a huge effect on these CV plots, with the RLE method suggesting that the spike-ins are highly variable.
Technical info:
sessionInfo()
## R version 3.3.0 (2016-05-03)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.11.3 (El Capitan)
##
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] RColorBrewer_1.1-2 edgeR_3.12.1 limma_3.26.9
## [4] dplyr_0.5.0 ggplot2_2.1.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.6 knitr_1.13 magrittr_1.5 munsell_0.4.3
## [5] colorspace_1.2-6 R6_2.1.2 stringr_1.1.0 plyr_1.8.4
## [9] tools_3.3.0 grid_3.3.0 gtable_0.2.0 DBI_0.4-1
## [13] htmltools_0.3.5 yaml_2.1.13 digest_0.6.10 assertthat_0.1
## [17] tibble_1.1 formatR_1.4 evaluate_0.9 rmarkdown_1.0
## [21] stringi_1.1.1 scales_0.4.0