Mysteries of single-cell normalisation

We have a single-cell dataset stored as raw_counts, that has been filtered for genes based on minimum expression:

load("../data/raw_counts.Rdata")

print(dim(raw_counts))

## [1] 24005   175

is_ercc <- grepl("ERCC", rownames(raw_counts))
print(table(is_ercc))

## is_ercc
## FALSE  TRUE 
## 23922    83

Now calculated two sets of size-factors - one using TMM and one using RLE:

sf_tmm <- calcNormFactors(raw_counts, method = "TMM")
sf_rle <- calcNormFactors(raw_counts, method = "RLE")

We can compare them to each other:

qplot(sf_tmm, sf_rle) + xlab("TMM size factor") + ylab("RLE size factor")

Now make normalised count matrices:

norm_counts_tmm <- t(t(raw_counts) / sf_tmm)
norm_counts_rle <- t(t(raw_counts) / sf_rle)

And compare the mean expression of each gene:

data_frame(tmm_mean = rowMeans(norm_counts_tmm), 
           rle_mean = rowMeans(norm_counts_rle),
           is_ercc) %>% 
  ggplot(aes(x = tmm_mean, y = rle_mean, color = is_ercc)) + 
  geom_point(alpha = 0.7) + scale_color_brewer(palette = "Set1") +
  xlab("Mean expression (TMM normalisation)") +
  ylab("Mean expression (RLE normalisation)") +
  scale_x_log10() + scale_y_log10()

We see good agreement, except in highly-expressed spike ins where a bizarre splitting happens.

We can subsequently make the mean-CV2 plots of the Brennecke et al. Nature Methods paper. Light grey dots are ``biological’’ genes and red are spike-ins.

make_brennecke_plot <- function(normc, is_ercc) {
  mean_exprs <- rowMeans(normc)
  var_exprs <- matrixStats::rowVars(normc)
  CV2 <- var_exprs / mean_exprs^2
  dd <- data_frame(CV2 = CV2, mean = mean_exprs, is_ercc)
  
  dd <- filter(dd, mean > 0.001)
  cols <- brewer.pal(3, "Set1")
  
  ggplot(dd, aes(x = mean, y = CV2)) + geom_point(color = "darkgrey", size = 0.5) +
    geom_point(data = filter(dd, is_ercc), fill = cols[1], size = 2, shape = 21, color = "black") + 
    scale_x_log10() + scale_y_log10() + xlab(expression(mu)) + ylab(expression(CV^2)) 
}

Plot for RLE:

make_brennecke_plot(norm_counts_rle, is_ercc)

then for TMM:

make_brennecke_plot(norm_counts_tmm, is_ercc)

The odd disagreement of the ERCC spike-ins clearly has a huge effect on these CV plots, with the RLE method suggesting that the spike-ins are highly variable.

Technical info:

sessionInfo()

## R version 3.3.0 (2016-05-03)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.11.3 (El Capitan)
## 
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] RColorBrewer_1.1-2 edgeR_3.12.1       limma_3.26.9      
## [4] dplyr_0.5.0        ggplot2_2.1.0     
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.6      knitr_1.13       magrittr_1.5     munsell_0.4.3   
##  [5] colorspace_1.2-6 R6_2.1.2         stringr_1.1.0    plyr_1.8.4      
##  [9] tools_3.3.0      grid_3.3.0       gtable_0.2.0     DBI_0.4-1       
## [13] htmltools_0.3.5  yaml_2.1.13      digest_0.6.10    assertthat_0.1  
## [17] tibble_1.1       formatR_1.4      evaluate_0.9     rmarkdown_1.0   
## [21] stringi_1.1.1    scales_0.4.0

Mysteries of single-cell normalisation

Kieran Campbell

30 August 2016