Accounting for technical noise in single-cell rna-seq experiments

Ian Donaldson
November 25th, 2015

blgc jclub

Background

Single-cell gene expression comparisons are now possible because of our ability to isolate single cells and sequence picogram levels of DNA.

Before we could “ignore” technical noise when comparing populations of cells.

Now it is essential to quantify it in order to distinguish it from biological noise.

See Supplementary note 1.

Figure 1

alt text

Dilution series of A. thaliana RNA

Comparing replicates to visualize noise

Figure 1a

alt text

5000 pg

note: good reproducibility down to 102 read counts

Figure 1d

alt text

only 10 pg of total RNA

note: still good reproducibility at 105 reads

Main idea of paper

alt text

1. Technical noise is related to a gene's average read count.

2. This relationship can be inferred from spike-ins.

Figure 2

alt text

Panels A and B are comparisons for normalized counts between two replicates

Blue = HeLa RNA spike-in

Brown = A. thaliana RNA

Figure 2a

alt text

read counts around 102 have maximal variability

detection of biological variability is impossible here

Figure 2b

alt text

Additional noise is “apparent” here and presumably corresponds to biological noise

These are already “normalized” counts

Figure 2c

alt text

HeLa spike-in

covariant of variation versus average normalized read count

each point is a gene

the red, solid line represents fitted variance-mean dependence

the dashed line is 95% confidence interval

Normalized counts

alt text

Normalized counts

sj will be a measure of “how good the sample is”

norm_counts = kij / sj

Does not take into account transcript length

See Supplementary note 4

Fitting coefficient of variance to mean norm reads

alt text

Figure 2d

alt text

Magenta points are genes that significantly exceed 50% variation

the dashed-line represents expected position of 50% biological CV

False-discovery rate is controlled at 10%

Figure 3

Large number of RNA in HeLa spike-in provides lots of statistical power

Also takes up about 50% of the reads.

ERCC92 spike in covers a wide-range of RNAs but takes up fewer reads

May not have sufficient power for experiemnts with low number of cells - Supp Figure 11.

But its ok for large numbers of cells (e.g. 96)

Figure 3

alt text

Figure 3

alt text

Blue points are ERCC spike-ins

Magenta points have 50% or more biological variation at 10% FDR

Discussion points

What are the positive and negative controls used to assess the method in the paper.

How sever are the limitations of the method

Coefficent of variation

The coefficient of variation is a measure of spread that describes the amount of variability relative to the mean. Because the coefficient of variation is unitless, you can use it instead of the standard deviation to compare the spread of data sets that have different units or different means.

Explanation of geometric mean

Slide With Code

summary(cars)
     speed           dist       
 Min.   : 4.0   Min.   :  2.00  
 1st Qu.:12.0   1st Qu.: 26.00  
 Median :15.0   Median : 36.00  
 Mean   :15.4   Mean   : 42.98  
 3rd Qu.:19.0   3rd Qu.: 56.00  
 Max.   :25.0   Max.   :120.00  

Slide With Plot

plot of chunk unnamed-chunk-2