Living document (i.e., it will be updated) in which I compare results from eXpress and Salmon on some RNAseq datasets.

Rice RNAseq with genome-based reference

First up: a single sample of Oryza sativa paired-end RNAseq using the MSU rice v7 genome-based transcriptome as the reference.

Expression distribution comparison

I always visualise the distribution of transcript quantification results between different libraries, samples, conditions and methods. It’s an easy way to spot major problems like failed sequencing runs or alignment runs that have errored out without being noticed.

In this plot we exclude all the transcripts with FPKM < 0.01 because they are fairly meaningless but can skew correlations.

plot of chunk unnamed-chunk-2

The distributions are fairly similar. There’s a noticeable difference between them - in particular Salmon has assigned more counts in the very low (<0.5 FPKM) and mid (~10 FPKM) ranges, while eXpress has more in the 0.5-2 FPKM range. The differences are all concentrated in the low end of the scale, so given only this evidence they are not likely to greatly affect downstream biological interpretation.

Transcript-level correlation

Next we should look at the individual transcripts - how does their expression compare as measured by the two programs?

plot of chunk unnamed-chunk-3

It seems as though a large number of transcripts are consistently given higher expression by Salmon than eXpress, right across the range of expression levels. In some cases the expression is thousands of times higher in Salmon. It’s hard to see how many transcripts are affected by this, so let’s try to expose it…

plot of chunk unnamed-chunk-4

Here we’ve overlaid a density gradient and contour map. The map has 20 levels, and the majority of the skewed points are outside any of the contours. This tell us that less than 5% of the transcripts are in the heavily skewed set. The lowest contour bin is showing a small expansion towards the salmon side, so perhaps as many as 10% of transcripts have some appreciable level of expression difference between the two methods.

Does this matter? For FPKM < 1, a two-fold difference probably corresponds to less than one transcript count per cell difference. For FPKM < 10, it’s still small, but should not be affecting the calls of any decent differential expression software.

TODO:

To be continued…