Transcriptome assembly quality assessment

Hellen Butungi, Judith Alawa, Wangwe Ibrahim, Francis Mwatuni, Dawit Kidanemariam, Triza Tonui, Fatma Hussein, Donwilliams Otieno, Job Areba
5th December 2015

1. Loading the assembly data

Human RNAseq data assembled with two assemblers: Oases and Soap. We want to compare the quality of the two assemblies.

# 1. load the files
oases <- read.csv('./oases.csv')
soap <- read.csv('./soapdenovotrans.csv')

# add an 'assembler' column
oases$assembler <- 'oases'
soap$assembler <- 'soap'

# create a merged data frame
assemblies <- rbind(oases, soap)

2. Number of contigs in each assembly

plot of chunk unnamed-chunk-3

3. Blast hits against human reference transcriptome

plot of chunk unnamed-chunk-4

4. Total number of reference transcripts found

total_hits <- length(levels(assemblies[!is.na(assemblies$hits), 'hits']))
print(total_hits)

[1] 15949

5. How many reference transcripts in each assembly?

hits_each <- data.frame(
  oases = length(levels(oases[!is.na(oases$hits), 'hits'])),
  soap = length(levels(soap[!is.na(soap$hits), 'hits']))
)
print(hits_each)

  oases  soap
1 14141 11841

6. How many hits unique to each assembly?

oases_hits <- levels(oases[!is.na(oases$hits), 'hits'])
soap_hits <- levels(soap[!is.na(soap$hits), 'hits'])
unique_hits <- data.frame(
  oases = length(setdiff(oases_hits, soap_hits)),
  soap = length(setdiff(soap_hits, oases_hits))
)
print(unique_hits)

  oases soap
1  4108 1808

7. Average reference coverage for each assembly

oases_coverage <- oases[!is.na(oases$reference_coverage), 'reference_coverage']
soap_coverage <- soap[!is.na(soap$reference_coverage), 'reference_coverage']
average_coverage <- data.frame(
  calculation = c('mean'),
  oases = c(mean(oases_coverage)),
  soap = c(mean(soap_coverage))
)
print(average_coverage)

  calculation     oases      soap
1        mean 0.6581914 0.6473458

8. Distribution of reference coverage

plot of chunk unnamed-chunk-9

9. How many well-assembled contigs in each assembly?

plot of chunk unnamed-chunk-10

10. What proportion of the 198,634 human transcripts were found?

plot of chunk unnamed-chunk-11

11. Overall assembly score

Mean coverage multipled by proportion of reference transcripts found

plot of chunk unnamed-chunk-12

Conclusions

Both assemblers had over 10,000 reference hits
Oases seems better overall
But both recover less than 5% of the total transcriptome
Both contain thousands of unique transcripts - so both are useful