Saline Isolates August, 2019

There are two hypersaline isolates slated for assembly and annotation, AR, which is the Kuwait BTEX strain, and WPC, which is a domestic isolate.

Much of this work was already done before starting this document. Briefly, WPC assembled normally and an annotated genome is ready for upload. However, AR showed what I interpreted as signs of contamination

1 strain AR

1.1 assembly pipeline overview

  1. subsample reads down to approximate 100X coverage, or 3.4 million reads, using SeqTK
  2. Quality trim using a 4:30 sliding window using Trimmomatic
  3. Assemble with SPAdes using; no coverage cutoff or coverage cutoff of 15X
  4. remove all contigs shorter than 500bp

Some basic stats on the assembly with no minimum coverage;

AR.1 <- readDNAStringSet("assemblies/AR/scaffolds.fasta")
name <- names(AR.1)
seq <- paste(AR.1)
len.split <- function(x){
  strsplit(x,"_")[[1]][4]
}
scaffold.length <- as.numeric(unlist(lapply(name, len.split)))

cov.split <- function(x){
  strsplit(x,"_")[[1]][6]
}
coverage <- as.numeric(unlist(lapply(name, cov.split)))

AR.1.info <- data.frame(name,scaffold.length,coverage,stringsAsFactors = F) 

paste("total length is",sum(AR.1.info$len))
## [1] "total length is 0"
library(ggplot2)
ggplot(AR.1.info, aes(x=scaffold.length, y=coverage)) + geom_point()

The subsets of lower coverage contigs become evident in this plot

AR.1.covsort <- AR.1.info[order(-AR.1.info$coverage),]
datatable(AR.1.covsort)
paste(sum(AR.1.covsort$scaffold.length),"total assembly length of contigs >499bp")
## [1] "4552116 total assembly length of contigs >499bp"

Among the low coverage contigs, there is one, NODE_88, which is 11.9kb long. BLAST a few segments of this at NCBInr; perhaps the results will be informative as to whether this is contamination or not.

Strangely, the only hit is from our own KBTEX submission from the MAG. This was from megablast (highly similar seqs) though, not blastn.

check another long contig with low coverage, NODE_58… same result, only one hit. Try a normal blastn search. All of the hits, although weak, are from verious Gamma-proteobacteria; thus, this is consistent with it being part of an Arhodomonas genome. The danger now is that the putatitive contaminant could be also a gammaproteobacerium. I checked another one, NODE_260, and found stronger hits to various Pseudomonas genomes

1.2 Checking assembly quality/contamination

1.2.1 CheckM

Within a python2 conda environment, run:

checkm lineage_wf -f AR.checkM.results --pplacer_threads 6 -t 6 -x fa AR.ass checkM.AR
AR.cm <- read.table("checkM.AR/AR.checkM.results", as.is = T, comment.char = "-", skip=3)
colnames(AR.cm) <- c("Bin_Id","Marker_lineage","lineage_code","genomes","markers","marker_sets",
                   "V0","V1","V2","V3","V4","V5+","Completeness","Contamination","Strain_heterogeneity")
kable(AR.cm) %>%
  kable_styling() %>%
  scroll_box(width = "100%")
Bin_Id Marker_lineage lineage_code genomes markers marker_sets V0 V1 V2 V3 V4 V5+ Completeness Contamination Strain_heterogeneity
AR.scaffolds c__Gammaproteobacteria (UID4267) 119 544 284 31 479 32 2 0 0 93.29 10.22 73.68

10% estimated contamination is troubling. It isn’t clear from this where that is coming from; duplicated singlecopy conserved genes I presume. Manual examination of the SCCP suggests that CheckM is very liberal with detection, i.e. many false positives in there which might inflate the contamination number

1.2.2 MiGA

Miga report was less troubling, 93.7% completeness and 3.6% contamination. This is totally acceptable. The MyTaxaScan also suggests continuity in the genome.

Consisten taxa pattern throughout the assembly for the most part, mix of Proteobacteria, Nitrococcus, Halomonas, Chromohalobacter, all things to be expected from an Arhodomonas. These are all Gammaproteobacteria, although from two different orders, Oceanospriallales and Chromatiales. All halophilic AFAIK. The consistency of the pattern throughout the assembly makes me more comfortable however; if there is a contaminint, it is fairly minor and is likely closely related to the Arhodomonas.

1.3 Purity conclusions

The assembly total size of ~4.5 mb, the MiGA contamination report and the MyTaxaScan make me comfortable that this is a good quality draft genome.

1.4 Annotation

in progress

RWMurdoch

October 4, 2019