Saline Isolates August, 2019

1 strain AR

There are two hypersaline isolates slated for assembly and annotation, AR, which is the Kuwait BTEX strain, and WPC, which is a domestic isolate.

Much of this work was already done before starting this document. Briefly, WPC assembled normally and an annotated genome is ready for upload. However, AR showed what I interpreted as signs of contamination

1 strain AR

1.1 assembly pipeline overview

subsample reads down to approximate 100X coverage, or 3.4 million reads, using SeqTK
Quality trim using a 4:30 sliding window using Trimmomatic
Assemble with SPAdes using; no coverage cutoff or coverage cutoff of 15X
remove all contigs shorter than 500bp

Some basic stats on the assembly with no minimum coverage;

AR.1 <- readDNAStringSet("assemblies/AR/scaffolds.fasta")
name <- names(AR.1)
seq <- paste(AR.1)
len.split <- function(x){
  strsplit(x,"_")[[1]][4]
}
scaffold.length <- as.numeric(unlist(lapply(name, len.split)))

cov.split <- function(x){
  strsplit(x,"_")[[1]][6]
}
coverage <- as.numeric(unlist(lapply(name, cov.split)))

AR.1.info <- data.frame(name,scaffold.length,coverage,stringsAsFactors = F) 

paste("total length is",sum(AR.1.info$len))

## [1] "total length is 0"

library(ggplot2)
ggplot(AR.1.info, aes(x=scaffold.length, y=coverage)) + geom_point()

The subsets of lower coverage contigs become evident in this plot

AR.1.covsort <- AR.1.info[order(-AR.1.info$coverage),]
datatable(AR.1.covsort)

paste(sum(AR.1.covsort$scaffold.length),"total assembly length of contigs >499bp")

## [1] "4552116 total assembly length of contigs >499bp"

Among the low coverage contigs, there is one, NODE_88, which is 11.9kb long. BLAST a few segments of this at NCBInr; perhaps the results will be informative as to whether this is contamination or not.

Strangely, the only hit is from our own KBTEX submission from the MAG. This was from megablast (highly similar seqs) though, not blastn.

check another long contig with low coverage, NODE_58… same result, only one hit. Try a normal blastn search. All of the hits, although weak, are from verious Gamma-proteobacteria; thus, this is consistent with it being part of an Arhodomonas genome. The danger now is that the putatitive contaminant could be also a gammaproteobacerium. I checked another one, NODE_260, and found stronger hits to various Pseudomonas genomes

1.2 Checking assembly quality/contamination

1.2.1 CheckM

Within a python2 conda environment, run:

checkm lineage_wf -f AR.checkM.results --pplacer_threads 6 -t 6 -x fa AR.ass checkM.AR

AR.cm <- read.table("checkM.AR/AR.checkM.results", as.is = T, comment.char = "-", skip=3)
colnames(AR.cm) <- c("Bin_Id","Marker_lineage","lineage_code","genomes","markers","marker_sets",
                   "V0","V1","V2","V3","V4","V5+","Completeness","Contamination","Strain_heterogeneity")
kable(AR.cm) %>%
  kable_styling() %>%
  scroll_box(width = "100%")

Bin_Id	Marker_lineage	lineage_code	genomes	markers	marker_sets	V0	V1	V2	V3	V4	V5+	Completeness	Contamination	Strain_heterogeneity
AR.scaffolds	c__Gammaproteobacteria	(UID4267)	119	544	284	31	479	32	2	0	0	93.29	10.22	73.68

10% estimated contamination is troubling. It isn’t clear from this where that is coming from; duplicated singlecopy conserved genes I presume. Manual examination of the SCCP suggests that CheckM is very liberal with detection, i.e. many false positives in there which might inflate the contamination number

1.2.2 MiGA

Miga report was less troubling, 93.7% completeness and 3.6% contamination. This is totally acceptable. The MyTaxaScan also suggests continuity in the genome.

Consisten taxa pattern throughout the assembly for the most part, mix of Proteobacteria, Nitrococcus, Halomonas, Chromohalobacter, all things to be expected from an Arhodomonas. These are all Gammaproteobacteria, although from two different orders, Oceanospriallales and Chromatiales. All halophilic AFAIK. The consistency of the pattern throughout the assembly makes me more comfortable however; if there is a contaminint, it is fairly minor and is likely closely related to the Arhodomonas.

1.3 Purity conclusions

The assembly total size of ~4.5 mb, the MiGA contamination report and the MyTaxaScan make me comfortable that this is a good quality draft genome.

1.4 Annotation

in progress