Saline Isolates August, 2019
There are two hypersaline isolates slated for assembly and annotation, AR, which is the Kuwait BTEX strain, and WPC, which is a domestic isolate.
Much of this work was already done before starting this document. Briefly, WPC assembled normally and an annotated genome is ready for upload. However, AR showed what I interpreted as signs of contamination
1 strain AR
1.1 assembly pipeline overview
- subsample reads down to approximate 100X coverage, or 3.4 million reads, using SeqTK
- Quality trim using a 4:30 sliding window using Trimmomatic
- Assemble with SPAdes using; no coverage cutoff or coverage cutoff of 15X
- remove all contigs shorter than 500bp
Some basic stats on the assembly with no minimum coverage;
AR.1 <- readDNAStringSet("assemblies/AR/scaffolds.fasta")
name <- names(AR.1)
seq <- paste(AR.1)
len.split <- function(x){
strsplit(x,"_")[[1]][4]
}
scaffold.length <- as.numeric(unlist(lapply(name, len.split)))
cov.split <- function(x){
strsplit(x,"_")[[1]][6]
}
coverage <- as.numeric(unlist(lapply(name, cov.split)))
AR.1.info <- data.frame(name,scaffold.length,coverage,stringsAsFactors = F)
paste("total length is",sum(AR.1.info$len))## [1] "total length is 0"
library(ggplot2)
ggplot(AR.1.info, aes(x=scaffold.length, y=coverage)) + geom_point()The subsets of lower coverage contigs become evident in this plot
AR.1.covsort <- AR.1.info[order(-AR.1.info$coverage),]
datatable(AR.1.covsort)paste(sum(AR.1.covsort$scaffold.length),"total assembly length of contigs >499bp")## [1] "4552116 total assembly length of contigs >499bp"
Among the low coverage contigs, there is one, NODE_88, which is 11.9kb long. BLAST a few segments of this at NCBInr; perhaps the results will be informative as to whether this is contamination or not.
Strangely, the only hit is from our own KBTEX submission from the MAG. This was from megablast (highly similar seqs) though, not blastn.
check another long contig with low coverage, NODE_58… same result, only one hit. Try a normal blastn search. All of the hits, although weak, are from verious Gamma-proteobacteria; thus, this is consistent with it being part of an Arhodomonas genome. The danger now is that the putatitive contaminant could be also a gammaproteobacerium. I checked another one, NODE_260, and found stronger hits to various Pseudomonas genomes
1.2 Checking assembly quality/contamination
1.2.1 CheckM
Within a python2 conda environment, run:
checkm lineage_wf -f AR.checkM.results --pplacer_threads 6 -t 6 -x fa AR.ass checkM.AR
AR.cm <- read.table("checkM.AR/AR.checkM.results", as.is = T, comment.char = "-", skip=3)
colnames(AR.cm) <- c("Bin_Id","Marker_lineage","lineage_code","genomes","markers","marker_sets",
"V0","V1","V2","V3","V4","V5+","Completeness","Contamination","Strain_heterogeneity")
kable(AR.cm) %>%
kable_styling() %>%
scroll_box(width = "100%")| Bin_Id | Marker_lineage | lineage_code | genomes | markers | marker_sets | V0 | V1 | V2 | V3 | V4 | V5+ | Completeness | Contamination | Strain_heterogeneity |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| AR.scaffolds | c__Gammaproteobacteria | (UID4267) | 119 | 544 | 284 | 31 | 479 | 32 | 2 | 0 | 0 | 93.29 | 10.22 | 73.68 |
10% estimated contamination is troubling. It isn’t clear from this where that is coming from; duplicated singlecopy conserved genes I presume. Manual examination of the SCCP suggests that CheckM is very liberal with detection, i.e. many false positives in there which might inflate the contamination number
1.2.2 MiGA
Miga report was less troubling, 93.7% completeness and 3.6% contamination. This is totally acceptable. The MyTaxaScan also suggests continuity in the genome.
Consisten taxa pattern throughout the assembly for the most part, mix of Proteobacteria, Nitrococcus, Halomonas, Chromohalobacter, all things to be expected from an Arhodomonas. These are all Gammaproteobacteria, although from two different orders, Oceanospriallales and Chromatiales. All halophilic AFAIK. The consistency of the pattern throughout the assembly makes me more comfortable however; if there is a contaminint, it is fairly minor and is likely closely related to the Arhodomonas.
1.3 Purity conclusions
The assembly total size of ~4.5 mb, the MiGA contamination report and the MyTaxaScan make me comfortable that this is a good quality draft genome.
1.4 Annotation
in progress