Two SRA datasets representing Gulfo Dulce were downloaded and assembled via mspades at kbase. The assemblies were downloaded and gene-called using prodigal;

prodigal -i 533.fasta -o 533.gbk -a 533.faa -p meta

The resulting proteins were blastp-searched with the DHB DCM cassette proteins.

makeblastdb -in 533.faa -dbtype prot -out blastdb/533_db

mkdir 533_DCM_search

blastp -db blastdb/533_db -query DHB.faa \
-out 533_DCM_search/blast.out \
-outfmt 6 \
-evalue 0.001

See this project directory for all relevant scripts used.

SRR5839047 (Gulfo Dulce 165 meters) SRR3880533 (Gulfo Dulce 90 meters)

GD047 results

blastp.047 <- read.csv("047/047_DCM_search/blast.out",sep="\t",header = F)
blastp.047 <- blastp.047[order(-blastp.047$V3),]
datatable(blastp.047)

Maximum hit to DcmE is 33.75%, far beneath any threshold I look for. The top hits were generally to DcmB (corrinoid), max 45% (not bad), and DcmH (DUF4445), which is not relevant at the max 42% identity. Aggressive examination in the blast file in excel confirmed no interesting synteny or identity.

GD533 results

blastp.533 <- read.csv("533/533_DCM_search/blast.out", sep = "\t", header =F)
blastp.533 <- blastp.533[order(-blastp.533$V4),]
datatable(blastp.047)

The highest hit to our core protein (2757325853) is ~25%, far below anything I am regarding as likely-homolog.

assemblies and protein files