Two SRA datasets representing Gulfo Dulce were downloaded and assembled via mspades at kbase. The assemblies were downloaded and gene-called using prodigal;
prodigal -i 533.fasta -o 533.gbk -a 533.faa -p meta
The resulting proteins were blastp-searched with the DHB DCM cassette proteins.
makeblastdb -in 533.faa -dbtype prot -out blastdb/533_db
mkdir 533_DCM_search
blastp -db blastdb/533_db -query DHB.faa \
-out 533_DCM_search/blast.out \
-outfmt 6 \
-evalue 0.001
See this project directory for all relevant scripts used.
SRR5839047 (Gulfo Dulce 165 meters) SRR3880533 (Gulfo Dulce 90 meters)
blastp.047 <- read.csv("047/047_DCM_search/blast.out",sep="\t",header = F)
blastp.047 <- blastp.047[order(-blastp.047$V3),]
datatable(blastp.047)
Maximum hit to DcmE is 33.75%, far beneath any threshold I look for. The top hits were generally to DcmB (corrinoid), max 45% (not bad), and DcmH (DUF4445), which is not relevant at the max 42% identity. Aggressive examination in the blast file in excel confirmed no interesting synteny or identity.
blastp.533 <- read.csv("533/533_DCM_search/blast.out", sep = "\t", header =F)
blastp.533 <- blastp.533[order(-blastp.533$V4),]
datatable(blastp.047)
The highest hit to our core protein (2757325853) is ~25%, far below anything I am regarding as likely-homolog.