Searching for HgcA homologs in existing microbial genomes

JGI IMG database indexes and annotates all publically available microbial genomes. They currently have over 80,000 genomes.

HgcA is annotated in the pfam system as pfam03599.

A function search of the IMG database targetting all genomes, all completion status and type, for pfam03599 is conducted. This includes metagenome-annotated genomes and single-cell genomes.

Results are encouraging, 3574 genes in 1756 genomes.

Characterizing the genomes that contain pfam03599 by phylum

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
pfam03599.genomes <- read.csv("pfam03599.genomes.csv", sep = "\t", header = TRUE)
pfam03599.phylum <- dplyr::group_by(pfam03599.genomes, Phylum)
pfam03599.phylum.summary <- dplyr::summarize(pfam03599.phylum, phylum.count = n())
pfam03599.phylum.summary <- pfam03599.phylum.summary[order(-pfam03599.phylum.summary$phylum.count),]
kable(pfam03599.phylum.summary)
Phylum phylum.count
Firmicutes 612
Euryarchaeota 275
Proteobacteria 275
Chloroflexi 137
Nitrospirae 60
unclassified 49
Spirochaetes 43
Candidatus Omnitrophica 42
Planctomycetes 38
Elusimicrobia 23
Candidatus Bathyarchaeota 21
Bacteroidetes 20
Actinobacteria 19
Aminicenantes 13
Candidatus Desantisbacteria 12
Armatimonadetes 11
Thermodesulfobacteria 11
Crenarchaeota 10
Lentisphaerae 9
Candidatus Bipolaricaulota 8
Candidatus Aminicenantes 7
Candidatus Raymondbacteria 7
Acidobacteria 5
Atribacteria 5
Candidatus Melainabacteria 5
Nitrospinae 4
Candidatus Firestonebacteria 3
Cyanobacteria 3
Thaumarchaeota 3
Aerophobetes 2
Candidatus Thorarchaeota 2
Cloacimonetes 2
Acetothermia 1
candidate division KD3-62 1
candidate division WOR-3 1
Candidatus Aerophobetes 1
Candidatus Atribacteria 1
Candidatus Calescamantes 1
Candidatus Fraserbacteria 1
Candidatus Heimdallarchaeota 1
Candidatus Lokiarchaeota 1
Candidatus Marinimicrobia 1
Candidatus Pacebacteria 1
Candidatus Poribacteria 1
Candidatus Schekmanbacteria 1
Candidatus Wallbacteria 1
Candidatus Woesearchaeota 1
Deferribacteres 1
Ignavibacteriae 1
Synergistetes 1
Thermotogae 1
Verrucomicrobia 1

Characterizing by family

library(dplyr)
pfam03599.genomes <- read.csv("pfam03599.genomes.csv", sep = "\t", header = TRUE)
pfam03599.family <- dplyr::group_by(pfam03599.genomes, Family)
pfam03599.family.summary <- dplyr::summarize(pfam03599.family, family.count = n())
pfam03599.family.summary <- pfam03599.family.summary[order(-pfam03599.family.summary$family.count),]
kable(pfam03599.family.summary)
Family family.count
unclassified 726
Peptostreptococcaceae 348
Methanosarcinaceae 111
Peptococcaceae 57
Lachnospiraceae 50
Clostridiaceae 41
Thermoanaerobacteraceae 37
Desulfobacteraceae 34
Methanobacteriaceae 29
Geobacteraceae 28
Archaeoglobaceae 22
Dehalococcoidaceae 22
Desulfovibrionaceae 22
Eubacteriaceae 14
Syntrophaceae 14
Desulfobulbaceae 12
Nitrospiraceae 11
Thermodesulfobacteriaceae 11
Methanococcaceae 10
Spirochaetaceae 10
Methanocaldococcaceae 9
Candidatus Brocadiaceae 8
Desulfarculaceae 8
Ruminococcaceae 8
Sporomusaceae 8
Syntrophobacteraceae 8
Hungateiclostridiaceae 7
Methanomicrobiaceae 7
Methanosaetaceae 6
Desulfonatronaceae 5
Desulfuromonadaceae 5
Halobacteroidaceae 5
Desulfohalobiaceae 4
Desulfomicrobiaceae 4
Methanomassiliicoccaceae 4
Methanoregulaceae 4
Thermoanaerobacterales Family III. Incertae Sedis 4
Gastranaerophilaceae 3
Methanopyraceae 3
Pseudonocardiaceae 3
Thermococcaceae 3
Bacteroidaceae 2
Clostridiales Family XVI. Incertae Sedis 2
Marinilabiliaceae 2
Methanocellaceae 2
Methanocorpusculaceae 2
Nitrospinaceae 2
Rhodobacteraceae 2
Peptococcaceae 1
Anaerolineaceae 1
Bacillaceae 1
Bradyrhizobiaceae 1
Candidatus Methanoperedenaceae 1
Desulfurococcaceae 1
Halanaerobiaceae 1
Holophagaceae 1
Kosmotogaceae 1
Methanospirillaceae 1
Methanothermaceae 1
Methermicoccaceae 1
Planctomycetaceae 1
Streptomycetaceae 1
Sulfolobaceae 1
Syntrophomonadaceae 1
Syntrophorhabdaceae 1

Attempting to expand the search window

Exploration of full annotations of these 3574 genes. Does any other annotation system captures the majority of these genes?

pfam03599 annotations in other systems

A snippet of the full gene annotation table

library(dplyr)
pfam03599.genes <- read.csv("pfam03599.gene.annotations.csv", sep = "\t", header = TRUE)
pfam03599.annot <- pfam03599.genes[-c(1:6)]
kable(head(pfam03599.annot))
COG Pfam Tigrfam KO X
COG1614===CO dehydrogenase/acetyl-CoA synthase beta subunit pfam03598===CdhC<<>>pfam03599===CdhD TIGR00381===CO dehydrogenase/acetyl-CoA synthase, delta subunit<<>>TIGR00316===CO dehydrogenase/CO-methylating acetyl-CoA synthase complex, beta subunit NA
COG1614===CO dehydrogenase/acetyl-CoA synthase beta subunit pfam03599===CdhD<<>>pfam03598===CdhC TIGR00381===CO dehydrogenase/acetyl-CoA synthase, delta subunit<<>>TIGR00316===CO dehydrogenase/CO-methylating acetyl-CoA synthase complex, beta subunit NA
COG1614===CO dehydrogenase/acetyl-CoA synthase beta subunit pfam03598===CdhC<<>>pfam03599===CdhD TIGR00381===CO dehydrogenase/acetyl-CoA synthase, delta subunit<<>>TIGR00316===CO dehydrogenase/CO-methylating acetyl-CoA synthase complex, beta subunit NA
COG1614===CO dehydrogenase/acetyl-CoA synthase beta subunit pfam03599===CdhD<<>>pfam03598===CdhC TIGR00381===CO dehydrogenase/acetyl-CoA synthase, delta subunit<<>>TIGR00316===CO dehydrogenase/CO-methylating acetyl-CoA synthase complex, beta subunit NA
COG1614===CO dehydrogenase/acetyl-CoA synthase beta subunit pfam03598===CdhC<<>>pfam03599===CdhD TIGR00381===CO dehydrogenase/acetyl-CoA synthase, delta subunit<<>>TIGR00316===CO dehydrogenase/CO-methylating acetyl-CoA synthase complex, beta subunit NA
COG1614===CO dehydrogenase/acetyl-CoA synthase beta subunit pfam03598===CdhC<<>>pfam03599===CdhD TIGR00381===CO dehydrogenase/acetyl-CoA synthase, delta subunit<<>>TIGR00316===CO dehydrogenase/CO-methylating acetyl-CoA synthase complex, beta subunit NA

Summarize by COG

library(dplyr)
pfam03599.COG <- dplyr::group_by(pfam03599.annot, COG)
pfam03599.COG.summary <- dplyr::summarise(pfam03599.COG, COG.count = n()/nrow(pfam03599.COG))
pfam03599.COG.summary$COG[pfam03599.COG.summary$COG == ""] <- "NA"
## Warning in `[<-.factor`(`*tmp*`, pfam03599.COG.summary$COG == "", value =
## structure(c(NA, : invalid factor level, NA generated
kable(pfam03599.COG.summary)
COG COG.count
NA 0.2115277
COG0294===Dihydropteroate synthase 0.0025182
COG0685===5,10-methylenetetrahydrofolate reductase<<>>COG2069===CO dehydrogenase/acetyl-CoA synthase delta subunit (corrinoid Fe-S protein) 0.0005596
COG1064===D-arabinose 1-dehydrogenase, Zn-dependent alcohol dehydrogenase family 0.0002798
COG1104===Cysteine sulfinate desulfinase/cysteine desulfurase or related enzyme 0.0002798
COG1151===Hydroxylamine reductase (hybrid-cluster protein) 0.0005596
COG1151===Hydroxylamine reductase (hybrid-cluster protein)<<>>COG2069===CO dehydrogenase/acetyl-CoA synthase delta subunit (corrinoid Fe-S protein) 0.0002798
COG1456===CO dehydrogenase/acetyl-CoA synthase gamma subunit (corrinoid Fe-S protein) 0.4029099
COG1614===CO dehydrogenase/acetyl-CoA synthase beta subunit 0.0047566
COG2069===CO dehydrogenase/acetyl-CoA synthase delta subunit (corrinoid Fe-S protein) 0.3757694
COG3640===CO dehydrogenase nickel-insertion accessory protein CooC1 0.0002798
COG3894===Uncharacterized 2Fe-2 and 4Fe-4S clusters-containing protein, contains DUF4445 domain 0.0002798

Summarize by TigrFAM

library(dplyr)
pfam03599.Tigr <- dplyr::group_by(pfam03599.annot, Tigrfam)
pfam03599.Tigr.summary <- dplyr::summarise(pfam03599.Tigr, Tigr.count = n()/nrow(pfam03599.annot))
pfam03599.Tigr.summary$Tigrfam[pfam03599.COG.summary$Tigrfam == ""] <- "NA"
## Warning in `[<-.factor`(`*tmp*`, pfam03599.COG.summary$Tigrfam == "", value
## = structure(1:14, .Label = c("", : invalid factor level, NA generated
## Warning: Unknown or uninitialised column: 'Tigrfam'.
kable(pfam03599.Tigr.summary)
Tigrfam Tigr.count
0.7526581
TIGR00284===dihydropteroate synthase-related protein 0.0025182
TIGR00316===CO dehydrogenase/CO-methylating acetyl-CoA synthase complex, beta subunit 0.0005596
TIGR00316===CO dehydrogenase/CO-methylating acetyl-CoA synthase complex, beta subunit<<>>TIGR00381===CO dehydrogenase/acetyl-CoA synthase, delta subunit 0.0008394
TIGR00381===CO dehydrogenase/acetyl-CoA synthase, delta subunit 0.2364298
TIGR00381===CO dehydrogenase/acetyl-CoA synthase, delta subunit<<>>TIGR00316===CO dehydrogenase/CO-methylating acetyl-CoA synthase complex, beta subunit 0.0033576
TIGR00381===CO dehydrogenase/acetyl-CoA synthase, delta subunit<<>>TIGR01445===intein N-terminal splicing region<<>>TIGR01443===intein C-terminal splicing region 0.0002798
TIGR00381===CO dehydrogenase/acetyl-CoA synthase, delta subunit<<>>TIGR02794===TolA protein 0.0002798
TIGR01445===intein N-terminal splicing region<<>>TIGR00381===CO dehydrogenase/acetyl-CoA synthase, delta subunit<<>>TIGR01443===intein C-terminal splicing region 0.0002798
TIGR01702===carbon-monoxide dehydrogenase, catalytic subunit 0.0002798
TIGR01702===carbon-monoxide dehydrogenase, catalytic subunit<<>>TIGR00381===CO dehydrogenase/acetyl-CoA synthase, delta subunit 0.0005596
TIGR02794===TolA protein<<>>TIGR00381===CO dehydrogenase/acetyl-CoA synthase, delta subunit 0.0002798
TIGR03402===cysteine desulfurase NifS 0.0002798
TIGR04256===GxxExxY protein 0.0013990

Summarize by KEGG

library(dplyr)
pfam03599.KO <- dplyr::group_by(pfam03599.annot, KO)
pfam03599.KO.summary <- dplyr::summarise(pfam03599.KO, KO.count = n()/nrow(pfam03599.KO))
pfam03599.KO.summary$KO[pfam03599.KO.summary$KO == ""] <- "NA"
## Warning in `[<-.factor`(`*tmp*`, pfam03599.KO.summary$KO == "", value =
## structure(c(NA, : invalid factor level, NA generated
kable(pfam03599.KO.summary)
KO KO.count
NA 0.1684387
KO:K00194===cdhD, acsD===acetyl-CoA decarbonylase/synthase complex subunit delta [EC:2.1.1.245] 0.3875210
KO:K00194===cdhD, acsD===acetyl-CoA decarbonylase/synthase complex subunit delta [EC:2.1.1.245]<<>>KO:K00198===cooS, acsA===carbon-monoxide dehydrogenase catalytic subunit [EC:1.2.99.2 1.2.7.4] 0.0005596
KO:K00197===cdhE, acsC===acetyl-CoA decarbonylase/synthase complex subunit gamma [EC:2.1.1.245] 0.4418019
KO:K00198===cooS, acsA===carbon-monoxide dehydrogenase catalytic subunit [EC:1.2.99.2 1.2.7.4] 0.0008394
KO:K00198===cooS, acsA===carbon-monoxide dehydrogenase catalytic subunit [EC:1.2.99.2 1.2.7.4]<<>>KO:K00194===cdhD, acsD===acetyl-CoA decarbonylase/synthase complex subunit delta [EC:2.1.1.245] 0.0002798
KO:K04487===iscS, NFS1===cysteine desulfurase [EC:2.8.1.7] 0.0002798
KO:K13953===adhP===alcohol dehydrogenase, propanol-preferring [EC:1.1.1.1] 0.0002798

Summarize by pfam

library(dplyr)
pfam03599.Pfam <- dplyr::group_by(pfam03599.annot, Pfam)
pfam03599.Pfam.summary <- dplyr::summarise(pfam03599.Pfam, Pfam.count = n()/nrow(pfam03599.Pfam))
pfam03599.Pfam.summary$Pfam[pfam03599.Pfam.summary$Pfam == ""] <- "NA"
## Warning in `[<-.factor`(`*tmp*`, pfam03599.Pfam.summary$Pfam == "", value
## = structure(1:36, .Label = c("pfam00037===Fer4<<>>pfam03599===CdhD", :
## invalid factor level, NA generated
pfam03599.Pfam.summary <- pfam03599.Pfam.summary[order(-pfam03599.Pfam.summary$Pfam.count),]
kable(pfam03599.Pfam.summary)
Pfam Pfam.count
pfam03599===CdhD 0.5554001
pfam03599===CdhD<<>>pfam04060===FeS 0.2269166
pfam04060===FeS<<>>pfam03599===CdhD 0.1829882
pfam03599===CdhD<<>>pfam00037===Fer4 0.0058758
pfam03599===CdhD<<>>pfam02219===MTHFR 0.0036374
pfam03598===CdhC<<>>pfam03599===CdhD 0.0030778
pfam02219===MTHFR<<>>pfam03599===CdhD 0.0027980
pfam00037===Fer4<<>>pfam03599===CdhD 0.0025182
pfam03599===CdhD<<>>pfam03598===CdhC 0.0016788
pfam03599===CdhD<<>>pfam14251===DUF4346 0.0016788
pfam03599===CdhD<<>>pfam12838===Fer4_7 0.0013990
pfam13187===Fer4_9<<>>pfam03599===CdhD 0.0013990
pfam03599===CdhD<<>>pfam13187===Fer4_9 0.0011192
pfam03599===CdhD<<>>pfam13649===Methyltransf_25 0.0011192
pfam03599===CdhD<<>>pfam03063===Prismane 0.0008394
pfam13366===PDDEXK_3<<>>pfam03599===CdhD 0.0008394
pfam03599===CdhD<<>>pfam04060===FeS<<>>pfam00809===Pterin_bind 0.0005596
pfam03599===CdhD<<>>pfam13366===PDDEXK_3 0.0005596
pfam12838===Fer4_7<<>>pfam03599===CdhD 0.0005596
pfam14251===DUF4346<<>>pfam03599===CdhD 0.0005596
pfam00809===Pterin_bind<<>>pfam03599===CdhD<<>>pfam04060===FeS 0.0002798
pfam03599===CdhD<<>>pfam00266===Aminotran_5 0.0002798
pfam03599===CdhD<<>>pfam01656===CbiA 0.0002798
pfam03599===CdhD<<>>pfam13187===Fer4_9<<>>pfam08241===Methyltransf_11 0.0002798
pfam03599===CdhD<<>>pfam13237===Fer4_10 0.0002798
pfam04060===FeS<<>>pfam03599===CdhD<<>>pfam00809===Pterin_bind 0.0002798
pfam04060===FeS<<>>pfam03599===CdhD<<>>pfam12683===DUF3798 0.0002798
pfam04536===TPM_phosphatase<<>>pfam03599===CdhD 0.0002798
pfam08240===ADH_N<<>>pfam00107===ADH_zinc_N<<>>pfam03599===CdhD 0.0002798
pfam08241===Methyltransf_11<<>>pfam03599===CdhD<<>>pfam13187===Fer4_9 0.0002798
pfam13181===TPR_8<<>>pfam03599===CdhD<<>>pfam02219===MTHFR<<>>pfam07719===TPR_2 0.0002798
pfam13237===Fer4_10<<>>pfam03599===CdhD 0.0002798
pfam13649===Methyltransf_25<<>>pfam03599===CdhD 0.0002798
pfam14574===DUF4445<<>>pfam04060===FeS<<>>pfam00111===Fer2<<>>pfam03599===CdhD 0.0002798
pfam14890===Intein_splicing<<>>pfam03598===CdhC<<>>pfam14528===LAGLIDADG_3<<>>pfam03599===CdhD 0.0002798
pfam14890===Intein_splicing<<>>pfam14528===LAGLIDADG_3<<>>pfam03599===CdhD<<>>pfam03598===CdhC 0.0002798

The other major systems are inconsistent.

Searching for HgcB

Based on “The Genetic Basis for Bacterial Mercury Methylation” (http://science.sciencemag.org/content/339/6125/1332), a handful of Desulfovibrio and Geobacter strains have proven Hgc activity and their HgcB genes have been identified;

The goal is to identify potential HgcB analogs among the 137 genomes with HgcA. We need a handle to confidently identify these however. Check the functional annotations of these five genes first.

Gather proven HgcB genes and detail their functional annotations

HgcB.proven.genomes <- read.csv("HgcB.proven.genomes.csv", sep = "\t", header = TRUE)
kable(HgcB.proven.genomes$Genome.Name...Sample.Name)
x
Desulfovibrio africanus Walvis Bay
Pseudodesulfovibrio aespoeensis Aspo-2, DSM 10631
Geobacter sulfurreducens PCA
Geobacter metallireducens GS-15
Desulfovibrio desulfuricans ND132

I could find 5 of the 6 proven HgcB genomes. D. propionicus has no genome or it’s name has been changed.

Identify HgcB genes and explore annotations

The HgcB genes identified in the aforementioned studies are pointed out above. They are analyzed for functional annotations:

HgcB.proven.annotations <- read.csv("Hgc.proven.annot.csv", sep = "\t", header = TRUE)
kable(HgcB.proven.annotations[-c(1,2,4,6)])
Gene.Product.Name Genome.Name COG Pfam Tigrfam KO X
4Fe-4S dicluster domain-containing protein Desulfovibrio desulfuricans ND132 NA pfam00037===Fer4 NA NA NA
4Fe-4S dicluster domain-containing protein Desulfovibrio africanus Walvis Bay NA pfam13187===Fer4_9 NA NA NA
4Fe-4S ferredoxin iron-sulfur-binding domain-containing protein Pseudodesulfovibrio aespoeensis Aspo-2, DSM 10631 NA pfam13187===Fer4_9 NA NA NA
Ferredoxin family protein Geobacter metallireducens GS-15 NA pfam13187===Fer4_9 NA NA NA
ferredoxin family protein Geobacter sulfurreducens PCA NA pfam13187===Fer4_9 NA NA NA

As can be seen, only pfam annotates these HgcB proteins, and does so inconsistently. However, they are all flagged as “ferrodoxin” or “4Fe-4S”. This is an imprecise technique but is something.

pfam13187 is not a terrible starting point

Refining the set of putative HgcA genes/proteins

An initial examination of the details of various pfam03359 genes reveals a huge variety in gene size and neighborhood (not shown). pfam03599 might capture all of HgcA proteins, but it is also potentially capturing many more proteins with other functions. We should move towards narrowing it down. Two ways to handle this, generally speaking

Check functional annotations of proven HgcA proteins

HgcA.proven.annot <- read.csv("HgcA.proven.annot.csv", header = TRUE, sep = "\t")
kable(HgcA.proven.annot[-c(1,2,3,4,6)])
Genome.Name COG Pfam Tigrfam KO X
Geobacter metallireducens GS-15 NA pfam03599===CdhD NA KO:K00197===cdhE, acsC===acetyl-CoA decarbonylase/synthase complex subunit gamma [EC:2.1.1.245] NA
Geobacter sulfurreducens PCA NA pfam03599===CdhD NA KO:K00197===cdhE, acsC===acetyl-CoA decarbonylase/synthase complex subunit gamma [EC:2.1.1.245] NA
Desulfovibrio desulfuricans ND132 NA pfam03599===CdhD NA NA
Desulfovibrio africanus Walvis Bay NA pfam03599===CdhD NA NA
Pseudodesulfovibrio aespoeensis Aspo-2, DSM 10631 NA pfam03599===CdhD NA NA

As is evident, COG does not pick this up at all, while KEGG is inconsistent.

Two potentially useful rules emerge:

  • No COG annotation
  • only ONE pfam annotation

Apply these rules to the pfam03599 set:

pfam03599.rules <- pfam03599.genes
pfam03599.rules <- pfam03599.rules[pfam03599.rules$COG == "",] # include only those with no COG annotation
pfam03599.rules <- pfam03599.rules[pfam03599.rules$Pfam == "pfam03599===CdhD",] #include only those with the single pfam

This restricts us to a more narrow set of 638 genes.

Note that there are some other interesting genes in there which contain both pfam03559 and a ferredoxin, Worth examining later, but this set is a good place to start.

We export the list of IMG gene IDs corresponding to this more restricted set and import to IMG

pfam03599.rules.ID <- data.frame(pfam03599.rules$Gene.ID)
write.csv(pfam03599.rules.ID,"pfam03599.rules.ID.csv")

Phylogeny of restricted gene set

library(dplyr)
pfam03599.rules.genomes <- read.csv("pfam03599.rules.genomes.csv", sep = "\t", header = TRUE)
pfam03599.rules.phylum <- dplyr::group_by(pfam03599.rules.genomes, Phylum)
pfam03599.rules.phylum.summary <- dplyr::summarize(pfam03599.rules.phylum, phylum.count = n())
pfam03599.rules.phylum.summary <- pfam03599.rules.phylum.summary[order(-pfam03599.rules.phylum.summary$phylum.count),]
kable(pfam03599.rules.phylum.summary)
Phylum phylum.count
Proteobacteria 210
Firmicutes 62
Euryarchaeota 54
Chloroflexi 35
Nitrospirae 34
Spirochaetes 34
unclassified 33
Bacteroidetes 16
Elusimicrobia 15
Candidatus Omnitrophica 8
Planctomycetes 7
Candidatus Raymondbacteria 6
Lentisphaerae 5
Nitrospinae 4
Actinobacteria 3
Armatimonadetes 3
Candidatus Aminicenantes 3
Candidatus Firestonebacteria 3
Cyanobacteria 3
Acidobacteria 2
Atribacteria 2
Candidatus Bipolaricaulota 2
Candidatus Melainabacteria 2
Thaumarchaeota 2
Aminicenantes 1
candidate division WOR-3 1
Candidatus Aerophobetes 1
Candidatus Atribacteria 1
Candidatus Bathyarchaeota 1
Candidatus Lokiarchaeota 1
Candidatus Schekmanbacteria 1
Candidatus Thorarchaeota 1
Crenarchaeota 1
Ignavibacteriae 1
Synergistetes 1

Compared to the original complete pfam03599 gene set, restricting based on COG and pfam profile has greatly enriched for Proteobacteria.

Searching for HgcB in the pfam03599 subset

These genomes have a huge number of pfam13187, the ferredoxin domain which is consistent with HgcB

This means that we need a different way to identify potential HgcAB pairs

Manual selection

Using the gene neighborhood diagrams presented by IMG, it is not unreasonable to sort through ~600 sets to flag syntenys that look reasonable under the basic criteria of having a small ferredoxin within 5kb downstream of the pfam03599. However this is imprecise and very clunky (not easy to do in img)

Conclusion

Working from annotations is sloppy and leads to many arbitrary choices. I could sift through these to find pairs of 03599 + ferredoxins and might have to, but this is where I will leave this for now

Homology-based searches

First make a set of 5 proven HgcA genes based on the 2013 Science paper:

HgcA.proven <- read.csv("HgcA.proven.csv", header = TRUE, sep = "\t")
kable(HgcA.proven[-c(6,7)])
Gene.ID Locus.Tag Gene.Product.Name Genome.ID Genome.Name
637778520 Gmet_1240 acetyl-CoA decarbonylase/synthase complex subunit gamma 637000119 Geobacter metallireducens GS-15
637126114 GSU1440 acetyl-CoA decarbonylase/synthase complex subunit gamma 637000120 Geobacter sulfurreducens PCA
2503785873 DND132_1056 CO dehydrogenase/acetyl-CoA synthase delta subunit 2503754015 Desulfovibrio desulfuricans ND132
2503792115 DesafDRAFT_0437 CO dehydrogenase/acetyl-CoA synthase delta subunit 2503754016 Desulfovibrio africanus Walvis Bay
649853478 Daes_2662 CO dehydrogenase/acetyl-CoA synthase subunit gamma (corrinoid Fe-S protein)-like protein 649633037 Pseudodesulfovibrio aespoeensis Aspo-2, DSM 10631

Perform all-IMG genome homolog searches and create a union set

This yields a set of 193 genes from 191 genomes (including the 5 used to search). Phylum-level summary:

library(dplyr)
HgcA.homolog.genomes <- read.csv("HgcA_homologs_genomes.csv", header = T, sep = "\t")
HgcA.homolog.phylum <- dplyr::group_by(HgcA.homolog.genomes, Phylum)
HgcA.homolog.phylum <- dplyr::summarise(HgcA.homolog.phylum, Phylum.count = n())
HgcA.homolog.phylum <- HgcA.homolog.phylum[order(-HgcA.homolog.phylum$Phylum.count),]
kable(HgcA.homolog.phylum)
Phylum Phylum.count
Proteobacteria 126
Firmicutes 35
Spirochaetes 6
unclassified 5
Lentisphaerae 4
Candidatus Aminicenantes 3
Nitrospirae 3
Bacteroidetes 2
Candidatus Raymondbacteria 2
Actinobacteria 1
Chloroflexi 1
Elusimicrobia 1
Planctomycetes 1
Synergistetes 1

The question now is: Which of these genomes which contain an HgcA homolog also contains an HgcB homolog?

A quick visual inspection suggests that most of these have a small ferredoxin shortly downstream. Taking a more formal approach:

Collect homologs for each of the five proven HgcB proteins yields a set of 179 genes from 178 genomes. Taxonomy summary;

library(dplyr)
HgcB.homolog.genomes <- read.csv("HgcB_homologs_genomes.csv", header = T, sep = "\t")
HgcB.homolog.phylum <- dplyr::group_by(HgcB.homolog.genomes, Phylum)
HgcB.homolog.phylum <- dplyr::summarise(HgcB.homolog.phylum, Phylum.count = n())
HgcB.homolog.phylum <- HgcB.homolog.phylum[order(-HgcB.homolog.phylum$Phylum.count),]
kable(HgcB.homolog.phylum)
Phylum Phylum.count
Proteobacteria 102
Firmicutes 27
Nitrospirae 10
Spirochaetes 9
unclassified 8
Elusimicrobia 7
Lentisphaerae 4
Bacteroidetes 2
Candidatus Aminicenantes 2
Planctomycetes 2
Actinobacteria 1
Candidatus Schekmanbacteria 1
Chloroflexi 1
Euryarchaeota 1
Synergistetes 1

Create an “inner-join” union set, genomes in which a homolog for each appears:

library(dplyr)
HgcAB.genome.union <- dplyr::inner_join(HgcA.homolog.genomes, HgcB.homolog.genomes, by = "Genome.Name...Sample.Name")
## Warning: Column `Genome.Name...Sample.Name` joining factors with different
## levels, coercing to character vector
head(HgcAB.genome.union$Genome.Name...Sample.Name)
## [1] "Lentisphaerae bacterium RIFOXYA12_FULL_48_11"                                
## [2] "Desulfobulbus propionicus 1pr3, DSM 2032"                                    
## [3] "Lentisphaerae bacterium GWF2_57_35"                                          
## [4] "Geopsychrobacter electrodiphilus DSM 16401"                                  
## [5] "Desulfobulbus sp. Tol-SR"                                                    
## [6] "Composite genome from Trout Bog Hypolimnion pan-assembly TBhypo.metabat.3815"

This generates a set of 149 genomes that are shared between the sets. This might be lumping duplications together…

Create an “inner-join” based on scaffold, only include scaffolds found in each gene set:

library(dplyr)
HgcA.homolog <- read.csv("HgcA_homologs.csv", header = T, sep = "\t")
colnames(HgcA.homolog) <- paste("HgcA",colnames(HgcA.homolog),sep=".")
colnames(HgcA.homolog)[9] <- "Scaffold.ID"
HgcB.homolog <- read.csv("HgcB_homologs.csv", header = T, sep = "\t")
colnames(HgcB.homolog) <- paste("HgcB",colnames(HgcB.homolog),sep=".")
colnames(HgcB.homolog)[10] <- "Scaffold.ID"
HgcAB.scaffold.full <- dplyr::full_join(HgcA.homolog, HgcB.homolog, by = "Scaffold.ID")
HgcAB.scaffold.union <- dplyr::inner_join(HgcA.homolog, HgcB.homolog, by = "Scaffold.ID")

This identifies 135 scaffolds that are found in both sets vs 237 which are found in either. In other words, of the 179 HgcB and 193 HgcA homologs, 135 are found on the same scaffold.

This does not establish co-localization however. I can compare “HcgA.End.Coord” and “HcgB.Start.Coord”:

HgcAB.same.scaffold <- HgcAB.scaffold.union[c(5,4,9,1,7,8,11,18,19)]
colnames(HgcAB.same.scaffold)[c(1,2)] <- c("Genome.Name","Genome.ID")
HgcAB.same.scaffold$A.B.dist <- HgcAB.same.scaffold$HgcA.End.Coord - HgcAB.same.scaffold$HgcB.Start.Coord

Plotting a histogram of gene distances:

hist(HgcAB.same.scaffold$A.B.dist, breaks = 1000)

This demonstrates fairly definitively that, with exception of two cases, the genes are in very close proximity to one another.