JGI IMG database indexes and annotates all publically available microbial genomes. They currently have over 80,000 genomes.
HgcA is annotated in the pfam system as pfam03599.
A function search of the IMG database targetting all genomes, all completion status and type, for pfam03599 is conducted. This includes metagenome-annotated genomes and single-cell genomes.
Results are encouraging, 3574 genes in 1756 genomes.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
pfam03599.genomes <- read.csv("pfam03599.genomes.csv", sep = "\t", header = TRUE)
pfam03599.phylum <- dplyr::group_by(pfam03599.genomes, Phylum)
pfam03599.phylum.summary <- dplyr::summarize(pfam03599.phylum, phylum.count = n())
pfam03599.phylum.summary <- pfam03599.phylum.summary[order(-pfam03599.phylum.summary$phylum.count),]
kable(pfam03599.phylum.summary)
| Phylum | phylum.count |
|---|---|
| Firmicutes | 612 |
| Euryarchaeota | 275 |
| Proteobacteria | 275 |
| Chloroflexi | 137 |
| Nitrospirae | 60 |
| unclassified | 49 |
| Spirochaetes | 43 |
| Candidatus Omnitrophica | 42 |
| Planctomycetes | 38 |
| Elusimicrobia | 23 |
| Candidatus Bathyarchaeota | 21 |
| Bacteroidetes | 20 |
| Actinobacteria | 19 |
| Aminicenantes | 13 |
| Candidatus Desantisbacteria | 12 |
| Armatimonadetes | 11 |
| Thermodesulfobacteria | 11 |
| Crenarchaeota | 10 |
| Lentisphaerae | 9 |
| Candidatus Bipolaricaulota | 8 |
| Candidatus Aminicenantes | 7 |
| Candidatus Raymondbacteria | 7 |
| Acidobacteria | 5 |
| Atribacteria | 5 |
| Candidatus Melainabacteria | 5 |
| Nitrospinae | 4 |
| Candidatus Firestonebacteria | 3 |
| Cyanobacteria | 3 |
| Thaumarchaeota | 3 |
| Aerophobetes | 2 |
| Candidatus Thorarchaeota | 2 |
| Cloacimonetes | 2 |
| Acetothermia | 1 |
| candidate division KD3-62 | 1 |
| candidate division WOR-3 | 1 |
| Candidatus Aerophobetes | 1 |
| Candidatus Atribacteria | 1 |
| Candidatus Calescamantes | 1 |
| Candidatus Fraserbacteria | 1 |
| Candidatus Heimdallarchaeota | 1 |
| Candidatus Lokiarchaeota | 1 |
| Candidatus Marinimicrobia | 1 |
| Candidatus Pacebacteria | 1 |
| Candidatus Poribacteria | 1 |
| Candidatus Schekmanbacteria | 1 |
| Candidatus Wallbacteria | 1 |
| Candidatus Woesearchaeota | 1 |
| Deferribacteres | 1 |
| Ignavibacteriae | 1 |
| Synergistetes | 1 |
| Thermotogae | 1 |
| Verrucomicrobia | 1 |
library(dplyr)
pfam03599.genomes <- read.csv("pfam03599.genomes.csv", sep = "\t", header = TRUE)
pfam03599.family <- dplyr::group_by(pfam03599.genomes, Family)
pfam03599.family.summary <- dplyr::summarize(pfam03599.family, family.count = n())
pfam03599.family.summary <- pfam03599.family.summary[order(-pfam03599.family.summary$family.count),]
kable(pfam03599.family.summary)
| Family | family.count |
|---|---|
| unclassified | 726 |
| Peptostreptococcaceae | 348 |
| Methanosarcinaceae | 111 |
| Peptococcaceae | 57 |
| Lachnospiraceae | 50 |
| Clostridiaceae | 41 |
| Thermoanaerobacteraceae | 37 |
| Desulfobacteraceae | 34 |
| Methanobacteriaceae | 29 |
| Geobacteraceae | 28 |
| Archaeoglobaceae | 22 |
| Dehalococcoidaceae | 22 |
| Desulfovibrionaceae | 22 |
| Eubacteriaceae | 14 |
| Syntrophaceae | 14 |
| Desulfobulbaceae | 12 |
| Nitrospiraceae | 11 |
| Thermodesulfobacteriaceae | 11 |
| Methanococcaceae | 10 |
| Spirochaetaceae | 10 |
| Methanocaldococcaceae | 9 |
| Candidatus Brocadiaceae | 8 |
| Desulfarculaceae | 8 |
| Ruminococcaceae | 8 |
| Sporomusaceae | 8 |
| Syntrophobacteraceae | 8 |
| Hungateiclostridiaceae | 7 |
| Methanomicrobiaceae | 7 |
| Methanosaetaceae | 6 |
| Desulfonatronaceae | 5 |
| Desulfuromonadaceae | 5 |
| Halobacteroidaceae | 5 |
| Desulfohalobiaceae | 4 |
| Desulfomicrobiaceae | 4 |
| Methanomassiliicoccaceae | 4 |
| Methanoregulaceae | 4 |
| Thermoanaerobacterales Family III. Incertae Sedis | 4 |
| Gastranaerophilaceae | 3 |
| Methanopyraceae | 3 |
| Pseudonocardiaceae | 3 |
| Thermococcaceae | 3 |
| Bacteroidaceae | 2 |
| Clostridiales Family XVI. Incertae Sedis | 2 |
| Marinilabiliaceae | 2 |
| Methanocellaceae | 2 |
| Methanocorpusculaceae | 2 |
| Nitrospinaceae | 2 |
| Rhodobacteraceae | 2 |
| Peptococcaceae | 1 |
| Anaerolineaceae | 1 |
| Bacillaceae | 1 |
| Bradyrhizobiaceae | 1 |
| Candidatus Methanoperedenaceae | 1 |
| Desulfurococcaceae | 1 |
| Halanaerobiaceae | 1 |
| Holophagaceae | 1 |
| Kosmotogaceae | 1 |
| Methanospirillaceae | 1 |
| Methanothermaceae | 1 |
| Methermicoccaceae | 1 |
| Planctomycetaceae | 1 |
| Streptomycetaceae | 1 |
| Sulfolobaceae | 1 |
| Syntrophomonadaceae | 1 |
| Syntrophorhabdaceae | 1 |
Exploration of full annotations of these 3574 genes. Does any other annotation system captures the majority of these genes?
A snippet of the full gene annotation table
library(dplyr)
pfam03599.genes <- read.csv("pfam03599.gene.annotations.csv", sep = "\t", header = TRUE)
pfam03599.annot <- pfam03599.genes[-c(1:6)]
kable(head(pfam03599.annot))
| COG | Pfam | Tigrfam | KO | X |
|---|---|---|---|---|
| COG1614===CO dehydrogenase/acetyl-CoA synthase beta subunit | pfam03598===CdhC<<>>pfam03599===CdhD | TIGR00381===CO dehydrogenase/acetyl-CoA synthase, delta subunit<<>>TIGR00316===CO dehydrogenase/CO-methylating acetyl-CoA synthase complex, beta subunit | NA | |
| COG1614===CO dehydrogenase/acetyl-CoA synthase beta subunit | pfam03599===CdhD<<>>pfam03598===CdhC | TIGR00381===CO dehydrogenase/acetyl-CoA synthase, delta subunit<<>>TIGR00316===CO dehydrogenase/CO-methylating acetyl-CoA synthase complex, beta subunit | NA | |
| COG1614===CO dehydrogenase/acetyl-CoA synthase beta subunit | pfam03598===CdhC<<>>pfam03599===CdhD | TIGR00381===CO dehydrogenase/acetyl-CoA synthase, delta subunit<<>>TIGR00316===CO dehydrogenase/CO-methylating acetyl-CoA synthase complex, beta subunit | NA | |
| COG1614===CO dehydrogenase/acetyl-CoA synthase beta subunit | pfam03599===CdhD<<>>pfam03598===CdhC | TIGR00381===CO dehydrogenase/acetyl-CoA synthase, delta subunit<<>>TIGR00316===CO dehydrogenase/CO-methylating acetyl-CoA synthase complex, beta subunit | NA | |
| COG1614===CO dehydrogenase/acetyl-CoA synthase beta subunit | pfam03598===CdhC<<>>pfam03599===CdhD | TIGR00381===CO dehydrogenase/acetyl-CoA synthase, delta subunit<<>>TIGR00316===CO dehydrogenase/CO-methylating acetyl-CoA synthase complex, beta subunit | NA | |
| COG1614===CO dehydrogenase/acetyl-CoA synthase beta subunit | pfam03598===CdhC<<>>pfam03599===CdhD | TIGR00381===CO dehydrogenase/acetyl-CoA synthase, delta subunit<<>>TIGR00316===CO dehydrogenase/CO-methylating acetyl-CoA synthase complex, beta subunit | NA |
library(dplyr)
pfam03599.COG <- dplyr::group_by(pfam03599.annot, COG)
pfam03599.COG.summary <- dplyr::summarise(pfam03599.COG, COG.count = n()/nrow(pfam03599.COG))
pfam03599.COG.summary$COG[pfam03599.COG.summary$COG == ""] <- "NA"
## Warning in `[<-.factor`(`*tmp*`, pfam03599.COG.summary$COG == "", value =
## structure(c(NA, : invalid factor level, NA generated
kable(pfam03599.COG.summary)
| COG | COG.count |
|---|---|
| NA | 0.2115277 |
| COG0294===Dihydropteroate synthase | 0.0025182 |
| COG0685===5,10-methylenetetrahydrofolate reductase<<>>COG2069===CO dehydrogenase/acetyl-CoA synthase delta subunit (corrinoid Fe-S protein) | 0.0005596 |
| COG1064===D-arabinose 1-dehydrogenase, Zn-dependent alcohol dehydrogenase family | 0.0002798 |
| COG1104===Cysteine sulfinate desulfinase/cysteine desulfurase or related enzyme | 0.0002798 |
| COG1151===Hydroxylamine reductase (hybrid-cluster protein) | 0.0005596 |
| COG1151===Hydroxylamine reductase (hybrid-cluster protein)<<>>COG2069===CO dehydrogenase/acetyl-CoA synthase delta subunit (corrinoid Fe-S protein) | 0.0002798 |
| COG1456===CO dehydrogenase/acetyl-CoA synthase gamma subunit (corrinoid Fe-S protein) | 0.4029099 |
| COG1614===CO dehydrogenase/acetyl-CoA synthase beta subunit | 0.0047566 |
| COG2069===CO dehydrogenase/acetyl-CoA synthase delta subunit (corrinoid Fe-S protein) | 0.3757694 |
| COG3640===CO dehydrogenase nickel-insertion accessory protein CooC1 | 0.0002798 |
| COG3894===Uncharacterized 2Fe-2 and 4Fe-4S clusters-containing protein, contains DUF4445 domain | 0.0002798 |
library(dplyr)
pfam03599.Tigr <- dplyr::group_by(pfam03599.annot, Tigrfam)
pfam03599.Tigr.summary <- dplyr::summarise(pfam03599.Tigr, Tigr.count = n()/nrow(pfam03599.annot))
pfam03599.Tigr.summary$Tigrfam[pfam03599.COG.summary$Tigrfam == ""] <- "NA"
## Warning in `[<-.factor`(`*tmp*`, pfam03599.COG.summary$Tigrfam == "", value
## = structure(1:14, .Label = c("", : invalid factor level, NA generated
## Warning: Unknown or uninitialised column: 'Tigrfam'.
kable(pfam03599.Tigr.summary)
| Tigrfam | Tigr.count |
|---|---|
| 0.7526581 | |
| TIGR00284===dihydropteroate synthase-related protein | 0.0025182 |
| TIGR00316===CO dehydrogenase/CO-methylating acetyl-CoA synthase complex, beta subunit | 0.0005596 |
| TIGR00316===CO dehydrogenase/CO-methylating acetyl-CoA synthase complex, beta subunit<<>>TIGR00381===CO dehydrogenase/acetyl-CoA synthase, delta subunit | 0.0008394 |
| TIGR00381===CO dehydrogenase/acetyl-CoA synthase, delta subunit | 0.2364298 |
| TIGR00381===CO dehydrogenase/acetyl-CoA synthase, delta subunit<<>>TIGR00316===CO dehydrogenase/CO-methylating acetyl-CoA synthase complex, beta subunit | 0.0033576 |
| TIGR00381===CO dehydrogenase/acetyl-CoA synthase, delta subunit<<>>TIGR01445===intein N-terminal splicing region<<>>TIGR01443===intein C-terminal splicing region | 0.0002798 |
| TIGR00381===CO dehydrogenase/acetyl-CoA synthase, delta subunit<<>>TIGR02794===TolA protein | 0.0002798 |
| TIGR01445===intein N-terminal splicing region<<>>TIGR00381===CO dehydrogenase/acetyl-CoA synthase, delta subunit<<>>TIGR01443===intein C-terminal splicing region | 0.0002798 |
| TIGR01702===carbon-monoxide dehydrogenase, catalytic subunit | 0.0002798 |
| TIGR01702===carbon-monoxide dehydrogenase, catalytic subunit<<>>TIGR00381===CO dehydrogenase/acetyl-CoA synthase, delta subunit | 0.0005596 |
| TIGR02794===TolA protein<<>>TIGR00381===CO dehydrogenase/acetyl-CoA synthase, delta subunit | 0.0002798 |
| TIGR03402===cysteine desulfurase NifS | 0.0002798 |
| TIGR04256===GxxExxY protein | 0.0013990 |
library(dplyr)
pfam03599.KO <- dplyr::group_by(pfam03599.annot, KO)
pfam03599.KO.summary <- dplyr::summarise(pfam03599.KO, KO.count = n()/nrow(pfam03599.KO))
pfam03599.KO.summary$KO[pfam03599.KO.summary$KO == ""] <- "NA"
## Warning in `[<-.factor`(`*tmp*`, pfam03599.KO.summary$KO == "", value =
## structure(c(NA, : invalid factor level, NA generated
kable(pfam03599.KO.summary)
| KO | KO.count |
|---|---|
| NA | 0.1684387 |
| KO:K00194===cdhD, acsD===acetyl-CoA decarbonylase/synthase complex subunit delta [EC:2.1.1.245] | 0.3875210 |
| KO:K00194===cdhD, acsD===acetyl-CoA decarbonylase/synthase complex subunit delta [EC:2.1.1.245]<<>>KO:K00198===cooS, acsA===carbon-monoxide dehydrogenase catalytic subunit [EC:1.2.99.2 1.2.7.4] | 0.0005596 |
| KO:K00197===cdhE, acsC===acetyl-CoA decarbonylase/synthase complex subunit gamma [EC:2.1.1.245] | 0.4418019 |
| KO:K00198===cooS, acsA===carbon-monoxide dehydrogenase catalytic subunit [EC:1.2.99.2 1.2.7.4] | 0.0008394 |
| KO:K00198===cooS, acsA===carbon-monoxide dehydrogenase catalytic subunit [EC:1.2.99.2 1.2.7.4]<<>>KO:K00194===cdhD, acsD===acetyl-CoA decarbonylase/synthase complex subunit delta [EC:2.1.1.245] | 0.0002798 |
| KO:K04487===iscS, NFS1===cysteine desulfurase [EC:2.8.1.7] | 0.0002798 |
| KO:K13953===adhP===alcohol dehydrogenase, propanol-preferring [EC:1.1.1.1] | 0.0002798 |
library(dplyr)
pfam03599.Pfam <- dplyr::group_by(pfam03599.annot, Pfam)
pfam03599.Pfam.summary <- dplyr::summarise(pfam03599.Pfam, Pfam.count = n()/nrow(pfam03599.Pfam))
pfam03599.Pfam.summary$Pfam[pfam03599.Pfam.summary$Pfam == ""] <- "NA"
## Warning in `[<-.factor`(`*tmp*`, pfam03599.Pfam.summary$Pfam == "", value
## = structure(1:36, .Label = c("pfam00037===Fer4<<>>pfam03599===CdhD", :
## invalid factor level, NA generated
pfam03599.Pfam.summary <- pfam03599.Pfam.summary[order(-pfam03599.Pfam.summary$Pfam.count),]
kable(pfam03599.Pfam.summary)
| Pfam | Pfam.count |
|---|---|
| pfam03599===CdhD | 0.5554001 |
| pfam03599===CdhD<<>>pfam04060===FeS | 0.2269166 |
| pfam04060===FeS<<>>pfam03599===CdhD | 0.1829882 |
| pfam03599===CdhD<<>>pfam00037===Fer4 | 0.0058758 |
| pfam03599===CdhD<<>>pfam02219===MTHFR | 0.0036374 |
| pfam03598===CdhC<<>>pfam03599===CdhD | 0.0030778 |
| pfam02219===MTHFR<<>>pfam03599===CdhD | 0.0027980 |
| pfam00037===Fer4<<>>pfam03599===CdhD | 0.0025182 |
| pfam03599===CdhD<<>>pfam03598===CdhC | 0.0016788 |
| pfam03599===CdhD<<>>pfam14251===DUF4346 | 0.0016788 |
| pfam03599===CdhD<<>>pfam12838===Fer4_7 | 0.0013990 |
| pfam13187===Fer4_9<<>>pfam03599===CdhD | 0.0013990 |
| pfam03599===CdhD<<>>pfam13187===Fer4_9 | 0.0011192 |
| pfam03599===CdhD<<>>pfam13649===Methyltransf_25 | 0.0011192 |
| pfam03599===CdhD<<>>pfam03063===Prismane | 0.0008394 |
| pfam13366===PDDEXK_3<<>>pfam03599===CdhD | 0.0008394 |
| pfam03599===CdhD<<>>pfam04060===FeS<<>>pfam00809===Pterin_bind | 0.0005596 |
| pfam03599===CdhD<<>>pfam13366===PDDEXK_3 | 0.0005596 |
| pfam12838===Fer4_7<<>>pfam03599===CdhD | 0.0005596 |
| pfam14251===DUF4346<<>>pfam03599===CdhD | 0.0005596 |
| pfam00809===Pterin_bind<<>>pfam03599===CdhD<<>>pfam04060===FeS | 0.0002798 |
| pfam03599===CdhD<<>>pfam00266===Aminotran_5 | 0.0002798 |
| pfam03599===CdhD<<>>pfam01656===CbiA | 0.0002798 |
| pfam03599===CdhD<<>>pfam13187===Fer4_9<<>>pfam08241===Methyltransf_11 | 0.0002798 |
| pfam03599===CdhD<<>>pfam13237===Fer4_10 | 0.0002798 |
| pfam04060===FeS<<>>pfam03599===CdhD<<>>pfam00809===Pterin_bind | 0.0002798 |
| pfam04060===FeS<<>>pfam03599===CdhD<<>>pfam12683===DUF3798 | 0.0002798 |
| pfam04536===TPM_phosphatase<<>>pfam03599===CdhD | 0.0002798 |
| pfam08240===ADH_N<<>>pfam00107===ADH_zinc_N<<>>pfam03599===CdhD | 0.0002798 |
| pfam08241===Methyltransf_11<<>>pfam03599===CdhD<<>>pfam13187===Fer4_9 | 0.0002798 |
| pfam13181===TPR_8<<>>pfam03599===CdhD<<>>pfam02219===MTHFR<<>>pfam07719===TPR_2 | 0.0002798 |
| pfam13237===Fer4_10<<>>pfam03599===CdhD | 0.0002798 |
| pfam13649===Methyltransf_25<<>>pfam03599===CdhD | 0.0002798 |
| pfam14574===DUF4445<<>>pfam04060===FeS<<>>pfam00111===Fer2<<>>pfam03599===CdhD | 0.0002798 |
| pfam14890===Intein_splicing<<>>pfam03598===CdhC<<>>pfam14528===LAGLIDADG_3<<>>pfam03599===CdhD | 0.0002798 |
| pfam14890===Intein_splicing<<>>pfam14528===LAGLIDADG_3<<>>pfam03599===CdhD<<>>pfam03598===CdhC | 0.0002798 |
The other major systems are inconsistent.
Based on “The Genetic Basis for Bacterial Mercury Methylation” (http://science.sciencemag.org/content/339/6125/1332), a handful of Desulfovibrio and Geobacter strains have proven Hgc activity and their HgcB genes have been identified;
The goal is to identify potential HgcB analogs among the 137 genomes with HgcA. We need a handle to confidently identify these however. Check the functional annotations of these five genes first.
HgcB.proven.genomes <- read.csv("HgcB.proven.genomes.csv", sep = "\t", header = TRUE)
kable(HgcB.proven.genomes$Genome.Name...Sample.Name)
| x |
|---|
| Desulfovibrio africanus Walvis Bay |
| Pseudodesulfovibrio aespoeensis Aspo-2, DSM 10631 |
| Geobacter sulfurreducens PCA |
| Geobacter metallireducens GS-15 |
| Desulfovibrio desulfuricans ND132 |
I could find 5 of the 6 proven HgcB genomes. D. propionicus has no genome or it’s name has been changed.
The HgcB genes identified in the aforementioned studies are pointed out above. They are analyzed for functional annotations:
HgcB.proven.annotations <- read.csv("Hgc.proven.annot.csv", sep = "\t", header = TRUE)
kable(HgcB.proven.annotations[-c(1,2,4,6)])
| Gene.Product.Name | Genome.Name | COG | Pfam | Tigrfam | KO | X |
|---|---|---|---|---|---|---|
| 4Fe-4S dicluster domain-containing protein | Desulfovibrio desulfuricans ND132 | NA | pfam00037===Fer4 | NA | NA | NA |
| 4Fe-4S dicluster domain-containing protein | Desulfovibrio africanus Walvis Bay | NA | pfam13187===Fer4_9 | NA | NA | NA |
| 4Fe-4S ferredoxin iron-sulfur-binding domain-containing protein | Pseudodesulfovibrio aespoeensis Aspo-2, DSM 10631 | NA | pfam13187===Fer4_9 | NA | NA | NA |
| Ferredoxin family protein | Geobacter metallireducens GS-15 | NA | pfam13187===Fer4_9 | NA | NA | NA |
| ferredoxin family protein | Geobacter sulfurreducens PCA | NA | pfam13187===Fer4_9 | NA | NA | NA |
As can be seen, only pfam annotates these HgcB proteins, and does so inconsistently. However, they are all flagged as “ferrodoxin” or “4Fe-4S”. This is an imprecise technique but is something.
An initial examination of the details of various pfam03359 genes reveals a huge variety in gene size and neighborhood (not shown). pfam03599 might capture all of HgcA proteins, but it is also potentially capturing many more proteins with other functions. We should move towards narrowing it down. Two ways to handle this, generally speaking
HgcA.proven.annot <- read.csv("HgcA.proven.annot.csv", header = TRUE, sep = "\t")
kable(HgcA.proven.annot[-c(1,2,3,4,6)])
| Genome.Name | COG | Pfam | Tigrfam | KO | X |
|---|---|---|---|---|---|
| Geobacter metallireducens GS-15 | NA | pfam03599===CdhD | NA | KO:K00197===cdhE, acsC===acetyl-CoA decarbonylase/synthase complex subunit gamma [EC:2.1.1.245] | NA |
| Geobacter sulfurreducens PCA | NA | pfam03599===CdhD | NA | KO:K00197===cdhE, acsC===acetyl-CoA decarbonylase/synthase complex subunit gamma [EC:2.1.1.245] | NA |
| Desulfovibrio desulfuricans ND132 | NA | pfam03599===CdhD | NA | NA | |
| Desulfovibrio africanus Walvis Bay | NA | pfam03599===CdhD | NA | NA | |
| Pseudodesulfovibrio aespoeensis Aspo-2, DSM 10631 | NA | pfam03599===CdhD | NA | NA |
As is evident, COG does not pick this up at all, while KEGG is inconsistent.
Two potentially useful rules emerge:
Apply these rules to the pfam03599 set:
pfam03599.rules <- pfam03599.genes
pfam03599.rules <- pfam03599.rules[pfam03599.rules$COG == "",] # include only those with no COG annotation
pfam03599.rules <- pfam03599.rules[pfam03599.rules$Pfam == "pfam03599===CdhD",] #include only those with the single pfam
This restricts us to a more narrow set of 638 genes.
We export the list of IMG gene IDs corresponding to this more restricted set and import to IMG
pfam03599.rules.ID <- data.frame(pfam03599.rules$Gene.ID)
write.csv(pfam03599.rules.ID,"pfam03599.rules.ID.csv")
library(dplyr)
pfam03599.rules.genomes <- read.csv("pfam03599.rules.genomes.csv", sep = "\t", header = TRUE)
pfam03599.rules.phylum <- dplyr::group_by(pfam03599.rules.genomes, Phylum)
pfam03599.rules.phylum.summary <- dplyr::summarize(pfam03599.rules.phylum, phylum.count = n())
pfam03599.rules.phylum.summary <- pfam03599.rules.phylum.summary[order(-pfam03599.rules.phylum.summary$phylum.count),]
kable(pfam03599.rules.phylum.summary)
| Phylum | phylum.count |
|---|---|
| Proteobacteria | 210 |
| Firmicutes | 62 |
| Euryarchaeota | 54 |
| Chloroflexi | 35 |
| Nitrospirae | 34 |
| Spirochaetes | 34 |
| unclassified | 33 |
| Bacteroidetes | 16 |
| Elusimicrobia | 15 |
| Candidatus Omnitrophica | 8 |
| Planctomycetes | 7 |
| Candidatus Raymondbacteria | 6 |
| Lentisphaerae | 5 |
| Nitrospinae | 4 |
| Actinobacteria | 3 |
| Armatimonadetes | 3 |
| Candidatus Aminicenantes | 3 |
| Candidatus Firestonebacteria | 3 |
| Cyanobacteria | 3 |
| Acidobacteria | 2 |
| Atribacteria | 2 |
| Candidatus Bipolaricaulota | 2 |
| Candidatus Melainabacteria | 2 |
| Thaumarchaeota | 2 |
| Aminicenantes | 1 |
| candidate division WOR-3 | 1 |
| Candidatus Aerophobetes | 1 |
| Candidatus Atribacteria | 1 |
| Candidatus Bathyarchaeota | 1 |
| Candidatus Lokiarchaeota | 1 |
| Candidatus Schekmanbacteria | 1 |
| Candidatus Thorarchaeota | 1 |
| Crenarchaeota | 1 |
| Ignavibacteriae | 1 |
| Synergistetes | 1 |
Compared to the original complete pfam03599 gene set, restricting based on COG and pfam profile has greatly enriched for Proteobacteria.
These genomes have a huge number of pfam13187, the ferredoxin domain which is consistent with HgcB
This means that we need a different way to identify potential HgcAB pairs
Using the gene neighborhood diagrams presented by IMG, it is not unreasonable to sort through ~600 sets to flag syntenys that look reasonable under the basic criteria of having a small ferredoxin within 5kb downstream of the pfam03599. However this is imprecise and very clunky (not easy to do in img)
Working from annotations is sloppy and leads to many arbitrary choices. I could sift through these to find pairs of 03599 + ferredoxins and might have to, but this is where I will leave this for now
First make a set of 5 proven HgcA genes based on the 2013 Science paper:
HgcA.proven <- read.csv("HgcA.proven.csv", header = TRUE, sep = "\t")
kable(HgcA.proven[-c(6,7)])
| Gene.ID | Locus.Tag | Gene.Product.Name | Genome.ID | Genome.Name |
|---|---|---|---|---|
| 637778520 | Gmet_1240 | acetyl-CoA decarbonylase/synthase complex subunit gamma | 637000119 | Geobacter metallireducens GS-15 |
| 637126114 | GSU1440 | acetyl-CoA decarbonylase/synthase complex subunit gamma | 637000120 | Geobacter sulfurreducens PCA |
| 2503785873 | DND132_1056 | CO dehydrogenase/acetyl-CoA synthase delta subunit | 2503754015 | Desulfovibrio desulfuricans ND132 |
| 2503792115 | DesafDRAFT_0437 | CO dehydrogenase/acetyl-CoA synthase delta subunit | 2503754016 | Desulfovibrio africanus Walvis Bay |
| 649853478 | Daes_2662 | CO dehydrogenase/acetyl-CoA synthase subunit gamma (corrinoid Fe-S protein)-like protein | 649633037 | Pseudodesulfovibrio aespoeensis Aspo-2, DSM 10631 |
Perform all-IMG genome homolog searches and create a union set
This yields a set of 193 genes from 191 genomes (including the 5 used to search). Phylum-level summary:
library(dplyr)
HgcA.homolog.genomes <- read.csv("HgcA_homologs_genomes.csv", header = T, sep = "\t")
HgcA.homolog.phylum <- dplyr::group_by(HgcA.homolog.genomes, Phylum)
HgcA.homolog.phylum <- dplyr::summarise(HgcA.homolog.phylum, Phylum.count = n())
HgcA.homolog.phylum <- HgcA.homolog.phylum[order(-HgcA.homolog.phylum$Phylum.count),]
kable(HgcA.homolog.phylum)
| Phylum | Phylum.count |
|---|---|
| Proteobacteria | 126 |
| Firmicutes | 35 |
| Spirochaetes | 6 |
| unclassified | 5 |
| Lentisphaerae | 4 |
| Candidatus Aminicenantes | 3 |
| Nitrospirae | 3 |
| Bacteroidetes | 2 |
| Candidatus Raymondbacteria | 2 |
| Actinobacteria | 1 |
| Chloroflexi | 1 |
| Elusimicrobia | 1 |
| Planctomycetes | 1 |
| Synergistetes | 1 |
The question now is: Which of these genomes which contain an HgcA homolog also contains an HgcB homolog?
A quick visual inspection suggests that most of these have a small ferredoxin shortly downstream. Taking a more formal approach:
Collect homologs for each of the five proven HgcB proteins yields a set of 179 genes from 178 genomes. Taxonomy summary;
library(dplyr)
HgcB.homolog.genomes <- read.csv("HgcB_homologs_genomes.csv", header = T, sep = "\t")
HgcB.homolog.phylum <- dplyr::group_by(HgcB.homolog.genomes, Phylum)
HgcB.homolog.phylum <- dplyr::summarise(HgcB.homolog.phylum, Phylum.count = n())
HgcB.homolog.phylum <- HgcB.homolog.phylum[order(-HgcB.homolog.phylum$Phylum.count),]
kable(HgcB.homolog.phylum)
| Phylum | Phylum.count |
|---|---|
| Proteobacteria | 102 |
| Firmicutes | 27 |
| Nitrospirae | 10 |
| Spirochaetes | 9 |
| unclassified | 8 |
| Elusimicrobia | 7 |
| Lentisphaerae | 4 |
| Bacteroidetes | 2 |
| Candidatus Aminicenantes | 2 |
| Planctomycetes | 2 |
| Actinobacteria | 1 |
| Candidatus Schekmanbacteria | 1 |
| Chloroflexi | 1 |
| Euryarchaeota | 1 |
| Synergistetes | 1 |
Create an “inner-join” union set, genomes in which a homolog for each appears:
library(dplyr)
HgcAB.genome.union <- dplyr::inner_join(HgcA.homolog.genomes, HgcB.homolog.genomes, by = "Genome.Name...Sample.Name")
## Warning: Column `Genome.Name...Sample.Name` joining factors with different
## levels, coercing to character vector
head(HgcAB.genome.union$Genome.Name...Sample.Name)
## [1] "Lentisphaerae bacterium RIFOXYA12_FULL_48_11"
## [2] "Desulfobulbus propionicus 1pr3, DSM 2032"
## [3] "Lentisphaerae bacterium GWF2_57_35"
## [4] "Geopsychrobacter electrodiphilus DSM 16401"
## [5] "Desulfobulbus sp. Tol-SR"
## [6] "Composite genome from Trout Bog Hypolimnion pan-assembly TBhypo.metabat.3815"
This generates a set of 149 genomes that are shared between the sets. This might be lumping duplications together…
Create an “inner-join” based on scaffold, only include scaffolds found in each gene set:
library(dplyr)
HgcA.homolog <- read.csv("HgcA_homologs.csv", header = T, sep = "\t")
colnames(HgcA.homolog) <- paste("HgcA",colnames(HgcA.homolog),sep=".")
colnames(HgcA.homolog)[9] <- "Scaffold.ID"
HgcB.homolog <- read.csv("HgcB_homologs.csv", header = T, sep = "\t")
colnames(HgcB.homolog) <- paste("HgcB",colnames(HgcB.homolog),sep=".")
colnames(HgcB.homolog)[10] <- "Scaffold.ID"
HgcAB.scaffold.full <- dplyr::full_join(HgcA.homolog, HgcB.homolog, by = "Scaffold.ID")
HgcAB.scaffold.union <- dplyr::inner_join(HgcA.homolog, HgcB.homolog, by = "Scaffold.ID")
This identifies 135 scaffolds that are found in both sets vs 237 which are found in either. In other words, of the 179 HgcB and 193 HgcA homologs, 135 are found on the same scaffold.
This does not establish co-localization however. I can compare “HcgA.End.Coord” and “HcgB.Start.Coord”:
HgcAB.same.scaffold <- HgcAB.scaffold.union[c(5,4,9,1,7,8,11,18,19)]
colnames(HgcAB.same.scaffold)[c(1,2)] <- c("Genome.Name","Genome.ID")
HgcAB.same.scaffold$A.B.dist <- HgcAB.same.scaffold$HgcA.End.Coord - HgcAB.same.scaffold$HgcB.Start.Coord
Plotting a histogram of gene distances:
hist(HgcAB.same.scaffold$A.B.dist, breaks = 1000)
This demonstrates fairly definitively that, with exception of two cases, the genes are in very close proximity to one another.