Hgc.IMG

Searching for HgcA homologs in existing microbial genomes

JGI IMG database indexes and annotates all publically available microbial genomes. They currently have over 80,000 genomes.

HgcA is annotated in the pfam system as pfam03599.

A function search of the IMG database targetting all genomes, all completion status and type, for pfam03599 is conducted. This includes metagenome-annotated genomes and single-cell genomes.

Results are encouraging, 3574 genes in 1756 genomes.

Characterizing the genomes that contain pfam03599 by phylum

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

pfam03599.genomes <- read.csv("pfam03599.genomes.csv", sep = "\t", header = TRUE)
pfam03599.phylum <- dplyr::group_by(pfam03599.genomes, Phylum)
pfam03599.phylum.summary <- dplyr::summarize(pfam03599.phylum, phylum.count = n())
pfam03599.phylum.summary <- pfam03599.phylum.summary[order(-pfam03599.phylum.summary$phylum.count),]
kable(pfam03599.phylum.summary)

Phylum	phylum.count
Firmicutes	612
Euryarchaeota	275
Proteobacteria	275
Chloroflexi	137
Nitrospirae	60
unclassified	49
Spirochaetes	43
Candidatus Omnitrophica	42
Planctomycetes	38
Elusimicrobia	23
Candidatus Bathyarchaeota	21
Bacteroidetes	20
Actinobacteria	19
Aminicenantes	13
Candidatus Desantisbacteria	12
Armatimonadetes	11
Thermodesulfobacteria	11
Crenarchaeota	10
Lentisphaerae	9
Candidatus Bipolaricaulota	8
Candidatus Aminicenantes	7
Candidatus Raymondbacteria	7
Acidobacteria	5
Atribacteria	5
Candidatus Melainabacteria	5
Nitrospinae	4
Candidatus Firestonebacteria	3
Cyanobacteria	3
Thaumarchaeota	3
Aerophobetes	2
Candidatus Thorarchaeota	2
Cloacimonetes	2
Acetothermia	1
candidate division KD3-62	1
candidate division WOR-3	1
Candidatus Aerophobetes	1
Candidatus Atribacteria	1
Candidatus Calescamantes	1
Candidatus Fraserbacteria	1
Candidatus Heimdallarchaeota	1
Candidatus Lokiarchaeota	1
Candidatus Marinimicrobia	1
Candidatus Pacebacteria	1
Candidatus Poribacteria	1
Candidatus Schekmanbacteria	1
Candidatus Wallbacteria	1
Candidatus Woesearchaeota	1
Deferribacteres	1
Ignavibacteriae	1
Synergistetes	1
Thermotogae	1
Verrucomicrobia	1

Characterizing by family

library(dplyr)
pfam03599.genomes <- read.csv("pfam03599.genomes.csv", sep = "\t", header = TRUE)
pfam03599.family <- dplyr::group_by(pfam03599.genomes, Family)
pfam03599.family.summary <- dplyr::summarize(pfam03599.family, family.count = n())
pfam03599.family.summary <- pfam03599.family.summary[order(-pfam03599.family.summary$family.count),]
kable(pfam03599.family.summary)

Family	family.count
unclassified	726
Peptostreptococcaceae	348
Methanosarcinaceae	111
Peptococcaceae	57
Lachnospiraceae	50
Clostridiaceae	41
Thermoanaerobacteraceae	37
Desulfobacteraceae	34
Methanobacteriaceae	29
Geobacteraceae	28
Archaeoglobaceae	22
Dehalococcoidaceae	22
Desulfovibrionaceae	22
Eubacteriaceae	14
Syntrophaceae	14
Desulfobulbaceae	12
Nitrospiraceae	11
Thermodesulfobacteriaceae	11
Methanococcaceae	10
Spirochaetaceae	10
Methanocaldococcaceae	9
Candidatus Brocadiaceae	8
Desulfarculaceae	8
Ruminococcaceae	8
Sporomusaceae	8
Syntrophobacteraceae	8
Hungateiclostridiaceae	7
Methanomicrobiaceae	7
Methanosaetaceae	6
Desulfonatronaceae	5
Desulfuromonadaceae	5
Halobacteroidaceae	5
Desulfohalobiaceae	4
Desulfomicrobiaceae	4
Methanomassiliicoccaceae	4
Methanoregulaceae	4
Thermoanaerobacterales Family III. Incertae Sedis	4
Gastranaerophilaceae	3
Methanopyraceae	3
Pseudonocardiaceae	3
Thermococcaceae	3
Bacteroidaceae	2
Clostridiales Family XVI. Incertae Sedis	2
Marinilabiliaceae	2
Methanocellaceae	2
Methanocorpusculaceae	2
Nitrospinaceae	2
Rhodobacteraceae	2
Peptococcaceae	1
Anaerolineaceae	1
Bacillaceae	1
Bradyrhizobiaceae	1
Candidatus Methanoperedenaceae	1
Desulfurococcaceae	1
Halanaerobiaceae	1
Holophagaceae	1
Kosmotogaceae	1
Methanospirillaceae	1
Methanothermaceae	1
Methermicoccaceae	1
Planctomycetaceae	1
Streptomycetaceae	1
Sulfolobaceae	1
Syntrophomonadaceae	1
Syntrophorhabdaceae	1

Attempting to expand the search window

Exploration of full annotations of these 3574 genes. Does any other annotation system captures the majority of these genes?

pfam03599 annotations in other systems

A snippet of the full gene annotation table

library(dplyr)
pfam03599.genes <- read.csv("pfam03599.gene.annotations.csv", sep = "\t", header = TRUE)
pfam03599.annot <- pfam03599.genes[-c(1:6)]
kable(head(pfam03599.annot))

COG	Pfam	Tigrfam	X
COG1614===CO dehydrogenase/acetyl-CoA synthase beta subunit	pfam03598===CdhC<<>>pfam03599===CdhD	TIGR00381===CO dehydrogenase/acetyl-CoA synthase, delta subunit<<>>TIGR00316===CO dehydrogenase/CO-methylating acetyl-CoA synthase complex, beta subunit	NA
COG1614===CO dehydrogenase/acetyl-CoA synthase beta subunit	pfam03599===CdhD<<>>pfam03598===CdhC	TIGR00381===CO dehydrogenase/acetyl-CoA synthase, delta subunit<<>>TIGR00316===CO dehydrogenase/CO-methylating acetyl-CoA synthase complex, beta subunit	NA
COG1614===CO dehydrogenase/acetyl-CoA synthase beta subunit	pfam03598===CdhC<<>>pfam03599===CdhD	TIGR00381===CO dehydrogenase/acetyl-CoA synthase, delta subunit<<>>TIGR00316===CO dehydrogenase/CO-methylating acetyl-CoA synthase complex, beta subunit	NA
COG1614===CO dehydrogenase/acetyl-CoA synthase beta subunit	pfam03599===CdhD<<>>pfam03598===CdhC	TIGR00381===CO dehydrogenase/acetyl-CoA synthase, delta subunit<<>>TIGR00316===CO dehydrogenase/CO-methylating acetyl-CoA synthase complex, beta subunit	NA
COG1614===CO dehydrogenase/acetyl-CoA synthase beta subunit	pfam03598===CdhC<<>>pfam03599===CdhD	TIGR00381===CO dehydrogenase/acetyl-CoA synthase, delta subunit<<>>TIGR00316===CO dehydrogenase/CO-methylating acetyl-CoA synthase complex, beta subunit	NA
COG1614===CO dehydrogenase/acetyl-CoA synthase beta subunit	pfam03598===CdhC<<>>pfam03599===CdhD	TIGR00381===CO dehydrogenase/acetyl-CoA synthase, delta subunit<<>>TIGR00316===CO dehydrogenase/CO-methylating acetyl-CoA synthase complex, beta subunit	NA

Summarize by COG

library(dplyr)
pfam03599.COG <- dplyr::group_by(pfam03599.annot, COG)
pfam03599.COG.summary <- dplyr::summarise(pfam03599.COG, COG.count = n()/nrow(pfam03599.COG))
pfam03599.COG.summary$COG[pfam03599.COG.summary$COG == ""] <- "NA"

## Warning in `[<-.factor`(`*tmp*`, pfam03599.COG.summary$COG == "", value =
## structure(c(NA, : invalid factor level, NA generated

kable(pfam03599.COG.summary)

COG	COG.count
NA	0.2115277
COG0294===Dihydropteroate synthase	0.0025182
COG0685===5,10-methylenetetrahydrofolate reductase<<>>COG2069===CO dehydrogenase/acetyl-CoA synthase delta subunit (corrinoid Fe-S protein)	0.0005596
COG1064===D-arabinose 1-dehydrogenase, Zn-dependent alcohol dehydrogenase family	0.0002798
COG1104===Cysteine sulfinate desulfinase/cysteine desulfurase or related enzyme	0.0002798
COG1151===Hydroxylamine reductase (hybrid-cluster protein)	0.0005596
COG1151===Hydroxylamine reductase (hybrid-cluster protein)<<>>COG2069===CO dehydrogenase/acetyl-CoA synthase delta subunit (corrinoid Fe-S protein)	0.0002798
COG1456===CO dehydrogenase/acetyl-CoA synthase gamma subunit (corrinoid Fe-S protein)	0.4029099
COG1614===CO dehydrogenase/acetyl-CoA synthase beta subunit	0.0047566
COG2069===CO dehydrogenase/acetyl-CoA synthase delta subunit (corrinoid Fe-S protein)	0.3757694
COG3640===CO dehydrogenase nickel-insertion accessory protein CooC1	0.0002798
COG3894===Uncharacterized 2Fe-2 and 4Fe-4S clusters-containing protein, contains DUF4445 domain	0.0002798

Summarize by TigrFAM

library(dplyr)
pfam03599.Tigr <- dplyr::group_by(pfam03599.annot, Tigrfam)
pfam03599.Tigr.summary <- dplyr::summarise(pfam03599.Tigr, Tigr.count = n()/nrow(pfam03599.annot))
pfam03599.Tigr.summary$Tigrfam[pfam03599.COG.summary$Tigrfam == ""] <- "NA"

## Warning in `[<-.factor`(`*tmp*`, pfam03599.COG.summary$Tigrfam == "", value
## = structure(1:14, .Label = c("", : invalid factor level, NA generated

## Warning: Unknown or uninitialised column: 'Tigrfam'.

kable(pfam03599.Tigr.summary)

Tigrfam	Tigr.count
	0.7526581
TIGR00284===dihydropteroate synthase-related protein	0.0025182
TIGR00316===CO dehydrogenase/CO-methylating acetyl-CoA synthase complex, beta subunit	0.0005596
TIGR00316===CO dehydrogenase/CO-methylating acetyl-CoA synthase complex, beta subunit<<>>TIGR00381===CO dehydrogenase/acetyl-CoA synthase, delta subunit	0.0008394
TIGR00381===CO dehydrogenase/acetyl-CoA synthase, delta subunit	0.2364298
TIGR00381===CO dehydrogenase/acetyl-CoA synthase, delta subunit<<>>TIGR00316===CO dehydrogenase/CO-methylating acetyl-CoA synthase complex, beta subunit	0.0033576
TIGR00381===CO dehydrogenase/acetyl-CoA synthase, delta subunit<<>>TIGR01445===intein N-terminal splicing region<<>>TIGR01443===intein C-terminal splicing region	0.0002798
TIGR00381===CO dehydrogenase/acetyl-CoA synthase, delta subunit<<>>TIGR02794===TolA protein	0.0002798
TIGR01445===intein N-terminal splicing region<<>>TIGR00381===CO dehydrogenase/acetyl-CoA synthase, delta subunit<<>>TIGR01443===intein C-terminal splicing region	0.0002798
TIGR01702===carbon-monoxide dehydrogenase, catalytic subunit	0.0002798
TIGR01702===carbon-monoxide dehydrogenase, catalytic subunit<<>>TIGR00381===CO dehydrogenase/acetyl-CoA synthase, delta subunit	0.0005596
TIGR02794===TolA protein<<>>TIGR00381===CO dehydrogenase/acetyl-CoA synthase, delta subunit	0.0002798
TIGR03402===cysteine desulfurase NifS	0.0002798
TIGR04256===GxxExxY protein	0.0013990

Summarize by KEGG

library(dplyr)
pfam03599.KO <- dplyr::group_by(pfam03599.annot, KO)
pfam03599.KO.summary <- dplyr::summarise(pfam03599.KO, KO.count = n()/nrow(pfam03599.KO))
pfam03599.KO.summary$KO[pfam03599.KO.summary$KO == ""] <- "NA"

## Warning in `[<-.factor`(`*tmp*`, pfam03599.KO.summary$KO == "", value =
## structure(c(NA, : invalid factor level, NA generated

kable(pfam03599.KO.summary)

KO	KO.count
NA	0.1684387
KO:K00194===cdhD, acsD===acetyl-CoA decarbonylase/synthase complex subunit delta [EC:2.1.1.245]	0.3875210
KO:K00194===cdhD, acsD===acetyl-CoA decarbonylase/synthase complex subunit delta [EC:2.1.1.245]<<>>KO:K00198===cooS, acsA===carbon-monoxide dehydrogenase catalytic subunit [EC:1.2.99.2 1.2.7.4]	0.0005596
KO:K00197===cdhE, acsC===acetyl-CoA decarbonylase/synthase complex subunit gamma [EC:2.1.1.245]	0.4418019
KO:K00198===cooS, acsA===carbon-monoxide dehydrogenase catalytic subunit [EC:1.2.99.2 1.2.7.4]	0.0008394
KO:K00198===cooS, acsA===carbon-monoxide dehydrogenase catalytic subunit [EC:1.2.99.2 1.2.7.4]<<>>KO:K00194===cdhD, acsD===acetyl-CoA decarbonylase/synthase complex subunit delta [EC:2.1.1.245]	0.0002798
KO:K04487===iscS, NFS1===cysteine desulfurase [EC:2.8.1.7]	0.0002798
KO:K13953===adhP===alcohol dehydrogenase, propanol-preferring [EC:1.1.1.1]	0.0002798

Summarize by pfam

library(dplyr)
pfam03599.Pfam <- dplyr::group_by(pfam03599.annot, Pfam)
pfam03599.Pfam.summary <- dplyr::summarise(pfam03599.Pfam, Pfam.count = n()/nrow(pfam03599.Pfam))
pfam03599.Pfam.summary$Pfam[pfam03599.Pfam.summary$Pfam == ""] <- "NA"

## Warning in `[<-.factor`(`*tmp*`, pfam03599.Pfam.summary$Pfam == "", value
## = structure(1:36, .Label = c("pfam00037===Fer4<<>>pfam03599===CdhD", :
## invalid factor level, NA generated

pfam03599.Pfam.summary <- pfam03599.Pfam.summary[order(-pfam03599.Pfam.summary$Pfam.count),]
kable(pfam03599.Pfam.summary)

Pfam	Pfam.count
pfam03599===CdhD	0.5554001
pfam03599===CdhD<<>>pfam04060===FeS	0.2269166
pfam04060===FeS<<>>pfam03599===CdhD	0.1829882
pfam03599===CdhD<<>>pfam00037===Fer4	0.0058758
pfam03599===CdhD<<>>pfam02219===MTHFR	0.0036374
pfam03598===CdhC<<>>pfam03599===CdhD	0.0030778
pfam02219===MTHFR<<>>pfam03599===CdhD	0.0027980
pfam00037===Fer4<<>>pfam03599===CdhD	0.0025182
pfam03599===CdhD<<>>pfam03598===CdhC	0.0016788
pfam03599===CdhD<<>>pfam14251===DUF4346	0.0016788
pfam03599===CdhD<<>>pfam12838===Fer4_7	0.0013990
pfam13187===Fer4_9<<>>pfam03599===CdhD	0.0013990
pfam03599===CdhD<<>>pfam13187===Fer4_9	0.0011192
pfam03599===CdhD<<>>pfam13649===Methyltransf_25	0.0011192
pfam03599===CdhD<<>>pfam03063===Prismane	0.0008394
pfam13366===PDDEXK_3<<>>pfam03599===CdhD	0.0008394
pfam03599===CdhD<<>>pfam04060===FeS<<>>pfam00809===Pterin_bind	0.0005596
pfam03599===CdhD<<>>pfam13366===PDDEXK_3	0.0005596
pfam12838===Fer4_7<<>>pfam03599===CdhD	0.0005596
pfam14251===DUF4346<<>>pfam03599===CdhD	0.0005596
pfam00809===Pterin_bind<<>>pfam03599===CdhD<<>>pfam04060===FeS	0.0002798
pfam03599===CdhD<<>>pfam00266===Aminotran_5	0.0002798
pfam03599===CdhD<<>>pfam01656===CbiA	0.0002798
pfam03599===CdhD<<>>pfam13187===Fer4_9<<>>pfam08241===Methyltransf_11	0.0002798
pfam03599===CdhD<<>>pfam13237===Fer4_10	0.0002798
pfam04060===FeS<<>>pfam03599===CdhD<<>>pfam00809===Pterin_bind	0.0002798
pfam04060===FeS<<>>pfam03599===CdhD<<>>pfam12683===DUF3798	0.0002798
pfam04536===TPM_phosphatase<<>>pfam03599===CdhD	0.0002798
pfam08240===ADH_N<<>>pfam00107===ADH_zinc_N<<>>pfam03599===CdhD	0.0002798
pfam08241===Methyltransf_11<<>>pfam03599===CdhD<<>>pfam13187===Fer4_9	0.0002798
pfam13181===TPR_8<<>>pfam03599===CdhD<<>>pfam02219===MTHFR<<>>pfam07719===TPR_2	0.0002798
pfam13237===Fer4_10<<>>pfam03599===CdhD	0.0002798
pfam13649===Methyltransf_25<<>>pfam03599===CdhD	0.0002798
pfam14574===DUF4445<<>>pfam04060===FeS<<>>pfam00111===Fer2<<>>pfam03599===CdhD	0.0002798
pfam14890===Intein_splicing<<>>pfam03598===CdhC<<>>pfam14528===LAGLIDADG_3<<>>pfam03599===CdhD	0.0002798
pfam14890===Intein_splicing<<>>pfam14528===LAGLIDADG_3<<>>pfam03599===CdhD<<>>pfam03598===CdhC	0.0002798

The other major systems are inconsistent.

Searching for HgcB

Based on “The Genetic Basis for Bacterial Mercury Methylation” (http://science.sciencemag.org/content/339/6125/1332), a handful of Desulfovibrio and Geobacter strains have proven Hgc activity and their HgcB genes have been identified;

The goal is to identify potential HgcB analogs among the 137 genomes with HgcA. We need a handle to confidently identify these however. Check the functional annotations of these five genes first.

Gather proven HgcB genes and detail their functional annotations

HgcB.proven.genomes <- read.csv("HgcB.proven.genomes.csv", sep = "\t", header = TRUE)
kable(HgcB.proven.genomes$Genome.Name...Sample.Name)

x
Desulfovibrio africanus Walvis Bay
Pseudodesulfovibrio aespoeensis Aspo-2, DSM 10631
Geobacter sulfurreducens PCA
Geobacter metallireducens GS-15
Desulfovibrio desulfuricans ND132

I could find 5 of the 6 proven HgcB genomes. D. propionicus has no genome or it’s name has been changed.

Identify HgcB genes and explore annotations

The HgcB genes identified in the aforementioned studies are pointed out above. They are analyzed for functional annotations:

HgcB.proven.annotations <- read.csv("Hgc.proven.annot.csv", sep = "\t", header = TRUE)
kable(HgcB.proven.annotations[-c(1,2,4,6)])

Gene.Product.Name	Genome.Name	COG	Pfam	Tigrfam	KO	X
4Fe-4S dicluster domain-containing protein	Desulfovibrio desulfuricans ND132	NA	pfam00037===Fer4	NA	NA	NA
4Fe-4S dicluster domain-containing protein	Desulfovibrio africanus Walvis Bay	NA	pfam13187===Fer4_9	NA	NA	NA
4Fe-4S ferredoxin iron-sulfur-binding domain-containing protein	Pseudodesulfovibrio aespoeensis Aspo-2, DSM 10631	NA	pfam13187===Fer4_9	NA	NA	NA
Ferredoxin family protein	Geobacter metallireducens GS-15	NA	pfam13187===Fer4_9	NA	NA	NA
ferredoxin family protein	Geobacter sulfurreducens PCA	NA	pfam13187===Fer4_9	NA	NA	NA

As can be seen, only pfam annotates these HgcB proteins, and does so inconsistently. However, they are all flagged as “ferrodoxin” or “4Fe-4S”. This is an imprecise technique but is something.

pfam13187 is not a terrible starting point

Refining the set of putative HgcA genes/proteins

An initial examination of the details of various pfam03359 genes reveals a huge variety in gene size and neighborhood (not shown). pfam03599 might capture all of HgcA proteins, but it is also potentially capturing many more proteins with other functions. We should move towards narrowing it down. Two ways to handle this, generally speaking

Check to see if any other annotation picks up proven HgcA while not picking up the other pfam03359 ** This is described below; it leads to a tangled mess with too many arbitrary choices in my opinion
Refine by another search pattern such as protein length or lack of other annotation
Phylogenetic analysis
Protein homology-based approach ** This strategy appears to be fruitful, leading to ~ 100 sets of HcgAB homologs

Check functional annotations of proven HgcA proteins

HgcA.proven.annot <- read.csv("HgcA.proven.annot.csv", header = TRUE, sep = "\t")
kable(HgcA.proven.annot[-c(1,2,3,4,6)])

Genome.Name	COG	Pfam	Tigrfam	KO	X
Geobacter metallireducens GS-15	NA	pfam03599===CdhD	NA	KO:K00197===cdhE, acsC===acetyl-CoA decarbonylase/synthase complex subunit gamma [EC:2.1.1.245]	NA
Geobacter sulfurreducens PCA	NA	pfam03599===CdhD	NA	KO:K00197===cdhE, acsC===acetyl-CoA decarbonylase/synthase complex subunit gamma [EC:2.1.1.245]	NA
Desulfovibrio desulfuricans ND132	NA	pfam03599===CdhD	NA		NA
Desulfovibrio africanus Walvis Bay	NA	pfam03599===CdhD	NA		NA
Pseudodesulfovibrio aespoeensis Aspo-2, DSM 10631	NA	pfam03599===CdhD	NA		NA

As is evident, COG does not pick this up at all, while KEGG is inconsistent.

Two potentially useful rules emerge:

No COG annotation
only ONE pfam annotation

Apply these rules to the pfam03599 set:

pfam03599.rules <- pfam03599.genes
pfam03599.rules <- pfam03599.rules[pfam03599.rules$COG == "",] # include only those with no COG annotation
pfam03599.rules <- pfam03599.rules[pfam03599.rules$Pfam == "pfam03599===CdhD",] #include only those with the single pfam

This restricts us to a more narrow set of 638 genes.

Note that there are some other interesting genes in there which contain both pfam03559 and a ferredoxin, Worth examining later, but this set is a good place to start.

We export the list of IMG gene IDs corresponding to this more restricted set and import to IMG

pfam03599.rules.ID <- data.frame(pfam03599.rules$Gene.ID)
write.csv(pfam03599.rules.ID,"pfam03599.rules.ID.csv")

Phylogeny of restricted gene set

library(dplyr)
pfam03599.rules.genomes <- read.csv("pfam03599.rules.genomes.csv", sep = "\t", header = TRUE)
pfam03599.rules.phylum <- dplyr::group_by(pfam03599.rules.genomes, Phylum)
pfam03599.rules.phylum.summary <- dplyr::summarize(pfam03599.rules.phylum, phylum.count = n())
pfam03599.rules.phylum.summary <- pfam03599.rules.phylum.summary[order(-pfam03599.rules.phylum.summary$phylum.count),]
kable(pfam03599.rules.phylum.summary)

Phylum	phylum.count
Proteobacteria	210
Firmicutes	62
Euryarchaeota	54
Chloroflexi	35
Nitrospirae	34
Spirochaetes	34
unclassified	33
Bacteroidetes	16
Elusimicrobia	15
Candidatus Omnitrophica	8
Planctomycetes	7
Candidatus Raymondbacteria	6
Lentisphaerae	5
Nitrospinae	4
Actinobacteria	3
Armatimonadetes	3
Candidatus Aminicenantes	3
Candidatus Firestonebacteria	3
Cyanobacteria	3
Acidobacteria	2
Atribacteria	2
Candidatus Bipolaricaulota	2
Candidatus Melainabacteria	2
Thaumarchaeota	2
Aminicenantes	1
candidate division WOR-3	1
Candidatus Aerophobetes	1
Candidatus Atribacteria	1
Candidatus Bathyarchaeota	1
Candidatus Lokiarchaeota	1
Candidatus Schekmanbacteria	1
Candidatus Thorarchaeota	1
Crenarchaeota	1
Ignavibacteriae	1
Synergistetes	1

Compared to the original complete pfam03599 gene set, restricting based on COG and pfam profile has greatly enriched for Proteobacteria.

Searching for HgcB in the pfam03599 subset

These genomes have a huge number of pfam13187, the ferredoxin domain which is consistent with HgcB

This means that we need a different way to identify potential HgcAB pairs

Manual selection

Using the gene neighborhood diagrams presented by IMG, it is not unreasonable to sort through ~600 sets to flag syntenys that look reasonable under the basic criteria of having a small ferredoxin within 5kb downstream of the pfam03599. However this is imprecise and very clunky (not easy to do in img)

Conclusion

Working from annotations is sloppy and leads to many arbitrary choices. I could sift through these to find pairs of 03599 + ferredoxins and might have to, but this is where I will leave this for now

Homology-based searches

First make a set of 5 proven HgcA genes based on the 2013 Science paper:

HgcA.proven <- read.csv("HgcA.proven.csv", header = TRUE, sep = "\t")
kable(HgcA.proven[-c(6,7)])

Gene.ID	Locus.Tag	Gene.Product.Name	Genome.ID	Genome.Name
637778520	Gmet_1240	acetyl-CoA decarbonylase/synthase complex subunit gamma	637000119	Geobacter metallireducens GS-15
637126114	GSU1440	acetyl-CoA decarbonylase/synthase complex subunit gamma	637000120	Geobacter sulfurreducens PCA
2503785873	DND132_1056	CO dehydrogenase/acetyl-CoA synthase delta subunit	2503754015	Desulfovibrio desulfuricans ND132
2503792115	DesafDRAFT_0437	CO dehydrogenase/acetyl-CoA synthase delta subunit	2503754016	Desulfovibrio africanus Walvis Bay
649853478	Daes_2662	CO dehydrogenase/acetyl-CoA synthase subunit gamma (corrinoid Fe-S protein)-like protein	649633037	Pseudodesulfovibrio aespoeensis Aspo-2, DSM 10631

Perform all-IMG genome homolog searches and create a union set

This yields a set of 193 genes from 191 genomes (including the 5 used to search). Phylum-level summary:

library(dplyr)
HgcA.homolog.genomes <- read.csv("HgcA_homologs_genomes.csv", header = T, sep = "\t")
HgcA.homolog.phylum <- dplyr::group_by(HgcA.homolog.genomes, Phylum)
HgcA.homolog.phylum <- dplyr::summarise(HgcA.homolog.phylum, Phylum.count = n())
HgcA.homolog.phylum <- HgcA.homolog.phylum[order(-HgcA.homolog.phylum$Phylum.count),]
kable(HgcA.homolog.phylum)

Phylum	Phylum.count
Proteobacteria	126
Firmicutes	35
Spirochaetes	6
unclassified	5
Lentisphaerae	4
Candidatus Aminicenantes	3
Nitrospirae	3
Bacteroidetes	2
Candidatus Raymondbacteria	2
Actinobacteria	1
Chloroflexi	1
Elusimicrobia	1
Planctomycetes	1
Synergistetes	1

The question now is: Which of these genomes which contain an HgcA homolog also contains an HgcB homolog?

A quick visual inspection suggests that most of these have a small ferredoxin shortly downstream. Taking a more formal approach:

Collect homologs for each of the five proven HgcB proteins yields a set of 179 genes from 178 genomes. Taxonomy summary;

library(dplyr)
HgcB.homolog.genomes <- read.csv("HgcB_homologs_genomes.csv", header = T, sep = "\t")
HgcB.homolog.phylum <- dplyr::group_by(HgcB.homolog.genomes, Phylum)
HgcB.homolog.phylum <- dplyr::summarise(HgcB.homolog.phylum, Phylum.count = n())
HgcB.homolog.phylum <- HgcB.homolog.phylum[order(-HgcB.homolog.phylum$Phylum.count),]
kable(HgcB.homolog.phylum)

Phylum	Phylum.count
Proteobacteria	102
Firmicutes	27
Nitrospirae	10
Spirochaetes	9
unclassified	8
Elusimicrobia	7
Lentisphaerae	4
Bacteroidetes	2
Candidatus Aminicenantes	2
Planctomycetes	2
Actinobacteria	1
Candidatus Schekmanbacteria	1
Chloroflexi	1
Euryarchaeota	1
Synergistetes	1

Create an “inner-join” union set, genomes in which a homolog for each appears:

library(dplyr)
HgcAB.genome.union <- dplyr::inner_join(HgcA.homolog.genomes, HgcB.homolog.genomes, by = "Genome.Name...Sample.Name")

## Warning: Column `Genome.Name...Sample.Name` joining factors with different
## levels, coercing to character vector

head(HgcAB.genome.union$Genome.Name...Sample.Name)

## [1] "Lentisphaerae bacterium RIFOXYA12_FULL_48_11"                                
## [2] "Desulfobulbus propionicus 1pr3, DSM 2032"                                    
## [3] "Lentisphaerae bacterium GWF2_57_35"                                          
## [4] "Geopsychrobacter electrodiphilus DSM 16401"                                  
## [5] "Desulfobulbus sp. Tol-SR"                                                    
## [6] "Composite genome from Trout Bog Hypolimnion pan-assembly TBhypo.metabat.3815"

This generates a set of 149 genomes that are shared between the sets. This might be lumping duplications together…

Create an “inner-join” based on scaffold, only include scaffolds found in each gene set:

library(dplyr)
HgcA.homolog <- read.csv("HgcA_homologs.csv", header = T, sep = "\t")
colnames(HgcA.homolog) <- paste("HgcA",colnames(HgcA.homolog),sep=".")
colnames(HgcA.homolog)[9] <- "Scaffold.ID"
HgcB.homolog <- read.csv("HgcB_homologs.csv", header = T, sep = "\t")
colnames(HgcB.homolog) <- paste("HgcB",colnames(HgcB.homolog),sep=".")
colnames(HgcB.homolog)[10] <- "Scaffold.ID"
HgcAB.scaffold.full <- dplyr::full_join(HgcA.homolog, HgcB.homolog, by = "Scaffold.ID")
HgcAB.scaffold.union <- dplyr::inner_join(HgcA.homolog, HgcB.homolog, by = "Scaffold.ID")

This identifies 135 scaffolds that are found in both sets vs 237 which are found in either. In other words, of the 179 HgcB and 193 HgcA homologs, 135 are found on the same scaffold.

This does not establish co-localization however. I can compare “HcgA.End.Coord” and “HcgB.Start.Coord”:

HgcAB.same.scaffold <- HgcAB.scaffold.union[c(5,4,9,1,7,8,11,18,19)]
colnames(HgcAB.same.scaffold)[c(1,2)] <- c("Genome.Name","Genome.ID")
HgcAB.same.scaffold$A.B.dist <- HgcAB.same.scaffold$HgcA.End.Coord - HgcAB.same.scaffold$HgcB.Start.Coord

Plotting a histogram of gene distances:

hist(HgcAB.same.scaffold$A.B.dist, breaks = 1000)

This demonstrates fairly definitively that, with exception of two cases, the genes are in very close proximity to one another.