The general goal is to begin to compare the metabolic capacities of A.BTEX to two existing genomes; A.aquaeoli and A.2007 (Seminole?). Initially, this will be performed by comparing functional annotations aiming for KEGG annotations of aromatic degradation genes.
A.2007 is currently represented only by a set of proteins; nucleotide sequences are not available, neither is the genome.
This is a fast protein annotation tool provided by the KEGG group. Submitted all three protein sets. The KO assignment is 1:1 (protein:KO).
KEGG maintains a large “reference pathway” called “degradation of aromoatic compounds” https://www.genome.jp/kegg-bin/show_pathway?map=map01220&show_description=show This is a container for many “modules”: for example, Toluene degradation: https://www.genome.jp/kegg-bin/show_module?M00418 This also contains “reaction modules”, which seem to be ge
The LEVEL2 category “Xenobiotics biodegradation and metabolism” is more useful. It contains 690 mappings of KO families particular pathways, such as benzoate, toluene, PAH metabolism.
However, these are very large maps… they are good for exploration, not so good for summarizing
GhostKoala accessions: A.aqueoli: https://www.kegg.jp/kegg-bin/blastkoala_result?id=32bb94f58c06ac0b21024fc41e53579590efc4b9&passwd=mgZ6D9&type=ghostkoala A.2007: https://www.kegg.jp/kegg-bin/blastkoala_result?id=ee34628779e008d2bb05da4d70c04439b2d4d83e&passwd=jgOiAL&type=ghostkoala A.BTEX: https://www.kegg.jp/kegg-bin/blastkoala_result?id=068dfcc08becc834efb50defbda6105baef45b90&passwd=jAzaXi&type=ghostkoala
pathway mapping in database type format is not helpful. The pathway maps are too large and integrated. What ARE helpful are “modules”.
First we can examine the Reference Pathway for Degradation of aromatic compounds with mappings for each organism
These are VERY large maps, but they permit us to see where the action is. The Module system is a different hierarchy from the pathway system. It also maps KOs to larger systems, but goes KO -> Module -> Reference Pathway or Pathway Module. The Module Mappin tool is the way to get at this directly from KOs (https://www.genome.jp/kegg/tool/map_module.html). This corresponds DIRECTLY to the output from GhostKoala.
So, within KEGG system, the goal is to: 1. Assign genes to KO groups 2. Compare KEGGs between organisms, identify differences (look into Venn diagrams) 2. map KO to Modules 3. complete modules must be analyzed directly or parsed out and compared accordingly
For each module, only modules which are missing 2 blocks at most are included. The result is in an awkward format:
The most straightforward way to handle this is with a grep for M plus 5 numbers (e.g.“M00345”) and then remove every thing else: grep -o ‘[^ ]M[0-9][0-9][0-9][0-9][0-9].’ kegg.module.txt | sed ’s/ .*//’ > kegg.module.name.txt the module descriptions can then be added in using a mapping table downloaded from kegg
library(dplyr)
module.btex <- read.csv("A.BTEX/kegg.module.name.txt",fill=T,sep="\ ",header=F)
head(module.btex)
## V1
## 1 M00001
## 2 M00002
## 3 M00307
## 4 M00009
## 5 M00010
## 6 M00011
kegg.module <- read.csv("../../../KEGG/KEGG.modules.txt", sep="\t",stringsAsFactors = F, as.is = T, header=F)
kegg.module <- kegg.module[c(1,ncol(kegg.module))]
head(kegg.module)
## V1
## 1 M00001
## 2 M00002
## 3 M00003
## 4 M00004
## 5 M00005
## 6 M00006
## V18
## 1 Glycolysis (Embden-Meyerhof pathway), glucose => pyruvate
## 2 Glycolysis, core module involving three-carbon compounds
## 3 Gluconeogenesis, oxaloacetate => fructose-6P
## 4 Pentose phosphate pathway (Pentose phosphate cycle)
## 5 PRPP biosynthesis, ribose 5P => PRPP
## 6 Pentose phosphate pathway, oxidative phase, glucose 6P => ribulose 5P
module.btex <- dplyr::left_join(module.btex, kegg.module, by = "V1")
module.btex <- module.btex[order(module.btex$V1),]
head(module.btex)
## V1
## 1 M00001
## 2 M00002
## 10 M00005
## 7 M00006
## 8 M00007
## 11 M00008
## V18
## 1 Glycolysis (Embden-Meyerhof pathway), glucose => pyruvate
## 2 Glycolysis, core module involving three-carbon compounds
## 10 PRPP biosynthesis, ribose 5P => PRPP
## 7 Pentose phosphate pathway, oxidative phase, glucose 6P => ribulose 5P
## 8 Pentose phosphate pathway, non-oxidative phase, fructose 6P => ribose 5P
## 11 Entner-Doudoroff pathway, glucose-6P => glyceraldehyde-3P + pyruvate
library(dplyr)
module.aqua <- read.csv("A.aquaeolei/kegg.module.name.txt",fill=T,sep="\ ",header=F)
head(module.aqua)
## V1
## 1 M00001
## 2 M00002
## 3 M00003
## 4 M00307
## 5 M00009
## 6 M00010
kegg.module <- read.csv("../../../KEGG/KEGG.modules.txt", sep="\t",stringsAsFactors = F, as.is = T, header=F)
kegg.module <- kegg.module[c(1,ncol(kegg.module))]
head(kegg.module)
## V1
## 1 M00001
## 2 M00002
## 3 M00003
## 4 M00004
## 5 M00005
## 6 M00006
## V18
## 1 Glycolysis (Embden-Meyerhof pathway), glucose => pyruvate
## 2 Glycolysis, core module involving three-carbon compounds
## 3 Gluconeogenesis, oxaloacetate => fructose-6P
## 4 Pentose phosphate pathway (Pentose phosphate cycle)
## 5 PRPP biosynthesis, ribose 5P => PRPP
## 6 Pentose phosphate pathway, oxidative phase, glucose 6P => ribulose 5P
module.aqua <- dplyr::left_join(module.aqua, kegg.module, by = "V1")
head(module.aqua)
## V1
## 1 M00001
## 2 M00002
## 3 M00003
## 4 M00307
## 5 M00009
## 6 M00010
## V18
## 1 Glycolysis (Embden-Meyerhof pathway), glucose => pyruvate
## 2 Glycolysis, core module involving three-carbon compounds
## 3 Gluconeogenesis, oxaloacetate => fructose-6P
## 4 Pyruvate oxidation, pyruvate => acetyl-CoA
## 5 Citrate cycle (TCA cycle, Krebs cycle)
## 6 Citrate cycle, first carbon oxidation, oxaloacetate => 2-oxoglutarate
library(dplyr)
module.2007 <- read.csv("Arhodomonas_2007//kegg.module.name.txt",fill=T,sep="\ ",header=F)
head(module.2007)
## V1
## 1 M00001
## 2 M00002
## 3 M00003
## 4 M00307
## 5 M00009
## 6 M00010
kegg.module <- read.csv("../../../KEGG/KEGG.modules.txt", sep="\t",stringsAsFactors = F, as.is = T, header=F)
kegg.module <- kegg.module[c(1,ncol(kegg.module))]
head(kegg.module)
## V1
## 1 M00001
## 2 M00002
## 3 M00003
## 4 M00004
## 5 M00005
## 6 M00006
## V18
## 1 Glycolysis (Embden-Meyerhof pathway), glucose => pyruvate
## 2 Glycolysis, core module involving three-carbon compounds
## 3 Gluconeogenesis, oxaloacetate => fructose-6P
## 4 Pentose phosphate pathway (Pentose phosphate cycle)
## 5 PRPP biosynthesis, ribose 5P => PRPP
## 6 Pentose phosphate pathway, oxidative phase, glucose 6P => ribulose 5P
module.2007 <- dplyr::left_join(module.2007, kegg.module, by = "V1")
head(module.2007)
## V1
## 1 M00001
## 2 M00002
## 3 M00003
## 4 M00307
## 5 M00009
## 6 M00010
## V18
## 1 Glycolysis (Embden-Meyerhof pathway), glucose => pyruvate
## 2 Glycolysis, core module involving three-carbon compounds
## 3 Gluconeogenesis, oxaloacetate => fructose-6P
## 4 Pyruvate oxidation, pyruvate => acetyl-CoA
## 5 Citrate cycle (TCA cycle, Krebs cycle)
## 6 Citrate cycle, first carbon oxidation, oxaloacetate => 2-oxoglutarate
cat(paste("A.BTEX has",nrow(module.btex),"modules\n"))
## A.BTEX has 204 modules
cat(paste("A.aquaeoli has",nrow(module.aqua),"modules\n"))
## A.aquaeoli has 225 modules
cat(paste("A.2007 has",nrow(module.2007),"modules"))
## A.2007 has 237 modules
The key questions here are: 1. metabolic capacities 2. what is different about A.BTEX
question 1 can only be examined line by line with a biological perspective. The tables will be first combined and tabulated:
module.2007$strain <- "A.2007"
module.aqua$strain <- "A.aquaeoli"
module.btex$strain <- "A.BTEX"
module.sum <- rbind(module.aqua, module.2007, module.btex)
module.sum <- module.sum[order(module.sum$V1),]
head(module.sum)
## V1 V18
## 1 M00001 Glycolysis (Embden-Meyerhof pathway), glucose => pyruvate
## 226 M00001 Glycolysis (Embden-Meyerhof pathway), glucose => pyruvate
## 1100 M00001 Glycolysis (Embden-Meyerhof pathway), glucose => pyruvate
## 2 M00002 Glycolysis, core module involving three-carbon compounds
## 227 M00002 Glycolysis, core module involving three-carbon compounds
## 2100 M00002 Glycolysis, core module involving three-carbon compounds
## strain
## 1 A.aquaeoli
## 226 A.2007
## 1100 A.BTEX
## 2 A.aquaeoli
## 227 A.2007
## 2100 A.BTEX
First I take counts of each module and make three sub-tables:
module.counts <- dplyr::count(module.sum, V1)
module.counts <- dplyr::left_join(module.counts,kegg.module,by="V1")
module.counts.1 <- module.counts[module.counts$n == 1,]
module.counts.2 <- module.counts[module.counts$n == 2,]
module.counts.3 <- module.counts[module.counts$n == 3,]
cat(paste("there are",nrow(module.counts.3),"modules found in all strains\n",nrow(module.counts.2),"found in only two strains\nand",nrow(module.counts.1),"modules found in only one strain"))
## there are 188 modules found in all strains
## 36 found in only two strains
## and 30 modules found in only one strain
Attempt a full-join of the three module lists instead:
module.fulljoin <- dplyr::full_join(module.2007,module.aqua, by="V1")
module.fulljoin <- dplyr::full_join(module.fulljoin,module.btex,by="V1")
module.fulljoin <- dplyr::left_join(module.fulljoin, kegg.module, by="V1")
module.fulljoin <- module.fulljoin[c(1,8,3,5,7)]
colnames(module.fulljoin) <- c("module","description","A.2007","A.aqua","A.BTEX")
module.fulljoin[is.na(module.fulljoin)] <- 0
module.fulljoin[c(3,4,5)] <- lapply(module.fulljoin[c(3,4,5)], function(x) sub("A.2007|A.BTEX|A.aquaeoli",1,x))
module.fulljoin[c(3,4,5)] <- lapply(module.fulljoin[c(3,4,5)], function(x) as.numeric(x))
module.fulljoin$sum <- rowSums(module.fulljoin[c(3,4,5)])
module.fulljoin <- module.fulljoin[order(module.fulljoin$sum),]
write.csv(module.fulljoin, "Arhodomonas.kegg.module.comparison.csv", row.names = F)
head(module.fulljoin)
## module
## 41 M00098
## 55 M00338
## 73 M00136
## 75 M00045
## 94 M00573
## 95 M00577
## description
## 41 Acylglycerol degradation
## 55 Cysteine biosynthesis, homocysteine + serine => cysteine
## 73 GABA biosynthesis, prokaryotes, putrescine => GABA
## 75 Histidine degradation, histidine => N-formiminoglutamate => glutamate
## 94 Biotin biosynthesis, BioI pathway, long-chain-acyl-ACP => pimeloyl-ACP => biotin
## 95 Biotin biosynthesis, BioW pathway, pimelate => pimeloyl-CoA => biotin
## A.2007 A.aqua A.BTEX sum
## 41 1 0 0 1
## 55 1 0 0 1
## 73 1 0 0 1
## 75 1 0 0 1
## 94 1 0 0 1
## 95 1 0 0 1
This gives us a quickly searchable overview comparison of the three genomes from a well-understood pathway point of view. This is now written to a file called “Arhodomonas.kegg.module.comparison.csv”
library(VennDiagram)
M.2007 <- as.list(module.2007[1])
M.aqua <- as.list(module.aqua[1])
M.btex <- as.list(module.btex[1])
VD.overlap <- calculate.overlap(x = list("A.2007" = M.2007,"A.aquaeoli" = M.aqua, "A.BTEX" = M.btex))
#venn.diagram(area1=length(M.2007), area2=length(M.aqua), area3=length(M.btex), filename = NULL)
vd <- venn.diagram(c("A.2007"= M.2007, "A.aquaeoli" = M.aqua, "A.BTEX" = M.btex), filename = NULL, resolution = 150, height = 800, width = 800,
fill = c("cornflowerblue", "green","darkorchid1"), main = "KEGG Metabolic Modules in Arhodomonas Genomes",
alpha = 0.2, main.cex = 1, cex = 1, cat.cex = .75, cat.dist = c(.05,.05,.05), main.pos = c(0.5,0.95))
png("module.vd.png")
grid.draw(vd)
dev.off()
## png
## 2