Arhodomonas Metabolic Overview via KEGG Modules

Introduction

The general goal is to begin to compare the metabolic capacities of A.BTEX to two existing genomes; A.aquaeoli and A.2007 (Seminole?). Initially, this will be performed by comparing functional annotations aiming for KEGG annotations of aromatic degradation genes.

A.2007 is currently represented only by a set of proteins; nucleotide sequences are not available, neither is the genome.

Full modules lists are provided in the file “Arhodomonas.kegg.module.comparison.csv”

GhostKoala

This is a fast protein annotation tool provided by the KEGG group. Submitted all three protein sets. The KO assignment is 1:1 (protein:KO).

Identifying KO families consistent with aromatic degradation

KEGG maintains a large “reference pathway” called “degradation of aromoatic compounds” https://www.genome.jp/kegg-bin/show_pathway?map=map01220&show_description=show This is a container for many “modules”: for example, Toluene degradation: https://www.genome.jp/kegg-bin/show_module?M00418 This also contains “reaction modules”, which seem to be ge

The LEVEL2 category “Xenobiotics biodegradation and metabolism” is more useful. It contains 690 mappings of KO families particular pathways, such as benzoate, toluene, PAH metabolism.

However, these are very large maps… they are good for exploration, not so good for summarizing

GhostKoala accessions: A.aqueoli: https://www.kegg.jp/kegg-bin/blastkoala_result?id=32bb94f58c06ac0b21024fc41e53579590efc4b9&passwd=mgZ6D9&type=ghostkoala A.2007: https://www.kegg.jp/kegg-bin/blastkoala_result?id=ee34628779e008d2bb05da4d70c04439b2d4d83e&passwd=jgOiAL&type=ghostkoala A.BTEX: https://www.kegg.jp/kegg-bin/blastkoala_result?id=068dfcc08becc834efb50defbda6105baef45b90&passwd=jAzaXi&type=ghostkoala

Revision of Strategy

pathway mapping in database type format is not helpful. The pathway maps are too large and integrated. What ARE helpful are “modules”.

First we can examine the Reference Pathway for Degradation of aromatic compounds with mappings for each organism

These are VERY large maps, but they permit us to see where the action is. The Module system is a different hierarchy from the pathway system. It also maps KOs to larger systems, but goes KO -> Module -> Reference Pathway or Pathway Module. The Module Mappin tool is the way to get at this directly from KOs (https://www.genome.jp/kegg/tool/map_module.html). This corresponds DIRECTLY to the output from GhostKoala.

So, within KEGG system, the goal is to: 1. Assign genes to KO groups 2. Compare KEGGs between organisms, identify differences (look into Venn diagrams) 2. map KO to Modules 3. complete modules must be analyzed directly or parsed out and compared accordingly

Parsing the Module Mapping

For each module, only modules which are missing 2 blocks at most are included. The result is in an awkward format:

The most straightforward way to handle this is with a grep for M plus 5 numbers (e.g.“M00345”) and then remove every thing else: grep -o ‘[^ ]M[0-9][0-9][0-9][0-9][0-9].’ kegg.module.txt | sed ’s/ .*//’ > kegg.module.name.txt the module descriptions can then be added in using a mapping table downloaded from kegg

A.BTEX

library(dplyr)
module.btex <- read.csv("A.BTEX/kegg.module.name.txt",fill=T,sep="\ ",header=F)
head(module.btex)

##       V1
## 1 M00001
## 2 M00002
## 3 M00307
## 4 M00009
## 5 M00010
## 6 M00011

kegg.module <- read.csv("../../../KEGG/KEGG.modules.txt", sep="\t",stringsAsFactors = F, as.is = T, header=F)
kegg.module <- kegg.module[c(1,ncol(kegg.module))]
head(kegg.module)

##       V1
## 1 M00001
## 2 M00002
## 3 M00003
## 4 M00004
## 5 M00005
## 6 M00006
##                                                                     V18
## 1             Glycolysis (Embden-Meyerhof pathway), glucose => pyruvate
## 2              Glycolysis, core module involving three-carbon compounds
## 3                          Gluconeogenesis, oxaloacetate => fructose-6P
## 4                   Pentose phosphate pathway (Pentose phosphate cycle)
## 5                                  PRPP biosynthesis, ribose 5P => PRPP
## 6 Pentose phosphate pathway, oxidative phase, glucose 6P => ribulose 5P

module.btex <- dplyr::left_join(module.btex, kegg.module, by = "V1")
module.btex <- module.btex[order(module.btex$V1),]
head(module.btex)

##        V1
## 1  M00001
## 2  M00002
## 10 M00005
## 7  M00006
## 8  M00007
## 11 M00008
##                                                                         V18
## 1                 Glycolysis (Embden-Meyerhof pathway), glucose => pyruvate
## 2                  Glycolysis, core module involving three-carbon compounds
## 10                                     PRPP biosynthesis, ribose 5P => PRPP
## 7     Pentose phosphate pathway, oxidative phase, glucose 6P => ribulose 5P
## 8  Pentose phosphate pathway, non-oxidative phase, fructose 6P => ribose 5P
## 11     Entner-Doudoroff pathway, glucose-6P => glyceraldehyde-3P + pyruvate

A.aquaeoli

library(dplyr)
module.aqua <- read.csv("A.aquaeolei/kegg.module.name.txt",fill=T,sep="\ ",header=F)
head(module.aqua)

##       V1
## 1 M00001
## 2 M00002
## 3 M00003
## 4 M00307
## 5 M00009
## 6 M00010

kegg.module <- read.csv("../../../KEGG/KEGG.modules.txt", sep="\t",stringsAsFactors = F, as.is = T, header=F)
kegg.module <- kegg.module[c(1,ncol(kegg.module))]
head(kegg.module)

##       V1
## 1 M00001
## 2 M00002
## 3 M00003
## 4 M00004
## 5 M00005
## 6 M00006
##                                                                     V18
## 1             Glycolysis (Embden-Meyerhof pathway), glucose => pyruvate
## 2              Glycolysis, core module involving three-carbon compounds
## 3                          Gluconeogenesis, oxaloacetate => fructose-6P
## 4                   Pentose phosphate pathway (Pentose phosphate cycle)
## 5                                  PRPP biosynthesis, ribose 5P => PRPP
## 6 Pentose phosphate pathway, oxidative phase, glucose 6P => ribulose 5P

module.aqua <- dplyr::left_join(module.aqua, kegg.module, by = "V1")
head(module.aqua)

##       V1
## 1 M00001
## 2 M00002
## 3 M00003
## 4 M00307
## 5 M00009
## 6 M00010
##                                                                     V18
## 1             Glycolysis (Embden-Meyerhof pathway), glucose => pyruvate
## 2              Glycolysis, core module involving three-carbon compounds
## 3                          Gluconeogenesis, oxaloacetate => fructose-6P
## 4                            Pyruvate oxidation, pyruvate => acetyl-CoA
## 5                                Citrate cycle (TCA cycle, Krebs cycle)
## 6 Citrate cycle, first carbon oxidation, oxaloacetate => 2-oxoglutarate

A.2007

library(dplyr)
module.2007 <- read.csv("Arhodomonas_2007//kegg.module.name.txt",fill=T,sep="\ ",header=F)
head(module.2007)

##       V1
## 1 M00001
## 2 M00002
## 3 M00003
## 4 M00307
## 5 M00009
## 6 M00010

kegg.module <- read.csv("../../../KEGG/KEGG.modules.txt", sep="\t",stringsAsFactors = F, as.is = T, header=F)
kegg.module <- kegg.module[c(1,ncol(kegg.module))]
head(kegg.module)

##       V1
## 1 M00001
## 2 M00002
## 3 M00003
## 4 M00004
## 5 M00005
## 6 M00006
##                                                                     V18
## 1             Glycolysis (Embden-Meyerhof pathway), glucose => pyruvate
## 2              Glycolysis, core module involving three-carbon compounds
## 3                          Gluconeogenesis, oxaloacetate => fructose-6P
## 4                   Pentose phosphate pathway (Pentose phosphate cycle)
## 5                                  PRPP biosynthesis, ribose 5P => PRPP
## 6 Pentose phosphate pathway, oxidative phase, glucose 6P => ribulose 5P

module.2007 <- dplyr::left_join(module.2007, kegg.module, by = "V1")
head(module.2007)

##       V1
## 1 M00001
## 2 M00002
## 3 M00003
## 4 M00307
## 5 M00009
## 6 M00010
##                                                                     V18
## 1             Glycolysis (Embden-Meyerhof pathway), glucose => pyruvate
## 2              Glycolysis, core module involving three-carbon compounds
## 3                          Gluconeogenesis, oxaloacetate => fructose-6P
## 4                            Pyruvate oxidation, pyruvate => acetyl-CoA
## 5                                Citrate cycle (TCA cycle, Krebs cycle)
## 6 Citrate cycle, first carbon oxidation, oxaloacetate => 2-oxoglutarate

Summarizing the modules

cat(paste("A.BTEX has",nrow(module.btex),"modules\n"))

## A.BTEX has 204 modules

cat(paste("A.aquaeoli has",nrow(module.aqua),"modules\n"))

## A.aquaeoli has 225 modules

cat(paste("A.2007 has",nrow(module.2007),"modules"))

## A.2007 has 237 modules

Comparing the modules

The key questions here are: 1. metabolic capacities 2. what is different about A.BTEX

question 1 can only be examined line by line with a biological perspective. The tables will be first combined and tabulated:

module.2007$strain <- "A.2007"
module.aqua$strain <- "A.aquaeoli"
module.btex$strain <- "A.BTEX"

module.sum <- rbind(module.aqua, module.2007, module.btex)
module.sum <- module.sum[order(module.sum$V1),]
head(module.sum)

##          V1                                                       V18
## 1    M00001 Glycolysis (Embden-Meyerhof pathway), glucose => pyruvate
## 226  M00001 Glycolysis (Embden-Meyerhof pathway), glucose => pyruvate
## 1100 M00001 Glycolysis (Embden-Meyerhof pathway), glucose => pyruvate
## 2    M00002  Glycolysis, core module involving three-carbon compounds
## 227  M00002  Glycolysis, core module involving three-carbon compounds
## 2100 M00002  Glycolysis, core module involving three-carbon compounds
##          strain
## 1    A.aquaeoli
## 226      A.2007
## 1100     A.BTEX
## 2    A.aquaeoli
## 227      A.2007
## 2100     A.BTEX

First I take counts of each module and make three sub-tables:

module.counts <- dplyr::count(module.sum, V1)
module.counts <- dplyr::left_join(module.counts,kegg.module,by="V1")
module.counts.1 <- module.counts[module.counts$n == 1,]
module.counts.2 <- module.counts[module.counts$n == 2,]
module.counts.3 <- module.counts[module.counts$n == 3,]
cat(paste("there are",nrow(module.counts.3),"modules found in all strains\n",nrow(module.counts.2),"found in only two strains\nand",nrow(module.counts.1),"modules found in only one strain"))

## there are 188 modules found in all strains
##  36 found in only two strains
## and 30 modules found in only one strain

Attempt a full-join of the three module lists instead:

module.fulljoin <- dplyr::full_join(module.2007,module.aqua, by="V1")
module.fulljoin <- dplyr::full_join(module.fulljoin,module.btex,by="V1")
module.fulljoin <- dplyr::left_join(module.fulljoin, kegg.module, by="V1")
module.fulljoin <- module.fulljoin[c(1,8,3,5,7)]
colnames(module.fulljoin) <- c("module","description","A.2007","A.aqua","A.BTEX")
module.fulljoin[is.na(module.fulljoin)] <- 0
module.fulljoin[c(3,4,5)] <- lapply(module.fulljoin[c(3,4,5)], function(x) sub("A.2007|A.BTEX|A.aquaeoli",1,x))
module.fulljoin[c(3,4,5)] <- lapply(module.fulljoin[c(3,4,5)], function(x) as.numeric(x))
module.fulljoin$sum <- rowSums(module.fulljoin[c(3,4,5)])
module.fulljoin <- module.fulljoin[order(module.fulljoin$sum),]
write.csv(module.fulljoin, "Arhodomonas.kegg.module.comparison.csv", row.names = F)
head(module.fulljoin)

##    module
## 41 M00098
## 55 M00338
## 73 M00136
## 75 M00045
## 94 M00573
## 95 M00577
##                                                                         description
## 41                                                         Acylglycerol degradation
## 55                         Cysteine biosynthesis, homocysteine + serine => cysteine
## 73                               GABA biosynthesis, prokaryotes, putrescine => GABA
## 75            Histidine degradation, histidine => N-formiminoglutamate => glutamate
## 94 Biotin biosynthesis, BioI pathway, long-chain-acyl-ACP => pimeloyl-ACP => biotin
## 95            Biotin biosynthesis, BioW pathway, pimelate => pimeloyl-CoA => biotin
##    A.2007 A.aqua A.BTEX sum
## 41      1      0      0   1
## 55      1      0      0   1
## 73      1      0      0   1
## 75      1      0      0   1
## 94      1      0      0   1
## 95      1      0      0   1

This gives us a quickly searchable overview comparison of the three genomes from a well-understood pathway point of view. This is now written to a file called “Arhodomonas.kegg.module.comparison.csv”

Venn Diagram

library(VennDiagram)
M.2007 <- as.list(module.2007[1])
M.aqua <- as.list(module.aqua[1])
M.btex <- as.list(module.btex[1])
VD.overlap <- calculate.overlap(x = list("A.2007" = M.2007,"A.aquaeoli" = M.aqua, "A.BTEX" = M.btex))
#venn.diagram(area1=length(M.2007), area2=length(M.aqua), area3=length(M.btex), filename = NULL)
vd <- venn.diagram(c("A.2007"= M.2007, "A.aquaeoli" = M.aqua, "A.BTEX" = M.btex), filename = NULL, resolution = 150, height = 800, width = 800,
             fill = c("cornflowerblue", "green","darkorchid1"), main = "KEGG Metabolic Modules in Arhodomonas Genomes",
             alpha = 0.2, main.cex = 1, cex = 1, cat.cex = .75, cat.dist = c(.05,.05,.05), main.pos = c(0.5,0.95))
png("module.vd.png")
grid.draw(vd)
dev.off()

## png 
##   2