How does the number and types of oxygenase systems in 273 compare to other Pseudomonads? Are there any unusual types or an unusually large number of such systems?
There are ~480 “oxygenase” gene families in the KEGG system: https://www.genome.jp/dbget-bin/www_bfind_sub?mode=bfind&max_hit=1000&locale=en&serv=kegg&dbkey=orthology&keywords=oxygenase&page=1
Import these and parse:
oxy.kegg.all <- read.csv("KEGG.all.oxygenase.txt", stringsAsFactors = F, sep="\t",
header = F)
rows = nrow(oxy.kegg.all)
oxy.KEGG.all.table <- data.frame(KO = character(), description = character(), stringsAsFactors = F)
for (x in c(0:(rows/2-1))) {
oxy.KEGG.all.table[x+1, 1] <- oxy.kegg.all[x*2+1, 1]
oxy.KEGG.all.table[x+1, 2] <- oxy.kegg.all[x*2+2, 1]
}
head(oxy.KEGG.all.table)
## KO
## 1 K00446
## 2 K00448
## 3 K00449
## 4 K00450
## 5 K00451
## 6 K00452
## description
## 1 dmpB, xylE; catechol 2,3-dioxygenase [EC:1.13.11.2]
## 2 pcaG; protocatechuate 3,4-dioxygenase, alpha subunit [EC:1.13.11.3]
## 3 pcaH; protocatechuate 3,4-dioxygenase, beta subunit [EC:1.13.11.3]
## 4 E1.13.11.4; gentisate 1,2-dioxygenase [EC:1.13.11.4]
## 5 HGD, hmgA; homogentisate 1,2-dioxygenase [EC:1.13.11.5]
## 6 HAAO; 3-hydroxyanthranilate 3,4-dioxygenase [EC:1.13.11.6]
This will cast a very wide net
Read in the 273 annotation table (from EggNOG) and limit it to KO numbers and locus names
annot.273 <- read.csv("../P273.full.annotations.csv", stringsAsFactors = F)
annot.273 <- annot.273[c(2,5)]
colnames(annot.273)[2] <- "KO"
head(annot.273)
## locus KO
## 1 P273_00001 K15125
## 2 P273_00002 K07559
## 3 P273_00003 K07484
## 4 P273_00004 K07746
## 5 P273_00005 K19092
## 6 P273_00006
merge the 273 annotation onto the KO full list. Any KOs that occur multiple times in 273 will yield additional rows.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
KO.oxy.273 <- dplyr::left_join(oxy.KEGG.all.table, annot.273, by = "KO")
KO.oxy.273 <- KO.oxy.273[complete.cases(KO.oxy.273),]
by.KO <- KO.oxy.273 %>% group_by(KO)
KO.count.273 <- dplyr::summarise(by.KO, n = n())
KO.count.273 <- KO.count.273[order(-KO.count.273$n),]
#add descriptions
KO.count.273 <- dplyr::left_join(KO.count.273, oxy.KEGG.all.table, by="KO")
#remove rows with n = 0
KO.count.273 <- KO.count.273[KO.count.273$n > 0, ]
kable(head(KO.count.273, n=20))
| KO | n | description |
|---|---|---|
| K04091 | 9 | ssuD; alkanesulfonate monooxygenase [EC:1.14.14.5] |
| K00459 | 4 | ncd2, npd; nitronate monooxygenase [EC:1.13.12.16] |
| K03863 | 4 | vanB; vanillate monooxygenase ferredoxin subunit |
| K09018 | 4 | rutA; pyrimidine oxygenase [EC:1.14.99.46] |
| K03119 | 3 | tauD; taurine dioxygenase [EC:1.14.11.17] |
| K10621 | 3 | cmtC, dhbA; 2,3-dihydroxy-p-cumate/2,3-dihydroxybenzoate 3,4-dioxygenase [EC:1.13.11.- 1.13.11.14] |
| K16303 | 3 | cmtAc; p-cumate 2,3-dioxygenase subunit beta [EC:1.14.12.25] |
| K00457 | 2 | HPD, hppD; 4-hydroxyphenylpyruvate dioxygenase [EC:1.13.11.27] |
| K00499 | 2 | CMO; choline monooxygenase [EC:1.14.15.7] |
| K00529 | 2 | hcaD; 3-phenylpropionate/trans-cinnamate dioxygenase ferredoxin reductase component [EC:1.18.1.3] |
| K05599 | 2 | antA; anthranilate 1,2-dioxygenase (deaminating, decarboxylating) large subunit [EC:1.14.12.1] |
| K16048 | 2 | hsaB; 3-hydroxy-9,10-secoandrosta-1,3,5(10)-triene-9,17-dione monooxygenase reductase component [EC:1.5.1.-] |
| K00446 | 1 | dmpB, xylE; catechol 2,3-dioxygenase [EC:1.13.11.2] |
| K00448 | 1 | pcaG; protocatechuate 3,4-dioxygenase, alpha subunit [EC:1.13.11.3] |
| K00449 | 1 | pcaH; protocatechuate 3,4-dioxygenase, beta subunit [EC:1.13.11.3] |
| K00451 | 1 | HGD, hmgA; homogentisate 1,2-dioxygenase [EC:1.13.11.5] |
| K00452 | 1 | HAAO; 3-hydroxyanthranilate 3,4-dioxygenase [EC:1.13.11.6] |
| K00455 | 1 | hpaD, hpcB; 3,4-dihydroxyphenylacetate 2,3-dioxygenase [EC:1.13.11.15] |
| K00481 | 1 | pobA; p-hydroxybenzoate 3-monooxygenase [EC:1.14.13.2] |
| K00496 | 1 | alkB1_2, alkM; alkane 1-monooxygenase [EC:1.14.15.3] |
## strain 273 has 66 oxygenase genes (detected by the KEGG system) from 38 ortholog families
There are some suspiscious-sounding gene families here in high copy number. However, before chasing butterflies we should examine their frequency in psuedomonads in general. First we need a clean list of the suspect KO numbers:
write.csv(paste("KO:",oxy.KEGG.all.table$KO,sep=""), "KO.oxygenases.csv", row.names = F, col.names = NULL)
This creates a function list which can be uploaded to IMG and then used to query their Psuedomonads.
There are 2392 Pseudomonas genomes available in IMG. Many of these are likely to be strains of a P. putida and aeurginosa; I can filter my table down to one per species when I get the table in hand.
IMG only allows bulk function profiles for 500 genomes at a time. Importing the profiles of each of the subsets and combining
library(data.table)
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
set1 <- read.csv("IMG.set1.csv", stringsAsFactors = F, sep = "\t")
set1 <- set1[-ncol(set1)]
set2 <- read.csv("IMG.set2.csv", stringsAsFactors = F, sep = "\t")
set2 <- set2[-c(1,2,ncol(set2))]
set3 <- read.csv("IMG.set3.csv", stringsAsFactors = F, sep = "\t")
set3 <- set3[-c(1,2,ncol(set3))]
set4 <- read.csv("IMG.set4.csv", stringsAsFactors = F, sep = "\t")
set4 <- set4[-c(1,2,ncol(set4))]
set5 <- read.csv("IMG.set5.csv", stringsAsFactors = F, sep = "\t")
set5 <- set5[-c(1,2,ncol(set5))]
set.all <- cbind(set1,set2,set3,set4,set5)
write.csv(set4,"test.csv")
The species name is very difficult to parse in this particular file. I will ignore for now; the genomes can be regarded as a random subsample of Pseudomonads for the time being. Let’s get some basic stats for each KO.
Q1.calc <- function(x) {
quantile(x,0.25)
}
Q3.calc <- function(x) {
quantile(x,0.75)
}
cleanup <- function(x) {
substring(x,4)
}
set1.1 <- set.all[-2]
KO <- apply(set1.1[1],1, cleanup)
Q3 <- apply(set1.1[-1],1, Q3.calc)
Q1 <- apply(set1.1[-1],1, Q1.calc)
median.x <- apply(set1.1[-1],1, median)
IQR.x <- apply(set1.1[-1],1, IQR)
set1.stats <- data.frame(KO, Q1, Q3, median.x, IQR.x, stringsAsFactors = F)
set1.stats$outlier <- set1.stats$Q3 + set1.stats$IQR.x
write.csv(set1.stats,"test.csv")
head(set1.stats)
## KO Q1 Q3 median.x IQR.x outlier
## 1 K00446 0 0 0 0 0
## 2 K00448 1 1 1 0 1
## 3 K00449 1 1 1 0 1
## 4 K00450 0 1 0 1 2
## 5 K00451 1 1 1 0 1
## 6 K00452 0 0 0 0 0
We now have stats for each of the KEGG oxygenase families along with a threshold for outlier determination
We can now join in 273 and create a list only of ortholog families which are present at or above the outlier threshold.
library(dplyr)
set1.273 <- dplyr::left_join(set1.stats, KO.count.273, by="KO")
set1.273$above.outlier <- set1.273$n - set1.273$outlier
set1.273 <- set1.273[order(-set1.273$above.outlier),]
write.csv(set1.273, "test.csv")
outliers <-subset(set1.273, above.outlier >= 0)
kable(outliers)
| KO | Q1 | Q3 | median.x | IQR.x | outlier | n | description | above.outlier | |
|---|---|---|---|---|---|---|---|---|---|
| 106 | K09018 | 0 | 0 | 0 | 0 | 0 | 4 | rutA; pyrimidine oxygenase [EC:1.14.99.46] | 4 |
| 119 | K10621 | 0 | 0 | 0 | 0 | 0 | 3 | cmtC, dhbA; 2,3-dihydroxy-p-cumate/2,3-dihydroxybenzoate 3,4-dioxygenase [EC:1.13.11.- 1.13.11.14] | 3 |
| 264 | K16303 | 0 | 0 | 0 | 0 | 0 | 3 | cmtAc; p-cumate 2,3-dioxygenase subunit beta [EC:1.14.12.25] | 3 |
| 38 | K00499 | 0 | 0 | 0 | 0 | 0 | 2 | CMO; choline monooxygenase [EC:1.14.15.7] | 2 |
| 48 | K00529 | 0 | 0 | 0 | 0 | 0 | 2 | hcaD; 3-phenylpropionate/trans-cinnamate dioxygenase ferredoxin reductase component [EC:1.18.1.3] | 2 |
| 58 | K03863 | 0 | 1 | 1 | 1 | 2 | 4 | vanB; vanillate monooxygenase ferredoxin subunit | 2 |
| 59 | K04091 | 1 | 4 | 2 | 3 | 7 | 9 | ssuD; alkanesulfonate monooxygenase [EC:1.14.14.5] | 2 |
| 253 | K16048 | 0 | 0 | 0 | 0 | 0 | 2 | hsaB; 3-hydroxy-9,10-secoandrosta-1,3,5(10)-triene-9,17-dione monooxygenase reductase component [EC:1.5.1.-] | 2 |
| 1 | K00446 | 0 | 0 | 0 | 0 | 0 | 1 | dmpB, xylE; catechol 2,3-dioxygenase [EC:1.13.11.2] | 1 |
| 6 | K00452 | 0 | 0 | 0 | 0 | 0 | 1 | HAAO; 3-hydroxyanthranilate 3,4-dioxygenase [EC:1.13.11.6] | 1 |
| 53 | K03379 | 0 | 0 | 0 | 0 | 0 | 1 | chnB; cyclohexanone monooxygenase [EC:1.14.13.22] | 1 |
| 118 | K10619 | 0 | 0 | 0 | 0 | 0 | 1 | cmtAb; p-cumate 2,3-dioxygenase subunit alpha [EC:1.14.12.25] | 1 |
| 120 | K10676 | 0 | 0 | 0 | 0 | 0 | 1 | tfdB; 2,4-dichlorophenol 6-monooxygenase [EC:1.14.13.20] | 1 |
| 177 | K14581 | 0 | 0 | 0 | 0 | 0 | 1 | nahAa, nagAa, ndoR, nbzAa, dntAa; naphthalene 1,2-dioxygenase ferredoxin reductase component [EC:1.18.1.7] | 1 |
| 219 | K15751 | 0 | 0 | 0 | 0 | 0 | 1 | carAa; carbazole 1,9a-dioxygenase [EC:1.14.12.22] | 1 |
| 2 | K00448 | 1 | 1 | 1 | 0 | 1 | 1 | pcaG; protocatechuate 3,4-dioxygenase, alpha subunit [EC:1.13.11.3] | 0 |
| 3 | K00449 | 1 | 1 | 1 | 0 | 1 | 1 | pcaH; protocatechuate 3,4-dioxygenase, beta subunit [EC:1.13.11.3] | 0 |
| 5 | K00451 | 1 | 1 | 1 | 0 | 1 | 1 | HGD, hmgA; homogentisate 1,2-dioxygenase [EC:1.13.11.5] | 0 |
| 11 | K00457 | 2 | 2 | 2 | 0 | 2 | 2 | HPD, hppD; 4-hydroxyphenylpyruvate dioxygenase [EC:1.13.11.27] | 0 |
| 13 | K00459 | 2 | 3 | 2 | 1 | 4 | 4 | ncd2, npd; nitronate monooxygenase [EC:1.13.12.16] | 0 |
| 26 | K00481 | 1 | 1 | 1 | 0 | 1 | 1 | pobA; p-hydroxybenzoate 3-monooxygenase [EC:1.14.13.2] | 0 |
| 49 | K01633 | 1 | 1 | 1 | 0 | 1 | 1 | folB; 7,8-dihydroneopterin aldolase/epimerase/oxygenase [EC:4.1.2.25 5.1.99.8 1.13.11.81] | 0 |
| 75 | K05599 | 0 | 1 | 1 | 1 | 2 | 2 | antA; anthranilate 1,2-dioxygenase (deaminating, decarboxylating) large subunit [EC:1.14.12.1] | 0 |
| 84 | K05916 | 1 | 1 | 1 | 0 | 1 | 1 | hmp, YHB1; nitric oxide dioxygenase [EC:1.14.12.17] | 0 |
| 91 | K07215 | 1 | 1 | 1 | 0 | 1 | 1 | pigA, hemO; heme oxygenase (biliverdin-IX-beta and delta-forming) [EC:1.14.99.58] | 0 |
| 105 | K08967 | 1 | 1 | 1 | 0 | 1 | 1 | mtnD, mtnZ, ADI1; 1,2-dihydroxy-3-keto-5-methylthiopentene dioxygenase [EC:1.13.11.53 1.13.11.54] | 0 |
| 114 | K10253 | 1 | 1 | 1 | 0 | 1 | 1 | K10253; DOPA 4,5-dioxygenase [EC:1.14.99.-] | 0 |
| 236 | K15777 | 1 | 1 | 1 | 0 | 1 | 1 | DOPA; 4,5-DOPA dioxygenase extradiol [EC:1.13.11.-] | 0 |
## There is a set of 28 gene families which exist in copy numbers well above what are typically seen in Pseudomonads
.