Quick summary

scroll to the bottom to see the list of potentially high copy number / unusual oxygenases

Oxygenases in Pseudomonas 273

How does the number and types of oxygenase systems in 273 compare to other Pseudomonads? Are there any unusual types or an unusually large number of such systems?

  1. define a set of oxygenase functional annotations in the KEGG system (or alternatively pfam) which encapsualate most types of oxygenases found in Bacteria
  2. quanitify these in strain 273 using existing EggNOG system annotations
  3. determine counts and distributions in Psudomonad genomes via IMG system

Defining a set of oxygenase functional annotations

There are ~480 “oxygenase” gene families in the KEGG system: https://www.genome.jp/dbget-bin/www_bfind_sub?mode=bfind&max_hit=1000&locale=en&serv=kegg&dbkey=orthology&keywords=oxygenase&page=1

Import these and parse:

oxy.kegg.all <- read.csv("KEGG.all.oxygenase.txt", stringsAsFactors = F, sep="\t",
                         header = F)
rows = nrow(oxy.kegg.all)

oxy.KEGG.all.table <- data.frame(KO = character(), description = character(), stringsAsFactors = F)

for (x in c(0:(rows/2-1))) {
  oxy.KEGG.all.table[x+1, 1] <- oxy.kegg.all[x*2+1, 1]
  oxy.KEGG.all.table[x+1, 2] <- oxy.kegg.all[x*2+2, 1]
}

head(oxy.KEGG.all.table)
##       KO
## 1 K00446
## 2 K00448
## 3 K00449
## 4 K00450
## 5 K00451
## 6 K00452
##                                                           description
## 1                 dmpB, xylE; catechol 2,3-dioxygenase [EC:1.13.11.2]
## 2 pcaG; protocatechuate 3,4-dioxygenase, alpha subunit [EC:1.13.11.3]
## 3  pcaH; protocatechuate 3,4-dioxygenase, beta subunit [EC:1.13.11.3]
## 4                E1.13.11.4; gentisate 1,2-dioxygenase [EC:1.13.11.4]
## 5             HGD, hmgA; homogentisate 1,2-dioxygenase [EC:1.13.11.5]
## 6          HAAO; 3-hydroxyanthranilate 3,4-dioxygenase [EC:1.13.11.6]

This will cast a very wide net

Filter out oxygenases in the 273 genome

Read in the 273 annotation table (from EggNOG) and limit it to KO numbers and locus names

annot.273 <- read.csv("../P273.full.annotations.csv", stringsAsFactors = F)
annot.273 <- annot.273[c(2,5)]
colnames(annot.273)[2] <- "KO"
head(annot.273)
##        locus     KO
## 1 P273_00001 K15125
## 2 P273_00002 K07559
## 3 P273_00003 K07484
## 4 P273_00004 K07746
## 5 P273_00005 K19092
## 6 P273_00006

merge the 273 annotation onto the KO full list. Any KOs that occur multiple times in 273 will yield additional rows.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
KO.oxy.273 <- dplyr::left_join(oxy.KEGG.all.table, annot.273, by = "KO")
KO.oxy.273 <- KO.oxy.273[complete.cases(KO.oxy.273),]
by.KO <- KO.oxy.273 %>% group_by(KO)
KO.count.273 <- dplyr::summarise(by.KO, n = n())
KO.count.273 <- KO.count.273[order(-KO.count.273$n),]
#add descriptions
KO.count.273 <- dplyr::left_join(KO.count.273, oxy.KEGG.all.table, by="KO")
#remove rows with n = 0
KO.count.273 <- KO.count.273[KO.count.273$n > 0, ]
kable(head(KO.count.273, n=20))
KO n description
K04091 9 ssuD; alkanesulfonate monooxygenase [EC:1.14.14.5]
K00459 4 ncd2, npd; nitronate monooxygenase [EC:1.13.12.16]
K03863 4 vanB; vanillate monooxygenase ferredoxin subunit
K09018 4 rutA; pyrimidine oxygenase [EC:1.14.99.46]
K03119 3 tauD; taurine dioxygenase [EC:1.14.11.17]
K10621 3 cmtC, dhbA; 2,3-dihydroxy-p-cumate/2,3-dihydroxybenzoate 3,4-dioxygenase [EC:1.13.11.- 1.13.11.14]
K16303 3 cmtAc; p-cumate 2,3-dioxygenase subunit beta [EC:1.14.12.25]
K00457 2 HPD, hppD; 4-hydroxyphenylpyruvate dioxygenase [EC:1.13.11.27]
K00499 2 CMO; choline monooxygenase [EC:1.14.15.7]
K00529 2 hcaD; 3-phenylpropionate/trans-cinnamate dioxygenase ferredoxin reductase component [EC:1.18.1.3]
K05599 2 antA; anthranilate 1,2-dioxygenase (deaminating, decarboxylating) large subunit [EC:1.14.12.1]
K16048 2 hsaB; 3-hydroxy-9,10-secoandrosta-1,3,5(10)-triene-9,17-dione monooxygenase reductase component [EC:1.5.1.-]
K00446 1 dmpB, xylE; catechol 2,3-dioxygenase [EC:1.13.11.2]
K00448 1 pcaG; protocatechuate 3,4-dioxygenase, alpha subunit [EC:1.13.11.3]
K00449 1 pcaH; protocatechuate 3,4-dioxygenase, beta subunit [EC:1.13.11.3]
K00451 1 HGD, hmgA; homogentisate 1,2-dioxygenase [EC:1.13.11.5]
K00452 1 HAAO; 3-hydroxyanthranilate 3,4-dioxygenase [EC:1.13.11.6]
K00455 1 hpaD, hpcB; 3,4-dihydroxyphenylacetate 2,3-dioxygenase [EC:1.13.11.15]
K00481 1 pobA; p-hydroxybenzoate 3-monooxygenase [EC:1.14.13.2]
K00496 1 alkB1_2, alkM; alkane 1-monooxygenase [EC:1.14.15.3]
## strain 273 has 66 oxygenase genes (detected by the KEGG system) from 38 ortholog families

There are some suspiscious-sounding gene families here in high copy number. However, before chasing butterflies we should examine their frequency in psuedomonads in general. First we need a clean list of the suspect KO numbers:

write.csv(paste("KO:",oxy.KEGG.all.table$KO,sep=""), "KO.oxygenases.csv", row.names = F, col.names = NULL)

This creates a function list which can be uploaded to IMG and then used to query their Psuedomonads.

There are 2392 Pseudomonas genomes available in IMG. Many of these are likely to be strains of a P. putida and aeurginosa; I can filter my table down to one per species when I get the table in hand.

IMG only allows bulk function profiles for 500 genomes at a time. Importing the profiles of each of the subsets and combining

library(data.table)
## 
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
set1 <- read.csv("IMG.set1.csv", stringsAsFactors = F, sep = "\t")
set1 <- set1[-ncol(set1)]
set2 <- read.csv("IMG.set2.csv", stringsAsFactors = F, sep = "\t")
set2 <- set2[-c(1,2,ncol(set2))]
set3 <- read.csv("IMG.set3.csv", stringsAsFactors = F, sep = "\t")
set3 <- set3[-c(1,2,ncol(set3))]
set4 <- read.csv("IMG.set4.csv", stringsAsFactors = F, sep = "\t")
set4 <- set4[-c(1,2,ncol(set4))]
set5 <- read.csv("IMG.set5.csv", stringsAsFactors = F, sep = "\t")
set5 <- set5[-c(1,2,ncol(set5))]

set.all <- cbind(set1,set2,set3,set4,set5)

write.csv(set4,"test.csv")

The species name is very difficult to parse in this particular file. I will ignore for now; the genomes can be regarded as a random subsample of Pseudomonads for the time being. Let’s get some basic stats for each KO.

Q1.calc <- function(x) {
  quantile(x,0.25)
}

Q3.calc <- function(x) {
  quantile(x,0.75)
}

cleanup <- function(x) {
  substring(x,4)
}

set1.1 <- set.all[-2]
KO <- apply(set1.1[1],1, cleanup)
Q3 <- apply(set1.1[-1],1, Q3.calc)
Q1 <- apply(set1.1[-1],1, Q1.calc)
median.x <- apply(set1.1[-1],1, median)
IQR.x <- apply(set1.1[-1],1, IQR)

set1.stats <- data.frame(KO, Q1, Q3, median.x, IQR.x, stringsAsFactors = F)

set1.stats$outlier <- set1.stats$Q3 + set1.stats$IQR.x

write.csv(set1.stats,"test.csv")

head(set1.stats)
##       KO Q1 Q3 median.x IQR.x outlier
## 1 K00446  0  0        0     0       0
## 2 K00448  1  1        1     0       1
## 3 K00449  1  1        1     0       1
## 4 K00450  0  1        0     1       2
## 5 K00451  1  1        1     0       1
## 6 K00452  0  0        0     0       0

We now have stats for each of the KEGG oxygenase families along with a threshold for outlier determination

We can now join in 273 and create a list only of ortholog families which are present at or above the outlier threshold.

library(dplyr)
set1.273 <- dplyr::left_join(set1.stats, KO.count.273, by="KO")
set1.273$above.outlier <- set1.273$n - set1.273$outlier
set1.273 <- set1.273[order(-set1.273$above.outlier),]
write.csv(set1.273, "test.csv")
outliers <-subset(set1.273, above.outlier >= 0)
kable(outliers)
KO Q1 Q3 median.x IQR.x outlier n description above.outlier
106 K09018 0 0 0 0 0 4 rutA; pyrimidine oxygenase [EC:1.14.99.46] 4
119 K10621 0 0 0 0 0 3 cmtC, dhbA; 2,3-dihydroxy-p-cumate/2,3-dihydroxybenzoate 3,4-dioxygenase [EC:1.13.11.- 1.13.11.14] 3
264 K16303 0 0 0 0 0 3 cmtAc; p-cumate 2,3-dioxygenase subunit beta [EC:1.14.12.25] 3
38 K00499 0 0 0 0 0 2 CMO; choline monooxygenase [EC:1.14.15.7] 2
48 K00529 0 0 0 0 0 2 hcaD; 3-phenylpropionate/trans-cinnamate dioxygenase ferredoxin reductase component [EC:1.18.1.3] 2
58 K03863 0 1 1 1 2 4 vanB; vanillate monooxygenase ferredoxin subunit 2
59 K04091 1 4 2 3 7 9 ssuD; alkanesulfonate monooxygenase [EC:1.14.14.5] 2
253 K16048 0 0 0 0 0 2 hsaB; 3-hydroxy-9,10-secoandrosta-1,3,5(10)-triene-9,17-dione monooxygenase reductase component [EC:1.5.1.-] 2
1 K00446 0 0 0 0 0 1 dmpB, xylE; catechol 2,3-dioxygenase [EC:1.13.11.2] 1
6 K00452 0 0 0 0 0 1 HAAO; 3-hydroxyanthranilate 3,4-dioxygenase [EC:1.13.11.6] 1
53 K03379 0 0 0 0 0 1 chnB; cyclohexanone monooxygenase [EC:1.14.13.22] 1
118 K10619 0 0 0 0 0 1 cmtAb; p-cumate 2,3-dioxygenase subunit alpha [EC:1.14.12.25] 1
120 K10676 0 0 0 0 0 1 tfdB; 2,4-dichlorophenol 6-monooxygenase [EC:1.14.13.20] 1
177 K14581 0 0 0 0 0 1 nahAa, nagAa, ndoR, nbzAa, dntAa; naphthalene 1,2-dioxygenase ferredoxin reductase component [EC:1.18.1.7] 1
219 K15751 0 0 0 0 0 1 carAa; carbazole 1,9a-dioxygenase [EC:1.14.12.22] 1
2 K00448 1 1 1 0 1 1 pcaG; protocatechuate 3,4-dioxygenase, alpha subunit [EC:1.13.11.3] 0
3 K00449 1 1 1 0 1 1 pcaH; protocatechuate 3,4-dioxygenase, beta subunit [EC:1.13.11.3] 0
5 K00451 1 1 1 0 1 1 HGD, hmgA; homogentisate 1,2-dioxygenase [EC:1.13.11.5] 0
11 K00457 2 2 2 0 2 2 HPD, hppD; 4-hydroxyphenylpyruvate dioxygenase [EC:1.13.11.27] 0
13 K00459 2 3 2 1 4 4 ncd2, npd; nitronate monooxygenase [EC:1.13.12.16] 0
26 K00481 1 1 1 0 1 1 pobA; p-hydroxybenzoate 3-monooxygenase [EC:1.14.13.2] 0
49 K01633 1 1 1 0 1 1 folB; 7,8-dihydroneopterin aldolase/epimerase/oxygenase [EC:4.1.2.25 5.1.99.8 1.13.11.81] 0
75 K05599 0 1 1 1 2 2 antA; anthranilate 1,2-dioxygenase (deaminating, decarboxylating) large subunit [EC:1.14.12.1] 0
84 K05916 1 1 1 0 1 1 hmp, YHB1; nitric oxide dioxygenase [EC:1.14.12.17] 0
91 K07215 1 1 1 0 1 1 pigA, hemO; heme oxygenase (biliverdin-IX-beta and delta-forming) [EC:1.14.99.58] 0
105 K08967 1 1 1 0 1 1 mtnD, mtnZ, ADI1; 1,2-dihydroxy-3-keto-5-methylthiopentene dioxygenase [EC:1.13.11.53 1.13.11.54] 0
114 K10253 1 1 1 0 1 1 K10253; DOPA 4,5-dioxygenase [EC:1.14.99.-] 0
236 K15777 1 1 1 0 1 1 DOPA; 4,5-DOPA dioxygenase extradiol [EC:1.13.11.-] 0
## There is a set of 28 gene families which exist in copy numbers well above what are typically seen in Pseudomonads

.