HADase and related proteins in metagenomes

1 Introduction

The prevalence of HADases in prokaryotic genomes has been explored in and demonstrated in other projects. Using phylogenetic methods, I have previously shown that K01560 is a strong tool for delineating true HADases. The goal of this document is to lay the groundwork for exploring the prevalence of HADases in real environmental systems, both from a proportion of genomes perspective and from a gene-expression perspective.

2 HADase in metagenomes

2.1 Single-copy gene selection

Single-copy genes allow us to estimate genome copies present in a given metagenome.

2.1.1 RpoB

Kostas’s group has used RpoB in the past. Oddly, the KEGG system contains four RpoB KO’s:

  • K03043 (the highest number in metagenomes)
  • K03044
  • K03045
  • K03046 (about the same as K03043)

K03046 is actually rpoC, but is mis-named in the KEGG system. This explains why in all metagenomes, the K03043 to K03046 ratio is ~1.

That leaves two additional KO numbers. According the the KEGG database descriptions, these (03044 and 03045) are Archaeal. Therefore, I will sum the first three for future work.

However, these are likely better defined by COG families

2.1.2 Expanding universal single-copy gene set

http://science.sciencemag.org/content/311/5765/1283 This manuscript delineates a set of 31 genes which are universally distributed. this is from 2006 however. This has since been updated in https://peerj.com/articles/243/, who propose a set of 37 “elite” marker gene families that have largely congruent phylogenetic histories. These genes are somewhat described here: https://phylosift.wordpress.com/tutorials/scripts-markers/. However, there is no indication of appropriate annotations for them.

I will use the COG system, since it is very trustworthy when it comes to well-described genes and does so in an orthology-driven manner. If I require unambivalent mappings of the checkM set to COG, I end up with a set of 19 COGs:

marker.set <- read.csv("phylosift.markers.csv", header = T, stringsAsFactors = F)
datatable(marker.set)

RpoB is NOT a part of this set.

2.2 Building a representative set of metagenomes

Obtaining function copy numbers is NOT something that can reasonably be performed on a HUGE number of metagenomes. There is only one IMG feature, “Abundance Profile Overview”, that can bulk extrapolate not just presence of COGs/KEGGs or number of distinct genes present, but also provide copy number, which is crucial for this analysis. Based on a converastion with team members and on personal discretion, several categories of metagenomes were defined (listed in the tables below). Three high-quality metagenomes were selected to represent each category. The metagenomes used are listed below;

mgenome.metadata <- read.csv("IMG.mgenome.data.tab", sep = "\t", header = T)
colnames(mgenome.metadata)[6] <- "Genome.Name"
mgenome.metadata.short <- mgenome.metadata[c(3,1,5,6)]
datatable(mgenome.metadata.short)

2.2.1 Summarizing the metagenomes

The categories used, average gene count for metagenomes in each category, and estimated distinct microbial genomes (assuming ~3000 genes per genome) are listed below.

library(dplyr)
mgenome.metadata.type <- mgenome.metadata %>% 
  group_by(Category) %>% summarise(n())

mgenome.metadata.genes <- mgenome.metadata %>% 
  group_by(Category) %>% summarise(average.gene.count = round(mean(Gene.Count.....assembled)))

mgenome.metadata.type <- dplyr::left_join(mgenome.metadata.type,mgenome.metadata.genes)
mgenome.metadata.type$est.distinct.genomes <- round(mgenome.metadata.type$average.gene.count / 3000)

head(mgenome.metadata.type, n=16)
## # A tibble: 16 x 4
##    Category                       `n()` average.gene.count est.distinct.genomes
##    <fct>                          <int>              <dbl>                <dbl>
##  1 Aerobic.Municipal.Wastewater       3            2746159                  915
##  2 Anaerobic.Municipal.Wastewater     3            1887023                  629
##  3 Attine.Fungal.Garden               3             404119                  135
##  4 Deep.Subsurface                    4             990656                  330
##  5 Freshwater.Sediment                3            5720044                 1907
##  6 Freshwater.Surface                 3            4169331                 1390
##  7 Groundwater                        3            3032083                 1011
##  8 Human.Gut                          3             730968                  244
##  9 Marine.Pelagic                     3            1083774                  361
## 10 Marine.Sediment                    3            8205493                 2735
## 11 Peatland.Bog                       4            1558091                  519
## 12 Rhizosphere                        3            5760309                 1920
## 13 Rumen                              3            8137436                 2712
## 14 Soil.Field.and.Agriculture         3            6428998                 2143
## 15 Soil.Forest                        3           18136866                 6046
## 16 Termite                            3            3194838                 1065

2.3 Calculating genome copies in each metagenome

2.3.1 Gathering COG function tables

The copy number of all genes with a given annotation are gathered and then filtered by the appropriate annotation

2.3.2 COG copy number

COG is gathered and then filtered using the single-copy list defined above. This is used to calculate a range of genome copy number estimations

library(stringr)
mgenome.COG <- read.csv("HAD.mgenome.COG.counts.tab", sep="\t", header = T)
mgenome.COG <- mgenome.COG[mgenome.COG$Func_id %in% unlist(marker.set$COG.id),]
mgenome.COG <- mgenome.COG[-2]
mgenome.COG <- t(mgenome.COG)
datatable(mgenome.COG)
write.csv(mgenome.COG, "mgenome.COG.csv") #this table has to be manipulated in excel; the genome names have been mutilated beyond repair

2.3.2.1 COG copy number consistency

How consistent are the copy number estimations for these genes? Ideally, they are very similar to one another within a given metagenome.

library(tidyr)
library(ggplot2)
mgenome.COG <- read.csv("mgenome.COG.clean.csv", header=T)
mgenome.category <- mgenome.metadata.short[c(4,1)]
mgenome.cat.COG <- left_join(mgenome.category, mgenome.COG, by="Genome.Name")
mgenome.cat.COG[c(1,2)] <- mgenome.cat.COG[c(2,1)]
colnames(mgenome.cat.COG)[c(1,2)] <- colnames(mgenome.cat.COG)[c(2,1)]
x <- ncol(mgenome.cat.COG)
mgenome.cat.COG$mean <- round(rowMeans( mgenome.cat.COG[c(3:x)]))
mgenome.cat.COG$SD <- apply (  mgenome.cat.COG[c(3:x)], 1, sd)
mgenome.cat.COG$SD <- round(mgenome.cat.COG$SD)

mgenome.COG.summary <- ggplot(mgenome.cat.COG, aes(x=Category, y=mean, fill=Genome.Name)) +
  geom_bar(position = position_dodge(), stat="identity",color="black") +
  guides(fill=FALSE) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  scale_fill_manual(values=c("grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey")) +
  scale_y_continuous(trans='log10') +
  ylab("mean genome copy estimate") +
  xlab("metagenome source environment") +
  geom_errorbar(width = 0.2, aes(ymax = mean + SD, ymin=mean - SD), position = position_dodge(0.9))
  

mgenome.COG.summary

long.mgenome.COG <- mgenome.cat.COG  %>%   gather("COG", "n", 3:ncol(mgenome.cat.COG))

#mgenome.means
datatable((mgenome.cat.COG)[c(1,22,23)])

3 Obtaining HADase gene counts

3.1 Gathering K01560 sequences

This proceeds similarly as above, but is based on the KEGG system. This system was chosen because K01560 was shown to be the only ortholog group in any database that consistently detects proven HADases

library(stringr)
#the original IMG annotation overview file is find-and-replaced, "KO:" -> ""
KEGG.dehalo.list <- c("K01560","K01561","K01563")
mgenome.KEGG <- read.csv("HAD.mgenome.KEGG.counts.tab", sep="\t", header = T)
mgenome.KEGG <- mgenome.KEGG[mgenome.KEGG$Func_id %in% KEGG.dehalo.list,]
mgenome.KEGG <- t(mgenome.KEGG)
datatable(mgenome.KEGG)
write.csv(mgenome.KEGG, "mgenome.KEGG.csv") #this table has to be manipulated in excel; the genome names have been mutilated beyond repair

3.2 Filtering K01560 with hmm HADase model

In the 200401 hmm.build pipeline, I built and tested an HADase hmm model, itself built on a phylogenetic reconstruction of all genome-derived K01560 with delineation of the HADase clade informed by proven HADases, both from the literature and from our own work. This model, when applying a bitscore cutoff of 140, has rougly 0.5% false negative and positive detection rates. This model must be applied to the K01560 genes identified in the metagenomes.

In my original analysis, I used the IMG system to provide copy numbers directly via Compare Genomes > Abundance Profiles > All Functions > KO, and harvest read counts directly from the output matrix. In this analysis I have to of course have the sequences in-hand. Thus, I will simply use the funciton search feature

I’ve saved the new K01560 gene sets in IMG as “K01560.env.seqs.setA/B”

I then download the gene tables with copy number displayed and download the protein multifastas

3.2.1 hmmscan

The two environmental protein sets are scanned against the HADase hmm model

mkdir hmmscan

hmmscan -o hmmscan/hmm.logfileA.txt \
--tblout hmmscan/resultsA.tbl.txt \
--noali \
--notextw \
--cpu 6 \
../200401.hmm.build/likely.hadase.hmm \
metagenome.seqs/K01560.env.seqs.setA.faa

hmmscan -o hmmscan/hmm.logfileB.txt \
--tblout hmmscan/resultsB.tbl.txt \
--noali \
--notextw \
--cpu 6 \
../200401.hmm.build/likely.hadase.hmm \
metagenome.seqs/K01560.env.seqs.setB.faa

The results are then read in, parsed, filtered by 140 bitscore, used to filter the total gene table, which is then used to produce HADase gene copies per metagenome

hmmscan.A <- read.table("hmmscan/resultsA.tbl.txt")
hmmscan.A <- hmmscan.A[c(1,3,5,6)]
hmmscan.B <- read.table("hmmscan/resultsB.tbl.txt")
hmmscan.B <- hmmscan.B[c(1,3,5,6)]
hmmscan <- rbind(hmmscan.A, hmmscan.B)
colnames(hmmscan) <- c("proxy","gene","eval","bitscore")

library(ggplot2)

p <- ggplot(hmmscan, aes(x=proxy, y=bitscore)) +
  geom_jitter(size=0.1) +
  geom_hline( yintercept = 140, color="red")

p

hmm.pos <- filter(hmmscan, bitscore >=140)

print(paste("of the ",nrow(hmmscan)," K01560 genes, ",nrow(hmm.pos)," were determined to be likely HADases based on the HADase hmm model",sep = ""))
## [1] "of the 20533 K01560 genes, 3142 were determined to be likely HADases based on the HADase hmm model"

It is surprising that the results are so smooth and evenly distributed. What this says to me is that many gene fragments are being thrown away by this procedure.

3.2.2 Generating the filtered K01560 -> likely HADase table

mgenome.KEGG.A <- read.csv("metagenome.seqs/K01560.env.seqs.setA.xls", sep="\t")
mgenome.KEGG.B <- read.csv("metagenome.seqs/K01560.env.seqs.setB.xls", sep="\t")
mgenome.KEGG.all <- rbind(mgenome.KEGG.A, mgenome.KEGG.B)
colnames(hmm.pos)[2] <- "Locus.Tag"
HAD.env <- left_join(hmm.pos,mgenome.KEGG.all)
mgenome.KEGG.all <- HAD.env %>% group_by(Genome.Name) %>%
  summarize(HAD = sum(Scaffold.Read.Depth))

3.3 Calculating HADase per genome copy

Something was lost along the way with the termite samples; remove them

All other NA values for HAD are truly zero and have to be fixed

#mgenome.KEGG <- read.csv("mgenome.KEGG.clean.csv", header=T)
#colnames(mgenome.KEGG)[1] <- "Genome.Name"
mgenome.KEGG <- data.frame(mgenome.KEGG.all)
mgenome.means <- mgenome.cat.COG[c(2,22,23)]
mgenome.KEGG <- left_join(mgenome.means, mgenome.KEGG, by = "Genome.Name")
mgenome.KEGG <- mgenome.KEGG[-c(48:50),]
mgenome.KEGG[is.na(mgenome.KEGG)] <- 0
mgenome.KEGG$HAD.per.genome <- round(mgenome.KEGG$HAD / mgenome.KEGG$mean, 2)
mgenome.KEGG.cat <- left_join(mgenome.category, mgenome.KEGG, by = "Genome.Name")
mgenome.KEGG.cat <- mgenome.KEGG.cat[-c(48:50),]
datatable(mgenome.KEGG.cat)

3.4 HADase per genome copy

mgenome.HAD <- ggplot(mgenome.KEGG.cat, aes(x=Category, y=HAD.per.genome, fill=Genome.Name)) +
  theme_linedraw() +
  geom_bar(position = position_dodge(), stat="identity",color="black") +
  guides(fill=FALSE) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  scale_fill_manual(values=c("grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey")) +
  ylab("HADase per genome copy") +
  xlab("metagenome source environment") 

mgenome.HAD

This figure will need to be tweaked and improved

These are suprisingly high numbers. In my previous work using an rpoB + rpoC standard (K03043+K03046), which was certainly somewhat mistakenly high due to mistaken inclusion of two large orthologs, I had much lower numbers, certainly lower than 1 per genome. Is RpoB actually a multi-copy gene? Initial investigations suggets yes.

4 HADase in metatranscriptomes

The IMG database also has a large number of metatranscriptomes. There is an important issue to resolve first though:

  • Are the metatranscriptomes clearly associated with a metagenome?

Taking the JGI project Gs0053071, “Peatlands soil microbial communities from Germany and Austria, that are sulfate reducing”, I have to say no, they are not paired. The metatranscriptomes seem to be handled de-novo using the same pipeline as the metagenomes. That means that I should simply locate, again, high-quality representative metatranscriptomes and move forward.

4.1 How to standardize representations of HADase gene expression

4.1.1 Standardize to particular functional genes

First, check expression levels of the single-copy genes. In a single trial, the expression of the ribosomal proteins were very consistent, ranging from 378 to 786. Other single-copy genes are less stable, which makes good sense.

Given that we cannot make any wide-ranging statements as to what types of metabolism might be underway in any given sample, ribosomal proteins makes perfect sense. All prokaryotes use them if they are metabolically active, they are well-annotated. In two metatranscriptomes, one peat and one in arctic permafrost (!), we see about 0.2 HADase mRNA for each ribosomal protein.

In both systems, the ribosomal proteins are among the most highly expressed, in the top 2 - 10% of all functions.

4.1.2 Conclusion

Use the ribosomal protein subset of the single-copy COG list.

4.2 Normalizing HADase expression

A subset of the 19 single-copy genes consisting only of ribosomal proteins will be used for initial normalization.

mtrans.marker.set <- read.csv("phylosift.markers.mtransciptome.csv", header = T, stringsAsFactors = F)
datatable(mtrans.marker.set)

4.3 Creating a metatranscriptome set

This is built using the same guidelines where possible. Some categories are not be possible due to simple lack of availability issues.

  • Attine fungal gardens are excluded due to lack of availability

  • Aerobic municipal wastewater is three samples from the same project

  • There are no human gut samples due to none in IMG system

  • Only a single Termite sample is available

mtranscriptome.metadata <- read.csv("mtranscriptome.IMG.csv", header = T)
colnames(mtranscriptome.metadata)[4] <- "Genome.Name"
mtranscriptome.metadata.short <- mtranscriptome.metadata[c(2,1,3,4)]
datatable(mtranscriptome.metadata.short)

4.4 Category summary

library(dplyr)
mtranscriptome.metadata.type <- mtranscriptome.metadata %>% 
  group_by(Category) %>% summarise(n())

mtranscriptome.metadata.genes <- mtranscriptome.metadata %>% 
  group_by(Category) %>% summarise(average.gene.count = round(mean(Gene.Count.....assembled)))

mtranscriptome.metadata.type <- dplyr::left_join(mtranscriptome.metadata.type,mtranscriptome.metadata.genes)
mtranscriptome.metadata.type$est.distinct.genomes <- round(mtranscriptome.metadata.type$average.gene.count / 3000)

head(mtranscriptome.metadata.type, n=16)
## # A tibble: 13 x 4
##    Category                       `n()` average.gene.count est.distinct.genomes
##    <fct>                          <int>              <dbl>                <dbl>
##  1 Aerobic.Municipal.Wastewater       3             419294                  140
##  2 Anaerobic.Municipal.Wastewater     3             438360                  146
##  3 Freshwater.Sediment                3            1465509                  489
##  4 Freshwater.Surface                 4            1349119                  450
##  5 Groundwater                        4            1320186                  440
##  6 Marine.Pelagic                     3            1290780                  430
##  7 Marine.Sediment                    3            1188308                  396
##  8 Peatland.Bog                       2             802818                  268
##  9 Rhizosphere                        3            2428798                  810
## 10 Rumen                              3            1541073                  514
## 11 Soil.Field.and.Agriculture         3            1207581                  403
## 12 Soil.Forest                        3            3310079                 1103
## 13 Termite                            1              92185                   31

4.5 Calculating ribosomal protein transcript counts

4.5.1 Gathering COG function tables

The copy number of all genes with a given annotation are gathered and then filtered by the appropriate annotation

4.5.2 COG copy number

COG is gathered and then filtered using the ribosomal protein annotation list defined above. This is used as a standard with which to normalize HAD transcript counts.

library(stringr)
mtranscriptome.COG <- read.csv("mtranscriptome.COG.tab", sep="\t", header = T)
mtranscriptome.COG <- mtranscriptome.COG[mtranscriptome.COG$Func_id %in% unlist(marker.set$COG.id),]
mtranscriptome.COG <- mtranscriptome.COG[-2]
mtranscriptome.COG <- t(mtranscriptome.COG)
datatable(mtranscriptome.COG)
write.csv(mtranscriptome.COG, "mtranscriptome.COG.csv") #this table has to be manipulated in excel; the genome names have been mutilated beyond repair

4.6 COG ribosomal protein transcript count consistency

How consistent are the copy number estimations for these genes? Ideally, they are very similar to one another within a given metagenome.

library(tidyr)
library(ggplot2)
mtranscriptome.COG <- read.csv("mtranscriptome.COG.clean.csv", header=T)
mtranscriptome.category <- mtranscriptome.metadata.short[c(4,1)]
mtranscriptome.cat.COG <- left_join(mtranscriptome.category, mtranscriptome.COG, by="Genome.Name")
mtranscriptome.cat.COG[c(1,2)] <- mtranscriptome.cat.COG[c(2,1)]
colnames(mtranscriptome.cat.COG)[c(1,2)] <- colnames(mtranscriptome.cat.COG)[c(2,1)]
x <- ncol(mtranscriptome.cat.COG)
mtranscriptome.cat.COG$mean <- round(rowMeans( mtranscriptome.cat.COG[c(3:x)]))
mtranscriptome.cat.COG$SD <- apply (  mtranscriptome.cat.COG[c(3:x)], 1, sd)
mtranscriptome.cat.COG$SD <- round(mtranscriptome.cat.COG$SD)

mtranscriptome.COG.summary <- ggplot(mtranscriptome.cat.COG, aes(x=Category, y=mean, fill=Genome.Name)) +
  geom_bar(position = position_dodge(), stat="identity",color="black") +
  guides(fill=FALSE) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  scale_fill_manual(values=c("grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey")) +
  scale_y_continuous(trans='log10') +
  ggtitle("mean ribosomal protein gene transcript estimates") +
  ylab("mean mRNA count") +
  xlab("metatranscriptome source environment") +
  geom_errorbar(width = 0.2, aes(ymax = mean + SD, ymin=mean - SD), position = position_dodge(0.9))
  

mtranscriptome.COG.summary

long.mtranscriptome.COG <- mtranscriptome.cat.COG  %>%   gather("COG", "n", 3:ncol(mtranscriptome.cat.COG))

datatable((mtranscriptome.cat.COG)[c(1,22,23)])

4.7 Obtaining HADase transcript counts

This proceeds similarly as it was for metagenomes

library(stringr)
#the original IMG annotation overview file is find-and-replaced, "KO:" -> ""
KEGG.dehalo.list <- c("K01560","K01561","K01563")
mtranscriptome.KEGG <- read.csv("mtranscriptome.KEGG.tab", sep="\t", header = T)
mtranscriptome.KEGG <- mtranscriptome.KEGG[mtranscriptome.KEGG$Func_id %in% KEGG.dehalo.list,]
mtranscriptome.KEGG <- t(mtranscriptome.KEGG)
datatable(mtranscriptome.KEGG)
write.csv(mtranscriptome.KEGG, "mtranscriptome.KEGG.csv") #this table has to be manipulated in excel; the genome names have been mutilated beyond repair

4.8 Calculating HADase per average ribosomal protein transcript

mtranscriptome.KEGG <- read.csv("mtranscriptome.KEGG.clean.csv", header=T)
colnames(mtranscriptome.KEGG)[1] <- "Genome.Name"
mtranscriptome.means <- mtranscriptome.cat.COG[c(2,22,23)]
mtranscriptome.KEGG <- left_join(mtranscriptome.KEGG, mtranscriptome.means, by="Genome.Name")
mtranscriptome.KEGG$K01560.per.transcriptome <- round(mtranscriptome.KEGG$K01560 / mtranscriptome.KEGG$mean, 2)
mtranscriptome.KEGG$K01561.per.transcriptome <- round(mtranscriptome.KEGG$K01561 / mtranscriptome.KEGG$mean, 2)
mtranscriptome.KEGG$K01563.per.transcriptome <- round(mtranscriptome.KEGG$K01563 / mtranscriptome.KEGG$mean, 2)
mtranscriptome.KEGG.cat <- left_join(mtranscriptome.category, mtranscriptome.KEGG, by = "Genome.Name")
datatable(mtranscriptome.KEGG.cat)

4.9 HADase transcripts per ribosomal protein transcript

mtranscriptome.K01560 <- ggplot(mtranscriptome.KEGG.cat, aes(x=Category, y=K01560.per.transcriptome, fill=Genome.Name)) +
  geom_bar(position = position_dodge(), stat="identity",color="black") +
  guides(fill=FALSE) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  scale_fill_manual(values=c("grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey")) +
  ggtitle("HADase transcripts per ribosomal protein transcript") +
  ylab("K01560 mRNA / ribosomal prot. mRNA") +
  xlab("metatranscriptome source environment") 

mtranscriptome.K01560

4.10 Fluoroacetate dehalogenase

mtranscriptome.K01561 <- ggplot(mtranscriptome.KEGG.cat, aes(x=Category, y=K01561.per.transcriptome, fill=Genome.Name)) +
  geom_bar(position = position_dodge(), stat="identity",color="black") +
  guides(fill=FALSE) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  scale_fill_manual(values=c("grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey")) +
  ggtitle("FADase transcripts per ribosomal protein transcript") +
  ylab("K01561 mRNA / ribosomal prot. mRNA") +
  xlab("metatranscriptome source environment") 

mtranscriptome.K01561

4.11 Haloalkane dehalogenase

mtranscriptome.K01563 <- ggplot(mtranscriptome.KEGG.cat, aes(x=Category, y=K01561.per.transcriptome, fill=Genome.Name)) +
  geom_bar(position = position_dodge(), stat="identity",color="black") +
  guides(fill=FALSE) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  scale_fill_manual(values=c("grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey")) +
  ggtitle("Haloalkane dehalogenase transcripts per ribosomal protein transcript") +
  ylab("K01563 mRNA / ribosomal prot. mRNA") +
  xlab("metatranscriptome source environment") 

mtranscriptome.K01563