HADase and related proteins in metagenomes
1 Introduction
The prevalence of HADases in prokaryotic genomes has been explored in and demonstrated in other projects. Using phylogenetic methods, I have previously shown that K01560 is a strong tool for delineating true HADases. The goal of this document is to lay the groundwork for exploring the prevalence of HADases in real environmental systems, both from a proportion of genomes perspective and from a gene-expression perspective.
2 HADase in metagenomes
2.1 Single-copy gene selection
Single-copy genes allow us to estimate genome copies present in a given metagenome.
2.1.1 RpoB
Kostas’s group has used RpoB in the past. Oddly, the KEGG system contains four RpoB KO’s:
- K03043 (the highest number in metagenomes)
- K03044
- K03045
- K03046 (about the same as K03043)
K03046 is actually rpoC, but is mis-named in the KEGG system. This explains why in all metagenomes, the K03043 to K03046 ratio is ~1.
That leaves two additional KO numbers. According the the KEGG database descriptions, these (03044 and 03045) are Archaeal. Therefore, I will sum the first three for future work.
However, these are likely better defined by COG families
2.1.2 Expanding universal single-copy gene set
http://science.sciencemag.org/content/311/5765/1283 This manuscript delineates a set of 31 genes which are universally distributed. this is from 2006 however. This has since been updated in https://peerj.com/articles/243/, who propose a set of 37 “elite” marker gene families that have largely congruent phylogenetic histories. These genes are somewhat described here: https://phylosift.wordpress.com/tutorials/scripts-markers/. However, there is no indication of appropriate annotations for them.
I will use the COG system, since it is very trustworthy when it comes to well-described genes and does so in an orthology-driven manner. If I require unambivalent mappings of the checkM set to COG, I end up with a set of 19 COGs:
marker.set <- read.csv("phylosift.markers.csv", header = T, stringsAsFactors = F)
datatable(marker.set)RpoB is NOT a part of this set.
2.2 Building a representative set of metagenomes
Obtaining function copy numbers is NOT something that can reasonably be performed on a HUGE number of metagenomes. There is only one IMG feature, “Abundance Profile Overview”, that can bulk extrapolate not just presence of COGs/KEGGs or number of distinct genes present, but also provide copy number, which is crucial for this analysis. Based on a converastion with team members and on personal discretion, several categories of metagenomes were defined (listed in the tables below). Three high-quality metagenomes were selected to represent each category. The metagenomes used are listed below;
mgenome.metadata <- read.csv("IMG.mgenome.data.tab", sep = "\t", header = T)
colnames(mgenome.metadata)[6] <- "Genome.Name"
mgenome.metadata.short <- mgenome.metadata[c(3,1,5,6)]
datatable(mgenome.metadata.short)2.2.1 Summarizing the metagenomes
The categories used, average gene count for metagenomes in each category, and estimated distinct microbial genomes (assuming ~3000 genes per genome) are listed below.
library(dplyr)
mgenome.metadata.type <- mgenome.metadata %>%
group_by(Category) %>% summarise(n())
mgenome.metadata.genes <- mgenome.metadata %>%
group_by(Category) %>% summarise(average.gene.count = round(mean(Gene.Count.....assembled)))
mgenome.metadata.type <- dplyr::left_join(mgenome.metadata.type,mgenome.metadata.genes)
mgenome.metadata.type$est.distinct.genomes <- round(mgenome.metadata.type$average.gene.count / 3000)
head(mgenome.metadata.type, n=16)## # A tibble: 16 x 4
## Category `n()` average.gene.count est.distinct.genomes
## <fct> <int> <dbl> <dbl>
## 1 Aerobic.Municipal.Wastewater 3 2746159 915
## 2 Anaerobic.Municipal.Wastewater 3 1887023 629
## 3 Attine.Fungal.Garden 3 404119 135
## 4 Deep.Subsurface 4 990656 330
## 5 Freshwater.Sediment 3 5720044 1907
## 6 Freshwater.Surface 3 4169331 1390
## 7 Groundwater 3 3032083 1011
## 8 Human.Gut 3 730968 244
## 9 Marine.Pelagic 3 1083774 361
## 10 Marine.Sediment 3 8205493 2735
## 11 Peatland.Bog 4 1558091 519
## 12 Rhizosphere 3 5760309 1920
## 13 Rumen 3 8137436 2712
## 14 Soil.Field.and.Agriculture 3 6428998 2143
## 15 Soil.Forest 3 18136866 6046
## 16 Termite 3 3194838 1065
2.3 Calculating genome copies in each metagenome
2.3.1 Gathering COG function tables
The copy number of all genes with a given annotation are gathered and then filtered by the appropriate annotation
2.3.2 COG copy number
COG is gathered and then filtered using the single-copy list defined above. This is used to calculate a range of genome copy number estimations
library(stringr)
mgenome.COG <- read.csv("HAD.mgenome.COG.counts.tab", sep="\t", header = T)
mgenome.COG <- mgenome.COG[mgenome.COG$Func_id %in% unlist(marker.set$COG.id),]
mgenome.COG <- mgenome.COG[-2]
mgenome.COG <- t(mgenome.COG)
datatable(mgenome.COG)write.csv(mgenome.COG, "mgenome.COG.csv") #this table has to be manipulated in excel; the genome names have been mutilated beyond repair2.3.2.1 COG copy number consistency
How consistent are the copy number estimations for these genes? Ideally, they are very similar to one another within a given metagenome.
library(tidyr)
library(ggplot2)
mgenome.COG <- read.csv("mgenome.COG.clean.csv", header=T)
mgenome.category <- mgenome.metadata.short[c(4,1)]
mgenome.cat.COG <- left_join(mgenome.category, mgenome.COG, by="Genome.Name")
mgenome.cat.COG[c(1,2)] <- mgenome.cat.COG[c(2,1)]
colnames(mgenome.cat.COG)[c(1,2)] <- colnames(mgenome.cat.COG)[c(2,1)]
x <- ncol(mgenome.cat.COG)
mgenome.cat.COG$mean <- round(rowMeans( mgenome.cat.COG[c(3:x)]))
mgenome.cat.COG$SD <- apply ( mgenome.cat.COG[c(3:x)], 1, sd)
mgenome.cat.COG$SD <- round(mgenome.cat.COG$SD)
mgenome.COG.summary <- ggplot(mgenome.cat.COG, aes(x=Category, y=mean, fill=Genome.Name)) +
geom_bar(position = position_dodge(), stat="identity",color="black") +
guides(fill=FALSE) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
scale_fill_manual(values=c("grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey")) +
scale_y_continuous(trans='log10') +
ylab("mean genome copy estimate") +
xlab("metagenome source environment") +
geom_errorbar(width = 0.2, aes(ymax = mean + SD, ymin=mean - SD), position = position_dodge(0.9))
mgenome.COG.summarylong.mgenome.COG <- mgenome.cat.COG %>% gather("COG", "n", 3:ncol(mgenome.cat.COG))
#mgenome.means
datatable((mgenome.cat.COG)[c(1,22,23)])3 Obtaining HADase gene counts
3.1 Gathering K01560 sequences
This proceeds similarly as above, but is based on the KEGG system. This system was chosen because K01560 was shown to be the only ortholog group in any database that consistently detects proven HADases
library(stringr)
#the original IMG annotation overview file is find-and-replaced, "KO:" -> ""
KEGG.dehalo.list <- c("K01560","K01561","K01563")
mgenome.KEGG <- read.csv("HAD.mgenome.KEGG.counts.tab", sep="\t", header = T)
mgenome.KEGG <- mgenome.KEGG[mgenome.KEGG$Func_id %in% KEGG.dehalo.list,]
mgenome.KEGG <- t(mgenome.KEGG)
datatable(mgenome.KEGG)write.csv(mgenome.KEGG, "mgenome.KEGG.csv") #this table has to be manipulated in excel; the genome names have been mutilated beyond repair3.2 Filtering K01560 with hmm HADase model
In the 200401 hmm.build pipeline, I built and tested an HADase hmm model, itself built on a phylogenetic reconstruction of all genome-derived K01560 with delineation of the HADase clade informed by proven HADases, both from the literature and from our own work. This model, when applying a bitscore cutoff of 140, has rougly 0.5% false negative and positive detection rates. This model must be applied to the K01560 genes identified in the metagenomes.
In my original analysis, I used the IMG system to provide copy numbers directly via Compare Genomes > Abundance Profiles > All Functions > KO, and harvest read counts directly from the output matrix. In this analysis I have to of course have the sequences in-hand. Thus, I will simply use the funciton search feature
I’ve saved the new K01560 gene sets in IMG as “K01560.env.seqs.setA/B”
I then download the gene tables with copy number displayed and download the protein multifastas
3.2.1 hmmscan
The two environmental protein sets are scanned against the HADase hmm model
mkdir hmmscan
hmmscan -o hmmscan/hmm.logfileA.txt \
--tblout hmmscan/resultsA.tbl.txt \
--noali \
--notextw \
--cpu 6 \
../200401.hmm.build/likely.hadase.hmm \
metagenome.seqs/K01560.env.seqs.setA.faa
hmmscan -o hmmscan/hmm.logfileB.txt \
--tblout hmmscan/resultsB.tbl.txt \
--noali \
--notextw \
--cpu 6 \
../200401.hmm.build/likely.hadase.hmm \
metagenome.seqs/K01560.env.seqs.setB.faa
The results are then read in, parsed, filtered by 140 bitscore, used to filter the total gene table, which is then used to produce HADase gene copies per metagenome
hmmscan.A <- read.table("hmmscan/resultsA.tbl.txt")
hmmscan.A <- hmmscan.A[c(1,3,5,6)]
hmmscan.B <- read.table("hmmscan/resultsB.tbl.txt")
hmmscan.B <- hmmscan.B[c(1,3,5,6)]
hmmscan <- rbind(hmmscan.A, hmmscan.B)
colnames(hmmscan) <- c("proxy","gene","eval","bitscore")
library(ggplot2)
p <- ggplot(hmmscan, aes(x=proxy, y=bitscore)) +
geom_jitter(size=0.1) +
geom_hline( yintercept = 140, color="red")
phmm.pos <- filter(hmmscan, bitscore >=140)
print(paste("of the ",nrow(hmmscan)," K01560 genes, ",nrow(hmm.pos)," were determined to be likely HADases based on the HADase hmm model",sep = ""))## [1] "of the 20533 K01560 genes, 3142 were determined to be likely HADases based on the HADase hmm model"
It is surprising that the results are so smooth and evenly distributed. What this says to me is that many gene fragments are being thrown away by this procedure.
3.2.2 Generating the filtered K01560 -> likely HADase table
mgenome.KEGG.A <- read.csv("metagenome.seqs/K01560.env.seqs.setA.xls", sep="\t")
mgenome.KEGG.B <- read.csv("metagenome.seqs/K01560.env.seqs.setB.xls", sep="\t")
mgenome.KEGG.all <- rbind(mgenome.KEGG.A, mgenome.KEGG.B)
colnames(hmm.pos)[2] <- "Locus.Tag"
HAD.env <- left_join(hmm.pos,mgenome.KEGG.all)
mgenome.KEGG.all <- HAD.env %>% group_by(Genome.Name) %>%
summarize(HAD = sum(Scaffold.Read.Depth))3.3 Calculating HADase per genome copy
Something was lost along the way with the termite samples; remove them
All other NA values for HAD are truly zero and have to be fixed
#mgenome.KEGG <- read.csv("mgenome.KEGG.clean.csv", header=T)
#colnames(mgenome.KEGG)[1] <- "Genome.Name"
mgenome.KEGG <- data.frame(mgenome.KEGG.all)
mgenome.means <- mgenome.cat.COG[c(2,22,23)]
mgenome.KEGG <- left_join(mgenome.means, mgenome.KEGG, by = "Genome.Name")
mgenome.KEGG <- mgenome.KEGG[-c(48:50),]
mgenome.KEGG[is.na(mgenome.KEGG)] <- 0
mgenome.KEGG$HAD.per.genome <- round(mgenome.KEGG$HAD / mgenome.KEGG$mean, 2)
mgenome.KEGG.cat <- left_join(mgenome.category, mgenome.KEGG, by = "Genome.Name")
mgenome.KEGG.cat <- mgenome.KEGG.cat[-c(48:50),]
datatable(mgenome.KEGG.cat)3.4 HADase per genome copy
mgenome.HAD <- ggplot(mgenome.KEGG.cat, aes(x=Category, y=HAD.per.genome, fill=Genome.Name)) +
theme_linedraw() +
geom_bar(position = position_dodge(), stat="identity",color="black") +
guides(fill=FALSE) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
scale_fill_manual(values=c("grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey")) +
ylab("HADase per genome copy") +
xlab("metagenome source environment")
mgenome.HADThis figure will need to be tweaked and improved
These are suprisingly high numbers. In my previous work using an rpoB + rpoC standard (K03043+K03046), which was certainly somewhat mistakenly high due to mistaken inclusion of two large orthologs, I had much lower numbers, certainly lower than 1 per genome. Is RpoB actually a multi-copy gene? Initial investigations suggets yes.
4 HADase in metatranscriptomes
The IMG database also has a large number of metatranscriptomes. There is an important issue to resolve first though:
- Are the metatranscriptomes clearly associated with a metagenome?
Taking the JGI project Gs0053071, “Peatlands soil microbial communities from Germany and Austria, that are sulfate reducing”, I have to say no, they are not paired. The metatranscriptomes seem to be handled de-novo using the same pipeline as the metagenomes. That means that I should simply locate, again, high-quality representative metatranscriptomes and move forward.
4.1 How to standardize representations of HADase gene expression
4.1.1 Standardize to particular functional genes
First, check expression levels of the single-copy genes. In a single trial, the expression of the ribosomal proteins were very consistent, ranging from 378 to 786. Other single-copy genes are less stable, which makes good sense.
Given that we cannot make any wide-ranging statements as to what types of metabolism might be underway in any given sample, ribosomal proteins makes perfect sense. All prokaryotes use them if they are metabolically active, they are well-annotated. In two metatranscriptomes, one peat and one in arctic permafrost (!), we see about 0.2 HADase mRNA for each ribosomal protein.
In both systems, the ribosomal proteins are among the most highly expressed, in the top 2 - 10% of all functions.
4.1.2 Conclusion
Use the ribosomal protein subset of the single-copy COG list.
4.2 Normalizing HADase expression
A subset of the 19 single-copy genes consisting only of ribosomal proteins will be used for initial normalization.
mtrans.marker.set <- read.csv("phylosift.markers.mtransciptome.csv", header = T, stringsAsFactors = F)
datatable(mtrans.marker.set)4.3 Creating a metatranscriptome set
This is built using the same guidelines where possible. Some categories are not be possible due to simple lack of availability issues.
Attine fungal gardens are excluded due to lack of availability
Aerobic municipal wastewater is three samples from the same project
There are no human gut samples due to none in IMG system
Only a single Termite sample is available
mtranscriptome.metadata <- read.csv("mtranscriptome.IMG.csv", header = T)
colnames(mtranscriptome.metadata)[4] <- "Genome.Name"
mtranscriptome.metadata.short <- mtranscriptome.metadata[c(2,1,3,4)]
datatable(mtranscriptome.metadata.short)4.4 Category summary
library(dplyr)
mtranscriptome.metadata.type <- mtranscriptome.metadata %>%
group_by(Category) %>% summarise(n())
mtranscriptome.metadata.genes <- mtranscriptome.metadata %>%
group_by(Category) %>% summarise(average.gene.count = round(mean(Gene.Count.....assembled)))
mtranscriptome.metadata.type <- dplyr::left_join(mtranscriptome.metadata.type,mtranscriptome.metadata.genes)
mtranscriptome.metadata.type$est.distinct.genomes <- round(mtranscriptome.metadata.type$average.gene.count / 3000)
head(mtranscriptome.metadata.type, n=16)## # A tibble: 13 x 4
## Category `n()` average.gene.count est.distinct.genomes
## <fct> <int> <dbl> <dbl>
## 1 Aerobic.Municipal.Wastewater 3 419294 140
## 2 Anaerobic.Municipal.Wastewater 3 438360 146
## 3 Freshwater.Sediment 3 1465509 489
## 4 Freshwater.Surface 4 1349119 450
## 5 Groundwater 4 1320186 440
## 6 Marine.Pelagic 3 1290780 430
## 7 Marine.Sediment 3 1188308 396
## 8 Peatland.Bog 2 802818 268
## 9 Rhizosphere 3 2428798 810
## 10 Rumen 3 1541073 514
## 11 Soil.Field.and.Agriculture 3 1207581 403
## 12 Soil.Forest 3 3310079 1103
## 13 Termite 1 92185 31
4.5 Calculating ribosomal protein transcript counts
4.5.1 Gathering COG function tables
The copy number of all genes with a given annotation are gathered and then filtered by the appropriate annotation
4.5.2 COG copy number
COG is gathered and then filtered using the ribosomal protein annotation list defined above. This is used as a standard with which to normalize HAD transcript counts.
library(stringr)
mtranscriptome.COG <- read.csv("mtranscriptome.COG.tab", sep="\t", header = T)
mtranscriptome.COG <- mtranscriptome.COG[mtranscriptome.COG$Func_id %in% unlist(marker.set$COG.id),]
mtranscriptome.COG <- mtranscriptome.COG[-2]
mtranscriptome.COG <- t(mtranscriptome.COG)
datatable(mtranscriptome.COG)write.csv(mtranscriptome.COG, "mtranscriptome.COG.csv") #this table has to be manipulated in excel; the genome names have been mutilated beyond repair4.6 COG ribosomal protein transcript count consistency
How consistent are the copy number estimations for these genes? Ideally, they are very similar to one another within a given metagenome.
library(tidyr)
library(ggplot2)
mtranscriptome.COG <- read.csv("mtranscriptome.COG.clean.csv", header=T)
mtranscriptome.category <- mtranscriptome.metadata.short[c(4,1)]
mtranscriptome.cat.COG <- left_join(mtranscriptome.category, mtranscriptome.COG, by="Genome.Name")
mtranscriptome.cat.COG[c(1,2)] <- mtranscriptome.cat.COG[c(2,1)]
colnames(mtranscriptome.cat.COG)[c(1,2)] <- colnames(mtranscriptome.cat.COG)[c(2,1)]
x <- ncol(mtranscriptome.cat.COG)
mtranscriptome.cat.COG$mean <- round(rowMeans( mtranscriptome.cat.COG[c(3:x)]))
mtranscriptome.cat.COG$SD <- apply ( mtranscriptome.cat.COG[c(3:x)], 1, sd)
mtranscriptome.cat.COG$SD <- round(mtranscriptome.cat.COG$SD)
mtranscriptome.COG.summary <- ggplot(mtranscriptome.cat.COG, aes(x=Category, y=mean, fill=Genome.Name)) +
geom_bar(position = position_dodge(), stat="identity",color="black") +
guides(fill=FALSE) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
scale_fill_manual(values=c("grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey")) +
scale_y_continuous(trans='log10') +
ggtitle("mean ribosomal protein gene transcript estimates") +
ylab("mean mRNA count") +
xlab("metatranscriptome source environment") +
geom_errorbar(width = 0.2, aes(ymax = mean + SD, ymin=mean - SD), position = position_dodge(0.9))
mtranscriptome.COG.summarylong.mtranscriptome.COG <- mtranscriptome.cat.COG %>% gather("COG", "n", 3:ncol(mtranscriptome.cat.COG))
datatable((mtranscriptome.cat.COG)[c(1,22,23)])4.7 Obtaining HADase transcript counts
This proceeds similarly as it was for metagenomes
library(stringr)
#the original IMG annotation overview file is find-and-replaced, "KO:" -> ""
KEGG.dehalo.list <- c("K01560","K01561","K01563")
mtranscriptome.KEGG <- read.csv("mtranscriptome.KEGG.tab", sep="\t", header = T)
mtranscriptome.KEGG <- mtranscriptome.KEGG[mtranscriptome.KEGG$Func_id %in% KEGG.dehalo.list,]
mtranscriptome.KEGG <- t(mtranscriptome.KEGG)
datatable(mtranscriptome.KEGG)write.csv(mtranscriptome.KEGG, "mtranscriptome.KEGG.csv") #this table has to be manipulated in excel; the genome names have been mutilated beyond repair4.8 Calculating HADase per average ribosomal protein transcript
mtranscriptome.KEGG <- read.csv("mtranscriptome.KEGG.clean.csv", header=T)
colnames(mtranscriptome.KEGG)[1] <- "Genome.Name"
mtranscriptome.means <- mtranscriptome.cat.COG[c(2,22,23)]
mtranscriptome.KEGG <- left_join(mtranscriptome.KEGG, mtranscriptome.means, by="Genome.Name")
mtranscriptome.KEGG$K01560.per.transcriptome <- round(mtranscriptome.KEGG$K01560 / mtranscriptome.KEGG$mean, 2)
mtranscriptome.KEGG$K01561.per.transcriptome <- round(mtranscriptome.KEGG$K01561 / mtranscriptome.KEGG$mean, 2)
mtranscriptome.KEGG$K01563.per.transcriptome <- round(mtranscriptome.KEGG$K01563 / mtranscriptome.KEGG$mean, 2)
mtranscriptome.KEGG.cat <- left_join(mtranscriptome.category, mtranscriptome.KEGG, by = "Genome.Name")
datatable(mtranscriptome.KEGG.cat)4.9 HADase transcripts per ribosomal protein transcript
mtranscriptome.K01560 <- ggplot(mtranscriptome.KEGG.cat, aes(x=Category, y=K01560.per.transcriptome, fill=Genome.Name)) +
geom_bar(position = position_dodge(), stat="identity",color="black") +
guides(fill=FALSE) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
scale_fill_manual(values=c("grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey")) +
ggtitle("HADase transcripts per ribosomal protein transcript") +
ylab("K01560 mRNA / ribosomal prot. mRNA") +
xlab("metatranscriptome source environment")
mtranscriptome.K015604.10 Fluoroacetate dehalogenase
mtranscriptome.K01561 <- ggplot(mtranscriptome.KEGG.cat, aes(x=Category, y=K01561.per.transcriptome, fill=Genome.Name)) +
geom_bar(position = position_dodge(), stat="identity",color="black") +
guides(fill=FALSE) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
scale_fill_manual(values=c("grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey")) +
ggtitle("FADase transcripts per ribosomal protein transcript") +
ylab("K01561 mRNA / ribosomal prot. mRNA") +
xlab("metatranscriptome source environment")
mtranscriptome.K015614.11 Haloalkane dehalogenase
mtranscriptome.K01563 <- ggplot(mtranscriptome.KEGG.cat, aes(x=Category, y=K01561.per.transcriptome, fill=Genome.Name)) +
geom_bar(position = position_dodge(), stat="identity",color="black") +
guides(fill=FALSE) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
scale_fill_manual(values=c("grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey","grey")) +
ggtitle("Haloalkane dehalogenase transcripts per ribosomal protein transcript") +
ylab("K01563 mRNA / ribosomal prot. mRNA") +
xlab("metatranscriptome source environment")
mtranscriptome.K01563