Kuwait.BTEX.enrichment.mgenomics.rmd

16S analysis results
Depth requirement calculations
Metagenome shotgun library coverage estimations using nonpareil
KBase assembly pipeline characteristics of assembly
MaxBin2
MiGA analysis of bins
Coverage metrics and combination of bins
BowTie strict mapping and retention of unmapped reads
Submission of unassembled reads to MG-RAST for functional and phylogenetic characterization
Assembly of non-mapped reads (no trimming required)

16S amplicon library results

The amplicon library of the BTEX enrichment was processed using qiime2 pipeline with a pre-trained Silva v132 classifier and dada2 actual-sequence-variant denoising. The results clearly revealed a community almost entirely dominated by an Arhodomonas species.

OTU.table <- read.csv("rich.OTU.table.BTEX.csv", header = TRUE)
taxonomy <- data.frame(do.call('rbind', strsplit(as.character(OTU.table$taxonomy),';',fixed=TRUE)))
colnames(taxonomy) <- c("kingdom","phylum","class","order","family","genus","species")
#OTU.table <- cbind(taxonomy, OTU.table[-1])
kable(OTU.table[c(1:3)])

taxonomy	seq.count	relative.abund.
D_0__Bacteria;D_1__Proteobacteria;D_2__Gammaproteobacteria;D_3__Nitrococcales;D_4__Nitrococcaceae;D_5__Arhodomonas;D_6__uncultured bacterium	453177	0.9931036
D_0__Bacteria;D_1__Firmicutes;D_2__Clostridia;D_3__Clostridiales;D_4__Clostridiaceae 1;D_5__Clostridium sensu stricto 13;D_6__uncultured bacterium	1276	0.0027963
D_0__Bacteria;D_1__Firmicutes;D_2__Bacilli;;;;;	591	0.0012951
D_0__Bacteria;D_1__Firmicutes;D_2__Clostridia;D_3__Clostridiales;D_4__Peptococcaceae;D_5__Candidatus Dichloromethanomonas;D_6__uncultured bacterium	201	0.0004405
D_0__Bacteria;D_1__Firmicutes;D_2__Bacilli;D_3__Bacillales;D_4__Bacillaceae;D_5__Bacillus;D_6__uncultured compost bacterium	191	0.0004186
D_0__Bacteria;D_1__Proteobacteria;D_2__Deltaproteobacteria;D_3__Desulfovibrionales;D_4__Desulfovibrionaceae;D_5__Desulfovibrio;D_6__Desulfovibrio desulfuricans	101	0.0002213
D_0__Bacteria;D_1__Proteobacteria;D_2__Gammaproteobacteria;D_3__Pseudomonadales;D_4__Pseudomonadaceae;D_5__Pseudomonas;D_6__Pseudomonas fluorescens	93	0.0002038
D_0__Bacteria;D_1__Proteobacteria;D_2__Gammaproteobacteria;D_3__Enterobacteriales;D_4__Enterobacteriaceae;;;	90	0.0001972
D_0__Bacteria;D_1__Proteobacteria;D_2__Gammaproteobacteria;D_3__Nitrococcales;D_4__Nitrococcaceae;D_5__Arhodomonas;D_6__uncultured bacterium	78	0.0001709
D_0__Bacteria;D_1__Firmicutes;D_2__Bacilli;D_3__Bacillales;D_4__Bacillaceae;D_5__Bacillus;;	75	0.0001644
D_0__Bacteria;D_1__Thermotogae;D_2__Thermotogae;D_3__Petrotogales;D_4__Petrotogaceae;D_5__SC103;D_6__metagenome	45	0.0000986
D_0__Bacteria;D_1__Cloacimonetes;D_2__Cloacimonadia;D_3__Cloacimonadales;;;;	43	0.0000942
D_0__Bacteria;D_1__Firmicutes;D_2__Clostridia;D_3__Clostridiales;D_4__Peptococcaceae;D_5__uncultured;;	39	0.0000855
D_0__Bacteria;D_1__Bacteroidetes;D_2__Bacteroidia;D_3__Bacteroidales;D_4__Rikenellaceae;D_5__DMER64;D_6__uncultured bacterium	37	0.0000811
D_0__Bacteria;D_1__Spirochaetes;D_2__Spirochaetia;D_3__Spirochaetales;D_4__Spirochaetaceae;D_5__uncultured;D_6__uncultured bacterium	36	0.0000789
D_0__Bacteria;D_1__Bacteroidetes;D_2__Bacteroidia;D_3__Bacteroidales;D_4__Bacteroidetes vadinHA17;D_5__uncultured bacterium;D_6__uncultured bacterium	20	0.0000438
D_0__Bacteria;D_1__Bacteroidetes;D_2__Rhodothermia;D_3__Balneolales;D_4__Balneolaceae;D_5__uncultured;D_6__uncultured bacterium	19	0.0000416
D_0__Bacteria;D_1__Chloroflexi;D_2__Anaerolineae;D_3__Anaerolineales;D_4__Anaerolineaceae;D_5__Longilinea;D_6__uncultured bacterium	18	0.0000394
D_0__Bacteria;D_1__Proteobacteria;D_2__Gammaproteobacteria;D_3__Xanthomonadales;D_4__Xanthomonadaceae;D_5__Stenotrophomonas;;	18	0.0000394
D_0__Bacteria;D_1__Proteobacteria;D_2__Deltaproteobacteria;D_3__Syntrophobacterales;D_4__Syntrophaceae;D_5__Smithella;D_6__uncultured bacterium	17	0.0000373
D_0__Bacteria;D_1__Chlamydiae;D_2__LD1-PA32;D_3__uncultured bacterium;D_4__uncultured bacterium;D_5__uncultured bacterium;D_6__uncultured bacterium	17	0.0000373
D_0__Archaea;D_1__Euryarchaeota;D_2__Methanobacteria;D_3__Methanobacteriales;D_4__Methanobacteriaceae;D_5__Methanobrevibacter;;	16	0.0000351
D_0__Bacteria;D_1__Synergistetes;D_2__Synergistia;D_3__Synergistales;D_4__Synergistaceae;D_5__Lactivibrio;D_6__uncultured bacterium	15	0.0000329
D_0__Bacteria;D_1__Bacteroidetes;D_2__Bacteroidia;D_3__Bacteroidales;;;;	15	0.0000329
D_0__Bacteria;D_1__Proteobacteria;D_2__Gammaproteobacteria;D_3__Oceanospirillales;D_4__Halomonadaceae;D_5__Chromohalobacter;;	14	0.0000307
D_0__Bacteria;D_1__Cloacimonetes;D_2__Cloacimonadia;D_3__Cloacimonadales;;;;	12	0.0000263
D_0__Archaea;D_1__Euryarchaeota;D_2__Methanomicrobia;D_3__Methanomicrobiales;D_4__Methanocorpusculaceae;D_5__Methanocorpusculum;D_6__uncultured archaeon	10	0.0000219
D_0__Bacteria;D_1__Proteobacteria;D_2__Deltaproteobacteria;D_3__Myxococcales;D_4__Phaselicystidaceae;D_5__Phaselicystis;D_6__uncultured bacterium	10	0.0000219
D_0__Bacteria;D_1__Spirochaetes;D_2__Spirochaetia;D_3__Spirochaetales;D_4__Spirochaetaceae;D_5__uncultured;D_6__uncultured bacterium	9	0.0000197
D_0__Bacteria;D_1__Firmicutes;D_2__Clostridia;D_3__DTU014;D_4__uncultured bacterium;D_5__uncultured bacterium;D_6__uncultured bacterium	8	0.0000175
D_0__Bacteria;D_1__Hydrogenedentes;D_2__Hydrogenedentia;D_3__Hydrogenedentiales;D_4__Hydrogenedensaceae;D_5__uncultured bacterium;D_6__uncultured bacterium	7	0.0000153
D_0__Bacteria;;;;;;;	7	0.0000153
D_0__Bacteria;D_1__Atribacteria;D_2__Caldatribacteriia;D_3__Caldatribacteriales;D_4__Caldatribacteriaceae;D_5__Candidatus Caldatribacterium;D_6__uncultured bacterium	6	0.0000131
D_0__Bacteria;D_1__Actinobacteria;D_2__Thermoleophilia;D_3__Gaiellales;D_4__uncultured;D_5__metagenome;D_6__metagenome	4	0.0000088
D_0__Bacteria;D_1__Actinobacteria;D_2__Actinobacteria;D_3__Micrococcales;D_4__Cellulomonadaceae;D_5__Cellulomonas;D_6__uncultured bacterium	3	0.0000066
D_0__Bacteria;D_1__Firmicutes;D_2__Clostridia;D_3__Clostridiales;D_4__Christensenellaceae;D_5__Christensenellaceae R-7 group;;	3	0.0000066
D_0__Bacteria;D_1__Verrucomicrobia;D_2__Verrucomicrobiae;D_3__Pedosphaerales;D_4__Pedosphaeraceae;;;	3	0.0000066

A number of other minor community members are possibly present, although at very low relative abundances. Indeed, the second most abundant community member is present in the amplicon library at 0.3%. This creates significant challenge for shotgun metagenomic analysis, as the vast majority of sequencing reads will be consumed by the dominant species.

Estimating shotgun metagenomics read-depth requirements

We can do a very crude assessment. Assume average prokaryotic genome is 4mb. You need about 10x coverage to confidently assemble. That means, lets say ~40mb of sequencing per genome present. If we assume that the 16S amplicon library accurately represents the community, the second-most abundant species represents 0.3% of the total genomic DNA present. the amount of sequencing required is roughly equivalent to that which would be required to sequence 40mb / 0.003 = 13,333mb or 13.3 gbases (to reach 10X coverage).

A full MiSeq run yields a maximum of roughly 10 gbases. This will yield very approximately 7X coverage of the second-most abundant species; not enough to assemble with any confidence but enough to send down an assembly-independent pipeline.

Note that there is no guarantee that this is the actual sequencing requirement (the amplicon library is imprecise and shotgun sequencing is unpredictable).

https://www.nature.com/articles/srep01968 presents a more formalized approach:

They present an equation for estimating, assuming genus level uniqueness of the target species:

Sequencing required (A) = 0.0706 * (proportion) ^ -0.9959 / 2 A = 0.0706 * (0.003) ^ -0.9959 / 2

This method estimates 11.49 gbases required for 10x coverage, very similar estimate.

The actual metagenome yielded 4.7 gbases following QC (detailed below), thus we can expect ~4X coverage of the second-most abundant community member and thus partial assembly.

Shotgun sequence assembly

Shotgun sequencing of the BTEX enrichment yielded 6.7 Gbases of sequence data.
This was trimmed using Trimmomatic (default settings, sliding window of 6:25). This resulted in 4.7 Gbases. Reads were assembled using SPAdes v 3.12, with “careful” setting and error correction and a minimum contig size of 500bp. Assessing the assembly in QUAST:

Statistics.without.reference	Kuwait2018.mgenome.SPAdes.con.
# contigs	1066.00
# contigs (>= 0 bp)	1066.00
# contigs (>= 1000 bp)	446.00
# contigs (>= 10000 bp)	121.00
# contigs (>= 100000 bp)	1.00
# contigs (>= 1000000 bp)	0.00
Largest contig	179935.00
Total length	5104356.00
Total length (>= 0 bp)	5104356.00
Total length (>= 1000 bp)	4695727.00
Total length (>= 10000 bp)	3664180.00
Total length (>= 100000 bp)	179935.00
Total length (>= 1000000 bp)	0.00
N50	24189.00
N75	8250.00
L50	47.00
L75	140.00
GC (%)	67.96
Mismatches	NA
# N’s	8511.00
# N’s per 100 kbp	166.74
Predicted genes	NA
# predicted genes (unique)	4909.00
# predicted genes (>= 0 bp)	4910.00
# predicted genes (>= 300 bp)	4273.00
# predicted genes (>= 1500 bp)	610.00
# predicted genes (>= 3000 bp)	67.00

Total assembly length is ~5.1mb

Total genome size of the only existing Arhodomonas genome is 3.96mb, implying an extra 1mb of assembly (assuming that this Arhodomonas is similar in that respect)

GC content is also similar to that found in studied Arhodomonas genomes, ~70%:

Binning the metagenome assembly:

The assembly was binned using MaxBin2 v2.2.4. This yielded two bins:

Bins: 2 Input Contigs: 1066 Binned Contigs: 368 (34.5%) Unbinned Contigs: 698 (65.5%) Contigs Too Short: 620 (58.2%) Summed Length of Binned Contigs: 4204219 (82.4%) Summed Length of Unbinned Contigs: 900137 (17.6%) Summed Length of Short Contigs: 408629 (8.0%)

Bin.Name.	Read.Coverage.	GC.Content.	Number.of.Contigs.	Total.Contig.Length.
Bin.001.fasta	0.925	0.698	245	3528385
Bin.002.fasta	0.028	0.674	123	675834

Given the similarity in GC content, what we already know about the community make-up, and the total length of the combined bins (4.2mb) compared to existing Arhodomonas genome (3.96mb), it is assumed that these two bins together represent the Arhodomonas

Comparing the combined BTEX enrichment bin with A. aqueoli genome

Whole genome alignment using Mauve

The genome of A. aqeuoli was downloaded from JGI IMG and annotate using prokka v1.13.3 (https://github.com/tseemann/prokka) under default recommended settings. The combined bin was annotated using the same pipeline. Initial whole genome alignment using Mauve yielded a scattered picture:

Iterative contig realignment made the contiguous regions more evident:

There remains approximately 50kb of the binned genome which is not accounted for in the reference genome. These contigs will be recovered and their functional annotations explored. These might represent novel genomic material in the Arhodomonas or they might represent mis-binned contigs.

The non-aligned contigs are named in this file and will be examined later in the pipeline:

V2	V3	V4	V5	V6
BTEX1_347	chromosome	forward	3718756	3722692
BTEX1_290	chromosome	complement	3722693	3728982
BTEX1_76	chromosome	forward	3728983	3739463
BTEX1_337	chromosome	forward	3739464	3740876
BTEX1_105	chromosome	complement	3740877	3741996
BTEX1_5	chromosome	forward	3741997	3744665
This is only	the first fi	ve of 139 con	tigs which	do not seem to align with the reference genome. These must derive from either:

un-binned Arhodomonas genomic contigs
satellite community member genomes

Average Nucleotide Identity (ANI) calculation

FastANI estimates average nucleotide identity between genomes by calculating pairwise nucleotide identity between homologous genes.

Results: 97.67% identity, out of the total 1218 sequence fragments in the BTEX bin (BTEX1), 1044 were determined to be orthologous to A. aquaeoli.

This is regarded as within-species level of identity based on recent observations described in https://www.biorxiv.org/content/early/2017/11/27/225342

ANI scores between pairs of 90,000 bacterial genomes show that an ANI above 96% is consistent with genomes from the same species

This alignment visualization is based on the FastANI calculations and shows the reference, A. aquaeoli (top) vs. the BTEX enrichment bin (bottom)

Binned genome quality report

Conducted at MiGA web server:

Essential genes found: 98/111.

Completeness: 88.3%.
Contamination: 3.6%.

Multiple copies:

2 tRNA-synt_1d: tRNA synthetases class I (R).
2 TIGR00064: ftsY: signal recognition particle-docking protein FtsY.
2 TIGR00344: alaS: alanine–tRNA ligase.
2 TIGR00436: era: GTP-binding protein Era.

Missing genes:

PGK: Phosphoglycerate kinase.
TIGR00009: L28: ribosomal protein L28.
TIGR00362: DnaA: chromosomal replication initiator protein DnaA.
TIGR00388: glyQ: glycine–tRNA ligase, alpha subunit.
TIGR00389: glyS_dimeric: glycine–tRNA ligase.
TIGR00408: proS_fam_I: proline–tRNA ligase.
TIGR00471: pheT_arch: phenylalanine–tRNA ligase, beta subunit.
TIGR00663: dnan: DNA polymerase III, beta subunit.
TIGR00775: NhaD: Na+/H+ antiporter, NhaD family.
TIGR01030: rpmH_bact: ribosomal protein L34.
TIGR01059: gyrB: DNA gyrase, B subunit.
TIGR01063: gyrA: DNA gyrase, A subunit.
TIGR02387: rpoC1_cyan: DNA-directed RNA polymerase, gamma subunit.

Separating metagenomic reads which are not part of the Arhodomonas BTEX1 genome.

The raw reads were mapped back with very strict requirements to the scaffolds of BTEX1. Reads which did not map back might represent two categories of genomic material: 1 BTEX1 genomic material which is not captured during the binning process and is also not found in the reference Arhodoomonas genome. 2 genomic material from other members of the community.

Bowtie configuration

BowTie2 default parameters are somewhat permissive. Our read-length is maximum 250 bp. The default minimum score is -0.6 + -0.6 * L(ength), or -150.6. Mismatches between high quality reads impose a -6 penalty each. Thus, this default setting allows up to ~25 high quality mismatches, or ~10%. Given the knowledge that the other community members are unlikely to be closely related, this is acceptable.

Thus, the bowtie2 call used was:

bowtie2 -x (index) -1 (forward reads) -2 (reverse reads) –score-min L,-0.6,-0.6 –un-conc (paired reads that fail to map condordantly)

The unmapped reads are expected to be quite a minimal set, representing ~0.003 of the 4.8gbases of trimmed sequencing effort, or 10.5 mbases. However, there are likely still scaffolds representing un-binned parts of the BTEX1 genome also.

paired-read mapping leads to pairs that map only once to BTEX1 ending up in the unmapped sequence bin. This is good for assembly-based analysis, where the paired-end information is essential to good assembly and the contaminating BTEX1 pairs will not have any impact on the assembly. However, this is potentilly very bad for assembly-free pipelines, as BTEX1 reads will still be annotated. The question; How many more reads make it into the unmapped paired data set vs the unmapped unpaired data sets. The difference represents the amount of BTEX1 contamination.

BowTie2 results, paired mapping:

Forward reads: 11154399 reads; of these: 11154399 (100.00%) were unpaired; of these: 1479305 (13.26%) aligned 0 times 8963463 (80.36%) aligned exactly 1 time 711631 (6.38%) aligned >1 times 86.74% overall alignment rate

Reverse reads: 11154399 reads; of these: 11154399 (100.00%) were unpaired; of these: 1468625 (13.17%) aligned 0 times 8971133 (80.43%) aligned exactly 1 time 714641 (6.41%) aligned >1 times 86.83% overall alignment rate

Analyzing novel shotgun metagenome sequences via MG-RAST

Goal of MG-RAST analysis:

We want to confirm who is there based on 16S gene fragment identity
We want to explore what the additional community members might be doing metabolically that is markedly different than the Arhodomonas, i.e. what novel functions do the minor community members bring the the table?
This approach is complicated (unavoidably) by the presence of reads representing un-binned Arhodomonas genome fragments. We are avoiding this by not considering functions in the unmapped read set that are present at greater than 500 (arbitrary cutoff).

Methodology

Two data sets:

Full metagenome library
only the unmapped unpaired reads (via BowTie2)

Annotations which are present at a ratio of at least 0.5 unmapped/full metagenome and are present with less than 500 reads are considered.

Reads processed at MG-RAST without dereplication and without additional quality trimming (data has already been trimmed via Trimmomatic as described above)

Create tables of Silva SSU and RDP non-Arhodomonas taxonomy

Create table of pathways over-represented in the unmapped annotations via KEGG, COG, and Subsystems. Focus on a handful of stories: * putrescine uptake * phosphotransferase modules for carbohydrate uptake

Taxonomy counts

Note that the 16S rRNA gene was NOT included in the BTEX1 genome bin, thus these reads are present in the unmapped read set. Also note that this is not a “classifier” -based approach, it only reports the best hit, it does not present and resolve conflicts.

Reads that map to the RDP database under the following stringency thresholds: e-value = -30, %identity = 70%, length = 50, minimum count = 2

RDP

RDP30 <- read.csv("MG-RAST/final.RDP.e30.tsv", header=TRUE, sep="\t")
RDP30 <- RDP30[-8]
RDP30 <- RDP30[order(-RDP30$BTEX1.unmapped.FR.fastq),]
RDP30$BTEX1.unmapped.FR.fastq <- RDP30$BTEX1.unmapped.FR.fastq / sum(RDP30$BTEX1.unmapped.FR.fastq)
RDP30$Kuwait.1810.mgenome.fastq <- RDP30$Kuwait.1810.mgenome.fastq / sum(RDP30$Kuwait.1810.mgenome.fastq)
colnames(RDP30) <- c("domain","phylum","class","order","family","genus","unmapped","full.mgenome")
RDP30 <- RDP30[-c(1:4),]
write.csv(RDP30, "RDP30.csv")
kable(RDP30)

	domain	phylum	class	order	family	genus	unmapped	full.mgenome
4	Bacteria	Firmicutes	Bacilli	Bacillales	Bacillaceae	Bacillus	0.0006074	0.0003704
8	Bacteria	Firmicutes	Bacilli	Bacillales	Bacillaceae	Geobacillus	0.0002916	0.0003010
9	Bacteria	Firmicutes	Bacilli	Bacillales	Bacillaceae	Marinococcus	0.0001701	0.0002315
2	Bacteria	Firmicutes	Bacilli	Bacillales	Alicyclobacillaceae	Alicyclobacillus	0.0001458	0.0001158
5	Bacteria	Firmicutes	Clostridia	Clostridiales	Peptococcaceae	Desulfotomaculum	0.0000972	0.0000000
16	Bacteria	Firmicutes	Bacilli	Bacillales	Bacillaceae	Virgibacillus	0.0000486	0.0000695
1	Bacteria	Firmicutes	Clostridia	Clostridiales	Eubacteriaceae	Acetobacterium	0.0000000	0.0000695
6	Bacteria	Proteobacteria	Gammaproteobacteria	Chromatiales	Ectothiorhodospiraceae	Ectothiorhodospira	0.0000000	0.0000926
7	Bacteria	Firmicutes	Clostridia	Clostridiales	Eubacteriaceae	Eubacterium	0.0000000	0.0000695
10	Bacteria	Proteobacteria	Gammaproteobacteria	unclassified (derived from Gammaproteobacteria)	unclassified (derived from Gammaproteobacteria)	Methylonatrum	0.0000000	0.0000463
11	Bacteria	Proteobacteria	Gammaproteobacteria	Chromatiales	Ectothiorhodospiraceae	Nitrococcus	0.0000000	0.0006020
12	Bacteria	Firmicutes	Bacilli	Bacillales	Paenibacillaceae	Paenibacillus	0.0000000	0.0000926
13	Bacteria	Proteobacteria	Gammaproteobacteria	Pseudomonadales	Moraxellaceae	Psychrobacter	0.0000000	0.0001158
14	Bacteria	Firmicutes	Bacilli	Bacillales	Sporolactobacillaceae	Sporolactobacillus	0.0000000	0.0000695
15	Bacteria	Proteobacteria	Gammaproteobacteria	Chromatiales	Ectothiorhodospiraceae	Thioalkalivibrio	0.0000000	0.0000695
19	Bacteria	Proteobacteria	unclassified (derived from Proteobacteria)	unclassified (derived from Proteobacteria)	unclassified (derived from Proteobacteria)	unclassified (derived from Proteobacteria)	0.0000000	0.0000695

Silva SSU

Conducted using the same specificity settings as the RDP

SILVASSU30 <- read.csv("MG-RAST/final.SilvaSSU.e30.tsv", header=TRUE, sep="\t")
SILVASSU30 <- SILVASSU30[-8]
SILVASSU30 <- SILVASSU30[order(-SILVASSU30$BTEX1.unmapped.FR.fastq),]
SILVASSU30$BTEX1.unmapped.FR.fastq <- SILVASSU30$BTEX1.unmapped.FR.fastq / sum(SILVASSU30$BTEX1.unmapped.FR.fastq)
SILVASSU30$Kuwait.1810.mgenome.fastq <- SILVASSU30$Kuwait.1810.mgenome.fastq / sum(SILVASSU30$Kuwait.1810.mgenome.fastq)
colnames(SILVASSU30) <- c("domain","phylum","class","order","family","genus","unmapped","full.mgenome")
SILVASSU30 <- SILVASSU30[-c(1:4),]
kable(SILVASSU30)

	domain	phylum	class	order	family	genus	unmapped	full.mgenome
4	Bacteria	Firmicutes	Bacilli	Bacillales	Bacillaceae	Bacillus	0.0006802	0.0004398
11	Bacteria	Firmicutes	Bacilli	Bacillales	Bacillaceae	Geobacillus	0.0002915	0.0003009
12	Bacteria	Firmicutes	Bacilli	Bacillales	Bacillaceae	Marinococcus	0.0001700	0.0002315
2	Bacteria	Firmicutes	Bacilli	Bacillales	Alicyclobacillaceae	Alicyclobacillus	0.0001457	0.0001157
7	Bacteria	Firmicutes	Clostridia	Clostridiales	Peptococcaceae	Desulfotomaculum	0.0000972	0.0000000
9	Bacteria	Firmicutes	Bacilli	Lactobacillales	Enterococcaceae	Enterococcus	0.0000729	0.0000000
6	Bacteria	Firmicutes	Clostridia	Clostridiales	Peptococcaceae	Desulfosporosinus	0.0000486	0.0000000
19	Bacteria	Firmicutes	Bacilli	Bacillales	Bacillaceae	Virgibacillus	0.0000486	0.0000694
1	Bacteria	Firmicutes	Clostridia	Clostridiales	Eubacteriaceae	Acetobacterium	0.0000000	0.0000694
5	Eukaryota	Nematoda	Chromadorea	Rhabditida	Rhabditidae	Caenorhabditis	0.0000000	0.0000463
8	Bacteria	Proteobacteria	Gammaproteobacteria	Chromatiales	Ectothiorhodospiraceae	Ectothiorhodospira	0.0000000	0.0000926
10	Bacteria	Firmicutes	Clostridia	Clostridiales	Eubacteriaceae	Eubacterium	0.0000000	0.0000694
13	Bacteria	Proteobacteria	Gammaproteobacteria	unclassified (derived from Gammaproteobacteria)	unclassified (derived from Gammaproteobacteria)	Methylonatrum	0.0000000	0.0000463
14	Bacteria	Proteobacteria	Gammaproteobacteria	Chromatiales	Ectothiorhodospiraceae	Nitrococcus	0.0000000	0.0006018
15	Bacteria	Firmicutes	Bacilli	Bacillales	Paenibacillaceae	Paenibacillus	0.0000000	0.0001157
16	Bacteria	Proteobacteria	Gammaproteobacteria	Pseudomonadales	Moraxellaceae	Psychrobacter	0.0000000	0.0001157
17	Bacteria	Firmicutes	Bacilli	Bacillales	Sporolactobacillaceae	Sporolactobacillus	0.0000000	0.0000694
18	Bacteria	Proteobacteria	Gammaproteobacteria	Chromatiales	Ectothiorhodospiraceae	Thioalkalivibrio	0.0000000	0.0002315
22	Bacteria	Proteobacteria	unclassified (derived from Proteobacteria)	unclassified (derived from Proteobacteria)	unclassified (derived from Proteobacteria)	unclassified (derived from Proteobacteria)	0.0000000	0.0000694

library(ggplot2)
library(RColorBrewer)
SILVASSU30.c <- SILVASSU30[c(1:8),-8]
SILVASSU30.c$concat <- paste(SILVASSU30.c$order,";",SILVASSU30.c$family,";",SILVASSU30.c$genus," = (",round(SILVASSU30.c$unmapped*100, 4),"%)", sep="")
write.csv(SILVASSU30.c, "SILVASSU30.c.csv")

par(mar=c(3,3,7,7))
plot <- pie(SILVASSU30.c$unmapped, labels = SILVASSU30.c$concat, main="non-Arhodomonas 16S rRNA gene fragments", 
            cex = 0.6, col = brewer.pal(n = 8, name = "Dark2"))

plot

## NULL

Not an ideal way to represent the data but it’s something

Summarizing the results by Order

library(dplyr)
RDP30 <- dplyr::group_by(RDP30, order)
summary.RDP <- dplyr::summarise(RDP30, RDP = sum(unmapped))
summary.RDP$order <- as.character(summary.RDP$order)
SILVASSU30 <- dplyr::group_by(SILVASSU30, order)
summary.silva <- dplyr::summarise(SILVASSU30, SILVASSU30 = sum(unmapped))
summary.silva$order <- as.character(summary.silva$order)
summary.16S.order <- dplyr::full_join(summary.RDP, summary.silva, by="order")
summary.16S.order <- summary.16S.order[order(-summary.16S.order$SILVASSU30),]
colnames(summary.16S.order) <- c("order","RDP","Silva.SSU")
kable(summary.16S.order)

order	RDP	Silva.SSU
Bacillales	0.0012634	0.0013360
Clostridiales	0.0000972	0.0001457
Lactobacillales	NA	0.0000729
Chromatiales	0.0000000	0.0000000
Pseudomonadales	0.0000000	0.0000000
unclassified (derived from Gammaproteobacteria)	0.0000000	0.0000000
unclassified (derived from Proteobacteria)	0.0000000	0.0000000
Rhabditida	NA	0.0000000

The results from the two databases are very close to one another, indicating the presence of a handful of genomes from orders Bacillales and Clostriales. This is largely in accordance with the 16S amplicon library results. The 16S library suggested ~0.0030% other organisms, while the metagenome suggests ~0.0014% other organisms. At most, we should expect to find ~20 reads from any given fragment of genomes of minor community members.

Function counts

This table is the result of annotation of all reads with the Subsytems database. Annotations are displayed using the stringency measures: e-value = -10 %identity minimum = 60% minimum length of match = 15 minimum counts = 2 maximum counts = 500

(relative counts and min/max read counts were applied in Excel)

subsystems <- read.csv("MG-RAST/final.subsystems.function.tsv", sep="\t", header=TRUE)
subsystems <- subsystems[-6]
subsystems <- subsystems[order(-subsystems$rel),]
subsystems <- subsystems[subsystems$rel > 0.5,] #remove all relative fractions 50% and below
kable(head(subsystems))

	level1	level2	level3	function.	BTEX1.unmapped.FR.fastq	Kuwait.1810.mgenome.fastq	test	rel
20	RNA Metabolism	RNA processing and modification	tRNA_modification_Archaea	Asparaginyl-tRNA synthetase (EC 6.1.1.22)	6	3	6	2.000000
21	Carbohydrates	Central carbohydrate metabolism	Pyruvate_metabolism_II:_acetyl-CoA,_acetogenesis_from_pyruvate	Pyruvate dehydrogenase E1 component alpha subunit (EC 1.2.4.1)	4	2	4	2.000000
22	Carbohydrates	One-carbon Metabolism	Serine-glyoxylate_cycle	Phosphoenolpyruvate carboxykinase [ATP] (EC 4.1.1.49)	9	5	9	1.800000
23	Cell Wall and Capsule	Capsular and extracellular polysacchrides	Alginate_metabolism	Probable poly(beta-D-mannuronate) O-acetylase (EC 2.3.1.-)	7	4	7	1.750000
24	Membrane Transport	NULL	Tricarboxylate_transport_system	Tricarboxylate transport protein TctC	7	4	7	1.750000
25	Clustering-based subsystems	NULL	Cluster_containing_CofD-like_protein_and_co-occuring_with_DNA_repair	FIG002813: LPPG:FO 2-phospho-L-lactate transferase like, CofD-like	5	3	5	1.666667

This is a small subset of the subsystems analysis results

print(paste("There are",nrow(subsystems),"functions annoted by the subsystems database which meet the criteria for being present primarily in non-Arhodomonas menbers of the community"))

## [1] "There are 227 functions annoted by the subsystems database which meet the criteria for being present primarily in non-Arhodomonas menbers of the community"

Summarize by Subsystems pathway

subsystems.level3 <- dplyr::group_by(subsystems, level3)
subsystems.level3 <- dplyr::summarise(subsystems.level3, subsystems.path = n())
subsystems.level3 <- subsystems.level3[order(-subsystems.level3$subsystems.path),]
kable(subsystems.level3)

level3	subsystems.path
Sporulation_gene_orphans	6
Purine_conversions	5
Sugar_utilization_in_Thermotogales	5
D-Galacturonate_and_D-Glucuronate_Utilization	4
Heme,_hemin_uptake_and_utilization_systems_in_GramPositives	4
L-rhamnose_utilization	4
Mannitol_Utilization	4
COG3533	3
Inositol_catabolism	3
Menaquinone_and_Phylloquinone_Biosynthesis	3
Phosphate_metabolism	3
Proteolysis_in_bacteria,_ATP-dependent	3
SigmaB_stress_responce_regulation	3
Trehalose_Uptake_and_Utilization	3
Bacterial_hemoglobins	2
Calvin-Benson_cycle	2
CBSS-257314.1.peg.676	2
CBSS-393133.3.peg.2787	2
Control_of_cell_elongation_-_division_cycle_in_Bacilli	2
DNA_processing_cluster	2
DNA_repair,_bacterial	2
ECF_class_transporters	2
Exopolysaccharide_Biosynthesis	2
Experimental_-_Histidine_Degradation	2
(GlcNAc)2_Catabolic_Operon	2
Oxidative_stress	2
Polyglycerolphosphate_lipoteichoic_acid_biosynthesis	2
Rhamnose_containing_glycans	2
Serine-glyoxylate_cycle	2
Tryptophan_catabolism	2
Xylose_utilization	2
4-Hydroxyphenylacetic_acid_catabolic_pathway	1
Acetone_Butanol_Ethanol_Synthesis	1
Adenosyl_nucleosidases	1
Alginate_metabolism	1
Arginine_and_Ornithine_Degradation	1
Arsenic_resistance	1
At4g10620_At3g57180_At3g47450	1
At5g63420	1
Bacillibactin_Siderophore	1
Bacillus_biofilm_matrix_protein_component_TasA_and_homologs	1
Bacitracin_Stress_Response	1
Bacterial_Chemotaxis	1
Beta-Glucoside_Metabolism	1
Beta-lactamase_cluster_in_Streptococcus	1
Biogenesis_of_c-type_cytochromes	1
Biotin_biosynthesis	1
Broadly_distributed_proteins_not_in_subsystems	1
Cadmium_resistance	1
Carbon_Starvation	1
Carboxysome	1
CBSS-176280.1.peg.1561	1
CBSS-176299.4.peg.1292	1
CBSS-193567.1.peg.90	1
CBSS-262719.3.peg.410	1
CBSS-269801.1.peg.1715	1
CBSS-288681.3.peg.1039	1
CBSS-393121.3.peg.1913	1
CBSS-393121.3.peg.2760	1
CBSS-393130.3.peg.129	1
Chitin_and_N-acetylglucosamine_utilization	1
Cluster_containing_CofD-like_protein_and_co-occuring_with_DNA_repair	1
Cluster_with_phosphopentomutase_paralog	1
Coenzyme_A_Biosynthesis	1
Coenzyme_PQQ_synthesis	1
COG0451	1
COG0523	1
COG2363	1
COG2509	1
COG3146_experimental	1
Competence_or_DNA_damage-inducible_protein_CinA_and_related_protein_families	1
Conserved_gene_cluster_associated_with_Met-tRNA_formyltransferase	1
Cysteine_Biosynthesis	1
Dehydrogenase_complexes	1
De_Novo_Purine_Biosynthesis	1
De_Novo_Pyrimidine_Synthesis	1
Deoxyribose_and_Deoxynucleoside_Catabolism	1
D-galactarate,_D-glucarate_and_D-glycerate_catabolism	1
Dihydroxyacetone_kinases	1
Dipicolinate_Synthesis	1
DMT_transporter	1
DNA_repair,_bacterial_DinG_and_relatives	1
DNA_Repair_Base_Excision	1
DNA_replication_cluster_1	1
D-Tagatose_and_Galactitol_Utilization	1
EcsAB_transporter_affecting_expression_and_secretion_of_secretory_preproteins	1
Fermentations:_Mixed_acid	1
Ferrous_iron_transporter_EfeUOB,_low-pH-induced	1
Formate_hydrogenase	1
Fosfomycin_resistance	1
Fructooligosaccharides(FOS)_and_Raffinose_Utilization	1
Galactose-inducible_PTS	1
Glutamate_and_Aspartate_uptake_in_Bacteria	1
Glutamate-mediated_methylamine_utilization_pathway	1
Glycerolipid_and_Glycerophospholipid_Metabolism_in_Bacteria	1
Glycine_Biosynthesis	1
Glycogen_metabolism_cluster	1
Glycolysis_and_Gluconeogenesis	1
Heme_and_Siroheme_Biosynthesis	1
HMG_CoA_Synthesis	1
HPr_catabolite_repression_system	1
HPr_kinase_and_hprK_operon_in_Gram-positive_organisms	1
Hydantoin_metabolism	1
Isobutyryl-CoA_to_Propionyl-CoA_Module	1
Isoleucine_degradation	1
Isoprenoid_Biosynthesis	1
Lactate_utilization	1
Lactose_and_Galactose_Uptake_and_Utilization	1
L-Arabinose_utilization	1
Listeria_Pathogenicity_Island_LIPI-1_extended	1
LMPTP_YwlE_cluster	1
Lysine_Biosynthesis_DAP_Pathway	1
Mannose_Metabolism	1
Mediator_of_hyperadherence_YidE_in_Enterobacteria_and_its_conserved_region	1
Metallocarboxypeptidases_(EC_3.4.17.-)	1
Metalloendopeptidases_(EC_3.4.24.-)	1
Methionine_Biosynthesis	1
Muconate_lactonizing_enzyme_family	1
Murein_hydrolase_regulation_and_cell_death	1
Nitrosative_stress	1
Nudix_proteins_(nucleoside_triphosphate_hydrolases)	1
Omega_peptidases_(EC_3.4.19.-)	1
p-Aminobenzoyl-Glutamate_Utilization	1
Pentose_phosphate_pathway	1
Peptidoglycan_Crosslinking_of_Peptide_Stems	1
Phage_replication	1
Phd-Doc,YdcE-YdcD_toxin-antitoxin(programmed_cell_death)_systems	1
Phenylalanine_and_Tyrosine_Branches_from_Chorismate	1
Photorespiration_(oxidative_C2_cycle)	1
Polyamine_Metabolism	1
Polymyxin_Synthetase_Gene_Cluster_in_Bacillus	1
Potassium_homeostasis	1
Proline,_4-hydroxyproline_uptake_and_utilization	1
PROSC	1
Protocatechuate_branch_of_beta-ketoadipate_pathway	1
pyrimidine_conversions	1
Pyruvate_metabolism_I:_anaplerotic_reactions,_PEP	1
Pyruvate_metabolism_II:_acetyl-CoA,_acetogenesis_from_pyruvate	1
Restriction-Modification_System	1
RNA_methylation	1
RNA_modification_cluster	1
Sialic_Acid_Metabolism	1
Siderophore_[Alcaligin-like]	1
Siderophore_assembly_kit	1
Single-Rhodanese-domain_proteins	1
Spore_Coat	1
Spore_Core_Dehydration	1
Spore_germination	1
Sporulation-associated_proteins_with_broader_functions	1
Sporulation_Cluster	1
Sporulation-related_Hypotheticals	1
SpoVS_protein_family	1
Streptococcus_pyogenes_recombinatorial_zone	1
Sulfatases_and_sulfatase_modifying_factor_1	1
Synthesis_of_osmoregulated_periplasmic_glucans	1
Teichoic_and_lipoteichoic_acids_biosynthesis	1
Terminal_cytochrome_C_oxidases	1
Terminal_cytochrome_oxidases	1
Thiamin_biosynthesis	1
Tolerance_to_colicin_E2	1
Trehalose_Biosynthesis	1
Tricarboxylate_transport_system	1
tRNA_modification_Archaea	1
tRNA_modification_Bacteria	1
tRNA_modification_yeast_cytoplasmic	1
tRNA_processing	1
Two-component_Response_Regulator_of_Virulence_ResDE	1
Ubiquinone_Menaquinone-cytochrome_c_reductase_complexes	1
Universal_GTPases	1
Unspecified_monosaccharide_transport_cluster	1
YgfZ	1

The Subsystems database is advantageous in that it is large and well annotated, but the organization leaves much to desire.

The presence of sporulation-related genes confirms the 16S results, that there are gram-positives in the community at low levels.

Summarizing Subsystems database results at higher level of organization

subsystems.level1 <- dplyr::group_by(subsystems, level1)
subsystems.level1 <- dplyr::summarise(subsystems.level1, subsystems.category = n())
subsystems.level1 <- subsystems.level1[order(-subsystems.level1$subsystems.category),]
kable(subsystems.level1)

level1	subsystems.category
Carbohydrates	54
Clustering-based subsystems	26
Miscellaneous	18
Amino Acids and Derivatives	13
Dormancy and Sporulation	13
Nucleosides and Nucleotides	11
Cell Wall and Capsule	10
Cofactors, Vitamins, Prosthetic Groups, Pigments	10
Stress Response	9
Iron acquisition and metabolism	8
Virulence, Disease and Defense	8
DNA Metabolism	7
Protein Metabolism	7
Membrane Transport	5
Respiration	5
RNA Metabolism	5
Regulation and Cell signaling	4
Phosphorus Metabolism	3
Cell Division and Cell Cycle	2
Fatty Acids, Lipids, and Isoprenoids	2
Metabolism of Aromatic Compounds	2
Phages, Prophages, Transposable elements, Plasmids	2
Motility and Chemotaxis	1
Nitrogen Metabolism	1
Potassium metabolism	1

COG annotation results

COG <- read.csv("MG-RAST/final.COG.function.tsv", sep="\t", header=TRUE)
COG <- COG[order(-COG$rel),]
COG <- COG[COG$rel > 0.5,] #remove all relative fractions 50% and below
kable(head(COG))

level1	level2	function.	BTEX1.unmapped.FR.fastq	Kuwait.1810.mgenome.fastq	test	rel
CELLULAR PROCESSES AND SIGNALING	Cell wall/membrane/envelope biogenesis	Predicted membrane protein involved in D-alanine export	5	2	5	2.5
METABOLISM	Coenzyme transport and metabolism	Na+/panthothenate symporter	6	3	6	2.0
POORLY CHARACTERIZED	General function prediction only	Predicted phosphohydrolases	4	2	4	2.0
METABOLISM	Energy production and conversion	Phosphoenolpyruvate carboxykinase (ATP)	9	5	9	1.8
INFORMATION STORAGE AND PROCESSING	Replication, recombination and repair	ATPase involved in DNA replication	6	4	6	1.5
INFORMATION STORAGE AND PROCESSING	Replication, recombination and repair	Adenine specific DNA methylase Mod	3	2	3	1.5

This is a small subset of the full COG database results

print(paste("There are",nrow(COG),"functions annoted by the COG database which meet the criteria for being present primarily in non-Arhodomonas menbers of the community"))

## [1] "There are 145 functions annoted by the COG database which meet the criteria for being present primarily in non-Arhodomonas menbers of the community"

Summarizing by COG pathway

COG.level2 <- dplyr::group_by(COG, level2)
COG.level2 <- dplyr::summarise(COG.level2, COG.path = n())
COG.level2 <- COG.level2[order(-COG.level2$COG.path),]
kable(COG.level2)

level2	COG.path
Carbohydrate transport and metabolism	27
General function prediction only	26
Amino acid transport and metabolism	16
Inorganic ion transport and metabolism	13
Energy production and conversion	12
Cell wall/membrane/envelope biogenesis	8
Replication, recombination and repair	7
Coenzyme transport and metabolism	6
Secondary metabolites biosynthesis, transport and catabolism	6
Signal transduction mechanisms	6
Nucleotide transport and metabolism	5
Transcription	5
Lipid transport and metabolism	3
Posttranslational modification, protein turnover, chaperones	2
Cell motility	1
Function unknown	1
Translation, ribosomal structure and biogenesis	1

This is only consistent with the subsystems results in that carbohydrate metabolism is the main category containing unique genes.

COG.level1 <- dplyr::group_by(COG, level1)
COG.level1 <- dplyr::summarise(COG.level1, COG.category = n())
COG.level1 <- COG.level1[order(-COG.level1$COG.category),]
kable(COG.level1)

level1	COG.category
METABOLISM	88
POORLY CHARACTERIZED	27
CELLULAR PROCESSES AND SIGNALING	17
INFORMATION STORAGE AND PROCESSING	13

KEGG database results

KEGG <- read.csv("MG-RAST/final.KEGG.function.tsv", sep="\t", header=TRUE)
KEGG <- KEGG[order(-KEGG$rel),]
KEGG <- KEGG[KEGG$rel > 0.5,] #remove all relative fractions 50% and below
kable(head(KEGG))

level1	level2	level3	function.	BTEX1.unmapped.FR.fastq	Kuwait.1810.mgenome.fastq	test	rel
Environmental Information Processing	Membrane transport	02010 ABC transporters [PATH:ko02010]	araG; L-arabinose transport system ATP-binding protein [EC:3.6.3.17]	14	6	14	2.333333
Metabolism	Carbohydrate metabolism	00052 Galactose metabolism [PATH:ko00052]	bgaB, lacA; beta-galactosidase [EC:3.2.1.23]	4	2	4	2.000000
Environmental Information Processing	Signal transduction	02020 Two-component system [PATH:ko02020]	desR; two-component system, NarL family, response regulator DesR	4	2	4	2.000000
Metabolism	Carbohydrate metabolism	00500 Starch and sucrose metabolism [PATH:ko00500]	treS; maltose alpha-D-glucosyltransferase/ alpha-amylase [EC:5.4.99.16 3.2.1.1]	6	3	6	2.000000
Metabolism	Carbohydrate metabolism	00010 Glycolysis / Gluconeogenesis [PATH:ko00010]	E4.1.1.49, pckA; phosphoenolpyruvate carboxykinase (ATP) [EC:4.1.1.49]	12	7	12	1.714286
Metabolism	Amino acid metabolism	00270 Cysteine and methionine metabolism [PATH:ko00270]	E2.1.1.10, mmuM; homocysteine S-methyltransferase [EC:2.1.1.10]	8	5	8	1.600000

print(paste("There are",nrow(KEGG),"functions annoted by the KEGG database which meet the criteria for being present primarily in non-Arhodomonas menbers of the community"))

## [1] "There are 144 functions annoted by the KEGG database which meet the criteria for being present primarily in non-Arhodomonas menbers of the community"

KEGG.level3 <- dplyr::group_by(KEGG, level3)
KEGG.level3 <- dplyr::summarise(KEGG.level3, KEGG.pathway = n())
KEGG.level3 <- KEGG.level3[order(-KEGG.level3$KEGG.pathway),]
kable(KEGG.level3)

level3	KEGG.pathway
02010 ABC transporters [PATH:ko02010]	13
02020 Two-component system [PATH:ko02020]	13
02060 Phosphotransferase system (PTS) [PATH:ko02060]	10
00010 Glycolysis / Gluconeogenesis [PATH:ko00010]	5
00130 Ubiquinone and other terpenoid-quinone biosynthesis [PATH:ko00130]	5
00270 Cysteine and methionine metabolism [PATH:ko00270]	5
00040 Pentose and glucuronate interconversions [PATH:ko00040]	4
00051 Fructose and mannose metabolism [PATH:ko00051]	4
00190 Oxidative phosphorylation [PATH:ko00190]	4
00230 Purine metabolism [PATH:ko00230]	4
00500 Starch and sucrose metabolism [PATH:ko00500]	4
00260 Glycine, serine and threonine metabolism [PATH:ko00260]	3
00520 Amino sugar and nucleotide sugar metabolism [PATH:ko00520]	3
00562 Inositol phosphate metabolism [PATH:ko00562]	3
00030 Pentose phosphate pathway [PATH:ko00030]	2
00052 Galactose metabolism [PATH:ko00052]	2
00053 Ascorbate and aldarate metabolism [PATH:ko00053]	2
00280 Valine, leucine and isoleucine degradation [PATH:ko00280]	2
00330 Arginine and proline metabolism [PATH:ko00330]	2
00340 Histidine metabolism [PATH:ko00340]	2
00350 Tyrosine metabolism [PATH:ko00350]	2
00561 Glycerolipid metabolism [PATH:ko00561]	2
00620 Pyruvate metabolism [PATH:ko00620]	2
00730 Thiamine metabolism [PATH:ko00730]	2
00770 Pantothenate and CoA biosynthesis [PATH:ko00770]	2
00860 Porphyrin and chlorophyll metabolism [PATH:ko00860]	2
00940 Phenylpropanoid biosynthesis [PATH:ko00940]	2
01053 Biosynthesis of siderophore group nonribosomal peptides [PATH:ko01053]	2
04146 Peroxisome [PATH:ko04146]	2
00020 Citrate cycle (TCA cycle) [PATH:ko00020]	1
00140 Steroid hormone biosynthesis [PATH:ko00140]	1
00240 Pyrimidine metabolism [PATH:ko00240]	1
00250 Alanine, aspartate and glutamate metabolism [PATH:ko00250]	1
00300 Lysine biosynthesis [PATH:ko00300]	1
00310 Lysine degradation [PATH:ko00310]	1
00360 Phenylalanine metabolism [PATH:ko00360]	1
00362 Benzoate degradation [PATH:ko00362]	1
00380 Tryptophan metabolism [PATH:ko00380]	1
00440 Phosphonate and phosphinate metabolism [PATH:ko00440]	1
00511 Other glycan degradation [PATH:ko00511]	1
00521 Streptomycin biosynthesis [PATH:ko00521]	1
00550 Peptidoglycan biosynthesis [PATH:ko00550]	1
00564 Glycerophospholipid metabolism [PATH:ko00564]	1
00630 Glyoxylate and dicarboxylate metabolism [PATH:ko00630]	1
00633 Nitrotoluene degradation [PATH:ko00633]	1
00650 Butanoate metabolism [PATH:ko00650]	1
00680 Methane metabolism [PATH:ko00680]	1
00760 Nicotinate and nicotinamide metabolism [PATH:ko00760]	1
00785 Lipoic acid metabolism [PATH:ko00785]	1
00900 Terpenoid backbone biosynthesis [PATH:ko00900]	1
00930 Caprolactam degradation [PATH:ko00930]	1
00970 Aminoacyl-tRNA biosynthesis [PATH:ko00970]	1
02040 Flagellar assembly [PATH:ko02040]	1
03008 Ribosome biogenesis in eukaryotes [PATH:ko03008]	1
03013 RNA transport [PATH:ko03013]	1
03018 RNA degradation [PATH:ko03018]	1
03020 RNA polymerase [PATH:ko03020]	1
03030 DNA replication [PATH:ko03030]	1
03410 Base excision repair [PATH:ko03410]	1
03430 Mismatch repair [PATH:ko03430]	1
03440 Homologous recombination [PATH:ko03440]	1
04142 Lysosome [PATH:ko04142]	1
05146 Amoebiasis [PATH:ko05146]	1

KEGG.level2 <- dplyr::group_by(KEGG, level2)
KEGG.level2 <- dplyr::summarise(KEGG.level2, KEGG.level2 = n())
KEGG.level2 <- KEGG.level2[order(-KEGG.level2$KEGG.level2),]
kable(KEGG.level2)

level2	KEGG.level2
Carbohydrate metabolism	34
Membrane transport	23
Amino acid metabolism	21
Metabolism of cofactors and vitamins	13
Signal transduction	13
Energy metabolism	5
Nucleotide metabolism	5
Lipid metabolism	4
Replication and repair	4
Biosynthesis of other secondary metabolites	3
Translation	3
Transport and catabolism	3
Xenobiotics biodegradation and metabolism	3
Glycan biosynthesis and metabolism	2
Metabolism of terpenoids and polyketides	2
Cell motility	1
Folding, sorting and degradation	1
Infectious diseases	1
Metabolism of other amino acids	1
Metabolism of Terpenoids and Polyketides	1
Transcription	1

Mapping KEGG functions to pathways

#this cell imports and parses the MG-RAST kegg orthology file output to create a name to KO mapping file
library(dplyr)
library(stringr)
substrRight <- function(x, n){
  substr(x, nchar(x)-n+1, nchar(x))}
KO.name.full <- read.csv("MG-RAST/ontology.csv", header = TRUE, sep = "\t", stringsAsFactors = F, row.names=NULL)
KO.name <- KO.name.full[13]
KO.name <- data.frame(KO.name[!duplicated(KO.name),])
colnames(KO.name) <- "gene"
KO.name <- data.frame(stringr::str_split_fixed(KO.name$gene, ",", 2), stringsAsFactors = FALSE)
colnames(KO.name) <- c("KO","KO.function")
KO.name$KO <- substr(KO.name$KO, nchar(KO.name$KO)-6, nchar(KO.name$KO)-1)
KO.name$KO.function <- substr(KO.name$KO.function, 11, nchar(KO.name$KO.function))
KO.name$KO.function <- substr(KO.name$KO.function, 0, nchar(KO.name$KO.function) - 1)
KO.name <- KO.name[c(2,1)]

#this chunk takes the KEGG table output from MGRAST analysis and maps the kegg names to KO numbers
KEGG.name <- KEGG[c(4,5)]
KEGG.name.2 <- data.frame(stringr::str_split_fixed(KEGG.name$function.,"; ",2))
KEGG.name <- cbind(KEGG.name.2[2],KEGG.name[2])
colnames(KEGG.name) <- c("KO.function","count")
KEGG.name$KO.function <- as.character(KEGG.name$KO.function)
KEGG.name.KO <- dplyr::left_join(KEGG.name,KO.name,by="KO.function")

Some of the MG-RAST file is not possible to parse cleanly. ~10% of function names are mapped to KO terms by hand. This file is read back in and turned into a list of KO terms that can be fed into KEGG mapper (https://www.genome.jp/kegg/tool/map_pathway1.html)

unmapped.novel.KO <- read.csv("KEGG.name.KO.csv", header = TRUE, stringsAsFactors = FALSE)
KO.terms <- data.frame(unmapped.novel.KO$KO)
write.csv(KO.terms,"KO.terms.csv")

At KEGG mapper website, KO terms can be pasted in as such:

This in turn leads to an index of pathways to which the query functions map.

The individual KEGG maps can be used to guide biological investigations of the implications of the functional annotations.

Example investigations of functions and pathways novel to the minor community members

Sarcosine oxidase

Mannitol uptake and metabolism

Trehalose uptake and metabolism

Vitamin K synthesis

Lack of complete Vitamin K synthesis in Arhodomonas genome

These pathways represent only an initial survey of the functional annotations

Conclusions and future work

Initial analysis confirms the presence of several minor community members from orders Bacilli and Clostridia. This is supported by analysis of 16S rRNA gene fragments in the metagenome and by the presence of functions unique to gram positive bacteria such as teichoic acid metabolism and sporulation. Their unique activities include uptake of carbohydrates (sugars), waste products (sarcosine), and osmoprotectants (trehalose) while enabling synthesis of Vitamin K (menaqionone). Altogether, this paints a picture of organisms surviving by scavenging trace sugars and waste products and providing an essential vitamin to the community.

Testable hypotheses:

Arodomonas requires Vitamin K for optimal activity. This can be investigated by looking at predicted functions in the dominant Arhodomonas by biologically informed bioinformatics approaches. Additionally, the Arhodomonas (BTEX1) can be isolated and the consequences of Vitamin K supplementation investigated.
BTEX1, when growing on BTEX, produces sarcosine and other waste products which accumulate in the medium and negatively affect growth and/or metabolism.