16S amplicon library results

The amplicon library of the BTEX enrichment was processed using qiime2 pipeline with a pre-trained Silva v132 classifier and dada2 actual-sequence-variant denoising. The results clearly revealed a community almost entirely dominated by an Arhodomonas species.

OTU.table <- read.csv("rich.OTU.table.BTEX.csv", header = TRUE)
taxonomy <- data.frame(do.call('rbind', strsplit(as.character(OTU.table$taxonomy),';',fixed=TRUE)))
colnames(taxonomy) <- c("kingdom","phylum","class","order","family","genus","species")
#OTU.table <- cbind(taxonomy, OTU.table[-1])
kable(OTU.table[c(1:3)])
taxonomy seq.count relative.abund.
D_0__Bacteria;D_1__Proteobacteria;D_2__Gammaproteobacteria;D_3__Nitrococcales;D_4__Nitrococcaceae;D_5__Arhodomonas;D_6__uncultured bacterium 453177 0.9931036
D_0__Bacteria;D_1__Firmicutes;D_2__Clostridia;D_3__Clostridiales;D_4__Clostridiaceae 1;D_5__Clostridium sensu stricto 13;D_6__uncultured bacterium 1276 0.0027963
D_0__Bacteria;D_1__Firmicutes;D_2__Bacilli;;;;; 591 0.0012951
D_0__Bacteria;D_1__Firmicutes;D_2__Clostridia;D_3__Clostridiales;D_4__Peptococcaceae;D_5__Candidatus Dichloromethanomonas;D_6__uncultured bacterium 201 0.0004405
D_0__Bacteria;D_1__Firmicutes;D_2__Bacilli;D_3__Bacillales;D_4__Bacillaceae;D_5__Bacillus;D_6__uncultured compost bacterium 191 0.0004186
D_0__Bacteria;D_1__Proteobacteria;D_2__Deltaproteobacteria;D_3__Desulfovibrionales;D_4__Desulfovibrionaceae;D_5__Desulfovibrio;D_6__Desulfovibrio desulfuricans 101 0.0002213
D_0__Bacteria;D_1__Proteobacteria;D_2__Gammaproteobacteria;D_3__Pseudomonadales;D_4__Pseudomonadaceae;D_5__Pseudomonas;D_6__Pseudomonas fluorescens 93 0.0002038
D_0__Bacteria;D_1__Proteobacteria;D_2__Gammaproteobacteria;D_3__Enterobacteriales;D_4__Enterobacteriaceae;;; 90 0.0001972
D_0__Bacteria;D_1__Proteobacteria;D_2__Gammaproteobacteria;D_3__Nitrococcales;D_4__Nitrococcaceae;D_5__Arhodomonas;D_6__uncultured bacterium 78 0.0001709
D_0__Bacteria;D_1__Firmicutes;D_2__Bacilli;D_3__Bacillales;D_4__Bacillaceae;D_5__Bacillus;; 75 0.0001644
D_0__Bacteria;D_1__Thermotogae;D_2__Thermotogae;D_3__Petrotogales;D_4__Petrotogaceae;D_5__SC103;D_6__metagenome 45 0.0000986
D_0__Bacteria;D_1__Cloacimonetes;D_2__Cloacimonadia;D_3__Cloacimonadales;;;; 43 0.0000942
D_0__Bacteria;D_1__Firmicutes;D_2__Clostridia;D_3__Clostridiales;D_4__Peptococcaceae;D_5__uncultured;; 39 0.0000855
D_0__Bacteria;D_1__Bacteroidetes;D_2__Bacteroidia;D_3__Bacteroidales;D_4__Rikenellaceae;D_5__DMER64;D_6__uncultured bacterium 37 0.0000811
D_0__Bacteria;D_1__Spirochaetes;D_2__Spirochaetia;D_3__Spirochaetales;D_4__Spirochaetaceae;D_5__uncultured;D_6__uncultured bacterium 36 0.0000789
D_0__Bacteria;D_1__Bacteroidetes;D_2__Bacteroidia;D_3__Bacteroidales;D_4__Bacteroidetes vadinHA17;D_5__uncultured bacterium;D_6__uncultured bacterium 20 0.0000438
D_0__Bacteria;D_1__Bacteroidetes;D_2__Rhodothermia;D_3__Balneolales;D_4__Balneolaceae;D_5__uncultured;D_6__uncultured bacterium 19 0.0000416
D_0__Bacteria;D_1__Chloroflexi;D_2__Anaerolineae;D_3__Anaerolineales;D_4__Anaerolineaceae;D_5__Longilinea;D_6__uncultured bacterium 18 0.0000394
D_0__Bacteria;D_1__Proteobacteria;D_2__Gammaproteobacteria;D_3__Xanthomonadales;D_4__Xanthomonadaceae;D_5__Stenotrophomonas;; 18 0.0000394
D_0__Bacteria;D_1__Proteobacteria;D_2__Deltaproteobacteria;D_3__Syntrophobacterales;D_4__Syntrophaceae;D_5__Smithella;D_6__uncultured bacterium 17 0.0000373
D_0__Bacteria;D_1__Chlamydiae;D_2__LD1-PA32;D_3__uncultured bacterium;D_4__uncultured bacterium;D_5__uncultured bacterium;D_6__uncultured bacterium 17 0.0000373
D_0__Archaea;D_1__Euryarchaeota;D_2__Methanobacteria;D_3__Methanobacteriales;D_4__Methanobacteriaceae;D_5__Methanobrevibacter;; 16 0.0000351
D_0__Bacteria;D_1__Synergistetes;D_2__Synergistia;D_3__Synergistales;D_4__Synergistaceae;D_5__Lactivibrio;D_6__uncultured bacterium 15 0.0000329
D_0__Bacteria;D_1__Bacteroidetes;D_2__Bacteroidia;D_3__Bacteroidales;;;; 15 0.0000329
D_0__Bacteria;D_1__Proteobacteria;D_2__Gammaproteobacteria;D_3__Oceanospirillales;D_4__Halomonadaceae;D_5__Chromohalobacter;; 14 0.0000307
D_0__Bacteria;D_1__Cloacimonetes;D_2__Cloacimonadia;D_3__Cloacimonadales;;;; 12 0.0000263
D_0__Archaea;D_1__Euryarchaeota;D_2__Methanomicrobia;D_3__Methanomicrobiales;D_4__Methanocorpusculaceae;D_5__Methanocorpusculum;D_6__uncultured archaeon 10 0.0000219
D_0__Bacteria;D_1__Proteobacteria;D_2__Deltaproteobacteria;D_3__Myxococcales;D_4__Phaselicystidaceae;D_5__Phaselicystis;D_6__uncultured bacterium 10 0.0000219
D_0__Bacteria;D_1__Spirochaetes;D_2__Spirochaetia;D_3__Spirochaetales;D_4__Spirochaetaceae;D_5__uncultured;D_6__uncultured bacterium 9 0.0000197
D_0__Bacteria;D_1__Firmicutes;D_2__Clostridia;D_3__DTU014;D_4__uncultured bacterium;D_5__uncultured bacterium;D_6__uncultured bacterium 8 0.0000175
D_0__Bacteria;D_1__Hydrogenedentes;D_2__Hydrogenedentia;D_3__Hydrogenedentiales;D_4__Hydrogenedensaceae;D_5__uncultured bacterium;D_6__uncultured bacterium 7 0.0000153
D_0__Bacteria;;;;;;; 7 0.0000153
D_0__Bacteria;D_1__Atribacteria;D_2__Caldatribacteriia;D_3__Caldatribacteriales;D_4__Caldatribacteriaceae;D_5__Candidatus Caldatribacterium;D_6__uncultured bacterium 6 0.0000131
D_0__Bacteria;D_1__Actinobacteria;D_2__Thermoleophilia;D_3__Gaiellales;D_4__uncultured;D_5__metagenome;D_6__metagenome 4 0.0000088
D_0__Bacteria;D_1__Actinobacteria;D_2__Actinobacteria;D_3__Micrococcales;D_4__Cellulomonadaceae;D_5__Cellulomonas;D_6__uncultured bacterium 3 0.0000066
D_0__Bacteria;D_1__Firmicutes;D_2__Clostridia;D_3__Clostridiales;D_4__Christensenellaceae;D_5__Christensenellaceae R-7 group;; 3 0.0000066
D_0__Bacteria;D_1__Verrucomicrobia;D_2__Verrucomicrobiae;D_3__Pedosphaerales;D_4__Pedosphaeraceae;;; 3 0.0000066

A number of other minor community members are possibly present, although at very low relative abundances. Indeed, the second most abundant community member is present in the amplicon library at 0.3%. This creates significant challenge for shotgun metagenomic analysis, as the vast majority of sequencing reads will be consumed by the dominant species.

Estimating shotgun metagenomics read-depth requirements

We can do a very crude assessment. Assume average prokaryotic genome is 4mb. You need about 10x coverage to confidently assemble. That means, lets say ~40mb of sequencing per genome present. If we assume that the 16S amplicon library accurately represents the community, the second-most abundant species represents 0.3% of the total genomic DNA present. the amount of sequencing required is roughly equivalent to that which would be required to sequence 40mb / 0.003 = 13,333mb or 13.3 gbases (to reach 10X coverage).

A full MiSeq run yields a maximum of roughly 10 gbases. This will yield very approximately 7X coverage of the second-most abundant species; not enough to assemble with any confidence but enough to send down an assembly-independent pipeline.

Note that there is no guarantee that this is the actual sequencing requirement (the amplicon library is imprecise and shotgun sequencing is unpredictable).

https://www.nature.com/articles/srep01968 presents a more formalized approach:

They present an equation for estimating, assuming genus level uniqueness of the target species:

Sequencing required (A) = 0.0706 * (proportion) ^ -0.9959 / 2 A = 0.0706 * (0.003) ^ -0.9959 / 2

This method estimates 11.49 gbases required for 10x coverage, very similar estimate.

The actual metagenome yielded 4.7 gbases following QC (detailed below), thus we can expect ~4X coverage of the second-most abundant community member and thus partial assembly.

Shotgun sequence assembly

Shotgun sequencing of the BTEX enrichment yielded 6.7 Gbases of sequence data.
This was trimmed using Trimmomatic (default settings, sliding window of 6:25). This resulted in 4.7 Gbases. Reads were assembled using SPAdes v 3.12, with “careful” setting and error correction and a minimum contig size of 500bp. Assessing the assembly in QUAST:

Statistics.without.reference Kuwait2018.mgenome.SPAdes.con.
# contigs 1066.00
# contigs (>= 0 bp) 1066.00
# contigs (>= 1000 bp) 446.00
# contigs (>= 10000 bp) 121.00
# contigs (>= 100000 bp) 1.00
# contigs (>= 1000000 bp) 0.00
Largest contig 179935.00
Total length 5104356.00
Total length (>= 0 bp) 5104356.00
Total length (>= 1000 bp) 4695727.00
Total length (>= 10000 bp) 3664180.00
Total length (>= 100000 bp) 179935.00
Total length (>= 1000000 bp) 0.00
N50 24189.00
N75 8250.00
L50 47.00
L75 140.00
GC (%) 67.96
Mismatches NA
# N’s 8511.00
# N’s per 100 kbp 166.74
Predicted genes NA
# predicted genes (unique) 4909.00
# predicted genes (>= 0 bp) 4910.00
# predicted genes (>= 300 bp) 4273.00
# predicted genes (>= 1500 bp) 610.00
# predicted genes (>= 3000 bp) 67.00

Total assembly length is ~5.1mb

Total genome size of the only existing Arhodomonas genome is 3.96mb, implying an extra 1mb of assembly (assuming that this Arhodomonas is similar in that respect)

GC content is also similar to that found in studied Arhodomonas genomes, ~70%:

Binning the metagenome assembly:

The assembly was binned using MaxBin2 v2.2.4. This yielded two bins:

Bins: 2 Input Contigs: 1066 Binned Contigs: 368 (34.5%) Unbinned Contigs: 698 (65.5%) Contigs Too Short: 620 (58.2%) Summed Length of Binned Contigs: 4204219 (82.4%) Summed Length of Unbinned Contigs: 900137 (17.6%) Summed Length of Short Contigs: 408629 (8.0%)

Bin.Name. Read.Coverage. GC.Content. Number.of.Contigs. Total.Contig.Length.
Bin.001.fasta  0.925 0.698 245 3528385
Bin.002.fasta  0.028 0.674 123 675834

Given the similarity in GC content, what we already know about the community make-up, and the total length of the combined bins (4.2mb) compared to existing Arhodomonas genome (3.96mb), it is assumed that these two bins together represent the Arhodomonas

Comparing the combined BTEX enrichment bin with A. aqueoli genome

Whole genome alignment using Mauve

The genome of A. aqeuoli was downloaded from JGI IMG and annotate using prokka v1.13.3 (https://github.com/tseemann/prokka) under default recommended settings. The combined bin was annotated using the same pipeline. Initial whole genome alignment using Mauve yielded a scattered picture:

Iterative contig realignment made the contiguous regions more evident:

There remains approximately 50kb of the binned genome which is not accounted for in the reference genome. These contigs will be recovered and their functional annotations explored. These might represent novel genomic material in the Arhodomonas or they might represent mis-binned contigs.

The non-aligned contigs are named in this file and will be examined later in the pipeline:

V2 V3 V4 V5 V6
BTEX1_347 chromosome forward 3718756 3722692
BTEX1_290 chromosome complement 3722693 3728982
BTEX1_76 chromosome forward 3728983 3739463
BTEX1_337 chromosome forward 3739464 3740876
BTEX1_105 chromosome complement 3740877 3741996
BTEX1_5 chromosome forward 3741997 3744665
This is only the first fi ve of 139 con tigs which do not seem to align with the reference genome. These must derive from either:
  • un-binned Arhodomonas genomic contigs
  • satellite community member genomes

Average Nucleotide Identity (ANI) calculation

FastANI estimates average nucleotide identity between genomes by calculating pairwise nucleotide identity between homologous genes.

Results: 97.67% identity, out of the total 1218 sequence fragments in the BTEX bin (BTEX1), 1044 were determined to be orthologous to A. aquaeoli.

This is regarded as within-species level of identity based on recent observations described in https://www.biorxiv.org/content/early/2017/11/27/225342

ANI scores between pairs of 90,000 bacterial genomes show that an ANI above 96% is consistent with genomes from the same species

ANI scores between pairs of 90,000 bacterial genomes show that an ANI above 96% is consistent with genomes from the same species

This alignment visualization is based on the FastANI calculations and shows the reference, A. aquaeoli (top) vs. the BTEX enrichment bin (bottom)

Binned genome quality report

Conducted at MiGA web server:

Essential genes found: 98/111.

  • Completeness: 88.3%.
  • Contamination: 3.6%.

Multiple copies:

  • 2 tRNA-synt_1d: tRNA synthetases class I (R).
  • 2 TIGR00064: ftsY: signal recognition particle-docking protein FtsY.
  • 2 TIGR00344: alaS: alanine–tRNA ligase.
  • 2 TIGR00436: era: GTP-binding protein Era.

Missing genes:

  • PGK: Phosphoglycerate kinase.
  • TIGR00009: L28: ribosomal protein L28.
  • TIGR00362: DnaA: chromosomal replication initiator protein DnaA.
  • TIGR00388: glyQ: glycine–tRNA ligase, alpha subunit.
  • TIGR00389: glyS_dimeric: glycine–tRNA ligase.
  • TIGR00408: proS_fam_I: proline–tRNA ligase.
  • TIGR00471: pheT_arch: phenylalanine–tRNA ligase, beta subunit.
  • TIGR00663: dnan: DNA polymerase III, beta subunit.
  • TIGR00775: NhaD: Na+/H+ antiporter, NhaD family.
  • TIGR01030: rpmH_bact: ribosomal protein L34.
  • TIGR01059: gyrB: DNA gyrase, B subunit.
  • TIGR01063: gyrA: DNA gyrase, A subunit.
  • TIGR02387: rpoC1_cyan: DNA-directed RNA polymerase, gamma subunit.

Separating metagenomic reads which are not part of the Arhodomonas BTEX1 genome.

The raw reads were mapped back with very strict requirements to the scaffolds of BTEX1. Reads which did not map back might represent two categories of genomic material: 1 BTEX1 genomic material which is not captured during the binning process and is also not found in the reference Arhodoomonas genome. 2 genomic material from other members of the community.

Bowtie configuration

BowTie2 default parameters are somewhat permissive. Our read-length is maximum 250 bp. The default minimum score is -0.6 + -0.6 * L(ength), or -150.6. Mismatches between high quality reads impose a -6 penalty each. Thus, this default setting allows up to ~25 high quality mismatches, or ~10%. Given the knowledge that the other community members are unlikely to be closely related, this is acceptable.

Thus, the bowtie2 call used was:

bowtie2 -x (index) -1 (forward reads) -2 (reverse reads) –score-min L,-0.6,-0.6 –un-conc (paired reads that fail to map condordantly)

The unmapped reads are expected to be quite a minimal set, representing ~0.003 of the 4.8gbases of trimmed sequencing effort, or 10.5 mbases. However, there are likely still scaffolds representing un-binned parts of the BTEX1 genome also.

paired-read mapping leads to pairs that map only once to BTEX1 ending up in the unmapped sequence bin. This is good for assembly-based analysis, where the paired-end information is essential to good assembly and the contaminating BTEX1 pairs will not have any impact on the assembly. However, this is potentilly very bad for assembly-free pipelines, as BTEX1 reads will still be annotated. The question; How many more reads make it into the unmapped paired data set vs the unmapped unpaired data sets. The difference represents the amount of BTEX1 contamination.

BowTie2 results, paired mapping:

Forward reads: 11154399 reads; of these: 11154399 (100.00%) were unpaired; of these: 1479305 (13.26%) aligned 0 times 8963463 (80.36%) aligned exactly 1 time 711631 (6.38%) aligned >1 times 86.74% overall alignment rate

Reverse reads: 11154399 reads; of these: 11154399 (100.00%) were unpaired; of these: 1468625 (13.17%) aligned 0 times 8971133 (80.43%) aligned exactly 1 time 714641 (6.41%) aligned >1 times 86.83% overall alignment rate

Analyzing novel shotgun metagenome sequences via MG-RAST

Goal of MG-RAST analysis:

Methodology

Two data sets:

  • Full metagenome library
  • only the unmapped unpaired reads (via BowTie2)

Annotations which are present at a ratio of at least 0.5 unmapped/full metagenome and are present with less than 500 reads are considered.

Reads processed at MG-RAST without dereplication and without additional quality trimming (data has already been trimmed via Trimmomatic as described above)

Create tables of Silva SSU and RDP non-Arhodomonas taxonomy

Create table of pathways over-represented in the unmapped annotations via KEGG, COG, and Subsystems. Focus on a handful of stories: * putrescine uptake * phosphotransferase modules for carbohydrate uptake

Taxonomy counts

Note that the 16S rRNA gene was NOT included in the BTEX1 genome bin, thus these reads are present in the unmapped read set. Also note that this is not a “classifier” -based approach, it only reports the best hit, it does not present and resolve conflicts.

Reads that map to the RDP database under the following stringency thresholds: e-value = -30, %identity = 70%, length = 50, minimum count = 2

RDP

RDP30 <- read.csv("MG-RAST/final.RDP.e30.tsv", header=TRUE, sep="\t")
RDP30 <- RDP30[-8]
RDP30 <- RDP30[order(-RDP30$BTEX1.unmapped.FR.fastq),]
RDP30$BTEX1.unmapped.FR.fastq <- RDP30$BTEX1.unmapped.FR.fastq / sum(RDP30$BTEX1.unmapped.FR.fastq)
RDP30$Kuwait.1810.mgenome.fastq <- RDP30$Kuwait.1810.mgenome.fastq / sum(RDP30$Kuwait.1810.mgenome.fastq)
colnames(RDP30) <- c("domain","phylum","class","order","family","genus","unmapped","full.mgenome")
RDP30 <- RDP30[-c(1:4),]
write.csv(RDP30, "RDP30.csv")
kable(RDP30)
domain phylum class order family genus unmapped full.mgenome
4 Bacteria Firmicutes Bacilli Bacillales Bacillaceae Bacillus 0.0006074 0.0003704
8 Bacteria Firmicutes Bacilli Bacillales Bacillaceae Geobacillus 0.0002916 0.0003010
9 Bacteria Firmicutes Bacilli Bacillales Bacillaceae Marinococcus 0.0001701 0.0002315
2 Bacteria Firmicutes Bacilli Bacillales Alicyclobacillaceae Alicyclobacillus 0.0001458 0.0001158
5 Bacteria Firmicutes Clostridia Clostridiales Peptococcaceae Desulfotomaculum 0.0000972 0.0000000
16 Bacteria Firmicutes Bacilli Bacillales Bacillaceae Virgibacillus 0.0000486 0.0000695
1 Bacteria Firmicutes Clostridia Clostridiales Eubacteriaceae Acetobacterium 0.0000000 0.0000695
6 Bacteria Proteobacteria Gammaproteobacteria Chromatiales Ectothiorhodospiraceae Ectothiorhodospira 0.0000000 0.0000926
7 Bacteria Firmicutes Clostridia Clostridiales Eubacteriaceae Eubacterium 0.0000000 0.0000695
10 Bacteria Proteobacteria Gammaproteobacteria unclassified (derived from Gammaproteobacteria) unclassified (derived from Gammaproteobacteria) Methylonatrum 0.0000000 0.0000463
11 Bacteria Proteobacteria Gammaproteobacteria Chromatiales Ectothiorhodospiraceae Nitrococcus 0.0000000 0.0006020
12 Bacteria Firmicutes Bacilli Bacillales Paenibacillaceae Paenibacillus 0.0000000 0.0000926
13 Bacteria Proteobacteria Gammaproteobacteria Pseudomonadales Moraxellaceae Psychrobacter 0.0000000 0.0001158
14 Bacteria Firmicutes Bacilli Bacillales Sporolactobacillaceae Sporolactobacillus 0.0000000 0.0000695
15 Bacteria Proteobacteria Gammaproteobacteria Chromatiales Ectothiorhodospiraceae Thioalkalivibrio 0.0000000 0.0000695
19 Bacteria Proteobacteria unclassified (derived from Proteobacteria) unclassified (derived from Proteobacteria) unclassified (derived from Proteobacteria) unclassified (derived from Proteobacteria) 0.0000000 0.0000695

Silva SSU

Conducted using the same specificity settings as the RDP

SILVASSU30 <- read.csv("MG-RAST/final.SilvaSSU.e30.tsv", header=TRUE, sep="\t")
SILVASSU30 <- SILVASSU30[-8]
SILVASSU30 <- SILVASSU30[order(-SILVASSU30$BTEX1.unmapped.FR.fastq),]
SILVASSU30$BTEX1.unmapped.FR.fastq <- SILVASSU30$BTEX1.unmapped.FR.fastq / sum(SILVASSU30$BTEX1.unmapped.FR.fastq)
SILVASSU30$Kuwait.1810.mgenome.fastq <- SILVASSU30$Kuwait.1810.mgenome.fastq / sum(SILVASSU30$Kuwait.1810.mgenome.fastq)
colnames(SILVASSU30) <- c("domain","phylum","class","order","family","genus","unmapped","full.mgenome")
SILVASSU30 <- SILVASSU30[-c(1:4),]
kable(SILVASSU30)
domain phylum class order family genus unmapped full.mgenome
4 Bacteria Firmicutes Bacilli Bacillales Bacillaceae Bacillus 0.0006802 0.0004398
11 Bacteria Firmicutes Bacilli Bacillales Bacillaceae Geobacillus 0.0002915 0.0003009
12 Bacteria Firmicutes Bacilli Bacillales Bacillaceae Marinococcus 0.0001700 0.0002315
2 Bacteria Firmicutes Bacilli Bacillales Alicyclobacillaceae Alicyclobacillus 0.0001457 0.0001157
7 Bacteria Firmicutes Clostridia Clostridiales Peptococcaceae Desulfotomaculum 0.0000972 0.0000000
9 Bacteria Firmicutes Bacilli Lactobacillales Enterococcaceae Enterococcus 0.0000729 0.0000000
6 Bacteria Firmicutes Clostridia Clostridiales Peptococcaceae Desulfosporosinus 0.0000486 0.0000000
19 Bacteria Firmicutes Bacilli Bacillales Bacillaceae Virgibacillus 0.0000486 0.0000694
1 Bacteria Firmicutes Clostridia Clostridiales Eubacteriaceae Acetobacterium 0.0000000 0.0000694
5 Eukaryota Nematoda Chromadorea Rhabditida Rhabditidae Caenorhabditis 0.0000000 0.0000463
8 Bacteria Proteobacteria Gammaproteobacteria Chromatiales Ectothiorhodospiraceae Ectothiorhodospira 0.0000000 0.0000926
10 Bacteria Firmicutes Clostridia Clostridiales Eubacteriaceae Eubacterium 0.0000000 0.0000694
13 Bacteria Proteobacteria Gammaproteobacteria unclassified (derived from Gammaproteobacteria) unclassified (derived from Gammaproteobacteria) Methylonatrum 0.0000000 0.0000463
14 Bacteria Proteobacteria Gammaproteobacteria Chromatiales Ectothiorhodospiraceae Nitrococcus 0.0000000 0.0006018
15 Bacteria Firmicutes Bacilli Bacillales Paenibacillaceae Paenibacillus 0.0000000 0.0001157
16 Bacteria Proteobacteria Gammaproteobacteria Pseudomonadales Moraxellaceae Psychrobacter 0.0000000 0.0001157
17 Bacteria Firmicutes Bacilli Bacillales Sporolactobacillaceae Sporolactobacillus 0.0000000 0.0000694
18 Bacteria Proteobacteria Gammaproteobacteria Chromatiales Ectothiorhodospiraceae Thioalkalivibrio 0.0000000 0.0002315
22 Bacteria Proteobacteria unclassified (derived from Proteobacteria) unclassified (derived from Proteobacteria) unclassified (derived from Proteobacteria) unclassified (derived from Proteobacteria) 0.0000000 0.0000694
library(ggplot2)
library(RColorBrewer)
SILVASSU30.c <- SILVASSU30[c(1:8),-8]
SILVASSU30.c$concat <- paste(SILVASSU30.c$order,";",SILVASSU30.c$family,";",SILVASSU30.c$genus," = (",round(SILVASSU30.c$unmapped*100, 4),"%)", sep="")
write.csv(SILVASSU30.c, "SILVASSU30.c.csv")

par(mar=c(3,3,7,7))
plot <- pie(SILVASSU30.c$unmapped, labels = SILVASSU30.c$concat, main="non-Arhodomonas 16S rRNA gene fragments", 
            cex = 0.6, col = brewer.pal(n = 8, name = "Dark2"))

plot
## NULL

Not an ideal way to represent the data but it’s something

Summarizing the results by Order

library(dplyr)
RDP30 <- dplyr::group_by(RDP30, order)
summary.RDP <- dplyr::summarise(RDP30, RDP = sum(unmapped))
summary.RDP$order <- as.character(summary.RDP$order)
SILVASSU30 <- dplyr::group_by(SILVASSU30, order)
summary.silva <- dplyr::summarise(SILVASSU30, SILVASSU30 = sum(unmapped))
summary.silva$order <- as.character(summary.silva$order)
summary.16S.order <- dplyr::full_join(summary.RDP, summary.silva, by="order")
summary.16S.order <- summary.16S.order[order(-summary.16S.order$SILVASSU30),]
colnames(summary.16S.order) <- c("order","RDP","Silva.SSU")
kable(summary.16S.order)
order RDP Silva.SSU
Bacillales 0.0012634 0.0013360
Clostridiales 0.0000972 0.0001457
Lactobacillales NA 0.0000729
Chromatiales 0.0000000 0.0000000
Pseudomonadales 0.0000000 0.0000000
unclassified (derived from Gammaproteobacteria) 0.0000000 0.0000000
unclassified (derived from Proteobacteria) 0.0000000 0.0000000
Rhabditida NA 0.0000000

The results from the two databases are very close to one another, indicating the presence of a handful of genomes from orders Bacillales and Clostriales. This is largely in accordance with the 16S amplicon library results. The 16S library suggested ~0.0030% other organisms, while the metagenome suggests ~0.0014% other organisms. At most, we should expect to find ~20 reads from any given fragment of genomes of minor community members.

Function counts

This table is the result of annotation of all reads with the Subsytems database. Annotations are displayed using the stringency measures: e-value = -10 %identity minimum = 60% minimum length of match = 15 minimum counts = 2 maximum counts = 500

(relative counts and min/max read counts were applied in Excel)

subsystems <- read.csv("MG-RAST/final.subsystems.function.tsv", sep="\t", header=TRUE)
subsystems <- subsystems[-6]
subsystems <- subsystems[order(-subsystems$rel),]
subsystems <- subsystems[subsystems$rel > 0.5,] #remove all relative fractions 50% and below
kable(head(subsystems))
level1 level2 level3 function. BTEX1.unmapped.FR.fastq Kuwait.1810.mgenome.fastq test rel
20 RNA Metabolism RNA processing and modification tRNA_modification_Archaea Asparaginyl-tRNA synthetase (EC 6.1.1.22) 6 3 6 2.000000
21 Carbohydrates Central carbohydrate metabolism Pyruvate_metabolism_II:_acetyl-CoA,_acetogenesis_from_pyruvate Pyruvate dehydrogenase E1 component alpha subunit (EC 1.2.4.1) 4 2 4 2.000000
22 Carbohydrates One-carbon Metabolism Serine-glyoxylate_cycle Phosphoenolpyruvate carboxykinase [ATP] (EC 4.1.1.49) 9 5 9 1.800000
23 Cell Wall and Capsule Capsular and extracellular polysacchrides Alginate_metabolism Probable poly(beta-D-mannuronate) O-acetylase (EC 2.3.1.-) 7 4 7 1.750000
24 Membrane Transport NULL Tricarboxylate_transport_system Tricarboxylate transport protein TctC 7 4 7 1.750000
25 Clustering-based subsystems NULL Cluster_containing_CofD-like_protein_and_co-occuring_with_DNA_repair FIG002813: LPPG:FO 2-phospho-L-lactate transferase like, CofD-like 5 3 5 1.666667

This is a small subset of the subsystems analysis results

print(paste("There are",nrow(subsystems),"functions annoted by the subsystems database which meet the criteria for being present primarily in non-Arhodomonas menbers of the community"))
## [1] "There are 227 functions annoted by the subsystems database which meet the criteria for being present primarily in non-Arhodomonas menbers of the community"

Summarize by Subsystems pathway

subsystems.level3 <- dplyr::group_by(subsystems, level3)
subsystems.level3 <- dplyr::summarise(subsystems.level3, subsystems.path = n())
subsystems.level3 <- subsystems.level3[order(-subsystems.level3$subsystems.path),]
kable(subsystems.level3)
level3 subsystems.path
Sporulation_gene_orphans 6
Purine_conversions 5
Sugar_utilization_in_Thermotogales 5
D-Galacturonate_and_D-Glucuronate_Utilization 4
Heme,_hemin_uptake_and_utilization_systems_in_GramPositives 4
L-rhamnose_utilization 4
Mannitol_Utilization 4
COG3533 3
Inositol_catabolism 3
Menaquinone_and_Phylloquinone_Biosynthesis 3
Phosphate_metabolism 3
Proteolysis_in_bacteria,_ATP-dependent 3
SigmaB_stress_responce_regulation 3
Trehalose_Uptake_and_Utilization 3
Bacterial_hemoglobins 2
Calvin-Benson_cycle 2
CBSS-257314.1.peg.676 2
CBSS-393133.3.peg.2787 2
Control_of_cell_elongation_-_division_cycle_in_Bacilli 2
DNA_processing_cluster 2
DNA_repair,_bacterial 2
ECF_class_transporters 2
Exopolysaccharide_Biosynthesis 2
Experimental_-_Histidine_Degradation 2
(GlcNAc)2_Catabolic_Operon 2
Oxidative_stress 2
Polyglycerolphosphate_lipoteichoic_acid_biosynthesis 2
Rhamnose_containing_glycans 2
Serine-glyoxylate_cycle 2
Tryptophan_catabolism 2
Xylose_utilization 2
4-Hydroxyphenylacetic_acid_catabolic_pathway 1
Acetone_Butanol_Ethanol_Synthesis 1
Adenosyl_nucleosidases 1
Alginate_metabolism 1
Arginine_and_Ornithine_Degradation 1
Arsenic_resistance 1
At4g10620_At3g57180_At3g47450 1
At5g63420 1
Bacillibactin_Siderophore 1
Bacillus_biofilm_matrix_protein_component_TasA_and_homologs 1
Bacitracin_Stress_Response 1
Bacterial_Chemotaxis 1
Beta-Glucoside_Metabolism 1
Beta-lactamase_cluster_in_Streptococcus 1
Biogenesis_of_c-type_cytochromes 1
Biotin_biosynthesis 1
Broadly_distributed_proteins_not_in_subsystems 1
Cadmium_resistance 1
Carbon_Starvation 1
Carboxysome 1
CBSS-176280.1.peg.1561 1
CBSS-176299.4.peg.1292 1
CBSS-193567.1.peg.90 1
CBSS-262719.3.peg.410 1
CBSS-269801.1.peg.1715 1
CBSS-288681.3.peg.1039 1
CBSS-393121.3.peg.1913 1
CBSS-393121.3.peg.2760 1
CBSS-393130.3.peg.129 1
Chitin_and_N-acetylglucosamine_utilization 1
Cluster_containing_CofD-like_protein_and_co-occuring_with_DNA_repair 1
Cluster_with_phosphopentomutase_paralog 1
Coenzyme_A_Biosynthesis 1
Coenzyme_PQQ_synthesis 1
COG0451 1
COG0523 1
COG2363 1
COG2509 1
COG3146_experimental 1
Competence_or_DNA_damage-inducible_protein_CinA_and_related_protein_families 1
Conserved_gene_cluster_associated_with_Met-tRNA_formyltransferase 1
Cysteine_Biosynthesis 1
Dehydrogenase_complexes 1
De_Novo_Purine_Biosynthesis 1
De_Novo_Pyrimidine_Synthesis 1
Deoxyribose_and_Deoxynucleoside_Catabolism 1
D-galactarate,_D-glucarate_and_D-glycerate_catabolism 1
Dihydroxyacetone_kinases 1
Dipicolinate_Synthesis 1
DMT_transporter 1
DNA_repair,_bacterial_DinG_and_relatives 1
DNA_Repair_Base_Excision 1
DNA_replication_cluster_1 1
D-Tagatose_and_Galactitol_Utilization 1
EcsAB_transporter_affecting_expression_and_secretion_of_secretory_preproteins 1
Fermentations:_Mixed_acid 1
Ferrous_iron_transporter_EfeUOB,_low-pH-induced 1
Formate_hydrogenase 1
Fosfomycin_resistance 1
Fructooligosaccharides(FOS)_and_Raffinose_Utilization 1
Galactose-inducible_PTS 1
Glutamate_and_Aspartate_uptake_in_Bacteria 1
Glutamate-mediated_methylamine_utilization_pathway 1
Glycerolipid_and_Glycerophospholipid_Metabolism_in_Bacteria 1
Glycine_Biosynthesis 1
Glycogen_metabolism_cluster 1
Glycolysis_and_Gluconeogenesis 1
Heme_and_Siroheme_Biosynthesis 1
HMG_CoA_Synthesis 1
HPr_catabolite_repression_system 1
HPr_kinase_and_hprK_operon_in_Gram-positive_organisms 1
Hydantoin_metabolism 1
Isobutyryl-CoA_to_Propionyl-CoA_Module 1
Isoleucine_degradation 1
Isoprenoid_Biosynthesis 1
Lactate_utilization 1
Lactose_and_Galactose_Uptake_and_Utilization 1
L-Arabinose_utilization 1
Listeria_Pathogenicity_Island_LIPI-1_extended 1
LMPTP_YwlE_cluster 1
Lysine_Biosynthesis_DAP_Pathway 1
Mannose_Metabolism 1
Mediator_of_hyperadherence_YidE_in_Enterobacteria_and_its_conserved_region 1
Metallocarboxypeptidases_(EC_3.4.17.-) 1
Metalloendopeptidases_(EC_3.4.24.-) 1
Methionine_Biosynthesis 1
Muconate_lactonizing_enzyme_family 1
Murein_hydrolase_regulation_and_cell_death 1
Nitrosative_stress 1
Nudix_proteins_(nucleoside_triphosphate_hydrolases) 1
Omega_peptidases_(EC_3.4.19.-) 1
p-Aminobenzoyl-Glutamate_Utilization 1
Pentose_phosphate_pathway 1
Peptidoglycan_Crosslinking_of_Peptide_Stems 1
Phage_replication 1
Phd-Doc,YdcE-YdcD_toxin-antitoxin(programmed_cell_death)_systems 1
Phenylalanine_and_Tyrosine_Branches_from_Chorismate 1
Photorespiration_(oxidative_C2_cycle) 1
Polyamine_Metabolism 1
Polymyxin_Synthetase_Gene_Cluster_in_Bacillus 1
Potassium_homeostasis 1
Proline,_4-hydroxyproline_uptake_and_utilization 1
PROSC 1
Protocatechuate_branch_of_beta-ketoadipate_pathway 1
pyrimidine_conversions 1
Pyruvate_metabolism_I:_anaplerotic_reactions,_PEP 1
Pyruvate_metabolism_II:_acetyl-CoA,_acetogenesis_from_pyruvate 1
Restriction-Modification_System 1
RNA_methylation 1
RNA_modification_cluster 1
Sialic_Acid_Metabolism 1
Siderophore_[Alcaligin-like] 1
Siderophore_assembly_kit 1
Single-Rhodanese-domain_proteins 1
Spore_Coat 1
Spore_Core_Dehydration 1
Spore_germination 1
Sporulation-associated_proteins_with_broader_functions 1
Sporulation_Cluster 1
Sporulation-related_Hypotheticals 1
SpoVS_protein_family 1
Streptococcus_pyogenes_recombinatorial_zone 1
Sulfatases_and_sulfatase_modifying_factor_1 1
Synthesis_of_osmoregulated_periplasmic_glucans 1
Teichoic_and_lipoteichoic_acids_biosynthesis 1
Terminal_cytochrome_C_oxidases 1
Terminal_cytochrome_oxidases 1
Thiamin_biosynthesis 1
Tolerance_to_colicin_E2 1
Trehalose_Biosynthesis 1
Tricarboxylate_transport_system 1
tRNA_modification_Archaea 1
tRNA_modification_Bacteria 1
tRNA_modification_yeast_cytoplasmic 1
tRNA_processing 1
Two-component_Response_Regulator_of_Virulence_ResDE 1
Ubiquinone_Menaquinone-cytochrome_c_reductase_complexes 1
Universal_GTPases 1
Unspecified_monosaccharide_transport_cluster 1
YgfZ 1

The Subsystems database is advantageous in that it is large and well annotated, but the organization leaves much to desire.

The presence of sporulation-related genes confirms the 16S results, that there are gram-positives in the community at low levels.

Summarizing Subsystems database results at higher level of organization

subsystems.level1 <- dplyr::group_by(subsystems, level1)
subsystems.level1 <- dplyr::summarise(subsystems.level1, subsystems.category = n())
subsystems.level1 <- subsystems.level1[order(-subsystems.level1$subsystems.category),]
kable(subsystems.level1)
level1 subsystems.category
Carbohydrates 54
Clustering-based subsystems 26
Miscellaneous 18
Amino Acids and Derivatives 13
Dormancy and Sporulation 13
Nucleosides and Nucleotides 11
Cell Wall and Capsule 10
Cofactors, Vitamins, Prosthetic Groups, Pigments 10
Stress Response 9
Iron acquisition and metabolism 8
Virulence, Disease and Defense 8
DNA Metabolism 7
Protein Metabolism 7
Membrane Transport 5
Respiration 5
RNA Metabolism 5
Regulation and Cell signaling 4
Phosphorus Metabolism 3
Cell Division and Cell Cycle 2
Fatty Acids, Lipids, and Isoprenoids 2
Metabolism of Aromatic Compounds 2
Phages, Prophages, Transposable elements, Plasmids 2
Motility and Chemotaxis 1
Nitrogen Metabolism 1
Potassium metabolism 1

COG annotation results

COG <- read.csv("MG-RAST/final.COG.function.tsv", sep="\t", header=TRUE)
COG <- COG[order(-COG$rel),]
COG <- COG[COG$rel > 0.5,] #remove all relative fractions 50% and below
kable(head(COG))
level1 level2 function. BTEX1.unmapped.FR.fastq Kuwait.1810.mgenome.fastq test rel
CELLULAR PROCESSES AND SIGNALING Cell wall/membrane/envelope biogenesis Predicted membrane protein involved in D-alanine export 5 2 5 2.5
METABOLISM Coenzyme transport and metabolism Na+/panthothenate symporter 6 3 6 2.0
POORLY CHARACTERIZED General function prediction only Predicted phosphohydrolases 4 2 4 2.0
METABOLISM Energy production and conversion Phosphoenolpyruvate carboxykinase (ATP) 9 5 9 1.8
INFORMATION STORAGE AND PROCESSING Replication, recombination and repair ATPase involved in DNA replication 6 4 6 1.5
INFORMATION STORAGE AND PROCESSING Replication, recombination and repair Adenine specific DNA methylase Mod 3 2 3 1.5

This is a small subset of the full COG database results

print(paste("There are",nrow(COG),"functions annoted by the COG database which meet the criteria for being present primarily in non-Arhodomonas menbers of the community"))
## [1] "There are 145 functions annoted by the COG database which meet the criteria for being present primarily in non-Arhodomonas menbers of the community"

Summarizing by COG pathway

COG.level2 <- dplyr::group_by(COG, level2)
COG.level2 <- dplyr::summarise(COG.level2, COG.path = n())
COG.level2 <- COG.level2[order(-COG.level2$COG.path),]
kable(COG.level2)
level2 COG.path
Carbohydrate transport and metabolism 27
General function prediction only 26
Amino acid transport and metabolism 16
Inorganic ion transport and metabolism 13
Energy production and conversion 12
Cell wall/membrane/envelope biogenesis 8
Replication, recombination and repair 7
Coenzyme transport and metabolism 6
Secondary metabolites biosynthesis, transport and catabolism 6
Signal transduction mechanisms 6
Nucleotide transport and metabolism 5
Transcription 5
Lipid transport and metabolism 3
Posttranslational modification, protein turnover, chaperones 2
Cell motility 1
Function unknown 1
Translation, ribosomal structure and biogenesis 1

This is only consistent with the subsystems results in that carbohydrate metabolism is the main category containing unique genes.

COG.level1 <- dplyr::group_by(COG, level1)
COG.level1 <- dplyr::summarise(COG.level1, COG.category = n())
COG.level1 <- COG.level1[order(-COG.level1$COG.category),]
kable(COG.level1)
level1 COG.category
METABOLISM 88
POORLY CHARACTERIZED 27
CELLULAR PROCESSES AND SIGNALING 17
INFORMATION STORAGE AND PROCESSING 13

KEGG database results

KEGG <- read.csv("MG-RAST/final.KEGG.function.tsv", sep="\t", header=TRUE)
KEGG <- KEGG[order(-KEGG$rel),]
KEGG <- KEGG[KEGG$rel > 0.5,] #remove all relative fractions 50% and below
kable(head(KEGG))
level1 level2 level3 function. BTEX1.unmapped.FR.fastq Kuwait.1810.mgenome.fastq test rel
Environmental Information Processing Membrane transport 02010 ABC transporters [PATH:ko02010] araG; L-arabinose transport system ATP-binding protein [EC:3.6.3.17] 14 6 14 2.333333
Metabolism Carbohydrate metabolism 00052 Galactose metabolism [PATH:ko00052] bgaB, lacA; beta-galactosidase [EC:3.2.1.23] 4 2 4 2.000000
Environmental Information Processing Signal transduction 02020 Two-component system [PATH:ko02020] desR; two-component system, NarL family, response regulator DesR 4 2 4 2.000000
Metabolism Carbohydrate metabolism 00500 Starch and sucrose metabolism [PATH:ko00500] treS; maltose alpha-D-glucosyltransferase/ alpha-amylase [EC:5.4.99.16 3.2.1.1] 6 3 6 2.000000
Metabolism Carbohydrate metabolism 00010 Glycolysis / Gluconeogenesis [PATH:ko00010] E4.1.1.49, pckA; phosphoenolpyruvate carboxykinase (ATP) [EC:4.1.1.49] 12 7 12 1.714286
Metabolism Amino acid metabolism 00270 Cysteine and methionine metabolism [PATH:ko00270] E2.1.1.10, mmuM; homocysteine S-methyltransferase [EC:2.1.1.10] 8 5 8 1.600000
print(paste("There are",nrow(KEGG),"functions annoted by the KEGG database which meet the criteria for being present primarily in non-Arhodomonas menbers of the community"))
## [1] "There are 144 functions annoted by the KEGG database which meet the criteria for being present primarily in non-Arhodomonas menbers of the community"
KEGG.level3 <- dplyr::group_by(KEGG, level3)
KEGG.level3 <- dplyr::summarise(KEGG.level3, KEGG.pathway = n())
KEGG.level3 <- KEGG.level3[order(-KEGG.level3$KEGG.pathway),]
kable(KEGG.level3)
level3 KEGG.pathway
02010 ABC transporters [PATH:ko02010] 13
02020 Two-component system [PATH:ko02020] 13
02060 Phosphotransferase system (PTS) [PATH:ko02060] 10
00010 Glycolysis / Gluconeogenesis [PATH:ko00010] 5
00130 Ubiquinone and other terpenoid-quinone biosynthesis [PATH:ko00130] 5
00270 Cysteine and methionine metabolism [PATH:ko00270] 5
00040 Pentose and glucuronate interconversions [PATH:ko00040] 4
00051 Fructose and mannose metabolism [PATH:ko00051] 4
00190 Oxidative phosphorylation [PATH:ko00190] 4
00230 Purine metabolism [PATH:ko00230] 4
00500 Starch and sucrose metabolism [PATH:ko00500] 4
00260 Glycine, serine and threonine metabolism [PATH:ko00260] 3
00520 Amino sugar and nucleotide sugar metabolism [PATH:ko00520] 3
00562 Inositol phosphate metabolism [PATH:ko00562] 3
00030 Pentose phosphate pathway [PATH:ko00030] 2
00052 Galactose metabolism [PATH:ko00052] 2
00053 Ascorbate and aldarate metabolism [PATH:ko00053] 2
00280 Valine, leucine and isoleucine degradation [PATH:ko00280] 2
00330 Arginine and proline metabolism [PATH:ko00330] 2
00340 Histidine metabolism [PATH:ko00340] 2
00350 Tyrosine metabolism [PATH:ko00350] 2
00561 Glycerolipid metabolism [PATH:ko00561] 2
00620 Pyruvate metabolism [PATH:ko00620] 2
00730 Thiamine metabolism [PATH:ko00730] 2
00770 Pantothenate and CoA biosynthesis [PATH:ko00770] 2
00860 Porphyrin and chlorophyll metabolism [PATH:ko00860] 2
00940 Phenylpropanoid biosynthesis [PATH:ko00940] 2
01053 Biosynthesis of siderophore group nonribosomal peptides [PATH:ko01053] 2
04146 Peroxisome [PATH:ko04146] 2
00020 Citrate cycle (TCA cycle) [PATH:ko00020] 1
00140 Steroid hormone biosynthesis [PATH:ko00140] 1
00240 Pyrimidine metabolism [PATH:ko00240] 1
00250 Alanine, aspartate and glutamate metabolism [PATH:ko00250] 1
00300 Lysine biosynthesis [PATH:ko00300] 1
00310 Lysine degradation [PATH:ko00310] 1
00360 Phenylalanine metabolism [PATH:ko00360] 1
00362 Benzoate degradation [PATH:ko00362] 1
00380 Tryptophan metabolism [PATH:ko00380] 1
00440 Phosphonate and phosphinate metabolism [PATH:ko00440] 1
00511 Other glycan degradation [PATH:ko00511] 1
00521 Streptomycin biosynthesis [PATH:ko00521] 1
00550 Peptidoglycan biosynthesis [PATH:ko00550] 1
00564 Glycerophospholipid metabolism [PATH:ko00564] 1
00630 Glyoxylate and dicarboxylate metabolism [PATH:ko00630] 1
00633 Nitrotoluene degradation [PATH:ko00633] 1
00650 Butanoate metabolism [PATH:ko00650] 1
00680 Methane metabolism [PATH:ko00680] 1
00760 Nicotinate and nicotinamide metabolism [PATH:ko00760] 1
00785 Lipoic acid metabolism [PATH:ko00785] 1
00900 Terpenoid backbone biosynthesis [PATH:ko00900] 1
00930 Caprolactam degradation [PATH:ko00930] 1
00970 Aminoacyl-tRNA biosynthesis [PATH:ko00970] 1
02040 Flagellar assembly [PATH:ko02040] 1
03008 Ribosome biogenesis in eukaryotes [PATH:ko03008] 1
03013 RNA transport [PATH:ko03013] 1
03018 RNA degradation [PATH:ko03018] 1
03020 RNA polymerase [PATH:ko03020] 1
03030 DNA replication [PATH:ko03030] 1
03410 Base excision repair [PATH:ko03410] 1
03430 Mismatch repair [PATH:ko03430] 1
03440 Homologous recombination [PATH:ko03440] 1
04142 Lysosome [PATH:ko04142] 1
05146 Amoebiasis [PATH:ko05146] 1
KEGG.level2 <- dplyr::group_by(KEGG, level2)
KEGG.level2 <- dplyr::summarise(KEGG.level2, KEGG.level2 = n())
KEGG.level2 <- KEGG.level2[order(-KEGG.level2$KEGG.level2),]
kable(KEGG.level2)
level2 KEGG.level2
Carbohydrate metabolism 34
Membrane transport 23
Amino acid metabolism 21
Metabolism of cofactors and vitamins 13
Signal transduction 13
Energy metabolism 5
Nucleotide metabolism 5
Lipid metabolism 4
Replication and repair 4
Biosynthesis of other secondary metabolites 3
Translation 3
Transport and catabolism 3
Xenobiotics biodegradation and metabolism 3
Glycan biosynthesis and metabolism 2
Metabolism of terpenoids and polyketides 2
Cell motility 1
Folding, sorting and degradation 1
Infectious diseases 1
Metabolism of other amino acids 1
Metabolism of Terpenoids and Polyketides 1
Transcription 1

Mapping KEGG functions to pathways

#this cell imports and parses the MG-RAST kegg orthology file output to create a name to KO mapping file
library(dplyr)
library(stringr)
substrRight <- function(x, n){
  substr(x, nchar(x)-n+1, nchar(x))}
KO.name.full <- read.csv("MG-RAST/ontology.csv", header = TRUE, sep = "\t", stringsAsFactors = F, row.names=NULL)
KO.name <- KO.name.full[13]
KO.name <- data.frame(KO.name[!duplicated(KO.name),])
colnames(KO.name) <- "gene"
KO.name <- data.frame(stringr::str_split_fixed(KO.name$gene, ",", 2), stringsAsFactors = FALSE)
colnames(KO.name) <- c("KO","KO.function")
KO.name$KO <- substr(KO.name$KO, nchar(KO.name$KO)-6, nchar(KO.name$KO)-1)
KO.name$KO.function <- substr(KO.name$KO.function, 11, nchar(KO.name$KO.function))
KO.name$KO.function <- substr(KO.name$KO.function, 0, nchar(KO.name$KO.function) - 1)
KO.name <- KO.name[c(2,1)]
#this chunk takes the KEGG table output from MGRAST analysis and maps the kegg names to KO numbers
KEGG.name <- KEGG[c(4,5)]
KEGG.name.2 <- data.frame(stringr::str_split_fixed(KEGG.name$function.,"; ",2))
KEGG.name <- cbind(KEGG.name.2[2],KEGG.name[2])
colnames(KEGG.name) <- c("KO.function","count")
KEGG.name$KO.function <- as.character(KEGG.name$KO.function)
KEGG.name.KO <- dplyr::left_join(KEGG.name,KO.name,by="KO.function")

Some of the MG-RAST file is not possible to parse cleanly. ~10% of function names are mapped to KO terms by hand. This file is read back in and turned into a list of KO terms that can be fed into KEGG mapper (https://www.genome.jp/kegg/tool/map_pathway1.html)

unmapped.novel.KO <- read.csv("KEGG.name.KO.csv", header = TRUE, stringsAsFactors = FALSE)
KO.terms <- data.frame(unmapped.novel.KO$KO)
write.csv(KO.terms,"KO.terms.csv")

At KEGG mapper website, KO terms can be pasted in as such:

This in turn leads to an index of pathways to which the query functions map.

The individual KEGG maps can be used to guide biological investigations of the implications of the functional annotations.

Example investigations of functions and pathways novel to the minor community members

Sarcosine oxidase

Mannitol uptake and metabolism

Trehalose uptake and metabolism

Vitamin K synthesis

Lack of complete Vitamin K synthesis in Arhodomonas genome

These pathways represent only an initial survey of the functional annotations

Conclusions and future work

Initial analysis confirms the presence of several minor community members from orders Bacilli and Clostridia. This is supported by analysis of 16S rRNA gene fragments in the metagenome and by the presence of functions unique to gram positive bacteria such as teichoic acid metabolism and sporulation. Their unique activities include uptake of carbohydrates (sugars), waste products (sarcosine), and osmoprotectants (trehalose) while enabling synthesis of Vitamin K (menaqionone). Altogether, this paints a picture of organisms surviving by scavenging trace sugars and waste products and providing an essential vitamin to the community.

Testable hypotheses: