16S and Shotgun Metagenomics integration

Published

November 5, 2025

Abstract

3 pipelines were compared in the present analysis (see Table 1)

Table 1- Pipelines used for the analysis of 18 samples. The figure shows databases and tools used in each pipeline

Data changes

Pipeline 1

  • 16S rRNA gene: All taxa originally labeled as “–” were renamed to reflect the lowest available taxonomic level (d: domain, p: phylum, c: class, o: order, f: family). Each of these taxa was subsequently tagged with “NID” to indicate “Not Identified.”

Pipeline 2 and 3

  • 16S rRNA gene: the following taxa were merged into a single taxon: Bacteroides and [Bacteroides], Clostridium and [Clostridium], Eubacterium x2 and [Eubacterium] x2. All taxa originally labeled as “–” were renamed to reflect the lowest available taxonomic level (d: domain, p: phylum, c: class, o: order, f: family). Each of these taxa was subsequently tagged with “NID” to indicate “Not Identified.”

Data check

Prevalence was calculated for each genus by sequencing type. Results are detailed in Table 2 and Table 3

Table 2- Number of samples per species by sequencing type (pipeline 1)

Note:

GreenGenes2 (GG2) was redesigned (2023–2024) to provide updated, genome-based taxonomy, aligned with GTDB. The identifiers like g__01-FULL-36-10b follow a systematic internal nomenclature used in GG2 to denote unnamed taxa that are distinct, but not yet formally described.

Breakdown of the name:

g__: Genus-level classification

01-Full: Cluster prefix — identifies one of GG2’s internal genome clusters (“FULL” means full-length 16S reference sequence).

36-10b: Cluster index — unique within that cluster, often indicating subclades or genomic bins.

So g__01-FULL-36-10b is a placeholder genus in the GG2 tree, used for genomes that form a coherent genus-level clade but lack a valid Latin name.

g__01-FULL-36-10b is an unnamed, genome-defined genus-level group in the GreenGenes2 taxonomy, representing a distinct lineage not yet given a formal name in GTDB/NCBI.

It’s a real microbial clade, not a formatting artifact — but it just hasn’t been taxonomically described yet.

Table 3- Number of samples per species by sequencing type (pipeline 2)
Table 4- Number of samples per species by sequencing type (pipeline 3)

Unique and shared taxa

Then, the unique and shared taxa were calculated for each pipeline

Table 5- Number of unique and shared species

Unique and Shared between pipeline 2 and 3

Table 6- Number of samples per species by sequencing pipeline (pipelines 2 and 3)
Table 7- Number of unique and shared species (pipelines 2 and 3)

Beta diversity

First, PCoA plots without rarefaction were done for each pipeline

Pipeline 1

No rarefaction

Figure 1- PCoA plots by sequencing type without rarefaction (Bray-Curtis)
Figure 2- PCoA plot by sequencing type without rarefaction (Jaccard)

Shared taxa

Figure 3- PCoA plots for shared taxa by sequencing type without rarefaction (Bray-Curtis)

Pipeline 2

No rarefaction

Figure 4- PCoA plots by sequencing type without rarefaction (Bray-Curtis)
Figure 5- PCoA plot by sequencing type without rarefaction (Jaccard)

Shared taxa

Figure 6- PCoA plots for shared taxa by sequencing type without rarefaction (Bray-Curtis)
Figure 7- PCoA plot for shared taxa by sequencing type without rarefaction (Jaccard)

Pipeline 3

No rarefaction

Figure 8- PCoA plots by sequencing type without rarefaction (Bray-Curtis)
Figure 9- PCoA plots by sequencing type without rarefaction (Jaccard)

Shared taxa

Figure 10- PCoA plots for shared taxa by sequencing type without rarefaction (Bray-Curtis)
Figure 11- PCoA plots for shared taxa by sequencing type without rarefaction (Jaccard)

Rarefaction and Beta diversity

Data was rarefied at 10,000 reads per sample and beta diversity indices were calculated without any filter.

Pipeline 1

Figure 12- PCoA plots by sequencing type (Bracy-Curtis)
Figure 13- PCoA plots by sequencing type (Jaccard)

Pipeline 2

Figure 14- PCoA plots by sequencing type (Bray-Curtis)
Figure 15- PCoA plots by sequencing type (Jaccard)

Pipeline 3

Figure 16- PCoA plots by sequencing type (Bray-Curtis)
Figure 17- PCoA plots by sequencing type (Jaccard)

Filter by relative abundance

Every dataset was filtered and all taxa with < 0.01% median relative abundance were removed separately from 16S rRNA gene and shotgun metagenomics.

Table 8- Number of unique and shared species after filtering by relative abundance (0.01%)

Relative abundance 0.001%

Table 9- Number of unique and shared species after filtering by relative abundance (>0.001%)

Unique and shared taxa after filtering (pipelines 2 and 3)

Beta diversity after filtering

Pipeline 1 0.01%

Figure 18- PCoA plots after filtering by relative abundance (Bray-Curtis)
Figure 19- PCoA plots after filtering by relative abundance (Jaccard)

Pipeline 1 0.001%

Figure 20- PCoA plots after filtering by relative abundance 0.001% (Bray-Curtis)
Figure 21- PCoA plots after filtering by relative abundance 0.001% (Jaccard)

Pipeline 2 0.01%

Figure 22- PCoA plots after filtering by relative abundance (Bray-Curtis)
Figure 23- PCoA plots after filtering by relative abundance (Jaccard)

Pipeline 2 0.001%

Figure 24- PCoA plots after filtering by relative abundance (Bray-Curtis)
Figure 25- PCoA plots after filtering by relative abundance (Jaccard)

Pipeline 3 0.01%

Figure 26- PCoA plots after filtering by relative abundance (Bray-Curtis)
Figure 27- PCoA plots after filtering by relative abundance (Jaccard)

Pipeline 3 0.001%

Figure 28- PCoA plots after filtering by relative abundance (Bray-Curtis)
Figure 29- PCoA plots after filtering by relative abundance (Jaccard)

Variance explained

The percentage of variance explained by the subjects and Dataset was calculated for each matrix across all pipelines Table 10 shows the R2 values for subjects after various data processing steps

Table 10- Variance explained by subject in each pipeline
Table 11- Variance explained by sequencing technique in each pipeline

Data integration

Data from 3 pipelines was merged into a single dataset and beta diversity was calculated

Figure 30- PCoA plots all pipelines (Jaccard)

The dataset was rarefied to 10000 reads per sample and beta diversity was calculated

Figure 31- PCoA plots after integration and rarefaction