This tutorial demonstrates basic usage of curatedMetagenomicData, prerequisite steps for analysis, estimation of alpha diversity, analysis of beta diversity, and differential abundance analysis. To make the tables and figures in this tutorial, a number of R packages are required, as follows. If you are using app.orchestra.cancerdatasci.org, these packages are already installed for you; if not, you’ll have to make sure the packages are installed.
library(curatedMetagenomicData)
library(dplyr)
library(stringr)
library(scater)
library(snakecase)
library(forcats)
library(gtsummary)
library(mia)
library(ggplot2)
library(ggsignif)
library(hrbrthemes)
library(vegan)
library(uwot)
library(ANCOMBC)
library(tibble)
library(tidyr)
library(knitr)
library(ggrepel)
With the requisite R packages loaded and a basic working knowledge of metagenomics analysis (as was covered in the workshop), you’re ready to begin. Just one last note, in most of the code chunks below messages have been suppressed; more verbose output should be expected when using the R console.
First, let’s address what the curatedMetagenomicData R/Bioconductor package is; here’s an brief description.
The curatedMetagenomicData package provides standardized, curated human microbiome data for novel analyses. It includes gene families, marker abundance, marker presence, pathway abundance, pathway coverage, and relative abundance for samples collected from different body sites. The bacterial, fungal, and archaeal taxonomic abundances for each sample were calculated with MetaPhlAn3 and metabolic functional potential was calculated with HUMAnN3. The manually curated sample metadata and standardized metagenomic data are available as (Tree)SummarizedExperiment objects.
Be sure to refer to the SummarizedExperiment and TreeSummarizedExperiment vignettes if the data structures are unclear. There is also a reference website for curatedMetagenomicData at waldronlab.io/curatedMetagenomicData, if you need it.
Any metagenomic analysis that uses curatedMetagenomicData is likely to begin with an exploration of sample metadata (i.e. sampleMetadata). The sampleMetadata object is a data.frame in the package that contains sample metadata that has been manually curated by humans for each and every sample. To give you an idea of some of the columns that are available, dplyr syntax is used below to take a random sampling of 10 rows (slice_sample). Then, columns containing NA values are removed, and the first 10 remaining columns are selected. The data.frame is sorted alphabetically by study_name prior to being returned.
sampleMetadata |>
slice_sample(n = 10) |>
select(where(~ !any(is.na(.x)))) |>
select(1:10) |>
arrange(study_name)
## study_name sample_id
## 1 AsnicarF_2021 SAMEA7045003
## 2 HMP_2019_ibdmdb HSM67VEI
## 3 HMP_2019_t2d HMP2_J05418_M_ST_T0_B0_0120_ZLZNCLZ-03_C934HANXX
## 4 LifeLinesDeep_2016 EGAR00001420875_9002000001448300
## 5 LomanNJ_2013 OBK4198
## 6 VatanenT_2016 G69159
## 7 WampachL_2018 MV_C118
## 8 YachidaS_2019 SAMD00114799
## 9 ZeeviD_2015 PNP_Main_550
## 10 ZeeviD_2015 PNP_Main_168
## subject_id body_site study_condition disease age_category
## 1 predict1_MTG_0150 stool control healthy adult
## 2 H4006 stool IBD IBD child
## 3 ZLZNCLZ stool IGT IGT adult
## 4 sub_9002000001448300LL stool control healthy adult
## 5 OBK4198 stool STEC STEC adult
## 6 P006862 stool control healthy child
## 7 M_C118 vagina control healthy adult
## 8 sub_10265 stool CRC CRC senior
## 9 PNP_Main_550 stool control healthy adult
## 10 PNP_Main_168 stool control healthy adult
## country non_westernized sequencing_platform
## 1 GBR no IlluminaNovaSeq
## 2 USA no IlluminaHiSeq
## 3 USA no IlluminaHiSeq
## 4 NLD no IlluminaHiSeq
## 5 DEU no IlluminaHiSeq
## 6 RUS no IlluminaHiSeq
## 7 LUX no IlluminaHiSeq
## 8 JPN no IlluminaHiSeq
## 9 ISR no IlluminaHiSeq
## 10 ISR no IlluminaHiSeq
The resources available in curatedMetagenomicData are organized by study_name and can be discovered with the curatedMetagenomicData() function. When provided with a string, it will return the names of available resources. Each study_name will have 6 data types (gene_families, marker_abundance, marker_presence, pathway_abundance, pathway_coverage, and relative_abundance), these follow the study_name and are separated by a dot. The date that precedes study_name is for versioning, but it’s unimportant here.
curatedMetagenomicData("AsnicarF")
## 2021-03-31.AsnicarF_2017.gene_families
## 2021-03-31.AsnicarF_2017.marker_abundance
## 2021-03-31.AsnicarF_2017.marker_presence
## 2021-03-31.AsnicarF_2017.pathway_abundance
## 2021-03-31.AsnicarF_2017.pathway_coverage
## 2021-03-31.AsnicarF_2017.relative_abundance
## 2021-03-31.AsnicarF_2021.gene_families
## 2021-03-31.AsnicarF_2021.marker_abundance
## 2021-03-31.AsnicarF_2021.marker_presence
## 2021-03-31.AsnicarF_2021.pathway_abundance
## 2021-03-31.AsnicarF_2021.pathway_coverage
## 2021-03-31.AsnicarF_2021.relative_abundance
The first argument passed to curatedMetagenomicData(), pattern, is actually a regular expression. So things like .+ (match any character one or more times) work, and the names multiple resources are returned if they match. Below, you can see there are two AsnicarF resources for the relative_abundance data type.
curatedMetagenomicData("AsnicarF.+.relative_abundance")
## 2021-03-31.AsnicarF_2017.relative_abundance
## 2021-03-31.AsnicarF_2021.relative_abundance
As is clear from the examples above, simply searching for the AsnicarF studies did not download any curated metagenomic data. To do that you must provide another argument, dryrun = FALSE, to the curatedMetagenomicData() function. Doing so will download the matching resources from ExperimentHub (or load them from the local cache). When dryrun = FALSE, curatedMetagenomicData() will always return a list of SummarizedExperiment and/or TreeSummarizedExperiment objects.
curatedMetagenomicData("AsnicarF.+.relative_abundance", dryrun = FALSE)
## $`2021-03-31.AsnicarF_2017.relative_abundance`
## class: TreeSummarizedExperiment
## dim: 301 24
## metadata(0):
## assays(1): relative_abundance
## rownames(301):
## k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Escherichia|s__Escherichia_coli
## k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Bifidobacteriales|f__Bifidobacteriaceae|g__Bifidobacterium|s__Bifidobacterium_bifidum
## ...
## k__Bacteria|p__Firmicutes|c__Bacilli|o__Lactobacillales|f__Streptococcaceae|g__Streptococcus|s__Streptococcus_gordonii
## k__Bacteria|p__Firmicutes|c__Bacilli|o__Lactobacillales|f__Aerococcaceae|g__Abiotrophia|s__Abiotrophia_sp_HMSC24B09
## rowData names(7): Kingdom Phylum ... Genus Species
## colnames(24): MV_FEI1_t1Q14 MV_FEI2_t1Q14 ... MV_MIM5_t2M14
## MV_MIM5_t3F15
## colData names(22): study_name subject_id ... lactating curator
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowLinks: a LinkDataFrame (301 rows)
## rowTree: 1 phylo tree(s) (10430 leaves)
## colLinks: NULL
## colTree: NULL
##
## $`2021-03-31.AsnicarF_2021.relative_abundance`
## class: TreeSummarizedExperiment
## dim: 639 1098
## metadata(0):
## assays(1): relative_abundance
## rownames(639):
## k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Bacteroidaceae|g__Bacteroides|s__Bacteroides_vulgatus
## k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Bacteroidaceae|g__Bacteroides|s__Bacteroides_stercoris
## ...
## k__Bacteria|p__Synergistetes|c__Synergistia|o__Synergistales|f__Synergistaceae|g__Pyramidobacter|s__Pyramidobacter_sp_C12_8
## k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Micrococcales|f__Brevibacteriaceae|g__Brevibacterium|s__Brevibacterium_aurantiacum
## rowData names(7): Kingdom Phylum ... Genus Species
## colnames(1098): SAMEA7041133 SAMEA7041134 ... SAMEA7045952 SAMEA7045953
## colData names(21): study_name subject_id ... curator BMI
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowLinks: a LinkDataFrame (639 rows)
## rowTree: 1 phylo tree(s) (10430 leaves)
## colLinks: NULL
## colTree: NULL
It can be useful to have the list elements returned by curatedMetagenomicData() merged into a single SummarizedExperiment or TreeSummarizedExperiment object. This is accomplished using the mergeData() function – simply pipe (|>) the output of curatedMetagenomicData() to mergeData() and a single object will be returned.
curatedMetagenomicData("AsnicarF.+.relative_abundance", dryrun = FALSE) |>
mergeData()
## class: TreeSummarizedExperiment
## dim: 681 1122
## metadata(0):
## assays(1): relative_abundance
## rownames(681):
## k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Escherichia|s__Escherichia_coli
## k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Bifidobacteriales|f__Bifidobacteriaceae|g__Bifidobacterium|s__Bifidobacterium_bifidum
## ...
## k__Bacteria|p__Synergistetes|c__Synergistia|o__Synergistales|f__Synergistaceae|g__Pyramidobacter|s__Pyramidobacter_sp_C12_8
## k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Micrococcales|f__Brevibacteriaceae|g__Brevibacterium|s__Brevibacterium_aurantiacum
## rowData names(7): Kingdom Phylum ... Genus Species
## colnames(1122): MV_FEI1_t1Q14 MV_FEI2_t1Q14 ... SAMEA7045952
## SAMEA7045953
## colData names(24): study_name subject_id ... antibiotics_current_use
## BMI
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowLinks: a LinkDataFrame (681 rows)
## rowTree: 1 phylo tree(s) (10430 leaves)
## colLinks: NULL
## colTree: NULL
There is just one more function in the curatedMetagenomicData package to know about, returnSamples(). It is used when you want to return samples across studies that match a certain set of conditions. To return a specific set of samples, you first subset the sampleMetadata data.frame to include only the desired samples. In the example below, stool (body_site == "stool") samples from healthy (disease == "healthy") adults (age >= 18) living in Ireland or Italy (str_detect(country, "IRL|ITA")) are included before columns of all NA values are dropped. When the sampleMetadata data.frame is subset as desired, it is piped to the returnSamples() function. As you can see below, there is another argument, counts, which has not yet been mentioned. When counts = TRUE, relative abundance proportions are multiplied by read depth and rounded to the nearest integer prior to being returned – the argument applies to both returnSamples() and curatedMetagenomicData() when requesting the relative_abundance data type.
countryData <-
filter(sampleMetadata, body_site == "stool") |>
filter(disease == "healthy") |>
filter(age >= 18) |>
filter(str_detect(country, "IRL|ITA")) |>
select(where(~ !all(is.na(.x)))) |>
returnSamples("relative_abundance", counts = TRUE)
First, have a look at the countryData object you just created to get a better understanding of what it contains. You will be using it for the rest of this tutorial.
countryData
## class: TreeSummarizedExperiment
## dim: 906 286
## metadata(0):
## assays(1): relative_abundance
## rownames(906):
## k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Prevotellaceae|g__Prevotella|s__Prevotella_copri
## k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Bifidobacteriales|f__Bifidobacteriaceae|g__Bifidobacterium|s__Bifidobacterium_adolescentis
## ...
## k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Clostridiaceae|g__Clostridium|s__Clostridium_sp_D5
## k__Bacteria|p__Firmicutes|c__Bacilli|o__Lactobacillales|f__Lactobacillaceae|g__Lactobacillus|s__Lactobacillus_algidus
## rowData names(7): Kingdom Phylum ... Genus Species
## colnames(286): SID01.BA.VG.2 SID01.BA.V_metag ... CRC_MR_SBJ83H_17
## CRC_MR_SBJ84H_17
## colData names(32): study_name subject_id ... dyastolic_p systolic_p
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowLinks: a LinkDataFrame (906 rows)
## rowTree: 1 phylo tree(s) (10430 leaves)
## colLinks: NULL
## colTree: NULL
As can be seen above, the first assay of the TreeSummarizedExperiment is named relative_abundance. This is usually of little consequence, but you will need to rename the assay to counts for the transformation in the next step; do so by setting the assayNames as follows.
assayNames(countryData) <-
"counts"
Next, use logNormCounts() from the scater package to create a second assay logcounts that contains log-normalized species abundances.
countryData <-
logNormCounts(countryData)
By taking a look again, you can see there are now two assays as was desired.
countryData
## class: TreeSummarizedExperiment
## dim: 906 286
## metadata(0):
## assays(2): counts logcounts
## rownames(906):
## k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Prevotellaceae|g__Prevotella|s__Prevotella_copri
## k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Bifidobacteriales|f__Bifidobacteriaceae|g__Bifidobacterium|s__Bifidobacterium_adolescentis
## ...
## k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Clostridiaceae|g__Clostridium|s__Clostridium_sp_D5
## k__Bacteria|p__Firmicutes|c__Bacilli|o__Lactobacillales|f__Lactobacillaceae|g__Lactobacillus|s__Lactobacillus_algidus
## rowData names(7): Kingdom Phylum ... Genus Species
## colnames(286): SID01.BA.VG.2 SID01.BA.V_metag ... CRC_MR_SBJ83H_17
## CRC_MR_SBJ84H_17
## colData names(33): study_name subject_id ... systolic_p sizeFactor
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowLinks: a LinkDataFrame (906 rows)
## rowTree: 1 phylo tree(s) (10430 leaves)
## colLinks: NULL
## colTree: NULL
Before diving into analysis, you should create a quick summary table to understand the metadata better and be prepared to handle potential biases in analysis. The steps below use dplyr syntax to clean up a few of the variables before they are output in the summary table.
colData(countryData) |>
as.data.frame() |>
select(study_name, age, gender, country, sequencing_platform, curator, diet) |>
mutate(gender = to_title_case(gender)) |>
mutate(gender = fct_explicit_na(gender)) |>
mutate(curator = str_replace_all(curator, "_", " ")) |>
mutate(curator = str_replace_all(curator, ";", ", ")) |>
mutate(diet = to_title_case(diet)) |>
mutate(diet = fct_explicit_na(diet)) |>
rename_with(to_title_case) |>
tbl_summary()
| Characteristic | N = 2861 |
|---|---|
| Study Name | |
| DeFilippisF_2019 | 97 (34%) |
| FerrettiP_2018 | 20 (7.0%) |
| KeohaneDM_2020 | 117 (41%) |
| RampelliS_2015 | 11 (3.8%) |
| ThomasAM_2018a | 14 (4.9%) |
| ThomasAM_2018b | 27 (9.4%) |
| Age | 39 (31, 52) |
| Gender | |
| Female | 140 (49%) |
| Male | 126 (44%) |
| (Missing) | 20 (7.0%) |
| Country | |
| IRL | 117 (41%) |
| ITA | 169 (59%) |
| Sequencing Platform | |
| IlluminaHiSeq | 189 (66%) |
| IlluminaNextSeq | 97 (34%) |
| Curator | |
| Francesca DeFilippis, Edoardo Pasolli | 97 (34%) |
| Pamela Ferretti | 20 (7.0%) |
| Paolo Manghi | 155 (54%) |
| Paolo Manghi, Marisa Metzger | 14 (4.9%) |
| Diet | |
| Omnivore | 23 (8.0%) |
| Vegan | 36 (13%) |
| Vegetarian | 38 (13%) |
| (Missing) | 189 (66%) |
|
1
n (%); Median (IQR)
|
|
Alpha diversity estimation seeks to quantify variation within a sample; the Shannon index (H’) is probably the most widely used measure. It accounts for both richness (i.e. how many types of bacteria are present) and evenness (i.e. how equal the proportions of types is); it is assessed below using the estimateDiversity() function from the mia package. Then, the plotColData() function from the scater package is used to generate a basic ggplot2 plot, which is stylized further using the ggsignif and hrbrthemes packages.
countryData |>
estimateDiversity(abund_values = "logcounts", index = "shannon") |>
plotColData(x = "country", y = "shannon", colour_by = "country", shape_by = "country") +
geom_signif(comparisons = list(c("IRL", "ITA")), test = "t.test", map_signif_level = TRUE) +
labs(
title = "Alpha Diversity by Country, Shannon Index (H')",
subtitle = "Stool Samples of Healthy Adults",
x = "Country",
y = "Alpha Diversity (H')"
) +
guides(color = guide_none(), shape = guide_none()) +
theme_ipsum_rc()
Figure 1: Alpha Diversity by Country
Alpha Diversity of stool samples from healthy adults, as measured by the Shannon index (H’), is significantly \((Pr(T < t) < 0.001)\) lower among Irish samples, as compared to Italian samples.
Beta diversity is a measurement of between sample variation that is usually qualitatively assessed using a low-dimensional embedding; classically this has been done in metagenomics with a principal coordinates analysis (PCoA) of Bray-Curtis dissimilarity.1 Bray-Curtis dissimilarity is not a metric distance and does not satisfy the triangle inequality, but it is generally accepted in metagenomics. This is done below using the runMDS() function from the scater package with the vegdist function from the vegan package. The plotReducedDim() function from the scater package is used to generate a basic ggplot2 plot, which is styled further.
countryData |>
runMDS(FUN = vegdist, method = "bray", exprs_values = "logcounts", name = "BrayCurtis") |>
plotReducedDim("BrayCurtis", colour_by = "country", shape_by = "country", text_by = "country") +
labs(
title = "Beta Diversity by Country, Bray-Curtis PCoA",
subtitle = "Stool Samples of Healthy Adults",
x = "PCo 1",
y = "PCo 2"
) +
guides(color = guide_none(), shape = guide_none()) +
theme_ipsum_rc()
Figure 2: Beta Diversity by Country, Bray-Curtis PCoA
The first two principal coordinates demonstrate good seperation of Irish and Italian stool samples from healthy adults, suggesting differences in gut microbial composition between the two populations.
To address the issue of using principal coordinates analysis (PCoA) with a dissimilarity, beta diversity can be assessed using the UMAP (Uniform Manifold Approximation and Projection) algorithm instead.2 McInnes, L., Healy, J. & Melville, J. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv [stat.ML] (2018) The runUMAP() function from the scater package is used below; otherwise, the syntax is largely the same as above.
countryData |>
runUMAP(exprs_values = "logcounts", name = "UMAP") |>
plotReducedDim("UMAP", colour_by = "country", shape_by = "country", text_by = "country") +
labs(
title = "Beta Diversity by Country, UMAP Embedding",
subtitle = "Stool Samples of Healthy Adults",
x = "UMAP 1",
y = "UMAP 2"
) +
guides(color = guide_none(), shape = guide_none()) +
theme_ipsum_rc()
Figure 3: Beta Diversity by Country, UMAP Embedding
The two-dimensional UMAP embedding demonstrates good seperation of Irish and Italian stool samples from healthy adults, suggesting differences in gut microbial composition between the two populations.
Where beta diversity embeddings suggest there are features that separate Irish and Italian stool samples from healthy adults, you might like to know which are most differentially abundance. To assess this the ANCOMBC package has a metagenomics-specific additive log-ratio model for the task. However, the ancombc() function requires a phyloseq class objects – one can be created using the makePhyloseqFromTreeSummarizedExperiment() function from the mia package, as shown below.
ancombcResults <-
makePhyloseqFromTreeSummarizedExperiment(countryData) |>
ancombc("country")
The results of the ANCOMBC model are in a strange list structure and have to be coerced into a data.frame before they can be displayed; the bind_cols() function from the dplyr package is used below.
ancombcTable <-
bind_cols(ancombcResults[["res"]])
Yet, the column names of the results table are missing and have to be assigned, as shown below.
colnames(ancombcTable) <-
names(ancombcResults[["res"]])
The row names of the results table are big long strings of microbial taxonomies that will need some editing if they are to be displayed nicely. the rownames_to_column() function from the tibble package is used below to turn them into a column so they can be edited.
ancombcTable <-
rownames_to_column(ancombcTable)
Before the row names are split into 7 pieces, the names of columns that each piece will be assigned to are created below.
rankNames <-
c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "Species")
The row names of the results table are then transformed using tidyr, dplyr, and stringr, as shown below.
ancombcTable[["rowname"]] <-
separate(ancombcTable, rowname, rankNames, sep = "\\|") |>
select(all_of(rankNames)) |>
mutate(across(.fns = ~ str_remove_all(.x, ".__"))) |>
mutate(across(.fns = ~ str_replace_all(.x, "_", " "))) |>
mutate(label = Species) |>
pull(label)
Once the results table is in good shape, it can be filtered to include only bacterial species that exhibited large (e.g. abs(beta) > 1) and significant (-log10(q_val) > 5) differences in abundances between the two countries. In the example below, the table is sorted by effect size and a number of formatting conventions are applied before displaying the results table.
filter(ancombcTable, abs(beta) > 1) |>
filter(-log10(q_val) > 5) |>
select(rowname, beta, se, p_val, q_val) |>
arrange(-abs(beta)) |>
column_to_rownames() |>
mutate(across(.fns = ~ round(.x, digits = 3))) |>
mutate(beta = format(beta)) |>
mutate(beta = str_replace(beta, " ", " ")) |>
mutate(p_val = if_else(p_val == 0, "< 0.001", format(p_val, nsmall = 3))) |>
mutate(q_val = if_else(q_val == 0, "< 0.001", format(q_val, nsmall = 3))) |>
kable(col.names = c("β", "SE", "P", "Q"), align = "cccc", escape = FALSE)
| β | SE | P | Q | |
|---|---|---|---|---|
| Catenibacterium mitsuokai | -8.689 | 0.618 | < 0.001 | < 0.001 |
| Holdemanella biformis | -8.007 | 0.575 | < 0.001 | < 0.001 |
| Eubacterium sp CAG 180 | -7.361 | 0.687 | < 0.001 | < 0.001 |
| Bacteroides dorei | Â 6.364 | 0.551 | < 0.001 | < 0.001 |
| Lactobacillus ruminis | -6.029 | 0.562 | < 0.001 | < 0.001 |
| Phascolarctobacterium succinatutens | -5.947 | 0.706 | < 0.001 | < 0.001 |
| Alistipes putredinis | Â 5.733 | 0.622 | < 0.001 | < 0.001 |
| Ruminococcus torques | -5.591 | 0.442 | < 0.001 | < 0.001 |
| Eubacterium hallii | -5.245 | 0.437 | < 0.001 | < 0.001 |
| Bifidobacterium angulatum | -5.185 | 0.550 | < 0.001 | < 0.001 |
| Prevotella sp CAG 279 | -5.093 | 0.626 | < 0.001 | < 0.001 |
| Lawsonibacter asaccharolyticus | Â 4.976 | 0.408 | < 0.001 | < 0.001 |
| Dorea formicigenerans | -4.855 | 0.366 | < 0.001 | < 0.001 |
| Prevotella sp 885 | -4.646 | 0.588 | < 0.001 | < 0.001 |
| Methanobrevibacter smithii | -4.514 | 0.642 | < 0.001 | < 0.001 |
| Coprococcus comes | -4.476 | 0.417 | < 0.001 | < 0.001 |
| Slackia isoflavoniconvertens | -4.356 | 0.581 | < 0.001 | < 0.001 |
| Roseburia intestinalis | -4.332 | 0.559 | < 0.001 | < 0.001 |
| Oscillibacter sp CAG 241 | -4.328 | 0.540 | < 0.001 | < 0.001 |
| Ruminococcus callidus | -4.294 | 0.517 | < 0.001 | < 0.001 |
| Eubacterium ramulus | -4.224 | 0.489 | < 0.001 | < 0.001 |
| Coprococcus catus | -4.206 | 0.421 | < 0.001 | < 0.001 |
| Intestinibacter bartlettii | -4.007 | 0.534 | < 0.001 | < 0.001 |
| Alistipes finegoldii | Â 3.849 | 0.544 | < 0.001 | < 0.001 |
| Eubacterium sp CAG 251 | -3.826 | 0.652 | < 0.001 | < 0.001 |
| Dorea longicatena | -3.811 | 0.327 | < 0.001 | < 0.001 |
| Prevotella sp AM42 24 | -3.567 | 0.560 | < 0.001 | < 0.001 |
| Roseburia faecis | -3.483 | 0.491 | < 0.001 | < 0.001 |
| Firmicutes bacterium CAG 170 | -3.393 | 0.577 | < 0.001 | < 0.001 |
| Bacteroides plebeius | Â 3.353 | 0.522 | < 0.001 | < 0.001 |
| Alistipes shahii | Â 3.259 | 0.532 | < 0.001 | < 0.001 |
| Bacteroides eggerthii | Â 3.212 | 0.477 | < 0.001 | < 0.001 |
| Butyricimonas virosa | Â 3.171 | 0.457 | < 0.001 | < 0.001 |
| Parasutterella excrementihominis | Â 3.167 | 0.477 | < 0.001 | < 0.001 |
| Parabacteroides distasonis | Â 3.114 | 0.529 | < 0.001 | < 0.001 |
| Alistipes indistinctus | Â 3.074 | 0.516 | < 0.001 | < 0.001 |
| Eubacterium rectale | -3.044 | 0.410 | < 0.001 | < 0.001 |
| Blautia wexlerae | -3.029 | 0.428 | < 0.001 | < 0.001 |
| Clostridium disporicum | -2.965 | 0.441 | < 0.001 | < 0.001 |
| Anaerostipes hadrus | -2.797 | 0.457 | < 0.001 | < 0.001 |
| Bacteroides sp CAG 530 | -2.790 | 0.411 | < 0.001 | < 0.001 |
| Blautia obeum | -2.751 | 0.405 | < 0.001 | < 0.001 |
| Intestinimonas butyriciproducens | Â 2.642 | 0.465 | < 0.001 | < 0.001 |
| Roseburia inulinivorans | -2.535 | 0.365 | < 0.001 | < 0.001 |
| Prevotella stercorea | -2.525 | 0.457 | < 0.001 | < 0.001 |
| Collinsella aerofaciens | -2.333 | 0.322 | < 0.001 | < 0.001 |
| Butyricimonas synergistica | Â 2.311 | 0.411 | < 0.001 | < 0.001 |
| Fusicatenibacter saccharivorans | -1.539 | 0.274 | < 0.001 | < 0.001 |
While the table above is somewhat interesting, the results are better summarized as a volcano plot (i.e. statistical significance versus fold change) and one can be made using the results table. To shorten labels for readability, stringr is first used to abbreviate species names by replacing the first word of the name with a single letter followed by a period. Next, the same filtering of the table as above is undertaken, and the color scheme used in all the plots above is applied. Both labeling and coloring are only applied where effect size and significance thresholds are met, as denoted by the dotted lines.
ancombcTable |>
mutate(rowname = str_replace(rowname, "^([A-Z])([a-z])+ ", "\\1. ")) |>
mutate(q_val = -log10(q_val)) |>
mutate(label = if_else(abs(beta) > 1, rowname, NA_character_)) |>
mutate(label = if_else(q_val > 5, label, NA_character_)) |>
mutate(color = if_else(beta > 0, "#FF9E4A", "#729ECE")) |>
mutate(color = if_else(is.na(label), "#000000", color)) |>
ggplot(mapping = aes(beta, q_val, color = I(color), label = label, shape = I(1))) +
geom_point() +
geom_hline(linetype = "dotted", yintercept = 5) +
geom_vline(linetype = "dotted", xintercept = 1) +
geom_vline(linetype = "dotted", xintercept = -1) +
geom_label_repel(min.segment.length = 0, force = 10, max.overlaps = 20, na.rm = TRUE) +
labs(
title = "Significance vs. Effect Size, ANCOM-BC",
subtitle = "Stool Samples of Healthy Adults",
x = expression(beta),
y = expression(-~log[10]~Q)
) +
guides(color = guide_none(), shape = guide_none()) +
theme_ipsum_rc()
Figure 4: Volcano Plot of Differentially Abundance Bacterial Species
In the model and the figure, Irish samples are the reference group such that bacterial species in blue at the left are significantly more abundant in Irish samples and those in yellow at the right are significantly more abundant in Italian samples.
As a bonus, do a Google or a PubMed search for Holdemanella biformis to see what condition(s) it is associated with and then explore differences between the two countries using GBD Compare visualizations.