Taxon-level analysis lets us identify microbial candidates that differ between sample groups. After obtaining an ASV (e.g. ACTT…GTTAA), ASVs are assigned taxonomic labels (e.g. Kingdom, Phylum, Class, Order, Family, Genus.. ) by determining the most likely origin of the sequence using a database of known sequence-to-taxon mappings.
We can use our ASV matrix and sample groupings to ask if there exist particular microbial taxa that differ between samples groups (e.g. case vs control) or along a continuous covariate of interest (e.g. soil pH). In the discrete case, we can test if taxon abundance differs between different groups. In the continuous case, we can model taxon abundance as a function of the continuous variable of interest to test for a statistically significant association.
First before we start on our objectives identified above, we have to set up a work environment.
You will be working with two datasets today. The first one is the bees dataset that you have already seen throughout the lab. The bees data only has discrete variables (bee type and hive), and so we will consider a different dataset which has discrete and continuous variables (for objective #4) as well as is a little larger than the bees dataset (which will help us explore efficiency gains better in objective #5).
The ‘pcap’ study goals were to identify how parasitic exposure / success is linked to the gut microbiome. Zebrafish were exposed to a parasite (Pseudocapillaria tomentosa - or ‘pcap’ for short), and the gut microbiome was assessed post exposure. Each fish sample then is associated with both a discrete variable (i.e. if the fish was exposed or unexposed) and a continuous variable (i.e. the worm burden). If you want to read more about that paper see Gaulke et al 2019. 10.1186/s40168-019-0622-9.
Lets set up our work space first before we start the lab. This first
section loads packages and points to the input files in the data
directory. Only the working directory (setwd) and/or the
inDir variable may need to be modified for your given
specifics.
setwd(".") # may need to change for your own data later
inDir <- "Data" # may need to change for your own data later
inDir
## [1] "Data"
Below there are some convenience functions that make extracting data
from phyloseq objects as data.tables a bit
easier. This section shouldn’t need to be modified.
# Returns metadata associated with a phyloseq object in a human readable format.
sample.data.table <- function(ps) {
as(sample_data(ps), "data.frame") %>%
as.data.table(keep.rownames = "Sample") %>%
setkey(Sample) %>%
return()
}
# Returns ASV table associated with a phyloseq object in a human readable format.
asv.data.table <- function(ps) {
if (taxa_are_rows(ps)) {
otu.mat <- t(as(otu_table(ps), "matrix"))
} else {
otu.mat <- as(otu_table(ps), "matrix")
}
return(as.data.table(otu.mat, keep.rownames = "Sample") %>% setkey(Sample))
}
# A function for printing things
cat.n <- function(x) cat(x, sep = "\n")
We import the data we’ll be using for the rest of the practical, and put it into a format that makes working with it easier.
# Load in dataset one - bees
# This is the dataset we have been working with throughout the lab.
bees.ps <- readRDS(file.path(inDir, "PS.rds"))
bees.data <- sample.data.table(bees.ps)
bees.asvs <- asv.data.table(bees.ps)
bees.asv.ids <- taxa_names(bees.ps)
bees.dt <- bees.data[bees.asvs]
# Take a look at our dataset to understand structure.
## 42 Bee Samples
## 72 different taxa
bees.ps
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 73 taxa and 42 samples ]
## sample_data() Sample Data: [ 42 samples by 2 sample variables ]
## tax_table() Taxonomy Table: [ 73 taxa by 7 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 73 tips and 72 internal nodes ]
# The covariates of interest for this data available include bee type - remember
# there are three different bee types
table(bees.dt$bee_type)
##
## F H W
## 14 14 14
# Load in dataset two - pcap
# This has more samples and taxa for our lab today than the bees data.
pcap.ps <- readRDS(file.path(inDir, "pcap.rds"))
pcap.data <- sample.data.table(pcap.ps)
pcap.asvs <- asv.data.table(pcap.ps)
pcap.genera.ids <- taxa_names(pcap.ps)
pcap.dt <- pcap.data[pcap.asvs]
# Take a look at Dataset #2 (pcap dataset) to understand its structure.
## 237 Fish Samples
## 566 different taxa
## A larger dataset from the bees data
pcap.ps
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 566 taxa and 237 samples ]
## sample_data() Sample Data: [ 237 samples by 18 sample variables ]
## The covariates of interest for this data will be fish that were exposed
## to a parasite ("Exposed") and fish that have not had the exposure ("Unexposed")
## This is simpler to consider since there are just two groups of interest
## compared to the bees dataset.
table(pcap.dt$Exposure) # discrete covariate we will use later
##
## Exposed Unexposed
## 118 119
hist(pcap.dt$Worms.histo) # continuous covariate we will use later
Let’s consider differential abundance testing across the pcap dataset first since it represents the most simple case of considering difference across two different sample groupings (“exposed” vs. “unexposed”).
Second, we will look at the bees dataset because it has more than two discrete sample groupings.
For this example we will serially conduct Wilcoxon tests on the Pcap data set. So, for our dataset of interest, the question becomes, does \(taxon_{i}\) differ between exposed and unexposed individuals? For each taxa:
\(H_0\): \(taxon_{i}\) is not differentially abundant between groups.
\(H_A\): \(taxon_{i}\) is differentially abundant between groups.
The Wilcoxon test is the non-parametric (distribution-free) version of the Student’s t-test and will tell us if the abundances of each taxon differ significantly between our two groups: Unexposed vs Exposed. After running the tests, we correct for multiple tests by adjusting the p-values. There are two examples of methods for adjusting the p-values provided.
# For each taxa (in this case, we are using a genus label), perform wildcox
wilcox.tests <- lapply(pcap.genera.ids, function(genus) {
frm <- paste(genus, "~ Exposure") # this combined with the as.formula() function below makes it possible to generate formulas in R programmatically
return(
wilcox.test(as.formula(frm), data = pcap.dt)
)
})
# Return p values for each of the tests above.
pcap.raw.pvals <- sapply(wilcox.tests, function(mod) {return(mod$p.value)})
# Correction with method #1 - Bonferroni
pcap.bon.pvals <- p.adjust(pcap.raw.pvals, method = "bonferroni")
# Printing the number of associations with adjusted p-values using bonferroni
# Method for correction:
paste("BON: Significant associations:", length(pcap.bon.pvals[pcap.bon.pvals <= 0.05])) %>%
cat.n()
## BON: Significant associations: 17
# Printing the number of associations with adjusted p-values using FDR
# Method for correction:
pcap.fdr.pvals <- p.adjust(pcap.raw.pvals, method = "fdr")
paste("FDR: Significant associations:", length(pcap.fdr.pvals[pcap.fdr.pvals <= 0.05])) %>%
cat.n()
## FDR: Significant associations: 76
Which correction for multiple hypothesis testing was more conservative and why might you use one vs the other? Here is another way to visualize the p values for correction of bonferroni method vs. fdr.
pvals = as_tibble(as.data.frame(cbind("FDR" = pcap.fdr.pvals, "BON" = pcap.bon.pvals))) %>%
mutate(bon_sig = case_when(BON < 0.05 ~ "p < 0.05",
TRUE ~ "p > 0.05")) %>%
mutate(fdr_sig = case_when(FDR < 0.05 ~ "p < 0.05",
TRUE ~ "p > 0.05"))
ggplot(pvals) +
geom_point(aes(x = BON, y = FDR, color = bon_sig, shape = fdr_sig))
Now that we have looked at the most simple case of considering two different groups, we can do the same thing to ask if taxon abundance differs across multiple groups with a couple of modifications. We will use the bees dataset for this
We will conduct serial Kruskal-Wallis tests
(kruskal.test) to determine if there are differences in
abundances for each taxon between the three Bee Types. This is the
non-parametric version of ANOVA and we’re using it here instead of the
Wilcoxon test because there are more than two groups (F, H, and W).
After running the tests, we will correct for multiple tests using
p.adjust.
kw.tests <- lapply(bees.asv.ids, function(asv) {
frm <- paste(asv, "~ bee_type")
return(
kruskal.test(as.formula(frm), data = bees.dt)
)
})
bees.raw.pvals <- sapply(kw.tests, function(mod) {return(mod$p.value)})
bees.bon.pvals <- p.adjust(bees.raw.pvals, method = "bonferroni")
paste(
"Significant associations:",
length(bees.bon.pvals[bees.bon.pvals <= 0.05])
) %>%
cat.n()
## Significant associations: 27
bees.fdr.pvals <- p.adjust(bees.raw.pvals, method = "fdr")
paste(
"Significant associations:",
length(bees.fdr.pvals[bees.fdr.pvals <= 0.05])
) %>%
cat.n()
## Significant associations: 43
The following chunk takes the raw data from the Pcap data set and plots the abundances of taxa (by Exposure) for only those taxa whose abundance was significantly different between Unexposed and Exposed individuals according to the Wilcoxon tests (with Bonferroni-adjusted p-values).
# "melting" an array takes a number of columns and reduces them to two columns:
# one with the labels and another with the value
# this can be useful for a number of reasons, including taking advantage of
# ggplot2's facet_wrap() and facet_grid() functions to display multiple plots
pcap.melt.dt <- melt(
pcap.dt,
measure.vars = pcap.genera.ids,
variable.name = "Genus",
value.name = "Abundance"
)
sig.genera <- pcap.genera.ids[pcap.bon.pvals <= 0.05]
pcap.plot.dt <- pcap.melt.dt[Genus %in% sig.genera]
pcap.plot.dt[, Exposure := factor(Exposure, levels = c("Unexposed", "Exposed"))]
# the default is for levels to be in alphanumeric order, this way the control
# group, Unexposed, will be first along the x-axes
pcap.plot <- ggplot(pcap.plot.dt, aes(x = Exposure, y = Abundance)) +
geom_beeswarm(alpha = 0.6, cex = .3, size = 0.3) +
stat_summary(
fun.data = "mean_cl_boot", # This summary statistical calculates the 95% C.I. of the mean via non-parametric methods
color = "red",
geom = "errorbar",
width = 0.8
) +
facet_wrap(~ Genus, scales = "free_y") +
theme(axis.text.x = element_text(angle = 30, hjust = 1)) +
theme(strip.text = element_text(size = 7))
pcap.plot
Below, you will make a similar visualization as in the example, but with the raw data from the bees data set and plots the abundances of taxa (by bee type) for only those taxa whose abundance was significantly different for at least one of the three groups, according to the Kruskal-Wallis tests (with adjusted p-values).
bees.melt.dt <- melt(
bees.dt,
measure.vars = bees.asv.ids,
variable.name = "ASV",
value.name = "Abundance"
)
head(bees.melt.dt)
## Sample bee_type hive ASV Abundance
## <char> <char> <char> <fctr> <int>
## 1: A1 F DM ASV11 490
## 2: A2 H DM ASV11 124
## 3: A3 W DM ASV11 0
## 4: B1 F GC ASV11 799
## 5: B2 H GC ASV11 142
## 6: B3 W GC ASV11 0
sig.asvs <- bees.asv.ids[bees.bon.pvals <= 0.05] # or bees.fdr.pvals, Their choice
bees.plot.dt <- bees.melt.dt[ASV %in% sig.asvs]
ggplot(bees.plot.dt, aes(x = bee_type, y = Abundance)) +
geom_beeswarm(alpha = 0.6, size = 0.3, cex = 0.3) +
stat_summary(
fun.data = "mean_cl_boot",
color = "red",
geom = "errorbar",
width = 0.8
) +
facet_wrap(~ ASV, scales = "free_y") +
theme(strip.text = element_text(size = 10))
Rather than performing non-parametric tests, we can also attempt to model the underlying data distribution.
Historically, the poisson distribution was used to model counts to test for differential expression in count data. However, in a Poisson distribution, the mean and variance are equal. Consideration of biological count data has lead to the understanding that the Poisson distribution is too restrictive, and that it predicts smaller variance than what is seen commonly in biological data. Therefore, the resulting test does not control type-I error well (the probability of false discoveries).
To address this multiple methods have modeled count data as negative binomial, which allows for more flexible models in that mean and variance do not have to be equal. Methods such as edgeR and DESeq have been developed to allow for testing of differential count data. Read more about DeSeq here: https://genomebiology.biomedcentral.com/articles/10.1186/gb-2010-11-10-r106 or here: https://chipster.csc.fi/manual/deseq2.html
DESeq is a method to infer differential gene expression based on a distribution which is more appropriate (but still with some caveats) for microbial count data - a negative binomial distribution, with variance and mean linked by local regression.
Below, we will run through an example of using DESeq2
with the Pcap data set. First we will convert our Pcap
phyloseq object into a DESeqDataSet, which is
a special type of object used to store the input values, intermediate
calculations, and results of an analysis of differential expression with
DESeq.
After we have our input in the correct format, we’ll run the
DESeq function, which takes an input table of raw counts
and then produces a table of significantly differentially expressed
ASVs. The steps that DESeq performs are:
Estimates size factors for each gene across all samples. Corrects for composition bias of different read depths.
Obtains dispersion estimates for each ASV through a model fit procedure.
Fits a negative binomial generalized linear regression model for each ASV and then uses Wald test for significance testing.
*Automatically detects outliers and removes these. Removes genes with low counts to improve multiple testing adjustment less harsh.
At the end of the chunk, we simply print the results
table from running DESeq.
Note: The %>% operator used below is a “pipe”
operator. It means “take the results of the function before it and send
them to the function after it”.
pcap.ps@sam_data$Exposure <- factor(
pcap.ps@sam_data$Exposure,
levels = c("Unexposed", "Exposed")
)
pcap.deseq <- phyloseq_to_deseq2(pcap.ps, ~ Exposure) # this function does all the heavy lifting for us to turn a phyloseq object into a DESeq2 object
pcap.dds <- DESeq(pcap.deseq)
pcap.res.dt <- results(pcap.dds) %>% as.data.table(keep.rownames = "Genus")
pcap.sig.dt <- pcap.res.dt[padj <= 0.05]
gt(pcap.sig.dt[order(log2FoldChange)]) %>%
fmt_number(columns = 2:ncol(pcap.sig.dt), decimals = 3)
| Genus | baseMean | log2FoldChange | lfcSE | stat | pvalue | padj |
|---|---|---|---|---|---|---|
| Paraclostridium | 8.783 | −5.752 | 0.694 | −8.292 | 0.000 | 0.000 |
| Plesiomonas | 1,438.125 | −4.786 | 0.304 | −15.762 | 0.000 | 0.000 |
| Exiguobacterium | 0.586 | −2.952 | 1.024 | −2.883 | 0.004 | 0.017 |
| Chitinophaga | 2.618 | −1.972 | 0.652 | −3.023 | 0.003 | 0.012 |
| Fluviicola | 4.962 | −1.829 | 0.680 | −2.690 | 0.007 | 0.029 |
| Cetobacterium | 1,778.887 | −0.866 | 0.261 | −3.318 | 0.001 | 0.006 |
| NONE | 196.546 | 0.836 | 0.185 | 4.509 | 0.000 | 0.000 |
| Pelomonas | 65.840 | 1.079 | 0.328 | 3.284 | 0.001 | 0.006 |
| Gemmobacter | 2.281 | 1.134 | 0.451 | 2.515 | 0.012 | 0.042 |
| Mycoplasma | 18.554 | 1.418 | 0.473 | 2.996 | 0.003 | 0.012 |
| Phreatobacter | 2.770 | 1.489 | 0.424 | 3.510 | 0.000 | 0.005 |
| Anaerotruncus | 3.378 | 1.720 | 0.638 | 2.695 | 0.007 | 0.029 |
| Pseudomonas | 133.346 | 1.829 | 0.274 | 6.666 | 0.000 | 0.000 |
| Fusobacterium | 5.669 | 1.960 | 0.635 | 3.085 | 0.002 | 0.011 |
| Hyphomicrobium | 1.092 | 1.976 | 0.601 | 3.288 | 0.001 | 0.006 |
| Weissella | 3.053 | 2.039 | 0.616 | 3.310 | 0.001 | 0.006 |
| Peptostreptococcus | 1.047 | 2.074 | 0.780 | 2.659 | 0.008 | 0.030 |
| Lactococcus | 1.054 | 2.280 | 0.935 | 2.439 | 0.015 | 0.050 |
| Comamonas | 1.728 | 2.313 | 0.872 | 2.653 | 0.008 | 0.030 |
| Clostridium_sensu_stricto_1 | 1.703 | 2.349 | 0.906 | 2.594 | 0.009 | 0.034 |
| Leeia | 1.681 | 2.362 | 0.695 | 3.400 | 0.001 | 0.006 |
| Ensifer | 5.369 | 2.485 | 0.397 | 6.266 | 0.000 | 0.000 |
| Ferruginibacter | 0.851 | 2.486 | 0.805 | 3.087 | 0.002 | 0.011 |
| Bacteroides | 1.600 | 2.519 | 0.762 | 3.307 | 0.001 | 0.006 |
| Lactobacillus | 2.695 | 2.547 | 0.749 | 3.400 | 0.001 | 0.006 |
| Rheinheimera | 1.499 | 2.596 | 0.844 | 3.074 | 0.002 | 0.011 |
| Ignatzschineria | 1.712 | 2.673 | 0.885 | 3.018 | 0.003 | 0.012 |
| Luteimonas | 0.983 | 2.683 | 0.740 | 3.627 | 0.000 | 0.003 |
| Undibacterium | 7.379 | 3.329 | 0.510 | 6.524 | 0.000 | 0.000 |
| Chitinimonas | 4.909 | 3.645 | 0.586 | 6.214 | 0.000 | 0.000 |
| Photobacterium | 6.744 | 3.655 | 0.715 | 5.110 | 0.000 | 0.000 |
Below, we plot the log2FoldChange (i.e. number of doublings) between the Unexposed and Exposed groups. More positive numbers indicate the Genus has a higher abundance in the Unexposed vs Exposed group, while negative numbers indicate the opposite.
ggplot(pcap.sig.dt, aes(x = reorder(Genus, log2FoldChange), y = log2FoldChange)) +
geom_col() +
coord_flip() +
labs(x = "Genus", y = "Log2 Fold Change (Unexposed to Exposed)")
Below you will run through a very similar analysis with the Bees data
set as the example above did with the Pcap data set. Notice, that just
as with the non-parametric tests, there will be some modifications to
the analysis since the number of groups you’ll be comparing is three
rather than two. Namely, there will be two tables to
results instead of one.
bees.deseq <- phyloseq_to_deseq2(bees.ps, ~ bee_type)
bees.dds <- DESeq(bees.deseq)
resultsNames(bees.dds) # the second and third names here indicate the comparisons we want to make below
## [1] "Intercept" "bee_type_H_vs_F" "bee_type_W_vs_F"
bees.HvF.res.dt <- as.data.table(
results(bees.dds, name = "bee_type_H_vs_F"),
keep.rownames = "ASV"
)
bees.HvF.sig.dt <- bees.HvF.res.dt[padj <= 0.05]
gt(bees.HvF.sig.dt[order(log2FoldChange)]) %>%
tab_header(title = "Bee type H vs F") %>%
fmt_number(columns = 2:ncol(bees.HvF.sig.dt), decimals = 3)
| Bee type H vs F | ||||||
| ASV | baseMean | log2FoldChange | lfcSE | stat | pvalue | padj |
|---|---|---|---|---|---|---|
| ASV58 | 5.449 | −23.054 | 3.577 | −6.445 | 0.000 | 0.000 |
| ASV50 | 40.149 | −9.571 | 2.086 | −4.588 | 0.000 | 0.000 |
| ASV63 | 16.777 | −7.479 | 1.806 | −4.141 | 0.000 | 0.000 |
| ASV48 | 60.603 | −4.753 | 1.702 | −2.793 | 0.005 | 0.016 |
| ASV25 | 186.727 | −3.522 | 1.022 | −3.446 | 0.001 | 0.002 |
| ASV14 | 536.603 | −2.910 | 1.015 | −2.868 | 0.004 | 0.013 |
| ASV16 | 536.316 | −2.558 | 1.094 | −2.339 | 0.019 | 0.049 |
| ASV13 | 537.183 | −2.517 | 0.375 | −6.712 | 0.000 | 0.000 |
| ASV42 | 91.221 | −2.225 | 0.482 | −4.616 | 0.000 | 0.000 |
| ASV12 | 742.996 | −2.091 | 0.384 | −5.450 | 0.000 | 0.000 |
| ASV11 | 892.166 | −2.025 | 0.804 | −2.520 | 0.012 | 0.034 |
| ASV6 | 1,385.724 | −1.902 | 0.232 | −8.212 | 0.000 | 0.000 |
| ASV5 | 3,030.315 | −1.090 | 0.370 | −2.950 | 0.003 | 0.010 |
| ASV2 | 7,650.634 | −1.007 | 0.213 | −4.723 | 0.000 | 0.000 |
| ASV36 | 92.213 | −0.672 | 0.194 | −3.474 | 0.001 | 0.002 |
| ASV4 | 3,346.276 | 0.701 | 0.200 | 3.508 | 0.000 | 0.002 |
| ASV22 | 232.225 | 0.762 | 0.307 | 2.477 | 0.013 | 0.037 |
| ASV7 | 902.376 | 1.601 | 0.256 | 6.264 | 0.000 | 0.000 |
| ASV31 | 104.500 | 1.668 | 0.445 | 3.746 | 0.000 | 0.001 |
| ASV23 | 224.797 | 1.855 | 0.476 | 3.897 | 0.000 | 0.000 |
| ASV51 | 28.195 | 3.452 | 1.096 | 3.148 | 0.002 | 0.006 |
| ASV35 | 88.694 | 5.262 | 1.169 | 4.503 | 0.000 | 0.000 |
| ASV66 | 10.507 | 7.125 | 1.987 | 3.586 | 0.000 | 0.001 |
| ASV52 | 25.951 | 7.153 | 1.609 | 4.447 | 0.000 | 0.000 |
| ASV65 | 10.417 | 7.154 | 2.905 | 2.463 | 0.014 | 0.037 |
| ASV41 | 70.089 | 9.905 | 0.941 | 10.531 | 0.000 | 0.000 |
| ASV40 | 9.440 | 22.508 | 3.516 | 6.401 | 0.000 | 0.000 |
bees.WvF.res.dt <- as.data.table(
results(bees.dds, name = "bee_type_W_vs_F"),
keep.rownames = "ASV"
)
bees.WvF.sig.dt <- bees.WvF.res.dt[padj <= 0.05]
gt(bees.WvF.sig.dt[order(log2FoldChange)]) %>%
tab_header(title = "Bee type W vs F") %>%
fmt_number(columns = 2:ncol(bees.WvF.sig.dt), decimals = 3)
| Bee type W vs F | ||||||
| ASV | baseMean | log2FoldChange | lfcSE | stat | pvalue | padj |
|---|---|---|---|---|---|---|
| ASV37 | 23.395 | −24.205 | 3.576 | −6.769 | 0.000 | 0.000 |
| ASV34 | 81.152 | −23.951 | 2.246 | −10.665 | 0.000 | 0.000 |
| ASV58 | 5.449 | −22.775 | 3.577 | −6.367 | 0.000 | 0.000 |
| ASV25 | 186.727 | −11.535 | 1.157 | −9.969 | 0.000 | 0.000 |
| ASV50 | 40.149 | −9.438 | 2.086 | −4.524 | 0.000 | 0.000 |
| ASV63 | 16.777 | −8.171 | 1.806 | −4.525 | 0.000 | 0.000 |
| ASV11 | 892.166 | −7.844 | 0.813 | −9.646 | 0.000 | 0.000 |
| ASV24 | 253.396 | −7.273 | 2.528 | −2.877 | 0.004 | 0.011 |
| ASV48 | 60.603 | −7.115 | 1.729 | −4.114 | 0.000 | 0.000 |
| ASV75 | 10.244 | −6.226 | 2.511 | −2.480 | 0.013 | 0.029 |
| ASV43 | 91.169 | −6.071 | 1.523 | −3.987 | 0.000 | 0.000 |
| ASV51 | 28.195 | −5.350 | 1.224 | −4.372 | 0.000 | 0.000 |
| ASV35 | 88.694 | −5.283 | 1.289 | −4.097 | 0.000 | 0.000 |
| ASV27 | 216.853 | −4.956 | 1.606 | −3.086 | 0.002 | 0.006 |
| ASV14 | 536.603 | −3.687 | 1.015 | −3.633 | 0.000 | 0.001 |
| ASV16 | 536.316 | −2.762 | 1.094 | −2.525 | 0.012 | 0.027 |
| ASV5 | 3,030.315 | −2.542 | 0.370 | −6.874 | 0.000 | 0.000 |
| ASV42 | 91.221 | −1.660 | 0.481 | −3.450 | 0.001 | 0.002 |
| ASV6 | 1,385.724 | −1.461 | 0.231 | −6.310 | 0.000 | 0.000 |
| ASV2 | 7,650.634 | −1.342 | 0.213 | −6.296 | 0.000 | 0.000 |
| ASV18 | 384.296 | −1.307 | 0.256 | −5.104 | 0.000 | 0.000 |
| ASV13 | 537.183 | −0.954 | 0.374 | −2.548 | 0.011 | 0.026 |
| ASV4 | 3,346.276 | 0.996 | 0.200 | 4.988 | 0.000 | 0.000 |
| ASV19 | 327.169 | 1.108 | 0.347 | 3.196 | 0.001 | 0.004 |
| ASV31 | 104.500 | 1.133 | 0.446 | 2.542 | 0.011 | 0.026 |
| ASV8 | 919.224 | 1.428 | 0.555 | 2.574 | 0.010 | 0.026 |
| ASV7 | 902.376 | 1.908 | 0.256 | 7.466 | 0.000 | 0.000 |
| ASV23 | 224.797 | 2.140 | 0.476 | 4.498 | 0.000 | 0.000 |
| ASV3 | 3,713.902 | 2.406 | 0.598 | 4.026 | 0.000 | 0.000 |
| ASV52 | 25.951 | 7.732 | 1.608 | 4.808 | 0.000 | 0.000 |
| ASV59 | 18.599 | 23.411 | 3.406 | 6.874 | 0.000 | 0.000 |
Because there are 3 groups we’re comparing, there will be two plots instead of one.
plot.HvF <- ggplot(bees.HvF.sig.dt, aes(x = reorder(ASV, log2FoldChange), y = log2FoldChange)) +
geom_col() +
coord_flip() +
labs(x = "ASV", y = "Log2 Fold Change", title = "Bee type H vs F")
plot.WvF <- ggplot(bees.WvF.sig.dt, aes(x = reorder(ASV, log2FoldChange), y = log2FoldChange)) +
geom_col() +
coord_flip() +
labs(x = "ASV", y = "Log2 Fold Change", title = "Bee type W vs F")
plot_grid(plot.HvF, plot.WvF, ncol = 1)
# Reach goal:
# Modify and combine the two DESeq2 results tables into a single data.table (see ?rbind) and plot the same graph in one call utilizing the facet_wrap() function with the ggplot()
Up until now, we have only considered determining significance between discrete groups of samples (e.g. exposed / unexposed). Let’s try modeling microbial abundance as a function of the parasitic burden (an example of a continuous covariate). You can do this using a compound poisson generalized linear model.
You will see that this implementation is a bit slow - even for this small dataset and so we will work on it below to speed it up.
cpglms.for <- list()
pcap.noNA.dt <- pcap.dt[complete.cases(pcap.dt)]
compute.time.for <- system.time(
for (genus in pcap.genera.ids) {
frm <- paste(genus, "~ Worms.histo")
mod <- try(cpglm(as.formula(frm), data = pcap.noNA.dt, link = "identity"), silent = T)
if (class(mod) == "try-error") { # if the model doesn't run successfully
mod <- NA
}
cpglms.for[[genus]] <- mod
}
)
paste("Elapsed time:", round(compute.time.for[3], 3), "seconds") %>%
cat.n()
## Elapsed time: 92.956 seconds
Lets see if we can get the loop to run faster using an
lapply. Notice the difference in the
system.time outputs. (Depending on various factors, the
lapply loop may not actually run faster for you, but
compare with others around you, and, on average, it should be a few
seconds faster or about equivalent).
compute.time.lapply <- system.time(
cpglms.lapply <- lapply(pcap.genera.ids, function(genus) {
frm <- paste(genus, "~ Worms.histo")
mod <- try(cpglm(as.formula(frm), data = pcap.noNA.dt, link = "identity"), silent = T)
if (class(mod) == "try-error") {
mod <- NA
}
return(mod)
})
)
cat(paste("Elapsed time:", round(compute.time.lapply[3], 3), "seconds"), sep = "\n")
## Elapsed time: 91.771 seconds
These data sets are rather small. With larger data sets even low-percentage reductions in compute time can make a big difference in terms of total time.
Now, lets convert the lapply loop above into a
foreach parallel loop using 2 cores to see if that runs
faster. Note the reduction in compute time printed by
system.time.
?foreach
nCores <- 2
cl <- makeCluster(nCores, type = "FORK")
registerDoParallel(cl, cores = nCores)
compute.time.par <- system.time(
cpglms.par <- foreach(
genus = pcap.genera.ids
) %dopar% {
frm <- paste(genus, "~ Worms.histo")
mod <- try(cpglm(as.formula(frm), data = pcap.noNA.dt, link = "identity"), silent = T)
if (class(mod) == "try-error") {
mod <- NA
}
return(mod)
}
)
stopCluster(cl)
paste("Elapsed time:", round(compute.time.par[3], 3), "seconds") %>%
cat.n()
## Elapsed time: 63.074 seconds
Remember since everyone is doing the lab at once, that we cannot use a large number of cores for this test. However, if you have access to a cluster with 100s of cores, this can speed up your analysis quite a bit. On my machine this run didn’t quite reduce the compute time by half, but was close.
Now that we made this a bit faster, we will now continue on our example of modeling taxon abundance as a function of a continuous variable.
As above, we will get all the p-values from the CPGLM models and adjust them for multiple tests. Additionally, since we’re already looping through all the models, we’ll also grab other values that will help us plot the results, i.e., estimated coefficients for intercepts and slopes.
cpglms.dt <- lapply(1:length(cpglms.lapply), function(i) {
mod <- cpglms.lapply[[i]]
if (class(mod)[1] == "cpglm") { # only continue if we have a model here instead of an NA
genus <- pcap.genera.ids[[i]]
sink("tmp.txt") # the summary function for the cplm-class objects prints default every model when saving them to a variable. This sink() function prevents writing all of them to the console and instead to the tmp.txt file
mod.sum <- summary(mod)
sink() # Ends the writing of output to the tmp.txt file
return(
data.table(
Genus = genus,
Intercept = mod.sum$coefficients[1, 1],
Slope = mod.sum$coefficients[2, 1],
P.val = mod.sum$coefficients[2, 4]
)
)
}
}) %>% rbindlist() # combine the list of data.tables into a single data.table
file.remove("tmp.txt") # just to keep things tidy
## [1] TRUE
# below is the data.table syntax for creating a new column based on another column
cpglms.dt[, Pval.adj := p.adjust(P.val, method = "bonferroni")] # or another correction of your choice
cpglms.sig.dt <- cpglms.dt[Pval.adj <= 0.05]
cpglms.sig.dt
## Genus Intercept Slope P.val Pval.adj
## <char> <num> <num> <num> <num>
## 1: Acinetobacter 1.057274e+02 -2.74894800 1.621950e-05 6.698654e-03
## 2: Hyphomicrobium 2.254789e-01 0.51633826 2.294996e-05 9.478334e-03
## 3: Filimonas 5.749191e-12 0.06467901 7.960534e-05 3.287700e-02
## 4: Plesiomonas 7.781889e+02 -21.19145174 3.976450e-12 1.642274e-09
## 5: Dechloromonas 5.202028e-01 -0.01445008 6.406180e-05 2.645752e-02
## 6: Sphingomonas 1.640056e-01 0.32551617 3.761001e-05 1.553294e-02
## 7: Sutterella 9.052547e-11 0.06403765 2.756049e-05 1.138248e-02
## 8: Microvirga 1.709563e-10 0.01967196 8.393415e-05 3.466481e-02
## 9: Defluviimonas 2.651962e+00 -0.07366561 3.698659e-05 1.527546e-02
## 10: Devosia 2.159936e-11 0.12663095 1.177475e-06 4.862972e-04
Now, we’ll plot the significant associations between taxon abundances
and worm counts (Worms.histo) as scatter plots. Positive
slopes indicate that an increase in worm counts is associated with a
greater abundance of a taxon, while negative slopes indicate that an
increase in worm counts is associated with a lesser abundance of a
taxon.
This will be a case of using two different data arrays within a
single ggplot call.
cpglms.plot.dt <- pcap.melt.dt[Genus %in% cpglms.sig.dt$Genus] # grabbing pcap.melt.dt from above because it's already in the format that we need for easy plotting in ggplot. We're just subsetting it here so it only contains the genera that had significant associations with worm counts.
cpglms.plot.dt <- cpglms.plot.dt[complete.cases(cpglms.plot.dt)]
ggplot(cpglms.plot.dt, aes(x = Worms.histo, y = Abundance)) + # we're using a log tranformation of the Abudance counts because that's what CPGLM does by default. You can change this in the cpglm()
geom_point() + # makes it a scatter plot
geom_abline(
data = cpglms.sig.dt, # Notice we're using a different data.table here than that above that contains the intercepts and slopes. Also note that for facet_wrap() to work below, the same facet variable has to be present in this second data.table, too
aes(intercept = Intercept, slope = Slope),
color = "red"
) +
facet_wrap(~ Genus, scales = "free_y")
While these associations have been deemed significant by the statistical test, how many are you convinced are strong, likely real, associations?