Objectives

Overview

Taxon-level analysis lets us identify microbial candidates that differ between sample groups. After obtaining an ASV (e.g. ACTT…GTTAA), ASVs are assigned taxonomic labels (e.g. Kingdom, Phylum, Class, Order, Family, Genus.. ) by determining the most likely origin of the sequence using a database of known sequence-to-taxon mappings.

We can use our ASV matrix and sample groupings to ask if there exist particular microbial taxa that differ between samples groups (e.g. case vs control) or along a continuous covariate of interest (e.g. soil pH). In the discrete case, we can test if taxon abundance differs between different groups. In the continuous case, we can model taxon abundance as a function of the continuous variable of interest to test for a statistically significant association.

0 - Set Up

First before we start on our objectives identified above, we have to set up a work environment.

Datasets

You will be working with two datasets today. The first one is the bees dataset that you have already seen throughout the lab. The bees data only has discrete variables (bee type and hive), and so we will consider a different dataset which has discrete and continuous variables (for objective #4) as well as is a little larger than the bees dataset (which will help us explore efficiency gains better in objective #5).

The ‘pcap’ study goals were to identify how parasitic exposure / success is linked to the gut microbiome. Zebrafish were exposed to a parasite (Pseudocapillaria tomentosa - or ‘pcap’ for short), and the gut microbiome was assessed post exposure. Each fish sample then is associated with both a discrete variable (i.e. if the fish was exposed or unexposed) and a continuous variable (i.e. the worm burden). If you want to read more about that paper see Gaulke et al 2019. 10.1186/s40168-019-0622-9.

Prepare working enviroment

Lets set up our work space first before we start the lab. This first section loads packages and points to the input files in the data directory. Only the working directory (setwd) and/or the inDir variable may need to be modified for your given specifics.

setwd(".") # may need to change for your own data later
inDir <- "Data" # may need to change for your own data later
inDir 
## [1] "Data"

Below there are some convenience functions that make extracting data from phyloseq objects as data.tables a bit easier. This section shouldn’t need to be modified.

# Returns metadata associated with a phyloseq object in a human readable format.
sample.data.table <- function(ps) {
  as(sample_data(ps), "data.frame") %>%
    as.data.table(keep.rownames = "Sample") %>% 
    setkey(Sample) %>%
    return()
}

# Returns ASV table associated with a phyloseq object in a human readable format.
asv.data.table <- function(ps) {
  if (taxa_are_rows(ps)) {
    otu.mat <- t(as(otu_table(ps), "matrix"))
  } else {
    otu.mat <- as(otu_table(ps), "matrix")
  }
  return(as.data.table(otu.mat, keep.rownames = "Sample") %>% setkey(Sample))
}

# A function for printing things 
cat.n <- function(x) cat(x, sep = "\n")

We import the data we’ll be using for the rest of the practical, and put it into a format that makes working with it easier.

# Load in dataset one - bees
# This is the dataset we have been working with throughout the lab.
bees.ps <- readRDS(file.path(inDir, "PS.rds"))
bees.data <- sample.data.table(bees.ps)
bees.asvs <- asv.data.table(bees.ps)
bees.asv.ids <- taxa_names(bees.ps)
bees.dt <- bees.data[bees.asvs]

# Take a look at our dataset to understand structure. 
## 42 Bee Samples
## 72 different taxa
bees.ps
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 73 taxa and 42 samples ]
## sample_data() Sample Data:       [ 42 samples by 2 sample variables ]
## tax_table()   Taxonomy Table:    [ 73 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 73 tips and 72 internal nodes ]
# The covariates of interest for this data available include bee type - remember 
# there are three different bee types
table(bees.dt$bee_type)
## 
##  F  H  W 
## 14 14 14
# Load in dataset two - pcap
# This has more samples and taxa for our lab today than the bees data. 
pcap.ps <- readRDS(file.path(inDir, "pcap.rds"))
pcap.data <- sample.data.table(pcap.ps)
pcap.asvs <- asv.data.table(pcap.ps)
pcap.genera.ids <- taxa_names(pcap.ps)
pcap.dt <- pcap.data[pcap.asvs]

# Take a look at Dataset #2 (pcap dataset) to understand its structure. 
## 237 Fish Samples
## 566 different taxa
## A larger dataset from the bees data 
pcap.ps 
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 566 taxa and 237 samples ]
## sample_data() Sample Data:       [ 237 samples by 18 sample variables ]
## The covariates of interest for this data will be fish that were exposed
## to a parasite ("Exposed") and fish that have not had the exposure ("Unexposed")
## This is simpler to consider since there are just two groups of interest 
## compared to the bees dataset. 
table(pcap.dt$Exposure) # discrete covariate we will use later
## 
##   Exposed Unexposed 
##       118       119
hist(pcap.dt$Worms.histo) # continuous covariate we will use later

Objective 1: Perform Differential Taxon Abundance Testing Across (Two) Groups

Let’s consider differential abundance testing across the pcap dataset first since it represents the most simple case of considering difference across two different sample groupings (“exposed” vs. “unexposed”).

Second, we will look at the bees dataset because it has more than two discrete sample groupings.

1A Question: Does taxon abundance significantly differ between (two) groups?

For this example we will serially conduct Wilcoxon tests on the Pcap data set. So, for our dataset of interest, the question becomes, does \(taxon_{i}\) differ between exposed and unexposed individuals? For each taxa:

\(H_0\): \(taxon_{i}\) is not differentially abundant between groups.

\(H_A\): \(taxon_{i}\) is differentially abundant between groups.

Method

The Wilcoxon test is the non-parametric (distribution-free) version of the Student’s t-test and will tell us if the abundances of each taxon differ significantly between our two groups: Unexposed vs Exposed. After running the tests, we correct for multiple tests by adjusting the p-values. There are two examples of methods for adjusting the p-values provided.

# For each taxa (in this case, we are using a genus label), perform wildcox
wilcox.tests <- lapply(pcap.genera.ids, function(genus) {
  frm <- paste(genus, "~ Exposure") # this combined with the as.formula() function below makes it possible to generate formulas in R programmatically
  return(
    wilcox.test(as.formula(frm), data = pcap.dt)
  )
})

# Return p values for each of the tests above.
pcap.raw.pvals <- sapply(wilcox.tests, function(mod) {return(mod$p.value)})

# Correction with method #1 - Bonferroni
pcap.bon.pvals <- p.adjust(pcap.raw.pvals, method = "bonferroni")

# Printing the number of associations with adjusted p-values using bonferroni 
# Method for correction:
paste("BON: Significant associations:", length(pcap.bon.pvals[pcap.bon.pvals <= 0.05])) %>%
  cat.n()
## BON: Significant associations: 17
# Printing the number of associations with adjusted p-values using FDR 
# Method for correction:
pcap.fdr.pvals <- p.adjust(pcap.raw.pvals, method = "fdr")
paste("FDR: Significant associations:", length(pcap.fdr.pvals[pcap.fdr.pvals <= 0.05])) %>%
  cat.n()
## FDR: Significant associations: 76

Question:

Which correction for multiple hypothesis testing was more conservative and why might you use one vs the other? Here is another way to visualize the p values for correction of bonferroni method vs. fdr.

pvals = as_tibble(as.data.frame(cbind("FDR" = pcap.fdr.pvals, "BON" = pcap.bon.pvals))) %>%
  mutate(bon_sig = case_when(BON < 0.05 ~ "p < 0.05",
                             TRUE ~ "p > 0.05")) %>%
  mutate(fdr_sig = case_when(FDR < 0.05 ~ "p < 0.05",
                             TRUE ~ "p > 0.05"))
ggplot(pvals) +
  geom_point(aes(x = BON, y = FDR, color = bon_sig, shape = fdr_sig))

1B Question: Does taxon abundance significantly differ between (multiple) groups?

Now that we have looked at the most simple case of considering two different groups, we can do the same thing to ask if taxon abundance differs across multiple groups with a couple of modifications. We will use the bees dataset for this

Method

We will conduct serial Kruskal-Wallis tests (kruskal.test) to determine if there are differences in abundances for each taxon between the three Bee Types. This is the non-parametric version of ANOVA and we’re using it here instead of the Wilcoxon test because there are more than two groups (F, H, and W). After running the tests, we will correct for multiple tests using p.adjust.

kw.tests <- lapply(bees.asv.ids, function(asv) {
  frm <- paste(asv, "~ bee_type")
  return(
    kruskal.test(as.formula(frm), data = bees.dt)
  )
})
bees.raw.pvals <- sapply(kw.tests, function(mod) {return(mod$p.value)})

bees.bon.pvals <- p.adjust(bees.raw.pvals, method = "bonferroni")
paste(
  "Significant associations:", 
  length(bees.bon.pvals[bees.bon.pvals <= 0.05])
) %>% 
  cat.n()
## Significant associations: 27
bees.fdr.pvals <- p.adjust(bees.raw.pvals, method = "fdr")
paste(
  "Significant associations:", 
  length(bees.fdr.pvals[bees.fdr.pvals <= 0.05])
) %>% 
  cat.n()
## Significant associations: 43

Objective 2: Visualize results of differential taxon abundance testing

Visualize Differential Abundance 2-sample test

The following chunk takes the raw data from the Pcap data set and plots the abundances of taxa (by Exposure) for only those taxa whose abundance was significantly different between Unexposed and Exposed individuals according to the Wilcoxon tests (with Bonferroni-adjusted p-values).

# "melting" an array takes a number of columns and reduces them to two columns: 
# one with the labels and another with the value
# this can be useful for a number of reasons, including taking advantage of 
# ggplot2's facet_wrap() and facet_grid() functions to display multiple plots

pcap.melt.dt <- melt(
  pcap.dt, 
  measure.vars = pcap.genera.ids, 
  variable.name = "Genus", 
  value.name = "Abundance"
)

sig.genera <- pcap.genera.ids[pcap.bon.pvals <= 0.05]
pcap.plot.dt <- pcap.melt.dt[Genus %in% sig.genera]
pcap.plot.dt[, Exposure := factor(Exposure, levels = c("Unexposed", "Exposed"))] 
# the default is for levels to be in alphanumeric order, this way the control 
# group, Unexposed, will be first along the x-axes

pcap.plot <- ggplot(pcap.plot.dt, aes(x = Exposure, y = Abundance)) + 
  geom_beeswarm(alpha = 0.6, cex = .3, size = 0.3) + 
  stat_summary(
    fun.data = "mean_cl_boot", # This summary statistical calculates the 95% C.I. of the mean via non-parametric methods
    color = "red", 
    geom = "errorbar", 
    width = 0.8
  ) + 
  facet_wrap(~ Genus, scales = "free_y") +
  theme(axis.text.x = element_text(angle = 30, hjust = 1)) + 
  theme(strip.text = element_text(size = 7))
pcap.plot

Visualize Differential Abundance (multiple groups)

Below, you will make a similar visualization as in the example, but with the raw data from the bees data set and plots the abundances of taxa (by bee type) for only those taxa whose abundance was significantly different for at least one of the three groups, according to the Kruskal-Wallis tests (with adjusted p-values).

bees.melt.dt <- melt(
  bees.dt, 
  measure.vars = bees.asv.ids, 
  variable.name = "ASV", 
  value.name = "Abundance"
)
head(bees.melt.dt)
##    Sample bee_type   hive    ASV Abundance
##    <char>   <char> <char> <fctr>     <int>
## 1:     A1        F     DM  ASV11       490
## 2:     A2        H     DM  ASV11       124
## 3:     A3        W     DM  ASV11         0
## 4:     B1        F     GC  ASV11       799
## 5:     B2        H     GC  ASV11       142
## 6:     B3        W     GC  ASV11         0
sig.asvs <- bees.asv.ids[bees.bon.pvals <= 0.05] # or bees.fdr.pvals, Their choice
bees.plot.dt <- bees.melt.dt[ASV %in% sig.asvs]

ggplot(bees.plot.dt, aes(x = bee_type, y = Abundance)) + 
  geom_beeswarm(alpha = 0.6, size = 0.3, cex = 0.3) + 
  stat_summary(
    fun.data = "mean_cl_boot", 
    color = "red", 
    geom = "errorbar", 
    width = 0.8
  ) + 
  facet_wrap(~ ASV, scales = "free_y") + 
  theme(strip.text = element_text(size = 10))

Objective 3: Differential expression analysis based on the negative binomial (NB) Distribution

3A: Differenital expression based on NB with two groups (pcap dataset)

Rather than performing non-parametric tests, we can also attempt to model the underlying data distribution.

Historically, the poisson distribution was used to model counts to test for differential expression in count data. However, in a Poisson distribution, the mean and variance are equal. Consideration of biological count data has lead to the understanding that the Poisson distribution is too restrictive, and that it predicts smaller variance than what is seen commonly in biological data. Therefore, the resulting test does not control type-I error well (the probability of false discoveries).

To address this multiple methods have modeled count data as negative binomial, which allows for more flexible models in that mean and variance do not have to be equal. Methods such as edgeR and DESeq have been developed to allow for testing of differential count data. Read more about DeSeq here: https://genomebiology.biomedcentral.com/articles/10.1186/gb-2010-11-10-r106 or here: https://chipster.csc.fi/manual/deseq2.html

DESeq is a method to infer differential gene expression based on a distribution which is more appropriate (but still with some caveats) for microbial count data - a negative binomial distribution, with variance and mean linked by local regression.

Below, we will run through an example of using DESeq2 with the Pcap data set. First we will convert our Pcap phyloseq object into a DESeqDataSet, which is a special type of object used to store the input values, intermediate calculations, and results of an analysis of differential expression with DESeq.

After we have our input in the correct format, we’ll run the DESeq function, which takes an input table of raw counts and then produces a table of significantly differentially expressed ASVs. The steps that DESeq performs are:

  1. Estimates size factors for each gene across all samples. Corrects for composition bias of different read depths.

  2. Obtains dispersion estimates for each ASV through a model fit procedure.

  3. Fits a negative binomial generalized linear regression model for each ASV and then uses Wald test for significance testing.

*Automatically detects outliers and removes these. Removes genes with low counts to improve multiple testing adjustment less harsh.

At the end of the chunk, we simply print the results table from running DESeq.

Note: The %>% operator used below is a “pipe” operator. It means “take the results of the function before it and send them to the function after it”.

pcap.ps@sam_data$Exposure <- factor(
  pcap.ps@sam_data$Exposure, 
  levels = c("Unexposed", "Exposed")
)
pcap.deseq <- phyloseq_to_deseq2(pcap.ps, ~ Exposure) # this function does all the heavy lifting for us to turn a phyloseq object into a DESeq2 object
pcap.dds <- DESeq(pcap.deseq)
pcap.res.dt <- results(pcap.dds) %>% as.data.table(keep.rownames = "Genus")
pcap.sig.dt <- pcap.res.dt[padj <= 0.05]

gt(pcap.sig.dt[order(log2FoldChange)]) %>% 
  fmt_number(columns = 2:ncol(pcap.sig.dt), decimals = 3)
Genus baseMean log2FoldChange lfcSE stat pvalue padj
Paraclostridium 8.783 −5.752 0.694 −8.292 0.000 0.000
Plesiomonas 1,438.125 −4.786 0.304 −15.762 0.000 0.000
Exiguobacterium 0.586 −2.952 1.024 −2.883 0.004 0.017
Chitinophaga 2.618 −1.972 0.652 −3.023 0.003 0.012
Fluviicola 4.962 −1.829 0.680 −2.690 0.007 0.029
Cetobacterium 1,778.887 −0.866 0.261 −3.318 0.001 0.006
NONE 196.546 0.836 0.185 4.509 0.000 0.000
Pelomonas 65.840 1.079 0.328 3.284 0.001 0.006
Gemmobacter 2.281 1.134 0.451 2.515 0.012 0.042
Mycoplasma 18.554 1.418 0.473 2.996 0.003 0.012
Phreatobacter 2.770 1.489 0.424 3.510 0.000 0.005
Anaerotruncus 3.378 1.720 0.638 2.695 0.007 0.029
Pseudomonas 133.346 1.829 0.274 6.666 0.000 0.000
Fusobacterium 5.669 1.960 0.635 3.085 0.002 0.011
Hyphomicrobium 1.092 1.976 0.601 3.288 0.001 0.006
Weissella 3.053 2.039 0.616 3.310 0.001 0.006
Peptostreptococcus 1.047 2.074 0.780 2.659 0.008 0.030
Lactococcus 1.054 2.280 0.935 2.439 0.015 0.050
Comamonas 1.728 2.313 0.872 2.653 0.008 0.030
Clostridium_sensu_stricto_1 1.703 2.349 0.906 2.594 0.009 0.034
Leeia 1.681 2.362 0.695 3.400 0.001 0.006
Ensifer 5.369 2.485 0.397 6.266 0.000 0.000
Ferruginibacter 0.851 2.486 0.805 3.087 0.002 0.011
Bacteroides 1.600 2.519 0.762 3.307 0.001 0.006
Lactobacillus 2.695 2.547 0.749 3.400 0.001 0.006
Rheinheimera 1.499 2.596 0.844 3.074 0.002 0.011
Ignatzschineria 1.712 2.673 0.885 3.018 0.003 0.012
Luteimonas 0.983 2.683 0.740 3.627 0.000 0.003
Undibacterium 7.379 3.329 0.510 6.524 0.000 0.000
Chitinimonas 4.909 3.645 0.586 6.214 0.000 0.000
Photobacterium 6.744 3.655 0.715 5.110 0.000 0.000

Below, we plot the log2FoldChange (i.e. number of doublings) between the Unexposed and Exposed groups. More positive numbers indicate the Genus has a higher abundance in the Unexposed vs Exposed group, while negative numbers indicate the opposite.

ggplot(pcap.sig.dt, aes(x = reorder(Genus, log2FoldChange), y = log2FoldChange)) + 
  geom_col() + 
  coord_flip() + 
  labs(x = "Genus", y = "Log2 Fold Change (Unexposed to Exposed)")

3B: Differenital expression based on NB with multiple groups (bees dataset)

Below you will run through a very similar analysis with the Bees data set as the example above did with the Pcap data set. Notice, that just as with the non-parametric tests, there will be some modifications to the analysis since the number of groups you’ll be comparing is three rather than two. Namely, there will be two tables to results instead of one.

bees.deseq <- phyloseq_to_deseq2(bees.ps, ~ bee_type)
bees.dds <- DESeq(bees.deseq)
resultsNames(bees.dds) # the second and third names here indicate the comparisons we want to make below
## [1] "Intercept"       "bee_type_H_vs_F" "bee_type_W_vs_F"
bees.HvF.res.dt <- as.data.table(
  results(bees.dds, name = "bee_type_H_vs_F"), 
  keep.rownames = "ASV"
)


bees.HvF.sig.dt <- bees.HvF.res.dt[padj <= 0.05]
gt(bees.HvF.sig.dt[order(log2FoldChange)]) %>% 
  tab_header(title = "Bee type H vs F") %>%
  fmt_number(columns = 2:ncol(bees.HvF.sig.dt), decimals = 3)
Bee type H vs F
ASV baseMean log2FoldChange lfcSE stat pvalue padj
ASV58 5.449 −23.054 3.577 −6.445 0.000 0.000
ASV50 40.149 −9.571 2.086 −4.588 0.000 0.000
ASV63 16.777 −7.479 1.806 −4.141 0.000 0.000
ASV48 60.603 −4.753 1.702 −2.793 0.005 0.016
ASV25 186.727 −3.522 1.022 −3.446 0.001 0.002
ASV14 536.603 −2.910 1.015 −2.868 0.004 0.013
ASV16 536.316 −2.558 1.094 −2.339 0.019 0.049
ASV13 537.183 −2.517 0.375 −6.712 0.000 0.000
ASV42 91.221 −2.225 0.482 −4.616 0.000 0.000
ASV12 742.996 −2.091 0.384 −5.450 0.000 0.000
ASV11 892.166 −2.025 0.804 −2.520 0.012 0.034
ASV6 1,385.724 −1.902 0.232 −8.212 0.000 0.000
ASV5 3,030.315 −1.090 0.370 −2.950 0.003 0.010
ASV2 7,650.634 −1.007 0.213 −4.723 0.000 0.000
ASV36 92.213 −0.672 0.194 −3.474 0.001 0.002
ASV4 3,346.276 0.701 0.200 3.508 0.000 0.002
ASV22 232.225 0.762 0.307 2.477 0.013 0.037
ASV7 902.376 1.601 0.256 6.264 0.000 0.000
ASV31 104.500 1.668 0.445 3.746 0.000 0.001
ASV23 224.797 1.855 0.476 3.897 0.000 0.000
ASV51 28.195 3.452 1.096 3.148 0.002 0.006
ASV35 88.694 5.262 1.169 4.503 0.000 0.000
ASV66 10.507 7.125 1.987 3.586 0.000 0.001
ASV52 25.951 7.153 1.609 4.447 0.000 0.000
ASV65 10.417 7.154 2.905 2.463 0.014 0.037
ASV41 70.089 9.905 0.941 10.531 0.000 0.000
ASV40 9.440 22.508 3.516 6.401 0.000 0.000
bees.WvF.res.dt <- as.data.table(
  results(bees.dds, name = "bee_type_W_vs_F"), 
  keep.rownames = "ASV"
)
bees.WvF.sig.dt <- bees.WvF.res.dt[padj <= 0.05]
gt(bees.WvF.sig.dt[order(log2FoldChange)]) %>% 
  tab_header(title = "Bee type W vs F") %>%
  fmt_number(columns = 2:ncol(bees.WvF.sig.dt), decimals = 3)
Bee type W vs F
ASV baseMean log2FoldChange lfcSE stat pvalue padj
ASV37 23.395 −24.205 3.576 −6.769 0.000 0.000
ASV34 81.152 −23.951 2.246 −10.665 0.000 0.000
ASV58 5.449 −22.775 3.577 −6.367 0.000 0.000
ASV25 186.727 −11.535 1.157 −9.969 0.000 0.000
ASV50 40.149 −9.438 2.086 −4.524 0.000 0.000
ASV63 16.777 −8.171 1.806 −4.525 0.000 0.000
ASV11 892.166 −7.844 0.813 −9.646 0.000 0.000
ASV24 253.396 −7.273 2.528 −2.877 0.004 0.011
ASV48 60.603 −7.115 1.729 −4.114 0.000 0.000
ASV75 10.244 −6.226 2.511 −2.480 0.013 0.029
ASV43 91.169 −6.071 1.523 −3.987 0.000 0.000
ASV51 28.195 −5.350 1.224 −4.372 0.000 0.000
ASV35 88.694 −5.283 1.289 −4.097 0.000 0.000
ASV27 216.853 −4.956 1.606 −3.086 0.002 0.006
ASV14 536.603 −3.687 1.015 −3.633 0.000 0.001
ASV16 536.316 −2.762 1.094 −2.525 0.012 0.027
ASV5 3,030.315 −2.542 0.370 −6.874 0.000 0.000
ASV42 91.221 −1.660 0.481 −3.450 0.001 0.002
ASV6 1,385.724 −1.461 0.231 −6.310 0.000 0.000
ASV2 7,650.634 −1.342 0.213 −6.296 0.000 0.000
ASV18 384.296 −1.307 0.256 −5.104 0.000 0.000
ASV13 537.183 −0.954 0.374 −2.548 0.011 0.026
ASV4 3,346.276 0.996 0.200 4.988 0.000 0.000
ASV19 327.169 1.108 0.347 3.196 0.001 0.004
ASV31 104.500 1.133 0.446 2.542 0.011 0.026
ASV8 919.224 1.428 0.555 2.574 0.010 0.026
ASV7 902.376 1.908 0.256 7.466 0.000 0.000
ASV23 224.797 2.140 0.476 4.498 0.000 0.000
ASV3 3,713.902 2.406 0.598 4.026 0.000 0.000
ASV52 25.951 7.732 1.608 4.808 0.000 0.000
ASV59 18.599 23.411 3.406 6.874 0.000 0.000

Because there are 3 groups we’re comparing, there will be two plots instead of one.

plot.HvF <- ggplot(bees.HvF.sig.dt, aes(x = reorder(ASV, log2FoldChange), y = log2FoldChange)) + 
  geom_col() + 
  coord_flip() + 
  labs(x = "ASV", y = "Log2 Fold Change", title = "Bee type H vs F")

plot.WvF <- ggplot(bees.WvF.sig.dt, aes(x = reorder(ASV, log2FoldChange), y = log2FoldChange)) + 
  geom_col() + 
  coord_flip() + 
  labs(x = "ASV", y = "Log2 Fold Change", title = "Bee type W vs F")

plot_grid(plot.HvF, plot.WvF, ncol = 1)

# Reach goal:
# Modify and combine the two DESeq2 results tables into a single data.table (see ?rbind) and plot the same graph in one call utilizing the facet_wrap() function with the ggplot()

Objective 4: Model abundances as a function of a continuous variable and (Objective 5) do so efficiently

Up until now, we have only considered determining significance between discrete groups of samples (e.g. exposed / unexposed). Let’s try modeling microbial abundance as a function of the parasitic burden (an example of a continuous covariate). You can do this using a compound poisson generalized linear model.

You will see that this implementation is a bit slow - even for this small dataset and so we will work on it below to speed it up.

Loop over each genus using a standard for loop.

cpglms.for <- list()
pcap.noNA.dt <- pcap.dt[complete.cases(pcap.dt)]

compute.time.for <- system.time(
  for (genus in pcap.genera.ids) {
    frm <- paste(genus, "~ Worms.histo")
    mod <- try(cpglm(as.formula(frm), data = pcap.noNA.dt, link = "identity"), silent = T)
    if (class(mod) == "try-error") { # if the model doesn't run successfully
mod <- NA
}
cpglms.for[[genus]] <- mod
}
)

paste("Elapsed time:", round(compute.time.for[3], 3), "seconds") %>%
  cat.n()
## Elapsed time: 92.956 seconds

Using lapply for increased efficiency

Lets see if we can get the loop to run faster using an lapply. Notice the difference in the system.time outputs. (Depending on various factors, the lapply loop may not actually run faster for you, but compare with others around you, and, on average, it should be a few seconds faster or about equivalent).

compute.time.lapply <- system.time(
  cpglms.lapply <- lapply(pcap.genera.ids, function(genus) {
    frm <- paste(genus, "~ Worms.histo")
    mod <- try(cpglm(as.formula(frm), data = pcap.noNA.dt, link = "identity"), silent = T)
    if (class(mod) == "try-error") {
      mod <- NA
    }
    return(mod)
  })
)

cat(paste("Elapsed time:", round(compute.time.lapply[3], 3), "seconds"), sep = "\n")
## Elapsed time: 91.771 seconds

These data sets are rather small. With larger data sets even low-percentage reductions in compute time can make a big difference in terms of total time.

Parallelization example

Now, lets convert the lapply loop above into a foreach parallel loop using 2 cores to see if that runs faster. Note the reduction in compute time printed by system.time.

?foreach

nCores <- 2
cl <- makeCluster(nCores, type = "FORK")
registerDoParallel(cl, cores = nCores)

compute.time.par <- system.time(
  cpglms.par <- foreach(
    genus = pcap.genera.ids
  ) %dopar% {
    frm <- paste(genus, "~ Worms.histo")
    mod <- try(cpglm(as.formula(frm), data = pcap.noNA.dt, link = "identity"), silent = T)
    if (class(mod) == "try-error") {
      mod <- NA
    }
    return(mod)
  }
)

stopCluster(cl)

paste("Elapsed time:", round(compute.time.par[3], 3), "seconds") %>%
  cat.n()
## Elapsed time: 63.074 seconds

Remember since everyone is doing the lab at once, that we cannot use a large number of cores for this test. However, if you have access to a cluster with 100s of cores, this can speed up your analysis quite a bit. On my machine this run didn’t quite reduce the compute time by half, but was close.

Correct for multiple hypothesis testing

Now that we made this a bit faster, we will now continue on our example of modeling taxon abundance as a function of a continuous variable.

As above, we will get all the p-values from the CPGLM models and adjust them for multiple tests. Additionally, since we’re already looping through all the models, we’ll also grab other values that will help us plot the results, i.e., estimated coefficients for intercepts and slopes.

cpglms.dt <- lapply(1:length(cpglms.lapply), function(i) {
  mod <- cpglms.lapply[[i]]
  if (class(mod)[1] == "cpglm") { # only continue if we have a model here instead of an NA
    genus <- pcap.genera.ids[[i]]
    sink("tmp.txt") # the summary function for the cplm-class objects prints default every model  when saving them to a variable. This sink() function prevents writing all of them to the console and instead to the tmp.txt file
    mod.sum <- summary(mod)
    sink() # Ends the writing of output to the tmp.txt file
    return(
      data.table(
        Genus = genus,
        Intercept = mod.sum$coefficients[1, 1],
        Slope = mod.sum$coefficients[2, 1],
        P.val = mod.sum$coefficients[2, 4]
      )
    )
  }
}) %>% rbindlist() # combine the list of data.tables into a single data.table

file.remove("tmp.txt") # just to keep things tidy
## [1] TRUE
# below is the data.table syntax for creating a new column based on another column
cpglms.dt[, Pval.adj := p.adjust(P.val, method = "bonferroni")] # or another correction of your choice
cpglms.sig.dt <- cpglms.dt[Pval.adj <= 0.05]
cpglms.sig.dt
##              Genus    Intercept        Slope        P.val     Pval.adj
##             <char>        <num>        <num>        <num>        <num>
##  1:  Acinetobacter 1.057274e+02  -2.74894800 1.621950e-05 6.698654e-03
##  2: Hyphomicrobium 2.254789e-01   0.51633826 2.294996e-05 9.478334e-03
##  3:      Filimonas 5.749191e-12   0.06467901 7.960534e-05 3.287700e-02
##  4:    Plesiomonas 7.781889e+02 -21.19145174 3.976450e-12 1.642274e-09
##  5:  Dechloromonas 5.202028e-01  -0.01445008 6.406180e-05 2.645752e-02
##  6:   Sphingomonas 1.640056e-01   0.32551617 3.761001e-05 1.553294e-02
##  7:     Sutterella 9.052547e-11   0.06403765 2.756049e-05 1.138248e-02
##  8:     Microvirga 1.709563e-10   0.01967196 8.393415e-05 3.466481e-02
##  9:  Defluviimonas 2.651962e+00  -0.07366561 3.698659e-05 1.527546e-02
## 10:        Devosia 2.159936e-11   0.12663095 1.177475e-06 4.862972e-04

Vizualize results

Now, we’ll plot the significant associations between taxon abundances and worm counts (Worms.histo) as scatter plots. Positive slopes indicate that an increase in worm counts is associated with a greater abundance of a taxon, while negative slopes indicate that an increase in worm counts is associated with a lesser abundance of a taxon.

This will be a case of using two different data arrays within a single ggplot call.

cpglms.plot.dt <- pcap.melt.dt[Genus %in% cpglms.sig.dt$Genus] # grabbing pcap.melt.dt from above because it's already in the format that we need for easy plotting in ggplot. We're just subsetting it here so it only contains the genera that had significant associations with worm counts.
cpglms.plot.dt <- cpglms.plot.dt[complete.cases(cpglms.plot.dt)] 

ggplot(cpglms.plot.dt, aes(x = Worms.histo, y = Abundance)) + # we're using a log tranformation of the Abudance counts because that's what CPGLM does by default. You can change this in the cpglm()
  geom_point() + # makes it a scatter plot 
  geom_abline(
    data = cpglms.sig.dt, # Notice we're using a different data.table here than that above that contains the intercepts and slopes. Also note that for facet_wrap() to work below, the same facet variable has to be present in this second data.table, too
aes(intercept = Intercept, slope = Slope),
color = "red"
) +
  facet_wrap(~ Genus, scales = "free_y")

While these associations have been deemed significant by the statistical test, how many are you convinced are strong, likely real, associations?