Microbiome Analysis (16s) with phyloseq and LDM

phyloseq: An R package for reproducible interactive analysis and graphics of microbiome census data (2013) PLoS ONE 8(4):e61217 http://dx.plos.org/10.1371/journal.pone.0061217

Yi-Juan Hu and Glen A. Satten (2019). LDM: Testing Hypotheses about the Microbiome using the Linear Decomposition Model. R package version 1.0. The original paper from Hu And Satten is available here https://www.biorxiv.org/content/10.1101/229831v2 and the authors’ Github repository is available here https://github.com/yijuanhu

Example data comes from the Global Patterns dataset is described in PNAS (Caporoso, 2011). This dataset compares the microbial communities of 25 environmental samples and three known “mock communities”-a total of 9 sample types- at 3.1 million reads/sample. It is natively available within phyloseq.

Briefly, phyloseq takes in data from data processing programs like QIIME, mothur, and Pyrotagger. There are many other ways to import data into phyloseq. We typically import the .qza files produced by QIIME2 but in this walkthrough we will be using a built-in dataset that you can use anytime so you can follow along with this walkthrough if desired. For more examples to import data using other programs, see the phyloseq vignette here https://bioconductor.org/packages/release/bioc/vignettes/phyloseq/inst/doc/phyloseq-basics.html

While QIIME2 offers richness estimates and other exploratory data analysis (ex: alpha and beta diversity metrices) we believe that phyloseq in combination with ggplot2 offers greater flexibility for generating customizable data visualizations.

Once exploratory data analysis in phyloseq is complete, we use the LDM package to perform statistical analyses. LDM takes in a table of operational taxonomic units (OTUs) or amplicon sequence variants (ASVs) along with a table of data about the samples (i.e. covariates) and uses a linear decomposition model to associate experimental conditions and covariates of interest with microbial abundance. Data about the samples typically includes sample names, some experimental condition of interest, and other variables as collected by the experimentors. LDM can accomodate both continuous and categorical data.

Definition of terms

Linear Decomposition Model

The LDM models microbial abundance in the form of counts transformed into relative abundances as an outcome of interest given experimental covariates of interest. LDM provides users with both global and local hypothesis tests of differential abundance given covariates of interest and microbial count data. LDM decomposes the model sum of squares into parts explained by each variable in the model. From these sub-models we can see the amount of variability that each variable is contributing to the overall variability explained by the model’s covariates of interest.

False Discovery Rate (FDR)

The FDR tells you how likely it is that all taxa identified as differentially abundant (DA) are false positives. A FDR of 5% means that among all taxa called DA, an average of 5% of those are truly not DA. The q-value is the local significance threshold adjusted for the fact that we have assessed muliple taxa. Accurate interpretation of unadjusted p-values assumes that each taxa is assessed for DA on its own. However, most, if not all 16s microbiome experiments assess multiple taxa for differential abundance at once. In order to account for the number of taxa we are testing, we must calculate and interpret the adjusted p-value for each taxa.

Our very first step is to load the packages we need from Bioconductor. Please see http://bioconductor.org/ for information about installation and use of Bioconductor and its packages.

Loading R packages

pacman::p_load("devtools","phyloseq","tidyverse","qiime2R", "ggplot2", 
              "reshape2", "GUniFrac", "vegan", "LDM")

Import and prepare data

We will be using the Global Patterns dataset for this walk through but typically we would use the .qza files we obtain from QIIME2 and the qza_to_phyloseq function in phyloseq to import the data as a phyloseq object.

In order to improve statistical power to detect differentially abundant taxa we typically remove taxa which have counts of 0 in at least one (or more) samples.

With this dataset we would also like to define a categorical variable for sample type which will tell us if the sample is from a human or not since this is a covariate of interest to us and we expect it to account for a large percentage of the variability between the samples. This variable will be project specific.

data ("GlobalPatterns")

GP <- prune_taxa(taxa_sums(GlobalPatterns) > 0, GlobalPatterns)

human <- get_variable(GP, "SampleType") %in% c("Feces", "Mock", "Skin", "Tongue")

sample_data(GP)$human <- factor(human)

sample_data(GP)$human<-ifelse(human==TRUE, "Human", "Not-Human")

#create and label human variable
head(sample_data(GP))

##         X.SampleID  Primer Final_Barcode Barcode_truncated_plus_T
## CL3            CL3 ILBC_01        AACGCA                   TGCGTT
## CC1            CC1 ILBC_02        AACTCG                   CGAGTT
## SV1            SV1 ILBC_03        AACTGT                   ACAGTT
## M31Fcsw    M31Fcsw ILBC_04        AAGAGA                   TCTCTT
## M11Fcsw    M11Fcsw ILBC_05        AAGCTG                   CAGCTT
## M31Plmr    M31Plmr ILBC_07        AATCGT                   ACGATT
##         Barcode_full_length SampleType
## CL3             CTAGCGTGCGT       Soil
## CC1             CATCGACGAGT       Soil
## SV1             GTACGCACAGT       Soil
## M31Fcsw         TCGACATCTCT      Feces
## M11Fcsw         CGACTGCAGCT      Feces
## M31Plmr         CGAGTCACGAT       Skin
##                                        Description     human
## CL3       Calhoun South Carolina Pine soil, pH 4.9 Not-Human
## CC1       Cedar Creek Minnesota, grassland, pH 6.1 Not-Human
## SV1     Sevilleta new Mexico, desert scrub, pH 8.3 Not-Human
## M31Fcsw    M3, Day 1, fecal swab, whole body study     Human
## M11Fcsw   M1, Day 1, fecal swab, whole body study      Human
## M31Plmr    M3, Day 1, right palm, whole body study     Human

Exploratory Data Analysis with phyloseq

Richness estimation

Alpha Diversity

To compare richness estimates within samples we typically provide the Shannon metric but phyloseq also offers the Chao1, ACE, Simpson, or Inverse Simpson metrics. While this dataset does not include control samples we would suggest including Zymo controls, Positive Controls, and Negative Controls in your sequencing along with your experimenatl samples.

Here we have also specified that we would like the Classic ggplot2 theme,which gives a white background with a rainbow of colors for each box plot. This color scheme is customizable and we would be happy to use one to fit your project.

theme_set(theme_classic())

alpha_meas = c("Shannon")

plotGP <- plot_richness(GP, "human", "SampleType", measures=alpha_meas)
plotGP + geom_boxplot(data=plotGP$data, aes(human,value,color=NULL), alpha=0.1)+coord_cartesian(ylim = c(0, 7))

Each box groups the sample types by whether they come from a human or not. Each sample type is represented by a different color and we see that among the non-human samples, soil has the highest alpha diversity score and Freshwater samples have the lowest. Among the human samples, skin samples have the highest alpha diversity and tongue samples have the lowest alpha diversity.

Alpha diversity scores tell us how similar in microbial composition each individual sample is to other samples of the same sample type or other grouping variable of interest, or the within-sample variability. We would expect experimental samples to have scores like those we see here, between 2-7. This makes sense given the amount of biological diversity between individuals even of the same sample type. It seems reasonable that the microbial composition of skin cells would be more different from one skin cell to the next than the microbial composition of one tongue cell to the next.

For control samples, we would expect them to have much lower alpha diversity scores than the experimental samples, closer to zero, because we expect that all of the samples in each control type are extremely similar to each other and more similar to one another than to other sample types.

Correspondance Analysis

Beta Diversity

Similarly to a Principal Components Analysis a Correspondance Analysis tells us how much variability is being contributed to the data set by each axis (like a principal component). In the case of this data, we only want to lok at the top 200 most represented taxa in the top 5 Phyla so first we have to subset the data to reflect that. These are parameters which will vary based on your research question.

There are many metrics for assessing beta diversity and phyloseq includes options for Unifrac distance, Jaccard, Manhattan, Euclidean, or Caho metrics among others. Beta Diversity tells us how similar in microbial composition each sample is to other samples of different types, or between sample variability.

The amount of variability we would expect to see between sample types will vary based on your specific project. In general we do expect that control samples will be very different in their microbial composition than the experimental samples.

In order to see which CAs are contributing the most variability to the relationship between sampletype and microbiome composition, we first exmaine a Scree plot. The y-axis of a Scree plot shows us from 0.1 the eigenvalue of each CA. Multiplied by 100, this tells us how much variability is being contributed by each covariate of interest according to the Bray-Curtis distance.

Scree Plot

topsp <- names(sort(taxa_sums(GP), TRUE)[1:200])
GP    <- prune_taxa(topsp, GP)

top5ph <- sort(tapply(taxa_sums(GP), tax_table(GP)[, "Phylum"], sum), decreasing=TRUE)[1:5]
GP     <- subset_taxa(GP, Phylum %in% names(top5ph))

# Re-add human variable to sample data:
sample_data(GP)$human <- factor(human)
sample_data(GP)$human<-ifelse(human==TRUE, "Human", "Not-Human")

GPca  <- ordinate(GP, "CCA", "bray")
plot_scree(GPca, "Scree Plot for Global Patterns Correspondence Analysis")

From the Scree plot we see that the first 2 CAs are contributing the most variability to the relationship so let’s examine how these are influencing the distances between the samples.

plot_ordination(GP, GPca, "samples", color="SampleType") + 
  geom_point(size=3)

Looking at the first and second CAs we see that the Ocean and Freshwater (creek) samples and the Tongue and Skin samples appear to be more similar to one another than they are different. The Freshwater, Soil, and Sediment (estuary) samples are each more similar to other samples of the same type than samples of other types since they appear to cluster relatively closely by sample type. Finally we note that the Mock community samples and the Feces samples cluster very closely.

Looking at the top half of the graph compared to the bottom half of the graph it suggests that the most important covariate in explaining the differences between these samples is whether or not they come from human samples.This seems biologically reasonable since we would expect samples from humans would be more similar to one another than samples from waterways or sediment.

Phyla-specific abundance plot

Next, we want to see which particular phyla differ between human and non-human samples. We can do this with a bar plot with abundance graphed along the y-axis and whether the samples are human or not along the x-axis.

p1  <- 
  plot_ordination(GP, GPca, "species", color="Phylum")

mdf <- 
  melt(p1$data[, c("CA1", "CA2", "Phylum", "Family", "Genus")], 
            id=c("Phylum", "Family", "Genus") )
pbar <-
  plot_bar(GP, x="human", fill="SampleType", facet_grid= ~ Phylum)

#specify that we do not want black lines around each individual ASV
pbar + 
  geom_bar(aes(color=SampleType, fill=SampleType), stat="identity", position="stack")

Here we see that Cyanobacteria and Actinobacteria are most abundant in the non-human samples and the Bacteriodetes and Firmicutes are most abundant in the human samples.

There are many ways to visualize data in phyloseq, these are just a few examples of the typical products of our analytical pipeline at EICC. We would be happy to customize graphs according to your project-specific goals.

Statistical Analysis with LDM

We need to recover the data from the phyloseq object and transform the taxa table, asvs, and the metadata objects from phyloseq data objects to data frames.

taxatable <- as.data.frame(tax_table(GP)) #taxonomy

phytree <- phy_tree(GP) #phylogenetic tree

metadata <- as.data.frame(sample_data(GP)) #sample data

asvs <- as.data.frame(otu_table(GP)) #ASVs

dim(asvs) #182 x 26

## [1] 182  26

Since the ASVs have the taxa as rows and we need them as columns, we can use the t function to prepare the data in the way we need it and transpose the columns into rows and rows into columns. Note that we have 182 taxa and 26 samples.

Next, to remove any statistical noise we may still have to detect relationships between the covariates and microbial composition, we decide to keep only those taxa which appear in at least 2 samples. This is a parameter that will vary by project.

asvt <- as.data.frame (t(asvs))

otu_pres = which(colSums(asvt>0)>=1)

asvt = asvt[,otu_pres]

dim(asvt) #26 x 182

## [1]  26 182

Our exploratory data analysis has suggested that a sample being either from a human or not is an important determination of its microbial composition. Let’s test whether or not this relationship is statistically significant. In some cases this variable might be considered a confounder but in this case, since we want to examine the relationship between human and non-human samples and microbial composition, it will be the only covariate of interest. LDM can handle covariates both categorical and continuous and can control for confounders.

Correspondance Analysis

Since we are only interested in assessing the impact of whether or not a sample comes from a human, let’s look at a Correspondance Analysis plot with samples colored by this variable.

plot_ordination(GP, GPca, "samples", color="human") + geom_point(size=3)

By coloring the samples by whether the come from humans or not we see that they do appear to have some seperation based on this variable. We should investigate further to determine whether or not this apparent seperation results in statistically significantly different microbial communities between human and non-human samples.

To examine this relationship we first specify the data object for our model, called form by stating we would like to assess whether or not a sample being from a human significantly impacts its microbiome composition. Next, we fit this linear discriminant model using the Bray-Curtis distance. Recall that this distance is one of many metrics available to quantify the between-sample variability or how different each sample group is from one another.

Fit the LDM

form<- asvt ~human

fit <-ldm(formula=form, data=asvt,dist.method="bray", n.perm.max=0)

Scree Plot

Similarly as above, we will use a scree plot here to quantify how much variability in our model is explained by the human variable.

scree.VE.freq <- c(fit$VE.global.freq.submodels/fit$VE.df.submodels,
                   fit$VE.global.freq.residuals)

color <- c(rep("red", 1), rep("black", length(scree.VE.freq)-1))

plot(scree.VE.freq/sum(scree.VE.freq), main="frequency scale",
     xlab="Component", ylab="Proportion of Total Sum of Squares", col=color)

Each dot on this plot represents a variable in the model either one we have recorded in our dataset or so-called surrogate variables which we may not have measured. In this scree plot we see results which support our Correspondance Analysis above, that whether or not a sample is human (also represented by CA1) is contributing ~10% of the observed variability.

Global Hypothesis Testing

In order to determine whether a sample being from a human or non-human source is statistically significantly contributing to observed differences in microbial composition we first perform a test of the global hypothesis. We want to know, overall, are there differences between human and non-human samples with regard to the composition of the ASVs expressed in each group?

Since we are performing permuations, we set a seed to be able to reproduce our work each time we run the model. This is key to reproducibility and must be specified. You can use R’s built-in seed function or you can pick a number you like as long as you use the same one everytime you need a seed in this analysis and don’t mind sharing it.

(seed=sample.int(100000, size=1)) #36575, use this to reproduce these results exactly

## [1] 85756

Now we fit our model, specifying that we only want to do a global test with test.global=TRUE and test.otu=FALSE. We also specify that we want to use the Bray Curtis distance but LDM offers many options to customize this parameter.

fit2<-ldm(formula=form, data=asvt, dist.method="bray", 
          test.global=TRUE, test.otu=FALSE, seed=36575)

## permutations: 1 
## permutations: 1001 
## permutations: 2001 
## permutations: 3001 
## permutations: 4001

#unadjusted p-value for the global hypothesis test
fit2$p.global.omni

## [1] 0.00019996

ASV-specific Hypothesis Testing

Since this global hypothesis test has a global p-value of <0.05 (p=0.0001996) it appears to suggest that there are statistically significant differences in the microbiome compositions of human and non-human samples. However, since we tested 182 ASVs with each one representing an individual hypothesis we must correct for multiple testing. Here we use the Benjamini-Hochberg False Discovery Rate to control the FDR at 0.05.

Let’s look more closely at which ASVs are responsible for this. Note that in this model we fit we specify test.otu=TRUE and the FDR is controlled at 0.05. We also specify the same seed as above for reproducibility of results. The FDR specified will vary by project. More exploratory or pilot studies may wish to have a higher FDR where more specific studies may wish to have a smaller one.

fit3<-ldm(formula=form, data=asvt, dist.method="bray",
          test.global=TRUE, test.otu=TRUE, n.rej.stop=100, fdr.nominal=0.05, seed=36575)

## permutations: 1 
## permutations: 1001 
## permutations: 2001 
## permutations: 3001 
## permutations: 4001 
## permutations: 5001 
## number of OTUs do not meet early stopping criterion: 33 
## otu test stopped at permutation 5500

After fitting our model, we can find out how many and which ASVs and associated taxa are significiantly differently abundant between human and non-human samples. Here we create a dataframe of results ordered by smallest local FDR and examine the top ten.

#i would love to make this a tidy function but I'm not sure how
Kingdom <-taxatable$Kingdom
Phylum <-taxatable$Phylum
Class <-taxatable$Class
Order <-taxatable$Order
Genus <-taxatable$Genus
Species <-taxatable$Species
pvals <-fit3$p.global.omni
qvals <-t(fit3$q.otu.omni) #transpose 

#sort results
results<-data.frame(Kingdom,Phylum,Class,Order,Genus,Species,pvals,qvals)

resqOrdered <- results[order(results$qvals),]
head(resqOrdered, 10)

##         Kingdom         Phylum          Class             Order           Genus
## 471141 Bacteria Actinobacteria Actinobacteria   Actinomycetales          Rothia
## 326977 Bacteria Actinobacteria Actinobacteria Bifidobacteriales Bifidobacterium
## 469873 Bacteria Actinobacteria Actinobacteria Bifidobacteriales Bifidobacterium
## 535929 Bacteria  Cyanobacteria    Chloroplast       Cryptophyta            <NA>
## 561077 Bacteria  Cyanobacteria    Chloroplast     Stramenopiles            <NA>
## 542011 Bacteria  Cyanobacteria    Chloroplast     Stramenopiles            <NA>
## 317182 Bacteria  Cyanobacteria    Chloroplast     Stramenopiles            <NA>
## 549656 Bacteria  Cyanobacteria    Chloroplast     Stramenopiles            <NA>
## 327536 Bacteria  Cyanobacteria    Chloroplast     Stramenopiles            <NA>
## 228407 Bacteria  Cyanobacteria    Chloroplast      Haptophyceae            <NA>
##                                 Species      pvals      qvals
## 471141               Rothiamucilaginosa 0.00019996 0.02543348
## 326977      Bifidobacteriumadolescentis 0.00019996 0.02543348
## 469873 Bifidobacteriumpseudocatenulatum 0.00019996 0.02543348
## 535929                             <NA> 0.00019996 0.02543348
## 561077                             <NA> 0.00019996 0.02543348
## 542011                             <NA> 0.00019996 0.02543348
## 317182                             <NA> 0.00019996 0.02543348
## 549656                             <NA> 0.00019996 0.02543348
## 327536                             <NA> 0.00019996 0.02543348
## 228407                             <NA> 0.00019996 0.02543348

Our results appear to be consistent with what we would expect to be the most abundant taxa in non-human as compared to human samples; Chloroplast. The most significantly abundant bacteria may or may not be the most clinically relevant and this will vary by project. Note that p-values and q-values are from the global omnibus tests. p and q values based on frequency scale data or arcsin-root transformed frequency data are also available.

We look forward to working with you to customize this pipeline to your experimental design and parameters of interest. For a more specific application and interpretation of these or additional tools and visualizations for your own data, please find our contact information here https://www.cores.emory.edu/eicc/about/index.html

Information about our R session is available below

sessionInfo()

## R version 3.6.1 (2019-07-05)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18362)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] LDM_1.0            GUniFrac_1.1       matrixStats_0.55.0 ape_5.3           
##  [5] vegan_2.5-6        lattice_0.20-38    permute_0.9-5      reshape2_1.4.3    
##  [9] qiime2R_0.99.13    forcats_0.4.0      stringr_1.4.0      dplyr_0.8.3       
## [13] purrr_0.3.3        readr_1.3.1        tidyr_1.0.0        tibble_2.1.3      
## [17] ggplot2_3.2.1      tidyverse_1.3.0    phyloseq_1.28.0    devtools_2.2.1    
## [21] usethis_1.5.1     
## 
## loaded via a namespace (and not attached):
##   [1] colorspace_1.4-1    ellipsis_0.3.0      rprojroot_1.3-2    
##   [4] htmlTable_1.13.3    XVector_0.24.0      base64enc_0.1-3    
##   [7] fs_1.3.1            rstudioapi_0.10     remotes_2.1.0      
##  [10] fansi_0.4.0         lubridate_1.7.4     xml2_1.2.2         
##  [13] codetools_0.2-16    splines_3.6.1       knitr_1.26         
##  [16] pkgload_1.0.2       zeallot_0.1.0       ade4_1.7-13        
##  [19] Formula_1.2-3       jsonlite_1.6        broom_0.5.2        
##  [22] cluster_2.1.0       dbplyr_1.4.2        compiler_3.6.1     
##  [25] httr_1.4.1          backports_1.1.5     assertthat_0.2.1   
##  [28] Matrix_1.2-17       lazyeval_0.2.2      cli_2.0.0          
##  [31] acepack_1.4.1       htmltools_0.4.0     prettyunits_1.0.2  
##  [34] tools_3.6.1         igraph_1.2.4.2      gtable_0.3.0       
##  [37] glue_1.3.1          Rcpp_1.0.3          Biobase_2.44.0     
##  [40] cellranger_1.1.0    vctrs_0.2.0         Biostrings_2.52.0  
##  [43] multtest_2.40.0     nlme_3.1-142        iterators_1.0.12   
##  [46] xfun_0.11           ps_1.3.0            testthat_2.3.1     
##  [49] rvest_0.3.5         lifecycle_0.1.0     pacman_0.5.1       
##  [52] zlibbioc_1.30.0     MASS_7.3-51.4       scales_1.0.0       
##  [55] hms_0.5.2           parallel_3.6.1      biomformat_1.12.0  
##  [58] rhdf5_2.28.1        RColorBrewer_1.1-2  yaml_2.2.0         
##  [61] gridExtra_2.3       memoise_1.1.0       rpart_4.1-15       
##  [64] latticeExtra_0.6-28 stringi_1.4.3       S4Vectors_0.22.1   
##  [67] desc_1.2.0          foreach_1.4.7       checkmate_1.9.4    
##  [70] BiocGenerics_0.30.0 pkgbuild_1.0.6      rlang_0.4.1        
##  [73] pkgconfig_2.0.3     evaluate_0.14       Rhdf5lib_1.6.3     
##  [76] labeling_0.3        htmlwidgets_1.5.1   processx_3.4.1     
##  [79] tidyselect_0.2.5    plyr_1.8.4          magrittr_1.5       
##  [82] R6_2.4.1            IRanges_2.18.3      generics_0.0.2     
##  [85] Hmisc_4.3-0         DBI_1.0.0           foreign_0.8-72     
##  [88] pillar_1.4.2        haven_2.2.0         withr_2.1.2        
##  [91] mgcv_1.8-31         nnet_7.3-12         survival_3.1-7     
##  [94] modelr_0.1.5        crayon_1.3.4        rmarkdown_1.18     
##  [97] grid_3.6.1          readxl_1.3.1        data.table_1.12.6  
## [100] callr_3.4.0         reprex_0.3.0        digest_0.6.22      
## [103] stats4_3.6.1        munsell_0.5.0       sessioninfo_1.1.1

Microbiome Analysis (16s) with phyloseq and LDM

Jessica Randall

12/16/2019

Definition of terms

Linear Decomposition Model

False Discovery Rate (FDR)

Loading R packages

Import and prepare data

Exploratory Data Analysis with phyloseq

Richness estimation

Alpha Diversity

Correspondance Analysis

Beta Diversity

Scree Plot

Phyla-specific abundance plot

Statistical Analysis with LDM

Correspondance Analysis

Fit the LDM

Scree Plot

Global Hypothesis Testing

ASV-specific Hypothesis Testing