Characterization of the Populus trichocarpa wood mycobiome

Gillian E. Bergmann (gillian.bergmann@gmail.com)

This report provides the code used in data exploration and preliminary analysis. The original code is saved in the following scripts:

data_exploration.R
diversity.R
core_taxa.R

In these analyses, I first re-checked the quality of the MiSeq data to evaluate if it would be adequate for analysis. I then plotted the taxonomic composition of the wood mycobiome for all samples across the three disease categories (healthy, non-canker, canker), and calculated the Shannon diversity for all samples. I plotted NMDS ordinations of the Bray-Curtis dissimilarity between samples across disease categories, dominant source populations, and disease * population. I also ordinated the canker and non-canker samples separately. I then ran a one-way ANOVA of Shannon diversity by disease category, and a PerMANOVA of Bray-Curtis dissimilarity by disease, population and disease*population.

After testing if the variation between disease categories and populations was significant, I created vectors of OTU abundances and overlaid them on the NMDS plot of samples by disease category to identify if certain fungi were driving the variation in community composition. I then ran an indicator species analysis to identify fungi that were significantly associated with a single disease category or source population.

Finally, I used the abundance-occupancy distribution method developed by Shade and Stopnisek (2019) to identify if there was a core mycobiome across all trees, across healthy trees, and across diseased trees.

Loading the relevant packages and data

I first had to read in the following files: -a table of OTUs identified by sequencing depth for each sample -an updated metadata table with information on genotype source populations and sites -the phyloseq object of clustered OTUs that was produced from the bioinformatics process I then created several subset phyloseq objects for the different analyses. The first is a phyloseq object with a 2.5% OTU prevalence threshold (removes rare OTUs). The next ones are phyloseq objects of healthy trees only and diseased trees only. The last phyloseq objects a subset of samples from the four most common source populations sampled (Core, South, Columbia, BC), with two OTU prevalence thresholds: 0.412% (to remove singletons) and 2.5%.

# need to update this to just read in the correct subsetted phyloseq objects (skip subsetting in the Rmd itself)
library(tidyverse)
library(magrittr)
library(phyloseq)
library(vegan)
library(ggthemes)
library(dada2)
library(ggplot2)
library(ggpubr)
library(viridis)
library(indicspecies)
library(reshape2)
library(devtools)
install_github("pmartinezarbizu/pairwiseAdonis/pairwiseAdonis")
library(pairwiseAdonis)

depth <- read.csv("../output/analysis/June/depth.csv")
clustered.meta <- read.csv("../output/analysis/June/clustered.meta.csv", row.names = 1)
clustered.phy <- readRDS("../output/analysis/June/clustered.phy.rds") #this phyloseq object has the raw read counts, and singletons are removed
sample_data(clustered.phy) <- clustered.meta
clust.rich.phy <- readRDS("../output/analysis/June/clust.rich.phy.rds") #this phyloseq object has proportional read counts, and all OTUs are included 
clust.prop.phy <- clustered.phy %>% transform_sample_counts(function(x){x*min(sample_sums(.)/sum(x))}) # this relativizes the raw read counts into proportional abundance

minDepth <- 500
# this creates a phyloseq object with an OTU prevalence threshold of 2.5%, which will be used for NMDS ordination and vector analysis
clust2.5.phy <- (clust.prop.phy %>% filter_taxa(function(x) { sum(x>0) > 0.025*ntaxa(.) }, TRUE) %>%
                   prune_samples(sample_sums(.)>minDepth,.) %>% 
                   filter_taxa(function(x) {sum(x) > 0}, TRUE))

# creating separate phyloseq objects for healthy and diseased trees
healthy.phy <- subset_samples(clust.prop.phy, Disease=="Healthy") 
clust.dis.phy <- subset_samples(clust.prop.phy, Disease=="Canker"|Disease=="Non-canker")
# creating separate phyloseq objects for the four most common source populations
pop.phy <- subset_samples(clust.prop.phy, Pop=="Core"|Pop=="South"|Pop=="BC"|Pop=="Columbia")
pop2.5.phy <- subset_samples(clust2.5.phy, Pop=="Core"|Pop=="South"|Pop=="BC"|Pop=="Columbia")
pop2.5.meta <- sample_data(pop2.5.phy) %>% data.frame
core.s.phy <- subset_samples(clust2.5.phy, Pop=="Core"|Pop=="South")

Checking data quality

Here, I am plotting the number of OTUs identified against the sequencing depth for each sample. The trendline is based on the XX formula, which…

OTU.depth.curve <- ggplot(depth, aes(x=Depth, y=OTU_count)) +
  geom_point() +
  geom_smooth(method = "loess") # Posy had an issue with using this formula
plot(OTU.depth.curve)

## `geom_smooth()` using formula 'y ~ x'

Visualizing taxonomic composition and diversity of the community

I first wanted to look at the taxonomic composition of the samples sorted by disease category.

# setting the theme and correcting taxonomy formatting
theme_set(theme_bw())
clust.prop.phy@tax_table@.Data %<>% parse_taxonomy_greengenes()
pop.phy@tax_table@.Data %<>% parse_taxonomy_greengenes()

# Plotting distribution of orders by disease category 
dis.orders <- plot_bar(clust.prop.phy, x="Sample", fill = "Order") + 
  geom_bar(aes(fill=Order), stat="identity", position="stack", color="gray33") +
  facet_wrap(~Disease, nrow = 1, scales = "free_x", drop = TRUE)+ 
  theme(legend.position = "bottom", legend.key.size = unit(0.2, "cm"), axis.text.x = element_blank()) + 
  ylab("Proportional abundance") + ggtitle("Figure 1a. Taxonomic composition of wood samples by disease category") +
  scale_fill_viridis(begin = 0, end = 1, discrete = T, option = "C")
plot(dis.orders)

I also wanted to look at taxonomic composition in the four common source populations, which I did with this script.

theme_set(theme_bw())
pop.phy@tax_table@.Data %<>% parse_taxonomy_greengenes()

## Warning in parse_taxonomy_greengenes(.): No greengenes prefixes were found. 
## Consider using parse_taxonomy_default() instead if true for all OTUs. 
## Dummy ranks may be included among taxonomic ranks now.

pop.orders <- plot_bar(pop.phy, x="Sample", fill = "Order") + 
  geom_bar(aes(fill=Order), stat="identity", position="stack", color="gray33") +
  facet_wrap(~Pop, nrow = 1, scales = "free_x", drop = TRUE)+ 
  theme(legend.position = "bottom", legend.key.size = unit(0.3, "cm"), axis.text.x = element_blank()) + 
  ylab("Proportional abundance") + ggtitle("Figure 1b. Taxonomic composition of wood samples by common source populations") +
  scale_fill_viridis(begin = 0, end = 1, discrete = T, option = "C")
plot(pop.orders)

Since the Core and South populations have a lot of samples, I created a separate plot for them so that their taxonomic composition is easier to see

theme_set(theme_bw())
core.s.phy@tax_table@.Data %<>% parse_taxonomy_greengenes()
core.s.orders <- plot_bar(core.s.phy, x="Sample", fill = "Order") + 
  geom_bar(aes(fill=Order), stat="identity", position="stack", color="gray33") +
  facet_wrap(~Pop, nrow = 1, scales = "free_x", drop = TRUE)+ 
  theme(legend.position = "bottom", legend.key.size = unit(0.3, "cm"), axis.text.x = element_blank()) + 
  ylab("Proportional abundance") + ggtitle("Figure 1c. Taxonomic composition of wood samples in Core and South populations") +
  scale_fill_viridis(begin = 0, end = 1, discrete = T, option = "C")
plot(core.s.orders)

After viewing the taxonomic composition across samples, I wanted to calculate and vizualize the Shannon diversity for the samples. (Note: still have to add population to the rich phyloseq metadata)

# calculating shannon diversity for each sample
otu.table <- otu_table(clust.rich.phy)
rich.meta <- sample_data(clust.rich.phy) %>% data.frame
otu.shannon <- diversity(otu.table, index = "shannon", MARGIN = 1, base = exp(1))
otu.shannon.df <- as.data.frame(otu.shannon)
otu.shannon.df$Disease <- rich.meta$Disease
otu.shannon.df$Genotype <- rich.meta$Genotype
view(head(otu.shannon.df))
# plotting the distribution of shannon diversity values by disease category
otu.shannon.box <- ggplot(otu.shannon.df, aes(x=Disease, y=otu.shannon, fill = Disease)) +
  geom_boxplot() +
  ylab("Shannon diversity") +
  theme(legend.position = "none") +
  ggtitle("Figure 2. Shannon diversity by disease category")
plot(otu.shannon.box)

NMDS ordinations

After looking at the taxonomic composition of the samples, I wanted to see if there was variation in community composition between disease categories and common source populations. I did this using Non-Metric Dimensional Scaling (NMDS) ordination of the Bray-Curtis dissimilarity between samples. I started by ordinating all the samples with the 2.5% OTU prevalence cutoff. This ordination did not converge, and the resulting plot had multiple outliers. I removed these outlier samples from the dataset, and tried the ordination again.

# creating an NMDS ordination of samples with the 2.5% OTU prevalence cutoff
clust.bray <- clust2.5.phy %>% phyloseq::distance("bray") %>% sqrt
clust.nmds <- metaMDS(clust.bray, trymax = 200, parallel = 10, k=3)

print(clust.nmds)

## 
## Call:
## metaMDS(comm = clust.bray, k = 3, trymax = 200, parallel = 10) 
## 
## global Multidimensional Scaling using monoMDS
## 
## Data:     clust.bray 
## Distance: bray 
## 
## Dimensions: 3 
## Stress:     0.1270785 
## Stress type 1, weak ties
## No convergent solutions - best solution after 200 tries
## Scaling: centring, PC rotation 
## Species: scores missing

clust.nmds.dat <- scores(clust.nmds) %>% data.frame %>% rownames_to_column("sampID") %>%
  full_join(sample_data(clust2.5.phy) %>% data.frame %>% rownames_to_column("sampID"))

## Joining, by = "sampID"

clust.plot <- ggplot(clust.nmds.dat,aes(x=NMDS1,y=NMDS2,fill=Disease)) +
  geom_point(shape=21,size=3)+
  stat_ellipse(mapping = aes(color = Disease)) +
  scale_fill_viridis(discrete = T, option = "C") +
  scale_color_viridis(discrete = T, option = "C") +
  ggthemes::theme_few()
plot(clust.plot)

  ## outliers: myco.03.02.f, myco.06.04.d, myco.03.04.a, myco.07.02.c, myco.07.05.d, myco.04.02.e

# removing outliers
clust.in.phy <- subset_samples(clust2.5.phy, sample_names(clust2.5.phy) !="myco.03.02.f"| sample_names(clust2.5.phy) !="myco.06.04.d"|
                                 sample_names(clust2.5.phy) !="myco.07.02.c"| sample_names(clust2.5.phy) !="myco.07.05.d"|
                                 sample_names(clust2.5.phy) !="myco.04.02.e"| sample_names(clust2.5.phy) != "myco.03.02.f"|
                                 sample_names(clust2.5.phy) !="myco.06.04.d"|sample_names(clust2.5.phy) != "myco.03.04.a") 

# running an NMDS ordination for the data with outliers removed
clust.in.bray <- clust.in.phy %>% phyloseq::distance("bray") %>% sqrt
clust.in.nmds <- metaMDS(clust.in.bray, trymax = 200, parallel = 10)

print(clust.in.nmds)

## 
## Call:
## metaMDS(comm = clust.in.bray, trymax = 200, parallel = 10) 
## 
## global Multidimensional Scaling using monoMDS
## 
## Data:     clust.in.bray 
## Distance: bray 
## 
## Dimensions: 2 
## Stress:     0.1795004 
## Stress type 1, weak ties
## No convergent solutions - best solution after 200 tries
## Scaling: centring, PC rotation 
## Species: scores missing

clust.in.dat <- scores(clust.in.nmds) %>% data.frame %>% rownames_to_column("sampID") %>%
  full_join(sample_data(clust.in.phy) %>% data.frame %>% rownames_to_column("sampID"))

## Joining, by = "sampID"

inlier.plot <- ggplot(clust.in.dat,aes(x=NMDS1,y=NMDS2,fill=Disease)) +  
  geom_point(shape=21,size=3)+
  stat_ellipse(mapping = aes(color = Disease)) +
  scale_fill_viridis(discrete = T, option = "C") +
  scale_color_viridis(discrete = T, option = "C") +
  ggtitle("Figure 3a. Community composition by disease category") +
  ggthemes::theme_few()
plot(inlier.plot)

I also ordinated diseased tree samples separately, and I ordinated the samples by source population.

# NMDS ordination of wood samples from infected trees
  ## outliers to remove before running ordination: myco.04.02.e, myco.06.04.d
#clust.dis.phy <- subset_samples(clust.dis.phy, sample_names(clust.dis.phy) != "myco.06.04.d"|sample_names(clust.dis.phy) !="myco.04.02.e")
clust.dis.bray <- clust.dis.phy %>% phyloseq::distance("bray") %>% sqrt                 
clust.dis.nmds <- metaMDS(clust.dis.bray, trymax = 500, parallel = 10, k=3)

print(clust.dis.nmds)

## 
## Call:
## metaMDS(comm = clust.dis.bray, k = 3, trymax = 500, parallel = 10) 
## 
## global Multidimensional Scaling using monoMDS
## 
## Data:     clust.dis.bray 
## Distance: bray 
## 
## Dimensions: 3 
## Stress:     0.1152198 
## Stress type 1, weak ties
## No convergent solutions - best solution after 500 tries
## Scaling: centring, PC rotation 
## Species: scores missing

clust.dis.dat <-scores(clust.dis.nmds) %>% data.frame %>% rownames_to_column("sampID") %>%
  full_join(sample_data(clust.dis.phy) %>% data.frame %>% rownames_to_column("sampID"))

## Joining, by = "sampID"

diseased.plot <- ggplot(clust.dis.dat,aes(x=NMDS1,y=NMDS2,fill=Disease)) + 
  geom_point(shape=21,size=3)+
  stat_ellipse(mapping = aes(color = Disease)) +
  scale_fill_viridis(discrete = T, option = "C") +
  scale_color_viridis(discrete = T, option = "C") +
  ggtitle("Figure 3b. Community composition of wood from diseased trees") +
  ggthemes::theme_few()
plot(diseased.plot)

 ## outliers removed before ordination: myco.06.04.d, myco.07.05.d, myco.03.02.f
#pop2.5.phy %<>% subset_samples(pop2.5.phy, sample_names(pop2.5.phy) != "myco.06.04.d"|sample_names(pop2.5.phy) !="myco.07.05.d"|sample_names(pop2.5.phy) !="myco.03.02.f")
# ordinating by population
pop2.5.bray <- pop2.5.phy %>% phyloseq::distance("bray") %>% sqrt
pop2.5.nmds <- metaMDS(pop2.5.bray, trymax = 200, parallel = 10, k=3)

print(pop2.5.nmds)

## 
## Call:
## metaMDS(comm = pop2.5.bray, k = 3, trymax = 200, parallel = 10) 
## 
## global Multidimensional Scaling using monoMDS
## 
## Data:     pop2.5.bray 
## Distance: bray 
## 
## Dimensions: 3 
## Stress:     0.1246627 
## Stress type 1, weak ties
## No convergent solutions - best solution after 200 tries
## Scaling: centring, PC rotation 
## Species: scores missing

pop2.5.dat <- scores(pop2.5.nmds) %>% data.frame %>% rownames_to_column("sampID") %>%
  full_join(sample_data(pop2.5.phy) %>% data.frame %>% rownames_to_column("sampID"))

## Joining, by = "sampID"

ggplot(pop2.5.dat, aes(x=NMDS1, y=NMDS2, color=Pop)) + 
  geom_point(size=3)+
  scale_color_viridis(discrete = T, option = "C") +
  ggtitle("Figure 3c. Community composition of wood by source population") +
  ggthemes::theme_few()

Finally, I ordinated the samples from the four most common populations by population and disease category

pop2.5.bray <- pop2.5.phy %>% phyloseq::distance("bray") %>% sqrt
pop2.5.nmds <- metaMDS(pop2.5.bray, trymax = 200, parallel = 10, k=3)
pop2.5.dat <- scores(pop2.5.nmds) %>% data.frame %>% rownames_to_column("sampID") %>%
  full_join(sample_data(pop2.5.phy) %>% data.frame %>% rownames_to_column("sampID"))

## Joining, by = "sampID"

ggplot(pop2.5.dat, aes(x=NMDS1, y=NMDS2, shape=Disease, color=Pop)) + 
  geom_point(size=3)+
  scale_color_viridis(discrete = T, option = "C") +
  ggtitle("Figure 3d. Community composition of wood by disease and population") +
  ggthemes::theme_few()

Testing the significance of mycobiome variation

After visualizing the diversity and community composition of the samples, I tested if the differences in these metrics between disease categories and populations were significant. For shannon diversity, I used a one-way analysis of variance (ANOVA). I used a permutational analysis of variation (PerMANOVA) on Bray-Curtis dissimilarity

# ANOVA of shannon diversity by disease category
shannon.aov <- aov(otu.shannon ~ Disease, data = otu.shannon.df) 
summary(shannon.aov)

##              Df Sum Sq Mean Sq F value  Pr(>F)    
## Disease       2    8.5   4.251   10.77 3.1e-05 ***
## Residuals   287  113.3   0.395                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

TukeyHSD(shannon.aov)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = otu.shannon ~ Disease, data = otu.shannon.df)
## 
## $Disease
##                           diff         lwr       upr     p adj
## Healthy-Canker      0.41603793  0.20430522 0.6277706 0.0000166
## Non-canker-Canker   0.33060821  0.05150602 0.6097104 0.0154478
## Non-canker-Healthy -0.08542971 -0.32680376 0.1559443 0.6823163

# ANOVA of shannon diversity by population - need to add
#shannon.pop.aov <- aov(otu.shannon ~ Population, data = otu.shannon.df)
#summary(shannon.aov)
# PerMANOVA of Bray-Curtis dissimilarity
pop2.5.dist <- pop2.5.phy %>% phyloseq::distance("bray")
adonis(pop2.5.dist~Disease*Pop, data = pop2.5.meta)

## 
## Call:
## adonis(formula = pop2.5.dist ~ Disease * Pop, data = pop2.5.meta) 
## 
## Permutation: free
## Number of permutations: 999
## 
## Terms added sequentially (first to last)
## 
##              Df SumsOfSqs MeanSqs F.Model      R2 Pr(>F)    
## Disease       2     5.172 2.58625  9.0767 0.09857  0.001 ***
## Pop           3     1.176 0.39205  1.3759 0.02241  0.098 .  
## Disease:Pop   6     2.249 0.37483  1.3155 0.04286  0.090 .  
## Residuals   154    43.879 0.28493         0.83616           
## Total       165    52.477                 1.00000           
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Pairwise adonis tests of Bray-Curtis dissimilarity between disease categories
pairwise.adonis(pop2.5.dist, factors = sample_data(pop2.5.phy)$Disease)

##                   pairs Df SumsOfSqs   F.Model          R2 p.value p.adjusted
## 1     Canker vs Healthy  1 4.6718673 16.333596 0.107222546   0.001      0.003
## 2  Canker vs Non-canker  1 2.4443563  8.407382 0.110033633   0.001      0.003
## 3 Healthy vs Non-canker  1 0.3452548  1.172019 0.009515302   0.300      0.900
##   sig
## 1   *
## 2   *
## 3

These tests show that the variation in community composition and diversity is significantly associated with disease category, but not source population. Moreover, the association with disease category is so strong that the population*disease association is significant. Through the pairwise tests, we also see that this significant variation between disease categories is primarily due to differences between canker samples and the other sample types. This makes sense based on the plots

Vector analysis

After seeing that the variation in communities was significant between disease categories, I wanted to identify OTUs that were driving this variation using vector analysis. To do this, I first used envfit to calculate the correlation between variables and the NMDS plot. These variables included the OTUs, as well as metadata (plate number etc.).

clust.in.otu <- otu_table(clust.in.phy) %>% data.frame()
clust.in.meta <- sample_data(clust.in.phy) %>% data.frame()
clust.in.var <- as.data.frame(scores(clust.in.nmds, display = 'sites'))
clust.in.var <- clust.in.meta %>% cbind(clust.in.var) 
clust.in.var <- clust.in.otu %>% cbind(clust.in.var) 
clust.in.fit <- envfit(clust.in.nmds, clust.in.var, na.rm = TRUE)
#pval.adjust <- p.adjust(clust.fit$pvals, method = "fdr")
#clust.fit$vectors$pvals <- pval.adjust
clust.in.filt <- data.frame(r = clust.in.fit$vectors$r,pvals = clust.in.fit$vectors$pvals) %>%
  rownames_to_column(var = "var") %>%
  filter(r >= 0.2, pvals <= 0.05)

clust.var.scrs <- as.data.frame(scores(clust.in.fit, display = "vectors"))
clust.var.scrs <- cbind(clust.var.scrs, variables = rownames(clust.var.scrs))
clust.var.scrs %<>% filter(variables %in% clust.in.filt$var)

vectors <- ggplot(clust.in.dat) +
  geom_point(mapping = aes(x = NMDS1, y = NMDS2, color = Disease)) +
  coord_fixed() +
  geom_segment(data = clust.var.scrs, size = 2,
               aes(x = 0, xend = NMDS1*.8, y = 0, yend = NMDS2*.8),
               arrow = arrow(length = unit(0.5, "cm")), colour = "black") +
  geom_text(data = clust.var.scrs, aes(x = NMDS1, y = NMDS2, label = variables), size = 5, hjust = 1, vjust = 0)
plot(vectors)

Based on this initial analysis, I then re-did the fit for just the OTUs that were significantly correlated with the ordination.

clust.spp.var <- clust.in.otu %>% dplyr::select("S. musiva" = OTU.1, 
                                             "M. tassiana" = OTU.2, 
                                             "A. pullulans" = OTU.3, 
                                             "C. subcutanea" = OTU.8)
clust.spp.fit <- envfit(clust.in.nmds, clust.spp.var)
clust.spp.filt <- data.frame(r = clust.spp.fit$vectors$r,pvals = clust.spp.fit$vectors$pvals) %>%
  rownames_to_column(var = "var") %>%
  filter(r >= 0.2, pvals <= 0.05)

clust.spp.scrs <- as.data.frame(scores(clust.spp.fit, display = "vectors"))
clust.spp.scrs <- cbind(clust.spp.scrs, variables = rownames(clust.spp.scrs))
clust.spp.scrs %<>% filter(variables %in% clust.spp.filt$var)

otu.vectors <- ggplot(clust.in.dat) +
  geom_point(mapping = aes(x = NMDS1, y = NMDS2, color = Disease)) +
  scale_color_viridis(discrete = T, option = "C", "Disease") +
  coord_fixed() +
  geom_segment(data = clust.spp.scrs, size = 2,
               aes(x = 0, xend = NMDS1*.8, y = 0, yend = NMDS2*.8),
               arrow = arrow(length = unit(0.5, "cm")), colour = "black") +
  geom_text(data = clust.spp.scrs, aes(x = NMDS1, y = NMDS2, label = variables), 
            size = 3, hjust = 0.5, vjust = 0) +
  ggtitle("Figure 4. OTUs significantly correlated with community variation in wood") +
  ggthemes::theme_few()
plot(otu.vectors)

Identifying indicator species

Another way to look at differences in composition between sample groups is to identify species that are significantly associated with a single group (indicator species). I wanted to look for indicator species of individual disease categories. This analysis produced Table 1, which is listed below.

disease.cat <- c(rep(1,68), rep(2,171), rep(3,47)) # note: group 1 is canker, group 2 is healthy, group 3 is non-canker
clust.otu <- read.csv("../output/analysis/June/clust.otu.csv", as.is = T, row.names = 1) 
disease.indval <- multipatt(clust.otu, disease.cat, control = how(nperm = 999))
summary(disease.indval)

## 
##  Multilevel pattern analysis
##  ---------------------------
## 
##  Association function: IndVal.g
##  Significance level (alpha): 0.05
## 
##  Total number of species: 162
##  Selected number of species: 30 
##  Number of species associated to 1 group: 24 
##  Number of species associated to 2 groups: 6 
## 
##  List of species associated to each combination: 
## 
##  Group 1  #sps.  20 
##          stat p.value    
## OTU.1   0.745   0.001 ***
## OTU.22  0.503   0.001 ***
## OTU.5   0.444   0.001 ***
## OTU.12  0.409   0.007 ** 
## OTU.58  0.383   0.001 ***
## OTU.35  0.321   0.016 *  
## OTU.157 0.297   0.001 ***
## OTU.162 0.290   0.042 *  
## OTU.54  0.286   0.014 *  
## OTU.34  0.268   0.008 ** 
## OTU.101 0.256   0.023 *  
## OTU.183 0.239   0.037 *  
## OTU.355 0.238   0.012 *  
## OTU.45  0.237   0.042 *  
## OTU.510 0.210   0.024 *  
## OTU.254 0.210   0.016 *  
## OTU.247 0.210   0.021 *  
## OTU.418 0.209   0.029 *  
## OTU.18  0.206   0.031 *  
## OTU.284 0.205   0.035 *  
## 
##  Group 2  #sps.  1 
##         stat p.value  
## OTU.27 0.392   0.033 *
## 
##  Group 3  #sps.  3 
##          stat p.value   
## OTU.92  0.350   0.018 * 
## OTU.679 0.251   0.004 **
## OTU.457 0.205   0.030 * 
## 
##  Group 1+3  #sps.  1 
##        stat p.value    
## OTU.10 0.51   0.001 ***
## 
##  Group 2+3  #sps.  5 
##         stat p.value    
## OTU.23 0.625   0.001 ***
## OTU.65 0.427   0.044 *  
## OTU.57 0.425   0.016 *  
## OTU.79 0.405   0.023 *  
## OTU.70 0.371   0.019 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From this analysis, we see that there are a number of indicator species of cankered wood (Group 1), including S. musiva (OTU.1). The other interesting thing to note is that when we look for indicators of multiple disease categories, there are no indicator species shared by cankered and healthy wood.

Identifying core taxa

After these analyses of the whole mycobiome, I became interested in determining if there was a core mycobiome. The core myobiome is a group of fungi that are consistently present across samples, sample populations, or time points (Shade and Handelsman 2012). The core can be identified by membership, abundance or both. For this project, I defined the core as taxa that are consistently present across all samples and disease categories, and that have stable relative abundance across these samples. This definition aligns with the abundance-occupancy distribution method for identifying core microbiomes proposed by Shade and Stopnisek (2019). To identify the core, I used code modified from a script provided in Shade and Stopnisek (2019). I first applied this script to all samples. Once the putative core taxa were identified, I created a subset phyloseq object of just these taxa for further exploration.

# Note: still need to upload these files to the lab computer
nReads <- 501 # input dataset needs to be rarified and the rarifaction depth included - achieved this with proportional abundance   
otu <- readRDS("../output/analysis/June/OTU.prop.rds")
meta <- readRDS("../output/analysis/June/OTU.meta.rds")

### Calculating occupancy and abundance ###
otu_PA <- 1*((otu>0)==1)                                               # presence-absence data
otu_occ <- rowSums(otu_PA)/ncol(otu_PA)                                # occupancy calculation
otu_rel <- apply(decostand(otu, method="total", MARGIN=2),1, mean)     # relative abundance  
occ_abun <- add_rownames(as.data.frame(cbind(otu_occ, otu_rel)),'otu') # combining occupancy and abundance 

### Making data frame of OTU indexes ###
PresenceSum <- data.frame(otu = as.factor(row.names(otu)), otu) %>% 
  gather(Samp_ID, abun, -otu) %>%
  left_join(meta, by = 'Samp_ID') %>%
  group_by(otu, Disease) %>%
  summarise(plot_freq=sum(abun>0)/length(abun),        # frequency of detection between time points
            coreSite=ifelse(plot_freq == 1, 1, 0), # 1 only if occupancy 1 with specific genotype, 0 if not
            detect=ifelse(plot_freq > 0, 1, 0)) %>%    # 1 if detected and 0 if not detected with specific genotype
  group_by(otu) %>%
  summarise(sumF=sum(plot_freq),
            sumG=sum(coreSite),
            nS=length(Disease)*2,
            Index=(sumF+sumG)/nS) # calculating weighting Index based on number of time points detected and 

### Adding OTU ranks by index ###
otu_ranked <- occ_abun %>%
  left_join(PresenceSum, by='otu') %>%
  transmute(otu=otu,
            rank=Index) %>%
  arrange(desc(rank))

BCaddition <- NULL

### Making ranked matrix ###
otu_start=otu_ranked$otu[1]
start_matrix <- as.matrix(otu[otu_start,]) # only one column in start_matrix
start_matrix <- t(start_matrix)
x <- apply(combn(ncol(start_matrix), 2), 2, function(x) sum(abs(start_matrix[,x[1]]- start_matrix[,x[2]]))/(2*nReads)) 
x_names <- apply(combn(ncol(start_matrix), 2), 2, function(x) paste(colnames(start_matrix)[x], collapse=' - '))
## warnings about uninitialized column 'fill'
df_s <- data.frame(x_names,x)
names(df_s)[2] <- 1 
BCaddition <- rbind(BCaddition,df_s)

# not sure what this for loop is doing
for(i in 2:162){
  otu_add=otu_ranked$otu[i]
  add_matrix <- as.matrix(otu[otu_add,])
  add_matrix <- t(add_matrix)
  start_matrix <- rbind(start_matrix, add_matrix)
  x <- apply(combn(ncol(start_matrix), 2), 2, function(x) sum(abs(start_matrix[,x[1]]-start_matrix[,x[2]]))/(2*nReads))
  x_names <- apply(combn(ncol(start_matrix), 2), 2, function(x) paste(colnames(start_matrix)[x], collapse=' - '))
  df_a <- data.frame(x_names,x)
  names(df_a)[2] <- i 
  BCaddition <- left_join(BCaddition, df_a, by=c('x_names'))  
}

x <-  apply(combn(ncol(otu), 2), 2, function(x) sum(abs(otu[,x[1]]-otu[,x[2]]))/(2*nReads))
x_names <- apply(combn(ncol(otu), 2), 2, function(x) paste(colnames(otu)[x], collapse=' - '))
df_full <- data.frame(x_names,x)
names(df_full)[2] <- length(rownames(otu))
BCfull <- left_join(BCaddition,df_full, by='x_names')

rownames(BCfull) <- BCfull$x_names
temp_BC <- BCfull
temp_BC$x_names <- NULL
temp_BC_matrix <- as.matrix(temp_BC)

BC_ranked <- data.frame(rank = as.factor(row.names(t(temp_BC_matrix))),t(temp_BC_matrix)) %>% 
  gather(comparison, BC, -rank) %>%
  group_by(rank) %>%
  summarise(MeanBC=mean(BC)) %>%
  arrange(-desc(MeanBC)) %>%
  mutate(proportionBC=MeanBC/max(MeanBC))
Increase=BC_ranked$MeanBC[-1]/BC_ranked$MeanBC[-length(BC_ranked$MeanBC)]
increaseDF <- data.frame(IncreaseBC=c(0,(Increase)), rank=factor(c(1:(length(Increase)+1))))
BC_ranked <- left_join(BC_ranked, increaseDF)
BC_ranked <- BC_ranked[-nrow(BC_ranked),]

fo_difference <- function(pos){
  left <- (BC_ranked[pos, 2] - BC_ranked[1, 2]) / pos
  right <- (BC_ranked[nrow(BC_ranked), 2] - BC_ranked[pos, 2]) / (nrow(BC_ranked) - pos)
  return(left - right)
}
BC_ranked$fo_diffs <- sapply(1:nrow(BC_ranked), fo_difference)

elbow <- which.max(BC_ranked$fo_diffs)

### Plotting ranked OTUs to find core taxa ###
ggplot(BC_ranked[1:162,], aes(x=factor(BC_ranked$rank[1:162], levels=BC_ranked$rank[1:162]))) +
  geom_point(aes(y=proportionBC)) +
  theme_classic() + theme(strip.background = element_blank(),axis.text.x = element_text(size=7, angle=45)) +
  geom_vline(xintercept=elbow, lty=3, col='red', cex=.5) +
  geom_vline(xintercept=last(as.numeric(BC_ranked$rank[(BC_ranked$IncreaseBC>=1.02)])), lty=3, col='blue', cex=.5) +
  labs(x='ranked OTUs',y='Bray-Curtis similarity') +
  annotate(geom="text", x=elbow+10, y=.15, label=paste("Elbow method"," (",elbow,")", sep=''), color="red")+    
  annotate(geom="text", x=last(as.numeric(BC_ranked$rank[(BC_ranked$IncreaseBC>=1.02)]))-4, y=.08, label=paste("Last 2% increase (",last(as.numeric(BC_ranked$rank[(BC_ranked$IncreaseBC>=1.02)])),')',sep=''), color="blue")

occ_abun$fill <- 'no'
occ_abun$fill[occ_abun$otu %in% otu_ranked$otu[1:last(as.numeric(BC_ranked$rank[(BC_ranked$IncreaseBC>=1.02)]))]] <- 'core'

# creating a phyloseq object of core taxa
clust.prop.tax <- tax_table(clust.prop.phy) %>% data.frame
clust.prop.tax <- occ_abun$fill %>% cbind(clust.prop.tax)
names(clust.prop.tax)[names(clust.prop.tax)=="."] <- "fill"
clust.prop.tax <- as.matrix(clust.prop.tax)
tax_table(clust.prop.phy) <- clust.prop.tax
core.tax.phy <- subset_taxa(clust.prop.phy, fill=='core')
core.tax.phy <- prune_samples(sample_sums(core.tax.phy)>0, core.tax.phy)

Variation in core mycobiome by disease and source population

After identifying the core mycobiome of my samples, I wanted to see if the core also varied in association with disease category. I used ordination and PerMANOVA to do this.

core.tax.bray <- core.tax.phy %>% phyloseq::distance("bray") %>% sqrt
core.tax.nmds <- metaMDS(core.tax.bray, trymax = 200, parallel = 10, k=3)

print(core.tax.nmds)

## 
## Call:
## metaMDS(comm = core.tax.bray, k = 3, trymax = 200, parallel = 10) 
## 
## global Multidimensional Scaling using monoMDS
## 
## Data:     core.tax.bray 
## Distance: bray 
## 
## Dimensions: 3 
## Stress:     0.1335893 
## Stress type 1, weak ties
## No convergent solutions - best solution after 200 tries
## Scaling: centring, PC rotation 
## Species: scores missing

core.tax.dat <- scores(core.tax.nmds) %>% data.frame %>% rownames_to_column("sampID") %>%
  full_join(sample_data(core.tax.phy) %>% data.frame %>% rownames_to_column("sampID"))

## Joining, by = "sampID"

ggplot(core.tax.dat,aes(x=NMDS1,y=NMDS2,fill=Disease))+ 
  geom_point(shape=21,size=3)+
  stat_ellipse(mapping = aes(color = Disease)) +
  scale_fill_viridis(discrete = T, option = "C") +
  scale_color_viridis(discrete = T, option = "C") +
  ggtitle("Figure 5. Core composition by disease category") +
  ggthemes::theme_few()

core.tax.dist <- core.tax.phy %>% phyloseq::distance("bray")
adonis(core.tax.dist~sample_data(core.tax.phy)$Disease)

## 
## Call:
## adonis(formula = core.tax.dist ~ sample_data(core.tax.phy)$Disease) 
## 
## Permutation: free
## Number of permutations: 999
## 
## Terms added sequentially (first to last)
## 
##                                    Df SumsOfSqs MeanSqs F.Model      R2 Pr(>F)
## sample_data(core.tax.phy)$Disease   2     7.534  3.7670  13.944 0.09058  0.001
## Residuals                         280    75.642  0.2702         0.90942       
## Total                             282    83.176                 1.00000       
##                                      
## sample_data(core.tax.phy)$Disease ***
## Residuals                            
## Total                                
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

core.pairs<-pairwise.adonis(core.tax.dist,factors=sample_data(core.tax.phy)$Disease)
print(core.pairs)

##                   pairs Df SumsOfSqs    F.Model          R2 p.value p.adjusted
## 1     Canker vs Healthy  1 7.0005346 26.0865288 0.100299423   0.001      0.003
## 2  Canker vs Non-canker  1 3.7648110 13.3696644 0.107499401   0.001      0.003
## 3 Healthy vs Non-canker  1 0.2513555  0.9442597 0.004372701   0.460      1.000
##   sig
## 1   *
## 2   *
## 3

This shows that disease category is also important in structuring the core mycobiome, particularly for the cankered samples.

References

Shade, A., & Handelsman, J. (2012). Beyond the Venn diagram: the hunt for a core microbiome. Environmental microbiology, 14(1), 4-12.

Shade, A., & Stopnisek, N. (2019). Abundance-occupancy distributions to prioritize plant core microbiome membership. Current opinion in microbiology, 49, 50-58.