Background: Dust entering the Sierras comes from both regional and global sources – and the contribution of each source changes over time. There is evidence that microbial communities can differ based on dust source and timing of dust deflation, and that this can have subsequent effects on endophytic associations (Caroline Frank’s work) and soil microbial communities at the deposition location. We therefore wanted to know:
Limitation: our time comparisons may be skewed because we did not sterilize collectors in between sampling events. Time points may appear more similar to each other than in reality. It’s also important to run on presence-absence data only; we don’t want to capture community differences that are driven by post-depositional divergence in relative abundances.
Split taxonomy column into multiple variables
# bacteria otu
B16S<-read.csv("data/filtered_table_w_metadata.csv", header = TRUE, stringsAsFactors = FALSE)
# split taxa groupings into new columns
source('dust_functions.R')
B16S<-split_taxa(B16S)
UniFrac is a measure of B-diversity that uses phylogenetic information to compare community samples. Use with multivariate methods like principal coordinates (PCoA), NMDS, etc to explain differences among communities. Measured as the amount of evolutionary history that is unique to each group - the fraction of branch length in a phylogenetic tree associated with ONLY the focal sample in pairwise comparison. With sequence data UniFrac values can be influenced by the number of sequences/sample and can use sequence jackknifing (Lozupone et al 2011. ISME).
Use to compare phylogenetic lineages between groups, to cluster samples.
unifrac<-read.table('data/unweighted_unifrac_dm.txt')
unifrac_wt<-read.table('data/weighted_unifrac_dm.txt')
Use heatmap to visualize sample dissimilarity based on UniFrac
otu.taxa<-read.csv('data/OTU_taxa_id.csv') # otu ids
otu.comm<-read.csv('data/OTU_community_clean.csv')
#otu.grp<-otu.comm%>%left_join(map6, by='SampleID')
## the samples in the unifrac dist and community need to be in the same order
# need to order community based on unifraq
colnames(unifrac) # use this order
## [1] "OS115" "AP114" "JJ415" "OT115" "JJ515" "OS615" "OJ615" "JT415"
## [9] "JP315" "OJ514" "OJ515" "OJ415" "AS114" "JS415" "JP414" "OT515"
## [17] "JJ114" "OT415" "OT414" "JP515" "JT115" "AJ414" "OP315" "OS214"
## [25] "JT314" "OP515" "AT414" "OS415" "OP615" "OP614" "JS115"
# make new df based on unifrac, matched to same order as unifrac
otu.ordered<-data.frame(SampleID = colnames(unifrac))%>%left_join(otu.comm, by='SampleID')
otu.ordered$SampleID
## [1] OS115 AP114 JJ415 OT115 JJ515 OS615 OJ615 JT415 JP315 OJ514 OJ515
## [12] OJ415 AS114 JS415 JP414 OT515 JJ114 OT415 OT414 JP515 JT115 AJ414
## [23] OP315 OS214 JT314 OP515 AT414 OS415 OP615 OP614 JS115
## 31 Levels: AJ414 AP114 AS114 AT414 JJ114 JJ415 JJ515 JP315 JP414 ... OT515
# this works but is inefficient - there are multiple copies of the otu table which is large, remember to remove this file after data wrangling to free up workspace on Rstudio (rm command works to remove files)
# alternatively - match the order, rewrite over same name
otu.comm<-otu.comm[match(colnames(unifrac), otu.comm$SampleID),]
# this says - for the community df, order the rows to match the order of the column names in the file with the unifrac dissimilarity matrix i.e., unifrac, that match to SampleID
rm(otu.ordered)
#combine with grouping variables - align mapping data with community data
otu.grp<-otu.comm%>%left_join(map6, by='SampleID')
#instead of making the df larger, we can reorder the mapping data to match with the community, like above
grps<-map6[match(colnames(unifrac), map6$SampleID),]
grps$SampleID
## [1] OS115 AP114 JJ415 OT115 JJ515 OS615 OJ615 JT415 JP315 OJ514 OJ515
## [12] OJ415 AS114 JS415 JP414 OT515 JJ114 OT415 OT414 JP515 JT115 AJ414
## [23] OP315 OS214 JT314 OP515 AT414 OS415 OP615 OP614 JS115
## 31 Levels: AJ414 AP114 AS114 AT414 JJ114 JJ415 JJ515 JP315 JP414 ... OT515
# now that all the data (community, distances, grouping variables) are ordered the same, we can use permanova etc
Permanova tests for differences in composition among groups
Reminder - permanova is always based on pairwise distances/dissimilarities.
Can either nest Month in Year with Year/Month Or, can restrict permutations within year, using strata=‘Year’
set.seed(304)
#unifrac distances
ad.uni<-adonis2(unifrac~Month+Elevation, data=grps, permutations=1000, strata='Year')
ad.uni
## Permutation test for adonis under reduced model
## Terms added sequentially (first to last)
## Permutation: free
## Number of permutations: 1000
##
## adonis2(formula = unifrac ~ Month + Elevation, data = grps, permutations = 1000, strata = "Year")
## Df SumOfSqs R2 F Pr(>F)
## Month 3 1.1164 0.13439 1.5389 0.002997 **
## Elevation 1 0.9037 0.10878 3.7372 0.000999 ***
## Residual 26 6.2873 0.75683
## Total 30 8.3074 1.00000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#unifrac distances with month nested in year
ad.uniNest<-adonis2(unifrac~Year/Month+Elevation, data=grps, permutations=1000)
ad.uniNest
## Permutation test for adonis under reduced model
## Terms added sequentially (first to last)
## Permutation: free
## Number of permutations: 1000
##
## adonis2(formula = unifrac ~ Year/Month + Elevation, data = grps, permutations = 1000)
## Df SumOfSqs R2 F Pr(>F)
## Year 1 0.5142 0.06190 2.1383 0.002997 **
## Elevation 1 0.8993 0.10825 3.7396 0.000999 ***
## Year:Month 3 0.8822 0.10620 1.2229 0.058941 .
## Residual 25 6.0117 0.72366
## Total 30 8.3074 1.00000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#unifrac distances weighted by relative abundance
ad.uniwt<-adonis2(unifrac_wt~Month+Year+Elevation, data=grps, permutations=1000)
ad.uniwt
## Permutation test for adonis under reduced model
## Terms added sequentially (first to last)
## Permutation: free
## Number of permutations: 1000
##
## adonis2(formula = unifrac_wt ~ Month + Year + Elevation, data = grps, permutations = 1000)
## Df SumOfSqs R2 F Pr(>F)
## Month 3 0.4880 0.14798 1.9345 0.020979 *
## Year 1 0.1288 0.03907 1.5320 0.144855
## Elevation 1 0.5787 0.17548 6.8818 0.000999 ***
## Residual 25 2.1022 0.63747
## Total 30 3.2976 1.00000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## are the results the same with other (non evolutionary) dissimiarlity indices?
dist.j<-vegdist(otu.comm[,-1], method='jaccard', na.rm=TRUE)
dist.bc<-vegdist(otu.comm[,-1], method='bray', na.rm=TRUE)
ad.bc<-adonis2(dist.bc~Month+Year+Elevation, data=grps, permutations=1000, strata='Year')
ad.bc
## Permutation test for adonis under reduced model
## Terms added sequentially (first to last)
## Permutation: free
## Number of permutations: 1000
##
## adonis2(formula = dist.bc ~ Month + Year + Elevation, data = grps, permutations = 1000, strata = "Year")
## Df SumOfSqs R2 F Pr(>F)
## Month 3 1.2059 0.13955 1.6232 0.003996 **
## Year 1 0.3061 0.03542 1.2359 0.250749
## Elevation 1 0.9385 0.10860 3.7897 0.000999 ***
## Residual 25 6.1913 0.71643
## Total 30 8.6418 1.00000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ad.j<-adonis2(dist.j~Month+Year+Elevation, data=grps, permutations=1000)
ad.j
## Permutation test for adonis under reduced model
## Terms added sequentially (first to last)
## Permutation: free
## Number of permutations: 1000
##
## adonis2(formula = dist.j ~ Month + Year + Elevation, data = grps, permutations = 1000)
## Df SumOfSqs R2 F Pr(>F)
## Month 3 1.3975 0.12712 1.3982 0.005994 **
## Year 1 0.3881 0.03530 1.1649 0.191808
## Elevation 1 0.8788 0.07994 2.6378 0.000999 ***
## Residual 25 8.3289 0.75764
## Total 30 10.9933 1.00000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Month and elevation are significant in all tests. Year is not but that is probably because month is nested within year.
Include Year in model appropriately
Or alternatively, if wanting to constrain for the variation by Year, constrain permutations within Year using ‘strata’
#### NMDS
The NMDS often pairs well with the PERMANOVA because both are more robust to properties of community data. To gain proper fit may need to try multiple dissimilarity methods or tranformations.
Plot ordination
ggplot(nmds.uni, aes(NMDS1, NMDS2))+
geom_point(aes(color=Elevation, shape=Month))
Plotting elevation, month, and site
ggplot(nmds.uni, aes(NMDS1, NMDS2))+
geom_point(aes(color=Elevation, shape=Site), size=3)+
scale_color_gradient2(low='turquoise', mid='yellow', high='red', midpoint=1500)
ggplot(nmds.uniwt, aes(NMDS1, NMDS2))+
geom_point(aes(color=Elevation, shape=Site), size=4)+
scale_color_gradient2(low='turquoise', mid='yellow', high='red', midpoint=1500)
Separates well by elevation/site, within each elevation the months are grouped together for the most part.
NMDS is an unconstrained ordination method to visualize multivariate data in fewer dimensions. Depending on the properties of your data and your questions, different methods of ordination may be appropriate. PCoA (or MDS) is a metric version of NMDS, meaning that PCoA is a Euclidean representation of dissimilarities. So, like NMDS PCoA uses dissimilarities, but it is euclidean rather than rank-order which is used to preserve distances hierarchically. NMDS and distance-based ordinations/tests are often easier to use with community data because 0’s and 1’s, rare species are common and do not configure in euclidean space.
In general, ordination & multivaraite methods are defined by (1) whether or not using dissimilarity, (2) whether or not euclidean, (3) test for differences vs explaining differences …
PCoA, like PCA, returns a set of orthogonal axes with associated eignevalues that measure their importance. PCoA and PCA should produce the same results if the PCA is calculated on a covariance matrix with scaling of 1.
#PCoA - unconstrained, euclidean
pcoa.uni<-capscale(unifrac~1,data=grps)
head(summary(pcoa.uni))
##
## Call:
## capscale(formula = unifrac ~ 1, data = grps)
##
## Partitioning of squared Unknown distance:
## Inertia Proportion
## Total 8.307 1
## Unconstrained 8.307 1
##
## Eigenvalues, and their contribution to the squared Unknown distance
##
## Importance of components:
## MDS1 MDS2 MDS3 MDS4 MDS5 MDS6
## Eigenvalue 1.3712 0.6172 0.48296 0.44120 0.37507 0.33195
## Proportion Explained 0.1651 0.0743 0.05814 0.05311 0.04515 0.03996
## Cumulative Proportion 0.1651 0.2394 0.29750 0.35061 0.39575 0.43571
## MDS7 MDS8 MDS9 MDS10 MDS11 MDS12
## Eigenvalue 0.33155 0.28128 0.26035 0.24962 0.23306 0.21707
## Proportion Explained 0.03991 0.03386 0.03134 0.03005 0.02805 0.02613
## Cumulative Proportion 0.47562 0.50948 0.54082 0.57087 0.59892 0.62505
## MDS13 MDS14 MDS15 MDS16 MDS17 MDS18
## Eigenvalue 0.21094 0.2044 0.19919 0.19491 0.19216 0.18599
## Proportion Explained 0.02539 0.0246 0.02398 0.02346 0.02313 0.02239
## Cumulative Proportion 0.65044 0.6750 0.69902 0.72248 0.74561 0.76800
## MDS19 MDS20 MDS21 MDS22 MDS23 MDS24
## Eigenvalue 0.18212 0.17951 0.17740 0.17325 0.16256 0.15928
## Proportion Explained 0.02192 0.02161 0.02135 0.02085 0.01957 0.01917
## Cumulative Proportion 0.78992 0.81153 0.83289 0.85374 0.87331 0.89248
## MDS25 MDS26 MDS27 MDS28 MDS29 MDS30
## Eigenvalue 0.15552 0.15339 0.15087 0.14823 0.14353 0.14165
## Proportion Explained 0.01872 0.01846 0.01816 0.01784 0.01728 0.01705
## Cumulative Proportion 0.91120 0.92967 0.94783 0.96567 0.98295 1.00000
##
## Scaling 2 for species and site scores
## * Species are scaled proportional to eigenvalues
## * Sites are unscaled: weighted dispersion equal on all dimensions
## * General scaling constant of scores: 3.973259
##
##
## Species scores
##
## MDS1 MDS2 MDS3 MDS4 MDS5 MDS6
## Dim1
## Dim2
## Dim3
## Dim4
## Dim5
## Dim6
## ....
##
##
## Site scores (weighted sums of species scores)
##
## MDS1 MDS2 MDS3 MDS4 MDS5 MDS6
## OS115 -0.7415 0.1656 0.04005 -0.4076 -1.28724 1.7182
## AP114 0.6906 0.6814 0.71376 -0.5067 -0.02969 -0.6876
## JJ415 0.9565 -0.3752 0.54479 0.7769 -0.34641 -0.2737
## OT115 -0.1568 -1.3120 -0.73738 -0.9420 -0.16122 -0.6079
## JJ515 1.1682 -0.2464 0.84516 0.7170 -0.55189 -0.2705
## OS615 -0.8800 -0.3348 0.22812 0.4617 -1.03511 0.1340
## ....
pcoa.df<-data.frame(scores(pcoa.uni, display='sites'))
pcoa.df# site scores correspond to the SampleID
## MDS1 MDS2
## OS115 -0.74150784 0.16559647
## AP114 0.69057917 0.68140546
## JJ415 0.95654989 -0.37523883
## OT115 -0.15679547 -1.31203771
## JJ515 1.16819586 -0.24636897
## OS615 -0.88000154 -0.33482448
## OJ615 0.74929599 -0.63299584
## JT415 0.72507563 0.29416305
## JP315 -0.67566941 0.31608512
## OJ514 0.87871285 -0.12487936
## OJ515 0.57853130 -1.12364192
## OJ415 0.48488862 -1.09386822
## AS114 0.02490994 0.91198409
## JS415 -0.98065675 0.39498848
## JP414 0.09809453 1.08984841
## OT515 -0.30571689 -1.45939703
## JJ114 1.05797322 -0.16478383
## OT415 -0.37346462 -1.42360670
## OT414 0.44985751 0.80011034
## JP515 -0.74251220 0.40269122
## JT115 0.80663646 0.47811206
## AJ414 -0.60026037 0.01492331
## OP315 -0.88894794 -0.30454442
## OS214 0.03104503 1.08472510
## JT314 0.71373399 0.72165153
## OP515 -0.79264025 -0.30799797
## AT414 0.72371431 0.46458444
## OS415 -0.70002762 0.25432307
## OP615 -0.77049900 -0.20989577
## OP614 -0.73749279 0.34824823
## JS115 -0.79160161 0.69064067
# bind with grouping data, in order
pcoa.df<-pcoa.df[match(grps$SampleID,rownames(pcoa.df)),]%>%cbind(grps)
Note: Can’t run any analyses that rely on relative abundance data (e.g., simper), Address using absence-presence data
This procedure identifies OTUs as indicator species independently from their abundance in the total data set. ####Steps (from https://www.nature.com/articles/ismej2015238):