library(phyloseq)
library(ggplot2)
library(dbplyr)
library(vegan)
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.5-4
library(scales)
library(reshape2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:dbplyr':
##
## ident, sql
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(grid)
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
Read in the csv file with soil data. Add in a new variable to categorize samples into “mound” and “depression” areas, based on the elevation. I matched the map created by the Corp that identified wetland and ‘non-wetland’ areas using different elevation cutoff points and landed on 375.3m ended up matching the best.
meta = read.csv("~/Documents/WWPS_soil_2018/soils_data_w2_parse.csv")
meta$SampleID <- paste0("WS-",meta$cluster,"-",meta$num)
row.names(meta) <- meta$SampleID
meta$topo <- c("mound", "depression")
meta$topo <- ifelse(meta$Ele <= 375.3, "depression", "mound")
write.csv(meta, "~/Documents/WWPS_soil_2018/soils_data_w2_topo.csv")
Open the phyloseq object that contains OTUs for both fungi and bacteria. Merge the phyloseq data with the metadata.
ps <- readRDS("~/Documents/WWPS_soil_2018/output/WWP_Full_phyloseq_object.RDS")
# Add soil metadata and merge all into one phyloseq object
ps.m = merge_phyloseq(ps, sample_data(meta))
#Change the Kingdom to either Fungi (k__Fungi) or Bacteria before running
ps = subset_taxa(ps.m, Kingdom == "k__Fungi")
ps
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 3197 taxa and 146 samples ]
## sample_data() Sample Data: [ 146 samples by 39 sample variables ]
## tax_table() Taxonomy Table: [ 3197 taxa by 7 taxonomic ranks ]
Everything looks good. We have a phyloseq object, which contains our OTU/taxonomic/sequence data, which has been combined with our soil metadata. The reason why we have 146 samples instead of 150 is that some of them only had a few seq counts, which ended up not being fungal (we think this happened somewhere between DNA amplification and seqencing)
Explore the count and OTU distribution of the raw samples, before we normalize the data set.
readsumsdf = data.frame(nreads = sort(taxa_sums(ps), TRUE), sorted = 1:ntaxa(ps),
type = "OTUs")
readsumsdf = rbind(readsumsdf, data.frame(nreads = sort(sample_sums(ps),
TRUE), sorted = 1:nsamples(ps), type = "Samples"))
title = "Total number of reads"
p = ggplot(readsumsdf, aes(x = sorted, y = nreads)) + geom_bar(stat = "identity")
p + ggtitle(title) + scale_y_log10() + facet_wrap(~type, 1, scales = "free")
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 8 rows containing missing values (geom_bar).
This looks pretty typical for an amplicon-based microbiome study. Thankfully, most samples have at least 100 reads. Regardless, we will use a technique to normalize the data to simulate an even “sequencing effort.”
First, I’ll remove the samples that have below 100 counts.
#Can change the arbitrary cutoff read number
ps.sub = prune_samples(sample_sums(ps) > 100, ps)
ps.sub
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 3197 taxa and 105 samples ]
## sample_data() Sample Data: [ 105 samples by 39 sample variables ]
## tax_table() Taxonomy Table: [ 3197 taxa by 7 taxonomic ranks ]
This leaves us with 105 samples that have more than 100 counts. Let’s look at the distribution of the counts again:
readsumsdf = data.frame(nreads = sort(taxa_sums(ps.sub), TRUE), sorted = 1:ntaxa(ps.sub),
type = "OTUs")
readsumsdf = rbind(readsumsdf, data.frame(nreads = sort(sample_sums(ps.sub),
TRUE), sorted = 1:nsamples(ps.sub), type = "Samples"))
title = "Total number of reads"
p = ggplot(readsumsdf, aes(x = sorted, y = nreads)) + geom_bar(stat = "identity")
p + ggtitle(title) + scale_y_log10() + facet_wrap(~type, 1, scales = "free")
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 3 rows containing missing values (geom_bar).
Sample counts are more even now, but we still need to normalize the data by randomly sampling all libraries that have over 100 counts.
ps.R <- prune_samples(sample_sums(ps) > 25, ps)
#The set.seed function makes the randomization repeatable
set.seed(100)
ps.R <- rarefy_even_depth(ps.R, sample.size = 100)
## You set `rngseed` to FALSE. Make sure you've set & recorded
## the random seed of your session for reproducibility.
## See `?set.seed`
## ...
## 4 samples removedbecause they contained fewer reads than `sample.size`.
## Up to first five removed samples are:
## WS-23-5WS-30-1WS-4-4WS-5-1
## ...
## 2325OTUs were removed because they are no longer
## present in any sample after random subsampling
## ...
Randomizing at this level has caused us to loose over 2,000 OTUs from our complete data set. This is a good place to introduce an alternative method for normalizing sample depth.
Traditionally, samples would be rarefied using random sub-sampling down to the lowest common sample size acceptable. However, there are issues with this approach (See https://www.frontiersin.org/articles/10.3389/fmicb.2017.02224/full#F2 and https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003531). Mixed models can be difficult to work with when developing diversity indexes as well. Thus, I will be using a simple transformation wich will change OTU counts into relative abundances, we can check both of these in the later analyses.
#In the following function, I use 100 as the sample count so it is comparable to our rarefied data set
ps.P <- transform_sample_counts(ps, function(x) 100 * x / sum(x))
Let’s now check both of our normalized data sets, to make sure the reads look the way they should.
par(mfrow = c(1,2))
title = "Sum of reads for each sample, ps.R"
plot(sort(sample_sums(ps.R), TRUE), type = "h", main = title, ylab = "reads",
ylim = c(0,150))
title = "Sum of reads for each sample, ps.P"
plot(sort(sample_sums(ps.P), TRUE), type = "h", main = title, ylab = "reads",
ylim = c(0,150))
Everything has been normalized the way we want it to.
One of the best and easiest ways to explore amplicon data is by using uncronstrained ordinations. The following will run a PCoA on the data using bray dissimilarity.
ps.pcoa <- ordinate(
physeq = ps.R,
method = "PCoA",
distance = "bray"
)
plot_ordination(
physeq = ps.R,
ordination = ps.pcoa,
color = "per_clay",
shape = "topo",
title = "PCoA of WWPS Fungal Communities"
) +
#scale_color_gradient(low = "blue", high = "red") +
geom_point(aes(color = per_clay), alpha = 0.7, size = 4) +
geom_point(colour = "grey90", size = 1.5)
Next, we can look at measurements of diversity between mound and depressions by plotting indexes for each:
plot_richness(ps.R, x = "topo", color = "per_clay", title = "Diversity of ps.R") + geom_boxplot()
## Warning: Removed 525 rows containing missing values (geom_errorbar).
It can be possible to ‘explain’ ordination using corresponding environmental measurements for each sample. We can explore these in the PCoA by changing the symbols according to the environmental variables, but we can also fit environmental vectors onto the ordination in the form of arrows.
To continue, I need to make the phyloseq object compatable with a package named “vegan”, which has many functions related to studying diversity and community structure. I’ll use some codes from https://jacobrprice.github.io/2017/08/26/phyloseq-to-vegan-and-back.html for this.
# convert the sample_data() within a phyloseq object to a vegan compatible data object
pssd2veg <- function(physeq) {
sd <- sample_data(physeq)
return(as(sd,"data.frame"))
}
# convert the otu_table() within a phyloseq object to a vegan compatible data object
psotu2veg <- function(physeq) {
OTU <- otu_table(physeq)
if (taxa_are_rows(OTU)) {
OTU <- t(OTU)
}
return(as(OTU, "matrix"))
}
ps.sd.veg <- pssd2veg(ps.R)
ps.otu.veg <- psotu2veg(ps.R)
#I reduced the sample data file further using ps$var = NULL to remove col
psv = read.csv("~/Documents/WWPS_soil_2018/soils_data_reduced.csv")
Now we need to calculate the Bray-Curtis dissimilarity for the otu table we will use in the vegan package
vare.dis <- vegdist(ps.otu.veg)
vare.mds0 <- isoMDS(vare.dis)
## initial value 41.807341
## iter 5 value 34.870064
## iter 10 value 29.920009
## iter 15 value 28.915330
## iter 20 value 27.776807
## iter 25 value 26.915619
## iter 30 value 26.560442
## final value 26.367949
## converged
Calculate the NMDS for the data set
library(MASS)
vare.mds <- metaMDS(ps.otu.veg, trace = FALSE)
vare.mds
##
## Call:
## metaMDS(comm = ps.otu.veg, trace = FALSE)
##
## global Multidimensional Scaling using monoMDS
##
## Data: wisconsin(sqrt(ps.otu.veg))
## Distance: bray
##
## Dimensions: 2
## Stress: 0.2600346
## Stress type 1, weak ties
## No convergent solutions - best solution after 20 tries
## Scaling: centring, PC rotation, halfchange scaling
## Species: expanded scores based on 'wisconsin(sqrt(ps.otu.veg))'
ef <- envfit(vare.mds, psv, permu = 999)
ef
##
## ***VECTORS
##
## NMDS1 NMDS2 r2 Pr(>r)
## lon 0.08204 0.99663 0.0601 0.035 *
## lat 0.94537 0.32600 0.2519 0.001 ***
## pH 0.99828 0.05862 0.1877 0.001 ***
## EC 0.36483 -0.93107 0.0277 0.233
## GWC -0.84877 0.52877 0.0506 0.072 .
## per_clay 0.18327 -0.98306 0.0060 0.741
## per_sand 0.99484 -0.10148 0.5393 0.001 ***
## per_silt -0.97516 0.22150 0.2508 0.001 ***
## Al -0.75673 0.65373 0.0145 0.497
## As -0.97620 -0.21688 0.2638 0.001 ***
## B -0.48848 0.87258 0.0441 0.106
## Ca -0.62833 0.77795 0.1718 0.001 ***
## Cd 0.99240 -0.12306 0.0165 0.493
## Co -0.96765 0.25231 0.1209 0.002 **
## Cr -0.99599 -0.08950 0.2136 0.001 ***
## Cu 0.18453 0.98283 0.0097 0.628
## Fe -1.00000 -0.00229 0.5021 0.001 ***
## K -0.75389 -0.65700 0.0218 0.366
## Mg -0.92167 0.38797 0.3576 0.001 ***
## Mn 0.68461 -0.72891 0.0169 0.427
## Mo -0.40402 -0.91475 0.0101 0.615
## Na -0.99616 -0.08757 0.2655 0.001 ***
## Ni -0.67639 0.73654 0.1230 0.001 ***
## P 0.82192 -0.56960 0.3130 0.001 ***
## S -0.72225 -0.69164 0.0852 0.004 **
## Ti -0.96774 -0.25194 0.2023 0.001 ***
## Zn -0.84847 0.52924 0.0527 0.063 .
## per_som 0.41871 0.90812 0.0028 0.889
## PerN -0.00551 0.99998 0.0365 0.146
## PerC 0.54996 0.83519 0.0738 0.020 *
## CNRatio 0.96255 0.27111 0.0966 0.007 **
## Ele 0.93752 -0.34794 0.3767 0.001 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Permutation: free
## Number of permutations: 999
##
## ***FACTORS:
##
## Centroids:
## NMDS1 NMDS2
## XWS-1-1 0.0047 0.0339
## XWS-1-2 -0.0022 0.0320
## XWS-1-3 0.0226 0.0098
## XWS-1-4 -0.0261 0.0291
## XWS-1-5 -0.0158 0.0036
## XWS-10-4 -0.0379 0.0038
## XWS-10-5 -0.0475 -0.0030
## XWS-11-1 -0.0402 0.0132
## XWS-11-2 0.0245 0.0194
## XWS-11-3 0.0145 -0.0165
## XWS-11-4 0.0253 -0.0031
## XWS-11-5 -0.0115 0.0131
## XWS-12-2 0.0320 0.0065
## XWS-12-3 0.0355 -0.0136
## XWS-12-4 0.0315 -0.0281
## XWS-12-5 0.0357 0.0065
## XWS-13-1 0.0233 0.0068
## XWS-13-2 0.0302 0.0018
## XWS-13-3 0.0355 0.0161
## XWS-13-4 0.0110 0.0488
## XWS-14-2 0.0438 -0.0147
## XWS-14-3 0.0477 0.0094
## XWS-14-4 0.0535 0.0287
## XWS-14-5 0.0365 -0.0223
## XWS-15-3 0.0162 0.0222
## XWS-15-4 0.0224 -0.0272
## XWS-15-5 0.0154 0.0132
## XWS-16-1 0.0022 0.0371
## XWS-16-2 -0.0058 0.0200
## XWS-16-3 -0.0049 0.0067
## XWS-16-4 0.0382 0.0296
## XWS-16-5 -0.0139 0.0399
## XWS-17-1 -0.0402 0.0308
## XWS-17-2 0.0131 0.0042
## XWS-17-3 0.0414 -0.0121
## XWS-17-4 0.0201 0.0037
## XWS-17-5 0.0319 -0.0199
## XWS-18-1 0.0409 -0.0228
## XWS-18-2 0.0264 -0.0201
## XWS-18-3 0.0032 -0.0303
## XWS-18-4 0.0110 0.0222
## XWS-18-5 0.0061 -0.0248
## XWS-19-1 0.0134 0.0188
## XWS-19-2 -0.0203 -0.0334
## XWS-19-3 -0.0065 -0.0334
## XWS-19-4 0.0068 -0.0411
## XWS-19-5 -0.0175 -0.0273
## XWS-2-1 0.0257 -0.0109
## XWS-2-2 0.0229 -0.0231
## XWS-2-3 0.0338 -0.0333
## XWS-2-4 0.0399 0.0117
## XWS-2-5 -0.0298 0.0198
## XWS-20-1 0.0415 -0.0195
## XWS-20-2 0.0229 0.0145
## XWS-20-3 0.0399 -0.0090
## XWS-20-4 0.0237 -0.0346
## XWS-20-5 -0.0381 -0.0394
## XWS-21-1 -0.0037 -0.0077
## XWS-21-2 0.0011 -0.0084
## XWS-21-3 0.0015 -0.0473
## XWS-21-4 -0.0140 -0.0193
## XWS-21-5 0.0070 0.0518
## XWS-22-1 -0.0569 -0.0193
## XWS-22-2 -0.0346 -0.0224
## XWS-22-3 -0.0394 0.0074
## XWS-22-4 0.0072 -0.0194
## XWS-22-5 -0.0053 -0.0258
## XWS-23-1 -0.0027 -0.0195
## XWS-23-2 -0.0065 0.0104
## XWS-23-3 0.0002 0.0477
## XWS-23-4 -0.0263 -0.0168
## XWS-24-1 -0.0465 -0.0170
## XWS-24-2 0.0154 -0.0467
## XWS-24-3 -0.0443 -0.0104
## XWS-24-4 -0.0245 0.0361
## XWS-25-1 -0.0464 0.0208
## XWS-25-2 -0.0175 -0.0401
## XWS-25-3 -0.0362 -0.0146
## XWS-25-4 -0.0360 -0.0179
## XWS-25-5 -0.0406 -0.0096
## XWS-26-1 -0.0605 -0.0135
## XWS-26-2 -0.0474 0.0003
## XWS-26-3 0.0361 -0.0030
## XWS-29-5 0.0342 -0.0064
## XWS-3-1 -0.0112 -0.0017
## XWS-3-2 -0.0265 -0.0007
## XWS-3-3 0.0215 0.0252
## XWS-3-4 -0.0086 0.0185
## XWS-3-5 -0.0231 -0.0056
## XWS-30-2 -0.0412 -0.0303
## XWS-30-3 -0.0098 -0.0441
## XWS-30-4 -0.0179 0.0133
## XWS-30-5 -0.0174 0.0287
## XWS-4-1 -0.0209 0.0496
## XWS-4-3 -0.0141 -0.0102
## XWS-4-5 -0.0224 0.0216
## XWS-5-2 -0.0299 0.0130
## XWS-5-3 -0.0036 0.0228
## XWS-6-1 -0.0191 0.0235
## XWS-6-2 0.0176 0.0008
## XWS-6-3 -0.0218 -0.0137
## XWS-6-4 0.0148 0.0250
## XWS-6-5 0.0066 0.0399
## XWS-7-1 0.0137 0.0113
## XWS-7-2 -0.0087 0.0098
##
## Goodness of fit:
## r2 Pr(>r)
## X 1 1
## Permutation: free
## Number of permutations: 999
The first two columns give direction cosines of the vectors, andr2givesthe squared correlation coefficient.
Now to plot them as fitted vectors to the ordination map. I’ll limit the plotting to the most significant variables with the argument p.max.
plot(vare.mds, display = "sites")
plot(ef, p.max = 0.001)
Fit surfaces of environmental variables to ordinations. This helps when the response is not actually linear. If it is linear and vectors are appropriate, the fitted surface is a plane whose gradient is parallel to the arrow and the fitted contours are equally spaced parallel lines perpendicular to the arrow.
ef <- envfit(vare.mds ~ Mg + P, psv)
plot(vare.mds, display = "sites")
plot(ef)
tmp <- with(psv, ordisurf(vare.mds, Mg, add = TRUE))
with(psv, ordisurf(vare.mds, P, add = TRUE, col = "green4"))
##
## Family: gaussian
## Link function: identity
##
## Formula:
## y ~ s(x1, x2, k = 10, bs = "tp", fx = FALSE)
##
## Estimated degrees of freedom:
## 7.57 total = 8.57
##
## REML score: 54.82107