Load all required data and merge datasets

Load required libraries

library(phyloseq)
library(ggplot2)
library(dbplyr)
library(vegan)

## Loading required package: permute

## Loading required package: lattice

## This is vegan 2.5-4

library(scales)
library(reshape2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:dbplyr':
## 
##     ident, sql

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(grid)
library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

Load the sample metadata

Read in the csv file with soil data. Add in a new variable to categorize samples into “mound” and “depression” areas, based on the elevation. I matched the map created by the Corp that identified wetland and ‘non-wetland’ areas using different elevation cutoff points and landed on 375.3m ended up matching the best.

meta = read.csv("~/Documents/WWPS_soil_2018/soils_data_w2_parse.csv")

meta$SampleID <- paste0("WS-",meta$cluster,"-",meta$num)
row.names(meta) <- meta$SampleID

meta$topo <- c("mound", "depression")
meta$topo <- ifelse(meta$Ele <= 375.3, "depression", "mound")

write.csv(meta, "~/Documents/WWPS_soil_2018/soils_data_w2_topo.csv")

Load phyloseq object created from the raw seq data

Open the phyloseq object that contains OTUs for both fungi and bacteria. Merge the phyloseq data with the metadata.

ps <- readRDS("~/Documents/WWPS_soil_2018/output/WWP_Full_phyloseq_object.RDS")

# Add soil metadata and merge all into one phyloseq object
ps.m = merge_phyloseq(ps, sample_data(meta))

#Change the Kingdom to either Fungi (k__Fungi) or Bacteria before running
ps = subset_taxa(ps.m, Kingdom == "k__Fungi")

ps

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 3197 taxa and 146 samples ]
## sample_data() Sample Data:       [ 146 samples by 39 sample variables ]
## tax_table()   Taxonomy Table:    [ 3197 taxa by 7 taxonomic ranks ]

Everything looks good. We have a phyloseq object, which contains our OTU/taxonomic/sequence data, which has been combined with our soil metadata. The reason why we have 146 samples instead of 150 is that some of them only had a few seq counts, which ended up not being fungal (we think this happened somewhere between DNA amplification and seqencing)

Start exploring the data

Explore the count and OTU distribution of the raw samples, before we normalize the data set.

readsumsdf = data.frame(nreads = sort(taxa_sums(ps), TRUE), sorted = 1:ntaxa(ps), 
    type = "OTUs")
readsumsdf = rbind(readsumsdf, data.frame(nreads = sort(sample_sums(ps), 
    TRUE), sorted = 1:nsamples(ps), type = "Samples"))
title = "Total number of reads"
p = ggplot(readsumsdf, aes(x = sorted, y = nreads)) + geom_bar(stat = "identity")
p + ggtitle(title) + scale_y_log10() + facet_wrap(~type, 1, scales = "free")

## Warning: Transformation introduced infinite values in continuous y-axis

## Warning: Removed 8 rows containing missing values (geom_bar).

This looks pretty typical for an amplicon-based microbiome study. Thankfully, most samples have at least 100 reads. Regardless, we will use a technique to normalize the data to simulate an even “sequencing effort.”

Normalization by rarefying sample count

First, I’ll remove the samples that have below 100 counts.

#Can change the arbitrary cutoff read number
ps.sub = prune_samples(sample_sums(ps) > 100, ps)
ps.sub

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 3197 taxa and 105 samples ]
## sample_data() Sample Data:       [ 105 samples by 39 sample variables ]
## tax_table()   Taxonomy Table:    [ 3197 taxa by 7 taxonomic ranks ]

This leaves us with 105 samples that have more than 100 counts. Let’s look at the distribution of the counts again:

readsumsdf = data.frame(nreads = sort(taxa_sums(ps.sub), TRUE), sorted = 1:ntaxa(ps.sub), 
    type = "OTUs")
readsumsdf = rbind(readsumsdf, data.frame(nreads = sort(sample_sums(ps.sub), 
    TRUE), sorted = 1:nsamples(ps.sub), type = "Samples"))
title = "Total number of reads"
p = ggplot(readsumsdf, aes(x = sorted, y = nreads)) + geom_bar(stat = "identity")
p + ggtitle(title) + scale_y_log10() + facet_wrap(~type, 1, scales = "free")

## Warning: Transformation introduced infinite values in continuous y-axis

## Warning: Removed 3 rows containing missing values (geom_bar).

Sample counts are more even now, but we still need to normalize the data by randomly sampling all libraries that have over 100 counts.

ps.R <- prune_samples(sample_sums(ps) > 25, ps) 
#The set.seed function makes the randomization repeatable
set.seed(100)
ps.R <- rarefy_even_depth(ps.R, sample.size = 100)

## You set `rngseed` to FALSE. Make sure you've set & recorded
##  the random seed of your session for reproducibility.
## See `?set.seed`

## ...

## 4 samples removedbecause they contained fewer reads than `sample.size`.

## Up to first five removed samples are:

## WS-23-5WS-30-1WS-4-4WS-5-1

## ...

## 2325OTUs were removed because they are no longer 
## present in any sample after random subsampling

## ...

Randomizing at this level has caused us to loose over 2,000 OTUs from our complete data set. This is a good place to introduce an alternative method for normalizing sample depth.

Traditionally, samples would be rarefied using random sub-sampling down to the lowest common sample size acceptable. However, there are issues with this approach (See https://www.frontiersin.org/articles/10.3389/fmicb.2017.02224/full#F2 and https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003531). Mixed models can be difficult to work with when developing diversity indexes as well. Thus, I will be using a simple transformation wich will change OTU counts into relative abundances, we can check both of these in the later analyses.

#In the following function, I use 100 as the sample count so it is comparable to our rarefied data set
ps.P <- transform_sample_counts(ps, function(x) 100 * x / sum(x))

Let’s now check both of our normalized data sets, to make sure the reads look the way they should.

par(mfrow = c(1,2))
title = "Sum of reads for each sample, ps.R"
plot(sort(sample_sums(ps.R), TRUE), type = "h", main = title, ylab = "reads",
     ylim = c(0,150))
title = "Sum of reads for each sample, ps.P"
plot(sort(sample_sums(ps.P), TRUE), type = "h", main = title, ylab = "reads",
     ylim = c(0,150))

Everything has been normalized the way we want it to.

Data exploration

PCoA Graph

One of the best and easiest ways to explore amplicon data is by using uncronstrained ordinations. The following will run a PCoA on the data using bray dissimilarity.

ps.pcoa <- ordinate(
  physeq = ps.R,
  method = "PCoA",
  distance = "bray"
)

plot_ordination(
  physeq = ps.R,
  ordination = ps.pcoa,
  color = "per_clay",
  shape = "topo",
  title = "PCoA of WWPS Fungal Communities"
) +
  #scale_color_gradient(low = "blue", high = "red") +
  geom_point(aes(color = per_clay), alpha = 0.7, size = 4) +
  geom_point(colour = "grey90", size = 1.5)

Next, we can look at measurements of diversity between mound and depressions by plotting indexes for each:

plot_richness(ps.R, x = "topo", color = "per_clay", title = "Diversity of ps.R") + geom_boxplot()

## Warning: Removed 525 rows containing missing values (geom_errorbar).

Environmental interpretation

It can be possible to ‘explain’ ordination using corresponding environmental measurements for each sample. We can explore these in the PCoA by changing the symbols according to the environmental variables, but we can also fit environmental vectors onto the ordination in the form of arrows.

To continue, I need to make the phyloseq object compatable with a package named “vegan”, which has many functions related to studying diversity and community structure. I’ll use some codes from https://jacobrprice.github.io/2017/08/26/phyloseq-to-vegan-and-back.html for this.

# convert the sample_data() within a phyloseq object to a vegan compatible data object
pssd2veg <- function(physeq) {
  sd <- sample_data(physeq)
  return(as(sd,"data.frame"))
}

# convert the otu_table() within a phyloseq object to a vegan compatible data object
psotu2veg <- function(physeq) {
  OTU <- otu_table(physeq)
  if (taxa_are_rows(OTU)) {
    OTU <- t(OTU)
  }
  return(as(OTU, "matrix"))
}

ps.sd.veg <-  pssd2veg(ps.R)
ps.otu.veg <-  psotu2veg(ps.R)

#I reduced the sample data file further using ps$var = NULL to remove col
psv = read.csv("~/Documents/WWPS_soil_2018/soils_data_reduced.csv")

Now we need to calculate the Bray-Curtis dissimilarity for the otu table we will use in the vegan package

vare.dis <- vegdist(ps.otu.veg)
vare.mds0 <- isoMDS(vare.dis)

## initial  value 41.807341 
## iter   5 value 34.870064
## iter  10 value 29.920009
## iter  15 value 28.915330
## iter  20 value 27.776807
## iter  25 value 26.915619
## iter  30 value 26.560442
## final  value 26.367949 
## converged

Calculate the NMDS for the data set

library(MASS)
vare.mds <- metaMDS(ps.otu.veg, trace = FALSE)
vare.mds

## 
## Call:
## metaMDS(comm = ps.otu.veg, trace = FALSE) 
## 
## global Multidimensional Scaling using monoMDS
## 
## Data:     wisconsin(sqrt(ps.otu.veg)) 
## Distance: bray 
## 
## Dimensions: 2 
## Stress:     0.2600346 
## Stress type 1, weak ties
## No convergent solutions - best solution after 20 tries
## Scaling: centring, PC rotation, halfchange scaling 
## Species: expanded scores based on 'wisconsin(sqrt(ps.otu.veg))'

Fitting environmental vectors to the previous NMDS result.

ef <- envfit(vare.mds, psv, permu = 999)
ef

## 
## ***VECTORS
## 
##             NMDS1    NMDS2     r2 Pr(>r)    
## lon       0.08204  0.99663 0.0601  0.035 *  
## lat       0.94537  0.32600 0.2519  0.001 ***
## pH        0.99828  0.05862 0.1877  0.001 ***
## EC        0.36483 -0.93107 0.0277  0.233    
## GWC      -0.84877  0.52877 0.0506  0.072 .  
## per_clay  0.18327 -0.98306 0.0060  0.741    
## per_sand  0.99484 -0.10148 0.5393  0.001 ***
## per_silt -0.97516  0.22150 0.2508  0.001 ***
## Al       -0.75673  0.65373 0.0145  0.497    
## As       -0.97620 -0.21688 0.2638  0.001 ***
## B        -0.48848  0.87258 0.0441  0.106    
## Ca       -0.62833  0.77795 0.1718  0.001 ***
## Cd        0.99240 -0.12306 0.0165  0.493    
## Co       -0.96765  0.25231 0.1209  0.002 ** 
## Cr       -0.99599 -0.08950 0.2136  0.001 ***
## Cu        0.18453  0.98283 0.0097  0.628    
## Fe       -1.00000 -0.00229 0.5021  0.001 ***
## K        -0.75389 -0.65700 0.0218  0.366    
## Mg       -0.92167  0.38797 0.3576  0.001 ***
## Mn        0.68461 -0.72891 0.0169  0.427    
## Mo       -0.40402 -0.91475 0.0101  0.615    
## Na       -0.99616 -0.08757 0.2655  0.001 ***
## Ni       -0.67639  0.73654 0.1230  0.001 ***
## P         0.82192 -0.56960 0.3130  0.001 ***
## S        -0.72225 -0.69164 0.0852  0.004 ** 
## Ti       -0.96774 -0.25194 0.2023  0.001 ***
## Zn       -0.84847  0.52924 0.0527  0.063 .  
## per_som   0.41871  0.90812 0.0028  0.889    
## PerN     -0.00551  0.99998 0.0365  0.146    
## PerC      0.54996  0.83519 0.0738  0.020 *  
## CNRatio   0.96255  0.27111 0.0966  0.007 ** 
## Ele       0.93752 -0.34794 0.3767  0.001 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Permutation: free
## Number of permutations: 999
## 
## ***FACTORS:
## 
## Centroids:
##            NMDS1   NMDS2
## XWS-1-1   0.0047  0.0339
## XWS-1-2  -0.0022  0.0320
## XWS-1-3   0.0226  0.0098
## XWS-1-4  -0.0261  0.0291
## XWS-1-5  -0.0158  0.0036
## XWS-10-4 -0.0379  0.0038
## XWS-10-5 -0.0475 -0.0030
## XWS-11-1 -0.0402  0.0132
## XWS-11-2  0.0245  0.0194
## XWS-11-3  0.0145 -0.0165
## XWS-11-4  0.0253 -0.0031
## XWS-11-5 -0.0115  0.0131
## XWS-12-2  0.0320  0.0065
## XWS-12-3  0.0355 -0.0136
## XWS-12-4  0.0315 -0.0281
## XWS-12-5  0.0357  0.0065
## XWS-13-1  0.0233  0.0068
## XWS-13-2  0.0302  0.0018
## XWS-13-3  0.0355  0.0161
## XWS-13-4  0.0110  0.0488
## XWS-14-2  0.0438 -0.0147
## XWS-14-3  0.0477  0.0094
## XWS-14-4  0.0535  0.0287
## XWS-14-5  0.0365 -0.0223
## XWS-15-3  0.0162  0.0222
## XWS-15-4  0.0224 -0.0272
## XWS-15-5  0.0154  0.0132
## XWS-16-1  0.0022  0.0371
## XWS-16-2 -0.0058  0.0200
## XWS-16-3 -0.0049  0.0067
## XWS-16-4  0.0382  0.0296
## XWS-16-5 -0.0139  0.0399
## XWS-17-1 -0.0402  0.0308
## XWS-17-2  0.0131  0.0042
## XWS-17-3  0.0414 -0.0121
## XWS-17-4  0.0201  0.0037
## XWS-17-5  0.0319 -0.0199
## XWS-18-1  0.0409 -0.0228
## XWS-18-2  0.0264 -0.0201
## XWS-18-3  0.0032 -0.0303
## XWS-18-4  0.0110  0.0222
## XWS-18-5  0.0061 -0.0248
## XWS-19-1  0.0134  0.0188
## XWS-19-2 -0.0203 -0.0334
## XWS-19-3 -0.0065 -0.0334
## XWS-19-4  0.0068 -0.0411
## XWS-19-5 -0.0175 -0.0273
## XWS-2-1   0.0257 -0.0109
## XWS-2-2   0.0229 -0.0231
## XWS-2-3   0.0338 -0.0333
## XWS-2-4   0.0399  0.0117
## XWS-2-5  -0.0298  0.0198
## XWS-20-1  0.0415 -0.0195
## XWS-20-2  0.0229  0.0145
## XWS-20-3  0.0399 -0.0090
## XWS-20-4  0.0237 -0.0346
## XWS-20-5 -0.0381 -0.0394
## XWS-21-1 -0.0037 -0.0077
## XWS-21-2  0.0011 -0.0084
## XWS-21-3  0.0015 -0.0473
## XWS-21-4 -0.0140 -0.0193
## XWS-21-5  0.0070  0.0518
## XWS-22-1 -0.0569 -0.0193
## XWS-22-2 -0.0346 -0.0224
## XWS-22-3 -0.0394  0.0074
## XWS-22-4  0.0072 -0.0194
## XWS-22-5 -0.0053 -0.0258
## XWS-23-1 -0.0027 -0.0195
## XWS-23-2 -0.0065  0.0104
## XWS-23-3  0.0002  0.0477
## XWS-23-4 -0.0263 -0.0168
## XWS-24-1 -0.0465 -0.0170
## XWS-24-2  0.0154 -0.0467
## XWS-24-3 -0.0443 -0.0104
## XWS-24-4 -0.0245  0.0361
## XWS-25-1 -0.0464  0.0208
## XWS-25-2 -0.0175 -0.0401
## XWS-25-3 -0.0362 -0.0146
## XWS-25-4 -0.0360 -0.0179
## XWS-25-5 -0.0406 -0.0096
## XWS-26-1 -0.0605 -0.0135
## XWS-26-2 -0.0474  0.0003
## XWS-26-3  0.0361 -0.0030
## XWS-29-5  0.0342 -0.0064
## XWS-3-1  -0.0112 -0.0017
## XWS-3-2  -0.0265 -0.0007
## XWS-3-3   0.0215  0.0252
## XWS-3-4  -0.0086  0.0185
## XWS-3-5  -0.0231 -0.0056
## XWS-30-2 -0.0412 -0.0303
## XWS-30-3 -0.0098 -0.0441
## XWS-30-4 -0.0179  0.0133
## XWS-30-5 -0.0174  0.0287
## XWS-4-1  -0.0209  0.0496
## XWS-4-3  -0.0141 -0.0102
## XWS-4-5  -0.0224  0.0216
## XWS-5-2  -0.0299  0.0130
## XWS-5-3  -0.0036  0.0228
## XWS-6-1  -0.0191  0.0235
## XWS-6-2   0.0176  0.0008
## XWS-6-3  -0.0218 -0.0137
## XWS-6-4   0.0148  0.0250
## XWS-6-5   0.0066  0.0399
## XWS-7-1   0.0137  0.0113
## XWS-7-2  -0.0087  0.0098
## 
## Goodness of fit:
##   r2 Pr(>r)
## X  1      1
## Permutation: free
## Number of permutations: 999

The first two columns give direction cosines of the vectors, andr2givesthe squared correlation coefficient.

Now to plot them as fitted vectors to the ordination map. I’ll limit the plotting to the most significant variables with the argument p.max.

plot(vare.mds, display = "sites")
plot(ef, p.max = 0.001)

Surface fitting

Fit surfaces of environmental variables to ordinations. This helps when the response is not actually linear. If it is linear and vectors are appropriate, the fitted surface is a plane whose gradient is parallel to the arrow and the fitted contours are equally spaced parallel lines perpendicular to the arrow.

ef <- envfit(vare.mds ~ Mg + P, psv)
plot(vare.mds, display = "sites")
plot(ef)
tmp <- with(psv, ordisurf(vare.mds, Mg, add = TRUE))
with(psv, ordisurf(vare.mds, P, add = TRUE, col = "green4"))

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## y ~ s(x1, x2, k = 10, bs = "tp", fx = FALSE)
## 
## Estimated degrees of freedom:
## 7.57  total = 8.57 
## 
## REML score: 54.82107

Fungal Soil Survey

S. Faye Smith

February 14, 2019