Trying to teach statistics is best done out of a shark cage, using a simulated dataset where “everything” is known about it. That way there are little surprises and you can focus on the statistics, not that “hum, what the hell happened here?”.
In ecology, non-metric multidimensional scaling is quite popular to seek (dis)similarities between variable (location, environmental measurement…) or samples (species, families, locations…) based on rank of (dis)similarity matrix in lower dimensions (take two, for example).
I will demonstrate how to generate species occurrence data for sites and analyse the dataset using NMDS. Let’s generate a low number of species to get the sense of how the data actually “feels”.
# load all the require packages
require(coenocliner)
set.seed(357)
# example is based on the example from ?coenocline
nsite <- 10 # number of sites
nspec <- 20 # number of species
# various parameters
x <- seq(from = 0, to = 10, length = nsite)
opt <- runif(n = nspec, min = 1, max = 8)
tol <- runif(n = nsite, min = 1, max = 5)
h <- runif(n = nsite, min = 1, max = 50)
y <- coenocline(x, responseModel = "gaussian",
params = cbind(opt = opt, tol = tol, h = h),
countModel = "poisson")
rownames(y) <- paste("site", 1:nsite, sep = "") # give columns pretty names
colnames(y) <- paste("spec", 1:nspec, sep = "")
y
## spec1 spec2 spec3 spec4 spec5 spec6 spec7 spec8 spec9 spec10 spec11
## site1 8 44 5 2 6 24 0 1 4 14 0
## site2 17 41 9 18 18 22 2 1 9 10 3
## site3 16 41 16 22 22 42 0 1 13 22 1
## site4 7 38 13 21 18 45 4 2 22 27 12
## site5 5 28 15 5 25 42 16 8 43 38 10
## site6 1 15 7 0 32 36 19 19 50 46 18
## site7 0 9 5 0 30 28 25 18 36 41 10
## site8 0 5 2 0 21 18 36 8 32 41 6
## site9 0 2 1 0 21 20 25 1 25 35 3
## site10 0 1 0 0 16 10 20 1 27 37 0
## spec12 spec13 spec14 spec15 spec16 spec17 spec18 spec19 spec20
## site1 13 1 3 12 47 14 0 6 22
## site2 26 7 7 26 34 20 0 14 34
## site3 28 5 18 25 39 28 0 26 42
## site4 31 13 18 33 49 24 1 27 40
## site5 56 12 5 20 33 24 3 36 41
## site6 41 13 3 21 26 20 8 48 24
## site7 41 8 0 33 18 19 11 48 22
## site8 21 4 0 21 15 6 13 23 22
## site9 11 6 0 16 8 1 5 30 19
## site10 5 2 0 7 2 0 2 20 20
Notice that for each species, we have a Poisson distribution along the sites. We can visualize this.
y.smooth <- coenocline(x, responseModel = "gaussian",
params = cbind(opt = opt, tol = tol, h = h),
countModel = "poisson",
expectation = TRUE)
matplot(x, y.smooth, type = "l", lty = "solid")
Run NMDS on this using the vegan package.
require(vegan)
mdl <- metaMDS(y, distance = "bray")
## Square root transformation
## Wisconsin double standardization
## Run 0 stress 0.002193
## Run 1 stress 0.01122
## Run 2 stress 0.002354
## ... procrustes: rmse 0.005973 max resid 0.008453
## Run 3 stress 0.006319
## Run 4 stress 0.006324
## Run 5 stress 0.00663
## Run 6 stress 0.01117
## Run 7 stress 0.0112
## Run 8 stress 0.003637
## Run 9 stress 0.002723
## Run 10 stress 0.002659
## ... procrustes: rmse 0.06029 max resid 0.1215
## Run 11 stress 0.005608
## Run 12 stress 0.002299
## ... procrustes: rmse 0.004794 max resid 0.007759
## *** Solution reached
par(mfrow = c(2, 1))
plot(mdl, type = "t")
stressplot(mdl)
mdl$stress
## [1] 0.002193
The above graph shows how species (columns) are similar to each other (but we don’t know by how much). For instance, species in columns 1, 4 and 14 are more similar to each other than to any other species. Same goes for species 7, 8 and 18. Compare the distributions of these six species. Notice that the first three have a peak around site2-site3, while the latter three peak later, at site7-site8.
y[, c(1,4,14, 7,8,18)]
## spec1 spec4 spec14 spec7 spec8 spec18
## site1 8 2 3 0 1 0
## site2 17 18 7 2 1 0
## site3 16 22 18 0 1 0
## site4 7 21 18 4 2 1
## site5 5 5 5 16 8 3
## site6 1 0 3 19 19 8
## site7 0 0 0 25 18 11
## site8 0 0 0 36 8 13
## site9 0 0 0 25 1 5
## site10 0 0 0 20 1 2
# visualize the 6 species
# what I deemed similar is colored using one color
matplot(y[, c(1,4,14, 7,8,18)], type = "l", lty = "solid", col = rep(c(1,2), each = 3))
Now that you’ve made sure NMDS “works”, you can try out your real messy data. Let the fun begin!