Trying to teach statistics is best done out of a shark cage, using a simulated dataset where “everything” is known about it. That way there are little surprises and you can focus on the statistics, not that “hum, what the hell happened here?”.

In ecology, non-metric multidimensional scaling is quite popular to seek (dis)similarities between variable (location, environmental measurement…) or samples (species, families, locations…) based on rank of (dis)similarity matrix in lower dimensions (take two, for example).

I will demonstrate how to generate species occurrence data for sites and analyse the dataset using NMDS. Let’s generate a low number of species to get the sense of how the data actually “feels”.

# load all the require packages
require(coenocliner)

set.seed(357)

# example is based on the example from ?coenocline
nsite <- 10 # number of sites
nspec <- 20 # number of species
# various parameters
x <- seq(from = 0, to = 10, length = nsite)
opt <- runif(n = nspec, min = 1, max = 8)
tol <- runif(n = nsite, min = 1, max = 5)
h <- runif(n = nsite, min = 1, max = 50)

y <- coenocline(x, responseModel = "gaussian", 
             params = cbind(opt = opt, tol = tol, h = h), 
             countModel = "poisson")

rownames(y) <- paste("site", 1:nsite, sep = "") # give columns pretty names
colnames(y) <- paste("spec", 1:nspec, sep = "")
y
##        spec1 spec2 spec3 spec4 spec5 spec6 spec7 spec8 spec9 spec10 spec11
## site1      8    44     5     2     6    24     0     1     4     14      0
## site2     17    41     9    18    18    22     2     1     9     10      3
## site3     16    41    16    22    22    42     0     1    13     22      1
## site4      7    38    13    21    18    45     4     2    22     27     12
## site5      5    28    15     5    25    42    16     8    43     38     10
## site6      1    15     7     0    32    36    19    19    50     46     18
## site7      0     9     5     0    30    28    25    18    36     41     10
## site8      0     5     2     0    21    18    36     8    32     41      6
## site9      0     2     1     0    21    20    25     1    25     35      3
## site10     0     1     0     0    16    10    20     1    27     37      0
##        spec12 spec13 spec14 spec15 spec16 spec17 spec18 spec19 spec20
## site1      13      1      3     12     47     14      0      6     22
## site2      26      7      7     26     34     20      0     14     34
## site3      28      5     18     25     39     28      0     26     42
## site4      31     13     18     33     49     24      1     27     40
## site5      56     12      5     20     33     24      3     36     41
## site6      41     13      3     21     26     20      8     48     24
## site7      41      8      0     33     18     19     11     48     22
## site8      21      4      0     21     15      6     13     23     22
## site9      11      6      0     16      8      1      5     30     19
## site10      5      2      0      7      2      0      2     20     20

Notice that for each species, we have a Poisson distribution along the sites. We can visualize this.

y.smooth <- coenocline(x, responseModel = "gaussian",
                params = cbind(opt = opt, tol = tol, h = h),
                countModel = "poisson",
                expectation = TRUE)

matplot(x, y.smooth, type = "l", lty = "solid")

plot of chunk unnamed-chunk-2

Run NMDS on this using the vegan package.

require(vegan)
mdl <- metaMDS(y, distance = "bray")
## Square root transformation
## Wisconsin double standardization
## Run 0 stress 0.002193 
## Run 1 stress 0.01122 
## Run 2 stress 0.002354 
## ... procrustes: rmse 0.005973  max resid 0.008453 
## Run 3 stress 0.006319 
## Run 4 stress 0.006324 
## Run 5 stress 0.00663 
## Run 6 stress 0.01117 
## Run 7 stress 0.0112 
## Run 8 stress 0.003637 
## Run 9 stress 0.002723 
## Run 10 stress 0.002659 
## ... procrustes: rmse 0.06029  max resid 0.1215 
## Run 11 stress 0.005608 
## Run 12 stress 0.002299 
## ... procrustes: rmse 0.004794  max resid 0.007759 
## *** Solution reached
par(mfrow = c(2, 1))
plot(mdl, type = "t")
stressplot(mdl)

plot of chunk unnamed-chunk-3

mdl$stress
## [1] 0.002193

The above graph shows how species (columns) are similar to each other (but we don’t know by how much). For instance, species in columns 1, 4 and 14 are more similar to each other than to any other species. Same goes for species 7, 8 and 18. Compare the distributions of these six species. Notice that the first three have a peak around site2-site3, while the latter three peak later, at site7-site8.

y[, c(1,4,14, 7,8,18)]
##        spec1 spec4 spec14 spec7 spec8 spec18
## site1      8     2      3     0     1      0
## site2     17    18      7     2     1      0
## site3     16    22     18     0     1      0
## site4      7    21     18     4     2      1
## site5      5     5      5    16     8      3
## site6      1     0      3    19    19      8
## site7      0     0      0    25    18     11
## site8      0     0      0    36     8     13
## site9      0     0      0    25     1      5
## site10     0     0      0    20     1      2
# visualize the 6 species
# what I deemed similar is colored using one color
matplot(y[, c(1,4,14, 7,8,18)], type = "l", lty = "solid", col = rep(c(1,2), each = 3))

plot of chunk unnamed-chunk-4

Now that you’ve made sure NMDS “works”, you can try out your real messy data. Let the fun begin!