library(sf)Linking to GEOS 3.13.0, GDAL 3.8.5, PROJ 9.5.1; sf_use_s2() is TRUE
library(terra)terra 1.9.11
library(mapview)This week we will continue our exploration of numerical techniques in community ecology. A very common approach to estimating the different responses of species in the same community is to use one of several techniques that come under Ordination Analysis.
The basic premise of ordination analysis is to take what is usually multidimensional environmental space and reduce this to just a few dominant explanatory variables, typically expressed by two or three axes on a resulting plot. Our job is then to assess the pattern of distribution of the different species in this reduced environmental space where the distance between species is thought to correlate with ecological similarity.
We will look at this process for a dataset representing community data on bumbleebees collected for Ben Wagstaffe’s dissertation which was published in 2025. We will carry out ordination analysis considering the community data only (unconstrained ordination), and then the community data with some features of the environment (constrained ordination). We will also consider spatial and non-spatial approaches.
To begin, set your working directory as usual and then install the packages for today’s work if you don’t have them already:
#Install packages
install.packages(c("vegan","devtools","sf","terra","dplyr","mapview"))
library(devtools)
devtools::install_version("vegetarian", version = "1.2")Load the packages we will use for spatial data.
library(sf)Linking to GEOS 3.13.0, GDAL 3.8.5, PROJ 9.5.1; sf_use_s2() is TRUE
library(terra)terra 1.9.11
library(mapview)Part One: Exploring and preparing the community and site data
Read in the site data containing some environmental variables
beeSites=st_read("bee_sites.shp")Reading layer `bee_sites' from data source
`/Users/mattdennis/Library/Mobile Documents/com~apple~CloudDocs/Documents/DocumentsMatt 2/bees/bee_sites.shp'
using driver `ESRI Shapefile'
Simple feature collection with 18 features and 8 fields
Geometry type: POLYGON
Dimension: XY
Bounding box: xmin: 519525.8 ymin: 173833.5 xmax: 537202.1 ymax: 184702
Projected CRS: OSGB36 / British National Grid
This data set includes columns identifying each site,the abundance and richness of bumblebees, the number of plant genera at each site (Genus), and the local contribution of each site to plant beta diversity in the study (LCBD).
To these we will add information on the structure of the vegetation and the amount of green space within 1000 metres of the site (we could of course try other distances, but we already did that in the original study).
First, compute the area of each site (assuming that this might influence bumblebee diversity - or at least to test that assumption) and plot them to understand the locations.
beeSites$Area=as.numeric(st_area(beeSites))
#plot the sites
mapview(beeSites)Now add the information on neighbouring greenspace.
Load in a raster of green space in the study area (from the OS Master Map Green Space Layer)
londonGS=rast("londonGS.tif")Now compute the green space cover within a 1000 metre buffer.
pcList=list()
for(i in 1:nrow(beeSites)){
buff.i=st_buffer(beeSites[i,],dist=1000)
buff.i=erase(vect(buff.i),vect(beeSites))
crop.i=crop(londonGS,buff.i,mask=T)
sum.i=sum(values(crop.i),na.rm=T)
pc.i=sum.i/expanse(buff.i)*100
pc.i
pcList[[i]]=pc.i}
beeSites$GS_1km=unlist(pcList)We can plot some summary statistics on key site data such as bumblebee abundance, richness and local contribution to plant beta diversity.
boxplot(Abundance~Type,data=beeSites,col=c("red","blue","green"))boxplot(LCBD~Type,data=beeSites,col=c("red","blue","green"))boxplot(Richness~Type,data=beeSites,col=c("red","blue","green"))Next we will add in information about site vegetation structure. This was measuerd in the field and is contained in the “comp.csv” file. These data consist of the site ID and one column each for the percentage of the site (averaged over all transects) covered by different height classes. These are “H20” (> 10 metres), “H10” (>5 & <=10 metres), “H5” (>1 & <=5 metres), “H1” (>20 cm & <1 metre), “H02” (>5cm & <=20cm), “H005” (<=5 cm)
compData=read.csv("comp.csv")
#inspect
head(compData) Site id H20 H10 H5 H1 H02 H005
1 Green Park 1 10 5 0 0 5 80
2 Kensington 2 4 0 0 0 4 92
3 Clapham 3 4 0 0 0 6 90
4 Peckham 4 4 0 0 0 11 85
5 Weavers 5 0 5 0 5 5 85
6 Victoria 18 15 0 0 0 0 85
We can use these data in their raw format (and we will later as predictor variables) but we might also be interested in computing a diversity index here to get an impression of how complex each site is in general with respect to the distribution of different vegetation layers (heights). To do this we will use Hill numbers as we did in the previous practical only in this case we apply the diversity index to vegetation classes (of height) rather than to species, but the principle works exactly the same. We will actually compute Hill ratios here which is the diversity index divided by the total number of classes to give a relative measure of the evenness in the distribution of vegetation layers between sites.
# Function to list over increasing values of q (Hill Number) and compute site alpha diversity for vegetation height (where increasing values of q emphasise the evenness between height classes).
#library for diversity indices
library(vegetarian)
qFun=function(x){
#init list of results for each site
qList=list()
#iterate of each site (rows in the data)
for( i in 1:nrow(compData)){
#x==0 this is species richness so just leave as is
if(x==0){
div.i=d(compData[i,3:8], lev = "alpha", q = x)
}else{
#compute diversity index for current Hill number
div.i=d(compData[i,3:8], lev = "alpha", q = x)
#compute Hill ratio
div.i=div.i/d(compData[i,3:8], lev = "alpha", q = 0)}
#add to list
qList[[i]]=as.numeric(div.i)
}
return(unlist(qList))
}
#set Hill numbers over which to iterate
qN=0:3
#run the function for each value of q
qDat=lapply(qN,qFun)
#collate and creat a data frame
qData=do.call("cbind",qDat)
qData=data.frame(qData)
#set column names and add the id column from the compData object
colnames(qData)=c("q0","q1","q2","q3")
qData$id=compData$idNow we can bind the site characteristics and the bee data together
#merge all data
beesSiteData=merge(beeSites,compData[,2:ncol(compData)],by="id")
beesSiteData=merge(beesSiteData,qData,by="id")Next we will read in and perform a Hellinger transformation on the species abundance data to help with model convergence when carrying out the ordination analysis.
#library for data transformations
library(vegan)Warning: package 'vegan' was built under R version 4.5.3
Loading required package: permute
#load in abundance data
dataAbundance=read.csv("bee_Table.csv",check.names = FALSE)
#Hellinger transform
spe_hellinger=decostand(dataAbundance[,4:ncol(dataAbundance)], method = "hellinger")
#bind abundance and transformed data
beesCompData=cbind(dataAbundance[,1:2],spe_hellinger)
#set column names
colnames(beesCompData)=c("Site","id",colnames(spe_hellinger))
#Create final data set
beesFinal=merge(beesCompData,beesSiteData,by="id")The data frame is a bit of a mess so we can tidy up the column names.
#set colnames
colnames(beesFinal)=c("id","Site","B.terrestris/lucorum","B.Pascuorum","B.Pratorum","B.Lapidarius",
"B.Hortorum","B.Hypnorum","Type","Abundance",
"Richness","LCBD","Genus","Site Area",
"GS1km","Vegetation>10m",
"Veg10m","Vegetation5m","Veg1m","Vegetation20cm",
"Vegetation5cm","Structure_q0","Structure_q1",
"Structure_q2","Structure q=3","geometry")Finally, make sure that the “Type” column is set as a factor (categorical variable) for use in plotting later on (currently it is a character string).
#Type as factor
beesFinal$Type=as.factor(beesFinal$Type)Part Two: Orindation
The first technique we will use is an unconstrained ordination approach (i.e. one that does not consider environmental variables but simply partitions samples based on community composition).
Here we use the metaMDS() function from the vegan package and pass the following arguments: 1. the community data, the number of dimensions to which we reduce the data,the number of maximum iterations to find a solution that best reflects the observed dissimilarity between points, and the dissimilarity index to use.
The performance of any solution to visualising the data points in a way that reflects their dissimilarity is captured by the model’s stress value. The stress value is simply a measure of the degree to which the low dimensional representation of the data (i.e. in a small number of dimensions) captures the variation in the original, high-dimensional data. Values for stress <0.2 are generally thought to suggest a reliable model. The metaMDS() function is quite verbose (it throws out a lot of information) and this information is basically telling us how well each iteration performs with respect to minimising the stress value. The “procrustes” reference and its rmse score is the technique used to compare one ordination (one attempt to find a good solution to achieving dimension reduction in the community data) to another. It is basically a way of seeing if the new best solution is significantly different to the previous one. If it isn’t, and the best solution repeats several times, then the algorithm stops. This process involves normalizing two ordinations to be the same size (on the same scale) and rotating them for ease of comparison.
Fun fact - “Procrustes” is a reference to a figure from Greek mythology that forced his victims to fit the size of an iron bed (by stretching or cutting off their limbs!)
#first ordination
bees_NMDS=metaMDS(beesFinal[,3:8],k=2,trymax=100,distance = "bray")Run 0 stress 0.121772
Run 1 stress 0.121772
... Procrustes: rmse 2.653695e-06 max resid 7.591709e-06
... Similar to previous best
Run 2 stress 0.121772
... Procrustes: rmse 2.972716e-06 max resid 6.98494e-06
... Similar to previous best
Run 3 stress 0.140969
Run 4 stress 0.121772
... Procrustes: rmse 7.373775e-07 max resid 2.23717e-06
... Similar to previous best
Run 5 stress 0.121772
... Procrustes: rmse 3.826445e-06 max resid 8.870516e-06
... Similar to previous best
Run 6 stress 0.121772
... Procrustes: rmse 2.114486e-06 max resid 6.028263e-06
... Similar to previous best
Run 7 stress 0.121772
... Procrustes: rmse 1.568469e-06 max resid 4.996627e-06
... Similar to previous best
Run 8 stress 0.121772
... Procrustes: rmse 2.811194e-06 max resid 6.845068e-06
... Similar to previous best
Run 9 stress 0.140969
Run 10 stress 0.140969
Run 11 stress 0.140969
Run 12 stress 0.140969
Run 13 stress 0.140969
Run 14 stress 0.2016947
Run 15 stress 0.2030891
Run 16 stress 0.121772
... New best solution
... Procrustes: rmse 5.935681e-07 max resid 1.000056e-06
... Similar to previous best
Run 17 stress 0.121772
... New best solution
... Procrustes: rmse 1.271187e-06 max resid 3.043148e-06
... Similar to previous best
Run 18 stress 0.140969
Run 19 stress 0.121772
... Procrustes: rmse 2.459612e-06 max resid 6.352932e-06
... Similar to previous best
Run 20 stress 0.2047755
*** Best solution repeated 2 times
We can plot the agreement between the ordination and the observed dissimilarity in the original data as a goodness-of-fit test.
#plot the modelled versus predicted dissimilarity
stressplot(bees_NMDS)We can extract some key information from the output for more versatile plotting. We’ll create a dataframe from the model’s site scores (the position in the ordination space for each site) and then plot each site by type (i.e. park, allotment, cemetery) using ggplot.
#extract NMDS scores (x and y coordinates) for sites
data.scores = as.data.frame(scores(bees_NMDS)$sites)
#add site names from the original dataframe
data.scores$Site = beesFinal$Site
#add site types from the original dataframe
data.scores$Type = beesFinal$TypeNext we’ll work on a plot for the results of the NMDS. There is an internal function for plotting in the vegan package but we’ll use ggplot for greater flexibility. Below we’ll create a convex hull (which is a kind of minumum bounding polygon) around sites of each type using the chull() function. Note also the use of piping below which is a quick way of passing the results of one function into another in quick succession. It is typically using when using packages from tidyVerse packages such as dplyr and is partcular good for data handling and manipulation.
library(ggplot2)
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:terra':
intersect, union
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
#create convex hull
typeConvexhull =
data.scores %>%
group_by(Type) %>%
slice(chull(NMDS1, NMDS2))
#create ggplot of the nmds
nmdsPlot = ggplot(data.scores, aes(x = NMDS1, y = NMDS2))+
geom_point(size = 4, aes( shape = Type, colour = Type)) +
geom_polygon(data = typeConvexhull,
aes(x = NMDS1, y = NMDS2, fill = Type, group = Type),
alpha = 0.30) +
theme(axis.text.y = element_text(colour = "black", size = 12, face = "bold"),
axis.text.x = element_text(colour = "black", face = "bold", size = 12),
legend.text = element_text(size=12,face ="bold", colour ="black"),
legend.position = "right",
axis.title.y = element_text(face = "bold", size = 14),
panel.background = element_blank(), panel.border = element_rect(colour="black",fill=NA,linewidth = 1),
axis.title.x = element_text(face = "bold", size = 14, colour = "black"),
legend.title = element_text(size = 14, colour = "black", face = "bold"),
) +
labs(x = "NMDS1", colour = "Type", y = "NMDS2", shape = "Type")
#plot
nmdsPlotAs we can see from the plot there is some overlap between site types but there is also evidence that the bumblebee communities are also relatively distinct between types. Next we will look at some constrained ordination techniques that include environmental covariates in the generation of ordination plots.
First we will use the cca() function to run a canonical correspondence analysis. We pass two arguments to the function - the community data and the explanatory variables site area, neighbourhood green space, plant LCBD, genus richness, vegetation >10metres, vegetation 1 metre, Structure_q0 (structural diversity order 0) and structure_q3 (structural diversity order 3). Recall that CCA assumes a unimodal response by species to environmental gradients.
#buil CCA model
bees_cca = cca(beesFinal[,3:8] ~., data = beesFinal[,c(12,13,14,15,16,19,22,25)])
#get model sumamary
sumCca=summary(bees_cca)
#explanatory power of first axis
sumCca$cont$importance[2,"CCA1"][1] 0.2805095
#explanatory power of first axis
sumCca$cont$importance[2,"CCA2"][1] 0.1861079
Now we build the ordination plot.
#set colours to differentiate ste types
colvec = c("darkgreen", "orange", "red")
#plot the axes
plot(bees_cca, type = "n", scaling = 3)
#fit the model
fit = envfit(bees_cca,beesFinal[,c(12,13,14,15,16,19,22,25)], perm = 9999)
#and plot
plot(fit, add = TRUE, col = "black",cex=0.7)
#this next part plots the sites according to their position in the ordination space
points(bees_cca, display = "sites", col = colvec[beesFinal$Type],cex=1.8,
scaling = 3, pch = 21, bg = colvec[beesFinal$Type])
#add legend
with(bees_cca, legend("bottomright", legend = levels(as.factor(beesFinal$Type)), bty = "n",
col = colvec, pch = 21,cex=1.2, pt.bg = colvec))
#add the species points
points(bees_cca, display = "species", pch = 3, cex = 2, col = "black")
#add species names to the plot
text(bees_cca, display = "species", col = "blue", cex = 0.7,pos=1,offset=-1.5)Next,we use RDA analysis, assuming a linear response by species to environmental gradients.
#build the RDA model
my_rda = rda(beesFinal[,3:8] ~., data = beesFinal[,c(12,13,14,15,16,19,22,25)])
#summarise the model
sumRda=summary(my_rda)
#assess the explanatory power of the first two axes.
sumRda$cont$importance[2,"RDA1"][1] 0.322931
sumRda$cont$importance[2,"RDA2"][1] 0.1576901
#set up ordination space
plot(my_rda, type = "n", scaling = 3)
fit = envfit(my_rda,beesFinal[,c(12,13,14,15,16,19,22,25)], perm = 9999)
plot(fit, add = TRUE, col = "black",cex=0.7)
# add sites
points(my_rda, display = "sites", col = colvec[beesFinal$Type],cex=1.8,
scaling = 3, pch = 21, bg = colvec[beesFinal$Type])
#add legend
with(beesFinal, legend("bottomright", legend = levels(as.factor(beesFinal$Type)), bty = "n",
col = colvec, pch = 21,cex=1.2, pt.bg = colvec))
#add species
points(my_rda, display = "species", pch = 3, cex = 2, col = "black")
#add species names
text(my_rda, display = "species", col = "blue", cex = 0.7,pos=1,offset=-1.5)According to the model summary, the first two axes of the RDA result explain more variation than those of the CCA.
Finally, we can incorporate space into our analysis through the use of a spatial weights matrix.
#get coordinates of all site centroids
XY=st_coordinates(st_centroid(beesFinal$geometry))
#create distance matrix
dist_matrix = as.matrix(dist(XY[,c("X", "Y")]))
#use the pcnm() function in the vegan package to approximate the spatial structure in the data using a technique called Principal Coordinates on Neighborhood Matrices
#calculate Principal Coordinates on Neighborhood Matrices
pcnm_dist = pcnm(dist_matrix)With the spatial structure in the data approximated, we can add this into our model to “partial out” (remove the effect of) spatial configuration. In the model formula below we add the first 10 dimensions of the pcnm_dist scores to the analysis (note the term dimensions here that underlines the fact that the pcnm() function is also a dimension reduction technique that reduces the complex spatial structure in the data to just a few dimensions so that these can be used in the ordination). The number of dimensions used is arbitrary and we could experiment here.
#run the partial RDA
rda_partial = rda(beesFinal[,3:8] ~., data =beesFinal[,c(12,13,14,15,16,19,22,25)]+I(scores(pcnm_dist, choices = 1:10)))
#plot
plot(rda_partial, type = "n",scaling=3)
#fir the model
fit = envfit(rda_partial,beesFinal[,c(12,13,14,15,16,19,22,25)], perm = 9999)
#and plot
plot(fit, add = TRUE, col = "black",cex=0.7)
#add site points
with(beesFinal, points(rda_partial, display = "sites", col = colvec[beesFinal$Type],cex=1.8,
scaling = 3, pch = 21, bg = colvec[Type]))
#with legend
with(beesFinal, legend("bottomright", legend = levels(as.factor(beesFinal$Type)), bty = "n",
col = colvec, pch = 21,cex=1.2, pt.bg = colvec))
points(rda_partial, display = "species", pch = 3, cex = 2, col = "black")
text(rda_partial, display = "species", col = "blue", cex = 0.7,pos=1,offset=-1)sumRDA_P=summary(rda_partial)
sumRDA_P$cont$importance[2,"RDA1"][1] 0.319522
sumRDA_P$cont$importance[2,"RDA2"][1] 0.1162752
We can see from the summary results that, for once, inlcuding space into our analysis does not improve model performance (because the first two axes explain less variance than in the case of the non-spatial RDA). There are several reasons why this might be the case. First, site characteristics may override any influence of spatial configuration on bumblebee diversity (which is one of the conclusions of the Wagstaffe and Dennis 2025 paper). Secondly, urban environments are incredibly heterogeneous and a simple Euclidean distance measure will not capture the variability in landscape resistance between site. Another explanation might be that site centroids don’t give an appropriate approximation of the overall configuration.