library(terra) #for spatial data
library(sf) #data frames with geometry
library(spatstat) #for point process modelling and converting between raster and pixel image objectsGEOG71922: Week Three: Species Distribution Modelling Two
In week two we looked at the use of envelope models, general linear models, maxent and point process models for estimating species distribution (Sciurus vulgaris). According to the cross-validation approach taken (K-fold partitioning) We achieved a good level of prediction. However, as we saw in the lecture, ecological data tend to be spatially dependent, meaning that observations in close proximity will be more similar than those at a distance. In terms of model validation this means that, in the case of K-fold validation in which samples are selected at random, training and test samples may be quite similar for any given covariate. This can (and almost always does) lead to over-optimistic model validation statistics (e.g. AUC scores). A more reliable approach is to ensure that folds used in cross-validation are spatially disjoint i.e. that the geographic distance between training and testing data is maximized. This is particularly important for species distribution modeling where our goal is the prediction across space of our species of interest. Therefore the ability of our model to work across different geographical areas is critical.
In this practical you will explore how spatial dependence affects model validation and the options available for overcoming such problems. We will use statistical models and machine learning algorithms on data for Sciurus valgaris (from Week One).
Using these data, you will then learn how, for machine learning algorithms, model hyperparameters can be tuned to improve performance. You can think of model hyperparameters as the “settings” for any given model. For example, hyperparameter “mtry” is the number of explanatory variables that are sampled for each tree in a random forest algorithm and “node size” denotes the minimum number of cases allowed at the end of each decision tree. “Parameters” on the other hand are the decisions that are made by the learner during training (so “hyper” refers to the fact that we set these properties BEFORE the learning process begins).
In part two of the practical we will adopt a point process modeling approach to understanding and making predictions on these same data.
First create and set the working directory, and copy over the data in the “SDM2” folder of the course dropbox (link on the Week Three Canvas page).
setwd("./GEOG71922/WeekThree")Install the packages for today’s work (we will call the libraries on first use):
#Install packages
install.packages(c("dismo","terra","spatstat","cowplot","ggplot2","precrec","glmnet","maxnet))
#We also need to load the "MLR" package for this practical
install.packages("mlr")
Load the libraries for terra sf and spatstat that we’ll need straight away.
Before we get started with Part one, we will build a function at the top of our scripts that we will need later on to convert raster objects to pixel images. The spatstat package (that we will use later to do some point process modelling) uses pixel images rather than raster objects, which is super annoying, but the spatstat package is otherwise ace, so we have to play ball. What is even more annoying is that there are no off-the-shelf functions out there that will do the conversion from raster to pixel image, so we have to make one ourselves. Well, almost ourselves, I adapted the following from some wisdom here: https://gis.stackexchange.com/questions/115159/converting-raster-to-im-object-for-point-process-model-covariate-in-r.
If you are happy enough to simply know that this function converts raster objects to pixel images then just copy this over to your script, run it and leave it alone for the rest of the practical. If you want to know more what’s going on, read the comments above each line of the function and check out the context from the stack exchange link above.
#Function to convert raster to images for spatstat. Takes one argument "im" that should be a raster object
raster.as.im = function(im) {
#get the resolution (cell size of the raster)
r = raster::res(im)[1]
#get the origin (bottom left corner of the raster/image)
orig = ext(im)[c(1,3)]
#set the coordinates of the columns which is just a series of number from zero increasing by 100 metres (the resolution of the raster) for every cell along the rows and columns.
xx = orig[1] + seq(from=0,to=(ncol(im) - 1)*100,by=r)
#set the coordinates of the columns
yy = orig[2] + seq(from=0,to=(nrow(im) - 1)*100,by=r)
#now build a single matrix with the cell values and dimension we want - note that we reverse the rows in the matrix by setting nrow(im):1. This is just because spatstat reads images in reverse order to how raster object are usually organised (so the below code ensures our image is not upside-down when we come to analyse it)
mat=matrix(raster::values(im), ncol = ncol(im),
nrow = nrow(im), byrow = TRUE)[nrow(im):1, ]
return(spatstat.geom::im(mat,
xcol = xx, yrow = yy))
}Part One: creating learners in the MLR (“Machine Learning in R”) package
The MLR package provides a suite of tools to explore, create and test modelling strategies that are to some degree compatible with a machine learning approach.
A key concept in MLR and in machine learning in general is the concept of the “Learner”. These are the objects that generate information (i.e. they “learn”) about the data provided. The latter object (representing the data) is generally formulated as a “task”. So the basic logic is that we “learn” how to carry out our “task” and then use this experience for optimal prediction. If that all sounds a little inaccessible, it may help to simply think of Learners as models (as in classical statistics e.g. regression models) and Tasks as data sets. Typically, creating a task involves the choosing of a data set and telling R which column holds the species observations.
Okay, let’s set up our first task. First we will get the same data as we used in Week One, with one addition (some elevation data).
Next we will load in the data we need for the analysis.
#load in the red squirrel data
sciurus=read.csv("Sciurus.csv")
#subset to cases with coordinate uncertainty <100 metres
sciurus=sciurus[sciurus$Coordinate.uncertainty_m<100,]
#make spatial points layer
#create crs object
sciurus.latlong=data.frame(x=sciurus$Longitude,y=sciurus$Latitude)
#Use coordinates object to create our spatial points object
sciurus.sp=st_as_sf(sciurus.latlong,coords=c("x","y"),crs="epsg:4326")Now we load in a polygon of the study area (central Scotland) that we will use and clip the squirrel data and the land cover data to this spatial extent.
#First set the extent to the study area
scot=st_read('scotSamp.shp')
#load in the land cover map and then clip to the polygon
LCM=rast("LCMUK.tif")
#crop to the extent of the study area plus a little more (because we will lose a small amount of data in the next step)
LCM=crop(LCM,st_buffer(scot, dist= 1000))Next we will do some data aggregation. This is not something I would normally do without good reason (as data aggregation always leads to some loss of information). However, in the case of this week’s practical I see two good reasons to increase the cell size of our covariate layers: 1. both machine learning approaches and point process modelling can be computationally intensive and 2. given that we identified a scale of 1800 metres (and we will summarise information at this scale), it seems reasonable to work with a resolution of about 100 metres which should speed things up but not involve a critical loss of precision.
We will use the aggregate() function to increase our LCM data from 25 to 100 metre resolution by passing “4” to the “fact” argument in the aggregrate() function (this will basically increase our cell size by a factor of four). We choose to take the modal value of aggregated cells because our data are categorical. Note that by increasing the cell size we will lose some definition around the edges (meaing we effectively lose some area/data) - this is why we buffered the study area before this step (we will crop to point data later on to the original study area).
#aggregate LCM raster
LCM=aggregate(LCM$LCMUK_1,fact=4,fun="modal")Make sure that the squirrel points are set to the same projection as the LCM layer.
#project squirrel data
sciurus.sp=st_transform(sciurus.sp,crs(LCM))Now crop the squirrel data to the study area polygon. The sf packages let’s us do this with square brackets just like if we were working with a data frame.
#now crop our points to the study area
sciurusFin=sciurus.sp[scot,]
#finally, mask the LCM to this boundary
LCM=crop(LCM,scot,mask=TRUE)
#inspect
plot(LCM)
plot(sciurusFin$geometry,add=T)Now we can start to set up our covariates (environmental and land cover variables). We start by reclassifying the land cover raster as in Week One
#access levels of the raster by treating them as categorical data ('factors' in R)
LCM=as.factor(LCM$LCMUK_1)
#create an vector object called reclass
reclass = c(0,1,rep(0,20))
# combine with the LCM categories into a matrix of old and new values.
RCmatrix=cbind(levels(LCM)[[1]],reclass)
RCmatrix=RCmatrix[,2:3]
#apply function to make sure new columns are numeric (here the "2" specifies that we want to apply the as.numeric function to columns, where "1" would have specified rows)
RCmatrix=apply(RCmatrix,2,FUN=as.numeric)
#Use the classify() function to asssign new values to LCM with our reclassification matrix
broadleaf=classify(LCM, RCmatrix)Remember that we assumed the characteristic scale of the response of Sciurus vulgaris to broadleaf woodland to be 1800 metres? Well we will now create a variable to reflect this by generating a raster layer for which the value at each cell gives the proportion of broadleaf woodland within a 1800 metre radius. This is conceptually the same as we did in the Week One class but here we want to create a raster (not just data frame) to actually characterise the spatial distribution of this characteristic. We achieve this by building a spatial weigts matrix and entering this into a focal analysis (a focal analysis is just an anlysis where you are characterising the neighbourhood of a location so when you see “focal” just think “neighbourhood”).
Next we create the spatial weights matrix (often this can be done by just passing a couple of numbers to the focalMat() function but I want you to understand how this works and conquer the world of spatial weights matrices!).
First we create an object to give us the number of rows and columns that will be needed in the matrix (you can think of the matrix as a being like a bunch of grid cells with values - similar to a table in excel). nPix below represents the number of pixels in the raster (which you can also think of as a grid with numbers, this is all that raster layers are really, just with added coordinates, remember - no magic!).
#neighbourhood weights matrix to sum all available resources for each cell
#get number of picels needed to cover the 1800 metre radius (round() just rounds this to the nearest interger - because you can't have 0.5 pixels)
nPix=round(1800/res(LCM)[1])
#next, you need to double this number (for the distance in two direction i.e. nPix is the radius and we want the diameter) and add one (because a weights matrix always needs to be odd so that you have a central cell and an equal number of grid cells either side).
nPix=(nPix*2)+1Next we build the actual matrix using the matrix() function. The first argument (1:nPix^2) tell the function how many cells we want in the matrix. The second and third argument state the number of rows and columns.
#buiild weights matrix
weightsMatrix=matrix(1:nPix^2,nrow=nPix,ncol=nPix)We need to identiy which cell in the matrix is the focal cell - this should now be obvious, it’s the middle one. So we get the row and column of this cell as x and y.
ceiling() is like round() but just ensures that we round up, not down.
#get focal cell
x=ceiling(ncol(weightsMatrix)/2)
y=ceiling(nrow(weightsMatrix)/2)
focalCell=weightsMatrix[x,y]Use the which() function to get the focal cell. The “==” symbol means “is” and arr.ind=TRUE means that we want the function to return the index (row and column) of the focal cell. So the code below just means “return the index of the cell that is the focal cell in weightsMatrix”.
indFocal=which(weightsMatrix==focalCell,arr.ind = TRUE)Now that the matrix is constructed, we need to populate it with distances so that we have a “window” for analysis (where the limit of this window will be set to the 1800 metres scale determined in Week One).
#compute distances
distances=list()
for(i in 1:nPix^2){
ind.i=which(weightsMatrix==i,arr.ind=T)
diffX=abs(ind.i[1,1]-indFocal[1,1])*res(LCM)[1]
diffY=abs(ind.i[1,2]-indFocal[1,2])*res(LCM)[1]
dist.i=sqrt(diffX^2+diffY^2)
distances[[i]]=dist.i
}The next step is to enter these values into the matrix, which is simply done using square brackets (square brackets are useful, right?)
#add distance values to the weights matrix
weightsMatrix[]=unlist(distances)
#set cells outside search radius to NA
weightsMatrix[weightsMatrix>1800]=NANow normalise all cell values so that they sum to one. Doing this means that when we use the focal() function later on, results will be returned as a proportion. This is because the weights matrix itself serves to weight all values it is applied to as a multiplier. So when applied to a binary raster representing woodland as 1 or 0, if all cell are woodland (i.e. 1) then the result of multipyling all cell values by the weights matrix will return 1 (because all the weights sum to one). If half the cells are woodland, then that would return 0.5.
#normalise the weights matrix by dividing all cell values by the number of cells.
weightsMatrix[!is.na(weightsMatrix)]=1/length(weightsMatrix[!is.na(weightsMatrix)])
#test to see for yourself
sum(weightsMatrix,na.rm=T)[1] 1
Now, use the focal() function, telling it to apply our weights matrix to the woodland layer
#sum neighbourhood values from all surrounding cells
lcm_wood_1800=focal(broadleaf,w=weightsMatrix,fun="sum")In addition to broadleaf woodland, let’s also create a layer for the urban class. Some initial testing on characteristic scale suggests that 2300 metres seems to be a reasonable estimate of the response of the species to urbanisation.
Here’s the proof (yes, I really checked):
We will use this to create a weights matrix in the same way as above for the urban land cover class (we will consider suburban and urban as one class).
#create an vector object called reclassUrban which is zero for all classes except tghe two urban classes in the LCM
reclassUrban = c(rep(0,19),1,1)
# combine with the LCM categories into a matrix of old and new values.
RCmatrixUrban= cbind(levels(LCM)[[1]],reclass)
RCmatrixUrban=RCmatrixUrban[,2:3]
#apply function to make sure new columns are numeric (here the "2" specifies that we want to apply the as.numeric function to columns, where "1" would have specified rows)
RCmatrixUrban=apply(RCmatrixUrban,2,FUN=as.numeric)
#Use the reclassify() function to asssign new values to LCM with our reclassification matrix
urban = classify(LCM, RCmatrixUrban)
#neighbourhood weights matrix to sum all available resources for each cell
#get number of picels needed to cover the 2300 metre radius for the urban class (round() just rounds this to the nearest interger - because you can't have 0.5 pixels)
nPixUrban=round(2300/res(LCM)[1])
#next, you need to double this number (one for rows and one for columns) and add one (because a weights matrix always needs to be odd so that you have a central cell and an equal number of grid cells either side).
nPixUrban=(nPixUrban*2)+1
#buiild weights matrix
weightsMatrixUrban=matrix(1:nPixUrban^2,nrow=nPixUrban,ncol=nPixUrban)
#get focal cell
x=ceiling(ncol(weightsMatrixUrban)/2)
y=ceiling(nrow(weightsMatrixUrban)/2)
focalCell=weightsMatrixUrban[x,y]
indFocal=which(weightsMatrixUrban==focalCell,arr.ind = TRUE)
#compute distances
distancesUrban=list()
for(i in 1:nPixUrban^2){
ind.i=which(weightsMatrixUrban==i,arr.ind=T)
diffX=abs(ind.i[1,1]-indFocal[1,1])*res(LCM)[1]
diffY=abs(ind.i[1,2]-indFocal[1,2])*res(LCM)[1]
dist.i=sqrt(diffX^2+diffY^2)
distancesUrban[[i]]=dist.i
}
#add distance values to the weights matrix
weightsMatrixUrban[]=unlist(distancesUrban)
#set cells outside search radius to NA
weightsMatrixUrban[weightsMatrixUrban>2300]=NA
#normalise the weights matrix by dividing all cell values by the number of cells.
weightsMatrixUrban[!is.na(weightsMatrixUrban)]=1/length(weightsMatrixUrban[!is.na(weightsMatrixUrban)])
#sum neighbourhood effects from all surrounding cells
lcm_urban_2300=focal(urban,w=weightsMatrixUrban,fun="sum")In addition to the woodland and urban layers we will also consider elevation as an environmental predictor. Load this in now.
demScot=rast('demScotland.tif')Ensure that the DEM (digital elevation model) has the same resolution and origin as the other data so that we can stack them together. We do this using the resample() function which will take our DEM and convert its resolution, origin and extent to that of another raster - which we set to the broadleaf woodland layer we created just now.
demScot=terra::resample(demScot,lcm_wood_1800)
#stack the first layer
allEnv=c(lcm_wood_1800,lcm_urban_2300,demScot)
names(allEnv)=c("broadleaf","urban","elev")Create background points in the same way as in Weeks One and Two:
set.seed(11)
#sample background - one point for every cell (9775)
back = spatSample(allEnv,size=2000,as.points=TRUE,method="random",na.rm=TRUE)
back=back[!is.na(back$broadleaf),]
back=st_as_sf(back,crs="EPSG:27700")Now carry out the extraction of data to points, again following the same scheme as in previous weeks.
# get environmental covariates at presence locations
eP=terra::extract(allEnv,sciurusFin)
#bind together the presence data using cbind() which binds together objects by column (i.e. with different columns but the same number of rows)
Pres.cov=st_as_sf(cbind(eP,sciurusFin))
Pres.cov$Pres=1
#Remove the first column which is just an ID field.
Pres.cov=Pres.cov[,-1]
#get coordinates for spatial cross-validation later
coordsPres=st_coordinates(Pres.cov)
#drop geometry column using st_drop_geometry()
Back.cov=st_as_sf(data.frame(back,Pres=0))
#get coordinates of background points for cross validation later
coordsBack=st_coordinates(back)
#combine
coords=data.frame(rbind(coordsPres,coordsBack))
#assign coumn names
colnames(coords)=c("x","y")Now we create our task for using with the machine learning in r package mlr, passing the data frame and specifying the species presence/absence column (the “target” in mlr terminology) and tell R that 1 = present (i.e. a positive result)
#combine pres and background
all.cov=rbind(Pres.cov,Back.cov)
#add coordinates
all.cov=cbind(all.cov,coords)
#remove any NAs
all.cov=na.omit(all.cov)Remove the geometry column - we will only work with x and y data columns in the mlr package
all.cov=st_drop_geometry(all.cov)Now load the mlr library and arrange the data in the necessary format.
library(mlr)Loading required package: ParamHelpers
Attaching package: 'mlr'
The following object is masked from 'package:terra':
resample
#For the makeClassifTask function to work, our target variable needs to be categorical (a "factor" in R) so let's tidy that up first
task=all.cov
task$Pres=as.factor(task$Pres)
task = makeClassifTask(data = task[,1:4], target = "Pres",
positive = "1", coordinates = task[,5:6])First we take the example of a model that cannot be tuned, a regular binomial regression, then build learner also for a tuneable model - Random Forest.
################ Binomial (logistic regression)
#use the make learner function to build the model approach. Fix factors prediction is set to TRUE here because our outcome is a factor (i.e. categorical: 0-1)
lrnBinomial = makeLearner("classif.binomial",
predict.type = "prob",
fix.factors.prediction = TRUE)
##########Random Forest
lrnRF = makeLearner("classif.ranger",
predict.type = "prob",
fix.factors.prediction = TRUE)Create re-sampling scheme (performance level) for cross-validation.
The below code creates 5 folds in the data and repeates the whole k-fold cross validation process 4 times (so 20 models in total) and aggregates (averages) the AUC statistic.
#set up resampling strategy (non-spatial cross-validation)
perf_levelCV = makeResampleDesc(method = "RepCV", predict = "test", folds = 5, reps = 5)
#set up resampling strategy for spatial cross-validation
perf_level_spCV = makeResampleDesc(method = "SpRepCV", folds = 5, reps = 5) #sampling strategy to run five fold re-sampling five timesNow we run the model re-sampling the data (task) according to the scheme set up previously.
#Binomial conventional cross validation (K fold)
cvBinomial = resample(learner = lrnBinomial, task =task,
resampling = perf_levelCV,
measures = mlr::auc,
show.info = FALSE)
print(cvBinomial)Resample Result
Task: task[, 1:4]
Learner: classif.binomial
Aggr perf: auc.test.mean=0.8768175
Runtime: 0.116713
We can plot the locations of training and testing points using the createSpatialResamplingPlots() function.
#create Spatial Resampling Plots
plots = createSpatialResamplingPlots(task,resample=cvBinomial,
crs=crs(allEnv),datum=crs(allEnv),color.test = "red",point.size = 1)
library(cowplot)
#use the cowplot function to plot all folds out in a grid
cowplot::plot_grid(plotlist = plots[["Plots"]], ncol = 3, nrow = 2,
labels = plots[["Labels"]])Next - carry out a spatial cross-validation of the binomial model and plot the training and testing points.
##Binomial spatial cross validation
sp_cvBinomial = resample(learner = lrnBinomial, task =task,
resampling = perf_level_spCV,
measures = mlr::auc,
show.info = FALSE)
print(sp_cvBinomial)Resample Result
Task: task[, 1:4]
Learner: classif.binomial
Aggr perf: auc.test.mean=0.8433564
Runtime: 0.251992
#make partition plots
plotsSP = createSpatialResamplingPlots(task,resample=sp_cvBinomial,
crs=crs(allEnv),datum=crs(allEnv),color.test = "red",point.size = 1)
#use the cowplot function to plot all folds out in a grid
cowplot::plot_grid(plotlist = plotsSP[["Plots"]], ncol = 3, nrow = 2,
labels = plotsSP[["Labels"]])Let’s evaluate our Random Forest model in the same way.
###### Random Forest evaluation
##random sampling cross-validation
cvRF = resample(learner = lrnRF, task =task,
resampling = perf_levelCV,
measures = mlr::auc,
show.info = FALSE)
print(cvRF)Resample Result
Task: task[, 1:4]
Learner: classif.ranger
Aggr perf: auc.test.mean=0.9426031
Runtime: 5.03165
And with spatial partitioning:
#spatial partitioning cross-validation
sp_cvRF = resample(learner = lrnRF, task =task,
resampling = perf_level_spCV,
measures = mlr::auc,
show.info = FALSE)
print(sp_cvRF)Resample Result
Task: task[, 1:4]
Learner: classif.ranger
Aggr perf: auc.test.mean=0.8296579
Runtime: 4.99321
As you will see, the random forest clearly out-performs the binomial model (a glm) when testing and training data are drawn randomly, but performs significantly worse than the binomial model when the data are spatially partitioned. This reflects the fact that the random forest model is more complex then the relatively simple bionimial model. Recall the discussion on parsimony and over-fitting from the lecture and the explainer on maximum entropy from this week’s materials.
Unlike the binomial model (which can’t actually “learn”), however, the random forest learner can be tuned. If we tune its parameters using a spatial cross-validation for benchmarking, then, in a sense, the model can “learn” which parameters provide more robust predictions to new, unknown locations. Let’s see. Firt we use the getParamSet() function to see what parameters can be tuned.
getParamSet(lrnRF) Type len Def
num.trees integer - 500
mtry integer - -
min.node.size integer - -
replace logical - TRUE
sample.fraction numeric - -
split.select.weights numericvector <NA> -
always.split.variables untyped - -
respect.unordered.factors discrete - ignore
importance discrete - none
write.forest logical - TRUE
scale.permutation.importance logical - FALSE
num.threads integer - -
save.memory logical - FALSE
verbose logical - TRUE
seed integer - -
splitrule discrete - gini
num.random.splits integer - 1
keep.inbag logical - FALSE
Constr Req Tunable Trafo
num.trees 1 to Inf - TRUE -
mtry 1 to Inf - TRUE -
min.node.size 1 to Inf - TRUE -
replace - - TRUE -
sample.fraction 0 to 1 - TRUE -
split.select.weights 0 to 1 - TRUE -
always.split.variables - - TRUE -
respect.unordered.factors ignore,order,partition - TRUE -
importance none,impurity,permutation - FALSE -
write.forest - - FALSE -
scale.permutation.importance - Y FALSE -
num.threads 1 to Inf - FALSE -
save.memory - - FALSE -
verbose - - FALSE -
seed -Inf to Inf - FALSE -
splitrule gini,extratrees - TRUE -
num.random.splits 1 to Inf Y TRUE -
keep.inbag - - FALSE -
We will choose to tune model based on the “mtry”, “min.node.size” and “num.trees” parameters.
paramsRF = makeParamSet(
makeIntegerParam("mtry",lower = 1,upper = 3),
makeIntegerParam("min.node.size",lower = 1,upper = 20),
makeIntegerParam("num.trees",lower = 100,upper = 500)
)Specify the resampling scheme…
# specifying random parameter value search
tune_level = makeResampleDesc(method = "SpCV", iters = 5)
ctrl = makeTuneControlRandom(maxit = 50)…and tune.
tuned_RF = tuneParams(learner = lrnRF,
task = task,
resampling = tune_level,
measures = mlr::auc,
par.set = paramsRF,
control = ctrl,
show.info = FALSE)
print(tuned_RF)Tune result:
Op. pars: mtry=1; min.node.size=18; num.trees=224
auc.test.mean=0.8434536
Our model tuning appears to have made some improvement to the random forest performance but not to the extent that it now outperforms the simple binomial regression. With more variables and longer tuning runs this might improve still, but for now we have to conclude that the binomial regression is the way to go!
You may have noticed that we haven’t looked at Maxent. This is because it is not available as a learner in the mlr package and, generally speaking is not especially “tunable”. We might still want to know how Maxent performs under a spatial cross-validation evaluation procedure though. We can do so, but will have to set up the spatial partitioning manually. We can create a grid to partition the data points (similar to what the mlr functions are doing above) using the st_make_grid() function from the sf package. Below we use this function, in which the first argument is our presence data (so sciurusFin) and the subsequent argument sets the dimension (x and y) of the grid. We select polygons as the format and square as the shape (or else we get hexagons, which is fine, but squares simplify things for demonstration).
# make a grid for spatial blocking
area_grid = st_make_grid(sciurusFin, c(50000, 50000), what = "polygons", square = T)What we want from this grid is for each square to be assigned to a different fold so we can use them in the same way as we partitioned our data into folds last week. The only difference here is that the folds represent different regions of the study area rather than a random subset.
First we want to convert the grid to an sf object so we can do some data operations, namely assigning an ID to each square/region.
area_grid_sf=st_as_sf(area_grid)
#the original grid from st_make_grid() is contained in a list so we can use the position of each square of that list as its ID.
area_grid_sf$grid_id=1:length(lengths(area_grid))Take a look:
# plot
plot(area_grid_sf$x)
#check all points in correct place
plot(sciurusFin$geometry,add=T)Set the folds and create a spatial object from the entrie dataset (all.cov)
# set folds
folds=area_grid_sf$grid_id
dataPoints=st_as_sf(all.cov,coords = c("x","y"))
st_crs(dataPoints)=crs(area_grid_sf)First check the model performance based on a random partitioning of the training and testing data (as we did in Week Two/SDM One).
library(dismo)Loading required package: raster
Loading required package: sp
Attaching package: 'raster'
The following object is masked from 'package:ParamHelpers':
getValues
The following object is masked from 'package:nlme':
getData
Attaching package: 'dismo'
The following object is masked from 'package:spatstat.model':
response
The following object is masked from 'package:spatstat.geom':
domain
library(maxnet)
library(glmnet)Loading required package: Matrix
Loaded glmnet 4.1-10
#set number of folds to use
folds=5
#partition presence and background data and assign to folds using the kfold() function.
Pres.cov=all.cov[all.cov$Pres==1,]
Back.cov=all.cov[all.cov$Pres==0,]
kfold_pres = kfold(Pres.cov, folds)
kfold_back = kfold(Back.cov, folds)
eMax=list()
for (i in 1:folds) {
train = Pres.cov[kfold_pres!= i,]
test = Pres.cov[kfold_pres == i,]
backTrain=Back.cov[kfold_back!=i,]
backTest=Back.cov[kfold_back==i,]
dataTrain=rbind(train,backTrain)
dataTest=rbind(test,backTest)
maxnetMod=maxnet(dataTrain$Pres, dataTrain[,1:3])
eMax[[i]] = evaluate(p=dataTest[ which(dataTest$Pres==1),],a=dataTest[which(dataTest$Pres==0),],maxnetMod)
}
#print the result
aucMax = sapply(eMax, function(x){slot(x, 'auc')} )
print(mean(aucMax))[1] 0.8919949
Now, to carry out the spatial cross-validation, we follow the same procedure as last week when we carried out model evaluation but where our partitioning into folds is now based on spatial location.
#load precrec for more versatile model evaluation methods
library(precrec)
folds=area_grid_sf$grid_id
maxEvalList=list()
for (i in folds) {
#select fold - select all folds which are not 'i' to train the model
gridTrain=subset(area_grid_sf,area_grid_sf$grid_id!=i)
#spatially subset the training data
train=data.frame(st_drop_geometry(st_intersection(gridTrain, dataPoints)))
#spatially subset the test data
gridTest=subset(area_grid_sf,area_grid_sf$grid_id==i)
test=data.frame(st_drop_geometry(st_intersection(gridTest, dataPoints)))
#build model
maxnetMod=maxnet(train$Pres, train[2:4],
maxnet.formula(train$Pres,all.cov[,c("broadleaf","elev","urban")],
classes="lq"))
#make prediction
pred=predict(maxnetMod, test,type="cloglog")
#use precrec to get roc curves
precrec_proc=evalmod(scores = pred,labels = test$Pres,mode = "prcroc")
#get auc value
modauc=precrec::auc(precrec::evalmod(scores = pred,
labels = test$Pres))
#add to list
maxEvalList[[i]]=modauc$aucs[1]
#track
print(i)
}Print mean auc for the Maxent spatial cross-validation
mean(unlist(maxEvalList))[1] 0.809083
So, we can see that Maxent also struggles to maintain its typically high performance under the scrutiny of a spatial approach to model validation. The humble binomial glm seems to win in this particular case.
Next, let’s look at the relationships between covariates and squirrel occurrence based on the binomial model (response plots).
First build the binomal glm.
#Specify the model
glm.sciurus=glm(Pres~broadleaf+urban+elev,binomial(link='logit'),
data=all.cov)Let’s plot the response to broadleaf first. To do that, we create a new data frame where all variables except broadleaf are held at their mean values and then we create a variable for broadleaf with continuous values ranging from zero to the max value in the dataset. Predicting to these new data with the glm.sciurus model above produces our response.
#build new data frame based on mean of elev and urban but varying values for broadleaf.
glmNew=data.frame(broadleaf=seq(0,max(all.cov$broadleaf),length=1000),
elev=mean(all.cov$elev),
urban=mean(all.cov$urban))
# use type = "response" for probability-scale predictions
preds = predict(glm.sciurus, newdata = glmNew, type = "response", se.fit = TRUE)
glmNew$fit = preds$fit
glmNew$se = preds$se.fit
head(glmNew) broadleaf elev urban fit se
1 0.000000000 249.7494 0.08314011 0.01296624 0.003582006
2 0.000466274 249.7494 0.08314011 0.01309596 0.003602116
3 0.000932548 249.7494 0.08314011 0.01322696 0.003622277
4 0.001398822 249.7494 0.08314011 0.01335925 0.003642491
5 0.001865096 249.7494 0.08314011 0.01349284 0.003662756
6 0.002331370 249.7494 0.08314011 0.01362776 0.003683071
Now we can plot the unique response of Sciurus vulgaris occurence to broadleaf woodland. For this we will use the ggplot() package.
library(ggplot2)Warning: package 'ggplot2' was built under R version 4.5.2
#use ggplot to plot the fitted values against the broadleaf variable. Use the ribbon() function in ggplot to add a 95% confidence interval.
ggplot(glmNew, aes(x = broadleaf, y = fit)) +
geom_ribbon(data = glmNew, aes(y = fit, ymin = fit - 1.96 * se, ymax = fit + 1.96 * se),
fill = "blue", alpha = 0.3) +
geom_line(data = glmNew, aes(y = fit)) Let’s do the same for urban land cover
#build new data frame based on mean of elev and broadleaf but varying values for urban.
glmNewUrban=data.frame(urban=seq(0,max(all.cov$urban),length=1000),
elev=mean(all.cov$elev),
broadleaf=mean(all.cov$broadleaf))
# use type = "response" for probability-scale predictions
predUrban = predict(glm.sciurus, newdata = glmNewUrban, type = "response", se.fit = TRUE)
glmNewUrban$fit = predUrban$fit
glmNewUrban$se = predUrban$se.fitNow plot:
ggplot(glmNewUrban, aes(x = urban, y = fit)) +
geom_ribbon(data = glmNewUrban, aes(y = fit, ymin = fit - 1.96 * se, ymax = fit + 1.96 * se),
fill = "blue", alpha = 0.3) +
geom_line(data = glmNewUrban, aes(y = fit)) Finally, look at elevation
#build new data frame based on mean of broaleaf and urban but varying values for elev.
glmNewElev=data.frame(elev=seq(0,max(all.cov$elev),length=1000),
urban=mean(all.cov$urban),
broadleaf=mean(all.cov$broadleaf))
# use type = "response" for probability-scale predictions
predElev = predict(glm.sciurus, newdata = glmNewElev, type = "response", se.fit = TRUE)
glmNewElev$fit = predElev$fit
glmNewElev$se = predElev$se.fitResponse plot for elevation:
ggplot(glmNewElev, aes(x = elev, y = fit)) +
geom_ribbon(data = glmNewElev, aes(y = fit, ymin = fit - 1.96 * se, ymax = fit + 1.96 * se),
fill = "blue", alpha = 0.3) +
geom_line(data = glmNewElev, aes(y = fit)) We can see from these plot that the response of Sciurus vulgaris occurrence to each variable is non-linear. The response to urban cover and elevation looks roughly like a second-order polynomial (where first-order polynomial is a straight line and second-order is a line with one clear “bend” in it). The response to broadleaf is more complicated and we might describe this as a third-order polynomial (because there is one more “bend” in this plot than in the other two plots). We can reflect this relationship in our model formula if we like, using the poly() function. For example, to fit the complex, third-order response for broadleaf woodland:
#Specify the glm model using the appropriate polynomial term for boradleaf
glm.sciurus=glm(Pres~poly(broadleaf,3)+urban+elev,binomial(link='logit'),
data=all.cov)We will employ such terms in the secnd part of this practical which focuses on point processes.
Part Two. Point Process Modelling
A key difference between the modeling approaches we have used till now and the point pattern analysis we will do next is the format of the environmental covariates which, in the Spatstat package that we will use to carry this out, take the form of Im (image) objects. So it is worth first converting our raster covariates to image objects. For this we use the custom function that we built at the start of the practical.
#load spatstat
library(spatstat)
#convert broadleaf object
broadleafIm=raster.as.im(raster(allEnv$broadleaf))
#convert urban object
urbanIm=raster.as.im(raster(allEnv$urban))
#convert elevation object
elevIm=raster.as.im(raster(allEnv$elev))Modelling point processes in spatstat requires the setting of an analysis window (which is usually achieved by passing the study area extent to the as.owin() function).
#create study window using the broadleafIm image object as a template
window.poly=as.owin(urbanIm)
plot(window.poly)#inspect#get coordinates from our sciurusFin object to create point pattern
sciurusCoords=st_coordinates(sciurusFin)
#create point pattern ('ppp') object with the ppp() function
pppSciurus=ppp(sciurusCoords[,1],sciurusCoords[,2], window = window.poly)Warning: 39 points were rejected as lying outside the specified window
Warning: data contain duplicated points
#inspect
plot(allEnv$broadleaf)
plot(pppSciurus,add=T)Warning in plot.ppp(pppSciurus, add = T): 39 illegal points also plotted
#use as.ppp() to remove points outside the window
pppSciurus=as.ppp(pppSciurus)Finally, it’s a good idea to rescale our data from metres to kilometres. Not only is this more intuitive when dealing with large distances but it helps with model convergence also.
#rescale from m to km
pppSciurus=rescale(pppSciurus, 1000)
#rescale image objects to conform with the point pattern
broadleafIm=rescale(broadleafIm,1000)
elevIm=rescale(elevIm,1000)
urbanIm=rescale(urbanIm,1000)Now we have our data ready for some analysis, let’s start by applying the Ripley’s K test that we discussed in the lecture. Remember this tests, for multiple radii, whether points within a certin radius are more clustered (high values of K) or more uniform (low values of K) than would be expected under complete spatial randomness.
K=Kest(pppSciurus,rmax=1800,correction = "border") #test for spatial dependence and use the "border" option to account for edge effects (points near the edge are ignored)
plot(K)#inspect the results#check "linearized" Ripley's K for comparison
L=Lest(pppSciurus,correction = "border")
plot(L)#generate envelope based on 39 random simulations of a homogeneous poisson process (rank = 1 means that we use minimum and maximum values to create the envelope, global=TRUE ensures that the estimate is made for all points simultaneously rather than point-wise).
Kcsr=envelope(pppSciurus,Kest,nsim=39,rank=1,global =TRUE)Generating 39 simulations of CSR ...
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38,
39.
Done.
plot(Kcsr,shade=c("hi","lo"),legend=T) #check whether our data fall within this envelope.ndTry=seq(100,1000,by=100)
for(i in ndTry){
Q.i=quadscheme(pppSciurus,method = "grid",nd=i)
fit.i=ppm(Q.i~broadleafIm+elevIm+urbanIm)
print(i)
print(AIC(fit.i))
}[1] 100
[1] 2827.255
[1] 200
[1] 2803.668
[1] 300
[1] 2795.46
[1] 400
[1] 2791.52
[1] 500
[1] 2789.184
[1] 600
[1] 2788.11
[1] 700
[1] 2788.046
[1] 800
[1] 2786.238
[1] 900
[1] 2786.08
[1] 1000
[1] 2786.789
Q=quadscheme(pppSciurus,method = "grid",nd=900)plot increasing values for each environmental covariate against intensity (~ density) of the point pattern
plot(rhohat(pppSciurus,broadleafIm)) # broadleafplot(rhohat(pppSciurus,elevIm)) # elevationplot(rhohat(pppSciurus,urbanIm)) # urbanThe shape of these plots are broadly in line with the response plots we produced earlier. The broadleaf response seems to be consistently more complex than the elevation and urban variables so we will firt a third order polynomial trend to broadlead and a second order trend to the other two.
firstPPMod=ppm(Q~poly(broadleafIm,3)+poly(elevIm,2)+poly(urbanIm,2)+x+y) #point process model based on polynomial terms for the three covariates and the projected coordinatesNow we have our model we can run some diagnostic tests to see if our model is influenced by spatial dependence.
As we discovered in our earlier exploratory analysis, we have reason to believe that our data do not come from a distribution with complete spatial randomness and we may want to consider accounting for spatial dependence between points in our modeling.
This first ppm model is basically a standard poisson regression model and does not account for point interactions. It therefore assumes spatial randomness in the data as discussed in the lecture. We can test whether our model conforms to spatial randomness i.e. the data come from a random poisson distribution using the envelope function which can also accept fitted point process models as the input.
firstModEnv=envelope(firstPPMod,Kest,nsim=39,rank=1,global =TRUE)plot(firstModEnv)As we can see our initial model results exhibit spatial clustering (values for K outside and higher than the theoretical CSR envelope) for most radii values.
We can attempt to account for this clustering inside our model using a Matern process, providing this option inside the kppm() function (for fitting clustered ppms).
maternMod=kppm(Q~broadleafIm+elevIm+urbanIm+x+y,"MatClust") # apply Matern cluster process to the data.Again, as for our first model we can perform a diagnostic test for our Matern process model using the envelope function. In this case, we have built our model based on the assumption that our points exhibit clustering.In other words, they represent an inhomogeneous point process. We therefore want to create an envelope based on simulations of inhomogeneous (clustering) point patterns rather than CSR. This can done by changing the second argument in the envelope() function from Kest (for Ripley’s K test of CSR) to Kinhom (to simulate inhomogeneous patterns). We can then assess the goodness-of-fit of our model by seeing if our observed values for Ripley’s K conform with an inhomogeneous distribution (i.e. if they lie within the simulation envelope). Note, the envelope run can take a few minutes, so please be patient.
matEnv=envelope(maternMod,Kinhom,nsim=39,rank=1,global=TRUE)Plot the result…
plot(matEnv)Our model fits nicely within the simulation envelope suggesting that it fits well to an inhomogeneous point pattern and is therefore correctly specified - it is a good approximation of the behaviour of the data.
We can look at the model coefficients by calling coef(maternMod) to see the relationship between our predictor variables and the predicted intensity of points. Notice that for varibales fit with poly() you have two coefficients, the first describes the relationship with intensity for lower values of the covariate and the second coefficient describes that for higher values of the covariate. For example, in the case of temperature, intensity (species density) appears to heighten with increasing temperature before decreasing after a certain value for temperature is reached. Note that this information is not available in the machine learning approaches.
coef(maternMod) (Intercept) broadleafIm elevIm urbanIm x
-12.062390683 13.099059532 -0.009123407 -5.564681764 -0.016806520
y
0.019879960
The spatstat package contains a function for calculating the area under the curve for point process models. This differs slightly from binomial models (models with 2 outcomes) in that here it is the intensity rather than the success of correctly classifying a binary outcome that is being tested. The spatstat roc() and auc() functions offer an indication of the ability of the fitted model intensity to separate the spatial domain into areas of high and low density of points. The AUC here is the probability that a randomly-selected data point has higher predicted intensity than does a randomly-selected spatial location. So the test incorporates space into the assessment (i.e. by considering intensity across the entire study area rather than just our data points). We can plot the roc curve and return the auc as below:
plot(roc(maternMod))Print the auc for the model to the console
auc.kppm(maternMod) obs theo
0.8692987 0.8692824
The roc curve and auc statistic for our model are high, following closely the output that would be expected for a well specified inhomogeneous point process model (“ftheo” here) and well above the null line (a 50:50 line that represents no ability to discriminate high from low intensity - “fnull” here). So we can see that with a well specified model that incorporates spatial dependence we have achieved a better representation of the data with the point process approach compared to our machine learning techniques. In reality we could carry out much more extensive model tuning for our machine learning (increasing number of iterations in the tuning step) but it’s perhaps unlikely that an auc score comparable to that of the Matern model would be achieved.
We can create a mapped output of our results using the predict function. Remember our point process model output is one of intensity (points per unit area) and has quite low values. We can log transform the values to better differentiate the distribution of values.
prPPMod=predict(maternMod)
plot(prPPMod)plot(log(prPPMod))If you want to convert to raster, that’s easy.
plot(rast(prPPMod))If you want to save the result to file as a geotiff file, you can use writeRaster() for this:
ppmRast=rast(prPPMod,)
writeRaster(ppmRast,"ppmSciurus.tif")Note, the equivalent for vector objects (i.e. points, lines or polygons) is writeVector(object,filname) for spatVector objects made with the vect() function or st_write(object,filename) for sf objects.
You have reached the end of our exploratio of species distribution modelling. There is no exercise this week. Instead, you can make a start applying what you have learned in the Assessment One of this course - good luck!