In week 3 we looked at the use of envelope models, general linear models, maxent and point process models for estimating species distribution (Bradypus variegatus). According to the cross-validation approach taken (K-fold partitioning) We achieved a good level of prediction. However, as we saw in the lecture, ecological data tend to be spatially dependent, meaning that observations in close proximity will be more similar than those at a distance. In terms of model validation this means that, in the case of K-fold validation in which samples are selected at random, training and test samples may be quite similar for any given covariate. This can (and almost always does) lead to over-optimistic model validation statistics (e.g. AUC scores). A more reliable approach is to ensure that folds used in cross-validation are spatially disjoint i.e. that the geographic distance between training and testing data is maximized. This is particularly important for species distribution modeling where our goal is the prediction across space of our species of interest. Therefore the ability of our model to work across different geographical areas is critical.
In this practical you will explore how spatial dependence affects model validation and the options available for overcoming such problems. We will use statistical models and machine learning algorithms on data for Bradypus variegatus.
Using these data, you will then learn how, for machine learning algorithms, model hyperparameters can be tuned to improve performance. You can think of model hyperparameters as the “settings” for any given model. For example, hyperparameter “mtry” is the number of explanatory variables that are sampled for each tree in a random forest algorithm and “node size” denotes the minimum number of cases allowed at the end of each decision tree. “Parameters” on the other hand are the decisions that are made by the learner during training (so “hyper” refers to the fact that we set these properties BEFORE the learning process begins).
In part four of the practical we will adopt a point process modeling approach to understanding and making predictions on these same data.
Part One: creating learners in the MLR (“Machine Learning in R”) package
The MLR package provides a suite of tools to explore, create and test modelling strategies that are to some degree compatible with a machine learning approach.
A key concept in MLR and in machine learning in general is the concept of the “Learner”. These are the objects that generate information (i.e. they “learn”) about the data provided. The latter object (the data) are generally formulated as “tasks”. So the basic logic is that we “learn” how to carry out our “task” and then use this experience for optimal prediction. If that all sounds a little inaccessible, it may help to simply think of Learners as models (as in classical statistics e.g. regression models) and Tasks as data sets. Typically, creating a task involves the choosing of a data set and telling R which column holds the species observations.
Okay, let’s set up our first task. First we will get the same data as we used last week. As you will only be working with data extracted from inside R packages or downloaded from online sources it is not critical where you set your working directory. However, if you intend to save any of your outputs from this practical then create a folder/directory for the work you are about to do. Remember, you can find the code for saving spatial outputs to file at the end of Practical One of this course.
#Code to tell R to only use the terra installation of PROJ (library for performing projections)
plib<-Sys.getenv("PROJ_LIB")
prj<-system.file("proj",package="terra")[1]
Sys.setenv("PROJ_LIB"=prj)
#Load packages
#Note, regarding the Spatstat and Maptools packages, if you receive a message stating that there is a binary and more recent source version of the package available, make sure you select "yes" to compile the package from source.
install.packages(c("dismo","terra","spatstat","adehabitatHR","ranger","cowplot"))
# "maptools is a useful package for converting data formats when using Spatstat. Sadly it is deprecated on CRAN but we can load a stable version from here:
install.packages("maptools", repos = "https://packagemanager.posit.co/cran/2023-10-13")
#We also need to load the "MLR" package for this practical
install.packages("mlr")
#in addition we will use machine learning algorithms from the "kernlab" package
install.packages("kernlab") #install necessary package
Next we recreate the data from Week 5. The only difference that we will make in terms of creating the data for analysis is that this week we will trim down the study area. Both machine learning (due to number of tests/iterations) and point process modelling (due to modelling complex patterns) can be computationally intensive. We will use the minimum convex polygon approach to select a convex hull around 80% of our bradypus points and clip the study area to this polygon.
#access the dismo library
library(dismo)
## Loading required package: raster
## Loading required package: sp
files <- list.files(path=paste(system.file(package="dismo"),
'/ex', sep=''), pattern='grd', full.names=TRUE )
files
## [1] "C:/Users/mlibxmda/AppData/Local/R/win-library/4.3/dismo/ex/bio1.grd"
## [2] "C:/Users/mlibxmda/AppData/Local/R/win-library/4.3/dismo/ex/bio12.grd"
## [3] "C:/Users/mlibxmda/AppData/Local/R/win-library/4.3/dismo/ex/bio16.grd"
## [4] "C:/Users/mlibxmda/AppData/Local/R/win-library/4.3/dismo/ex/bio17.grd"
## [5] "C:/Users/mlibxmda/AppData/Local/R/win-library/4.3/dismo/ex/bio5.grd"
## [6] "C:/Users/mlibxmda/AppData/Local/R/win-library/4.3/dismo/ex/bio6.grd"
## [7] "C:/Users/mlibxmda/AppData/Local/R/win-library/4.3/dismo/ex/bio7.grd"
## [8] "C:/Users/mlibxmda/AppData/Local/R/win-library/4.3/dismo/ex/bio8.grd"
## [9] "C:/Users/mlibxmda/AppData/Local/R/win-library/4.3/dismo/ex/biome.grd"
library(terra)
## terra 1.7.72
env<-rast(files[1:8])
#use the names() function to label each covariate layer
names(env) <- c("MeanTemp.","AnnualPrecip.","Precip.WettestQuarter","PrecipDriestQuarter","Max.Temp.","Min.Temp.","Temp.Range","MeanTemp.WettestQuarter")
plot(env)
SAproj<-"epsg:31972" # we want projected coordinate in order to calculate distance accurately so we use the EPSG code for South America
env<-project(env,y=SAproj,res=10000) # project raster and set an appropriate resolution
#And now plot:
# first layer of the RasterStack
plot(env, 1)
bradypus <- gbif("Bradypus","variegatus",sp=TRUE) #use gbif() to download the data
## 6342 records found
## 0-300-600-900-1200-1500-1800-2100-2400-2700-3000-3300-3600-3900-4200-4500-4800-5100-5400-5700-6000-6300-6342 records downloaded
#convert from an sp library (old) object to terra (newer/faster library) object
bradypus<-vect(bradypus)
#set the crs
crs(bradypus)<-"epsg:4326"
bradypus<-project(bradypus,crs(env))
plot(env$MeanTemp.)
plot(bradypus,add=TRUE,col ="red")
library(adehabitatHR) #library for mcp() function to get a minimum convex polygon around a certain number of points
## Loading required package: ade4
## Loading required package: adehabitatMA
## Registered S3 methods overwritten by 'adehabitatMA':
## method from
## print.SpatialPixelsDataFrame sp
## print.SpatialPixels sp
##
## Attaching package: 'adehabitatMA'
## The following object is masked from 'package:terra':
##
## buffer
## The following object is masked from 'package:raster':
##
## buffer
## Loading required package: adehabitatLT
## Loading required package: CircStats
## Loading required package: MASS
##
## Attaching package: 'MASS'
## The following object is masked from 'package:terra':
##
## area
## The following objects are masked from 'package:raster':
##
## area, select
## Loading required package: boot
bradMCP<-mcp(as(bradypus,"Spatial"),percent=80) #create minimum convex polygon around the 95% of points closest to the centroid (central point) of the presence data
plot(env$MeanTemp.) #plot
plot(bradMCP,add=TRUE)
plot(bradypus,add=TRUE)
bradSP<-crop(bradypus,bradMCP) # crop points to the mcp
#inspect
plot(env$MeanTemp.)
plot(bradSP)
plot(bradMCP,add=T)
###Crop the env object to this new polygon considering 80% of the point data
env<-crop(env,bradMCP)#crop
env<-mask(env,vect(bradMCP))#mask
plot(env)#plot
set.seed(11)
#create random points on cells of the 'env' object within the extent of bradypus and avoiding cells containing points from bradypus
back <- spatSample(env,size=2000,xy=TRUE,na.rm=TRUE)
head(back)
## x y MeanTemp. AnnualPrecip. Precip.WettestQuarter
## 1 596831.2 -73128.55 261.2404 2869.9595 950.9012
## 2 636831.2 1236871.45 227.4852 2018.1752 968.0748
## 3 1126831.2 1166871.45 236.1839 843.6674 334.4229
## 4 936831.2 1146871.45 260.1938 1045.0660 441.1959
## 5 1046831.2 946871.45 269.8413 1568.1650 733.9901
## 6 536831.2 866871.45 271.6064 3758.7358 1394.8872
## PrecipDriestQuarter Max.Temp. Min.Temp. Temp.Range MeanTemp.WettestQuarter
## 1 419.62311 316.8446 206.2404 110.8021 254.9481
## 2 95.62096 287.2628 162.4731 124.7897 225.8656
## 3 51.63866 298.5118 170.2902 128.1907 236.3653
## 4 56.49178 309.5564 204.3375 105.2292 261.7309
## 5 47.27510 338.3295 206.8261 131.3195 266.8843
## 6 268.92087 333.1964 217.2537 115.9427 267.6374
#absence values (leave cols 1 and 2 which are coordinates)
eA<-back[,3:ncol(back)]
#presence values
eP<-extract(env,bradSP)
#create data frames from values
Pres.cov<-data.frame(eP,Pres=1)
Back.cov<-data.frame(eA,Pres=0)
head(Back.cov)
## MeanTemp. AnnualPrecip. Precip.WettestQuarter PrecipDriestQuarter Max.Temp.
## 1 261.2404 2869.9595 950.9012 419.62311 316.8446
## 2 227.4852 2018.1752 968.0748 95.62096 287.2628
## 3 236.1839 843.6674 334.4229 51.63866 298.5118
## 4 260.1938 1045.0660 441.1959 56.49178 309.5564
## 5 269.8413 1568.1650 733.9901 47.27510 338.3295
## 6 271.6064 3758.7358 1394.8872 268.92087 333.1964
## Min.Temp. Temp.Range MeanTemp.WettestQuarter Pres
## 1 206.2404 110.8021 254.9481 0
## 2 162.4731 124.7897 225.8656 0
## 3 170.2902 128.1907 236.3653 0
## 4 204.3375 105.2292 261.7309 0
## 5 206.8261 131.3195 266.8843 0
## 6 217.2537 115.9427 267.6374 0
head(Pres.cov)
## ID MeanTemp. AnnualPrecip. Precip.WettestQuarter PrecipDriestQuarter
## 1 1 242.7934 860.167 531.1019 37.79567
## 2 2 243.5608 2908.904 1178.0919 202.05040
## 3 3 243.7697 3312.001 1443.1741 144.62831
## 4 4 231.2706 3314.167 1235.5300 302.25983
## 5 5 221.3931 1139.519 490.4107 94.57846
## 6 6 243.5608 2908.904 1178.0919 202.05040
## Max.Temp. Min.Temp. Temp.Range MeanTemp.WettestQuarter Pres
## 1 296.5367 194.3747 102.1593 249.7906 1
## 2 310.3405 184.6754 125.2382 242.2694 1
## 3 315.8277 178.8665 136.1665 240.3386 1
## 4 292.5822 175.1340 117.3696 230.7110 1
## 5 283.7167 152.7786 131.2737 224.1626 1
## 6 310.3405 184.6754 125.2382 242.2694 1
#combine both data sets using the rbind() function (binds datsets with same columns by stacking one on top of the other - i.e. joining rows together)
all.cov<-rbind(Pres.cov[,2:ncol(Pres.cov)],Back.cov)
#inspect
head(all.cov)
## MeanTemp. AnnualPrecip. Precip.WettestQuarter PrecipDriestQuarter Max.Temp.
## 1 242.7934 860.167 531.1019 37.79567 296.5367
## 2 243.5608 2908.904 1178.0919 202.05040 310.3405
## 3 243.7697 3312.001 1443.1741 144.62831 315.8277
## 4 231.2706 3314.167 1235.5300 302.25983 292.5822
## 5 221.3931 1139.519 490.4107 94.57846 283.7167
## 6 243.5608 2908.904 1178.0919 202.05040 310.3405
## Min.Temp. Temp.Range MeanTemp.WettestQuarter Pres
## 1 194.3747 102.1593 249.7906 1
## 2 184.6754 125.2382 242.2694 1
## 3 178.8665 136.1665 240.3386 1
## 4 175.1340 117.3696 230.7110 1
## 5 152.7786 131.2737 224.1626 1
## 6 184.6754 125.2382 242.2694 1
Now we can use this data frame to create a task for our learners to carry out. As stated at the start of this practical, the spatial location of points is critical to our analysis and will be required at every step of model development, testing and tuning. So, first let’s create an object for the coordinates of our presence and background data for use in the analysis.
### create task for use with mlr classification algorithms
coordsBack<-back[,1:2]
coordsPres<-crds(bradSP)
coords<-rbind(coordsPres,coordsBack)
coords<-data.frame(coords)
colnames(coords)<-c("x","y")
all.cov<-cbind(all.cov,coords)
all.cov<-subset(all.cov,!is.na(all.cov$MeanTemp.))#remove any NA values
#Now we create our task, passing the data frame and specifying the species presence/absence column (the "target") and tell R that 1 = present (i.e. a postive result)
library(mlr)
## Loading required package: ParamHelpers
##
## Attaching package: 'ParamHelpers'
## The following object is masked from 'package:raster':
##
## getValues
## Warning message: 'mlr' is in 'maintenance-only' mode since July 2019.
## Future development will only happen in 'mlr3'
## (<https://mlr3.mlr-org.com>). Due to the focus on 'mlr3' there might be
## uncaught bugs meanwhile in {mlr} - please consider switching.
##
## Attaching package: 'mlr'
## The following object is masked from 'package:adehabitatLT':
##
## subsample
## The following object is masked from 'package:terra':
##
## resample
## The following object is masked from 'package:raster':
##
## resample
#For the makeClassifTask function to work, our target variable needs to be categorical (a "factor" in R) so let's tidy that up first
all.cov$Pres<-factor(all.cov$Pres)
task = makeClassifTask(data = all.cov[,1:9], target = "Pres",
positive = "1", coordinates = all.cov[,10:11])
Now that we have the task set up, we need to choose some learners that will help us carry out our task. If we already know which learner we prefer then we can use the makeLearner() function to specify our algorithm. If we don’t then the mlr package has a really useful function listLearners() with which you can pass your existing task and get a list of machine learning algorithms which apply.
The listLearners function returns A LOT of information and most of this we do not want so we will not call it here but instead we will go straight to creating our learners which for today’s practical are: 1. classif.binomial (an implementation of logistic regression), 2. classif.randomForest and 3. classif.ksvm (a support vector machine learning algorithm developed by the Kernlab group). We looked at binomial regression and Random Forest previously. Support Vector Machines (SVMs) are classifiers that attempt to create the largest possible margin between samples (such as groups of presence/absence points) that maximises their difference (distance between the groups) whilst reducing the number of misclassifications, thereby helping the classifer to make decisions about group membership. See here for an article giving a nice simple overview of how SVMs work: https://towardsdatascience.com/support-vector-machine-simply-explained-fee28eba5496
The next step is to create our individual learners that will do the work of solving our task. Let’s start with the most familiar type - binomial regression:
################ Binomial (logistic regression)
lrnBinomial = makeLearner("classif.binomial",
predict.type = "prob",
fix.factors.prediction = TRUE)
In the above code we first tell the makeLearner function the type of classifier we want to work with, specify the type of prediction we want to make (probability) and if we want the algorithm to output precicted classes (i.e. 1 or 0) when used later for prediction.
Next we do the same for our Random Forest and Ksvm learners:
##########Random Forest
lrnRF = makeLearner("classif.ranger",
predict.type = "prob",
fix.factors.prediction = TRUE)
###### Support Vector Machine
# kernel = "rbfdot" = "Radial Basis kernel function" which is basically a Gaussian kernel i.e. one that simulates a normal distribution. The algorithm uses the kernel type to decide how to differentiate groups (you can think of this as a means to estimate or weight distances between group 1 - presence and group 0 - absence).
lrn_ksvm = makeLearner("classif.ksvm",
predict.type = "prob",
kernel = "rbfdot",
fix.factors.prediction = TRUE)
Part Two: model evaluation
Now that we have both our task and learners ready we can apply each learner to the task and evaluate their performance.
To do this we have to set up a sampling scheme for cross validation. This is conceptually identical to the k-fold validation we did in Week 3 but here we don’t have to go through the trouble of writing our own for loops - we can use the functionality within the mlr package to do this for us. First we will create a k-fold resampling strategy that DOES NOT take spatial location into account.
### Create resampling scheme (performance level) for cross-validation.
##This code creates 5 folds in the data and repeates the whole k-fold cross validation process 4 times (so 20 models in total) and aggregates (averages) the AUC statistic.
perf_levelCV = makeResampleDesc(method = "RepCV", predict = "test", folds = 5, reps = 5)
In addition we can easily set up the same cross validation strategy but with spatially disjoint folds. All we have to do is change “RepCV” to “SpRepCV” like so:
perf_level_spCV = makeResampleDesc(method = "SpRepCV", folds = 5, reps = 5) #sampling strategy to run five fold re-sampling five times
Now we can compare, for each of our learners, their performance based on conventional k-fold validation and one where spatial location is considered (i.e. when test data are made to be geographically far from training data).
#Binomial conventional cross validation (K fold)
cvBinomial = resample(learner = lrnBinomial, task =task,
resampling = perf_levelCV,
measures = mlr::auc)
## Resampling: repeated cross-validation
## Measures: auc
## [Resample] iter 1: 0.8301529
## [Resample] iter 2: 0.8490734
## [Resample] iter 3: 0.8229394
## [Resample] iter 4: 0.8116275
## [Resample] iter 5: 0.8265455
## [Resample] iter 6: 0.8207694
## [Resample] iter 7: 0.8384071
## [Resample] iter 8: 0.8065346
## [Resample] iter 9: 0.8449821
## [Resample] iter 10: 0.8291591
## [Resample] iter 11: 0.8208739
## [Resample] iter 12: 0.8305474
## [Resample] iter 13: 0.8325647
## [Resample] iter 14: 0.8254371
## [Resample] iter 15: 0.8282495
## [Resample] iter 16: 0.8329027
## [Resample] iter 17: 0.8190710
## [Resample] iter 18: 0.8298184
## [Resample] iter 19: 0.8339114
## [Resample] iter 20: 0.8205537
## [Resample] iter 21: 0.8139433
## [Resample] iter 22: 0.8322877
## [Resample] iter 23: 0.8207495
## [Resample] iter 24: 0.8316138
## [Resample] iter 25: 0.8413222
##
## Aggregated Result: auc.test.mean=0.8277615
##
We can access and plot the spatial partitioning performed by the mlr package. This information is good to have if we want to replicate the validation process later on.
plots = createSpatialResamplingPlots(task,resample=cvBinomial,
crs=crs(env),datum=crs(env),color.test = "red",point.size = 1)
library(cowplot)
#use the cowplot function to plot all folds out in a grid
cowplot::plot_grid(plotlist = plots[["Plots"]], ncol = 3, nrow = 2,
labels = plots[["Labels"]])
##Binomial spatial cross validation
sp_cvBinomial = resample(learner = lrnBinomial, task =task,
resampling = perf_level_spCV,
measures = mlr::auc)
## Resampling: repeated spatial cross-validation
## Measures: auc
## [Resample] iter 1: 0.5028453
## [Resample] iter 2: 0.5711554
## [Resample] iter 3: 0.7205928
## [Resample] iter 4: 0.0788633
## [Resample] iter 5: 0.4091305
## [Resample] iter 6: 0.8325269
## [Resample] iter 7: 0.3851623
## [Resample] iter 8: 0.3286315
## [Resample] iter 9: 0.6020144
## [Resample] iter 10: 0.5646577
## [Resample] iter 11: 0.7205928
## [Resample] iter 12: 0.5028453
## [Resample] iter 13: 0.5711554
## [Resample] iter 14: 0.4091305
## [Resample] iter 15: 0.0788633
## [Resample] iter 16: 0.8329681
## [Resample] iter 17: 0.4258841
## [Resample] iter 18: 0.3756532
## [Resample] iter 19: 0.3473831
## [Resample] iter 20: 0.5646577
## [Resample] iter 21: 0.5646577
## [Resample] iter 22: 0.4258841
## [Resample] iter 23: 0.3473831
## [Resample] iter 24: 0.3756532
## [Resample] iter 25: 0.8329681
##
## Aggregated Result: auc.test.mean=0.4948504
##
#make partition plots
plotsSP = createSpatialResamplingPlots(task,resample=sp_cvBinomial,
crs=crs(env),datum=crs(env),color.test = "red",point.size = 1)
#use the cowplot function to plot all folds out in a grid
cowplot::plot_grid(plotlist = plotsSP[["Plots"]], ncol = 3, nrow = 2,
labels = plotsSP[["Labels"]])
You should see that the result (aggregated AUC) for the spatial cross validation is noticeably lower than for the conventional cross validation. This proves that the latter is an over-optimistic assessment and that we should be cautious with non-spatial approaches to model evaluation.
Now let’s look at our Random Forest learner.
###### Random Forest evaluation
##random sampling cross-validation
cvRF = resample(learner = lrnRF, task =task,
resampling = perf_levelCV,
measures = mlr::auc)
## Resampling: repeated cross-validation
## Measures: auc
## [Resample] iter 1: 0.9813761
## [Resample] iter 2: 0.9787655
## [Resample] iter 3: 0.9785546
## [Resample] iter 4: 0.9850311
## [Resample] iter 5: 0.9815155
## [Resample] iter 6: 0.9821401
## [Resample] iter 7: 0.9829890
## [Resample] iter 8: 0.9719107
## [Resample] iter 9: 0.9807421
## [Resample] iter 10: 0.9819755
## [Resample] iter 11: 0.9823031
## [Resample] iter 12: 0.9802763
## [Resample] iter 13: 0.9795040
## [Resample] iter 14: 0.9797405
## [Resample] iter 15: 0.9787939
## [Resample] iter 16: 0.9805372
## [Resample] iter 17: 0.9861044
## [Resample] iter 18: 0.9770646
## [Resample] iter 19: 0.9772007
## [Resample] iter 20: 0.9783537
## [Resample] iter 21: 0.9795199
## [Resample] iter 22: 0.9733039
## [Resample] iter 23: 0.9857602
## [Resample] iter 24: 0.9813368
## [Resample] iter 25: 0.9818703
##
## Aggregated Result: auc.test.mean=0.9802668
##
#spatial partitioning cross-validation
sp_cvRF = resample(learner = lrnRF, task =task,
resampling = perf_level_spCV,
measures = mlr::auc)
## Resampling: repeated spatial cross-validation
## Measures: auc
## [Resample] iter 1: 0.4548440
## [Resample] iter 2: 0.6067284
## [Resample] iter 3: 0.5147399
## [Resample] iter 4: 0.6223749
## [Resample] iter 5: 0.4307723
## [Resample] iter 6: 0.6322529
## [Resample] iter 7: 0.5257214
## [Resample] iter 8: 0.6072639
## [Resample] iter 9: 0.5433091
## [Resample] iter 10: 0.5288575
## [Resample] iter 11: 0.5263679
## [Resample] iter 12: 0.4404957
## [Resample] iter 13: 0.5352940
## [Resample] iter 14: 0.5986165
## [Resample] iter 15: 0.5161097
## [Resample] iter 16: 0.5117341
## [Resample] iter 17: 0.5411851
## [Resample] iter 18: 0.4338626
## [Resample] iter 19: 0.6058483
## [Resample] iter 20: 0.5442464
## [Resample] iter 21: 0.5850094
## [Resample] iter 22: 0.5834310
## [Resample] iter 23: 0.2969607
## [Resample] iter 24: 0.5953777
## [Resample] iter 25: 0.5861225
##
## Aggregated Result: auc.test.mean=0.5347010
##
plotsSP = createSpatialResamplingPlots(task,resample=sp_cvRF,
crs=crs(env),datum=crs(env),color.test = "red",point.size = 1)
cowplot::plot_grid(plotlist = plotsSP[["Plots"]], ncol = 3, nrow = 2,
labels = plots[["Labels"]])
You will notice that, although random forest out-performs the binomial regression learner when assessed from a conventional cross-validation point of view, this algorithm suffers heavily when spatial dependence is take into account and the binomial learner achieves a comparable AUC. However, random forest has a set of hyperparameters that can be easily tuned to enhance model performance, a functionality not present with binomal learners (which are based on statistical methods rather than on machine learning algorithms). We will see later in the practical if model tuning changes the relative standing of these two learners. First though, let’s perform the same evaluation on our support vector machine learner for comparison.
########### Support vector machine evaluation
cvksvm = resample(learner = lrn_ksvm, task =task,
resampling = perf_levelCV,
measures = mlr::auc)
## Resampling: repeated cross-validation
## Measures: auc
## [Resample] iter 1: 0.9315302
## [Resample] iter 2: 0.9074731
## [Resample] iter 3: 0.9060738
## [Resample] iter 4: 0.9175863
## [Resample] iter 5: 0.9130538
## [Resample] iter 6: 0.9140758
## [Resample] iter 7: 0.9170467
## [Resample] iter 8: 0.9223604
## [Resample] iter 9: 0.9203246
## [Resample] iter 10: 0.9076729
## [Resample] iter 11: 0.9254400
## [Resample] iter 12: 0.9163057
## [Resample] iter 13: 0.9318815
## [Resample] iter 14: 0.9008502
## [Resample] iter 15: 0.9061469
## [Resample] iter 16: 0.9068799
## [Resample] iter 17: 0.9171021
## [Resample] iter 18: 0.9273444
## [Resample] iter 19: 0.9191297
## [Resample] iter 20: 0.9190030
## [Resample] iter 21: 0.9082492
## [Resample] iter 22: 0.9195705
## [Resample] iter 23: 0.9096856
## [Resample] iter 24: 0.9192094
## [Resample] iter 25: 0.9285573
##
## Aggregated Result: auc.test.mean=0.9165021
##
sp_cvksvm = resample(learner = lrn_ksvm, task =task,
resampling = perf_level_spCV,
measures = mlr::auc)
## Resampling: repeated spatial cross-validation
## Measures: auc
## [Resample] iter 1: 0.5720567
## [Resample] iter 2: 0.7259129
## [Resample] iter 3: 0.6189466
## [Resample] iter 4: 0.2537036
## [Resample] iter 5: 0.2253695
## [Resample] iter 6: 0.6132446
## [Resample] iter 7: 0.6146761
## [Resample] iter 8: 0.6062001
## [Resample] iter 9: 0.4357252
## [Resample] iter 10: 0.2872995
## [Resample] iter 11: 0.2068053
## [Resample] iter 12: 0.2567700
## [Resample] iter 13: 0.6099382
## [Resample] iter 14: 0.7250189
## [Resample] iter 15: 0.5764467
## [Resample] iter 16: 0.5438511
## [Resample] iter 17: 0.5204003
## [Resample] iter 18: 0.5338155
## [Resample] iter 19: 0.4288651
## [Resample] iter 20: 0.2415945
## [Resample] iter 21: 0.6135836
## [Resample] iter 22: 0.5763972
## [Resample] iter 23: 0.2118227
## [Resample] iter 24: 0.2504273
## [Resample] iter 25: 0.7165830
##
## Aggregated Result: auc.test.mean=0.4786182
##
So, both learners apparently suffer considerably when switching from a conventional to a spatially-oriented method of performance evaluation. especially in the case of the support vector machine. Support vector machines are the most readily improved through model tuning and this is a particular strength of this kind of learner. So, the winner in this contest is far from decided - not before we have taken a closer look at model tuning options for these different learners.
Part Three: Model tuning
For tuning of hyperparameters we follow a similar process as that taken when setting up cross validation resampling. We use the same resampling scheme so that the data are separated into five spatially disjoint folds and with 5 repeats. We could (and should) try many more than this but this would become increasingly time consuming to do.
We first need to look at each individual learner to check which hyperparameters are available for tuning. These can be accessed using the getParamSet() function. Let’s do this for the randomForest learner first.
getParamSet(lrnRF)
## Type len Def
## num.trees integer - 500
## mtry integer - -
## min.node.size integer - -
## replace logical - TRUE
## sample.fraction numeric - -
## split.select.weights numericvector <NA> -
## always.split.variables untyped - -
## respect.unordered.factors discrete - ignore
## importance discrete - none
## write.forest logical - TRUE
## scale.permutation.importance logical - FALSE
## num.threads integer - -
## save.memory logical - FALSE
## verbose logical - TRUE
## seed integer - -
## splitrule discrete - gini
## num.random.splits integer - 1
## keep.inbag logical - FALSE
## Constr Req Tunable Trafo
## num.trees 1 to Inf - TRUE -
## mtry 1 to Inf - TRUE -
## min.node.size 1 to Inf - TRUE -
## replace - - TRUE -
## sample.fraction 0 to 1 - TRUE -
## split.select.weights 0 to 1 - TRUE -
## always.split.variables - - TRUE -
## respect.unordered.factors ignore,order,partition - TRUE -
## importance none,impurity,permutation - FALSE -
## write.forest - - FALSE -
## scale.permutation.importance - Y FALSE -
## num.threads 1 to Inf - FALSE -
## save.memory - - FALSE -
## verbose - - FALSE -
## seed -Inf to Inf - FALSE -
## splitrule gini,extratrees - TRUE -
## num.random.splits 1 to Inf Y TRUE -
## keep.inbag - - FALSE -
For random forests the hyperparameters known to most closely affect model performance are “mtry” and “nodesize”. So, for these hyperparameters we can create a range of values to pass to the tuneParams() function for tuning our model. We create these hyperparameter ranges using the makeParamSet() function. For mtry the range of options is actually the number of predictor variables in our dataframe (eight) and a sensible range for nodesize starts from 1 with 10 being a common upper limit used in classification problems (any more than this reduces computing time but also precision).
paramsRF <- makeParamSet(
makeIntegerParam("mtry",lower = 1,upper = 8),
makeIntegerParam("min.node.size",lower = 1,upper = 10)
)
Now we have the parameter set ready to go we use the tuneParams() function to get the result of our model tuning. Before we do that though, let’s break down exactly what is happening in the model tuning process. The code below is basically a list of information given to the tuneParams() function. We provide the name of the learner “lrnRF”, the task for the learner, the spatially-oriented resampling method (we use the same one as in our model evaluation step above), the range of parameter values to select from “paramsRF”, the number of random combinations of parameter values to use (“ctrl”) and the measure we wish to return “auc”.
The objects we pass to the tuneParams() function will ensure that for five spatially disjoint folds (subsets) of the data, 10 random combinations of hyperparameters will be used for model tuning with the optimum values returned (i.e. those that maximize the auc score). This means that a total of 50 models (5 folds x 10 hyperparameter selections) will be tested and each time the training and testing data will be spatially partitioned such that our tuning is based on maximizing the ability of our model to predict to new spatial locations. In reality we might run many more models but in the interests of time we will proceed with 5 combinations of hyperparameters for tuning.
Please be patient as the following model tuning step can take a while.
# specifying random parameter value search
tune_level = makeResampleDesc(method = "SpCV", iters = 5)
ctrl = makeTuneControlRandom(maxit = 10)
tuned_RF = tuneParams(learner = lrnRF,
task = task,
resampling = tune_level,
measures = mlr::auc,
par.set = paramsRF,
control = ctrl,
show.info = TRUE)
## [Tune] Started tuning learner classif.ranger for parameter set:
## Type len Def Constr Req Tunable Trafo
## mtry integer - - 1 to 8 - TRUE -
## min.node.size integer - - 1 to 10 - TRUE -
## With control class: TuneControlRandom
## Imputation value: -0
## [Tune-x] 1: mtry=1; min.node.size=1
## [Tune-y] 1: auc.test.mean=0.5508521; time: 0.2 min
## [Tune-x] 2: mtry=3; min.node.size=3
## [Tune-y] 2: auc.test.mean=0.5672918; time: 0.4 min
## [Tune-x] 3: mtry=4; min.node.size=9
## [Tune-y] 3: auc.test.mean=0.5851705; time: 0.5 min
## [Tune-x] 4: mtry=2; min.node.size=6
## [Tune-y] 4: auc.test.mean=0.5808736; time: 0.3 min
## [Tune-x] 5: mtry=5; min.node.size=4
## [Tune-y] 5: auc.test.mean=0.5682975; time: 0.6 min
## [Tune-x] 6: mtry=8; min.node.size=2
## [Tune-y] 6: auc.test.mean=0.5639593; time: 0.8 min
## [Tune-x] 7: mtry=8; min.node.size=2
## [Tune-y] 7: auc.test.mean=0.5570767; time: 0.8 min
## [Tune-x] 8: mtry=6; min.node.size=9
## [Tune-y] 8: auc.test.mean=0.5744483; time: 0.6 min
## [Tune-x] 9: mtry=2; min.node.size=9
## [Tune-y] 9: auc.test.mean=0.5678892; time: 0.3 min
## [Tune-x] 10: mtry=5; min.node.size=10
## [Tune-y] 10: auc.test.mean=0.5951370; time: 0.5 min
## [Tune] Result: mtry=5; min.node.size=10 : auc.test.mean=0.5951370
We can see that the random forest in this case has responded a little to the model tuning we have carried out. We can follow the same process for the support vector machine algorithm. Here we take some recommendation from the literature to set the values for the algorithm parameters “C” (this determines the emphasis placed on finding a large margin around the classification boundary - i.e. the distance between 1/0 groups - versus avoiding unwanted misclassifications) and, for “sigma” (the kernel shape determining the influence of cases close to the classification boundary), we automate the selection using a function from the kernlab library.
Once we have the optimal model parameters we simply need to create a ksvm model (this works in much the same way that we have fitted glm() models previously). We pass the results from our model tuning step straight to the model formula (again we can access the tuning results directly with the ‘$’ symbol).This step can take a few minutes to run so please be patient.
# spatial partitioning
#support vector machine model tuning. Lower and upper limits are set to 0.1 and 10 for "C" and for "sigma" the range is set using the sigest() function in the kernlab library. This function looks at a potential model based on our data and returns an estimated range of sigma for good model performance.
library(kernlab)
##
## Attaching package: 'kernlab'
## The following object is masked from 'package:CircStats':
##
## rvm
## The following object is masked from 'package:adehabitatMA':
##
## buffer
## The following objects are masked from 'package:terra':
##
## buffer, size
## The following objects are masked from 'package:raster':
##
## buffer, rotated
sigest(Pres~.,data=all.cov[,1:9])
## 90% 50% 10%
## 0.03047447 0.09297329 0.35495406
# specifying random parameter value search
paramsSVM = makeParamSet(
makeNumericParam("C", lower = 0.1, upper = 10),
makeNumericParam("sigma", lower = 0.03, upper = 0.355)
)
tuned_SVM = tuneParams(learner = lrn_ksvm,
task = task,
resampling = tune_level,
measures = mlr::auc,
par.set = paramsSVM,
control = ctrl,
show.info = TRUE)
## [Tune] Started tuning learner classif.ksvm for parameter set:
## Type len Def Constr Req Tunable Trafo
## C numeric - - 0.1 to 10 - TRUE -
## sigma numeric - - 0.03 to 0.355 - TRUE -
## With control class: TuneControlRandom
## Imputation value: -0
## [Tune-x] 1: C=3.27; sigma=0.0716
## [Tune-y] 1: auc.test.mean=0.4928474; time: 0.6 min
## [Tune-x] 2: C=9.6; sigma=0.284
## [Tune-y] 2: auc.test.mean=0.5455737; time: 0.4 min
## [Tune-x] 3: C=1.52; sigma=0.0785
## [Tune-y] 3: auc.test.mean=0.5074506; time: 0.6 min
## [Tune-x] 4: C=6.62; sigma=0.0748
## [Tune-y] 4: auc.test.mean=0.4844498; time: 0.5 min
## [Tune-x] 5: C=0.646; sigma=0.227
## [Tune-y] 5: auc.test.mean=0.5131040; time: 0.5 min
## [Tune-x] 6: C=0.831; sigma=0.145
## [Tune-y] 6: auc.test.mean=0.4883957; time: 0.5 min
## [Tune-x] 7: C=5.79; sigma=0.343
## [Tune-y] 7: auc.test.mean=0.5565103; time: 0.4 min
## [Tune-x] 8: C=5.76; sigma=0.0312
## [Tune-y] 8: auc.test.mean=0.5039993; time: 0.7 min
## [Tune-x] 9: C=1.11; sigma=0.0394
## [Tune-y] 9: auc.test.mean=0.5274719; time: 0.4 min
## [Tune-x] 10: C=9.02; sigma=0.33
## [Tune-y] 10: auc.test.mean=0.5695327; time: 0.2 min
## [Tune] Result: C=9.02; sigma=0.33 : auc.test.mean=0.5695327
#model with tuning information and set to allow probability predictions ("prob.model =TRUE"). "C-svc" is the default solver for predictions with binary outcomes in ksvm models and supports probability estimates so we use it here.
#build the model
#tuned_SVM[[3]][2] access the slot in the results for sigma
#tuned_SVM[[3]][1] access the slot in the results for C
modelSig<- as.numeric(tuned_SVM[[3]][2])
modelC<- as.numeric(tuned_SVM[[3]][1])
We already know how to make mapped predictions with a random forest model from Week 3. For an SVM we would proceed as follows:
#build the model
svmMOD<-ksvm(Pres~.,data=all.cov[,1:9], type="C-svc",
kernel=rbfdot,kpar=list(sigma=modelSig),
C=modelC,prob.model=TRUE)
predKSVM<-predict(env,svmMOD,type="prob",index=2,na.rm=TRUE)
plot(predKSVM)
So, we can see that with only a small amount of model tuning (in reality we might run hundreds or even thousands of models with more time and computing power) we have been able to improve considerably the support vector machine algorithm. At this point both the random forest and SVM are giving similar performance results, with random forest slightly better, so we would favour this model. However, if we had more time, we might consider continuing the SVM tuning by specifying a greater number of models and parameter combinations and would likely see further improvement.
Now we have made some headway in terms of improving our model prediction though tuning based on spatial partitioning of the data. Our AUC statistic is still not ideal and we have no real knowledge of the relationship between individual variables and species occurrence, much less the nature of point interactions. Let’s take a look at the options provided by point pattern analysis through the Spatstat package to see if we can remedy some of these issues.
Part Four: Point pattern analysis
A key difference between the modeling approaches we have used till now and the point pattern analysis we will do next is the format of the environmental covariates which, in the Spatstat package that we will use to carry this out, take the form of Im (image) objects. So it is worth first converting our raster covariates to image objects. For this we can use the as.im.RasterLayer() function from the “maptools” library:
# load maptools and convert all environmental covariates to im objects. Note that we need to convert to raster inside the function (as the as.im.RasterLayer function only works with *raster objects)
library(maptools)
## Please note that 'maptools' will be retired during October 2023,
## plan transition at your earliest convenience (see
## https://r-spatial.org/r/2023/05/15/evolution4.html and earlier blogs
## for guidance);some functionality will be moved to 'sp'.
## Checking rgeos availability: FALSE
##
## Attaching package: 'maptools'
## The following object is masked from 'package:sp':
##
## sp2Mondrian
library(raster)
tempIm<-as.im.RasterLayer(raster(env$MeanTemp.))
precipIm<-as.im.RasterLayer(raster(env$AnnualPrecip.))
maxIm<-as.im.RasterLayer(raster(env$Max.Temp.))
minIm<-as.im.RasterLayer(raster(env$Min.Temp.))
rangeIm<-as.im.RasterLayer(raster(env$Temp.Range))
precipDry<-as.im.RasterLayer(raster(env$PrecipDriestQuarter))
tempWettestIm<-as.im.RasterLayer(raster(env$MeanTemp.WettestQuarter))
Next we call the spatstat library and set up our analysis window (necessary for point pattern analysis) and create our point pattern object.
library(spatstat)
## Loading required package: spatstat.data
## Loading required package: spatstat.geom
## spatstat.geom 3.2-8
##
## Attaching package: 'spatstat.geom'
## The following object is masked from 'package:MASS':
##
## area
## The following object is masked from 'package:ade4':
##
## disc
## The following objects are masked from 'package:terra':
##
## area, delaunay, is.empty, rescale, rotate, shift, where.max,
## where.min
## The following object is masked from 'package:dismo':
##
## domain
## The following objects are masked from 'package:raster':
##
## area, rotate, shift
## Loading required package: spatstat.random
## spatstat.random 3.2-2
## Loading required package: spatstat.explore
## Loading required package: nlme
##
## Attaching package: 'nlme'
## The following object is masked from 'package:raster':
##
## getData
## spatstat.explore 3.2-6
##
## Attaching package: 'spatstat.explore'
## The following object is masked from 'package:boot':
##
## envelope
## Loading required package: spatstat.model
## Loading required package: rpart
## spatstat.model 3.2-10
##
## Attaching package: 'spatstat.model'
## The following object is masked from 'package:dismo':
##
## response
## Loading required package: spatstat.linnet
## spatstat.linnet 3.1-3
##
## spatstat 3.0-7
## For an introduction to spatstat, type 'beginner'
window.poly <- as.owin(tempIm)#create study window using the tempIm image object as a template
plot(window.poly)#inspect
bradCoords<-crds(bradSP) #get coordinates from our Erioyce object to create point patter
pppBrad <- ppp(bradCoords[,1],bradCoords[,2], window = window.poly)#create point patter 'ppp' object
## Warning: 72 points were rejected as lying outside the specified window
## Warning: data contain duplicated points
plot(pppBrad,add=T) #inspect
## Warning in plot.ppp(pppBrad, add = T): 72 illegal points also plotted
pppBrad<-as.ppp(pppBrad)#use as.ppp to remove points outside the window
#Finally, it's a good idea to rescale our data from metres to kilometres. Not only is this more intuitive when dealing with large distances but it help with model convergence also.
pppBrad<- rescale(pppBrad, 1000)#rescale from m to km
#rescale image objects to conform with the point pattern
tempIm<-rescale(tempIm,1000)
precipIm<-rescale(precipIm,1000)
#plot the points against covariate
plot(tempIm)
plot(pppBrad,add=TRUE)
Now, as an exploratory step we can test our point pattern for evidence of spatial dependence using Ripley’s K and the more intuitive L function:
K<-Kest(pppBrad,correction = "border") #test for spatial dependence and use the "border" option to account for edge effects (points near the edge are ignored)
plot(K)#inspect the results
#check "linearized" Ripley's K for comparison
L<-Lest(pppBrad,correction = "border")
plot(L)
#generate envelope based on 39 random simulations of a homogeneous poisson process (rank = 1 means that we use minimum and maximum values to create the envelope, global=TRUE ensures that the estimate is made for all points simultaneously rather than point-wise).
Kcsr<-envelope(pppBrad,Kest,nsim=39,rank=1,global =TRUE)
## Generating 39 simulations of CSR ...
## 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
## 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38,
## 39.
##
## Done.
plot(Kcsr,shade=c("hi","lo"),legend=T) #check whether our data fall within this envelope.
According to the Kest, Lest and envelope tests our data deviate from complete spatial randomness. We should therefore consider exploring inhomogeneous processes when developing our point process models, which we will do next.
For setting up a point process model, in addition to the study window established previously, we need to create a suitable quadrature scheme. The quadrature points used in point process modelling are analogous to the background (pseudo-absence) points we have used in previous weeks. In the case of point process modeling we generally use many more points than typically used in presence-absence modeling and adopt a uniform grid of points rather than a random distribution. This uniform distribution of quadrature points is important for establishing the “intensity” measurement (predicted points per unit area) that is the outcome of point process modeling.
We use the quadscheme() function from spatstat to create our quadrature points (also referred to as “dummy” points in some of the spatstat functions). The resolution (number) of these points is determined by the ‘nd’ argument which sets the number of grid points in the horizontal and vertical directions.
Below we test several values for nd using a for loop and a simple model using annual temperature and precipitation as predictors, each time checking the model selection criterion AIC (Akaike’s Information Criterion). Lower AIC values suggest a better fit to the data and we use this as a basis to select the number of quadrature points.
ndTry<-seq(100,500,by=100)
for(i in ndTry){
Q.i<-quadscheme(pppBrad,method = "grid",nd=i)
fit.i<-ppm(Q.i~tempIm+precipIm)
print(i)
print(AIC(fit.i))
}
## [1] 100
## [1] 65011.07
## [1] 200
## [1] 65065.91
## [1] 300
## [1] 65039.24
## [1] 400
## [1] 65051.42
## [1] 500
## [1] 65048.94
# select best performing value for nd to set quadrature scheme
Q<-quadscheme(pppBrad,method = "grid",nd=100)
Before we start our modelling in earnest, we can take advantage of some exploratory techniques in the spatstat library, in particular the rohat() function that allows us to plot values for individual covariates against predicted intensity values for our point pattern.
#plot increasing values for each environmental covariate against intensity (~ density) of the point pattern
plot(rhohat(pppBrad,tempIm)) # annual temperature
plot(rhohat(pppBrad,precipIm)) # annual precipitation
plot(rhohat(pppBrad,"x")) #x coordinate
plot(rhohat(pppBrad,"y")) #y coordinate
The rhohat() function suggests we have non-linear associations between our environmental covariates and intensity of points. For temperature and the precipitation in particular this seems to roughly follow a quadratic trend. We can use this insight when fitting our model. For example we can use the poly() function to model these variables as quadratic curves. Below we fit a quadratic function to the tempIm object and y coordinate inside our model formula to reflect these trends.The ppm() function is used to create a point process model. In practice we would experiment will many more combinations of predictor variables and test their suitability through comparison of the model AIC statistics. This can be quite time consuming (and tedious) so for the sake of this class we will just consider the key variables of annual temperature and precipitation.
firstPPMod<-ppm(Q~poly(precipIm,2)+poly(tempIm,2)+x+y) #point process model based on polynomial terms for temperature and projected coordinates
AIC(firstPPMod) # get AIC
## [1] 53104.88
As we discovered in our earlier exploratory analysis, we have reason to believe that our data do not come from a distribution with complete spatial randomness and we may want to consider accounting for spatial dependence between points in our modeling.
This first ppm model is basically a standard poisson regression model and does not account for point interactions. It therefore assumes spatial randomness in the data as discussed in the lecture. We can test whether our model conforms to spatial randomness i.e. the data come from a random poisson distribution using the envelope function which can also accept fitted point process models as the input.
firstModEnv<-envelope(firstPPMod,Kest,nsim=39,rank=1,global =TRUE)
## Generating 78 simulated realisations of fitted Poisson model (39 to estimate
## the mean and 39 to calculate envelopes) ...
## 1, 2, [3:37 remaining] 3,
## [3:37 remaining] 4, [3:36 remaining] 5, [3:33 remaining] 6,
## [3:31 remaining] 7, [3:27 remaining] 8, [3:21 remaining] 9,
## [3:19 remaining] 10, [3:21 remaining] 11, [3:22 remaining] 12,
## [3:26 remaining] 13, [3:37 remaining] 14, [3:46 remaining] 15,
## [3:54 remaining] 16, [3:59 remaining] 17, [4:02 remaining] 18,
## [3:54 remaining] 19,
## [18:43 remaining, estimate finish 2024-02-20 15:48:33]
## 20,
## [17:36 remaining, estimate finish 2024-02-20 15:47:29]
## 21,
## [16:34 remaining, estimate finish 2024-02-20 15:46:30]
## 22,
## [15:38 remaining, estimate finish 2024-02-20 15:45:36]
## 23,
## [14:46 remaining, estimate finish 2024-02-20 15:44:48]
## 24,
## [13:58 remaining, estimate finish 2024-02-20 15:44:02]
## 25,
## [13:14 remaining, estimate finish 2024-02-20 15:43:21]
## 26,
## [12:33 remaining, estimate finish 2024-02-20 15:42:43]
## 27,
## [11:55 remaining, estimate finish 2024-02-20 15:42:06]
## 28,
## [11:20 remaining, estimate finish 2024-02-20 15:41:34]
## 29,
## [10:46 remaining, estimate finish 2024-02-20 15:41:02]
## 30,
## [10:16 remaining, estimate finish 2024-02-20 15:40:34]
## 31, [9:47 remaining] 32, [9:19 remaining] 33,
## [8:53 remaining] 34, [8:29 remaining] 35, [8:06 remaining] 36,
## [7:44 remaining] 37, [7:24 remaining] 38, [7:04 remaining] 39,
## [6:45 remaining] 40, [6:27 remaining] 41, [6:10 remaining] 42,
## [5:53 remaining] 43, [5:37 remaining] 44, [5:22 remaining] 45,
## [5:07 remaining] 46, [4:53 remaining] 47, [4:40 remaining] 48,
## [4:27 remaining] 49, [4:14 remaining] 50, [4:02 remaining] 51,
## [3:50 remaining] 52, [3:38 remaining] 53, [3:27 remaining] 54,
## [3:16 remaining] 55, [3:05 remaining] 56, [2:55 remaining] 57,
## [2:45 remaining] 58, [2:35 remaining] 59, [2:26 remaining] 60,
## [2:16 remaining] 61, [2:08 remaining] 62, [1:59 remaining] 63,
## [1:50 remaining] 64, [1:42 remaining] 65, [1:34 remaining] 66,
## [1:26 remaining] 67, [1:18 remaining] 68, [1:10 remaining] 69,
## [1:03 remaining] 70, [55 sec remaining] 71, [48 sec remaining] 72,
## [41 sec remaining] 73, [34 sec remaining] 74, [27 sec remaining] 75,
## [20 sec remaining] 76, [13 sec remaining] 77, [7 sec remaining]
## 78.
##
## Done.
plot(firstModEnv)
As we can see our initial model results exhibit spatial clustering (values for K outside and higher than the theoretical CSR envelope) for most radii values.
We can attempt to account for this clustering inside our model using a Matern process, providing this option inside the kppm() function (for fitting clustered ppms).
maternMod<-kppm(Q~poly(tempIm,2)+poly(precipIm,2)+x+y,"MatClust") # apply Matern cluster process to the data.
Again, as for our first model we can perform a diagnostic test for our Matern process model using the envelope function. In this case, we have built our model based on the assumption that our points exhibit clustering.In other words, they represent an inhomogeneous point process. We therefore want to create an envelope based on simulations of inhomogeneous (clustering) point patterns rather than CSR. This can done by changing the second argument in the envelope() function from Kest (for Ripley’s K test of CSR) to Kinhom (to simulate inhomogeneous patterns). We can then assess the goodness-of-fit of our model by seeing if our observed values for Ripley’s K conform with an inhomogeneous distribution (i.e. if they lie within the simulation envelope).
matEnv<-envelope(maternMod,Kinhom,nsim=39,rank=1,global=TRUE)
## Generating 78 simulated realisations of fitted cluster model (39 to estimate
## the mean and 39 to calculate envelopes) ...
## 1, 2, 3,
## [3:06 remaining] 4, [5:03 remaining] 5, [6:39 remaining] 6,
## [5:27 remaining] 7, [4:32 remaining] 8, [3:52 remaining] 9,
## [3:31 remaining] 10, [3:34 remaining] 11, [3:14 remaining] 12,
##
## [17:41 remaining, estimate finish 2024-02-20 15:53:59]
## 13,
## [16:03 remaining, estimate finish 2024-02-20 15:52:22]
## 14,
## [14:48 remaining, estimate finish 2024-02-20 15:51:09]
## 15,
## [13:35 remaining, estimate finish 2024-02-20 15:49:57]
## 16,
## [14:03 remaining, estimate finish 2024-02-20 15:50:48]
## 17,
## [13:18 remaining, estimate finish 2024-02-20 15:50:08]
## 18,
## [13:22 remaining, estimate finish 2024-02-20 15:50:30]
## 19,
## [12:26 remaining, estimate finish 2024-02-20 15:49:35]
## 20,
## [11:37 remaining, estimate finish 2024-02-20 15:48:46]
## 21,
## [10:51 remaining, estimate finish 2024-02-20 15:48:00]
## 22,
## [10:12 remaining, estimate finish 2024-02-20 15:47:22]
## 23, [9:35 remaining] 24, [9:00 remaining] 25,
## [9:03 remaining] 26, [8:43 remaining] 27, [8:14 remaining] 28,
## [8:06 remaining] 29, [7:49 remaining] 30, [7:25 remaining] 31,
## [7:02 remaining] 32, [6:40 remaining] 33, [7:13 remaining] 34,
## [6:51 remaining] 35, [6:37 remaining] 36, [6:31 remaining] 37,
## [6:24 remaining] 38, [6:04 remaining] 39, [6:04 remaining] 40,
## [5:49 remaining] 41, [5:32 remaining] 42, [7:06 remaining] 43,
## [6:45 remaining] 44, [6:37 remaining] 45, [8:06 remaining] 46,
## [8:26 remaining] 47, [8:01 remaining] 48, [7:36 remaining] 49,
## [7:19 remaining] 50, [6:56 remaining] 51, [9:46 remaining] 52,
## [9:14 remaining] 53, [8:44 remaining] 54, [8:15 remaining] 55,
## [8:29 remaining] 56, [7:58 remaining] 57, [7:28 remaining] 58,
## [7:00 remaining] 59, [6:32 remaining] 60, [6:16 remaining] 61,
## [5:51 remaining] 62, [5:25 remaining] 63, [5:00 remaining] 64,
## [4:42 remaining] 65, [4:18 remaining] 66, [3:55 remaining] 67,
## [3:33 remaining] 68, [3:11 remaining] 69, [2:50 remaining] 70,
## [2:48 remaining] 71, [2:25 remaining] 72, [2:08 remaining] 73,
## [1:45 remaining] 74, [1:23 remaining] 75, [1:01 remaining] 76,
## [40 sec remaining] 77, [20 sec remaining]
## 78.
##
## Done.
plot(matEnv)
Our model fits nicely within the simulation envelope suggesting that it fits well to an inhomogeneous point pattern and is therefore correctly specified - it is a good approximation of the behaviour of the data.
We can look at the model coefficients by calling coef(maternMod) to see the relationship between our predictor variables and the predicted intensity of points. Notice that for varibales fit with poly() you have two coefficients, the first describes the relationship with intensity for lower values of the covariate and the second coefficient describes that for higher values of the covariate. For example, in the case of temperature, intensity (species density) appears to heighten with increasing temperature before decreasing after a certain value for temperature is reached. Note that this information is not available in the machine learning approaches.
coef(maternMod)
## (Intercept) poly(tempIm, 2)1 poly(tempIm, 2)2 poly(precipIm, 2)1
## -7.359279938 42.584186463 3.967892571 37.035099357
## poly(precipIm, 2)2 x y
## -54.599033672 -0.001726656 0.002457423
The spatstat package contains a function for calculating the area under the curve for point process models. This differs slightly from binomial models (models with 2 outcomes) in that here it is the intensity rather than the success of correctly classifying a binary outcome that is being tested. The spatstat roc() and auc() functions offer an indication of the ability of the fitted model intensity to separate the spatial domain into areas of high and low density of points. The AUC here is the probability that a randomly-selected data point has higher predicted intensity than does a randomly-selected spatial location. So the test incorporates space into the assessment (i.e. by considering intensity across the entire study area rather than just our data points). We can plot the roc curve and return the auc as below:
plot(roc(maternMod))
auc.kppm(maternMod)
## obs theo
## 0.8757314 0.8922534
The roc curve and auc statistic for our model are high, following closely the output that would be expected for a well specified inhomogeneous point process model (“ftheo” here) and well above the null line (a 50:50 line that represents no ability to discriminate high from low intensity - “fnull” here). So we can see that with a well specified model that incorporates spatial dependence we have achieved a better representation of the data with the point process approach compared to our machine learning techniques. In reality we could carry out much more extensive model tuning for our machine learning (increasing number of iterations in the tuning step) but it’s perhaps unlikely that an auc score comparable to that of the Matern model would be achieved.
We can create a mapped output of our results using the predict function. Remember our point process model output is one of intensity (points per unit area) and has quite low values. We can log transform the raster values to better differentiate the distribution of values.
prPPMod<-predict(maternMod)
plot(prPPMod)
#log transform the result to highlight differences in value distribution
plot(log(prPPMod))
Finally we can convert our output to raster (remember spatstat deals in image objects):
maternRas<-raster(prPPMod)
plot(maternRas)
You’ve reached the end of this week’s practical. You’ve come a long way on your species distribution modeling journey. For this week’s exercise:
generate a point process model for Solanum acaule using a subset of the data inside an 80% minimum convex polygon (as we did for the Bradypus data in this practical. Test the data for CSR and fit a Matern process to build your model. Plot the roc curve for your final model as well as the raw and log-transformed prediction map.
Then, explain the importance of species distribution modelling in the context of nature conservation. Specifically, summarize the opportunities and challenges of applying different modelling approaches to species presence-only data (700 words).