In week 4 we looked at the use of envelope models, general linear models, maxent and point process models for estimating species distribution (Bradypus variegatus). According to the cross-validation approach taken (K-fold partitioning) We achieved a good level of prediction. However, as we saw in the lecture, ecological data tend to be spatially dependent, meaning that observations in close proximity will be more similar than those at a distance. In terms of model validation this means that, in the case of K-fold validation in which samples are selected at random, training and test samples may be quite similar for any given covariate. This can (and almost always does) lead to over-optimistic model validation statistics (e.g. AUC scores). A more reliable approach is to ensure that folds used in cross-validation are spatially disjoint i.e. that the geographic distance between training and testing data is maximized. This is particularly important for species distribution modeling where our goal is the prediction across space of our species of interest. Therefore the ability of our model to work across different geographical areas is critical.
In this practical you will explore how spatial dependence affects model validation and the options available for overcoming such problems. We will use statistical models and machine learning algorithms on data for Bradypus variegatus.
Using these data, you will then learn how, for machine learning algorithms, model hyperparameters can be tuned to improve performance. You can think of model hyperparameters as the “settings” for any given model. For example, hyperparameter “mtry” is the number of explanatory variables that are sampled for each tree in a random forest algorithm and “node size” denotes the minimum number of cases allowed at the end of each decision tree. “Parameters” on the other hand are the decisions that are made by the learner during training (so “hyper” refers to the fact that we set these properties BEFORE the learning process begins).
In part four of the practical we will adopt a point process modeling approach to understanding and making predictions on these same data.
Part One: creating learners in the MLR (“Machine Learning in R”) package
The MLR package provides a suite of tools to explore, create and test modelling strategies that are to some degree compatible with a machine learning approach.
A key concept in MLR and in machine learning in general is the concept of the “Learner”. These are the objects that generate information (i.e. they “learn”) about the data provided. The latter object (the data) are generally formulated as “tasks”. So the basic logic is that we “learn” how to carry out our “task” and then use this experience for optimal prediction. If that all sounds a little inaccessible, it may help to simply think of Learners as models (as in classical statistics e.g. regression models) and Tasks as data sets. Typically, creating a task involves the choosing of a data set and telling R which column holds the species observations.
Okay, let’s set up our first task. First we will get the same data as we used last week. If you have the code already from the Week 5 practical (SDM Part One) you can copy it to a new R script and run to create the all.cov data frame. For completeness, I’ve reproduced the code below.
As you will only be working with data extracted from inside R packages or downloaded from online sources it is not critical where you set your working directory. However, if you intend to save any of your outputs from this practical then create a folder/directory for the work you are about to do. Remember, you can find the code for saving spatial outputs to file at the end of Practical One of this course.
#Code to tell R to only use the terra installation of PROJ (library for performing projections)
plib<-Sys.getenv("PROJ_LIB")
prj<-system.file("proj",package="terra")[1]
Sys.setenv("PROJ_LIB"=prj)
#Load packages
#Note, regarding the Spatstat and Maptools packages, if you receive a message stating that there is a binary and more recent source version of the package available, make sure you select "yes" to compile the package from source.
install.packages(c("dismo","terra","spatstat","adehabitatHR"))
# "maptools is a useful package for converting data formats when using Spatstat. Sadly it is deprecated on CRAN but we can load a stable version from here:
install.packages("maptools", repos = "https://packagemanager.posit.co/cran/2023-10-13")
#We also need to load the "MLR" package for this practical
install.packages("mlr")
#in addition we will use machine learning algorithms from the "kernlab" package
install.packages("kernlab") #install necessary package
Next we recreate the data from Week 5. The only difference that we will make in terms of creating the data for analysis is that this week we will trim down the study area. Both machine learning (due to number of tests/iterations) and point process modelling (due to modelling complex patterns) can be computationally intensive. We will use the minimum convex polygon approach to select a convex hull around 80% of our bradypus points and clip the study area to this polygon.
#access the dismo library
library(dismo)
## Warning: package 'dismo' was built under R version 4.3.2
## Loading required package: raster
## Warning: package 'raster' was built under R version 4.3.2
## Loading required package: sp
## Warning: package 'sp' was built under R version 4.3.2
files <- list.files(path=paste(system.file(package="dismo"),
'/ex', sep=''), pattern='grd', full.names=TRUE )
files
## [1] "C:/Users/mlibxmda/AppData/Local/R/win-library/4.3/dismo/ex/bio1.grd"
## [2] "C:/Users/mlibxmda/AppData/Local/R/win-library/4.3/dismo/ex/bio12.grd"
## [3] "C:/Users/mlibxmda/AppData/Local/R/win-library/4.3/dismo/ex/bio16.grd"
## [4] "C:/Users/mlibxmda/AppData/Local/R/win-library/4.3/dismo/ex/bio17.grd"
## [5] "C:/Users/mlibxmda/AppData/Local/R/win-library/4.3/dismo/ex/bio5.grd"
## [6] "C:/Users/mlibxmda/AppData/Local/R/win-library/4.3/dismo/ex/bio6.grd"
## [7] "C:/Users/mlibxmda/AppData/Local/R/win-library/4.3/dismo/ex/bio7.grd"
## [8] "C:/Users/mlibxmda/AppData/Local/R/win-library/4.3/dismo/ex/bio8.grd"
## [9] "C:/Users/mlibxmda/AppData/Local/R/win-library/4.3/dismo/ex/biome.grd"
library(terra)
## Warning: package 'terra' was built under R version 4.3.2
## terra 1.7.72
env<-rast(files[1:8])
#use the names() function to label each covariate layer
names(env) <- c("MeanTemp.","AnnualPrecip.","Precip.WettestQuarter","PrecipDriestQuarter","Max.Temp.","Min.Temp.","Temp.Range","MeanTemp.WettestQuarter")
plot(env)
SAproj<-"epsg:31972" # we want projected coordinate in order to calculate distance accurately so we use the EPSG code for South America
env<-project(env,y=SAproj,res=10000) # project raster and set an appropriate resolution
#And now plot:
# first layer of the RasterStack
plot(env, 1)
bradypus <- gbif("Bradypus","variegatus",sp=TRUE) #use gbif() to download the data
## 6342 records found
## 0-300-600-900-1200-1500-1800-2100-2400-2700-3000-3300-3600-3900-4200-4500-4800-5100-5400-5700-6000-6300-6342 records downloaded
#convert from an sp library (old) object to terra (newer/faster library) object
bradypus<-vect(bradypus)
#set the crs
crs(bradypus)<-"epsg:4326"
bradypus<-project(bradypus,crs(env))
plot(env$MeanTemp.)
plot(bradypus,add=TRUE,col ="red")
library(adehabitatHR) #library for mcp() function to get a minimum convex polygon around a certain number of points
## Warning: package 'adehabitatHR' was built under R version 4.3.2
## Loading required package: ade4
## Warning: package 'ade4' was built under R version 4.3.2
## Loading required package: adehabitatMA
## Warning: package 'adehabitatMA' was built under R version 4.3.2
## Registered S3 methods overwritten by 'adehabitatMA':
## method from
## print.SpatialPixelsDataFrame sp
## print.SpatialPixels sp
##
## Attaching package: 'adehabitatMA'
##
## The following object is masked from 'package:terra':
##
## buffer
##
## The following object is masked from 'package:raster':
##
## buffer
##
## Loading required package: adehabitatLT
## Warning: package 'adehabitatLT' was built under R version 4.3.2
## Loading required package: CircStats
## Warning: package 'CircStats' was built under R version 4.3.2
## Loading required package: MASS
##
## Attaching package: 'MASS'
##
## The following object is masked from 'package:terra':
##
## area
##
## The following objects are masked from 'package:raster':
##
## area, select
##
## Loading required package: boot
bradMCP<-mcp(as(bradypus,"Spatial"),percent=80) #create minimum convex polygon around the 95% of points closest to the centroid (central point) of the presence data
plot(env$MeanTemp.) #plot
plot(bradMCP,add=TRUE)
plot(bradypus,add=TRUE)
bradSP<-crop(bradypus,bradMCP) # crop points to the mcp
#inspect
plot(env$MeanTemp.)
plot(bradSP)
plot(bradMCP,add=T)
###Crop the env object to this new polygon considering 80% of the point data
env<-crop(env,bradMCP)#crop
env<-mask(env,vect(bradMCP))#mask
plot(env)#plot
set.seed(9)
#create random points on cells of the 'env' object within the extent of bradypus and avoiding cells containing points from bradypus
back <- spatSample(env,size=2000,xy=TRUE,na.rm=TRUE)
head(back)
## x y MeanTemp. AnnualPrecip. Precip.WettestQuarter
## 1 426831.2 196871.5 215.9144 2311.436 792.2646
## 2 986831.2 1096871.5 242.5988 1106.478 431.8414
## 3 1306831.2 276871.5 263.3966 3463.986 1301.7174
## 4 536831.2 226871.5 251.6353 2381.213 812.4107
## 5 616831.2 646871.5 203.7728 2181.190 742.7102
## 6 346831.2 466871.5 219.3495 2250.851 735.3143
## PrecipDriestQuarter Max.Temp. Min.Temp. Temp.Range MeanTemp.WettestQuarter
## 1 349.0491 273.5486 163.7739 109.8253 213.2734
## 2 105.7077 295.7694 182.0228 113.7466 242.0772
## 3 506.4210 333.8190 203.4195 129.8693 256.9033
## 4 318.2018 317.2062 192.9885 124.6468 249.7131
## 5 319.6212 261.4363 148.2243 112.7249 202.8445
## 6 390.4212 276.0004 165.4384 110.5620 216.3757
#absence values (leave cols 1 and 2 which are coordinates)
eA<-back[,3:ncol(back)]
#presence values
eP<-extract(env,bradSP)
#create data frames from values
Pres.cov<-data.frame(eP,Pres=1)
Back.cov<-data.frame(eA,Pres=0)
head(Back.cov)
## MeanTemp. AnnualPrecip. Precip.WettestQuarter PrecipDriestQuarter Max.Temp.
## 1 215.9144 2311.436 792.2646 349.0491 273.5486
## 2 242.5988 1106.478 431.8414 105.7077 295.7694
## 3 263.3966 3463.986 1301.7174 506.4210 333.8190
## 4 251.6353 2381.213 812.4107 318.2018 317.2062
## 5 203.7728 2181.190 742.7102 319.6212 261.4363
## 6 219.3495 2250.851 735.3143 390.4212 276.0004
## Min.Temp. Temp.Range MeanTemp.WettestQuarter Pres
## 1 163.7739 109.8253 213.2734 0
## 2 182.0228 113.7466 242.0772 0
## 3 203.4195 129.8693 256.9033 0
## 4 192.9885 124.6468 249.7131 0
## 5 148.2243 112.7249 202.8445 0
## 6 165.4384 110.5620 216.3757 0
head(Pres.cov)
## ID MeanTemp. AnnualPrecip. Precip.WettestQuarter PrecipDriestQuarter
## 1 1 242.7934 860.167 531.1019 37.79567
## 2 2 243.5608 2908.904 1178.0919 202.05040
## 3 3 243.7697 3312.001 1443.1741 144.62831
## 4 4 231.2706 3314.167 1235.5300 302.25983
## 5 5 221.3931 1139.519 490.4107 94.57846
## 6 6 243.5608 2908.904 1178.0919 202.05040
## Max.Temp. Min.Temp. Temp.Range MeanTemp.WettestQuarter Pres
## 1 296.5367 194.3747 102.1593 249.7906 1
## 2 310.3405 184.6754 125.2382 242.2694 1
## 3 315.8277 178.8665 136.1665 240.3386 1
## 4 292.5822 175.1340 117.3696 230.7110 1
## 5 283.7167 152.7786 131.2737 224.1626 1
## 6 310.3405 184.6754 125.2382 242.2694 1
#combine both data sets using the rbind() function (binds datsets with same columns by stacking one on top of the other - i.e. joining rows together)
all.cov<-rbind(Pres.cov[,2:ncol(Pres.cov)],Back.cov)
#inspect
head(all.cov)
## MeanTemp. AnnualPrecip. Precip.WettestQuarter PrecipDriestQuarter Max.Temp.
## 1 242.7934 860.167 531.1019 37.79567 296.5367
## 2 243.5608 2908.904 1178.0919 202.05040 310.3405
## 3 243.7697 3312.001 1443.1741 144.62831 315.8277
## 4 231.2706 3314.167 1235.5300 302.25983 292.5822
## 5 221.3931 1139.519 490.4107 94.57846 283.7167
## 6 243.5608 2908.904 1178.0919 202.05040 310.3405
## Min.Temp. Temp.Range MeanTemp.WettestQuarter Pres
## 1 194.3747 102.1593 249.7906 1
## 2 184.6754 125.2382 242.2694 1
## 3 178.8665 136.1665 240.3386 1
## 4 175.1340 117.3696 230.7110 1
## 5 152.7786 131.2737 224.1626 1
## 6 184.6754 125.2382 242.2694 1
Now we can use this data frame to create a task for our learners to carry out. As stated at the start of this practical, the spatial location of points is critical to our analysis and will be required at every step of model development, testing and tuning. So, first let’s create an object for the coordinates of our presence and background data for use in the analysis.
### create task for use with mlr classification algorithms
coordsBack<-back[,1:2]
coordsPres<-crds(bradSP)
coords<-rbind(coordsPres,coordsBack)
coords<-data.frame(coords)
colnames(coords)<-c("x","y")
all.cov<-cbind(all.cov,coords)
all.cov<-subset(all.cov,!is.na(all.cov$MeanTemp.))#remove any NA values
#Now we create our task, passing the data frame and specifying the species presence/absence column (the "target") and tell R that 1 = present (i.e. a postive result)
library(mlr)
## Warning: package 'mlr' was built under R version 4.3.2
## Loading required package: ParamHelpers
## Warning: package 'ParamHelpers' was built under R version 4.3.2
##
## Attaching package: 'ParamHelpers'
## The following object is masked from 'package:raster':
##
## getValues
## Warning message: 'mlr' is in 'maintenance-only' mode since July 2019.
## Future development will only happen in 'mlr3'
## (<https://mlr3.mlr-org.com>). Due to the focus on 'mlr3' there might be
## uncaught bugs meanwhile in {mlr} - please consider switching.
##
## Attaching package: 'mlr'
## The following object is masked from 'package:adehabitatLT':
##
## subsample
## The following object is masked from 'package:terra':
##
## resample
## The following object is masked from 'package:raster':
##
## resample
#For the makeClassifTask function to work, our target variable needs to be categorical (a "factor" in R) so let's tidy that up first
all.cov$Pres<-factor(all.cov$Pres)
task = makeClassifTask(data = all.cov[,1:9], target = "Pres",
positive = "1", coordinates = all.cov[,10:11])
Now that we have the task set up, we need to choose some learners that will help us carry out our task. If we already know which learner we prefer then we can use the makeLearner() function to specify our algorithm. If we don’t then the mlr package has a really useful function listLearners() with which you can pass your existing task and get a list of machine learning algorithms which apply.
The listLearners function returns A LOT of information and most of this we do not want so we will not call it here but instead we will go straight to creating our learners which for today’s practical are: 1. classif.binomial (an implementation of logistic regression), 2. classif.randomForest and 3. classif.ksvm (a support vector machine learning algorithm developed by the Kernlab group). We looked at binomial regression and Random Forest previously. Support Vector Machines (SVMs) are classifiers that attempt to create the largest possible margin between samples (such as groups of presence/absence points) that maximises their difference (distance between the groups) whilst reducing the number of misclassifications, thereby helping the classifer to make decisions about group membership. See here for an article giving a nice simple overview of how SVMs work: https://towardsdatascience.com/support-vector-machine-simply-explained-fee28eba5496
The next step is to create our individual learners that will do the work of solving our task. Let’s start with the most familiar type - binomial regression:
################ Binomial (logistic regression)
lrnBinomial = makeLearner("classif.binomial",
predict.type = "prob",
fix.factors.prediction = TRUE)
In the above code we first tell the makeLearner function the type of classifier we want to work with, specify the type of prediction we want to make (probability) and if we want the algorithm to output precicted classes (i.e. 1 or 0) when used later for prediction.
Next we do the same for our Random Forest and Ksvm learners:
##########Random Forest
lrnRF = makeLearner("classif.randomForest",
predict.type = "prob")
###### Support Vector Machine
lrn_ksvm = makeLearner("classif.ksvm",
predict.type = "prob",
kernel = "rbfdot") # this argument specifies a "Radial Basis kernel function" which is basically a Gaussian kernel i.e. one that simulates a normal distribution. The algorithm uses the kernel type to decide how to differentiate groups (you can think of this as a means to estimate or weight distances between group 1 - presence and group 0 - absence).
Part Two: model evaluation
Now that we have both our task and learners ready we can apply each learner to the task and evaluate their performance.
To do this we have to set up a sampling scheme for cross validation. This is conceptually identical to the k-fold validation we did in Week 4 but here we don’t have to go through the trouble of writing our own for loops - we can use the functionality within the mlr package to do this for us. First we will create a k-fold resampling strategy that DOES NOT take spatial location into account.
### Create resampling scheme (performance level) for cross-validation.
##This code creates 5 folds in the data and repeates the whole k-fold cross validation process 4 times (so 20 models in total) and aggregates (averages) the AUC statistic.
perf_levelCV = makeResampleDesc(method = "RepCV", folds = 5, reps = 4)
In addition we can easily set up the same cross validation strategy but with spatially disjoint folds. All we have to do is change “RepCV” to “SpRepCV” like so:
perf_level_spCV = makeResampleDesc(method = "SpRepCV", folds = 5, reps = 4) #sampling strategy to run five fold re-sampling ten times
Now we can compare, for each of our learners, their performance based on conventional k-fold validation and one where spatial location is considered (i.e. when test data are made to be geographically far from training data).
#Binomial conventional cross validation (K fold)
cvBinomial = resample(learner = lrnBinomial, task =task,
resampling = perf_levelCV,
measures = mlr::auc)
## Resampling: repeated cross-validation
## Measures: auc
## [Resample] iter 1: 0.8248523
## [Resample] iter 2: 0.8235809
## [Resample] iter 3: 0.8255488
## [Resample] iter 4: 0.8341577
## [Resample] iter 5: 0.8432031
## [Resample] iter 6: 0.8308924
## [Resample] iter 7: 0.8363761
## [Resample] iter 8: 0.8344599
## [Resample] iter 9: 0.8293505
## [Resample] iter 10: 0.8156815
## [Resample] iter 11: 0.8325463
## [Resample] iter 12: 0.8192725
## [Resample] iter 13: 0.8464813
## [Resample] iter 14: 0.8408374
## [Resample] iter 15: 0.8084390
## [Resample] iter 16: 0.8490445
## [Resample] iter 17: 0.8032318
## [Resample] iter 18: 0.8349095
## [Resample] iter 19: 0.8313072
## [Resample] iter 20: 0.8269237
##
## Aggregated Result: auc.test.mean=0.8295548
##
##Binomial spatial cross validation
sp_cvBinomial = resample(learner = lrnBinomial, task =task,
resampling = perf_level_spCV,
measures = mlr::auc)
## Resampling: repeated spatial cross-validation
## Measures: auc
## [Resample] iter 1: 0.6470515
## [Resample] iter 2: 0.7093827
## [Resample] iter 3: 0.4376015
## [Resample] iter 4: 0.5739824
## [Resample] iter 5: 0.3056895
## [Resample] iter 6: 0.7093827
## [Resample] iter 7: 0.5739824
## [Resample] iter 8: 0.3056895
## [Resample] iter 9: 0.4376015
## [Resample] iter 10: 0.6470515
## [Resample] iter 11: 0.4782728
## [Resample] iter 12: 0.5876556
## [Resample] iter 13: 0.0668723
## [Resample] iter 14: 0.4954371
## [Resample] iter 15: 0.7225883
## [Resample] iter 16: 0.6408477
## [Resample] iter 17: 0.4583903
## [Resample] iter 18: 0.5371334
## [Resample] iter 19: 0.4105826
## [Resample] iter 20: 0.8482288
##
## Aggregated Result: auc.test.mean=0.5296712
##
You should see that the result (aggregated AUC) for the spatial cross validation is noticeably lower than for the conventional cross validation. This proves that the latter is an over-optimistic assessment and that we should be cautious with non-spatial approaches to model evaluation.
Now let’s look at our Random Forest learner.
###### Random Forest evaluation
##random sampling cross-validation
cvRF = resample(learner = lrnRF, task =task,
resampling = perf_levelCV,
measures = mlr::auc)
## Resampling: repeated cross-validation
## Measures: auc
## [Resample] iter 1: 0.9796747
## [Resample] iter 2: 0.9808870
## [Resample] iter 3: 0.9732387
## [Resample] iter 4: 0.9830299
## [Resample] iter 5: 0.9832725
## [Resample] iter 6: 0.9790994
## [Resample] iter 7: 0.9855364
## [Resample] iter 8: 0.9802687
## [Resample] iter 9: 0.9812615
## [Resample] iter 10: 0.9734789
## [Resample] iter 11: 0.9797515
## [Resample] iter 12: 0.9782136
## [Resample] iter 13: 0.9749778
## [Resample] iter 14: 0.9821473
## [Resample] iter 15: 0.9794942
## [Resample] iter 16: 0.9734883
## [Resample] iter 17: 0.9812067
## [Resample] iter 18: 0.9775149
## [Resample] iter 19: 0.9814442
## [Resample] iter 20: 0.9776476
##
## Aggregated Result: auc.test.mean=0.9792817
##
#spatial partitioning cross-validation
sp_cvRF = resample(learner = lrnRF, task =task,
resampling = perf_level_spCV,
measures = mlr::auc)
## Resampling: repeated spatial cross-validation
## Measures: auc
## [Resample] iter 1: 0.4820920
## [Resample] iter 2: 0.6300735
## [Resample] iter 3: 0.5406158
## [Resample] iter 4: 0.4912433
## [Resample] iter 5: 0.6556610
## [Resample] iter 6: 0.5910169
## [Resample] iter 7: 0.2445047
## [Resample] iter 8: 0.4217364
## [Resample] iter 9: 0.6674885
## [Resample] iter 10: 0.6514285
## [Resample] iter 11: 0.4316055
## [Resample] iter 12: 0.6755394
## [Resample] iter 13: 0.6329571
## [Resample] iter 14: 0.2714067
## [Resample] iter 15: 0.6319091
## [Resample] iter 16: 0.4248360
## [Resample] iter 17: 0.6231176
## [Resample] iter 18: 0.5233279
## [Resample] iter 19: 0.4942927
## [Resample] iter 20: 0.6616053
##
## Aggregated Result: auc.test.mean=0.5373229
##
You will notice that, although random forest out-performs the binomial regression learner when assessed from a conventional cross-validation point of view, the binomial learner achieves a higher AUC when spatial dependency is take into account. However, random forest has a set of hyperparameters that can be easily tuned to enhance model performance, a functionality not present with binomal learners (which are based on statistical methods rather than on machine learning algorithms). We will see later in the practical if model tuning changes the relative standing of these two learners. First though, let’s perform the same evaluation on our support vector machine learner for comparison.
########### Support vector machine evaluation
cvksvm = resample(learner = lrn_ksvm, task =task,
resampling = perf_levelCV,
measures = mlr::auc)
## Resampling: repeated cross-validation
## Measures: auc
## [Resample] iter 1: 0.9225526
## [Resample] iter 2: 0.9174637
## [Resample] iter 3: 0.9154988
## [Resample] iter 4: 0.9101491
## [Resample] iter 5: 0.9262508
## [Resample] iter 6: 0.9218152
## [Resample] iter 7: 0.9151019
## [Resample] iter 8: 0.9178854
## [Resample] iter 9: 0.9274458
## [Resample] iter 10: 0.9199808
## [Resample] iter 11: 0.9255461
## [Resample] iter 12: 0.9194485
## [Resample] iter 13: 0.9199696
## [Resample] iter 14: 0.9112554
## [Resample] iter 15: 0.9157295
## [Resample] iter 16: 0.9220167
## [Resample] iter 17: 0.9184347
## [Resample] iter 18: 0.9300501
## [Resample] iter 19: 0.9159154
## [Resample] iter 20: 0.9106382
##
## Aggregated Result: auc.test.mean=0.9191574
##
sp_cvksvm = resample(learner = lrn_ksvm, task =task,
resampling = perf_level_spCV,
measures = mlr::auc)
## Resampling: repeated spatial cross-validation
## Measures: auc
## [Resample] iter 1: 0.6411503
## [Resample] iter 2: 0.1249034
## [Resample] iter 3: 0.5107787
## [Resample] iter 4: 0.2936386
## [Resample] iter 5: 0.6042987
## [Resample] iter 6: 0.3724755
## [Resample] iter 7: 0.5985744
## [Resample] iter 8: 0.6384971
## [Resample] iter 9: 0.1474978
## [Resample] iter 10: 0.2986111
## [Resample] iter 11: 0.1355930
## [Resample] iter 12: 0.6382247
## [Resample] iter 13: 0.2987700
## [Resample] iter 14: 0.5137711
## [Resample] iter 15: 0.6148720
## [Resample] iter 16: 0.5368927
## [Resample] iter 17: 0.2988972
## [Resample] iter 18: 0.6391562
## [Resample] iter 19: 0.6057766
## [Resample] iter 20: 0.1586105
##
## Aggregated Result: auc.test.mean=0.4335495
##
So, both learners apparently suffer considerably when switching from a conventional to a spatially-oriented method of performance evaluation. especially in the case of the support vector machine. Support vector machines are the most readily improved through model tuning and this is a particular strength of this kind of learner. So, the winner in this contest is far from decided - not before we have taken a closer look at model tuning options for these different learners.
Part Three: Model tuning
For tuning of hyperparameters we follow a similar process initially as that taken when setting up cross validation resampling. We use the same resampling scheme so that the data are separated into five spatially disjoint folds and set the “ctrl” argument so that, for each fold, the tuning process generates 10 random combinations of model parameters. This is done via the makeTunerControlRandom() function with maximum iterations (“maxit=”) set to 5. We could (and should) try many more than this but this would become increasingly time consuming to do.
#hyperparameter tuning
# spatial partitioning
tune_level = makeResampleDesc(method = "SpCV", iters = 5)
# specifying random parameter value search
ctrl = makeTuneControlRandom(maxit = 5L)
Once the data resampling has been set up, we need to look at each individual learner to check which hyperparameters are available for tuning. These can be accessed using the getParamSet() function. Let’s do this for the randomForest learner first.
getParamSet(lrnRF)
## Type len Def Constr Req Tunable Trafo
## ntree integer - 500 1 to Inf - TRUE -
## mtry integer - - 1 to Inf - TRUE -
## replace logical - TRUE - - TRUE -
## classwt numericvector <NA> - 0 to Inf - TRUE -
## cutoff numericvector <NA> - 0 to 1 - TRUE -
## strata untyped - - - - FALSE -
## sampsize integervector <NA> - 1 to Inf - TRUE -
## nodesize integer - 1 1 to Inf - TRUE -
## maxnodes integer - - 1 to Inf - TRUE -
## importance logical - FALSE - - TRUE -
## localImp logical - FALSE - - TRUE -
## proximity logical - FALSE - - FALSE -
## oob.prox logical - - - Y FALSE -
## norm.votes logical - TRUE - - FALSE -
## do.trace logical - FALSE - - FALSE -
## keep.forest logical - TRUE - - FALSE -
## keep.inbag logical - FALSE - - FALSE -
For random forests the hyperparameters known to most closely affect model performance are “mtry” and “nodesize”. So, for these hyperparameters we can create a range of values to pass to the tuneParams() function for tuning our model. We create these hyperparameter ranges using the makeParamSet() function. For mtry the range of options is actually the number of predictor variables in our dataframe (eight) and a sensible range for nodesize starts from 1 with 10 being a common upper limit used in classification problems (any more than this reduces computing time but also precision).
paramsRF <- makeParamSet(
makeIntegerParam("mtry",lower = 1,upper = 8),
makeIntegerParam("nodesize",lower = 1,upper = 10)
)
Now we have the parameter set ready to go we use the tuneParams() function to get the result of our model tuning. Before we do that though, let’s break down exactly what is happening in the model tuning process. The code below is basically a list of information given to the tuneParams() function. We provide the name of the learner “lrnRF”, the task for the learner, the spatially-oriented resampling method “tune_level”, the range of parameter values to select from “paramsRF”, the number of random combinations of parameter values to use (“ctrl”) and the measure we wish to return “auc”.
The objects we pass to the tuneParams() function will ensure that for five spatially disjoint folds (subsets) of the data, 5 random combinations of hyperparameters will be used for model tuning with the optimum values returned (i.e. those that maximize the auc score). This means that a total of 25 models will be tested and each time the training and testing data will be spatially partitioned such that our tuning is based on maximizing the ability of our model to predict to new spatial locations. In reality we might run many more models but in the interests of time we will proceed with 5 combinations for tuning.
Please be patient as the following model tuning step can take a while.
tuneRF = tuneParams (learner = lrnRF,
task=task,
resampling = tune_level,
par.set = paramsRF,
control = ctrl,
measures = mlr::auc,
show.info = TRUE)
## [Tune] Started tuning learner classif.randomForest for parameter set:
## Type len Def Constr Req Tunable Trafo
## mtry integer - - 1 to 8 - TRUE -
## nodesize integer - - 1 to 10 - TRUE -
## With control class: TuneControlRandom
## Imputation value: -0
## [Tune-x] 1: mtry=6; nodesize=1
## [Tune-y] 1: auc.test.mean=0.5361777; time: 0.2 min
## [Tune-x] 2: mtry=4; nodesize=4
## [Tune-y] 2: auc.test.mean=0.5083837; time: 0.2 min
## [Tune-x] 3: mtry=8; nodesize=7
## [Tune-y] 3: auc.test.mean=0.5492542; time: 0.3 min
## [Tune-x] 4: mtry=8; nodesize=6
## [Tune-y] 4: auc.test.mean=0.5423036; time: 0.3 min
## [Tune-x] 5: mtry=5; nodesize=1
## [Tune-y] 5: auc.test.mean=0.5230143; time: 0.2 min
## [Tune] Result: mtry=8; nodesize=7 : auc.test.mean=0.5492542
##re-formulate the random forest learner based on these results (note we can access the tuning results directly with the '$' symbol)
We can see that the random forest in this case has responded quite well to the model tuning we have carried out. We can follow the same process for the support vector machine algorithm. Here we take some recommendation from the literature (Shratz et al., 2018) [https://arxiv.org/pdf/1803.11266.pdf] to set the values for the algorithm parameters “C” (this determines the emphasis placed on finding a large margin around the classification boundary - i.e. the distance between 1/0 groups - versus avoiding unwanted misclassifications) and “sigma” (the kernel shape determining the influence of cases close to the classification boundary).
Once we have the optimal model parameters we simply need to create a ksvm model (this works in much the same way that we have fitted glm() models previously). We pass the results from our model tuning step straight to the model formula (again we can access the tuning results directly with the ‘$’ symbol).This step can take a few minutes to run so please be patient.
Note in the below code, before we make our prediction, we have to provide a raster surface for each of the x and y coordinates (as these are variables in our model).
# spatial partitioning
tune_level = makeResampleDesc(method = "SpCV", iters = 5)
# specifying random parameter value search
ctrl = makeTuneControlRandom(maxit = 5L)
#support vector machine model tuning. Lower and upper limits are set to 1 and 2^15 for "C" and for "sigma" the range is 1 to 2^6. See Shratz et al (2018) above for more details on the choice of hyperparameter ranges.
paramsSVM = makeParamSet(
makeNumericParam("C", lower = -12, upper = 15, trafo = function(x) 2^x),
makeNumericParam("sigma", lower = -15, upper = 6, trafo = function(x) 2^x)
)
tuneSVM = tuneParams(learner = lrn_ksvm,
task=task,
resampling = tune_level,
par.set = paramsSVM,
control = ctrl,
measures = mlr::auc,
show.info = TRUE)
## [Tune] Started tuning learner classif.ksvm for parameter set:
## Type len Def Constr Req Tunable Trafo
## C numeric - - -12 to 15 - TRUE Y
## sigma numeric - - -15 to 6 - TRUE Y
## With control class: TuneControlRandom
## Imputation value: -0
## [Tune-x] 1: C=0.138; sigma=47.1
## [Tune-y] 1: auc.test.mean=0.5660925; time: 0.3 min
## [Tune-x] 2: C=168; sigma=0.00092
## [Tune-y] 2: auc.test.mean=0.4580252; time: 0.4 min
## [Tune-x] 3: C=550; sigma=36.9
## [Tune-y] 3: auc.test.mean=0.5516869; time: 0.3 min
## [Tune-x] 4: C=0.0162; sigma=48.5
## [Tune-y] 4: auc.test.mean=0.5623144; time: 0.3 min
## [Tune-x] 5: C=0.00268; sigma=3.05
## [Tune-y] 5: auc.test.mean=0.5642732; time: 0.3 min
## [Tune] Result: C=0.138; sigma=47.1 : auc.test.mean=0.5660925
library(kernlab)
##
## Attaching package: 'kernlab'
## The following object is masked from 'package:CircStats':
##
## rvm
## The following object is masked from 'package:adehabitatMA':
##
## buffer
## The following objects are masked from 'package:terra':
##
## buffer, size
## The following objects are masked from 'package:raster':
##
## buffer, rotated
#model with tuning information and set to allow probability predictions ("prob.model =TRUE"). "C-svc" is the default solver for predictions with binary outcomes in ksvm models and supports probability estimates so we use it here.
#build the model
svmMOD<-ksvm(Pres~.,data=all.cov[,1:9], type="C-svc",
kernel=rbfdot,kpar=list(sigma=tuneSVM$x$sigma),C=tuneSVM$x$C,prob.model=TRUE)
predKSVM<-predict(env,svmMOD,type="prob",index=2,na.rm=TRUE)
plot(predKSVM)
So, we can see that with only a small amount of model tuning (in reality we might run hundreds or even thousands of models with more time and computing power) we have been able to improve the support vector machine algorithm considerably, and we would now choose this algorithm to predict our species distribution on the original raster stack object. Random forest is outperformed here at this level of model tuning. However, considering the number of combinations of hyperparameters possible we might in practice achieve better results by specifying a greater number of models and parameter combinations.
Now we have made some headway in terms of improving our model prediction though tuning based on spatial partitioning of the data. Our AUC statistic is still not ideal and we have no real knowledge of the relationship between individual variables and species occurrence, much less the nature of point interactions. Let’s take a look at the options provided by point pattern analysis through the Spatstat package to see if we can remedy some of these issues.
Part Four: Point pattern analysis
A key difference between the modeling approaches we have used till now and the point pattern analysis we will do next is the format of the environmental covariates which, in the Spatstat package that we will use to carry this out, take the form of Im (image) objects. So it is worth first converting our raster covariates to image objects. For this we can use the as.im.RasterLayer() function from the “maptools” library:
# load maptools and convert all environmental covariates to im objects. Note that we need to convert to raster inside the function (as the as.im.RasterLayer function only works with *raster objects)
library(maptools)
## Please note that 'maptools' will be retired during October 2023,
## plan transition at your earliest convenience (see
## https://r-spatial.org/r/2023/05/15/evolution4.html and earlier blogs
## for guidance);some functionality will be moved to 'sp'.
## Checking rgeos availability: FALSE
##
## Attaching package: 'maptools'
## The following object is masked from 'package:sp':
##
## sp2Mondrian
tempIm<-as.im.RasterLayer(raster(env$MeanTemp.))
precipIm<-as.im.RasterLayer(raster(env$AnnualPrecip.))
maxIm<-as.im.RasterLayer(raster(env$Max.Temp.))
minIm<-as.im.RasterLayer(raster(env$Min.Temp.))
rangeIm<-as.im.RasterLayer(raster(env$Temp.Range))
precipDry<-as.im.RasterLayer(raster(env$PrecipDriestQuarter))
tempWettestIm<-as.im.RasterLayer(raster(env$MeanTemp.WettestQuarter))
Next we call the spatstat library and set up our analysis window (necessary for point pattern analysis) and create our point pattern object.
library(spatstat)
## Warning: package 'spatstat' was built under R version 4.3.2
## Loading required package: spatstat.data
## Warning: package 'spatstat.data' was built under R version 4.3.2
## Loading required package: spatstat.geom
## Warning: package 'spatstat.geom' was built under R version 4.3.2
## spatstat.geom 3.2-8
##
## Attaching package: 'spatstat.geom'
## The following object is masked from 'package:MASS':
##
## area
## The following object is masked from 'package:ade4':
##
## disc
## The following objects are masked from 'package:terra':
##
## area, delaunay, is.empty, rescale, rotate, shift, where.max,
## where.min
## The following object is masked from 'package:dismo':
##
## domain
## The following objects are masked from 'package:raster':
##
## area, rotate, shift
## Loading required package: spatstat.random
## Warning: package 'spatstat.random' was built under R version 4.3.2
## spatstat.random 3.2-2
## Loading required package: spatstat.explore
## Warning: package 'spatstat.explore' was built under R version 4.3.2
## Loading required package: nlme
##
## Attaching package: 'nlme'
## The following object is masked from 'package:raster':
##
## getData
## spatstat.explore 3.2-6
##
## Attaching package: 'spatstat.explore'
## The following object is masked from 'package:boot':
##
## envelope
## Loading required package: spatstat.model
## Warning: package 'spatstat.model' was built under R version 4.3.2
## Loading required package: rpart
## spatstat.model 3.2-10
##
## Attaching package: 'spatstat.model'
## The following object is masked from 'package:dismo':
##
## response
## Loading required package: spatstat.linnet
## Warning: package 'spatstat.linnet' was built under R version 4.3.2
## spatstat.linnet 3.1-3
##
## spatstat 3.0-7
## For an introduction to spatstat, type 'beginner'
window.poly <- as.owin(tempIm)#create study window using the tempIm image object as a template
plot(window.poly)#inspect
bradCoords<-crds(bradSP) #get coordinates from our Erioyce object to create point patter
pppBrad <- ppp(bradCoords[,1],bradCoords[,2], window = window.poly)#create point patter 'ppp' object
## Warning: 72 points were rejected as lying outside the specified window
## Warning: data contain duplicated points
plot(pppBrad,add=T) #inspect
## Warning in plot.ppp(pppBrad, add = T): 72 illegal points also plotted
pppBrad<-as.ppp(pppBrad)#use as.ppp to remove points outside the window
#Finally, it's a good idea to rescale our data from metres to kilometres. Not only is this more intuitive when dealing with large distances but it help with model convergence also.
pppBrad<- rescale(pppBrad, 1000)#rescale from m to km
#rescale image objects to conform with the point pattern
tempIm<-rescale(tempIm,1000)
precipIm<-rescale(precipIm,1000)
#plot the points against covariate
plot(tempIm)
plot(pppBrad,add=TRUE)
Now, as an exploratory step we can test our point pattern for evidence of spatial dependence using Ripley’s K and the more intuitive L function:
K<-Kest(pppBrad,correction = "border") #test for spatial dependence and use the "border" option to account for edge effects (points near the edge are ignored)
plot(K)#inspect the results
#check "linearized" Ripley's K for comparison
L<-Lest(pppBrad,correction = "border")
plot(L)
#generate envelope based on 39 random simulations of a homogeneous poisson process (rank = 1 means that we use minimum and maximum values to create the envelope, global=TRUE ensures that the estimate is made for all points simultaneously rather than point-wise).
Kcsr<-envelope(pppBrad,Kest,nsim=39,rank=1,global =TRUE)
## Generating 39 simulations of CSR ...
## 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
## 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38,
## 39.
##
## Done.
plot(Kcsr,shade=c("hi","lo"),legend=T) #check whether our data fall within this envelope.
According to the Kest, Lest and envelope tests our data deviate from complete spatial randomness. We should therefore consider exploring inhomogeneous processes when developing our point process models, which we will do next.
For setting up a point process model, in addition to the study window established previously, we need to create a suitable quadrature scheme. The quadrature points used in point process modelling are analogous to the background (pseudo-absence) points we have used in previous weeks. In the case of point process modeling we generally use many more points than typically used in presence-absence modeling and adopt a uniform grid of points rather than a random distribution. This uniform distribution of quadrature points is important for establishing the “intensity” measurement (predicted points per unit area) that is the outcome of point process modeling.
We use the quadscheme() function from spatstat to create our quadrature points (also referred to as “dummy” points in some of the spatstat functions). The resolution (number) of these points is determined by the ‘nd’ argument which sets the number of grid points in the horizontal and vertical directions.
Below we test several values for nd using a for loop and a simple model using annual temperature and precipitation as predictors, each time checking the model selection criterion AIC (Akaike’s Information Criterion). Lower AIC values suggest a better fit to the data and we use this as a basis to select the number of quadrature points.
ndTry<-seq(100,500,by=100)
for(i in ndTry){
Q.i<-quadscheme(pppBrad,method = "grid",nd=i)
fit.i<-ppm(Q.i~tempIm+precipIm)
print(i)
print(AIC(fit.i))
}
## [1] 100
## [1] 65011.07
## [1] 200
## [1] 65065.91
## [1] 300
## [1] 65039.24
## [1] 400
## [1] 65051.42
## [1] 500
## [1] 65048.94
# select best performing value for nd to set quadrature scheme
Q<-quadscheme(pppBrad,method = "grid",nd=400)
Before we start our modelling in earnest, we can take advantage of some exploratory techniques in the spatstat library, in particular the rohat() function that allows us to plot values for individual covariates against predicted intensity values for our point pattern.
#plot increasing values for each environmental covariate against intensity (~ density) of the point pattern
plot(rhohat(pppBrad,tempIm)) # annual temperature
plot(rhohat(pppBrad,precipIm)) # annual precipitation
plot(rhohat(pppBrad,"x")) #x coordinate
plot(rhohat(pppBrad,"y")) #y coordinate
The rhohat() function suggests we have non-linear associations between our environmental covariates and intensity of points. For temperature and the y coordinate in particular this seems to roughly follow a quadratic trend. We can use this insight when fitting our model. For example we can use the poly() function to model these variables as quadratic curves. Below we fit a quadratic function to the tempIm object and y coordinate inside our model formula to reflect these trends.The ppm() function is used to create a point process model. In practice we would experiment will many more combinations of predictor variables and test their suitability through comparison of the model AIC statistics. This can be quite time consuming (and tedious) so for the sake of this class we will just consider the key variables of annual temperature and precipitation.
firstPPMod<-ppm(Q~precipIm+poly(tempIm,2)+x+poly(y,2)) #point process model based on polynomial terms for temperature and projected coordinates
AIC(firstPPMod) # get AIC
## [1] 53449.41
As we discovered in our earlier exploratory analysis, we have reason to believe that our data do not come from a distribution with complete spatial randomness and we may want to consider accounting for spatial dependence between points in our modeling.
This first ppm model is basically a standard poisson regression model and does not account for point interactions. It therefore assumes spatial randomness in the data as discussed in the lecture. We can test whether our model conforms to spatial randomness i.e. the data come from a random poisson distribution using the envelope function which can also accept fitted point process models as the input.
firstModEnv<-envelope(firstPPMod,Kest,nsim=39,rank=1,global =TRUE)
## Generating 78 simulated realisations of fitted Poisson model (39 to estimate
## the mean and 39 to calculate envelopes) ...
## 1, 2, [6:20 remaining] 3,
## [6:15 remaining] 4, [5:50 remaining] 5, [6:04 remaining] 6,
## [5:48 remaining] 7, [5:44 remaining] 8, [5:43 remaining] 9,
## [5:37 remaining] 10, [5:29 remaining] 11, [5:20 remaining] 12,
## [5:11 remaining] 13, [5:03 remaining] 14, [4:52 remaining] 15,
## [4:46 remaining] 16, [4:38 remaining] 17, [4:29 remaining] 18,
## [4:25 remaining] 19, [4:21 remaining] 20, [4:23 remaining] 21,
## [4:19 remaining] 22, [4:15 remaining] 23, [4:13 remaining] 24,
## [4:10 remaining] 25, [4:07 remaining] 26, [4:06 remaining] 27,
## [4:02 remaining] 28, [3:57 remaining] 29, [3:52 remaining] 30,
## [3:46 remaining] 31, [3:42 remaining] 32, [3:37 remaining] 33,
## [3:33 remaining] 34, [3:28 remaining] 35, [3:24 remaining] 36,
## [3:20 remaining] 37, [3:14 remaining] 38, [3:10 remaining] 39,
## [3:05 remaining] 40, [3:00 remaining] 41, [2:55 remaining] 42,
## [2:49 remaining] 43, [2:44 remaining] 44, [2:39 remaining] 45,
## [2:35 remaining] 46, [2:30 remaining] 47, [2:26 remaining] 48,
## [2:21 remaining] 49, [2:17 remaining] 50, [2:12 remaining] 51,
## [2:08 remaining] 52, [2:03 remaining] 53, [1:59 remaining] 54,
## [1:54 remaining] 55, [1:49 remaining] 56, [1:44 remaining] 57,
## [1:40 remaining] 58, [1:35 remaining] 59, [1:30 remaining] 60,
## [1:26 remaining] 61, [1:21 remaining] 62, [1:17 remaining] 63,
## [1:12 remaining] 64, [1:07 remaining] 65, [1:02 remaining] 66,
## [57 sec remaining] 67, [53 sec remaining] 68, [48 sec remaining] 69,
## [43 sec remaining] 70, [38 sec remaining] 71, [33 sec remaining] 72,
## [28 sec remaining] 73, [24 sec remaining] 74, [19 sec remaining] 75,
## [14 sec remaining] 76, [9 sec remaining] 77, [5 sec remaining]
## 78.
##
## Done.
plot(firstModEnv)
As we can see our initial model results exhibit spatial clustering (values for K outside and higher than the theoretical CSR envelope) for most radii values.
We can attempt to account for this clustering inside our model using a Matern process, providing this option inside the kppm() function (for fitting clustered ppms).
maternMod<-kppm(Q~poly(tempIm,2)+precipIm+x+poly(y,2),"MatClust") # apply matern cluster process to the data.
Again, as for our first model we can perform a diagnostic test for our Matern process model using the envelope function. In this case, we have built our model based on the assumption that our points exhibit clustering.In other words, they represent an inhomogeneous point process. We therefore want to create an envelope based on simulations of inhomogeneous (clustering) point patterns rather than CSR. This can done by changing the second argument in the envelope() function from Kest (for Ripley’s K test of CSR) to Kinhom (to simulate inhomogeneous patterns). We can then assess the goodness-of-fit of our model by seeing if our observed values for Ripley’s K conform with an inhomogeneous distribution (i.e. if they lie within the simulation envelope).
matEnv<-envelope(maternMod,Kinhom,nsim=39,rank=1,global=TRUE)
## Generating 78 simulated realisations of fitted cluster model (39 to estimate
## the mean and 39 to calculate envelopes) ...
## 1, 2, [8:44 remaining] 3,
##
## [12:48 remaining, estimate finish 2024-02-17 13:14:01]
## 4,
## [11:47 remaining, estimate finish 2024-02-17 13:13:08]
## 5,
## [10:04 remaining, estimate finish 2024-02-17 13:11:30]
## 6,
## [10:18 remaining, estimate finish 2024-02-17 13:11:54]
## 7, [9:54 remaining] 8, [9:28 remaining] 9,
## [8:19 remaining] 10, [7:55 remaining] 11, [8:17 remaining] 12,
## [7:31 remaining] 13, [8:01 remaining] 14, [7:52 remaining] 15,
## [7:22 remaining] 16, [6:51 remaining] 17, [7:31 remaining] 18,
## [7:00 remaining] 19, [6:35 remaining] 20, [6:17 remaining] 21,
## [5:54 remaining] 22, [6:03 remaining] 23, [6:04 remaining] 24,
## [6:08 remaining] 25, [5:47 remaining] 26, [5:29 remaining] 27,
## [5:12 remaining] 28, [4:57 remaining] 29, [4:51 remaining] 30,
## [4:37 remaining] 31, [4:29 remaining] 32, [5:14 remaining] 33,
## [5:06 remaining] 34, [4:56 remaining] 35, [4:42 remaining] 36,
## [4:28 remaining] 37, [4:49 remaining] 38, [4:42 remaining] 39,
## [4:51 remaining] 40, [4:39 remaining] 41, [4:25 remaining] 42,
## [4:17 remaining] 43, [4:04 remaining] 44, [3:52 remaining] 45,
## [3:54 remaining] 46, [3:43 remaining] 47, [3:40 remaining] 48,
## [3:29 remaining] 49, [3:19 remaining] 50, [3:09 remaining] 51,
## [2:59 remaining] 52, [2:52 remaining] 53, [2:42 remaining] 54,
## [2:34 remaining] 55, [2:25 remaining] 56, [2:17 remaining] 57,
## [2:09 remaining] 58, [2:01 remaining] 59, [1:54 remaining] 60,
## [1:48 remaining] 61, [1:42 remaining] 62, [1:35 remaining] 63,
## [1:28 remaining] 64, [1:21 remaining] 65, [1:15 remaining] 66,
## [1:09 remaining] 67, [1:02 remaining] 68, [56 sec remaining] 69,
## [50 sec remaining] 70, [43 sec remaining] 71, [38 sec remaining] 72,
## [32 sec remaining] 73, [27 sec remaining] 74, [21 sec remaining] 75,
## [16 sec remaining] 76, [10 sec remaining] 77, [5 sec remaining]
## 78.
##
## Done.
plot(matEnv)
Our model fits nicely within the simulation envelope suggesting that it fits well to an inhomogeneous point pattern and is therefore correctly specified - it is a good approximation of the behaviour of the data.
We can look at the model coefficients by calling coef(maternMod) to see the relationship between our predictor variables and the predicted intensity of points. Notice that for varibales fit with poly() you have two coefficients, the first describes the relationship with intensity for lower values of the covariate and the second coefficient describes that for higher values of the covariate. For example, in the case of temperature, intensity (species density) appears to heighten with increasing temperature before decreasing after a certain value for temperature is reached. Note that this information is not available in the machine learning approaches.
coef(maternMod)
## (Intercept) poly(tempIm, 2)1 poly(tempIm, 2)2 precipIm
## -6.654676e+00 1.633600e+02 -3.495824e+01 1.792565e-04
## x poly(y, 2)1 poly(y, 2)2
## -1.998196e-03 2.758743e+02 1.119729e+02
The spatstat package contains a function for calculating the area under the curve for point process models. This differs slightly from binomial models (models with 2 outcomes) in that here it is the intensity rather than the success of correctly classifying a binary outcome that is being tested. The spatstat roc() and auc() functions offer an indication of the ability of the fitted model intensity to separate the spatial domain into areas of high and low density of points. The AUC here is the probability that a randomly-selected data point has higher predicted intensity than does a randomly-selected spatial location. So the test incorporates space into the assessment (i.e. by considering intensity across the entire study area rather than just our data points). We can plot the roc curve and return the auc as below:
plot(roc(maternMod))
auc.kppm(maternMod)
## obs theo
## 0.8849202 0.8842009
The roc curve and auc statistic for our model are high, following closely the output that would be expected for a well specified inhomogeneous point process model (“ftheo” here) and well above the null line (a 50:50 line that represents no ability to discriminate high from low intensity - “fnull” here). So we can see that with a well specified model that incorporates spatial dependence we have achieved a better representation of the data with the point process approach compared to our machine learning techniques. In reality we could carry out much more extensive model tuning for our machine learning (increasing number of iterations in the tuning step) but it’s perhaps unlikely that an auc score comparable to that of the Matern model would be achieved.
We can create a mapped output of our results using the predict function. Remember our point process model output is one of intensity (points per unit area) and has quite low values. We can log transform the raster values to better differentiate the distribution of values.
prPPMod<-predict(maternMod)
plot(prPPMod)
#log transform the result to highlight differences in value distribution
plot(log(prPPMod))
Finally we can convert our output to raster (remember spatstat deals in image objects):
maternRas<-raster(prPPMod)
plot(maternRas)
You’ve reached the end of this week’s practical. You’ve come a long way on your species distribution modeling journey. For this week’s exercise, generate a point process model for Solanum acaule for the same extent as for this practical. Test the data for CSR and fit a Matern process to build your model. Plot the roc curve for your final model as well as the raw and log-transformed prediction map.