Species Distribution Modelling Two

In week 3 we looked at the use of envelope models, general linear models, maxent and point process models for estimating species distribution (Bradypus variegatus). According to the cross-validation approach taken (K-fold partitioning) We achieved a good level of prediction. However, as we saw in the lecture, ecological data tend to be spatially dependent, meaning that observations in close proximity will be more similar than those at a distance. In terms of model validation this means that, in the case of K-fold validation in which samples are selected at random, training and test samples may be quite similar for any given covariate. This can (and almost always does) lead to over-optimistic model validation statistics (e.g. AUC scores). A more reliable approach is to ensure that folds used in cross-validation are spatially disjoint i.e. that the geographic distance between training and testing data is maximized. This is particularly important for species distribution modeling where our goal is the prediction across space of our species of interest. Therefore the ability of our model to work across different geographical areas is critical.

In this practical you will explore how spatial dependence affects model validation and the options available for overcoming such problems. We will use statistical models and machine learning algorithms on data for Bradypus variegatus.

Using these data, you will then learn how, for machine learning algorithms, model hyperparameters can be tuned to improve performance. You can think of model hyperparameters as the “settings” for any given model. For example, hyperparameter “mtry” is the number of explanatory variables that are sampled for each tree in a random forest algorithm and “node size” denotes the minimum number of cases allowed at the end of each decision tree. “Parameters” on the other hand are the decisions that are made by the learner during training (so “hyper” refers to the fact that we set these properties BEFORE the learning process begins).

In part four of the practical we will adopt a point process modeling approach to understanding and making predictions on these same data.

Part One: creating learners in the MLR (“Machine Learning in R”) package

The MLR package provides a suite of tools to explore, create and test modelling strategies that are to some degree compatible with a machine learning approach.

A key concept in MLR and in machine learning in general is the concept of the “Learner”. These are the objects that generate information (i.e. they “learn”) about the data provided. The latter object (the data) are generally formulated as “tasks”. So the basic logic is that we “learn” how to carry out our “task” and then use this experience for optimal prediction. If that all sounds a little inaccessible, it may help to simply think of Learners as models (as in classical statistics e.g. regression models) and Tasks as data sets. Typically, creating a task involves the choosing of a data set and telling R which column holds the species observations.

Okay, let’s set up our first task. First we will get the same data as we used last week. As you will only be working with data extracted from inside R packages or downloaded from online sources it is not critical where you set your working directory. However, if you intend to save any of your outputs from this practical then create a folder/directory for the work you are about to do. Remember, you can find the code for saving spatial outputs to file at the end of Practical One of this course.

#Code to tell R to only use the terra installation of PROJ (library for performing projections)
plib<-Sys.getenv("PROJ_LIB")
prj<-system.file("proj",package="terra")[1]
Sys.setenv("PROJ_LIB"=prj)

#Load packages

#Note, regarding the Spatstat and Maptools packages, if you receive a message stating that there is a binary and more recent source version of the package available, make sure you select "yes" to compile the package from source.

install.packages(c("dismo","terra","spatstat","adehabitatHR","ranger","cowplot"))

# "maptools is a useful package for converting data formats when using Spatstat. Sadly it is deprecated on CRAN but we can load a stable version from here: 
install.packages("maptools", repos = "https://packagemanager.posit.co/cran/2023-10-13")


#We also need to load the "MLR" package for this practical

install.packages("mlr")

#in addition we will use machine learning algorithms from the "kernlab" package

install.packages("kernlab") #install necessary package

Next we recreate the data from Week 5. The only difference that we will make in terms of creating the data for analysis is that this week we will trim down the study area. Both machine learning (due to number of tests/iterations) and point process modelling (due to modelling complex patterns) can be computationally intensive. We will use the minimum convex polygon approach to select a convex hull around 80% of our bradypus points and clip the study area to this polygon.

#access the dismo library

library(dismo)

## Loading required package: raster

## Loading required package: sp

files <- list.files(path=paste(system.file(package="dismo"),
                               '/ex', sep=''), pattern='grd', full.names=TRUE )

files

## [1] "C:/Users/mlibxmda/AppData/Local/R/win-library/4.3/dismo/ex/bio1.grd" 
## [2] "C:/Users/mlibxmda/AppData/Local/R/win-library/4.3/dismo/ex/bio12.grd"
## [3] "C:/Users/mlibxmda/AppData/Local/R/win-library/4.3/dismo/ex/bio16.grd"
## [4] "C:/Users/mlibxmda/AppData/Local/R/win-library/4.3/dismo/ex/bio17.grd"
## [5] "C:/Users/mlibxmda/AppData/Local/R/win-library/4.3/dismo/ex/bio5.grd" 
## [6] "C:/Users/mlibxmda/AppData/Local/R/win-library/4.3/dismo/ex/bio6.grd" 
## [7] "C:/Users/mlibxmda/AppData/Local/R/win-library/4.3/dismo/ex/bio7.grd" 
## [8] "C:/Users/mlibxmda/AppData/Local/R/win-library/4.3/dismo/ex/bio8.grd" 
## [9] "C:/Users/mlibxmda/AppData/Local/R/win-library/4.3/dismo/ex/biome.grd"

library(terra)

## terra 1.7.72

env<-rast(files[1:8])


#use the names() function to label each covariate layer

names(env) <- c("MeanTemp.","AnnualPrecip.","Precip.WettestQuarter","PrecipDriestQuarter","Max.Temp.","Min.Temp.","Temp.Range","MeanTemp.WettestQuarter")



plot(env)

SAproj<-"epsg:31972"  # we want projected coordinate in order to calculate distance accurately so we use  the EPSG code for South America

env<-project(env,y=SAproj,res=10000) # project raster and set an appropriate resolution


#And now plot:
 # first layer of the RasterStack
plot(env, 1)

bradypus <- gbif("Bradypus","variegatus",sp=TRUE)  #use gbif() to download the data

## 6342 records found

## 0-300-600-900-1200-1500-1800-2100-2400-2700-3000-3300-3600-3900-4200-4500-4800-5100-5400-5700-6000-6300-6342 records downloaded

#convert from an sp library (old) object to terra (newer/faster library) object

bradypus<-vect(bradypus)

#set the crs
crs(bradypus)<-"epsg:4326"
 
bradypus<-project(bradypus,crs(env))

plot(env$MeanTemp.)

plot(bradypus,add=TRUE,col ="red")

library(adehabitatHR) #library for mcp() function to get a minimum convex polygon around a certain number of points

## Loading required package: ade4

## Loading required package: adehabitatMA

## Registered S3 methods overwritten by 'adehabitatMA':
##   method                       from
##   print.SpatialPixelsDataFrame sp  
##   print.SpatialPixels          sp

## 
## Attaching package: 'adehabitatMA'

## The following object is masked from 'package:terra':
## 
##     buffer

## The following object is masked from 'package:raster':
## 
##     buffer

## Loading required package: adehabitatLT

## Loading required package: CircStats

## Loading required package: MASS

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:terra':
## 
##     area

## The following objects are masked from 'package:raster':
## 
##     area, select

## Loading required package: boot

bradMCP<-mcp(as(bradypus,"Spatial"),percent=80) #create minimum convex polygon around the 95% of points closest to the centroid (central point) of the presence data 

plot(env$MeanTemp.) #plot

plot(bradMCP,add=TRUE)

plot(bradypus,add=TRUE)

bradSP<-crop(bradypus,bradMCP) # crop points to the mcp

#inspect
plot(env$MeanTemp.)

plot(bradSP)
plot(bradMCP,add=T)

###Crop the env object to this new polygon considering 80% of the point data

env<-crop(env,bradMCP)#crop
env<-mask(env,vect(bradMCP))#mask

plot(env)#plot

set.seed(11)

#create random points on cells of the 'env' object within the extent of bradypus and avoiding cells containing points from bradypus

back <- spatSample(env,size=2000,xy=TRUE,na.rm=TRUE) 

head(back)

##           x          y MeanTemp. AnnualPrecip. Precip.WettestQuarter
## 1  596831.2  -73128.55  261.2404     2869.9595              950.9012
## 2  636831.2 1236871.45  227.4852     2018.1752              968.0748
## 3 1126831.2 1166871.45  236.1839      843.6674              334.4229
## 4  936831.2 1146871.45  260.1938     1045.0660              441.1959
## 5 1046831.2  946871.45  269.8413     1568.1650              733.9901
## 6  536831.2  866871.45  271.6064     3758.7358             1394.8872
##   PrecipDriestQuarter Max.Temp. Min.Temp. Temp.Range MeanTemp.WettestQuarter
## 1           419.62311  316.8446  206.2404   110.8021                254.9481
## 2            95.62096  287.2628  162.4731   124.7897                225.8656
## 3            51.63866  298.5118  170.2902   128.1907                236.3653
## 4            56.49178  309.5564  204.3375   105.2292                261.7309
## 5            47.27510  338.3295  206.8261   131.3195                266.8843
## 6           268.92087  333.1964  217.2537   115.9427                267.6374

#absence values (leave cols 1 and 2 which are coordinates)

eA<-back[,3:ncol(back)]

#presence values
eP<-extract(env,bradSP)

#create data frames from values
Pres.cov<-data.frame(eP,Pres=1)


Back.cov<-data.frame(eA,Pres=0)

head(Back.cov)

##   MeanTemp. AnnualPrecip. Precip.WettestQuarter PrecipDriestQuarter Max.Temp.
## 1  261.2404     2869.9595              950.9012           419.62311  316.8446
## 2  227.4852     2018.1752              968.0748            95.62096  287.2628
## 3  236.1839      843.6674              334.4229            51.63866  298.5118
## 4  260.1938     1045.0660              441.1959            56.49178  309.5564
## 5  269.8413     1568.1650              733.9901            47.27510  338.3295
## 6  271.6064     3758.7358             1394.8872           268.92087  333.1964
##   Min.Temp. Temp.Range MeanTemp.WettestQuarter Pres
## 1  206.2404   110.8021                254.9481    0
## 2  162.4731   124.7897                225.8656    0
## 3  170.2902   128.1907                236.3653    0
## 4  204.3375   105.2292                261.7309    0
## 5  206.8261   131.3195                266.8843    0
## 6  217.2537   115.9427                267.6374    0

head(Pres.cov)

##   ID MeanTemp. AnnualPrecip. Precip.WettestQuarter PrecipDriestQuarter
## 1  1  242.7934       860.167              531.1019            37.79567
## 2  2  243.5608      2908.904             1178.0919           202.05040
## 3  3  243.7697      3312.001             1443.1741           144.62831
## 4  4  231.2706      3314.167             1235.5300           302.25983
## 5  5  221.3931      1139.519              490.4107            94.57846
## 6  6  243.5608      2908.904             1178.0919           202.05040
##   Max.Temp. Min.Temp. Temp.Range MeanTemp.WettestQuarter Pres
## 1  296.5367  194.3747   102.1593                249.7906    1
## 2  310.3405  184.6754   125.2382                242.2694    1
## 3  315.8277  178.8665   136.1665                240.3386    1
## 4  292.5822  175.1340   117.3696                230.7110    1
## 5  283.7167  152.7786   131.2737                224.1626    1
## 6  310.3405  184.6754   125.2382                242.2694    1

#combine both data sets using the rbind() function (binds datsets with same columns by stacking one on top of the other - i.e. joining rows together)
all.cov<-rbind(Pres.cov[,2:ncol(Pres.cov)],Back.cov)


#inspect
head(all.cov)

##   MeanTemp. AnnualPrecip. Precip.WettestQuarter PrecipDriestQuarter Max.Temp.
## 1  242.7934       860.167              531.1019            37.79567  296.5367
## 2  243.5608      2908.904             1178.0919           202.05040  310.3405
## 3  243.7697      3312.001             1443.1741           144.62831  315.8277
## 4  231.2706      3314.167             1235.5300           302.25983  292.5822
## 5  221.3931      1139.519              490.4107            94.57846  283.7167
## 6  243.5608      2908.904             1178.0919           202.05040  310.3405
##   Min.Temp. Temp.Range MeanTemp.WettestQuarter Pres
## 1  194.3747   102.1593                249.7906    1
## 2  184.6754   125.2382                242.2694    1
## 3  178.8665   136.1665                240.3386    1
## 4  175.1340   117.3696                230.7110    1
## 5  152.7786   131.2737                224.1626    1
## 6  184.6754   125.2382                242.2694    1

Now we can use this data frame to create a task for our learners to carry out. As stated at the start of this practical, the spatial location of points is critical to our analysis and will be required at every step of model development, testing and tuning. So, first let’s create an object for the coordinates of our presence and background data for use in the analysis.

### create task for use with mlr classification algorithms

coordsBack<-back[,1:2]
coordsPres<-crds(bradSP)
coords<-rbind(coordsPres,coordsBack)
coords<-data.frame(coords)
colnames(coords)<-c("x","y")
all.cov<-cbind(all.cov,coords)
all.cov<-subset(all.cov,!is.na(all.cov$MeanTemp.))#remove any NA values

#Now we create our task, passing the data frame and specifying the species presence/absence column (the "target") and tell R that 1 = present (i.e. a postive result)

library(mlr)

## Loading required package: ParamHelpers

## 
## Attaching package: 'ParamHelpers'

## The following object is masked from 'package:raster':
## 
##     getValues

## Warning message: 'mlr' is in 'maintenance-only' mode since July 2019.
## Future development will only happen in 'mlr3'
## (<https://mlr3.mlr-org.com>). Due to the focus on 'mlr3' there might be
## uncaught bugs meanwhile in {mlr} - please consider switching.

## 
## Attaching package: 'mlr'

## The following object is masked from 'package:adehabitatLT':
## 
##     subsample

## The following object is masked from 'package:terra':
## 
##     resample

## The following object is masked from 'package:raster':
## 
##     resample

#For the makeClassifTask function to work, our target variable needs to be categorical (a "factor" in R) so let's tidy that up first

all.cov$Pres<-factor(all.cov$Pres)

task = makeClassifTask(data = all.cov[,1:9], target = "Pres",
                       positive = "1", coordinates = all.cov[,10:11])

Now that we have the task set up, we need to choose some learners that will help us carry out our task. If we already know which learner we prefer then we can use the makeLearner() function to specify our algorithm. If we don’t then the mlr package has a really useful function listLearners() with which you can pass your existing task and get a list of machine learning algorithms which apply.

The listLearners function returns A LOT of information and most of this we do not want so we will not call it here but instead we will go straight to creating our learners which for today’s practical are: 1. classif.binomial (an implementation of logistic regression), 2. classif.randomForest and 3. classif.ksvm (a support vector machine learning algorithm developed by the Kernlab group). We looked at binomial regression and Random Forest previously. Support Vector Machines (SVMs) are classifiers that attempt to create the largest possible margin between samples (such as groups of presence/absence points) that maximises their difference (distance between the groups) whilst reducing the number of misclassifications, thereby helping the classifer to make decisions about group membership. See here for an article giving a nice simple overview of how SVMs work: https://towardsdatascience.com/support-vector-machine-simply-explained-fee28eba5496

The next step is to create our individual learners that will do the work of solving our task. Let’s start with the most familiar type - binomial regression:

################ Binomial (logistic regression)



lrnBinomial = makeLearner("classif.binomial",
                      predict.type = "prob",
                      fix.factors.prediction = TRUE)

In the above code we first tell the makeLearner function the type of classifier we want to work with, specify the type of prediction we want to make (probability) and if we want the algorithm to output precicted classes (i.e. 1 or 0) when used later for prediction.

Next we do the same for our Random Forest and Ksvm learners:

##########Random Forest

lrnRF = makeLearner("classif.ranger",
                    predict.type = "prob",
                    fix.factors.prediction = TRUE)


###### Support Vector Machine
# kernel = "rbfdot" =  "Radial Basis kernel function" which is basically a Gaussian kernel i.e. one that simulates a normal distribution. The algorithm uses the kernel type to decide how to differentiate groups (you can think of this as a means to estimate or weight distances between group 1 - presence and group 0 - absence). 

lrn_ksvm = makeLearner("classif.ksvm",
                       predict.type = "prob",
                       kernel = "rbfdot", 
                       fix.factors.prediction = TRUE)

Part Two: model evaluation

Now that we have both our task and learners ready we can apply each learner to the task and evaluate their performance.

To do this we have to set up a sampling scheme for cross validation. This is conceptually identical to the k-fold validation we did in Week 3 but here we don’t have to go through the trouble of writing our own for loops - we can use the functionality within the mlr package to do this for us. First we will create a k-fold resampling strategy that DOES NOT take spatial location into account.

### Create resampling scheme (performance level) for cross-validation. 

##This code creates 5 folds in the data and repeates the whole k-fold cross validation process 4 times (so 20 models in total) and aggregates (averages) the AUC statistic.


perf_levelCV = makeResampleDesc(method = "RepCV", predict = "test", folds = 5, reps = 5)

In addition we can easily set up the same cross validation strategy but with spatially disjoint folds. All we have to do is change “RepCV” to “SpRepCV” like so:

perf_level_spCV = makeResampleDesc(method = "SpRepCV", folds = 5, reps = 5) #sampling strategy to run five fold re-sampling five times

Now we can compare, for each of our learners, their performance based on conventional k-fold validation and one where spatial location is considered (i.e. when test data are made to be geographically far from training data).

#Binomial conventional cross validation (K fold)


cvBinomial = resample(learner = lrnBinomial, task =task,
                resampling = perf_levelCV, 
                measures = mlr::auc)

## Resampling: repeated cross-validation

## Measures:             auc

## [Resample] iter 1:    0.8301529

## [Resample] iter 2:    0.8490734

## [Resample] iter 3:    0.8229394

## [Resample] iter 4:    0.8116275

## [Resample] iter 5:    0.8265455

## [Resample] iter 6:    0.8207694

## [Resample] iter 7:    0.8384071

## [Resample] iter 8:    0.8065346

## [Resample] iter 9:    0.8449821

## [Resample] iter 10:   0.8291591

## [Resample] iter 11:   0.8208739

## [Resample] iter 12:   0.8305474

## [Resample] iter 13:   0.8325647

## [Resample] iter 14:   0.8254371

## [Resample] iter 15:   0.8282495

## [Resample] iter 16:   0.8329027

## [Resample] iter 17:   0.8190710

## [Resample] iter 18:   0.8298184

## [Resample] iter 19:   0.8339114

## [Resample] iter 20:   0.8205537

## [Resample] iter 21:   0.8139433

## [Resample] iter 22:   0.8322877

## [Resample] iter 23:   0.8207495

## [Resample] iter 24:   0.8316138

## [Resample] iter 25:   0.8413222

##

## Aggregated Result: auc.test.mean=0.8277615

##

We can access and plot the spatial partitioning performed by the mlr package. This information is good to have if we want to replicate the validation process later on.

plots = createSpatialResamplingPlots(task,resample=cvBinomial,
  crs=crs(env),datum=crs(env),color.test = "red",point.size = 1)

library(cowplot)
#use the cowplot function to plot all folds out in a grid
cowplot::plot_grid(plotlist = plots[["Plots"]], ncol = 3, nrow = 2,
  labels = plots[["Labels"]])

##Binomial spatial cross validation

sp_cvBinomial = resample(learner = lrnBinomial, task =task,
                   resampling = perf_level_spCV, 
                   measures = mlr::auc)

## Resampling: repeated spatial cross-validation

## Measures:             auc

## [Resample] iter 1:    0.5028453

## [Resample] iter 2:    0.5711554

## [Resample] iter 3:    0.7205928

## [Resample] iter 4:    0.0788633

## [Resample] iter 5:    0.4091305

## [Resample] iter 6:    0.8325269

## [Resample] iter 7:    0.3851623

## [Resample] iter 8:    0.3286315

## [Resample] iter 9:    0.6020144

## [Resample] iter 10:   0.5646577

## [Resample] iter 11:   0.7205928

## [Resample] iter 12:   0.5028453

## [Resample] iter 13:   0.5711554

## [Resample] iter 14:   0.4091305

## [Resample] iter 15:   0.0788633

## [Resample] iter 16:   0.8329681

## [Resample] iter 17:   0.4258841

## [Resample] iter 18:   0.3756532

## [Resample] iter 19:   0.3473831

## [Resample] iter 20:   0.5646577

## [Resample] iter 21:   0.5646577

## [Resample] iter 22:   0.4258841

## [Resample] iter 23:   0.3473831

## [Resample] iter 24:   0.3756532

## [Resample] iter 25:   0.8329681

##

## Aggregated Result: auc.test.mean=0.4948504

##

#make partition plots
plotsSP = createSpatialResamplingPlots(task,resample=sp_cvBinomial,
  crs=crs(env),datum=crs(env),color.test = "red",point.size = 1)

#use the cowplot function to plot all folds out in a grid
cowplot::plot_grid(plotlist = plotsSP[["Plots"]], ncol = 3, nrow = 2,
  labels = plotsSP[["Labels"]])

You should see that the result (aggregated AUC) for the spatial cross validation is noticeably lower than for the conventional cross validation. This proves that the latter is an over-optimistic assessment and that we should be cautious with non-spatial approaches to model evaluation.

Now let’s look at our Random Forest learner.

###### Random Forest evaluation


##random sampling cross-validation

cvRF = resample(learner = lrnRF, task =task,
                   resampling = perf_levelCV, 
                   measures = mlr::auc)

## Resampling: repeated cross-validation

## Measures:             auc

## [Resample] iter 1:    0.9813761

## [Resample] iter 2:    0.9787655

## [Resample] iter 3:    0.9785546

## [Resample] iter 4:    0.9850311

## [Resample] iter 5:    0.9815155

## [Resample] iter 6:    0.9821401

## [Resample] iter 7:    0.9829890

## [Resample] iter 8:    0.9719107

## [Resample] iter 9:    0.9807421

## [Resample] iter 10:   0.9819755

## [Resample] iter 11:   0.9823031

## [Resample] iter 12:   0.9802763

## [Resample] iter 13:   0.9795040

## [Resample] iter 14:   0.9797405

## [Resample] iter 15:   0.9787939

## [Resample] iter 16:   0.9805372

## [Resample] iter 17:   0.9861044

## [Resample] iter 18:   0.9770646

## [Resample] iter 19:   0.9772007

## [Resample] iter 20:   0.9783537

## [Resample] iter 21:   0.9795199

## [Resample] iter 22:   0.9733039

## [Resample] iter 23:   0.9857602

## [Resample] iter 24:   0.9813368

## [Resample] iter 25:   0.9818703

##

## Aggregated Result: auc.test.mean=0.9802668

##

#spatial partitioning cross-validation

sp_cvRF = resample(learner = lrnRF, task =task,
                   resampling = perf_level_spCV, 
                   measures = mlr::auc)

## Resampling: repeated spatial cross-validation

## Measures:             auc

## [Resample] iter 1:    0.4548440

## [Resample] iter 2:    0.6067284

## [Resample] iter 3:    0.5147399

## [Resample] iter 4:    0.6223749

## [Resample] iter 5:    0.4307723

## [Resample] iter 6:    0.6322529

## [Resample] iter 7:    0.5257214

## [Resample] iter 8:    0.6072639

## [Resample] iter 9:    0.5433091

## [Resample] iter 10:   0.5288575

## [Resample] iter 11:   0.5263679

## [Resample] iter 12:   0.4404957

## [Resample] iter 13:   0.5352940

## [Resample] iter 14:   0.5986165

## [Resample] iter 15:   0.5161097

## [Resample] iter 16:   0.5117341

## [Resample] iter 17:   0.5411851

## [Resample] iter 18:   0.4338626

## [Resample] iter 19:   0.6058483

## [Resample] iter 20:   0.5442464

## [Resample] iter 21:   0.5850094

## [Resample] iter 22:   0.5834310

## [Resample] iter 23:   0.2969607

## [Resample] iter 24:   0.5953777

## [Resample] iter 25:   0.5861225

##

## Aggregated Result: auc.test.mean=0.5347010

##

plotsSP = createSpatialResamplingPlots(task,resample=sp_cvRF,
  crs=crs(env),datum=crs(env),color.test = "red",point.size = 1)


cowplot::plot_grid(plotlist = plotsSP[["Plots"]], ncol = 3, nrow = 2,
  labels = plots[["Labels"]])

You will notice that, although random forest out-performs the binomial regression learner when assessed from a conventional cross-validation point of view, this algorithm suffers heavily when spatial dependence is take into account and the binomial learner achieves a comparable AUC. However, random forest has a set of hyperparameters that can be easily tuned to enhance model performance, a functionality not present with binomal learners (which are based on statistical methods rather than on machine learning algorithms). We will see later in the practical if model tuning changes the relative standing of these two learners. First though, let’s perform the same evaluation on our support vector machine learner for comparison.

########### Support vector machine evaluation 


cvksvm = resample(learner = lrn_ksvm, task =task,
                resampling = perf_levelCV, 
                measures = mlr::auc)

## Resampling: repeated cross-validation

## Measures:             auc

## [Resample] iter 1:    0.9315302

## [Resample] iter 2:    0.9074731

## [Resample] iter 3:    0.9060738

## [Resample] iter 4:    0.9175863

## [Resample] iter 5:    0.9130538

## [Resample] iter 6:    0.9140758

## [Resample] iter 7:    0.9170467

## [Resample] iter 8:    0.9223604

## [Resample] iter 9:    0.9203246

## [Resample] iter 10:   0.9076729

## [Resample] iter 11:   0.9254400

## [Resample] iter 12:   0.9163057

## [Resample] iter 13:   0.9318815

## [Resample] iter 14:   0.9008502

## [Resample] iter 15:   0.9061469

## [Resample] iter 16:   0.9068799

## [Resample] iter 17:   0.9171021

## [Resample] iter 18:   0.9273444

## [Resample] iter 19:   0.9191297

## [Resample] iter 20:   0.9190030

## [Resample] iter 21:   0.9082492

## [Resample] iter 22:   0.9195705

## [Resample] iter 23:   0.9096856

## [Resample] iter 24:   0.9192094

## [Resample] iter 25:   0.9285573

##

## Aggregated Result: auc.test.mean=0.9165021

##

sp_cvksvm = resample(learner = lrn_ksvm, task =task,
                   resampling = perf_level_spCV, 
                   measures = mlr::auc)

## Resampling: repeated spatial cross-validation

## Measures:             auc

## [Resample] iter 1:    0.5720567

## [Resample] iter 2:    0.7259129

## [Resample] iter 3:    0.6189466

## [Resample] iter 4:    0.2537036

## [Resample] iter 5:    0.2253695

## [Resample] iter 6:    0.6132446

## [Resample] iter 7:    0.6146761

## [Resample] iter 8:    0.6062001

## [Resample] iter 9:    0.4357252

## [Resample] iter 10:   0.2872995

## [Resample] iter 11:   0.2068053

## [Resample] iter 12:   0.2567700

## [Resample] iter 13:   0.6099382

## [Resample] iter 14:   0.7250189

## [Resample] iter 15:   0.5764467

## [Resample] iter 16:   0.5438511

## [Resample] iter 17:   0.5204003

## [Resample] iter 18:   0.5338155

## [Resample] iter 19:   0.4288651

## [Resample] iter 20:   0.2415945

## [Resample] iter 21:   0.6135836

## [Resample] iter 22:   0.5763972

## [Resample] iter 23:   0.2118227

## [Resample] iter 24:   0.2504273

## [Resample] iter 25:   0.7165830

##

## Aggregated Result: auc.test.mean=0.4786182

##

So, both learners apparently suffer considerably when switching from a conventional to a spatially-oriented method of performance evaluation. especially in the case of the support vector machine. Support vector machines are the most readily improved through model tuning and this is a particular strength of this kind of learner. So, the winner in this contest is far from decided - not before we have taken a closer look at model tuning options for these different learners.

Part Three: Model tuning

For tuning of hyperparameters we follow a similar process as that taken when setting up cross validation resampling. We use the same resampling scheme so that the data are separated into five spatially disjoint folds and with 5 repeats. We could (and should) try many more than this but this would become increasingly time consuming to do.

We first need to look at each individual learner to check which hyperparameters are available for tuning. These can be accessed using the getParamSet() function. Let’s do this for the randomForest learner first.

getParamSet(lrnRF)

##                                       Type  len    Def
## num.trees                          integer    -    500
## mtry                               integer    -      -
## min.node.size                      integer    -      -
## replace                            logical    -   TRUE
## sample.fraction                    numeric    -      -
## split.select.weights         numericvector <NA>      -
## always.split.variables             untyped    -      -
## respect.unordered.factors         discrete    - ignore
## importance                        discrete    -   none
## write.forest                       logical    -   TRUE
## scale.permutation.importance       logical    -  FALSE
## num.threads                        integer    -      -
## save.memory                        logical    -  FALSE
## verbose                            logical    -   TRUE
## seed                               integer    -      -
## splitrule                         discrete    -   gini
## num.random.splits                  integer    -      1
## keep.inbag                         logical    -  FALSE
##                                                 Constr Req Tunable Trafo
## num.trees                                     1 to Inf   -    TRUE     -
## mtry                                          1 to Inf   -    TRUE     -
## min.node.size                                 1 to Inf   -    TRUE     -
## replace                                              -   -    TRUE     -
## sample.fraction                                 0 to 1   -    TRUE     -
## split.select.weights                            0 to 1   -    TRUE     -
## always.split.variables                               -   -    TRUE     -
## respect.unordered.factors       ignore,order,partition   -    TRUE     -
## importance                   none,impurity,permutation   -   FALSE     -
## write.forest                                         -   -   FALSE     -
## scale.permutation.importance                         -   Y   FALSE     -
## num.threads                                   1 to Inf   -   FALSE     -
## save.memory                                          -   -   FALSE     -
## verbose                                              -   -   FALSE     -
## seed                                       -Inf to Inf   -   FALSE     -
## splitrule                              gini,extratrees   -    TRUE     -
## num.random.splits                             1 to Inf   Y    TRUE     -
## keep.inbag                                           -   -   FALSE     -

For random forests the hyperparameters known to most closely affect model performance are “mtry” and “nodesize”. So, for these hyperparameters we can create a range of values to pass to the tuneParams() function for tuning our model. We create these hyperparameter ranges using the makeParamSet() function. For mtry the range of options is actually the number of predictor variables in our dataframe (eight) and a sensible range for nodesize starts from 1 with 10 being a common upper limit used in classification problems (any more than this reduces computing time but also precision).

paramsRF <- makeParamSet(
  makeIntegerParam("mtry",lower = 1,upper = 8),
  makeIntegerParam("min.node.size",lower = 1,upper = 10)
)

Now we have the parameter set ready to go we use the tuneParams() function to get the result of our model tuning. Before we do that though, let’s break down exactly what is happening in the model tuning process. The code below is basically a list of information given to the tuneParams() function. We provide the name of the learner “lrnRF”, the task for the learner, the spatially-oriented resampling method (we use the same one as in our model evaluation step above), the range of parameter values to select from “paramsRF”, the number of random combinations of parameter values to use (“ctrl”) and the measure we wish to return “auc”.

The objects we pass to the tuneParams() function will ensure that for five spatially disjoint folds (subsets) of the data, 10 random combinations of hyperparameters will be used for model tuning with the optimum values returned (i.e. those that maximize the auc score). This means that a total of 50 models (5 folds x 10 hyperparameter selections) will be tested and each time the training and testing data will be spatially partitioned such that our tuning is based on maximizing the ability of our model to predict to new spatial locations. In reality we might run many more models but in the interests of time we will proceed with 5 combinations of hyperparameters for tuning.

Please be patient as the following model tuning step can take a while.

# specifying random parameter value search

tune_level = makeResampleDesc(method = "SpCV", iters = 5)

ctrl = makeTuneControlRandom(maxit = 10)

tuned_RF = tuneParams(learner = lrnRF,
                         task = task,
                         resampling = tune_level,
                         measures = mlr::auc,
                         par.set = paramsRF,
                         control = ctrl,
                         show.info = TRUE)

## [Tune] Started tuning learner classif.ranger for parameter set:

##                  Type len Def  Constr Req Tunable Trafo
## mtry          integer   -   -  1 to 8   -    TRUE     -
## min.node.size integer   -   - 1 to 10   -    TRUE     -

## With control class: TuneControlRandom

## Imputation value: -0

## [Tune-x] 1: mtry=1; min.node.size=1

## [Tune-y] 1: auc.test.mean=0.5508521; time: 0.2 min

## [Tune-x] 2: mtry=3; min.node.size=3

## [Tune-y] 2: auc.test.mean=0.5672918; time: 0.4 min

## [Tune-x] 3: mtry=4; min.node.size=9

## [Tune-y] 3: auc.test.mean=0.5851705; time: 0.5 min

## [Tune-x] 4: mtry=2; min.node.size=6

## [Tune-y] 4: auc.test.mean=0.5808736; time: 0.3 min

## [Tune-x] 5: mtry=5; min.node.size=4

## [Tune-y] 5: auc.test.mean=0.5682975; time: 0.6 min

## [Tune-x] 6: mtry=8; min.node.size=2

## [Tune-y] 6: auc.test.mean=0.5639593; time: 0.8 min

## [Tune-x] 7: mtry=8; min.node.size=2

## [Tune-y] 7: auc.test.mean=0.5570767; time: 0.8 min

## [Tune-x] 8: mtry=6; min.node.size=9

## [Tune-y] 8: auc.test.mean=0.5744483; time: 0.6 min

## [Tune-x] 9: mtry=2; min.node.size=9

## [Tune-y] 9: auc.test.mean=0.5678892; time: 0.3 min

## [Tune-x] 10: mtry=5; min.node.size=10

## [Tune-y] 10: auc.test.mean=0.5951370; time: 0.5 min

## [Tune] Result: mtry=5; min.node.size=10 : auc.test.mean=0.5951370

We can see that the random forest in this case has responded a little to the model tuning we have carried out. We can follow the same process for the support vector machine algorithm. Here we take some recommendation from the literature to set the values for the algorithm parameters “C” (this determines the emphasis placed on finding a large margin around the classification boundary - i.e. the distance between 1/0 groups - versus avoiding unwanted misclassifications) and, for “sigma” (the kernel shape determining the influence of cases close to the classification boundary), we automate the selection using a function from the kernlab library.

Once we have the optimal model parameters we simply need to create a ksvm model (this works in much the same way that we have fitted glm() models previously). We pass the results from our model tuning step straight to the model formula (again we can access the tuning results directly with the ‘$’ symbol).This step can take a few minutes to run so please be patient.

# spatial partitioning


#support vector machine model tuning. Lower and upper limits are set to 0.1 and 10 for "C" and for "sigma" the range is set using the sigest() function in the kernlab library. This function looks at a potential model based on our data and returns an estimated range of sigma for good model performance.

library(kernlab)

## 
## Attaching package: 'kernlab'

## The following object is masked from 'package:CircStats':
## 
##     rvm

## The following object is masked from 'package:adehabitatMA':
## 
##     buffer

## The following objects are masked from 'package:terra':
## 
##     buffer, size

## The following objects are masked from 'package:raster':
## 
##     buffer, rotated

sigest(Pres~.,data=all.cov[,1:9])

##        90%        50%        10% 
## 0.03047447 0.09297329 0.35495406

# specifying random parameter value search


paramsSVM = makeParamSet(
  makeNumericParam("C", lower = 0.1, upper = 10),
  makeNumericParam("sigma", lower = 0.03, upper = 0.355)
)



tuned_SVM = tuneParams(learner = lrn_ksvm,
                         task = task,
                         resampling = tune_level,
                         measures = mlr::auc,
                         par.set = paramsSVM,
                         control = ctrl,
                         show.info = TRUE)

## [Tune] Started tuning learner classif.ksvm for parameter set:

##          Type len Def        Constr Req Tunable Trafo
## C     numeric   -   -     0.1 to 10   -    TRUE     -
## sigma numeric   -   - 0.03 to 0.355   -    TRUE     -

## With control class: TuneControlRandom

## Imputation value: -0

## [Tune-x] 1: C=3.27; sigma=0.0716

## [Tune-y] 1: auc.test.mean=0.4928474; time: 0.6 min

## [Tune-x] 2: C=9.6; sigma=0.284

## [Tune-y] 2: auc.test.mean=0.5455737; time: 0.4 min

## [Tune-x] 3: C=1.52; sigma=0.0785

## [Tune-y] 3: auc.test.mean=0.5074506; time: 0.6 min

## [Tune-x] 4: C=6.62; sigma=0.0748

## [Tune-y] 4: auc.test.mean=0.4844498; time: 0.5 min

## [Tune-x] 5: C=0.646; sigma=0.227

## [Tune-y] 5: auc.test.mean=0.5131040; time: 0.5 min

## [Tune-x] 6: C=0.831; sigma=0.145

## [Tune-y] 6: auc.test.mean=0.4883957; time: 0.5 min

## [Tune-x] 7: C=5.79; sigma=0.343

## [Tune-y] 7: auc.test.mean=0.5565103; time: 0.4 min

## [Tune-x] 8: C=5.76; sigma=0.0312

## [Tune-y] 8: auc.test.mean=0.5039993; time: 0.7 min

## [Tune-x] 9: C=1.11; sigma=0.0394

## [Tune-y] 9: auc.test.mean=0.5274719; time: 0.4 min

## [Tune-x] 10: C=9.02; sigma=0.33

## [Tune-y] 10: auc.test.mean=0.5695327; time: 0.2 min

## [Tune] Result: C=9.02; sigma=0.33 : auc.test.mean=0.5695327

#model with tuning information and set to allow probability predictions ("prob.model =TRUE"). "C-svc" is the default solver for predictions with binary outcomes in ksvm models and supports probability estimates so we use it here.


#build the model
#tuned_SVM[[3]][2] access the slot in the results for sigma
#tuned_SVM[[3]][1] access the slot in the results for C

modelSig<- as.numeric(tuned_SVM[[3]][2])
modelC<- as.numeric(tuned_SVM[[3]][1])

We already know how to make mapped predictions with a random forest model from Week 3. For an SVM we would proceed as follows:

#build the model
svmMOD<-ksvm(Pres~.,data=all.cov[,1:9], type="C-svc",
             kernel=rbfdot,kpar=list(sigma=modelSig),
             C=modelC,prob.model=TRUE)

predKSVM<-predict(env,svmMOD,type="prob",index=2,na.rm=TRUE)

plot(predKSVM)

So, we can see that with only a small amount of model tuning (in reality we might run hundreds or even thousands of models with more time and computing power) we have been able to improve considerably the support vector machine algorithm. At this point both the random forest and SVM are giving similar performance results, with random forest slightly better, so we would favour this model. However, if we had more time, we might consider continuing the SVM tuning by specifying a greater number of models and parameter combinations and would likely see further improvement.

Now we have made some headway in terms of improving our model prediction though tuning based on spatial partitioning of the data. Our AUC statistic is still not ideal and we have no real knowledge of the relationship between individual variables and species occurrence, much less the nature of point interactions. Let’s take a look at the options provided by point pattern analysis through the Spatstat package to see if we can remedy some of these issues.

Part Four: Point pattern analysis

A key difference between the modeling approaches we have used till now and the point pattern analysis we will do next is the format of the environmental covariates which, in the Spatstat package that we will use to carry this out, take the form of Im (image) objects. So it is worth first converting our raster covariates to image objects. For this we can use the as.im.RasterLayer() function from the “maptools” library:

# load maptools and convert all environmental covariates to im objects. Note that we need to convert to raster inside the function (as the as.im.RasterLayer function only works with *raster objects)
library(maptools)

## Please note that 'maptools' will be retired during October 2023,
## plan transition at your earliest convenience (see
## https://r-spatial.org/r/2023/05/15/evolution4.html and earlier blogs
## for guidance);some functionality will be moved to 'sp'.
##  Checking rgeos availability: FALSE

## 
## Attaching package: 'maptools'

## The following object is masked from 'package:sp':
## 
##     sp2Mondrian

library(raster)
tempIm<-as.im.RasterLayer(raster(env$MeanTemp.))
precipIm<-as.im.RasterLayer(raster(env$AnnualPrecip.))
maxIm<-as.im.RasterLayer(raster(env$Max.Temp.))
minIm<-as.im.RasterLayer(raster(env$Min.Temp.))
rangeIm<-as.im.RasterLayer(raster(env$Temp.Range))
precipDry<-as.im.RasterLayer(raster(env$PrecipDriestQuarter))
tempWettestIm<-as.im.RasterLayer(raster(env$MeanTemp.WettestQuarter))

Next we call the spatstat library and set up our analysis window (necessary for point pattern analysis) and create our point pattern object.

library(spatstat)

## Loading required package: spatstat.data

## Loading required package: spatstat.geom

## spatstat.geom 3.2-8

## 
## Attaching package: 'spatstat.geom'

## The following object is masked from 'package:MASS':
## 
##     area

## The following object is masked from 'package:ade4':
## 
##     disc

## The following objects are masked from 'package:terra':
## 
##     area, delaunay, is.empty, rescale, rotate, shift, where.max,
##     where.min

## The following object is masked from 'package:dismo':
## 
##     domain

## The following objects are masked from 'package:raster':
## 
##     area, rotate, shift

## Loading required package: spatstat.random

## spatstat.random 3.2-2

## Loading required package: spatstat.explore

## Loading required package: nlme

## 
## Attaching package: 'nlme'

## The following object is masked from 'package:raster':
## 
##     getData

## spatstat.explore 3.2-6

## 
## Attaching package: 'spatstat.explore'

## The following object is masked from 'package:boot':
## 
##     envelope

## Loading required package: spatstat.model

## Loading required package: rpart

## spatstat.model 3.2-10

## 
## Attaching package: 'spatstat.model'

## The following object is masked from 'package:dismo':
## 
##     response

## Loading required package: spatstat.linnet

## spatstat.linnet 3.1-3

## 
## spatstat 3.0-7 
## For an introduction to spatstat, type 'beginner'

window.poly <- as.owin(tempIm)#create study window using the tempIm image object as a template

plot(window.poly)#inspect


bradCoords<-crds(bradSP) #get coordinates from our Erioyce object to create point patter

pppBrad <- ppp(bradCoords[,1],bradCoords[,2], window = window.poly)#create point patter 'ppp' object

## Warning: 72 points were rejected as lying outside the specified window

## Warning: data contain duplicated points

plot(pppBrad,add=T) #inspect

## Warning in plot.ppp(pppBrad, add = T): 72 illegal points also plotted

pppBrad<-as.ppp(pppBrad)#use as.ppp to remove points outside the window

#Finally, it's a good idea to rescale our data from metres to kilometres. Not only is this more intuitive when dealing with large distances but it help with model convergence also.

pppBrad<- rescale(pppBrad, 1000)#rescale from m to km

#rescale image objects to conform with the point pattern 

tempIm<-rescale(tempIm,1000)
precipIm<-rescale(precipIm,1000)

#plot the points against covariate

plot(tempIm)

plot(pppBrad,add=TRUE)

Now, as an exploratory step we can test our point pattern for evidence of spatial dependence using Ripley’s K and the more intuitive L function:

K<-Kest(pppBrad,correction = "border") #test for spatial dependence and use the "border" option to account for edge effects (points near the edge are ignored) 

plot(K)#inspect the results

#check "linearized" Ripley's K for comparison
L<-Lest(pppBrad,correction = "border")

plot(L)

#generate envelope based on 39 random simulations of a homogeneous poisson process (rank = 1 means that we use minimum and maximum values to create the envelope, global=TRUE ensures that the estimate is made for all points simultaneously rather than point-wise).


Kcsr<-envelope(pppBrad,Kest,nsim=39,rank=1,global =TRUE)

## Generating 39 simulations of CSR  ...
## 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
## 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 
## 39.
## 
## Done.

plot(Kcsr,shade=c("hi","lo"),legend=T) #check whether our data fall within this envelope.

According to the Kest, Lest and envelope tests our data deviate from complete spatial randomness. We should therefore consider exploring inhomogeneous processes when developing our point process models, which we will do next.

For setting up a point process model, in addition to the study window established previously, we need to create a suitable quadrature scheme. The quadrature points used in point process modelling are analogous to the background (pseudo-absence) points we have used in previous weeks. In the case of point process modeling we generally use many more points than typically used in presence-absence modeling and adopt a uniform grid of points rather than a random distribution. This uniform distribution of quadrature points is important for establishing the “intensity” measurement (predicted points per unit area) that is the outcome of point process modeling.

We use the quadscheme() function from spatstat to create our quadrature points (also referred to as “dummy” points in some of the spatstat functions). The resolution (number) of these points is determined by the ‘nd’ argument which sets the number of grid points in the horizontal and vertical directions.

Below we test several values for nd using a for loop and a simple model using annual temperature and precipitation as predictors, each time checking the model selection criterion AIC (Akaike’s Information Criterion). Lower AIC values suggest a better fit to the data and we use this as a basis to select the number of quadrature points.

ndTry<-seq(100,500,by=100)


for(i in ndTry){
  Q.i<-quadscheme(pppBrad,method = "grid",nd=i)
  fit.i<-ppm(Q.i~tempIm+precipIm)
  print(i)
  print(AIC(fit.i))
}

## [1] 100
## [1] 65011.07
## [1] 200
## [1] 65065.91
## [1] 300
## [1] 65039.24
## [1] 400
## [1] 65051.42
## [1] 500
## [1] 65048.94

# select best performing value for nd to set quadrature scheme
Q<-quadscheme(pppBrad,method = "grid",nd=100)

Before we start our modelling in earnest, we can take advantage of some exploratory techniques in the spatstat library, in particular the rohat() function that allows us to plot values for individual covariates against predicted intensity values for our point pattern.

#plot increasing values for each environmental covariate against intensity (~ density) of the point pattern

plot(rhohat(pppBrad,tempIm)) # annual temperature

plot(rhohat(pppBrad,precipIm)) # annual precipitation

plot(rhohat(pppBrad,"x")) #x coordinate

plot(rhohat(pppBrad,"y")) #y coordinate

The rhohat() function suggests we have non-linear associations between our environmental covariates and intensity of points. For temperature and the precipitation in particular this seems to roughly follow a quadratic trend. We can use this insight when fitting our model. For example we can use the poly() function to model these variables as quadratic curves. Below we fit a quadratic function to the tempIm object and y coordinate inside our model formula to reflect these trends.The ppm() function is used to create a point process model. In practice we would experiment will many more combinations of predictor variables and test their suitability through comparison of the model AIC statistics. This can be quite time consuming (and tedious) so for the sake of this class we will just consider the key variables of annual temperature and precipitation.

firstPPMod<-ppm(Q~poly(precipIm,2)+poly(tempIm,2)+x+y) #point process model based on polynomial terms for temperature and projected coordinates

AIC(firstPPMod) # get AIC

## [1] 53104.88

As we discovered in our earlier exploratory analysis, we have reason to believe that our data do not come from a distribution with complete spatial randomness and we may want to consider accounting for spatial dependence between points in our modeling.

This first ppm model is basically a standard poisson regression model and does not account for point interactions. It therefore assumes spatial randomness in the data as discussed in the lecture. We can test whether our model conforms to spatial randomness i.e. the data come from a random poisson distribution using the envelope function which can also accept fitted point process models as the input.

firstModEnv<-envelope(firstPPMod,Kest,nsim=39,rank=1,global =TRUE)

## Generating 78 simulated realisations of fitted Poisson model (39 to estimate 
## the mean and 39 to calculate envelopes) ...
## 1, 2,  [3:37 remaining] 3,
##  [3:37 remaining] 4,  [3:36 remaining] 5,  [3:33 remaining] 6,
##  [3:31 remaining] 7,  [3:27 remaining] 8,  [3:21 remaining] 9,
##  [3:19 remaining] 10,  [3:21 remaining] 11,  [3:22 remaining] 12,
##  [3:26 remaining] 13,  [3:37 remaining] 14,  [3:46 remaining] 15,
##  [3:54 remaining] 16,  [3:59 remaining] 17,  [4:02 remaining] 18,
##  [3:54 remaining] 19, 
## [18:43 remaining, estimate finish 2024-02-20 15:48:33]
## 20, 
## [17:36 remaining, estimate finish 2024-02-20 15:47:29]
## 21, 
## [16:34 remaining, estimate finish 2024-02-20 15:46:30]
## 22, 
## [15:38 remaining, estimate finish 2024-02-20 15:45:36]
## 23, 
## [14:46 remaining, estimate finish 2024-02-20 15:44:48]
## 24, 
## [13:58 remaining, estimate finish 2024-02-20 15:44:02]
## 25, 
## [13:14 remaining, estimate finish 2024-02-20 15:43:21]
## 26, 
## [12:33 remaining, estimate finish 2024-02-20 15:42:43]
## 27, 
## [11:55 remaining, estimate finish 2024-02-20 15:42:06]
## 28, 
## [11:20 remaining, estimate finish 2024-02-20 15:41:34]
## 29, 
## [10:46 remaining, estimate finish 2024-02-20 15:41:02]
## 30, 
## [10:16 remaining, estimate finish 2024-02-20 15:40:34]
## 31,  [9:47 remaining] 32,  [9:19 remaining] 33,
##  [8:53 remaining] 34,  [8:29 remaining] 35,  [8:06 remaining] 36,
##  [7:44 remaining] 37,  [7:24 remaining] 38,  [7:04 remaining] 39,
##  [6:45 remaining] 40,  [6:27 remaining] 41,  [6:10 remaining] 42,
##  [5:53 remaining] 43,  [5:37 remaining] 44,  [5:22 remaining] 45,
##  [5:07 remaining] 46,  [4:53 remaining] 47,  [4:40 remaining] 48,
##  [4:27 remaining] 49,  [4:14 remaining] 50,  [4:02 remaining] 51,
##  [3:50 remaining] 52,  [3:38 remaining] 53,  [3:27 remaining] 54,
##  [3:16 remaining] 55,  [3:05 remaining] 56,  [2:55 remaining] 57,
##  [2:45 remaining] 58,  [2:35 remaining] 59,  [2:26 remaining] 60,
##  [2:16 remaining] 61,  [2:08 remaining] 62,  [1:59 remaining] 63,
##  [1:50 remaining] 64,  [1:42 remaining] 65,  [1:34 remaining] 66,
##  [1:26 remaining] 67,  [1:18 remaining] 68,  [1:10 remaining] 69,
##  [1:03 remaining] 70,  [55 sec remaining] 71,  [48 sec remaining] 72,
##  [41 sec remaining] 73,  [34 sec remaining] 74,  [27 sec remaining] 75,
##  [20 sec remaining] 76,  [13 sec remaining] 77,  [7 sec remaining] 
## 78.
## 
## Done.

plot(firstModEnv)

As we can see our initial model results exhibit spatial clustering (values for K outside and higher than the theoretical CSR envelope) for most radii values.

We can attempt to account for this clustering inside our model using a Matern process, providing this option inside the kppm() function (for fitting clustered ppms).

maternMod<-kppm(Q~poly(tempIm,2)+poly(precipIm,2)+x+y,"MatClust") # apply Matern cluster process to the data.

Again, as for our first model we can perform a diagnostic test for our Matern process model using the envelope function. In this case, we have built our model based on the assumption that our points exhibit clustering.In other words, they represent an inhomogeneous point process. We therefore want to create an envelope based on simulations of inhomogeneous (clustering) point patterns rather than CSR. This can done by changing the second argument in the envelope() function from Kest (for Ripley’s K test of CSR) to Kinhom (to simulate inhomogeneous patterns). We can then assess the goodness-of-fit of our model by seeing if our observed values for Ripley’s K conform with an inhomogeneous distribution (i.e. if they lie within the simulation envelope).

matEnv<-envelope(maternMod,Kinhom,nsim=39,rank=1,global=TRUE)

## Generating 78 simulated realisations of fitted cluster model (39 to estimate 
## the mean and 39 to calculate envelopes) ...
## 1, 2, 3,
##  [3:06 remaining] 4,  [5:03 remaining] 5,  [6:39 remaining] 6,
##  [5:27 remaining] 7,  [4:32 remaining] 8,  [3:52 remaining] 9,
##  [3:31 remaining] 10,  [3:34 remaining] 11,  [3:14 remaining] 12,
## 
## [17:41 remaining, estimate finish 2024-02-20 15:53:59]
## 13, 
## [16:03 remaining, estimate finish 2024-02-20 15:52:22]
## 14, 
## [14:48 remaining, estimate finish 2024-02-20 15:51:09]
## 15, 
## [13:35 remaining, estimate finish 2024-02-20 15:49:57]
## 16, 
## [14:03 remaining, estimate finish 2024-02-20 15:50:48]
## 17, 
## [13:18 remaining, estimate finish 2024-02-20 15:50:08]
## 18, 
## [13:22 remaining, estimate finish 2024-02-20 15:50:30]
## 19, 
## [12:26 remaining, estimate finish 2024-02-20 15:49:35]
## 20, 
## [11:37 remaining, estimate finish 2024-02-20 15:48:46]
## 21, 
## [10:51 remaining, estimate finish 2024-02-20 15:48:00]
## 22, 
## [10:12 remaining, estimate finish 2024-02-20 15:47:22]
## 23,  [9:35 remaining] 24,  [9:00 remaining] 25,
##  [9:03 remaining] 26,  [8:43 remaining] 27,  [8:14 remaining] 28,
##  [8:06 remaining] 29,  [7:49 remaining] 30,  [7:25 remaining] 31,
##  [7:02 remaining] 32,  [6:40 remaining] 33,  [7:13 remaining] 34,
##  [6:51 remaining] 35,  [6:37 remaining] 36,  [6:31 remaining] 37,
##  [6:24 remaining] 38,  [6:04 remaining] 39,  [6:04 remaining] 40,
##  [5:49 remaining] 41,  [5:32 remaining] 42,  [7:06 remaining] 43,
##  [6:45 remaining] 44,  [6:37 remaining] 45,  [8:06 remaining] 46,
##  [8:26 remaining] 47,  [8:01 remaining] 48,  [7:36 remaining] 49,
##  [7:19 remaining] 50,  [6:56 remaining] 51,  [9:46 remaining] 52,
##  [9:14 remaining] 53,  [8:44 remaining] 54,  [8:15 remaining] 55,
##  [8:29 remaining] 56,  [7:58 remaining] 57,  [7:28 remaining] 58,
##  [7:00 remaining] 59,  [6:32 remaining] 60,  [6:16 remaining] 61,
##  [5:51 remaining] 62,  [5:25 remaining] 63,  [5:00 remaining] 64,
##  [4:42 remaining] 65,  [4:18 remaining] 66,  [3:55 remaining] 67,
##  [3:33 remaining] 68,  [3:11 remaining] 69,  [2:50 remaining] 70,
##  [2:48 remaining] 71,  [2:25 remaining] 72,  [2:08 remaining] 73,
##  [1:45 remaining] 74,  [1:23 remaining] 75,  [1:01 remaining] 76,
##  [40 sec remaining] 77,  [20 sec remaining] 
## 78.
## 
## Done.

plot(matEnv)

Our model fits nicely within the simulation envelope suggesting that it fits well to an inhomogeneous point pattern and is therefore correctly specified - it is a good approximation of the behaviour of the data.

We can look at the model coefficients by calling coef(maternMod) to see the relationship between our predictor variables and the predicted intensity of points. Notice that for varibales fit with poly() you have two coefficients, the first describes the relationship with intensity for lower values of the covariate and the second coefficient describes that for higher values of the covariate. For example, in the case of temperature, intensity (species density) appears to heighten with increasing temperature before decreasing after a certain value for temperature is reached. Note that this information is not available in the machine learning approaches.

coef(maternMod)

##        (Intercept)   poly(tempIm, 2)1   poly(tempIm, 2)2 poly(precipIm, 2)1 
##       -7.359279938       42.584186463        3.967892571       37.035099357 
## poly(precipIm, 2)2                  x                  y 
##      -54.599033672       -0.001726656        0.002457423

The spatstat package contains a function for calculating the area under the curve for point process models. This differs slightly from binomial models (models with 2 outcomes) in that here it is the intensity rather than the success of correctly classifying a binary outcome that is being tested. The spatstat roc() and auc() functions offer an indication of the ability of the fitted model intensity to separate the spatial domain into areas of high and low density of points. The AUC here is the probability that a randomly-selected data point has higher predicted intensity than does a randomly-selected spatial location. So the test incorporates space into the assessment (i.e. by considering intensity across the entire study area rather than just our data points). We can plot the roc curve and return the auc as below:

plot(roc(maternMod))

auc.kppm(maternMod)

##       obs      theo 
## 0.8757314 0.8922534

The roc curve and auc statistic for our model are high, following closely the output that would be expected for a well specified inhomogeneous point process model (“ftheo” here) and well above the null line (a 50:50 line that represents no ability to discriminate high from low intensity - “fnull” here). So we can see that with a well specified model that incorporates spatial dependence we have achieved a better representation of the data with the point process approach compared to our machine learning techniques. In reality we could carry out much more extensive model tuning for our machine learning (increasing number of iterations in the tuning step) but it’s perhaps unlikely that an auc score comparable to that of the Matern model would be achieved.

We can create a mapped output of our results using the predict function. Remember our point process model output is one of intensity (points per unit area) and has quite low values. We can log transform the raster values to better differentiate the distribution of values.

prPPMod<-predict(maternMod)

plot(prPPMod)

#log transform the result to highlight differences in value distribution
plot(log(prPPMod))

Finally we can convert our output to raster (remember spatstat deals in image objects):

maternRas<-raster(prPPMod)

plot(maternRas)

You’ve reached the end of this week’s practical. You’ve come a long way on your species distribution modeling journey. For this week’s exercise:

generate a point process model for Solanum acaule using a subset of the data inside an 80% minimum convex polygon (as we did for the Bradypus data in this practical. Test the data for CSR and fit a Matern process to build your model. Plot the roc curve for your final model as well as the raw and log-transformed prediction map.
Then, explain the importance of species distribution modelling in the context of nature conservation. Specifically, summarize the opportunities and challenges of applying different modelling approaches to species presence-only data (700 words).

Species Distribution Modelling Two

MD

20/02/2024