GEOG71922: Week Two: species distribution modelling

Author

MD 06/02/2026

In this week’s pratical we will take our first step into species distribution modelling. We will see that there are multiple options for modelling the distribution of species and that all depend on point data and the extraction of environmental factors to those points.

First, let’s set the working directory and install the packages that we will use for today’s work:


setwd("./GEOG71922/WeekTwo")

Install the packages for today’s work (we will call the libraries on first use):


#Install packages

#dismo for downloading species data and evaluating models (also comes with the climatic data we will use for modelling)

#glmnet and maxnet for running Maxent models

#terra and sf you know (handling spatial data)

#randonForest - running random forest models

#rnaturalearth for calling world boundary and other miscellaneous geographical data

#dplr for data manipulation

install.packages(c("dismo","glmnet","maxnet","terra","randomForest","sf","rnaturalearth","dplyr"))

Load the libraries for terra sf and dismo

library(terra)

Warning: package 'terra' was built under R version 4.5.2

terra 1.8.93

library(sf)

Warning: package 'sf' was built under R version 4.5.2

Linking to GEOS 3.13.0, GDAL 3.8.5, PROJ 9.5.1; sf_use_s2() is TRUE

library(dismo)

Loading required package: raster

Loading required package: sp

We will be using raster layers from the Worldclim dataset of environmental variables that come ready-to-use once the dismo package is installed. Information on these layers can be found at: http://worldclim.org/bioclim Follow this link now to see how the different layer names are coded. The ‘biome’ layer included here comes from the ‘Terrestrial Biome’ dataset by the WWF. You can get more information on this layer here: https://www.worldwildlife.org/publications/terrestrial-ecoregions-of-the-world

For now let’s load these layers, name them and create a raster stack object (a raster made of multiple bands) before plotting them along with some world map data to see where in the world we are this week. These files are inside the ‘ex’ folder of the dismo package and all have the extention ‘.grd’. We can access these using the list.files() function and concatonating the file paths using the paste() function. Type ‘?paste’ into the command line for more information on this function. This is how they appear when called on my own machine:

files <- list.files(path=paste(system.file(package="dismo"),
                               '/ex', sep=''), pattern='grd', full.names=TRUE )

files

[1] "/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/dismo/ex/bio1.grd" 
[2] "/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/dismo/ex/bio12.grd"
[3] "/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/dismo/ex/bio16.grd"
[4] "/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/dismo/ex/bio17.grd"
[5] "/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/dismo/ex/bio5.grd" 
[6] "/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/dismo/ex/bio6.grd" 
[7] "/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/dismo/ex/bio7.grd" 
[8] "/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/dismo/ex/bio8.grd" 
[9] "/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/dismo/ex/biome.grd"

Next we creae raster object from these files and plot them.

#use the rast() function to create a "stack" of all the raster layers in the list of files we just made
env<-rast(files[1:8])

#check out the layers
plot(env)

Use the names() function to assign names to the different environmental layers

names(env) <- c("MeanTemp.","AnnualPrecip.","Precip.WettestQuarter","PrecipDriestQuarter","Max.Temp.","Min.Temp.","Temp.Range","MeanTemp.WettestQuarter")


#plot again with new layer names
plot(env)

Next we retrive some data on the Brown Throated Sloth (Bradypus variegatus) from the Global Biodiversity Information Facility (GBIF)

#use gbif() function to search and download sloth data sp=TRUE means return the data as points.

bradypus <- gbif("Bradypus", "variegatus*",sp=TRUE)

10221 records found

0-300-600-900-1200-1500-1800-2100-2400-2700-3000-3300-3600-3900-4200-4500-4800-5100-5400-5700-6000-6300-6600-6900-7200-7500-7800-8100-8400-8700-9000-9300-9600-9900-10200-10221 records downloaded

Now let’s have a look at a handy tool in R that allows us to work interactively with spatial data in R - the ‘mapview’ library. This allows use to look at our data and pan around their extent against base maps similar to how you would using QGIS or ArcGIS. Run the following code then have a play around with the options. When you’re ready, move on to the next step. We will focus our analysis around the Cairngorms area of Scotland, see if you can locate it on the map.

#convert from an sp library (old) object to terra (newer/faster library) object

bradypus<-vect(bradypus)

#set the crs
crs(bradypus)<-"epsg:4326"

Now we’ll add some political boundaries for context. We’ll use a package for load Natural Earth data and subset the data for our region of interest.

#call natural earth library to load in some national boundaries for N and S America 
library(rnaturalearth)

Warning: package 'rnaturalearth' was built under R version 4.5.2

#create an object just with boundaries for North and South America
am<-rnaturalearth::ne_countries(continent = c("North America","South America"),returnclass = "sf")

Now subset the data for the parts of South and North America that are relevant for our data.

#check all available regions in our initial subset
unique(am$subregion)

[1] "Northern America" "South America"    "Caribbean"        "Central America"

Now subset the data to only include the ones we want. We use the square brackets for subsetting as in Week One [] and the | symbol which just means OR (note that & would be AND, which wouldn’t return what we want here)

#subset using square brackets and | ("or") operands
SA<-am[am$subregion=="South America"|am$subregion=="Central America",]

Project the raster objects to the same as the species data and check that this all lines up

#check all plot together - but first project the env data to epsg:4326

env<-project(env,"epsg:4326",method="bilinear")

plot(env,1)

plot(SA$geometry,add=TRUE)

plot(bradypus, add=TRUE,col ="red")

Next, in order to build our dataset for modelling we need to generate some background (or “pseudo-absence”) points for our study area.

#create random points on cells of the 'env' object. This function also samples the raster values and returns them with the resulting object.

set.seed(11)

#sample background - one point for every cell (9775)
back <- spatSample(env,size=9775,as.points=TRUE,method="random",na.rm=TRUE) 

#inspect
head(back)

  MeanTemp. AnnualPrecip. Precip.WettestQuarter PrecipDriestQuarter Max.Temp.
1       250          1997                   952                  48       340
2       225           830                   296                 151       290
3       250          1495                   576                 227       312
4       270          2900                  1381                  81       326
5       265          1647                   881                 155       327
6       132           172                    51                  35       274
  Min.Temp. Temp.Range MeanTemp.WettestQuarter
1       154        186                     248
2       176        114                     242
3       185        126                     250
4       231         95                     263
5       212        115                     258
6         7        267                      73

#and plot

plot(env,1)

#subset to SA object

#done quickest with sf 
library(sf)

#convert to sf object 
back<-st_as_sf(back)

#spatial subset using [,]
back<-back[SA,]

#check number of background points
nrow(back)

[1] 7074

#subset bradypus data to the same extent using the same method

bradypus=st_as_sf(bradypus)[SA,]

Now remove any duplicate geometries in the bradypus object using the distinct function from the dplyr package

library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:raster':

    intersect, select, union

The following objects are masked from 'package:terra':

    intersect, union

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

#keep only unqiue coordinates
bradypus= distinct(bradypus,geometry, .keep_all = TRUE)

Plot both objects together (select geometry column or sf will try to plot one map for every variable in the data frame)

#background points
plot(back$geometry)

#presence points
plot(bradypus$geometry,col="red",add=T)

Next we use the extract() function to retrieve information on the point locations with respect to the climatic conditions

# get environmental covariates at presence locations
eP<-extract(env,bradypus)

#create data frames from values
Pres.cov<-data.frame(eP,Pres=1)

#Remove the first column which is just an ID field.
Pres.cov<-Pres.cov[,-1]

#check the data
summary(Pres.cov)

   MeanTemp.     AnnualPrecip.  Precip.WettestQuarter PrecipDriestQuarter
 Min.   :122.0   Min.   : 375   Min.   : 215          Min.   :   8       
 1st Qu.:235.0   1st Qu.:1912   1st Qu.: 812          1st Qu.: 143       
 Median :255.0   Median :2779   Median :1045          Median : 194       
 Mean   :247.6   Mean   :2685   Mean   :1033          Mean   : 256       
 3rd Qu.:261.0   3rd Qu.:3218   3rd Qu.:1215          3rd Qu.: 305       
 Max.   :284.0   Max.   :7682   Max.   :2458          Max.   :1496       
   Max.Temp.       Min.Temp.       Temp.Range    MeanTemp.WettestQuarter
 Min.   :179.0   Min.   : 49.0   Min.   : 78.0   Min.   :125.0          
 1st Qu.:303.0   1st Qu.:175.0   1st Qu.:108.0   1st Qu.:231.0          
 Median :313.0   Median :196.0   Median :116.0   Median :252.0          
 Mean   :309.9   Mean   :188.8   Mean   :120.9   Mean   :246.4          
 3rd Qu.:322.0   3rd Qu.:214.0   3rd Qu.:131.0   3rd Qu.:261.0          
 Max.   :359.0   Max.   :228.0   Max.   :219.0   Max.   :282.0          
      Pres  
 Min.   :1  
 1st Qu.:1  
 Median :1  
 Mean   :1  
 3rd Qu.:1  
 Max.   :1

We have a few NAs which will cause issues later so we remove these by selecting only rows that !is.na() meaning IS NOT NA. We just chose one variable here - “MeanTemp.” - to test for the NA status.

#remove NAs subsetting with []
Pres.cov<-Pres.cov[!is.na(Pres.cov$MeanTemp.),]

#check for NAs using the summary function
summary(Pres.cov)

   MeanTemp.     AnnualPrecip.  Precip.WettestQuarter PrecipDriestQuarter
 Min.   :122.0   Min.   : 375   Min.   : 215          Min.   :   8       
 1st Qu.:235.0   1st Qu.:1912   1st Qu.: 812          1st Qu.: 143       
 Median :255.0   Median :2779   Median :1045          Median : 194       
 Mean   :247.6   Mean   :2685   Mean   :1033          Mean   : 256       
 3rd Qu.:261.0   3rd Qu.:3218   3rd Qu.:1215          3rd Qu.: 305       
 Max.   :284.0   Max.   :7682   Max.   :2458          Max.   :1496       
   Max.Temp.       Min.Temp.       Temp.Range    MeanTemp.WettestQuarter
 Min.   :179.0   Min.   : 49.0   Min.   : 78.0   Min.   :125.0          
 1st Qu.:303.0   1st Qu.:175.0   1st Qu.:108.0   1st Qu.:231.0          
 Median :313.0   Median :196.0   Median :116.0   Median :252.0          
 Mean   :309.9   Mean   :188.8   Mean   :120.9   Mean   :246.4          
 3rd Qu.:322.0   3rd Qu.:214.0   3rd Qu.:131.0   3rd Qu.:261.0          
 Max.   :359.0   Max.   :228.0   Max.   :219.0   Max.   :282.0          
      Pres  
 Min.   :1  
 1st Qu.:1  
 Median :1  
 Mean   :1  
 3rd Qu.:1  
 Max.   :1

Now create a data frame for background values (currently an sf object but for modelling we don’t need the geometry so we drop that here)

#drop geometry column using st_drop_geometry()
Back.cov<-data.frame(st_drop_geometry(back),Pres=0)

#inspect
head(Back.cov)

   MeanTemp. AnnualPrecip. Precip.WettestQuarter PrecipDriestQuarter Max.Temp.
1        250          1997                   952                  48       340
5        265          1647                   881                 155       327
6        132           172                    51                  35       274
8        244          1936                   937                  43       333
10       262          1986                   849                 185       326
11       243          2745                  1352                 112       321
   Min.Temp. Temp.Range MeanTemp.WettestQuarter Pres
1        154        186                     248    0
5        212        115                     258    0
6          7        267                      73    0
8        149        184                     243    0
10       201        124                     256    0
11       174        147                     238    0

Combine presence and absence data into one data frame using rbind(). This function combines a set of rows on top of another set of rows providing both sets have the same column names.

#combine
all.cov<-rbind(Pres.cov,Back.cov)

Now let’s create our first model. The biolclim() function performs a distribution model based on the envelope modelling approach and takes presence only data. We specify the model therefore using only the Pres.cov data frame and, for now, we will use co-variates MeanTemp., Temp.Range. and AnnualPrecip.

#bioclim model
bc_bradypus <- bioclim(Pres.cov[,c('MeanTemp.', 'Temp.Range', 'AnnualPrecip.')])

The bioclim() algorithm assess the suitability of a location by comparing the values of environmental variables at that location to the distribution of values at known presence locations. In general, the closer to the 50th percentile (the median), the more suitable the location is.

We can map our prediction using the predict() function, using our covariate raster stack as the object and the bc_bradypus as the model. Note that for a bioclim() model the predicted values range from 0 to 1. It is scaled such that a value of 1 would be locations that have the median value of all covariates considered, while a value of zero would reflect locations where at least one covariate is outside the range of environmental covariates at presence locations.

#predict
bioclim.map <- predict(env, bc_bradypus)

#plot
plot(bioclim.map, axes = F, box = F, main = "bioclim: MeanTemp.,Temp.Range,AnnualPrecip.")

Now we’ll repeat our bioclim model but using all predictors (i.e. all columns of co-variates in Pres.cov)

#run bioclim with all variables
bc_bradypus2<-bioclim(Pres.cov[,1:8])

#inspect model details
bc_bradypus2

class    : Bioclim 

variables: MeanTemp. AnnualPrecip. Precip.WettestQuarter PrecipDriestQuarter Max.Temp. Min.Temp. Temp.Range MeanTemp.WettestQuarter 


presence points: 6757 
   MeanTemp. AnnualPrecip. Precip.WettestQuarter PrecipDriestQuarter Max.Temp.
1        277          1224                   521                  46       341
2        235          2779                  1191                 154       305
3        216          3285                  1218                 300       277
4        216          3285                  1218                 300       277
5        177          1910                   771                 208       253
6        216          3285                  1218                 300       277
7        216          3285                  1218                 300       277
8        267          2015                   628                 365       325
9        177          1910                   771                 208       253
10       177          1910                   771                 208       253
   Min.Temp. Temp.Range MeanTemp.WettestQuarter
1        212        130                     275
2        175        129                     231
3        160        117                     215
4        160        117                     215
5         90        162                     204
6        160        117                     215
7        160        117                     215
8        210        115                     265
9         90        162                     204
10        90        162                     204
  (... ...  ...)

Make a new prediction based on this model

bioclim.map2 <- predict(env, bc_bradypus2)

plot(bioclim.map2, axes = F, box = F, main = "bioclimAll")

We can see that our models, as expected, are very similar though we might assume that, given the relatively higher values and using fewer predictors, the first model (bc_bradypus) may be over-predicting our species occurence.

We can also use the response() function to give the response plots for each covariate.

#get response plots
response(bc_bradypus2)

The response plots show values for each covariate at which prediction is highest. For most of our covariates, the relationship appears to be non-linear in the presence-only bioclim model.

Next we’ll look at fitting some models that accept presence and (psuedo)-absence data. We will use the glm() function (glm stands for ‘generalized linear model’) and set the family of functions to ‘binomial’ as our outcome is binary. We set the first argument (the response variable) as ‘Pres’ and separate this from our predictor variable with the tilda symbol ‘~’. We can opt to include all variables (i.e. the remaining columns in the data frame) simply by placing a point ‘.’ after the tilda symbol.

#Specify model
glm.bradypus<-glm(Pres~.,binomial(link='logit'), data=all.cov)


#get summary of outputs

summary(glm.bradypus)


Call:
glm(formula = Pres ~ ., family = binomial(link = "logit"), data = all.cov)

Coefficients:
                          Estimate Std. Error z value Pr(>|z|)    
(Intercept)              5.5504504  0.3115500  17.816  < 2e-16 ***
MeanTemp.               -0.0085469  0.0080427  -1.063    0.288    
AnnualPrecip.            0.0044822  0.0001792  25.019  < 2e-16 ***
Precip.WettestQuarter   -0.0064592  0.0003806 -16.973  < 2e-16 ***
PrecipDriestQuarter     -0.0103928  0.0003996 -26.009  < 2e-16 ***
Max.Temp.                0.3265668  0.0459819   7.102 1.23e-12 ***
Min.Temp.               -0.3922343  0.0465899  -8.419  < 2e-16 ***
Temp.Range              -0.4131486  0.0462367  -8.936  < 2e-16 ***
MeanTemp.WettestQuarter  0.0736548  0.0039268  18.757  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 19167  on 13830  degrees of freedom
Residual deviance: 10937  on 13822  degrees of freedom
AIC: 10955

Number of Fisher Scoring iterations: 7

The statistics provided by the summary() function from the glm model are the co-efficient values estimated for each predictor variable and the AIC (Akaike Information Criterion) value. For the purpose of model comparison, a lower AIC denotes better model fit.

We can map the probability predictions of our glm.bradypus output in the same way as we did for the bioclim model.

#Use the predict function to estimate probabilities for our study area based on the raster stack object 'env'. 
glm.map <- predict(env, glm.bradypus, type = "response")

#plot the map
plot(glm.map,main="glmBradypus")

Now we will look at the application of a Maxent algorithm to our data. Maxent is a popular species distribution modelling algorithm, partly due to the open source package that can be freely downloaded and used “out of the box”. It can also be used within the R environment where, if the programme is saved inside the dismo package folder in the R directory, we can call maxent() from the dismo library in the same way as we have already called bioclim() and glm() from their respective libraries. However, this can be problematic if working on a computer without admin rights and in any case there is a way to implement a maxent model without outsourcing the algorithm from within R. The glmnet package contains a function maxnet() that allows us to do this.

For maxnet, we provide our presence data followed by the columns of our data frame containing predictors and, in this case, we choose “lq” (meaning linear and quadratic) from the list of available feature classes for model estimation.

In truth, the maxnet() function approximates a maxent formula using an “infinitely weighted logistic regression” - an inhomogenous point process (IPP) approach carried out with the glmnet package (see lecture material and more on this in Week 3).

Here we pass all variables in our all.cov data frame as predictors and use the maxnet.formula() function to give variable names as arguments. This is so that we can create response curves later on.

#load libraries
library(maxnet)


Attaching package: 'maxnet'

The following object is masked _by_ '.GlobalEnv':

    bradypus

library(glmnet)

Loading required package: Matrix

Loaded glmnet 4.1-10

# run a presence-only Maxent model selecting feature "linear" and "quadratic" using the "lq" argument.
maxnetMod <- maxnet(all.cov$Pres, all.cov[,1:8],maxnet.formula(all.cov$Pres,all.cov[,c("AnnualPrecip.", "Precip.WettestQuarter", "PrecipDriestQuarter", "Max.Temp.", "Temp.Range","MeanTemp.","MeanTemp.WettestQuarter")],classes="lq"))

maxnetMod <- maxnet(all.cov\(Pres, all.cov[,1:8],maxnet.formula(all.cov\)Pres,all.cov[,c(“AnnualPrecip.”, “Precip.WettestQuarter”, “PrecipDriestQuarter”, “Max.Temp.”, “Temp.Range”,“MeanTemp.”, “MeanTemp.WettestQuarter”)],classes=“lq”)) # runs a presence-only Maxent model selecting feature “linear” and “quadratic” using the “lq” argument.
To transform the raw model output values (an exponential model) for mapping to a 0 to 1 probability scale, we have several options. The default option in the orginal Maxent software was type “logistic” for ploting response curves though recent work comparing Maxent to an IPP formulation suggests that a cloglog (complementary log-log transform) approach is more suitable for mapping presence/absence predictions. Consequently, cloglog is now the default for the open source Maxent tool and we will plot it here.

First we plot response curves showing the relationship between some of our variables and probability. We use the response.plot() function fom the maxent package to do this. The function takes six arguments: 1.the model, 2.the variable to be plotted, 3. the variable mean (note we take this from the model itself but as it is stored in a list - which won’t do for plotting - we have to unlist this value), 4.the variable minimum value, 5. the variable maximum value and 6. the transformation applied to the raw data. You can plot the other variables also simply by swapping out the variable name.

response.plot(maxnetMod,"AnnualPrecip.", mm=unlist(maxnetMod$samplemeans), #set model, variable and mean values
              min=maxnetMod$varmin["AnnualPrecip."], max=maxnetMod$varmax["AnnualPrecip."], #set min/max values
              type="cloglog") #choose transform type

Do the same for Temp.range

response.plot(maxnetMod,"Temp.Range", mm=unlist(maxnetMod$samplemeans),min=maxnetMod$varmin["Temp.Range"], 
              max=maxnetMod$varmax["Temp.Range"], 
              type="cloglog")

MeanTemp.

response.plot(maxnetMod,"MeanTemp.", mm=unlist(maxnetMod$samplemeans),min=maxnetMod$varmin["MeanTemp."], 
              max=maxnetMod$varmax["MeanTemp."], 
              type="cloglog")

Now let’s plot the prediction map.

maxnet.cloglog.map <- predict(env, maxnetMod, clamp=F, type="cloglog",na.rm=TRUE)

plot(maxnet.cloglog.map, axes=F, box=F, main="Maxnet-clog")

Model Validation

An important part of species distribution modelling is model validation and this is a subject we will come back to repeatedly throughout this and next week’s practical. Although we briefly looked at the usefulness of the AIC statistic for comparing models, the AIC does not give us any idea about the predictive power of our model. To evaluate this, we need to use information provided in the confusion matrix, as introduced in the lecture.

The most readily available function (before creating our own later) for generating a confusion matrix and returning associated statistics is the evaluate() function that comes with the dismo package. To use this function, we supply three arguments to the evaluate() function: a data frame of presence points, a data frame of absence points and the model we wish to evaluate.

Let’s evaluate our presence-only model (bioclim) first.

#For evaluating the first bioclim model we only povide the MeanTemp.,Temp.Range and AnnualPrecip. predictors which are columns 1,2 and 7 in the Pres.cov data frame



eBC <- evaluate(Pres.cov[,c(1,2,7)],Back.cov,bc_bradypus)

#For model two validation, provide all predictors
eBC2 <- evaluate(Pres.cov,Back.cov,bc_bradypus2)

#inspect both evaluation results

eBC

class          : ModelEvaluation 
n presences    : 6757 
n absences     : 7074 
AUC            : 0.7882081 
cor            : 0.380942 
max TPR+TNR at : 0.02846297

eBC2

class          : ModelEvaluation 
n presences    : 6757 
n absences     : 7074 
AUC            : 0.8057731 
cor            : 0.3852813 
max TPR+TNR at : 0.02446712

The main outputs of interest here are the AUC, and max TPR+TNR statistics. Recall from the lecture that AUC stands for Area Under the Curve and is a test of model performance where higher values indicate greater accuracy in our predictions.This statistic is often used as a mark of model performance and we will use it to assess the model we run today. Remember also that the max TPR+TNR denotes the probability threshold at which our model maximizes the True Positive Rate and the True Negative Rate. It is generally accepted that this is an optimum value at which to set the threshold for binary classification (i.e. 1/0 or presence/absence) in mapping outputs. We will use these values later when symbolising our final mapped predictions. For now, we can safely assume that our second bioclim model, with a higher AUC statistic, is the better performing model.

We can plot the ROC (Receiver Operating Curve) to visualize these results:

# plot ROC for eBC

plot(eBC,'ROC')

# plot ROC for eBC2

plot(eBC2,'ROC')

Now let’s evaluate our glm models.

#GLM evaluation

#Model with all predictors
eglmAll <- evaluate(all.cov[all.cov$Pres==1,],all.cov[all.cov$Pres==0,], glm.bradypus)


#inspect
eglmAll

class          : ModelEvaluation 
n presences    : 6757 
n absences     : 7074 
AUC            : 0.8956108 
cor            : 0.6049487 
max TPR+TNR at : -0.0329718

Plot the ROC for the glm model

plot(eglmAll,'ROC')

Finally we apply the evaluate() function to our maxnet model.

maxeval<-evaluate(p=bradypus, a=back, maxnetMod, env)

maxeval

class          : ModelEvaluation 
n presences    : 6757 
n absences     : 7074 
AUC            : 0.9005798 
cor            : 0.611535 
max TPR+TNR at : -9.034249

So far the evaluation results suggest good performance by all of our models (high AUC), particularly the glm approach. However, we have almost certainly over-estimated the quality of our models given that we have tested them with the same data that was used for training them. A more robust approach is to train the model on a subset of the data and test (evaluate) the model on the remaining cases. For example, we might use 70% of the data to train our model and use the remaining 30% of cases to test our model’s ability to correctly classify those cases. Although more robust, this approach only gives us a single attempt at model validation on which to base our assessment. To overcome this, the k-fold approach is commonly used where the entire dataset is divided into folds (groups), say five, where one fold is chosen for model testing and the remaining four used for model training. This is then repeated in an iterative process until all folds have been used for model testing. We can then base our assessment on the mean values derived from all iterations. The kfold() function in dismo provided us with a neat way of partitioning our data.

#set number of folds to use
folds=5

#partiction presence and absence data according to folds using the kfold() function.

kfold_pres <- kfold(Pres.cov, folds)
kfold_back <- kfold(Back.cov, folds)


#create an empty list to hold our results (remember there will be five sets)
bc_K<-list()
par(mfrow=c(2,3))
# for loop to iterate over folds
for (i in 1:folds) {
  train <- Pres.cov[kfold_pres!= i,]#for presence values, select all folds which are not 'i' to train the model
  test <- Pres.cov[kfold_pres == i,]#the remaining fold is used to test the model
  backTrain<-Back.cov[kfold_back!=i,]#now for background values, select training folds
  backTest<-Back.cov[kfold_back==i,]#use the remainder for model testing
  dataTrain<-rbind(train,backTrain)#bind presence and background training data together
  dataTest<-rbind(test,backTest)#bind test data together
  bc_bradypus_K<-bioclim(train[,1:8])#for bioclim we only need presence data to train the model!
  bc_K[[i]] <- evaluate(p=test[,1:8],a=backTest[,1:8], bc_bradypus_K)#use testing data (kfold==i) for model evaluation
  
  #check the AUC by plotting ROC values
  
  plot(bc_K[[i]],'ROC')
  
}


#inspect

bc_K

[[1]]
class          : ModelEvaluation 
n presences    : 1351 
n absences     : 1415 
AUC            : 0.8038163 
cor            : 0.3697648 
max TPR+TNR at : 0.024937 

[[2]]
class          : ModelEvaluation 
n presences    : 1352 
n absences     : 1415 
AUC            : 0.8146052 
cor            : 0.4100813 
max TPR+TNR at : 0.01519029 

[[3]]
class          : ModelEvaluation 
n presences    : 1351 
n absences     : 1414 
AUC            : 0.8032965 
cor            : 0.3844469 
max TPR+TNR at : 0.02579715 

[[4]]
class          : ModelEvaluation 
n presences    : 1352 
n absences     : 1415 
AUC            : 0.7964918 
cor            : 0.3691173 
max TPR+TNR at : 0.01562618 

[[5]]
class          : ModelEvaluation 
n presences    : 1351 
n absences     : 1415 
AUC            : 0.8078322 
cor            : 0.3913554 
max TPR+TNR at : 0.02588298

Next we use the slot() function to acces the elements in the list bc_K that are of use to us. In this case we want to return all auc values from our 5 evaluation runs. We ‘loop’ through all elements in the list using the sapply() function (remember this is similar to a for loop but applies to elements in a list or a vector). Effectively the below code is iterating over each element in the list ‘bc_K’ and applying a function that selects the ‘auc’ statistics from the correpsonding slot in the model results. Slots are basically attributes of objects in R. Here bc_k is a list of objects of type Formal Class ModelEvaluation. So we iterate over each object and extract the ‘auc’ slot. You can type ‘help(slot)’ into the console for more information on slots.

Get model evaluation statistics:

aucBIO <- sapply( bc_K, function(x){slot(x, 'auc')} )

#calculate the mean values for coparison with other models


mean(aucBIO)

[1] 0.8052084

In order to map the results of our model as predicted presence-non presence we need to set a threshold over which locations (cells on the map) are allocated 1 and, beneath which, 0. One way to determine this threshold is to calculate the threshold (the model output) at which the TPR (true positive rate) and TNR (true negative rate) are maximized. Remember we need both of these to be high in order to ensure a successful model.

We use the sapply() function to iterate over each item in our eGLM list and select the ‘t’ value (this is the slot on bc_K which denotes a given threshold) for which corresponding values for TPR+TNR are highest (hence we use ‘which.max’).

Opt_BIO<-sapply( bc_K, function(x){ x@t[which.max(x@TPR + x@TNR)] } )

Opt_BIO

[1] 0.02493700 0.01519029 0.02579715 0.01562618 0.02588298

We will use these threshold values to make final predictions for areas of 1 and 0 (presence/absence) and map them.

#take the mean of the OptBIO list for application in our predictions

Mean_OptBIO<- mean(Opt_BIO)



Mean_OptBIO

[1] 0.02148672

Now we just need to make a prediction based on our environmental raster stack and our bioclim model, setting 1 and 0 (presence/absence) according to the above threshold.

prBIO <- predict(env, bc_bradypus2,type = "response")
par(mfrow=c(1,2))
plot(prBIO, main='BIO_K')

plot(prBIO > Mean_OptBIO, main='presence/absence')

Now we’ll do exactly the same for our GLM model, only remember that this algorithm takes presence and absence data so we’ll modify our for loop slightly.

#create an empty list to hold our results (remember there will be five sets)
eGLM<-list()
par(mfrow=c(2,3))

for (i in 1:folds) {
  train <- Pres.cov[kfold_pres!= i,]
  test <- Pres.cov[kfold_pres == i,]
  backTrain<-Back.cov[kfold_back!=i,]
  backTest<-Back.cov[kfold_back==i,]
  dataTrain<-rbind(train,backTrain)
  dataTest<-rbind(test,backTest)
  glm_eval <- glm(Pres~.,binomial(link = "logit"), data=dataTrain)#this is our glm model trained on presence and absence points
  eGLM[[i]] <- evaluate(p=dataTest[ which(dataTest$Pres==1),],a=dataTest[which(dataTest$Pres==0),], glm_eval)#use testing data (kfold==i) for model evaluation
  
  #check the AUC by plotting ROC values
  
  plot(eGLM[[i]],'ROC')
  
}


#inspect

eGLM

[[1]]
class          : ModelEvaluation 
n presences    : 1351 
n absences     : 1415 
AUC            : 0.8858644 
cor            : 0.5917575 
max TPR+TNR at : -0.0717664 

[[2]]
class          : ModelEvaluation 
n presences    : 1352 
n absences     : 1415 
AUC            : 0.9008186 
cor            : 0.6137022 
max TPR+TNR at : 0.00600667 

[[3]]
class          : ModelEvaluation 
n presences    : 1351 
n absences     : 1414 
AUC            : 0.8991449 
cor            : 0.6141121 
max TPR+TNR at : -0.02784938 

[[4]]
class          : ModelEvaluation 
n presences    : 1352 
n absences     : 1415 
AUC            : 0.8831481 
cor            : 0.5911513 
max TPR+TNR at : -0.00886923 

[[5]]
class          : ModelEvaluation 
n presences    : 1351 
n absences     : 1415 
AUC            : 0.9074673 
cor            : 0.6124596 
max TPR+TNR at : -0.01166423

Get model evaluation statistics:

aucGLM <- sapply( eGLM, function(x){slot(x, 'auc')} )

#calculate the mean values for comparison with other models


mean(aucGLM)

[1] 0.8952887

Again we use the sapply() function to iterate over each item in our eGLM list and select the ‘t’ value (this is the slot on eGLM which denotes a given threshold) for which corressponding values for TPR+TNR are highest (hence we use ‘which.max’).

Opt_GLM<-sapply(eGLM, function(x){ x@t[which.max(x@TPR + x@TNR)] } )

#inspect
Opt_GLM

[1] -0.07176640  0.00600667 -0.02784938 -0.00886923 -0.01166423

We can now use these threshold values to make final predictions for areas of 1 and 0 (presence/absence) and map them. As glm outputs are on a log scale, We can back-transform these values to approximate probabilities using the plogis() function like so:

#take the mean to be applied to our predictions

Mean_OptGLM<- mean(Opt_GLM)

trGLM<-plogis(Mean_OptGLM)
#inspect
trGLM

[1] 0.4942931

We now have a value with which to set the threshold for our presence/absence maps.

We use the predict() function to estimate our output based on the ‘env’ raster stack and the information from our glm_eval model. For plotting purposes we then restrict the output to show only areas above our threshold (presence), leaving all other areas blank.

prGLM <- predict(env, glm.bradypus,type = "response")
par(mfrow=c(1,2))
plot(prGLM, main='GLM, regression')

plot(prGLM > trGLM, main='presence/absence')

Now let’s do the same for the Maxent model (we will leave out bioclim)

#Perform the same evaluation process on our maxent model 

par(mfrow=c(2,3))

#create list to hold results
eMAX<-list()

#set number of folds
folds=5

#assign data to folds
kfold_pres <- kfold(Pres.cov, folds)
kfold_back <- kfold(Back.cov, folds)


for (i in 1:folds) {
  train <- Pres.cov[kfold_pres!= i,]
  test <- Pres.cov[kfold_pres == i,]
  backTrain<-Back.cov[kfold_back!=i,]
  backTest<-Back.cov[kfold_back==i,]
  dataTrain<-rbind(train,backTrain)
  dataTest<-rbind(test,backTest)
  maxnet_eval <- maxnet(dataTrain$Pres, dataTrain[,1:8])
  eMAX[[i]] <- evaluate(p=dataTest[which(dataTest$Pres==1),],a=dataTest[which(dataTest$Pres==0),],   maxnet_eval)
  plot(eMAX[[i]],'ROC')
  
}

#Again, we can access our AUC and MaxTPR+TNR statistics for this latest model.

aucMAX <- sapply(eMAX, function(x){slot(x, 'auc')} )

#calculate the mean values for comparison with other models

aucMAX

[1] 0.9194087 0.9210733 0.9128803 0.9270491 0.9301567

Return the mean auc value

mean(aucMAX)

[1] 0.9221136

Now obtain the threshold value

# retrieve optimal threhold
Opt_MAX<-sapply( eMAX, function(x){ x@t[which.max(x@TPR + x@TNR)] } )


#take the mean
Mean_OptMAX<-mean(Opt_MAX)


#predict the model to data

prMAX <- predict(env, maxnetMod,clamp=F,na.rm=TRUE)

#plot according to predicted values and to thresholds
par(mfrow=c(1,2))

Now plot according to the threshold

plot(prMAX, main='Maxent Prediction')

plot(prMAX > Mean_OptMAX, main='presence/absence')

Finally we will look at how to fit a random forest model to the data using the randomForest() package.

library(randomForest)

randomForest 4.7-1.2

Type rfNews() to see new features/changes/bug fixes.


Attaching package: 'randomForest'

The following object is masked from 'package:dplyr':

    combine

The randomForest model takes two arguments - parameters that can be tuned - mtry and ntree. mtry refers to the number of predictor variables to try at each iteration of the model and and ntree specifies the number of trees grown. 500 is the default value often used and will suffices for the size of our data. Model performance is thought to be more sensitive to mtry than ntree and this can be tuned with the tuneRF function:

#tune mtry

tuneRF(x=all.cov[,1:8],y=as.factor(all.cov$Pres))

mtry = 2  OOB error = 5.23% 
Searching left ...
mtry = 1    OOB error = 5.3% 
-0.01243094 0.05 
Searching right ...
mtry = 4    OOB error = 5.31% 
-0.01519337 0.05

      mtry   OOBError
1.OOB    1 0.05299689
2.OOB    2 0.05234618
4.OOB    4 0.05314149

The Out of Box (OOB) error statistic is lowest with an mtry of 1 so we go with this for fitting our model:

rf.Bradypus<-randomForest(as.factor(Pres)~.,mtry=2,ntree=500,data=all.cov)

Although randomForest does not produce variable co-efficients as in the case of glm(), we can use the varImpPlot() function to get a measure of variable importance (the contribution that each variable makes to the discrimination of classes in the final model). Here higher values of the Gini index represent greater probability that cases will be wrongly classified. Therefore variable importance is judged based on the decrease in the Gini index that is achieved through their inclusion in the model. Note that the rank of importance does not match closely the strength of association implied by the co-efficients in the glm model.

varImpPlot(rf.Bradypus)

To evaluate our random forest model we can still use the evaluate() function from the Dismo package but in this case we first make a prediction using the model output. This is because random forest, as used here, is a classifier and so we have to give it the response variable as a factor. However, factors don’t work well with ROC calculations so the predict() function here allows us to first make the predictions (i.e. where all predicted and actual values are returned as numerics rather than factors) before passing these to the evaluate() function.

#create an empty list to hold our results (remember there will be five sets)
eRF<-list()
par(mfrow=c(2,3))

for (i in 1:folds) {
  train <- Pres.cov[kfold_pres!= i,]
  test <- Pres.cov[kfold_pres == i,]
  backTrain<-Back.cov[kfold_back!=i,]
  backTest<-Back.cov[kfold_back==i,]
  dataTrain<-rbind(train,backTrain)
  dataTest<-rbind(test,backTest)
  RF_eval <- randomForest(as.factor(Pres)~., data=dataTrain)#this is our RF model
  rf.pred <- predict(RF_eval, type="prob")[,2]#make prediction
  eRF[[i]]<-evaluate(p = rf.pred[which(dataTrain$Pres == "1")], #set p to be 1s from the dataTrain object 
         a = rf.pred[which(dataTrain$Pres == "0")])# set a to be 0s from the dataTrain object
 
  #check the AUC by plotting ROC values
  
  plot(eRF[[i]],'ROC')
  
}


#inspect

eRF

[[1]]
class          : ModelEvaluation 
n presences    : 5406 
n absences     : 5659 
AUC            : 0.9799243 
cor            : 0.9080899 
max TPR+TNR at : 0.6293082 

[[2]]
class          : ModelEvaluation 
n presences    : 5405 
n absences     : 5659 
AUC            : 0.9796176 
cor            : 0.9097067 
max TPR+TNR at : 0.7689563 

[[3]]
class          : ModelEvaluation 
n presences    : 5406 
n absences     : 5660 
AUC            : 0.9796032 
cor            : 0.9062663 
max TPR+TNR at : 0.6736968 

[[4]]
class          : ModelEvaluation 
n presences    : 5405 
n absences     : 5659 
AUC            : 0.9801226 
cor            : 0.9093308 
max TPR+TNR at : 0.6388455 

[[5]]
class          : ModelEvaluation 
n presences    : 5406 
n absences     : 5659 
AUC            : 0.9797942 
cor            : 0.9102759 
max TPR+TNR at : 0.7824234

Retrieve the mean auc from these models:

#Get model evaluation statistics:

aucRF <- sapply( eRF, function(x){slot(x, 'auc')} )

#calculate the mean values for comparison with other models


mean(aucRF)

[1] 0.9798124

Get threshold:

#Get maxTPR+TNR for the Random Forest model

Opt_RF<-sapply( eRF, function(x){ x@t[which.max(x@TPR + x@TNR)] } )

#mean of all threshold values
Mean_OptRF<-mean(Opt_RF)

#print
Mean_OptRF

[1] 0.698646

We can predict and map our distribution estimate as for the other models, but note here that we have to add “index=2” as an argument in the predict() function. Remember that random forest is a classifier which mean it will produce outcomes of 0 and 1 automatically. The index=2 here is asking the predict function to calculate probabilties that a cell will equal 1 (which is the second column in the table that the predict function creates - therefore index =2).

prRF <- predict(env, rf.Bradypus,type="prob",index=2)
par(mfrow=c(1,2))
plot(prRF, main='Random Forest Prediction')

plot(prRF > Mean_OptRF, main='presence/absence')

Now we have a clearer view of the quality of our models, with the random forest model performing better overall. Both the glm and maxnet approaches still demonstrated good predictive power even after the k-fold validation. Next week we will look at how this method of cross-validation may still be overly optimistic given that we have not accounted for spatial autocorrelation in our evaluation, and at methods for handling such problems.

For now, proceed to this week’s independent exercise.

Week 3 Exercise

Use the gbif() function to download records for Solanum acaule and generate species distribution models using the maxnet() and glm() approaches. Use the same environmental data and approach to background sampling as used in this practical. Evaluate your models using a k-fold approach based on 5 folds. Select the best performing model and create 2 mapped outputs - 1. a probability plot of your predictions and 2. a presence/absence map based on a sensible threshold.