In this week’s pratical we will take our first step into species distribution modelling. We will see that there are multiple options for modelling the distribution of species and that all depend on point data and the extraction of environmental factors to those points.
First, let’s set the working directory and install the packages that we will use for today’s work:
setwd("P:/GEOG71922/Week3")
#Code to tell R to only use the terra installation of PROJ (library for performing projections)
plib<-Sys.getenv("PROJ_LIB")
prj<-system.file("proj",package="terra")[1]
Sys.setenv("PROJ_LIB"=prj)
Install the packages for today’s work:
#Install packages
install.packages(c("dismo","glmnet","maxnet","terra","randomForest","sf","rnaturalearth"))
We will be using raster layers from the Worldclim dataset of environmental variables that come ready-to-use once the dismo package is installed. Information on these layers can be found at: http://worldclim.org/bioclim Follow this link now to see how the different layer names are coded. The ‘biome’ layer included here comes from the ‘Terrestrial Biome’ dataset by the WWF. You can get more information on this layer here: https://www.worldwildlife.org/publications/terrestrial-ecoregions-of-the-world
For now let’s load these layers, name them and create a raster stack object (a raster made of multiple bands) before plotting them along with some world map data to see where in the world we are this week. These files are inside the ‘ex’ folder of the dismo package and all have the extention ‘.grd’. We can access these using the list.files() function and concatonating the file paths using the paste() function. Type ‘?paste’ into the command line for more information on this function. This is how they appear when called on my own machine:
```r
library(terra)
## Warning: package 'terra' was built under R version 4.3.2
## terra 1.7.72
library(dismo)
## Warning: package 'dismo' was built under R version 4.3.2
## Loading required package: raster
## Warning: package 'raster' was built under R version 4.3.2
## Loading required package: sp
## Warning: package 'sp' was built under R version 4.3.2
files <- list.files(path=paste(system.file(package="dismo"),
'/ex', sep=''), pattern='grd', full.names=TRUE )
files
## [1] "C:/Users/mlibxmda/AppData/Local/R/win-library/4.3/dismo/ex/bio1.grd"
## [2] "C:/Users/mlibxmda/AppData/Local/R/win-library/4.3/dismo/ex/bio12.grd"
## [3] "C:/Users/mlibxmda/AppData/Local/R/win-library/4.3/dismo/ex/bio16.grd"
## [4] "C:/Users/mlibxmda/AppData/Local/R/win-library/4.3/dismo/ex/bio17.grd"
## [5] "C:/Users/mlibxmda/AppData/Local/R/win-library/4.3/dismo/ex/bio5.grd"
## [6] "C:/Users/mlibxmda/AppData/Local/R/win-library/4.3/dismo/ex/bio6.grd"
## [7] "C:/Users/mlibxmda/AppData/Local/R/win-library/4.3/dismo/ex/bio7.grd"
## [8] "C:/Users/mlibxmda/AppData/Local/R/win-library/4.3/dismo/ex/bio8.grd"
## [9] "C:/Users/mlibxmda/AppData/Local/R/win-library/4.3/dismo/ex/biome.grd"
Next we creae raster object from these files and plot them.
env<-rast(files[1:8])
plot(env)
Use the names() function to assign names to the different environmental layers
names(env) <- c("MeanTemp.","AnnualPrecip.","Precip.WettestQuarter","PrecipDriestQuarter","Max.Temp.","Min.Temp.","Temp.Range","MeanTemp.WettestQuarter")
plot(env)
Next we retrive some data on the Brown Throated Sloth (Bradypus variegatus) from the Global Biodiversity Information Facility (GBIF)
#use gbif() function to search and download sloth data sp=TRUE means return the data as points.
bradypus <- gbif("Bradypus", "variegatus*",sp=TRUE)
## 6304 records found
## 0-300-600-900-1200-1500-1800-2100-2400-2700-3000-3300-3600-3900-4200-4500-4800-5100-5400-5700-6000-6300-6304 records downloaded
#convert from an sp library (old) object to terra (newer/faster library) object
bradypus<-vect(bradypus)
#set the crs
crs(bradypus)<-"epsg:4326"
Now we’ll add some political boundaries for context. We’ll use a package for load Natural Earth data and subset the data for our region of interest.
library(rnaturalearth)
## Warning: package 'rnaturalearth' was built under R version 4.3.2
#call natural earth library to load in some national boundaries for N and S America
am<-rnaturalearth::ne_countries(continent = c("North America","South America"),returnclass = "sf")
Now subset the world map for the parts of South and North America that are relevant for our data.
#check all available in our initial subset
unique(am$subregion)
## [1] "Northern America" "South America" "Caribbean" "Central America"
#sbuset using square brackets and | ("or") operands
SA<-am[am$subregion=="South America"|am$subregion=="Central America",]
#check all plot together - but first project the env data to epsg:4326
env<-project(env,"epsg:4326",method="bilinear")
plot(env,1)
plot(SA$geometry,add=TRUE)
plot(bradypus, add=TRUE,col ="red")
Next, in orde to build our dataset for modelling we need to generate some background (remember not absence) points for our study area.
#create random points on cells of the 'env' object. This function also samples the raster values and returns them with the resulting object.
set.seed(11)
back <- spatSample(env,size=9775,as.points=TRUE,method="random",na.rm=TRUE)
#inspect
head(back)
## MeanTemp. AnnualPrecip. Precip.WettestQuarter PrecipDriestQuarter Max.Temp.
## 1 250 1997 952 48 340
## 2 225 830 296 151 290
## 3 250 1495 576 227 312
## 4 270 2900 1381 81 326
## 5 265 1647 881 155 327
## 6 132 172 51 35 274
## Min.Temp. Temp.Range MeanTemp.WettestQuarter
## 1 154 186 248
## 2 176 114 242
## 3 185 126 250
## 4 231 95 263
## 5 212 115 258
## 6 7 267 73
plot(env,1)
temp<-env$MeanTemp.
#make sure crs matches other data
crs(back)
## [1] "GEOGCRS[\"WGS 84\",\n ENSEMBLE[\"World Geodetic System 1984 ensemble\",\n MEMBER[\"World Geodetic System 1984 (Transit)\"],\n MEMBER[\"World Geodetic System 1984 (G730)\"],\n MEMBER[\"World Geodetic System 1984 (G873)\"],\n MEMBER[\"World Geodetic System 1984 (G1150)\"],\n MEMBER[\"World Geodetic System 1984 (G1674)\"],\n MEMBER[\"World Geodetic System 1984 (G1762)\"],\n MEMBER[\"World Geodetic System 1984 (G2139)\"],\n ELLIPSOID[\"WGS 84\",6378137,298.257223563,\n LENGTHUNIT[\"metre\",1]],\n ENSEMBLEACCURACY[2.0]],\n PRIMEM[\"Greenwich\",0,\n ANGLEUNIT[\"degree\",0.0174532925199433]],\n CS[ellipsoidal,2],\n AXIS[\"geodetic latitude (Lat)\",north,\n ORDER[1],\n ANGLEUNIT[\"degree\",0.0174532925199433]],\n AXIS[\"geodetic longitude (Lon)\",east,\n ORDER[2],\n ANGLEUNIT[\"degree\",0.0174532925199433]],\n USAGE[\n SCOPE[\"Horizontal component of 3D system.\"],\n AREA[\"World.\"],\n BBOX[-90,-180,90,180]],\n ID[\"EPSG\",4326]]"
crs(env)
## [1] "GEOGCRS[\"WGS 84\",\n ENSEMBLE[\"World Geodetic System 1984 ensemble\",\n MEMBER[\"World Geodetic System 1984 (Transit)\"],\n MEMBER[\"World Geodetic System 1984 (G730)\"],\n MEMBER[\"World Geodetic System 1984 (G873)\"],\n MEMBER[\"World Geodetic System 1984 (G1150)\"],\n MEMBER[\"World Geodetic System 1984 (G1674)\"],\n MEMBER[\"World Geodetic System 1984 (G1762)\"],\n MEMBER[\"World Geodetic System 1984 (G2139)\"],\n ELLIPSOID[\"WGS 84\",6378137,298.257223563,\n LENGTHUNIT[\"metre\",1]],\n ENSEMBLEACCURACY[2.0]],\n PRIMEM[\"Greenwich\",0,\n ANGLEUNIT[\"degree\",0.0174532925199433]],\n CS[ellipsoidal,2],\n AXIS[\"geodetic latitude (Lat)\",north,\n ORDER[1],\n ANGLEUNIT[\"degree\",0.0174532925199433]],\n AXIS[\"geodetic longitude (Lon)\",east,\n ORDER[2],\n ANGLEUNIT[\"degree\",0.0174532925199433]],\n USAGE[\n SCOPE[\"Horizontal component of 3D system.\"],\n AREA[\"World.\"],\n BBOX[-90,-180,90,180]],\n ID[\"EPSG\",4326]]"
#subset to SA object
library(sf)
## Warning: package 'sf' was built under R version 4.3.2
## Linking to GEOS 3.11.2, GDAL 3.7.2, PROJ 9.3.0; sf_use_s2() is TRUE
#convert to sf object
back<-st_as_sf(back)
#spatial subset using [,]
back<-back[SA,]
#check number of background points
nrow(back)
## [1] 7074
#now let's plot them
plot(back$geometry)
plot(bradypus,add=TRUE,col="red")
# get environmental covariates at presence locations
eP<-extract(env,bradypus)
#create data frames from values
Pres.cov<-data.frame(eP,Pres=1)
#Remove the first column which is just an ID field.
Pres.cov<-Pres.cov[,-1]
#check the data
summary(Pres.cov)
## MeanTemp. AnnualPrecip. Precip.WettestQuarter PrecipDriestQuarter
## Min. :122.0 Min. : 375 Min. : 215 Min. : 4.0
## 1st Qu.:240.0 1st Qu.:2064 1st Qu.: 842 1st Qu.: 138.0
## Median :259.0 Median :2791 Median :1077 Median : 204.0
## Mean :251.6 Mean :2846 Mean :1102 Mean : 254.3
## 3rd Qu.:264.0 3rd Qu.:3548 3rd Qu.:1459 3rd Qu.: 305.0
## Max. :284.0 Max. :7682 Max. :2458 Max. :1496.0
## NA's :2 NA's :2 NA's :2 NA's :2
## Max.Temp. Min.Temp. Temp.Range MeanTemp.WettestQuarter
## Min. :179.0 Min. : 49.0 Min. : 67.0 Min. :125.0
## 1st Qu.:303.0 1st Qu.:178.0 1st Qu.:107.0 1st Qu.:238.0
## Median :316.0 Median :199.0 Median :116.0 Median :254.0
## Mean :313.5 Mean :193.6 Mean :119.8 Mean :249.7
## 3rd Qu.:335.0 3rd Qu.:218.0 3rd Qu.:136.0 3rd Qu.:262.0
## Max. :359.0 Max. :228.0 Max. :219.0 Max. :282.0
## NA's :2 NA's :2 NA's :2 NA's :2
## Pres
## Min. :1
## 1st Qu.:1
## Median :1
## Mean :1
## 3rd Qu.:1
## Max. :1
##
#we have a few NAs which will cause issues later so we remove these by selecting only rows that !is.na() meaning IS NOT NA. We just chose one variable here - "MeanTemp." - to test for the NA status.
Pres.cov<-Pres.cov[!is.na(Pres.cov$MeanTemp.),]
head(Pres.cov)
## MeanTemp. AnnualPrecip. Precip.WettestQuarter PrecipDriestQuarter Max.Temp.
## 1 242 782 497 31 296
## 2 255 2849 1136 204 321
## 3 264 3548 1552 138 337
## 4 216 3285 1218 300 277
## 5 230 1238 589 83 295
## 6 200 1278 634 93 279
## Min.Temp. Temp.Range MeanTemp.WettestQuarter Pres
## 1 193 103 249 1
## 2 196 125 256 1
## 3 198 138 260 1
## 4 160 117 215 1
## 5 160 135 232 1
## 6 96 184 225 1
#create data frame for background values (currently an sf object but for modelling we don't need the geometry so we drop that here)
Back.cov<-data.frame(st_drop_geometry(back),Pres=0)
head(Back.cov)
## MeanTemp. AnnualPrecip. Precip.WettestQuarter PrecipDriestQuarter Max.Temp.
## 1 250 1997 952 48 340
## 5 265 1647 881 155 327
## 6 132 172 51 35 274
## 8 244 1936 937 43 333
## 10 262 1986 849 185 326
## 11 243 2745 1352 112 321
## Min.Temp. Temp.Range MeanTemp.WettestQuarter Pres
## 1 154 186 248 0
## 5 212 115 258 0
## 6 7 267 73 0
## 8 149 184 243 0
## 10 201 124 256 0
## 11 174 147 238 0
#combine
all.cov<-rbind(Pres.cov,Back.cov)
Now let’s create our first model. The Biolclim() function performs a distribution model based on the envelope modelling approach and takes presence data only. We specify the model therefore using only the Pres.cov data frame and, for now, we will use co-variates MeanTemp., Temp.Range. and AnnualPrecip.:
bc_bradypus <- bioclim(Pres.cov[,c('MeanTemp.', 'Temp.Range', 'AnnualPrecip.')])
The Bioclim() algorithm assess the suitability of a location by comparing the values of environmental variables at that location to the distribution of values at known presence locations. In general, the closer to the 50th percentile (the median), the more suitable the location is.
We can map our prediction using the predict() function, using our covariate raster stack as the object and the bc_bradypus as the model. Note that for a bioclim() model the predicted values range from 0 to 1. It is scaled such that a value of 1 would be locations that have the median value of all covariates considered, while a value of zero would reflect locations where at least one covariate is outside the range of environmental covariates at presence locations.
bioclim.map <- predict(env, bc_bradypus)#predict
#plot
plot(bioclim.map, axes = F, box = F, main = "bioclim: MeanTemp.,Temp.Range,AnnualPrecip.")
Now we’ll repeat our bioclim model but using all predictors (i.e. all columns of co-variates in Pres.cov)
bc_bradypus2<-bioclim(Pres.cov[,1:8])
bc_bradypus2
## class : Bioclim
##
## variables: MeanTemp. AnnualPrecip. Precip.WettestQuarter PrecipDriestQuarter Max.Temp. Min.Temp. Temp.Range MeanTemp.WettestQuarter
##
##
## presence points: 5618
## MeanTemp. AnnualPrecip. Precip.WettestQuarter PrecipDriestQuarter Max.Temp.
## 1 242 782 497 31 296
## 2 255 2849 1136 204 321
## 3 264 3548 1552 138 337
## 4 216 3285 1218 300 277
## 5 230 1238 589 83 295
## 6 200 1278 634 93 279
## 7 255 2849 1136 204 321
## 8 230 1238 589 83 295
## 9 264 3548 1552 138 337
## 10 235 2779 1191 154 305
## Min.Temp. Temp.Range MeanTemp.WettestQuarter
## 1 193 103 249
## 2 196 125 256
## 3 198 138 260
## 4 160 117 215
## 5 160 135 232
## 6 96 184 225
## 7 196 125 256
## 8 160 135 232
## 9 198 138 260
## 10 175 129 231
## (... ... ...)
bioclim.map2 <- predict(env, bc_bradypus2)
plot(bioclim.map2, axes = F, box = F, main = "bioclimAll")
We can see that our models, as expected, are very similar though we might assume that, given the relatively higher values and using fewer predictors, the first model (bc_bradypus) may be over-predicting our species occurence.
We can also use the response() function to give the response plots for each covariate.
response(bc_bradypus2)
The response plots show values for each covariate at which prediction is highest.It might help to think of these plots as being similar in principle to a histogram, with values for co-variates on the X axis and proportion of presence points on the Y axis. For most of our covariates, the relationship appears to be non-linear in the presence-only bioclim model.
Next we’ll look at fitting some models that accept presence and (psuedo)-absence data. We will use the glm() function (glm stands for ‘generalized linear model’) and set the family of functions to ‘binomial’ (i.e. a logistic regression) as our outcome is binary. We set the first argument (the response variable) as ‘Pres’ and separate this from our predictor variable with the tilda symbol ‘~’. We can opt to include all variables (i.e. the remaining columns in the data frame) simply by placing a point ‘.’ after the tilda symbol.
#Specify model
glm.bradypus<-glm(Pres~.,binomial(link='logit'), data=all.cov)
#get summary of outputs
summary(glm.bradypus)
##
## Call:
## glm(formula = Pres ~ ., family = binomial(link = "logit"), data = all.cov)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.5509606 0.3339521 16.622 <2e-16 ***
## MeanTemp. 0.0004438 0.0094719 0.047 0.963
## AnnualPrecip. 0.0045718 0.0001834 24.933 <2e-16 ***
## Precip.WettestQuarter -0.0064333 0.0003944 -16.310 <2e-16 ***
## PrecipDriestQuarter -0.0112260 0.0004116 -27.273 <2e-16 ***
## Max.Temp. 0.4377123 0.0520944 8.402 <2e-16 ***
## Min.Temp. -0.5046198 0.0527981 -9.558 <2e-16 ***
## Temp.Range -0.5257910 0.0524529 -10.024 <2e-16 ***
## MeanTemp.WettestQuarter 0.0647936 0.0045312 14.299 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 17427.5 on 12691 degrees of freedom
## Residual deviance: 9225.1 on 12683 degrees of freedom
## AIC: 9243.1
##
## Number of Fisher Scoring iterations: 7
The statistics provided by the summary() function from the glm model are the co-efficient values estimated for each predictor variable and the AIC (Akaike Information Criterion) value. For the purpose of model comparison, a lower AIC denotes better model fit.
We can map the probability predictions of our glm.bradypus output in the same way as we did for the bioclim model.
#Use the predict function to estimate probabilities for our study area based on the raster stack object 'env'.
glm.map <- predict(env, glm.bradypus, type = "response")
#plot the map
plot(glm.map,main="glmBradypus")
Now we will look at the application of a Maxent algorithm to our data. Maxent is a popular species distribution modelling algorithm, partly due to the open source package that can be freely downloaded and used “out of the box”. It can also be used within the R environment where, if the programme is saved inside the dismo package folder in the R directory, we can call maxent() from the dismo library in the same way as we have already called bioclim() and glm() from their respective libraries. However, this can be problematic if working on a computer without admin rights and in any case there is a way to implement a maxent model without outsourcing the algorithm from within R. The glmnet package contains a function maxnet() that allows us to do this.
For maxnet, we provide our presence data followed by the columns of our data frame containing predictors and, in this case, we choose “lq” (meaning linear and quadratic) from the list of available feature classes for model estimation.
In truth, the maxnet() function approximates a maxent formula using an “infinitely weighted logistic regression” - an inhomogenous point process (IPP) approach carried out with the glmnet package (see Fithian and Hastie, 2013). This is why we also installed glmnet at the start of the practical. As you can see from the topics covered in the lecture and suggested reading for this week, there is a strong theoretical basis for using point process models in ecology. Here we pass all variables in our all.cov data frame as predictors and use the maxnet.formula() function to give variable names as arguments. This is so that we can create response curves later on.
library(maxnet)
## Warning: package 'maxnet' was built under R version 4.3.2
##
## Attaching package: 'maxnet'
## The following object is masked _by_ '.GlobalEnv':
##
## bradypus
library(glmnet)
## Warning: package 'glmnet' was built under R version 4.3.2
## Loading required package: Matrix
## Warning: package 'Matrix' was built under R version 4.3.2
## Loaded glmnet 4.1-8
head(all.cov[,1:8])
## MeanTemp. AnnualPrecip. Precip.WettestQuarter PrecipDriestQuarter Max.Temp.
## 1 242 782 497 31 296
## 2 255 2849 1136 204 321
## 3 264 3548 1552 138 337
## 4 216 3285 1218 300 277
## 5 230 1238 589 83 295
## 6 200 1278 634 93 279
## Min.Temp. Temp.Range MeanTemp.WettestQuarter
## 1 193 103 249
## 2 196 125 256
## 3 198 138 260
## 4 160 117 215
## 5 160 135 232
## 6 96 184 225
maxnetMod <- maxnet(all.cov$Pres, all.cov[,1:8],maxnet.formula(all.cov$Pres,all.cov[,c("AnnualPrecip.", "Precip.WettestQuarter", "PrecipDriestQuarter", "Max.Temp.", "Temp.Range","MeanTemp.",
"MeanTemp.WettestQuarter")],classes="lq")) # runs a presence-only Maxent model selecting feature "linear" and "quadratic" using the "lq" argument.
To transform the raw model output values (an exponential model) for mapping to a 0 to 1 probability scale, we have several options. The default option in the orginal Maxent software was type “logistic” for ploting response curves though recent work comparing Maxent to an IPP formulation suggests that a cloglog (complementary log-log transform) approach is more suitable for mapping presence/absence predictions. Consequently, cloglog is now the default for the open source Maxent tool and we will plot it here.
First we plot response curves showing the relationship between some of our variables and probability. We use the response.plot() function fom the maxent package to do this. The function takes six arguments: 1.the model, 2.the variable to be plotted, 3. the variable mean (note we take this from the model itself but as it is stored in a list - which won’t do for plotting - we have to unlist this value), 4.the variable minimum value, 5. the variable maximum value and 6. the transformation applied to the raw data. You can plot the other variables also simply by swapping out the variable name.
response.plot(maxnetMod,"AnnualPrecip.", mm=unlist(maxnetMod$samplemeans), #set model, variable and mean values
min=maxnetMod$varmin["AnnualPrecip."], max=maxnetMod$varmax["AnnualPrecip."], #set min/max values
type="cloglog") #choose transform type
response.plot(maxnetMod,"Temp.Range", mm=unlist(maxnetMod$samplemeans),min=maxnetMod$varmin["Temp.Range"],
max=maxnetMod$varmax["Temp.Range"],
type="cloglog")
response.plot(maxnetMod,"MeanTemp.", mm=unlist(maxnetMod$samplemeans),min=maxnetMod$varmin["MeanTemp."],
max=maxnetMod$varmax["MeanTemp."],
type="cloglog")
Now let’s plot the prediction map.
maxnet.cloglog.map <- predict(env, maxnetMod, clamp=F, type="cloglog",na.rm=TRUE)
plot(maxnet.cloglog.map, axes=F, box=F, main="Maxnet-clog")
Model Validation
An important part of species distribution modelling is model validation and this is a subject we will come back to repeatedly throughout this and next week’s practical. Although we briefly looked at the usefulness of the AIC statistic for comparing models, the AIC does not give us any idea about the predictive power of our model. To evaluate this, we need to use information provided in the confusion matrix, as introduced in the lecture.
The most readily available function (before creating our own later) for generating a confusion matrix and returning associated statistics is the evaluate() function that comes with the dismo package. To use this function, we supply three arguments to the evaluate() function: a data frame of presence points, a data frame of absence points and the model we wish to evaluate.
Let’s evaluate our presence-only model (bioclim) first.
#For evaluating the first bioclim model we only povide the MeanTemp.,Temp.Range and AnnualPrecip. predictors which are columns 1,2 and 7 in the Pres.cov data frame
eBC <- evaluate(Pres.cov[,c(1,2,7)],Back.cov,bc_bradypus)
#For model two validation, provide all predictors
eBC2 <- evaluate(Pres.cov,Back.cov,bc_bradypus2)
#inspect both evaluation results
eBC
## class : ModelEvaluation
## n presences : 5618
## n absences : 7074
## AUC : 0.7755517
## cor : 0.3363476
## max TPR+TNR at : 0.0245173
eBC2
## class : ModelEvaluation
## n presences : 5618
## n absences : 7074
## AUC : 0.7881668
## cor : 0.3090843
## max TPR+TNR at : 0.01647672
The main outputs of interest here are the AUC, and max TPR+TNR statistics. Recall from the lecture that AUC stands for Area Under the Curve and is a test of model performance where higher values indicate greater accuracy in our predictions.This statistic is often used as a mark of model performance and we will use it to assess the model we run today. Remember also that the max TPR+TNR denotes the probability threshold at which our model maximizes the True Positive Rate and the True Negative Rate. It is generally accepted that this is an optimum value at which to set the threshold for binary classification (i.e. 1/0 or presence/absence) in mapping outputs. We will use these values later when symbolising our final mapped predictions. For now, we can safely assume that our second bioclim model, with a higher AUC statistic, is the better performing model.
We can plot the ROC (Receiver Operating Curve) to visualize these results:
plot(eBC,'ROC')
plot(eBC2,'ROC')
Now let’s evaluate our glm models.
#GLM evaluation
#Model with all predictors
eglmAll <- evaluate(all.cov[all.cov$Pres==1,],all.cov[all.cov$Pres==0,], glm.bradypus)
#inspect
eglmAll
## class : ModelEvaluation
## n presences : 5618
## n absences : 7074
## AUC : 0.9139241
## cor : 0.6258142
## max TPR+TNR at : -0.4906637
plot(eglmAll,'ROC')
Finally we apply the evaluate() function to our maxnet model.
maxeval<-evaluate(p=bradypus, a=back, maxnetMod, env)
maxeval
## class : ModelEvaluation
## n presences : 5618
## n absences : 7074
## AUC : 0.9071895
## cor : 0.6209768
## max TPR+TNR at : -9.174224
So far the evaluation results suggest good performance by all of our models (high AUC), particularly the glm approach. However, we have almost certainly over-estimated the quality of our models given that we have tested them with the same data that was used for training them. A more robust approach is to train the model on a subset of the data and test (evaluate) the model on the remaining cases. For example, we might use 70% of the data to train our model and use the remaining 30% of cases to test our model’s ability to correctly classify those cases. Although more robust, this approach only gives us a single attempt at model validation on which to base our assessment. To overcome this, the k-fold approach is commonly used where the entire dataset is divided into folds (groups), say five, where one fold is chosen for model testing and the remaining four used for model training. This is then repeated in an iterative process until all folds have been used for model testing. We can then base our assessment on the mean values derived from all iterations. The kfold() function in dismo provided us with a neat way of partitioning our data.
#set number of folds to use
folds=5
#partiction presence and absence data according to folds using the kfold() function.
kfold_pres <- kfold(Pres.cov, folds)
kfold_back <- kfold(Back.cov, folds)
#create an empty list to hold our results (remember there will be five sets)
bc_K<-list()
par(mfrow=c(2,3))
# for loop to iterate over folds
for (i in 1:folds) {
train <- Pres.cov[kfold_pres!= i,]#for presence values, select all folds which are not 'i' to train the model
test <- Pres.cov[kfold_pres == i,]#the remaining fold is used to test the model
backTrain<-Back.cov[kfold_back!=i,]#now for background values, select training folds
backTest<-Back.cov[kfold_back==i,]#use the remainder for model testing
dataTrain<-rbind(train,backTrain)#bind presence and background training data together
dataTest<-rbind(test,backTest)#bind test data together
bc_bradypus_K<-bioclim(train[,1:8])#for bioclim we only need presence data to train the model!
bc_K[[i]] <- evaluate(p=test[,1:8],a=backTest[,1:8], bc_bradypus_K)#use testing data (kfold==i) for model evaluation
#check the AUC by plotting ROC values
plot(bc_K[[i]],'ROC')
}
#inspect
bc_K
## [[1]]
## class : ModelEvaluation
## n presences : 1124
## n absences : 1415
## AUC : 0.7892449
## cor : 0.3102134
## max TPR+TNR at : 0.01658892
##
## [[2]]
## class : ModelEvaluation
## n presences : 1123
## n absences : 1415
## AUC : 0.7946273
## cor : 0.327317
## max TPR+TNR at : 0.02780656
##
## [[3]]
## class : ModelEvaluation
## n presences : 1124
## n absences : 1414
## AUC : 0.7760338
## cor : 0.3022396
## max TPR+TNR at : 0.01747899
##
## [[4]]
## class : ModelEvaluation
## n presences : 1123
## n absences : 1415
## AUC : 0.7950379
## cor : 0.3052775
## max TPR+TNR at : 0.01524016
##
## [[5]]
## class : ModelEvaluation
## n presences : 1124
## n absences : 1415
## AUC : 0.7867409
## cor : 0.2963785
## max TPR+TNR at : 0.0304563
Next we use the slot() function to acces the elements in the list bc_K that are of use to us. In this case we want to return all auc values from our 5 evaluation runs. We ‘loop’ through all elements in the list using the sapply() function (remember this is similar to a for loop but applies to elements in a list or a vector). Effectively the below code is iterating over each element in the list ‘bc_K’ and applying a function that selects the ‘auc’ statistics from the correpsonding slot in the model results. Slots are basically attributes of objects in R. Here bc_k is a list of objects of type Formal Class ModelEvaluation. So we iterate of each and extract the ‘auc’ slot. You can type ‘help(slot)’ into the console for more information on slots.
Get model evaluation statistics:
aucBIO <- sapply( bc_K, function(x){slot(x, 'auc')} )
#calculate the mean values for coparison with other models
mean(aucBIO)
## [1] 0.788337
In order to map the results of our model we need to set a threshold over which locations (cells on the map) are allocated 1 and, beneath which, 0. One way to determine this threshold is to calculate the threshold (the model output) at which the TPR (true positive rate) and TNR (true negative rate) are maximized. Remember we need both of these to be high in order to ensure a successful model.
We use the sapply() function to iterate over each item in our eGLM list and select the ‘t’ value (this is the slot on bc_K which denotes a given threshold) for which corressponding values for TPR+TNR are highest (hence we use ‘which.max’).
Opt_BIO<-sapply( bc_K, function(x){ x@t[which.max(x@TPR + x@TNR)] } )
Opt_BIO
## [1] 0.01658892 0.02780656 0.01747899 0.01524016 0.03045630
We will use these threshold values to make final predictions for areas of 1 and 0 (presence/absence) and map them.
#take the mean of the OptBIO list for application in our predictions
Mean_OptBIO<- mean(Opt_BIO)
Mean_OptBIO
## [1] 0.02151419
Now we just need to make a prediction based on our environmental raster stack and our bioclim model, setting 1 and 0 (presence/absence) according to the above threshold.
prBIO <- predict(env, bc_bradypus2,type = "response")
par(mfrow=c(1,2))
plot(prBIO, main='BIO_K')
plot(prBIO > Mean_OptBIO, main='presence/absence')
Now we’ll do exactly the same for our GLM model, only remember that this algorithm takes presence and absence data so we’ll modify our for loop slightly.
#create an empty list to hold our results (remember there will be five sets)
eGLM<-list()
par(mfrow=c(2,3))
for (i in 1:folds) {
train <- Pres.cov[kfold_pres!= i,]
test <- Pres.cov[kfold_pres == i,]
backTrain<-Back.cov[kfold_back!=i,]
backTest<-Back.cov[kfold_back==i,]
dataTrain<-rbind(train,backTrain)
dataTest<-rbind(test,backTest)
glm_eval <- glm(Pres~.,binomial(link = "logit"), data=dataTrain)#this is our glm model trained on presence and absence points
eGLM[[i]] <- evaluate(p=dataTest[ which(dataTest$Pres==1),],a=dataTest[which(dataTest$Pres==0),], glm_eval)#use testing data (kfold==i) for model evaluation
#check the AUC by plotting ROC values
plot(eGLM[[i]],'ROC')
}
#inspect
eGLM
## [[1]]
## class : ModelEvaluation
## n presences : 1124
## n absences : 1415
## AUC : 0.9135389
## cor : 0.6206082
## max TPR+TNR at : -0.4855559
##
## [[2]]
## class : ModelEvaluation
## n presences : 1123
## n absences : 1415
## AUC : 0.9120963
## cor : 0.6238017
## max TPR+TNR at : -0.5308654
##
## [[3]]
## class : ModelEvaluation
## n presences : 1124
## n absences : 1414
## AUC : 0.9125619
## cor : 0.6236072
## max TPR+TNR at : -0.513314
##
## [[4]]
## class : ModelEvaluation
## n presences : 1123
## n absences : 1415
## AUC : 0.9163161
## cor : 0.6343554
## max TPR+TNR at : -0.5351958
##
## [[5]]
## class : ModelEvaluation
## n presences : 1124
## n absences : 1415
## AUC : 0.9138221
## cor : 0.6258148
## max TPR+TNR at : -0.5143094
Get model evaluation statistics:
aucGLM <- sapply( eGLM, function(x){slot(x, 'auc')} )
#calculate the mean values for comparison with other models
mean(aucGLM)
## [1] 0.913667
Again we use the sapply() function to iterate over each item in our eGLM list and select the ‘t’ value (this is the slot on eGLM which denotes a given threshold) for which corressponding values for TPR+TNR are highest (hence we use ‘which.max’).
Opt_GLM<-sapply(eGLM, function(x){ x@t[which.max(x@TPR + x@TNR)] } )
Opt_GLM
## [1] -0.4855559 -0.5308654 -0.5133140 -0.5351958 -0.5143094
We can now use these threshold values to make final predictions for areas of 1 and 0 (presence/absence) and map them. As glm outputs are on a log scale, We can back-transform these values to approximate probabilities using the plogis() function like so:
#take the mean to be applied to our predictions
Mean_OptGLM<- mean(Opt_GLM)
trGLM<-plogis(Mean_OptGLM)
#inspect
trGLM
## [1] 0.3738236
We now have a value with which to set the threshold for our presence/absence maps.
We use the predict() function to estimate our output based on the ‘env’ raster stack and the information from our glm_eval model. For plotting purposes we then restrict the output to show only areas above our threshold (presence), leaving all other areas blank.
prGLM <- predict(env, glm.bradypus,type = "response")
par(mfrow=c(1,2))
plot(prGLM, main='GLM, regression')
plot(prGLM > trGLM, main='presence/absence')
#We can perform the same evaluation process on our maxent model also (we will leave out bioclim)
par(mfrow=c(2,3))
#create list to hold results
eMAX<-list()
#set number of folds
folds=5
#assign data to folds
kfold_pres <- kfold(Pres.cov, folds)
kfold_back <- kfold(Back.cov, folds)
for (i in 1:folds) {
train <- Pres.cov[kfold_pres!= i,]
test <- Pres.cov[kfold_pres == i,]
backTrain<-Back.cov[kfold_back!=i,]
backTest<-Back.cov[kfold_back==i,]
dataTrain<-rbind(train,backTrain)
dataTest<-rbind(test,backTest)
maxnet_eval <- maxnet(dataTrain$Pres, dataTrain[,1:8])
eMAX[[i]] <- evaluate(p=dataTest[which(dataTest$Pres==1),],a=dataTest[which(dataTest$Pres==0),], maxnet_eval)
plot(eMAX[[i]],'ROC')
}
#Again, we can access our AUC and MaxTPR+TNR statistics for this latest model.
aucMAX <- sapply(eMAX, function(x){slot(x, 'auc')} )
#calculate the mean values for comparison with other models
aucMAX
## [1] 0.9308197 0.9197339 0.9363703 0.9131491 0.9380547
mean(aucMAX)
## [1] 0.9276255
# retrieve optimal threhold
Opt_MAX<-sapply( eMAX, function(x){ x@t[which.max(x@TPR + x@TNR)] } )
#take the mean
Mean_OptMAX<-mean(Opt_MAX)
#predict the model to data
prMAX <- predict(env, maxnetMod,clamp=F,na.rm=TRUE)
#plot according to predicted values and to thresholds
par(mfrow=c(1,2))
plot(prMAX, main='Maxent Prediction')
plot(prMAX > Mean_OptMAX, main='presence/absence')
Finally we will look at how to fit a random forest model to the data using the randomForest() package. The randomForest model takes two arguments - parameters that can be tuned - mtry and ntree. mtry refers to the number of predictor variables to try at each iteration of the model and and ntree specifies the number of trees grown. 500 is the default value often used and will suffices for the size of our data. Model performance is thought to be more sensitive to mtry than ntree and this can be tuned with the tuneRF function:
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.3.2
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
#tune mtry
tuneRF(x=all.cov[,1:8],y=as.factor(all.cov$Pres))
## mtry = 2 OOB error = 4.64%
## Searching left ...
## mtry = 1 OOB error = 4.39%
## 0.05432937 0.05
## Searching right ...
## mtry = 4 OOB error = 4.7%
## -0.07001795 0.05
## mtry OOBError
## 1.OOB 1 0.04388591
## 2.OOB 2 0.04640719
## 4.OOB 4 0.04695871
The Out of Box (OOB) error statistic is lowest with an mtry of 1 so we go with this for fitting our model:
rf.Bradypus<-randomForest(as.factor(Pres)~.,mtry=1,ntree=500,data=all.cov)
Although randomForest does not produce variable co-efficients as in the case of glm(), we can use the varImpPlot() function to get a measure of variable importance (the contribution that each variable makes to the discrimination of classes in the final model). Here higher values of the Gini index represent greater probability that cases will be wrongly classified. Therefore variable importance is judged based on the decrease in the Gini index that is achieved through their inclusion in the model. Note that the rank of importance does not match closely the strength of association implied by the co-efficients in the glm model.
varImpPlot(rf.Bradypus)
To evaluate our random forest model we can still use the evaluate() function from the Dismo package but in this case we first make a prediction using the model output. This is because random forest, as used here, is a classifier and so we have to give it the response variable as a factor. However, factors don’t work well with ROC calculations so the predict() function here allows us to first make the predictions (i.e. where all predicted and actual values are returned as numerics rather than factors) before passing these to the evaluate() function.
#create an empty list to hold our results (remember there will be five sets)
eRF<-list()
par(mfrow=c(2,3))
for (i in 1:folds) {
train <- Pres.cov[kfold_pres!= i,]
test <- Pres.cov[kfold_pres == i,]
backTrain<-Back.cov[kfold_back!=i,]
backTest<-Back.cov[kfold_back==i,]
dataTrain<-rbind(train,backTrain)
dataTest<-rbind(test,backTest)
RF_eval <- randomForest(as.factor(Pres)~., data=dataTrain)#this is our RF model
rf.pred <- predict(RF_eval, type="prob")[,2]#make prediction
eRF[[i]]<-evaluate(p = rf.pred[which(dataTrain$Pres == "1")], #set p to be 1s from the dataTrain object
a = rf.pred[which(dataTrain$Pres == "0")])# set a to be 0s from the dataTrain object
#check the AUC by plotting ROC values
plot(eRF[[i]],'ROC')
}
#inspect
eRF
## [[1]]
## class : ModelEvaluation
## n presences : 4494
## n absences : 5659
## AUC : 0.9817797
## cor : 0.9155063
## max TPR+TNR at : 0.6395502
##
## [[2]]
## class : ModelEvaluation
## n presences : 4495
## n absences : 5659
## AUC : 0.9828311
## cor : 0.9213618
## max TPR+TNR at : 0.6194318
##
## [[3]]
## class : ModelEvaluation
## n presences : 4494
## n absences : 5660
## AUC : 0.9821397
## cor : 0.9172202
## max TPR+TNR at : 0.3919786
##
## [[4]]
## class : ModelEvaluation
## n presences : 4495
## n absences : 5659
## AUC : 0.98257
## cor : 0.9164005
## max TPR+TNR at : 0.3808574
##
## [[5]]
## class : ModelEvaluation
## n presences : 4494
## n absences : 5659
## AUC : 0.9824052
## cor : 0.9166336
## max TPR+TNR at : 0.3668898
#Get model evaluation statistics:
aucRF <- sapply( eRF, function(x){slot(x, 'auc')} )
#calculate the mean values for comparison with other models
mean(aucRF)
## [1] 0.9823451
#Get maxTPR+TNR for the Random Forest model
Opt_RF<-sapply( eRF, function(x){ x@t[which.max(x@TPR + x@TNR)] } )
Opt_RF
## [1] 0.6395502 0.6194318 0.3919786 0.3808574 0.3668898
Mean_OptRF<-mean(Opt_RF)
Mean_OptRF
## [1] 0.4797416
We can predict and map our distribution estimate as for the other models, but note here that we have to add “index=2” as an argument in the predict() function. Remember that random forest is a classifier which mean it will produce outcomes of 0 and 1 automatically. The index=2 here is asking the predict function to calculate probabilties that a cell will equal 1 (which is the second column in the table that the predict function creates - therefore index =2).
prRF <- predict(env, rf.Bradypus,type="prob",index=2)
par(mfrow=c(1,2))
plot(prRF, main='Random Forest Prediction')
plot(prRF > Mean_OptRF, main='presence/absence')
Now we have a clearer view of the quality of our models, with the random forest model performing better overall. Both the glm and maxnet approaches still demonstrated good predictive power even after the k-fold validation. Next week we will look at how this method of cross-validation may still be overly optimistic given that we have not accounted for spatial autocorrelation in our evaluation, and at methods for handling such problems.
For now, proceed to this week’s independent exercise.
Week 3 Exercise
Use the gbif() function to download records for Solanum acaule and generate species distribution models using the maxnet() and glm() approaches. Use the same environmental data and approach to background sampling as used in this practical. Evaluate your models using a k-fold approach based on 5 folds. Select the best performing model and create 2 mapped outputs - 1. a probability plot of your predictions and 2. a presence/absence map based on a sensible threshold.