In this week’s pratical we will take our first step into species distribution modelling. We will see that there are multiple options for modelling the distribution of species and that all depend on point data and the extraction of environmental factors to those points.
First, let’s set the working directory and install the packages that we will use for today’s work (notice the use of the c() function to call multiple packages at once):
setwd("P:/GEOG70922/Week4")
install.packages(c("dismo","maptools","glmnet","maxnet","raster","sp","randomForest"))
We will be using raster layers from the Worldclim dataset of environmental variables that come ready-to-use once the dismo package is installed. Information on these layers can be found at: http://worldclim.org/bioclim Follow this link now to see how the different layer names are coded. The ‘biome’ layer included here comes from the ‘Terrestrial Biome’ dataset by the WWF. You can get more information on this layer here: https://www.worldwildlife.org/publications/terrestrial-ecoregions-of-the-world
For now let’s load these layers, name them and create a raster stack object (a raster made of multiple bands) before plotting them along with some world map data to see where in the world we are this week. Thes files are inside the ‘ex’ folder of the dismo package and all have the extention ‘.grd’. We can access these using the list.files() function and concatonating the file paths using the paste() function. Type ‘?paste’ into the command line for more information on this function. Thisis how they appear when called on my own machine:
library(dismo)
## Loading required package: raster
## Loading required package: sp
files <- list.files(path=paste(system.file(package="dismo"),
'/ex', sep=''), pattern='grd', full.names=TRUE )
files
## [1] "\\\\nask.man.ac.uk/home$/Documents/R/win-library/4.0/dismo/ex/bio1.grd"
## [2] "\\\\nask.man.ac.uk/home$/Documents/R/win-library/4.0/dismo/ex/bio12.grd"
## [3] "\\\\nask.man.ac.uk/home$/Documents/R/win-library/4.0/dismo/ex/bio16.grd"
## [4] "\\\\nask.man.ac.uk/home$/Documents/R/win-library/4.0/dismo/ex/bio17.grd"
## [5] "\\\\nask.man.ac.uk/home$/Documents/R/win-library/4.0/dismo/ex/bio5.grd"
## [6] "\\\\nask.man.ac.uk/home$/Documents/R/win-library/4.0/dismo/ex/bio6.grd"
## [7] "\\\\nask.man.ac.uk/home$/Documents/R/win-library/4.0/dismo/ex/bio7.grd"
## [8] "\\\\nask.man.ac.uk/home$/Documents/R/win-library/4.0/dismo/ex/bio8.grd"
## [9] "\\\\nask.man.ac.uk/home$/Documents/R/win-library/4.0/dismo/ex/biome.grd"
Now let’s use the stack() function to create a multi-band raster and add names to our bands.
env<-stack(files[1:8])
#use the names() function to label each covariate layer
names(env) <- c("MeanTemp.","AnnualPrecip.","Precip.WettestQuarter","PrecipDriestQuarter","Max.Temp.","Min.Temp.","Temp.Range","MeanTemp.WettestQuarter")
plot(env)
library(maptools)
## Checking rgeos availability: TRUE
data(wrld_simpl)
#And now plot:
# first layer of the RasterStack
plot(env, 1)
# note the "add=TRUE" argument with plot
plot(wrld_simpl,add=TRUE)
Now we need some species data. Rather than providing our own data we will use the gbif() function to download species record data directly from the on-line portal: Global Biodiversity Information Facility [www.gbif.org]. Of course it is useful and enjoyable to explore species records on the GBIF website, but it we are looking for data on a specific species the gbif() function provides us with instant access and even gives us the option of downloading the data as a SpatialPoints object directly into R.
Today we will practise our species modelling techniques on a well known dataset for this purpose - point occurrence data for a mammal species native to areas of South America, Bradypus variegatus.
bradypus <- gbif("Bradypus", "variegatus*",sp=TRUE)
## 5281 records found
## 0-300-600-900-1200-1500-1800-2100-2400-2700-3000-3300-3600-3900-4200-4500-4800-5100-5281 records downloaded
Now our bradypus points do not have a crs but as they are in lat/lon format - the same as the wrld_simpl data - we can plot them on our worldclim data and wrld_simpl data (all in the WGS84 crs).
plot(env, 1)
plot(wrld_simpl,add=TRUE)
plot(bradypus,add=TRUE,col='red')
Now, we know that Bradypus variegatus is a new world species (we aren’t expecting it to show up in Europe or Asia), so lets subset the Wrld_simpl data to give only South and Central America.
Fist let’s look at the data attribute table of the wrld_simple object:
View(wrld_simpl@data)
Looks like we want SUBREGIONS 5 and 13 from the data. We use the subset() function to select entries in the wrld_simpl data SUBREGION column that are 5 or 13. Note the use of the OR operator ‘|’ and the ‘==’ signifying ‘is equal to’ (as in the Python language).
SA<-subset(wrld_simpl,wrld_simpl$SUBREGION==5|wrld_simpl$SUBREGION==13)
plot(SA)
Next we’ll set the Bradypus location data to the same CRS as the world map and subset our points to make sure they are all within the boundary of the SA object (i.e. to remove any erroneous records with incorrect coordinates). Here we take advantage of the spatial extension of the square bracket ‘[]’ notation. We can very conveniently subset spatial data occuring within the extent of another spatial data layer using ‘[]’ without the need for an overlay function as would be required in a desktop GIS (and in previous versions of R). This simple trick is a great strength of R for spatial data manipulation. Note that we renew the object ‘bradypus’ by re-creating the object as a subset of the existing ‘bradypus’object (i.e. ’bradypus<- bradypus[SA,]’). If we wanted to keep two versions of this object we could have given the subsetted data a different name (such as ‘bradypus_sub’), leaving us with both ‘bradypus’ and ‘bradypus_sub’ objects. However, as we will only be working with this subsetted version, it makes sense just to keep the one object.
#for completeness set the bradypus crs to that of the SA object created above.
crs(bradypus)<-crs(SA)
#subset our points to only those inside SA
bradypus<-bradypus[SA, ]
plot(SA)
plot(bradypus, add=TRUE)
Now we have the presence data and environmental covariates necessary to fit a presence-only model of species distribution. However, we will also carry out the analysis using algorithms that take presence and background points so lets create some in a similar way as we did in Week 2, using the randomPoints() function from the dismo package.
set.seed(21)
#create random points on cells of the 'env' object within the extent of bradypus and avoiding cells containing points from bradypus
back.xy <- randomPoints(env, n=5000,p= bradypus,ext = extent(bradypus))
#Create Spatial Points layer from back.xy
back<-SpatialPoints(back.xy,crs(bradypus))
#make sure crs matches other data
crs(back)<-crs(SA)
#subset to SA object
back<-back[SA,]
plot(SA)
plot(bradypus,add=TRUE)
plot(back,add=TRUE, col='red')
Now we’ll extract information from our ‘env’ raster stack to both the presence and background points, add a column denoting presence (‘1’) and absence (‘0’) and combine them into a data frame for use in further analysis.
#absence values
eA<-extract(env,back)
#presence values
eP<-extract(env,bradypus)
#create data frames from values
Pres.cov<-data.frame(eP,Pres=1)
Back.cov<-data.frame(eA,Pres=0)
#combine
all.cov<-rbind(Pres.cov,Back.cov)
head(all.cov)
## MeanTemp. AnnualPrecip. Precip.WettestQuarter PrecipDriestQuarter Max.Temp.
## 1 267 1912 851 54 322
## 2 255 3056 1017 488 299
## 3 216 3285 1218 300 277
## 4 261 2791 1077 148 310
## 5 259 2876 1119 163 315
## 6 255 2539 1045 116 307
## Min.Temp. Temp.Range MeanTemp.WettestQuarter Pres
## 1 223 99 262 1
## 2 204 95 257 1
## 3 160 117 215 1
## 4 218 92 254 1
## 5 214 101 253 1
## 6 207 100 249 1
tail(all.cov)
## MeanTemp. AnnualPrecip. Precip.WettestQuarter PrecipDriestQuarter
## 8764 253 2484 723 542
## 8765 207 1262 595 122
## 8766 187 2652 952 329
## 8767 244 1148 737 25
## 8768 270 1348 799 35
## 8769 250 1436 674 59
## Max.Temp. Min.Temp. Temp.Range MeanTemp.WettestQuarter Pres
## 8764 315 198 116 253 0
## 8765 294 97 197 235 0
## 8766 255 125 130 184 0
## 8767 304 181 123 252 0
## 8768 341 204 137 263 0
## 8769 332 156 177 256 0
summary(all.cov)
## MeanTemp. AnnualPrecip. Precip.WettestQuarter PrecipDriestQuarter
## Min. : 17.0 Min. : 0 Min. : 0.0 Min. : 0.0
## 1st Qu.:235.0 1st Qu.:1496 1st Qu.: 662.0 1st Qu.: 68.0
## Median :256.0 Median :2226 Median : 889.0 Median : 148.0
## Mean :243.3 Mean :2275 Mean : 914.3 Mean : 198.7
## 3rd Qu.:264.0 3rd Qu.:2876 3rd Qu.:1136.0 3rd Qu.: 300.0
## Max. :288.0 Max. :7682 Max. :2458.0 Max. :1496.0
## Max.Temp. Min.Temp. Temp.Range MeanTemp.WettestQuarter
## Min. :100.0 Min. :-128.0 Min. : 77.0 Min. : 35.0
## 1st Qu.:305.0 1st Qu.: 158.0 1st Qu.:112.0 1st Qu.:238.0
## Median :319.0 Median : 190.0 Median :128.0 Median :254.0
## Mean :312.8 Mean : 173.9 Mean :138.8 Mean :245.1
## 3rd Qu.:336.0 3rd Qu.: 210.0 3rd Qu.:162.0 3rd Qu.:262.0
## Max. :363.0 Max. : 231.0 Max. :325.0 Max. :290.0
## Pres
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.4875
## 3rd Qu.:1.0000
## Max. :1.0000
Now let’s create our first model. The Biolclim() function performs a distribution model based on the envelope modelling approach and takes presence data only. We specify the model therefore using only the Pres.cov data frame and, for now, we will use co-variates MeanTemp., Temp.Range. and AnnualPrecip.:
bc_bradypus <- bioclim(Pres.cov[,c('MeanTemp.', 'Temp.Range', 'AnnualPrecip.')])
The Bioclim() algorithm assess the suitability of a location by comparing the values of environmental variables at that location to the distribution of values at known presence locations. In general, the closer to the 50th percentile (the median), the more suitable the location is.
We can map our prediction using the predict() function, using our covariate raster stack as the object and the bc_bradypus as the model. Note that for a bioclim() model the predicted values range from 0 to 1. It is scaled such that a value of 1 would be locations that have the median value of all covariates considered, while a value of zero would reflect locations where at least one covariate is outside the range of environmental covariates at presence locations.
bioclim.map <- predict(env, bc_bradypus)
plot(bioclim.map, axes = F, box = F, main = "bioclim: MeanTemp.,Temp.Range,AnnualPrecip.")
Now we’ll repeat our bioclim model but using all predictors (i.e. all columns of co-variates in Pres.cov)
bc_bradypus2<-bioclim(Pres.cov[,1:8])
bioclim.map2 <- predict(env, bc_bradypus2)
plot(bioclim.map2, axes = F, box = F, main = "bioclimAll")
We can see that our models, as expected, are very similar though we might assume that, given the relatively higher values and using fewer predictors, the first model (bc_bradypus) may be over-predicting our species occurence.
We can also use the response() function to give the response plots for each covariate.
response(bc_bradypus2)
The response plots show values for each covariate at which prediction is highest.It might help to think of these plots as being similar in principle to a histogram, with values for co-variates on the X axis and proportion of presence points on the Y axis. For most of our covariates, the relationship appears to be non-linear in the presence-only bioclim model.
Next we’ll look at fitting some models that accept presence and (psuedo)-absence data. We will use the glm() function (glm stands for ‘generalized linear model’) and set the family of functions to ‘binomial’ (i.e. a logistic regression) as our outcome is binary. We set the first argument (the response variable) as ‘Pres’ and separate this from our predictor variable with the tilda symbol ‘~’. We can opt to include all variables (i.e. the remaining columns in the data frame) simply by placing a point ‘.’ after the tilda symbol.
#Specify model
glm.bradypus<-glm(Pres~.,binomial(link='logit'), data=all.cov)
#get summary of outputs
summary(glm.bradypus)
##
## Call:
## glm(formula = Pres ~ ., family = binomial(link = "logit"), data = all.cov)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.2670 -0.5965 -0.1083 0.6302 3.1373
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.6412845 0.4191230 18.232 < 2e-16 ***
## MeanTemp. -0.0612713 0.0121192 -5.056 4.29e-07 ***
## AnnualPrecip. 0.0048583 0.0002050 23.696 < 2e-16 ***
## Precip.WettestQuarter -0.0070055 0.0004459 -15.710 < 2e-16 ***
## PrecipDriestQuarter -0.0111721 0.0004628 -24.139 < 2e-16 ***
## Max.Temp. 0.5524442 0.0609333 9.066 < 2e-16 ***
## Min.Temp. -0.5830398 0.0616546 -9.457 < 2e-16 ***
## Temp.Range -0.6247278 0.0613048 -10.191 < 2e-16 ***
## MeanTemp.WettestQuarter 0.0817452 0.0056363 14.503 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 12150.9 on 8768 degrees of freedom
## Residual deviance: 7082.1 on 8760 degrees of freedom
## AIC: 7100.1
##
## Number of Fisher Scoring iterations: 5
The statistics provided by the summary() function from the glm model are the co-efficient values estimated for each predictor variable and the AIC (Akaike Information Criterion) value. For the purpose of model comparison, a lower AIC denotes better model fit.
We can map the probability predictions of our glm.bradypus output in the same way as we did for the bioclim model.
#Use the predict function to estimate probabilities for our study area based on the raster stack object 'env'.
glm.map <- predict(env, glm.bradypus, type = "response")
#plot the map
plot(glm.map,main="glmBradypus")
glm.map
## class : RasterLayer
## dimensions : 192, 186, 35712 (nrow, ncol, ncell)
## resolution : 0.5, 0.5 (x, y)
## extent : -125, -32, -56, 40 (xmin, xmax, ymin, ymax)
## crs : +proj=longlat +datum=WGS84 +no_defs
## source : memory
## names : layer
## values : 1.637726e-11, 0.9993546 (min, max)
Now we will look at the application of a Maxent algorithm to our data. Maxent is a popular species distribution modelling algorithm, partly due to the open source package that can be freely downloaded and used “out of the box”. It can also be used within the R environment where, if the programme is saved inside the dismo package folder in the R directory, we can call maxent() from the dismo library in the same way as we have already called bioclim() and glm() from their respective libraries. However, this can be problematic if working on a computer without admin rights and in any case there is a way to implement a maxent model without outsourcing the algorithm from within R. The glmnet package contains a function maxnet() that allows us to do this.
For maxnet, we provide our presence data followed by the columns of our data frame containing predictors and, in this case, we choose “lq” (meaning linear and quadratic) from the list of available feature classes for model estimation.
In truth, the maxnet() function approximates a maxent formula using an “infinitely weighted logistic regression” - an inhomogenous point process (IPP) approach carried out with the glmnet package (see Fithian and Hastie, 2013). This is why we also installed glmnet at the start of the practical. As you can see from the topics covered in the lecture and suggested reading for this week, there is a strong theoretical basis for using point process models in ecology. Here we pass all variables in our all.cov data frame as predictors and use the maxnet.formula() function to give variable names as arguments. This is so that we can create response curves later on.
library(maxnet)
##
## Attaching package: 'maxnet'
## The following object is masked _by_ '.GlobalEnv':
##
## bradypus
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.1-1
maxnetMod <- maxnet(all.cov$Pres, all.cov[,1:8],maxnet.formula(all.cov$pres,all.cov[,c("AnnualPrecip.", "Precip.WettestQuarter", "PrecipDriestQuarter", "Max.Temp.", "Temp.Range","MeanTemp.")],classes="lq")) # runs a presence-only Maxent model selecting feature "linear" and "quadratic" using the "lq" argument.
To transform the raw model output values (an exponential model) for mapping to a 0 to 1 probability scale, we have several options. The default option in the orginal Maxent software was type “logistic” for ploting response curves though recent work comparing Maxent to an IPP formulation suggests that a cloglog (complementary log-log transform) approach is more suitable for mapping presence/absence predictions. Consequently, cloglog is now the default for the open source Maxent tool and we will plot it here.
First we plot response curves showing the relationship between some of our variables and probability. We use the response.plot() function fom the maxent package to do this. The function takes six arguments: 1.the model, 2.the variable to be plotted, 3. the variable mean (note we take this from the model itself but as it is stored in a list - which won’t do for plotting - we have to unlist this value), 4.the variable minimum value, 5. the variable maximum value and 6. the transformation applied to the raw data. You can plot the other variables also simply by swapping out the variable name.
response.plot(maxnetMod,"AnnualPrecip.", mm=unlist(maxnetMod$samplemeans), #set model, variable and mean values
min=maxnetMod$varmin["AnnualPrecip."], max=maxnetMod$varmax["AnnualPrecip."], #set min/max values
type="cloglog") #choose transform type
response.plot(maxnetMod,"Temp.Range", mm=unlist(maxnetMod$samplemeans),min=maxnetMod$varmin["Temp.Range"],
max=maxnetMod$varmax["Temp.Range"],
type="cloglog")
response.plot(maxnetMod,"MeanTemp.", mm=unlist(maxnetMod$samplemeans),min=maxnetMod$varmin["MeanTemp."],
max=maxnetMod$varmax["MeanTemp."],
type="cloglog")
Now let’s plot the prediction map.
maxnet.cloglog.map <- predict(env, maxnetMod, clamp=F, type="cloglog")
plot(maxnet.cloglog.map, axes=F, box=F, main="Maxnet-clog")
Model Validation
An important part of species distribution modelling is model validation and this is a subject we will come back to repeatedly throughout this and next week’s practical. Although we briefly looked at the usefulness of the AIC statistic for comparing models, the AIC does not give us any idea about the predictive power of our model. To evaluate this, we need to use information provided in the confusion matrix, as introduced in the lecture.
The most readily available function (before creating our own later) for generating a confusion matrix and returning associated statistics is the evaluate() function that comes with the dismo package. To use this function, we supply three arguments to the evaluate() function: a data frame of presence points, a data frame of absence points and the model we wish to evaluate.
Let’s evaluate our presence-only model (bioclim) first.
#For evaluating the first bioclim model we only povide the MeanTemp.,Temp.Range and AnnualPrecip. predictors which are columns 1,2 and 7 in the Pres.cov data frame
eBC <- evaluate(Pres.cov[,c(1,2,7)],Back.cov,bc_bradypus)
#For model two validation, provide all predictors
eBC2 <- evaluate(Pres.cov,Back.cov,bc_bradypus2)
#inspect both evaluation reults
eBC
## class : ModelEvaluation
## n presences : 4275
## n absences : 4494
## AUC : 0.6829342
## cor : 0.2068077
## max TPR+TNR at : 0.02996082
eBC2
## class : ModelEvaluation
## n presences : 4275
## n absences : 4494
## AUC : 0.7108813
## cor : 0.1795251
## max TPR+TNR at : 0.02305789
The main outputs of interest here are the AUC, and max TPR+TNR statistics. Recall from the lecture that AUC stands for Area Under the Curve and is a test of model performance where higher values indicate greater accuracy in our predictions.This statistic is often used as a mark of model performance and we will use it to assess the model we run today. Remember also that the max TPR+TNR denotes the probability threshold at which our model maximizes the True Positive Rate and the True Negative Rate. It is generally accepted that this is an optimum value at which to set the threshold for binary classification (i.e. 1/0 or presence/absence) in mapping outputs. We will use these values later when symbolising our final mapped predictions. For now, we can safely assume that our second bioclim model, with a higher AUC statistic, is the better performing model.
We can plot the ROC (Receiver Operating Curve) to visualize these results:
plot(eBC,'ROC')
plot(eBC2,'ROC')
Now let’s evaluate our glm models.
#GLM evaluation
#Model with all predictors
eglmAll <- evaluate(all.cov[all.cov$Pres==1,],all.cov[all.cov$Pres==0,], glm.bradypus)
#inspect
eglmAll
## class : ModelEvaluation
## n presences : 4275
## n absences : 4494
## AUC : 0.8954222
## cor : 0.6739881
## max TPR+TNR at : 0.406345
plot(eglmAll,'ROC')
Finally we apply the evaluate() function to our maxnet model.
maxeval<-evaluate(p=bradypus, a=back, maxnetMod, env)
maxeval
## class : ModelEvaluation
## n presences : 4275
## n absences : 4494
## AUC : 0.8811115
## cor : 0.5927064
## max TPR+TNR at : -8.862094
plot(maxeval,'ROC')
So far the evaluation results suggest good performance by all of our models (high AUC), particularly the glm approach. However, we have almost certainly over-estimated the quality of our models given that we have tested them with the same data that was used for training them. A more robust approach is to train the model on a subset of the data and test (evaluate) the model on the remaining cases. For example, we might use 70% of the data to train our model and use the remaining 30% of cases to test our model’s ability to correctly classify those cases. Although more robust, this approach only gives us a single attempt at model validation on which to base our assessment. To overcome this, the k-fold approach is commonly used where the entire dataset is divided into folds (groups), say five, where one fold is chosen for model testing and the remaining four used for model training. This is then repeated in an iterative process until all folds have been used for model testing. We can then base our assessment on the mean values derived from all iterations. The kfold() function in dismo provided us with a neat way of partitioning our data.
set.seed(5)
#set number of folds to use
folds=5
#partiction presence and absence data according to folds using the kfold() function.
kfold_pres <- kfold(Pres.cov, folds)
kfold_back <- kfold(Back.cov, folds)
#create an empty list to hold our results (remember there will be five sets)
bc_K<-list()
par(mfrow=c(2,3))
# for loop to iterate over folds
for (i in 1:folds) {
train <- Pres.cov[kfold_pres!= i,]#for presence values, select all folds which are not 'i' to train the model
test <- Pres.cov[kfold_pres == i,]#the remaining fold is used to test the model
backTrain<-Back.cov[kfold_back!=i,]#now for background values, select training folds
backTest<-Back.cov[kfold_back==i,]#use the remainder for model testing
dataTrain<-rbind(train,backTrain)#bind presence and background training data together
dataTest<-rbind(test,backTest)#bind test data together
bc_bradypus_K<-bioclim(train[,1:8])#for bioclim we only need presence data to train the model!
bc_K[[i]] <- evaluate(p=test[,1:8],a=backTest[,1:9], bc_bradypus_K)#use testing data (kfold==i) for model evaluation
#check the AUC by plotting ROC values
plot(bc_K[[i]],'ROC')
}
#inspect
bc_K
## [[1]]
## class : ModelEvaluation
## n presences : 855
## n absences : 899
## AUC : 0.7070019
## cor : 0.1694754
## max TPR+TNR at : 0.03177135
##
## [[2]]
## class : ModelEvaluation
## n presences : 855
## n absences : 899
## AUC : 0.7206376
## cor : 0.2035535
## max TPR+TNR at : 0.02153743
##
## [[3]]
## class : ModelEvaluation
## n presences : 855
## n absences : 898
## AUC : 0.721698
## cor : 0.198873
## max TPR+TNR at : 0.02387661
##
## [[4]]
## class : ModelEvaluation
## n presences : 855
## n absences : 899
## AUC : 0.694517
## cor : 0.1526417
## max TPR+TNR at : 0.0124731
##
## [[5]]
## class : ModelEvaluation
## n presences : 855
## n absences : 899
## AUC : 0.7032226
## cor : 0.1707208
## max TPR+TNR at : 0.03381813
Next we use the slot() function to acces the elements in the list bc_K that are of use to us. In this case we want to return all auc values from our 5 evaluation runs. We ‘loop’ through all elements in the list using the sapply() function (remember this is similar to a for loop but applies to elements in a list or a vector). Effectively the below code is iterating over each element in the list ‘bc_K’ and applying a function that selects the ‘auc’ statistics from the correpsonding slot in the model results. Slots are basically attributes of objects in R. Here bc_k is a list of objects of type Formal Class ModelEvaluation. So we iterate of each and extract the ‘auc’ slot. You can type ‘help(slot)’ into the console for more information on slots.
Get model evaluation statistics:
aucBIO <- sapply( bc_K, function(x){slot(x, 'auc')} )
#calculate the mean values for coparison with other models
mean(aucBIO)
## [1] 0.7094154
In order to map the results of our model we need to set a threshold over which locations (cells on the map) are allocated 1 and, beneath which, 0. One way to determine this threshold is to calculate the threshold (the model output) at which the TPR (true positive rate) and TNR (true negative rate) are maximized. Remember we need both of these to be high in order to ensure a successful model.
We use the sapply() function to iterate over each item in our eGLM list and select the ‘t’ value (this is the slot on bc_K which denotes a given threshold) for which corressponding values for TPR+TNR are highest (hence we use ‘which.max’).
Opt_BIO<-sapply( bc_K, function(x){ x@t[which.max(x@TPR + x@TNR)] } )
Opt_BIO
## [1] 0.03177135 0.02153743 0.02387661 0.01247310 0.03381813
We will use these threshold values to make final predictions for areas of 1 and 0 (presence/absence) and map them.
#take the mean of the OptBIO list for application in our predictions
Mean_OptBIO<- mean(Opt_BIO)
Mean_OptBIO
## [1] 0.02469532
Now we just need to make a prediction based on our environmental raster stack and our bioclim model, setting 1 and 0 (presence/absence) according to the above threshold.
prBIO <- predict(env, bc_bradypus2,type = "response")
par(mfrow=c(1,2))
plot(prBIO, main='BIO_K')
plot(wrld_simpl, add=TRUE, border='dark grey')
plot(prBIO > Mean_OptBIO, main='presence/absence')
plot(wrld_simpl,add=TRUE, border ='dark grey')
Now we’ll do exactly the same for our GLM model, only remember that this algorithm takes presence and absence data so we’ll modify our for loop slightly.
#create an empty list to hold our results (remember there will be five sets)
eGLM<-list()
par(mfrow=c(2,3))
for (i in 1:folds) {
train <- Pres.cov[kfold_pres!= i,]
test <- Pres.cov[kfold_pres == i,]
backTrain<-Back.cov[kfold_back!=i,]
backTest<-Back.cov[kfold_back==i,]
dataTrain<-rbind(train,backTrain)
dataTest<-rbind(test,backTest)
glm_eval <- glm(Pres~.,binomial(link = "logit"), data=dataTrain)#this is our glm model trained on presence and absence points
eGLM[[i]] <- evaluate(p=dataTest[ which(dataTest$Pres==1),],a=dataTest[which(dataTest$Pres==0),], glm_eval)#use testing data (kfold==i) for model evaluation
#check the AUC by plotting ROC values
plot(eGLM[[i]],'ROC')
}
#inspect
eGLM
## [[1]]
## class : ModelEvaluation
## n presences : 855
## n absences : 899
## AUC : 0.8880784
## cor : 0.6614947
## max TPR+TNR at : 0.4168784
##
## [[2]]
## class : ModelEvaluation
## n presences : 855
## n absences : 899
## AUC : 0.902241
## cor : 0.6854362
## max TPR+TNR at : -0.7104107
##
## [[3]]
## class : ModelEvaluation
## n presences : 855
## n absences : 898
## AUC : 0.8963519
## cor : 0.6773006
## max TPR+TNR at : -0.6960228
##
## [[4]]
## class : ModelEvaluation
## n presences : 855
## n absences : 899
## AUC : 0.8910602
## cor : 0.6691408
## max TPR+TNR at : 0.495993
##
## [[5]]
## class : ModelEvaluation
## n presences : 855
## n absences : 899
## AUC : 0.8956631
## cor : 0.6721622
## max TPR+TNR at : -0.5998866
Get model evaluation statistics:
aucGLM <- sapply( eGLM, function(x){slot(x, 'auc')} )
#calculate the mean values for comparison with other models
mean(aucGLM)
## [1] 0.8946789
Again we use the sapply() function to iterate over each item in our eGLM list and select the ‘t’ value (this is the slot on eGLM which denotes a given threshold) for which corressponding values for TPR+TNR are highest (hence we use ‘which.max’).
Opt_GLM<-sapply( eGLM, function(x){ x@t[which.max(x@TPR + x@TNR)] } )
Opt_GLM
## [1] 0.4168784 -0.7104107 -0.6960228 0.4959930 -0.5998866
We can now use these threshold values to make final predictions for areas of 1 and 0 (presence/absence) and map them. As glm outputs are on a log scale, We can back-transform these values to approximate probabilities using the plogis() function like so:
#take the mean to be applied to our predictions
Mean_OptGLM<- mean(Opt_GLM)
trGLM<-plogis(Mean_OptGLM)
trGLM
## [1] 0.4455444
We now have a value with which to set the threshold for our presence/absence maps.
We use the predict() function to estimate our output based on the ‘env’ raster stack and the information from our glm_eval model. For plotting purposes we then restrict the output to show only areas above our threshold (presence), leaving all other areas blank.
prGLM <- predict(env, glm.bradypus,type = "response")
par(mfrow=c(1,2))
plot(prGLM, main='GLM, regression')
plot(wrld_simpl, add=TRUE, border='dark grey')
plot(prGLM > trGLM, main='presence/absence')
plot(wrld_simpl,add=TRUE, border ='dark grey')
We can perform the same evaluation process on our maxent model also (we will leave k-fold validation of the lesser-performing bioclim model).
par(mfrow=c(2,3))
eMAX<-list()
folds=5
kfold_pres <- kfold(Pres.cov, folds)
kfold_back <- kfold(Back.cov, folds)
for (i in 1:folds) {
train <- Pres.cov[kfold_pres!= i,]
test <- Pres.cov[kfold_pres == i,]
backTrain<-Back.cov[kfold_back!=i,]
backTest<-Back.cov[kfold_back==i,]
dataTrain<-rbind(train,backTrain)
dataTest<-rbind(test,backTest)
maxnet_eval <- maxnet(dataTrain$Pres, dataTrain[,1:8])
eMAX[[i]] <- evaluate(p=dataTest[which(dataTest$Pres==1),],a=dataTest[which(dataTest$Pres==0),], maxnet_eval)
plot(eMAX[[i]],'ROC')
}
Again, we can access our AUC and MaxTPR+TNR statistics for this latest model (for reference, the equivalent AUC evaluation for the same analysis using the stand-alone Maxent package was 0.901).
aucMAX <- sapply( eMAX, function(x){slot(x, 'auc')} )
#calculate the mean values for comparison with other models
aucMAX
## [1] 0.9295930 0.9353252 0.9313159 0.9210845 0.9243370
mean(aucMAX)
## [1] 0.9283311
#Get maxTPR+TNR for the maxnet model
Opt_MAX<-sapply( eMAX, function(x){ x@t[which.max(x@TPR + x@TNR)] } )
Opt_MAX
## [1] -8.925871 -8.930440 -8.918266 -8.853102 -8.936865
Mean_OptMAX<-mean(Opt_MAX)
Mean_OptMAX
## [1] -8.912909
The threshold values for the maxnet model relate to the raw outputs (prior to transformation). We can achieve our presence/absence map simply by applying our threshold to these raw values. To do so, we map the raw results of the model (i.e, this time we do not opt for a ‘cloglog’ or other transform) and then remove all values below our threshold.
prMAX <- predict(env, maxnetMod)
par(mfrow=c(1,2))
plot(prMAX, main='Maxent Prediction')
plot(wrld_simpl, add=TRUE, border='dark grey')
plot(prMAX > Mean_OptMAX, main='presence/absence')
plot(wrld_simpl,add=TRUE, border ='dark grey')
Finally we will look at how to fit a random forest model to the data using the randomForest() package. The randomForest model takes two arguments - parameters that can be tuned - mtry and ntree. mtry refers to the number of predictor variables to try at each iteration of the model and and ntree specifies the number of trees grown. 500 is the default value often used and will suffices for the size of our data. Model performance is thought to be more sensitive to mtry than ntree and this can be tuned with the tuneRF function:
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
tuneRF(x=all.cov[,1:8],y=as.factor(all.cov$Pres))
## mtry = 2 OOB error = 2.16%
## Searching left ...
## mtry = 1 OOB error = 1.81%
## 0.1587302 0.05
## Searching right ...
## mtry = 4 OOB error = 2.33%
## -0.2830189 0.05
## mtry OOBError
## 1.OOB 1 0.01813206
## 2.OOB 2 0.02155320
## 4.OOB 4 0.02326377
The Out of Box (OOB) error statistic is lowest with an mtry of 1 so we go with this for fitting our model:
rf.Bradypus<-randomForest(as.factor(Pres)~.,mtry=1,ntree=500,data=all.cov)
Although randomForest does not produce variable co-efficients as in the case of glm(), we can use the varImPlot() function to get a measure of variable importance (the contribution that each variable makes to the discrimination of classes in the final model). Here higher values of the Gini index represent greater probability that cases will be wrongly classified. Therefore variable importance is judged based on the decrease in the Gini index that is achieved through their inclusion in the model. Note that the rank of importance does not match closely the strength of association implied by the co-efficients in the glm model.
varImpPlot(rf.Bradypus)
To evaluate our random forest model we can still use the evaluate() function from the Dismo package but in this case we first make a prediction using the model output. This is because random forest, as used here, is a classifier and so we have to give it the response variable as a factor. However, factors don’t work well with ROC calculations so the predict() function here allows us to first make the predictions (i.e. where all predicted and actual values are returned as numerics rather than factors) before passing these to the evaluate() function.
#create an empty list to hold our results (remember there will be five sets)
eRF<-list()
par(mfrow=c(2,3))
for (i in 1:folds) {
train <- Pres.cov[kfold_pres!= i,]
test <- Pres.cov[kfold_pres == i,]
backTrain<-Back.cov[kfold_back!=i,]
backTest<-Back.cov[kfold_back==i,]
dataTrain<-rbind(train,backTrain)
dataTest<-rbind(test,backTest)
RF_eval <- randomForest(as.factor(Pres)~., data=dataTrain)#this is our RF model
rf.pred <- predict(RF_eval, type="prob")[,2]#make prediction
eRF[[i]]<-evaluate(p = rf.pred[which(dataTrain$Pres == "1")], #set p to be 1s from the dataTrain object
a = rf.pred[which(dataTrain$Pres == "0")])# set a to be 0s from the dataTrain object
#check the AUC by plotting ROC values
plot(eRF[[i]],'ROC')
}
#inspect
eRF
## [[1]]
## class : ModelEvaluation
## n presences : 3420
## n absences : 3595
## AUC : 0.992676
## cor : 0.9559282
## max TPR+TNR at : 0.5977516
##
## [[2]]
## class : ModelEvaluation
## n presences : 3420
## n absences : 3595
## AUC : 0.9925513
## cor : 0.95577
## max TPR+TNR at : 0.5992547
##
## [[3]]
## class : ModelEvaluation
## n presences : 3420
## n absences : 3596
## AUC : 0.9936993
## cor : 0.9595103
## max TPR+TNR at : 0.621003
##
## [[4]]
## class : ModelEvaluation
## n presences : 3420
## n absences : 3595
## AUC : 0.9917957
## cor : 0.9545886
## max TPR+TNR at : 0.6169161
##
## [[5]]
## class : ModelEvaluation
## n presences : 3420
## n absences : 3595
## AUC : 0.9920844
## cor : 0.9554746
## max TPR+TNR at : 0.5918168
Get model evaluation statistics:
aucRF <- sapply( eRF, function(x){slot(x, 'auc')} )
#calculate the mean values for comparison with other models
mean(aucRF)
## [1] 0.9925613
#Get maxTPR+TNR for the Random Forest model
Opt_RF<-sapply( eRF, function(x){ x@t[which.max(x@TPR + x@TNR)] } )
Opt_RF
## [1] 0.5977516 0.5992547 0.6210030 0.6169161 0.5918168
Mean_OptRF<-mean(Opt_RF)
Mean_OptRF
## [1] 0.6053485
We can predict and map our distribution estimate as for the other models, but note here that we have to add “index=2” as an argument in the predict() function. Remember that random forest is a classifier which mean it will produce outcomes of 0 and 1 automatically. The index=2 here is asking the predict function to calculate probabilties that a cell will equal 1 (which is the second column in the table that the predict function creates - therefore index =2).
prRF <- predict(env, rf.Bradypus,type="prob",index=2)
par(mfrow=c(1,2))
plot(prRF, main='Random Forest Prediction')
plot(wrld_simpl, add=TRUE, border='dark grey')
plot(prRF > Mean_OptRF, main='presence/absence')
plot(wrld_simpl,add=TRUE, border ='dark grey')
Now we have a clearer view of the quality of our models, with the random forest model performing better overall. Both the glm and maxnet approaches still demonstrated good predictive power even after the k-fold validation. Next week we will look at how this method of cross-validation may still be overly optimistic given that we have not accounted for spatial autocorrelation in our evaluation, and at methods for handling such problems.
For now, proceed to this week’s independent exercises.
Week 4 Exercise
Use the gbif() function to download records for Solanum acaule and generate species distribution models using the maxnet() and glm() approaches. Use the same environmental data and approach to background sampling as used in this practical. Evaluate your models using a k-fold approach based on 5 folds. Select the best performing model and create 2 mapped outputs - 1. a probability plot of your predictions and 2. a presence/absence map based on a sensible threshold.
Explain the importance of species distribution modelling in the context of nature conservation. Summarize the opportunities and challenges of applying species distribution models to species presence data (300 words).