library(terra)Warning: package 'terra' was built under R version 4.5.2
terra 1.8.93
In this week’s pratical we will take our first step into species distribution modelling. We will see that there are multiple options for modelling the distribution of species and that all depend on point data and the extraction of environmental factors to those points.
First, let’s set the working directory and install the packages that we will use for today’s work:
setwd("./GEOG71922/WeekTwo")Install the packages for today’s work (we will call the libraries on first use):
#Install packages
#dismo for downloading species data and evaluating models (also comes with the climatic data we will use for modelling)
#glmnet and maxnet for running Maxent models
#terra and sf you know (handling spatial data)
#randonForest - running random forest models
#rnaturalearth for calling world boundary and other miscellaneous geographical data
#dplr for data manipulation
install.packages(c("dismo","glmnet","maxnet","terra","randomForest","sf","rnaturalearth","dplyr"))Load the libraries for terra sf and dismo
library(terra)Warning: package 'terra' was built under R version 4.5.2
terra 1.8.93
library(sf)Warning: package 'sf' was built under R version 4.5.2
Linking to GEOS 3.13.0, GDAL 3.8.5, PROJ 9.5.1; sf_use_s2() is TRUE
library(dismo)Loading required package: raster
Loading required package: sp
We will be using raster layers from the Worldclim dataset of environmental variables that come ready-to-use once the dismo package is installed. Information on these layers can be found at: http://worldclim.org/bioclim Follow this link now to see how the different layer names are coded. The ‘biome’ layer included here comes from the ‘Terrestrial Biome’ dataset by the WWF. You can get more information on this layer here: https://www.worldwildlife.org/publications/terrestrial-ecoregions-of-the-world
For now let’s load these layers, name them and create a raster stack object (a raster made of multiple bands) before plotting them along with some world map data to see where in the world we are this week. These files are inside the ‘ex’ folder of the dismo package and all have the extention ‘.grd’. We can access these using the list.files() function and concatonating the file paths using the paste() function. Type ‘?paste’ into the command line for more information on this function. This is how they appear when called on my own machine:
files <- list.files(path=paste(system.file(package="dismo"),
'/ex', sep=''), pattern='grd', full.names=TRUE )
files[1] "/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/dismo/ex/bio1.grd"
[2] "/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/dismo/ex/bio12.grd"
[3] "/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/dismo/ex/bio16.grd"
[4] "/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/dismo/ex/bio17.grd"
[5] "/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/dismo/ex/bio5.grd"
[6] "/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/dismo/ex/bio6.grd"
[7] "/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/dismo/ex/bio7.grd"
[8] "/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/dismo/ex/bio8.grd"
[9] "/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/dismo/ex/biome.grd"
Next we creae raster object from these files and plot them.
#use the rast() function to create a "stack" of all the raster layers in the list of files we just made
env<-rast(files[1:8])
#check out the layers
plot(env)Use the names() function to assign names to the different environmental layers
names(env) <- c("MeanTemp.","AnnualPrecip.","Precip.WettestQuarter","PrecipDriestQuarter","Max.Temp.","Min.Temp.","Temp.Range","MeanTemp.WettestQuarter")
#plot again with new layer names
plot(env)Next we retrive some data on the Brown Throated Sloth (Bradypus variegatus) from the Global Biodiversity Information Facility (GBIF)
#use gbif() function to search and download sloth data sp=TRUE means return the data as points.
bradypus <- gbif("Bradypus", "variegatus*",sp=TRUE)10221 records found
0-300-600-900-1200-1500-1800-2100-2400-2700-3000-3300-3600-3900-4200-4500-4800-5100-5400-5700-6000-6300-6600-6900-7200-7500-7800-8100-8400-8700-9000-9300-9600-9900-10200-10221 records downloaded
Now let’s have a look at a handy tool in R that allows us to work interactively with spatial data in R - the ‘mapview’ library. This allows use to look at our data and pan around their extent against base maps similar to how you would using QGIS or ArcGIS. Run the following code then have a play around with the options. When you’re ready, move on to the next step. We will focus our analysis around the Cairngorms area of Scotland, see if you can locate it on the map.
#convert from an sp library (old) object to terra (newer/faster library) object
bradypus<-vect(bradypus)
#set the crs
crs(bradypus)<-"epsg:4326"Now we’ll add some political boundaries for context. We’ll use a package for load Natural Earth data and subset the data for our region of interest.
#call natural earth library to load in some national boundaries for N and S America
library(rnaturalearth)Warning: package 'rnaturalearth' was built under R version 4.5.2
#create an object just with boundaries for North and South America
am<-rnaturalearth::ne_countries(continent = c("North America","South America"),returnclass = "sf") Now subset the data for the parts of South and North America that are relevant for our data.
#check all available regions in our initial subset
unique(am$subregion)[1] "Northern America" "South America" "Caribbean" "Central America"
Now subset the data to only include the ones we want. We use the square brackets for subsetting as in Week One [] and the | symbol which just means OR (note that & would be AND, which wouldn’t return what we want here)
#subset using square brackets and | ("or") operands
SA<-am[am$subregion=="South America"|am$subregion=="Central America",]Project the raster objects to the same as the species data and check that this all lines up
#check all plot together - but first project the env data to epsg:4326
env<-project(env,"epsg:4326",method="bilinear")
plot(env,1)
plot(SA$geometry,add=TRUE)
plot(bradypus, add=TRUE,col ="red")Next, in order to build our dataset for modelling we need to generate some background (or “pseudo-absence”) points for our study area.
#create random points on cells of the 'env' object. This function also samples the raster values and returns them with the resulting object.
set.seed(11)
#sample background - one point for every cell (9775)
back <- spatSample(env,size=9775,as.points=TRUE,method="random",na.rm=TRUE)
#inspect
head(back) MeanTemp. AnnualPrecip. Precip.WettestQuarter PrecipDriestQuarter Max.Temp.
1 250 1997 952 48 340
2 225 830 296 151 290
3 250 1495 576 227 312
4 270 2900 1381 81 326
5 265 1647 881 155 327
6 132 172 51 35 274
Min.Temp. Temp.Range MeanTemp.WettestQuarter
1 154 186 248
2 176 114 242
3 185 126 250
4 231 95 263
5 212 115 258
6 7 267 73
#and plot
plot(env,1)#subset to SA object
#done quickest with sf
library(sf)#convert to sf object
back<-st_as_sf(back)
#spatial subset using [,]
back<-back[SA,]
#check number of background points
nrow(back)[1] 7074
#subset bradypus data to the same extent using the same method
bradypus=st_as_sf(bradypus)[SA,]Now remove any duplicate geometries in the bradypus object using the distinct function from the dplyr package
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:raster':
intersect, select, union
The following objects are masked from 'package:terra':
intersect, union
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
#keep only unqiue coordinates
bradypus= distinct(bradypus,geometry, .keep_all = TRUE)Plot both objects together (select geometry column or sf will try to plot one map for every variable in the data frame)
#background points
plot(back$geometry)
#presence points
plot(bradypus$geometry,col="red",add=T)Next we use the extract() function to retrieve information on the point locations with respect to the climatic conditions
# get environmental covariates at presence locations
eP<-extract(env,bradypus)
#create data frames from values
Pres.cov<-data.frame(eP,Pres=1)
#Remove the first column which is just an ID field.
Pres.cov<-Pres.cov[,-1]
#check the data
summary(Pres.cov) MeanTemp. AnnualPrecip. Precip.WettestQuarter PrecipDriestQuarter
Min. :122.0 Min. : 375 Min. : 215 Min. : 8
1st Qu.:235.0 1st Qu.:1912 1st Qu.: 812 1st Qu.: 143
Median :255.0 Median :2779 Median :1045 Median : 194
Mean :247.6 Mean :2685 Mean :1033 Mean : 256
3rd Qu.:261.0 3rd Qu.:3218 3rd Qu.:1215 3rd Qu.: 305
Max. :284.0 Max. :7682 Max. :2458 Max. :1496
Max.Temp. Min.Temp. Temp.Range MeanTemp.WettestQuarter
Min. :179.0 Min. : 49.0 Min. : 78.0 Min. :125.0
1st Qu.:303.0 1st Qu.:175.0 1st Qu.:108.0 1st Qu.:231.0
Median :313.0 Median :196.0 Median :116.0 Median :252.0
Mean :309.9 Mean :188.8 Mean :120.9 Mean :246.4
3rd Qu.:322.0 3rd Qu.:214.0 3rd Qu.:131.0 3rd Qu.:261.0
Max. :359.0 Max. :228.0 Max. :219.0 Max. :282.0
Pres
Min. :1
1st Qu.:1
Median :1
Mean :1
3rd Qu.:1
Max. :1
We have a few NAs which will cause issues later so we remove these by selecting only rows that !is.na() meaning IS NOT NA. We just chose one variable here - “MeanTemp.” - to test for the NA status.
#remove NAs subsetting with []
Pres.cov<-Pres.cov[!is.na(Pres.cov$MeanTemp.),]
#check for NAs using the summary function
summary(Pres.cov) MeanTemp. AnnualPrecip. Precip.WettestQuarter PrecipDriestQuarter
Min. :122.0 Min. : 375 Min. : 215 Min. : 8
1st Qu.:235.0 1st Qu.:1912 1st Qu.: 812 1st Qu.: 143
Median :255.0 Median :2779 Median :1045 Median : 194
Mean :247.6 Mean :2685 Mean :1033 Mean : 256
3rd Qu.:261.0 3rd Qu.:3218 3rd Qu.:1215 3rd Qu.: 305
Max. :284.0 Max. :7682 Max. :2458 Max. :1496
Max.Temp. Min.Temp. Temp.Range MeanTemp.WettestQuarter
Min. :179.0 Min. : 49.0 Min. : 78.0 Min. :125.0
1st Qu.:303.0 1st Qu.:175.0 1st Qu.:108.0 1st Qu.:231.0
Median :313.0 Median :196.0 Median :116.0 Median :252.0
Mean :309.9 Mean :188.8 Mean :120.9 Mean :246.4
3rd Qu.:322.0 3rd Qu.:214.0 3rd Qu.:131.0 3rd Qu.:261.0
Max. :359.0 Max. :228.0 Max. :219.0 Max. :282.0
Pres
Min. :1
1st Qu.:1
Median :1
Mean :1
3rd Qu.:1
Max. :1
Now create a data frame for background values (currently an sf object but for modelling we don’t need the geometry so we drop that here)
#drop geometry column using st_drop_geometry()
Back.cov<-data.frame(st_drop_geometry(back),Pres=0)
#inspect
head(Back.cov) MeanTemp. AnnualPrecip. Precip.WettestQuarter PrecipDriestQuarter Max.Temp.
1 250 1997 952 48 340
5 265 1647 881 155 327
6 132 172 51 35 274
8 244 1936 937 43 333
10 262 1986 849 185 326
11 243 2745 1352 112 321
Min.Temp. Temp.Range MeanTemp.WettestQuarter Pres
1 154 186 248 0
5 212 115 258 0
6 7 267 73 0
8 149 184 243 0
10 201 124 256 0
11 174 147 238 0
Combine presence and absence data into one data frame using rbind(). This function combines a set of rows on top of another set of rows providing both sets have the same column names.
#combine
all.cov<-rbind(Pres.cov,Back.cov)Now let’s create our first model. The biolclim() function performs a distribution model based on the envelope modelling approach and takes presence only data. We specify the model therefore using only the Pres.cov data frame and, for now, we will use co-variates MeanTemp., Temp.Range. and AnnualPrecip.
#bioclim model
bc_bradypus <- bioclim(Pres.cov[,c('MeanTemp.', 'Temp.Range', 'AnnualPrecip.')])The bioclim() algorithm assess the suitability of a location by comparing the values of environmental variables at that location to the distribution of values at known presence locations. In general, the closer to the 50th percentile (the median), the more suitable the location is.
We can map our prediction using the predict() function, using our covariate raster stack as the object and the bc_bradypus as the model. Note that for a bioclim() model the predicted values range from 0 to 1. It is scaled such that a value of 1 would be locations that have the median value of all covariates considered, while a value of zero would reflect locations where at least one covariate is outside the range of environmental covariates at presence locations.
#predict
bioclim.map <- predict(env, bc_bradypus)
#plot
plot(bioclim.map, axes = F, box = F, main = "bioclim: MeanTemp.,Temp.Range,AnnualPrecip.")Now we’ll repeat our bioclim model but using all predictors (i.e. all columns of co-variates in Pres.cov)
#run bioclim with all variables
bc_bradypus2<-bioclim(Pres.cov[,1:8])
#inspect model details
bc_bradypus2class : Bioclim
variables: MeanTemp. AnnualPrecip. Precip.WettestQuarter PrecipDriestQuarter Max.Temp. Min.Temp. Temp.Range MeanTemp.WettestQuarter
presence points: 6757
MeanTemp. AnnualPrecip. Precip.WettestQuarter PrecipDriestQuarter Max.Temp.
1 277 1224 521 46 341
2 235 2779 1191 154 305
3 216 3285 1218 300 277
4 216 3285 1218 300 277
5 177 1910 771 208 253
6 216 3285 1218 300 277
7 216 3285 1218 300 277
8 267 2015 628 365 325
9 177 1910 771 208 253
10 177 1910 771 208 253
Min.Temp. Temp.Range MeanTemp.WettestQuarter
1 212 130 275
2 175 129 231
3 160 117 215
4 160 117 215
5 90 162 204
6 160 117 215
7 160 117 215
8 210 115 265
9 90 162 204
10 90 162 204
(... ... ...)
Make a new prediction based on this model
bioclim.map2 <- predict(env, bc_bradypus2)
plot(bioclim.map2, axes = F, box = F, main = "bioclimAll")We can see that our models, as expected, are very similar though we might assume that, given the relatively higher values and using fewer predictors, the first model (bc_bradypus) may be over-predicting our species occurence.
We can also use the response() function to give the response plots for each covariate.
#get response plots
response(bc_bradypus2)The response plots show values for each covariate at which prediction is highest. For most of our covariates, the relationship appears to be non-linear in the presence-only bioclim model.
Next we’ll look at fitting some models that accept presence and (psuedo)-absence data. We will use the glm() function (glm stands for ‘generalized linear model’) and set the family of functions to ‘binomial’ as our outcome is binary. We set the first argument (the response variable) as ‘Pres’ and separate this from our predictor variable with the tilda symbol ‘~’. We can opt to include all variables (i.e. the remaining columns in the data frame) simply by placing a point ‘.’ after the tilda symbol.
#Specify model
glm.bradypus<-glm(Pres~.,binomial(link='logit'), data=all.cov)
#get summary of outputs
summary(glm.bradypus)
Call:
glm(formula = Pres ~ ., family = binomial(link = "logit"), data = all.cov)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 5.5504504 0.3115500 17.816 < 2e-16 ***
MeanTemp. -0.0085469 0.0080427 -1.063 0.288
AnnualPrecip. 0.0044822 0.0001792 25.019 < 2e-16 ***
Precip.WettestQuarter -0.0064592 0.0003806 -16.973 < 2e-16 ***
PrecipDriestQuarter -0.0103928 0.0003996 -26.009 < 2e-16 ***
Max.Temp. 0.3265668 0.0459819 7.102 1.23e-12 ***
Min.Temp. -0.3922343 0.0465899 -8.419 < 2e-16 ***
Temp.Range -0.4131486 0.0462367 -8.936 < 2e-16 ***
MeanTemp.WettestQuarter 0.0736548 0.0039268 18.757 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 19167 on 13830 degrees of freedom
Residual deviance: 10937 on 13822 degrees of freedom
AIC: 10955
Number of Fisher Scoring iterations: 7
The statistics provided by the summary() function from the glm model are the co-efficient values estimated for each predictor variable and the AIC (Akaike Information Criterion) value. For the purpose of model comparison, a lower AIC denotes better model fit.
We can map the probability predictions of our glm.bradypus output in the same way as we did for the bioclim model.
#Use the predict function to estimate probabilities for our study area based on the raster stack object 'env'.
glm.map <- predict(env, glm.bradypus, type = "response")
#plot the map
plot(glm.map,main="glmBradypus")Now we will look at the application of a Maxent algorithm to our data. Maxent is a popular species distribution modelling algorithm, partly due to the open source package that can be freely downloaded and used “out of the box”. It can also be used within the R environment where, if the programme is saved inside the dismo package folder in the R directory, we can call maxent() from the dismo library in the same way as we have already called bioclim() and glm() from their respective libraries. However, this can be problematic if working on a computer without admin rights and in any case there is a way to implement a maxent model without outsourcing the algorithm from within R. The glmnet package contains a function maxnet() that allows us to do this.
For maxnet, we provide our presence data followed by the columns of our data frame containing predictors and, in this case, we choose “lq” (meaning linear and quadratic) from the list of available feature classes for model estimation.
In truth, the maxnet() function approximates a maxent formula using an “infinitely weighted logistic regression” - an inhomogenous point process (IPP) approach carried out with the glmnet package (see lecture material and more on this in Week 3).
Here we pass all variables in our all.cov data frame as predictors and use the maxnet.formula() function to give variable names as arguments. This is so that we can create response curves later on.
#load libraries
library(maxnet)
Attaching package: 'maxnet'
The following object is masked _by_ '.GlobalEnv':
bradypus
library(glmnet)Loading required package: Matrix
Loaded glmnet 4.1-10
# run a presence-only Maxent model selecting feature "linear" and "quadratic" using the "lq" argument.
maxnetMod <- maxnet(all.cov$Pres, all.cov[,1:8],maxnet.formula(all.cov$Pres,all.cov[,c("AnnualPrecip.", "Precip.WettestQuarter", "PrecipDriestQuarter", "Max.Temp.", "Temp.Range","MeanTemp.","MeanTemp.WettestQuarter")],classes="lq")) maxnetMod <- maxnet(all.cov\(Pres, all.cov[,1:8],maxnet.formula(all.cov\)Pres,all.cov[,c(“AnnualPrecip.”, “Precip.WettestQuarter”, “PrecipDriestQuarter”, “Max.Temp.”, “Temp.Range”,“MeanTemp.”, “MeanTemp.WettestQuarter”)],classes=“lq”)) # runs a presence-only Maxent model selecting feature “linear” and “quadratic” using the “lq” argument.
To transform the raw model output values (an exponential model) for mapping to a 0 to 1 probability scale, we have several options. The default option in the orginal Maxent software was type “logistic” for ploting response curves though recent work comparing Maxent to an IPP formulation suggests that a cloglog (complementary log-log transform) approach is more suitable for mapping presence/absence predictions. Consequently, cloglog is now the default for the open source Maxent tool and we will plot it here.
First we plot response curves showing the relationship between some of our variables and probability. We use the response.plot() function fom the maxent package to do this. The function takes six arguments: 1.the model, 2.the variable to be plotted, 3. the variable mean (note we take this from the model itself but as it is stored in a list - which won’t do for plotting - we have to unlist this value), 4.the variable minimum value, 5. the variable maximum value and 6. the transformation applied to the raw data. You can plot the other variables also simply by swapping out the variable name.
response.plot(maxnetMod,"AnnualPrecip.", mm=unlist(maxnetMod$samplemeans), #set model, variable and mean values
min=maxnetMod$varmin["AnnualPrecip."], max=maxnetMod$varmax["AnnualPrecip."], #set min/max values
type="cloglog") #choose transform typeDo the same for Temp.range
response.plot(maxnetMod,"Temp.Range", mm=unlist(maxnetMod$samplemeans),min=maxnetMod$varmin["Temp.Range"],
max=maxnetMod$varmax["Temp.Range"],
type="cloglog")MeanTemp.
response.plot(maxnetMod,"MeanTemp.", mm=unlist(maxnetMod$samplemeans),min=maxnetMod$varmin["MeanTemp."],
max=maxnetMod$varmax["MeanTemp."],
type="cloglog")Now let’s plot the prediction map.
maxnet.cloglog.map <- predict(env, maxnetMod, clamp=F, type="cloglog",na.rm=TRUE)
plot(maxnet.cloglog.map, axes=F, box=F, main="Maxnet-clog")Model Validation
An important part of species distribution modelling is model validation and this is a subject we will come back to repeatedly throughout this and next week’s practical. Although we briefly looked at the usefulness of the AIC statistic for comparing models, the AIC does not give us any idea about the predictive power of our model. To evaluate this, we need to use information provided in the confusion matrix, as introduced in the lecture.
The most readily available function (before creating our own later) for generating a confusion matrix and returning associated statistics is the evaluate() function that comes with the dismo package. To use this function, we supply three arguments to the evaluate() function: a data frame of presence points, a data frame of absence points and the model we wish to evaluate.
Let’s evaluate our presence-only model (bioclim) first.
#For evaluating the first bioclim model we only povide the MeanTemp.,Temp.Range and AnnualPrecip. predictors which are columns 1,2 and 7 in the Pres.cov data frame
eBC <- evaluate(Pres.cov[,c(1,2,7)],Back.cov,bc_bradypus)
#For model two validation, provide all predictors
eBC2 <- evaluate(Pres.cov,Back.cov,bc_bradypus2)
#inspect both evaluation results
eBCclass : ModelEvaluation
n presences : 6757
n absences : 7074
AUC : 0.7882081
cor : 0.380942
max TPR+TNR at : 0.02846297
eBC2class : ModelEvaluation
n presences : 6757
n absences : 7074
AUC : 0.8057731
cor : 0.3852813
max TPR+TNR at : 0.02446712
The main outputs of interest here are the AUC, and max TPR+TNR statistics. Recall from the lecture that AUC stands for Area Under the Curve and is a test of model performance where higher values indicate greater accuracy in our predictions.This statistic is often used as a mark of model performance and we will use it to assess the model we run today. Remember also that the max TPR+TNR denotes the probability threshold at which our model maximizes the True Positive Rate and the True Negative Rate. It is generally accepted that this is an optimum value at which to set the threshold for binary classification (i.e. 1/0 or presence/absence) in mapping outputs. We will use these values later when symbolising our final mapped predictions. For now, we can safely assume that our second bioclim model, with a higher AUC statistic, is the better performing model.
We can plot the ROC (Receiver Operating Curve) to visualize these results:
# plot ROC for eBC
plot(eBC,'ROC')# plot ROC for eBC2
plot(eBC2,'ROC')Now let’s evaluate our glm models.
#GLM evaluation
#Model with all predictors
eglmAll <- evaluate(all.cov[all.cov$Pres==1,],all.cov[all.cov$Pres==0,], glm.bradypus)
#inspect
eglmAllclass : ModelEvaluation
n presences : 6757
n absences : 7074
AUC : 0.8956108
cor : 0.6049487
max TPR+TNR at : -0.0329718
Plot the ROC for the glm model
plot(eglmAll,'ROC')Finally we apply the evaluate() function to our maxnet model.
maxeval<-evaluate(p=bradypus, a=back, maxnetMod, env)
maxevalclass : ModelEvaluation
n presences : 6757
n absences : 7074
AUC : 0.9005798
cor : 0.611535
max TPR+TNR at : -9.034249
So far the evaluation results suggest good performance by all of our models (high AUC), particularly the glm approach. However, we have almost certainly over-estimated the quality of our models given that we have tested them with the same data that was used for training them. A more robust approach is to train the model on a subset of the data and test (evaluate) the model on the remaining cases. For example, we might use 70% of the data to train our model and use the remaining 30% of cases to test our model’s ability to correctly classify those cases. Although more robust, this approach only gives us a single attempt at model validation on which to base our assessment. To overcome this, the k-fold approach is commonly used where the entire dataset is divided into folds (groups), say five, where one fold is chosen for model testing and the remaining four used for model training. This is then repeated in an iterative process until all folds have been used for model testing. We can then base our assessment on the mean values derived from all iterations. The kfold() function in dismo provided us with a neat way of partitioning our data.
#set number of folds to use
folds=5
#partiction presence and absence data according to folds using the kfold() function.
kfold_pres <- kfold(Pres.cov, folds)
kfold_back <- kfold(Back.cov, folds)
#create an empty list to hold our results (remember there will be five sets)
bc_K<-list()
par(mfrow=c(2,3))
# for loop to iterate over folds
for (i in 1:folds) {
train <- Pres.cov[kfold_pres!= i,]#for presence values, select all folds which are not 'i' to train the model
test <- Pres.cov[kfold_pres == i,]#the remaining fold is used to test the model
backTrain<-Back.cov[kfold_back!=i,]#now for background values, select training folds
backTest<-Back.cov[kfold_back==i,]#use the remainder for model testing
dataTrain<-rbind(train,backTrain)#bind presence and background training data together
dataTest<-rbind(test,backTest)#bind test data together
bc_bradypus_K<-bioclim(train[,1:8])#for bioclim we only need presence data to train the model!
bc_K[[i]] <- evaluate(p=test[,1:8],a=backTest[,1:8], bc_bradypus_K)#use testing data (kfold==i) for model evaluation
#check the AUC by plotting ROC values
plot(bc_K[[i]],'ROC')
}
#inspect
bc_K[[1]]
class : ModelEvaluation
n presences : 1351
n absences : 1415
AUC : 0.8038163
cor : 0.3697648
max TPR+TNR at : 0.024937
[[2]]
class : ModelEvaluation
n presences : 1352
n absences : 1415
AUC : 0.8146052
cor : 0.4100813
max TPR+TNR at : 0.01519029
[[3]]
class : ModelEvaluation
n presences : 1351
n absences : 1414
AUC : 0.8032965
cor : 0.3844469
max TPR+TNR at : 0.02579715
[[4]]
class : ModelEvaluation
n presences : 1352
n absences : 1415
AUC : 0.7964918
cor : 0.3691173
max TPR+TNR at : 0.01562618
[[5]]
class : ModelEvaluation
n presences : 1351
n absences : 1415
AUC : 0.8078322
cor : 0.3913554
max TPR+TNR at : 0.02588298
Next we use the slot() function to acces the elements in the list bc_K that are of use to us. In this case we want to return all auc values from our 5 evaluation runs. We ‘loop’ through all elements in the list using the sapply() function (remember this is similar to a for loop but applies to elements in a list or a vector). Effectively the below code is iterating over each element in the list ‘bc_K’ and applying a function that selects the ‘auc’ statistics from the correpsonding slot in the model results. Slots are basically attributes of objects in R. Here bc_k is a list of objects of type Formal Class ModelEvaluation. So we iterate over each object and extract the ‘auc’ slot. You can type ‘help(slot)’ into the console for more information on slots.
Get model evaluation statistics:
aucBIO <- sapply( bc_K, function(x){slot(x, 'auc')} )
#calculate the mean values for coparison with other models
mean(aucBIO)[1] 0.8052084
In order to map the results of our model as predicted presence-non presence we need to set a threshold over which locations (cells on the map) are allocated 1 and, beneath which, 0. One way to determine this threshold is to calculate the threshold (the model output) at which the TPR (true positive rate) and TNR (true negative rate) are maximized. Remember we need both of these to be high in order to ensure a successful model.
We use the sapply() function to iterate over each item in our eGLM list and select the ‘t’ value (this is the slot on bc_K which denotes a given threshold) for which corresponding values for TPR+TNR are highest (hence we use ‘which.max’).
Opt_BIO<-sapply( bc_K, function(x){ x@t[which.max(x@TPR + x@TNR)] } )
Opt_BIO[1] 0.02493700 0.01519029 0.02579715 0.01562618 0.02588298
We will use these threshold values to make final predictions for areas of 1 and 0 (presence/absence) and map them.
#take the mean of the OptBIO list for application in our predictions
Mean_OptBIO<- mean(Opt_BIO)
Mean_OptBIO[1] 0.02148672
Now we just need to make a prediction based on our environmental raster stack and our bioclim model, setting 1 and 0 (presence/absence) according to the above threshold.
prBIO <- predict(env, bc_bradypus2,type = "response")
par(mfrow=c(1,2))
plot(prBIO, main='BIO_K')
plot(prBIO > Mean_OptBIO, main='presence/absence')Now we’ll do exactly the same for our GLM model, only remember that this algorithm takes presence and absence data so we’ll modify our for loop slightly.
#create an empty list to hold our results (remember there will be five sets)
eGLM<-list()
par(mfrow=c(2,3))
for (i in 1:folds) {
train <- Pres.cov[kfold_pres!= i,]
test <- Pres.cov[kfold_pres == i,]
backTrain<-Back.cov[kfold_back!=i,]
backTest<-Back.cov[kfold_back==i,]
dataTrain<-rbind(train,backTrain)
dataTest<-rbind(test,backTest)
glm_eval <- glm(Pres~.,binomial(link = "logit"), data=dataTrain)#this is our glm model trained on presence and absence points
eGLM[[i]] <- evaluate(p=dataTest[ which(dataTest$Pres==1),],a=dataTest[which(dataTest$Pres==0),], glm_eval)#use testing data (kfold==i) for model evaluation
#check the AUC by plotting ROC values
plot(eGLM[[i]],'ROC')
}
#inspect
eGLM[[1]]
class : ModelEvaluation
n presences : 1351
n absences : 1415
AUC : 0.8858644
cor : 0.5917575
max TPR+TNR at : -0.0717664
[[2]]
class : ModelEvaluation
n presences : 1352
n absences : 1415
AUC : 0.9008186
cor : 0.6137022
max TPR+TNR at : 0.00600667
[[3]]
class : ModelEvaluation
n presences : 1351
n absences : 1414
AUC : 0.8991449
cor : 0.6141121
max TPR+TNR at : -0.02784938
[[4]]
class : ModelEvaluation
n presences : 1352
n absences : 1415
AUC : 0.8831481
cor : 0.5911513
max TPR+TNR at : -0.00886923
[[5]]
class : ModelEvaluation
n presences : 1351
n absences : 1415
AUC : 0.9074673
cor : 0.6124596
max TPR+TNR at : -0.01166423
Get model evaluation statistics:
aucGLM <- sapply( eGLM, function(x){slot(x, 'auc')} )
#calculate the mean values for comparison with other models
mean(aucGLM)[1] 0.8952887
Again we use the sapply() function to iterate over each item in our eGLM list and select the ‘t’ value (this is the slot on eGLM which denotes a given threshold) for which corressponding values for TPR+TNR are highest (hence we use ‘which.max’).
Opt_GLM<-sapply(eGLM, function(x){ x@t[which.max(x@TPR + x@TNR)] } )
#inspect
Opt_GLM[1] -0.07176640 0.00600667 -0.02784938 -0.00886923 -0.01166423
We can now use these threshold values to make final predictions for areas of 1 and 0 (presence/absence) and map them. As glm outputs are on a log scale, We can back-transform these values to approximate probabilities using the plogis() function like so:
#take the mean to be applied to our predictions
Mean_OptGLM<- mean(Opt_GLM)
trGLM<-plogis(Mean_OptGLM)
#inspect
trGLM[1] 0.4942931
We now have a value with which to set the threshold for our presence/absence maps.
We use the predict() function to estimate our output based on the ‘env’ raster stack and the information from our glm_eval model. For plotting purposes we then restrict the output to show only areas above our threshold (presence), leaving all other areas blank.
prGLM <- predict(env, glm.bradypus,type = "response")
par(mfrow=c(1,2))
plot(prGLM, main='GLM, regression')
plot(prGLM > trGLM, main='presence/absence')Now let’s do the same for the Maxent model (we will leave out bioclim)
#Perform the same evaluation process on our maxent model
par(mfrow=c(2,3))
#create list to hold results
eMAX<-list()
#set number of folds
folds=5
#assign data to folds
kfold_pres <- kfold(Pres.cov, folds)
kfold_back <- kfold(Back.cov, folds)
for (i in 1:folds) {
train <- Pres.cov[kfold_pres!= i,]
test <- Pres.cov[kfold_pres == i,]
backTrain<-Back.cov[kfold_back!=i,]
backTest<-Back.cov[kfold_back==i,]
dataTrain<-rbind(train,backTrain)
dataTest<-rbind(test,backTest)
maxnet_eval <- maxnet(dataTrain$Pres, dataTrain[,1:8])
eMAX[[i]] <- evaluate(p=dataTest[which(dataTest$Pres==1),],a=dataTest[which(dataTest$Pres==0),], maxnet_eval)
plot(eMAX[[i]],'ROC')
}
#Again, we can access our AUC and MaxTPR+TNR statistics for this latest model.
aucMAX <- sapply(eMAX, function(x){slot(x, 'auc')} )
#calculate the mean values for comparison with other models
aucMAX[1] 0.9194087 0.9210733 0.9128803 0.9270491 0.9301567
Return the mean auc value
mean(aucMAX)[1] 0.9221136
Now obtain the threshold value
# retrieve optimal threhold
Opt_MAX<-sapply( eMAX, function(x){ x@t[which.max(x@TPR + x@TNR)] } )
#take the mean
Mean_OptMAX<-mean(Opt_MAX)
#predict the model to data
prMAX <- predict(env, maxnetMod,clamp=F,na.rm=TRUE)
#plot according to predicted values and to thresholds
par(mfrow=c(1,2))Now plot according to the threshold
plot(prMAX, main='Maxent Prediction')plot(prMAX > Mean_OptMAX, main='presence/absence')Finally we will look at how to fit a random forest model to the data using the randomForest() package.
library(randomForest)randomForest 4.7-1.2
Type rfNews() to see new features/changes/bug fixes.
Attaching package: 'randomForest'
The following object is masked from 'package:dplyr':
combine
The randomForest model takes two arguments - parameters that can be tuned - mtry and ntree. mtry refers to the number of predictor variables to try at each iteration of the model and and ntree specifies the number of trees grown. 500 is the default value often used and will suffices for the size of our data. Model performance is thought to be more sensitive to mtry than ntree and this can be tuned with the tuneRF function:
#tune mtry
tuneRF(x=all.cov[,1:8],y=as.factor(all.cov$Pres))mtry = 2 OOB error = 5.23%
Searching left ...
mtry = 1 OOB error = 5.3%
-0.01243094 0.05
Searching right ...
mtry = 4 OOB error = 5.31%
-0.01519337 0.05
mtry OOBError
1.OOB 1 0.05299689
2.OOB 2 0.05234618
4.OOB 4 0.05314149
The Out of Box (OOB) error statistic is lowest with an mtry of 1 so we go with this for fitting our model:
rf.Bradypus<-randomForest(as.factor(Pres)~.,mtry=2,ntree=500,data=all.cov)Although randomForest does not produce variable co-efficients as in the case of glm(), we can use the varImpPlot() function to get a measure of variable importance (the contribution that each variable makes to the discrimination of classes in the final model). Here higher values of the Gini index represent greater probability that cases will be wrongly classified. Therefore variable importance is judged based on the decrease in the Gini index that is achieved through their inclusion in the model. Note that the rank of importance does not match closely the strength of association implied by the co-efficients in the glm model.
varImpPlot(rf.Bradypus)To evaluate our random forest model we can still use the evaluate() function from the Dismo package but in this case we first make a prediction using the model output. This is because random forest, as used here, is a classifier and so we have to give it the response variable as a factor. However, factors don’t work well with ROC calculations so the predict() function here allows us to first make the predictions (i.e. where all predicted and actual values are returned as numerics rather than factors) before passing these to the evaluate() function.
#create an empty list to hold our results (remember there will be five sets)
eRF<-list()
par(mfrow=c(2,3))
for (i in 1:folds) {
train <- Pres.cov[kfold_pres!= i,]
test <- Pres.cov[kfold_pres == i,]
backTrain<-Back.cov[kfold_back!=i,]
backTest<-Back.cov[kfold_back==i,]
dataTrain<-rbind(train,backTrain)
dataTest<-rbind(test,backTest)
RF_eval <- randomForest(as.factor(Pres)~., data=dataTrain)#this is our RF model
rf.pred <- predict(RF_eval, type="prob")[,2]#make prediction
eRF[[i]]<-evaluate(p = rf.pred[which(dataTrain$Pres == "1")], #set p to be 1s from the dataTrain object
a = rf.pred[which(dataTrain$Pres == "0")])# set a to be 0s from the dataTrain object
#check the AUC by plotting ROC values
plot(eRF[[i]],'ROC')
}
#inspect
eRF[[1]]
class : ModelEvaluation
n presences : 5406
n absences : 5659
AUC : 0.9799243
cor : 0.9080899
max TPR+TNR at : 0.6293082
[[2]]
class : ModelEvaluation
n presences : 5405
n absences : 5659
AUC : 0.9796176
cor : 0.9097067
max TPR+TNR at : 0.7689563
[[3]]
class : ModelEvaluation
n presences : 5406
n absences : 5660
AUC : 0.9796032
cor : 0.9062663
max TPR+TNR at : 0.6736968
[[4]]
class : ModelEvaluation
n presences : 5405
n absences : 5659
AUC : 0.9801226
cor : 0.9093308
max TPR+TNR at : 0.6388455
[[5]]
class : ModelEvaluation
n presences : 5406
n absences : 5659
AUC : 0.9797942
cor : 0.9102759
max TPR+TNR at : 0.7824234
Retrieve the mean auc from these models:
#Get model evaluation statistics:
aucRF <- sapply( eRF, function(x){slot(x, 'auc')} )
#calculate the mean values for comparison with other models
mean(aucRF)[1] 0.9798124
Get threshold:
#Get maxTPR+TNR for the Random Forest model
Opt_RF<-sapply( eRF, function(x){ x@t[which.max(x@TPR + x@TNR)] } )
#mean of all threshold values
Mean_OptRF<-mean(Opt_RF)
#print
Mean_OptRF[1] 0.698646
We can predict and map our distribution estimate as for the other models, but note here that we have to add “index=2” as an argument in the predict() function. Remember that random forest is a classifier which mean it will produce outcomes of 0 and 1 automatically. The index=2 here is asking the predict function to calculate probabilties that a cell will equal 1 (which is the second column in the table that the predict function creates - therefore index =2).
prRF <- predict(env, rf.Bradypus,type="prob",index=2)
par(mfrow=c(1,2))
plot(prRF, main='Random Forest Prediction')
plot(prRF > Mean_OptRF, main='presence/absence')Now we have a clearer view of the quality of our models, with the random forest model performing better overall. Both the glm and maxnet approaches still demonstrated good predictive power even after the k-fold validation. Next week we will look at how this method of cross-validation may still be overly optimistic given that we have not accounted for spatial autocorrelation in our evaluation, and at methods for handling such problems.
For now, proceed to this week’s independent exercise.
Week 3 Exercise
Use the gbif() function to download records for Solanum acaule and generate species distribution models using the maxnet() and glm() approaches. Use the same environmental data and approach to background sampling as used in this practical. Evaluate your models using a k-fold approach based on 5 folds. Select the best performing model and create 2 mapped outputs - 1. a probability plot of your predictions and 2. a presence/absence map based on a sensible threshold.