Tutorial for SDMPlay

Species distribution models (SDMs) have been developed for several years to adress conservation issues, assess the direct impact of human activities on ecosystems and predict the potential distribution shifts of invasive species (see Elith et al. 2006, Elith & Leathwick 2009, Pearson 2007 for reviews). SDM relate species occurrences with environmental information and can extrapolate species distribution on their entire occupied space. Applying SDM on limited occurrence datasets can therefore bring complementary information in non-visited areas. However, users must be aware of potential bias and limitations while using such poor and historical datasets (Araujo & Guisan 2006, Robinson 2011, Proosdij et al. 2016) that require corrections (Phillips et al. 2009, Barbet-Massin et al. 2012).

SDMPlay is a pedagogic package that will allow you to compute species distribution models (SDM) with two popular machine learning approaches, BRT (Boosted Regression Trees) and MaxEnt (Maximum Entropy). It contains occurrences of marine species and environmental descriptors datasets as examples for a first use of SDMs. You can also upload your own dataset. Basic approaches for model calibration and execution are provided. Classic tools to evaluate model performance are supplied (Area Under the Curve, omission rate and confusion matrix, standard deviation projection) and are completed with tools to perform null models (Raes & ter Steege 2007, Proosdij et al. 2016).

The biological dataset includes original occurrences of two echinoid species (sea urchins) present on the Kerguelen Plateau. The environmental dataset includes 15 environmental descriptors, displayed in a raster format, on the extent of the Kerguelen Plateau, for different time periods.

This package focusses on datasets containing presence-only data. Functions must be adapted whether dealing with presence-absence or abundance data. Presence-only methods imply using background data to be selected in the study area (Pearce & Boyce 2006) in order to calibrate the model. Several background sampling methods exist (Phillips et al. 2009), and its choice depends on the presence-only sampling pattern and the scientific questions. This package focusses only on a random sampling of background data. You can refere to other packages such as biomod2 if you want to use another sampling strategy (e.g. relative envelope, distance to disk, independent strategy).

Data overview

In the package, you can download occurrence data of two echinoid species, Brisaster antarcticus and Ctenocidaris nutrix, distributed on the Kerguelen Plateau. The complete dataset of Kerguelen echinoid species is available in Guillaumot et al. (2016).

library(SDMPlay)
data("ctenocidaris.nutrix")
head(ctenocidaris.nutrix)

##    id     scientific.name scientific.name.authorship
## 56  1 Ctenocidaris_nutrix             (Thomson 1876)
## 57  2 Ctenocidaris_nutrix             (Thomson 1876)
## 58  3 Ctenocidaris_nutrix             (Thomson 1876)
## 59  4 Ctenocidaris_nutrix             (Thomson 1876)
## 60  5 Ctenocidaris_nutrix             (Thomson 1876)
## 61  6 Ctenocidaris_nutrix             (Thomson 1876)
##                          genus                        family
## 56 Ctenocidaris Mortensen 1910 Ctenocidarinae Mortensen 1928
## 57 Ctenocidaris Mortensen 1910 Ctenocidarinae Mortensen 1928
## 58 Ctenocidaris Mortensen 1910 Ctenocidarinae Mortensen 1928
## 59 Ctenocidaris Mortensen 1910 Ctenocidarinae Mortensen 1928
## 60 Ctenocidaris Mortensen 1910 Ctenocidarinae Mortensen 1928
## 61 Ctenocidaris Mortensen 1910 Ctenocidarinae Mortensen 1928
##    order.and.higher.taxonomic.rank decimal.Longitude decimal.Latitude
## 56            Cidaroida Claus 1880          67.13167        -48.98500
## 57            Cidaroida Claus 1880          67.33167        -49.44167
## 58            Cidaroida Claus 1880          67.51167        -49.00500
## 59            Cidaroida Claus 1880          67.54167        -48.11667
## 60            Cidaroida Claus 1880          67.88500        -49.46667
## 61            Cidaroida Claus 1880          68.05833        -49.06667
##    depth year       campaign             reference          vessel
## 56   315 1975 MD04 (BENTHOS) De Ridder et al. 1992 Marion Dufresne
## 57   301 1975 MD04 (BENTHOS) De Ridder et al. 1992 Marion Dufresne
## 58   206 1975 MD04 (BENTHOS) De Ridder et al. 1992 Marion Dufresne
## 59   365 1975 MD04 (BENTHOS) De Ridder et al. 1992 Marion Dufresne
## 60   191 1975 MD04 (BENTHOS) De Ridder et al. 1992 Marion Dufresne
## 61   178 1975 MD04 (BENTHOS) De Ridder et al. 1992 Marion Dufresne

This package also contains stack of raster layers, corresponding to environmental descriptors in the region of the Kerguelen Plateau, for three time periods [1965-1974], [2005-2012], and for the climatic scenario A1B (IPCC,4th report 2007) for 2200.

Load the raster stacks

data("predictors1965_1974")
data("predictors2005_2012")
data("predictors2200A1B")

Plot the layers and explore their properties

plot(subset(predictors2005_2012, c(1:3)))

predictors2005_2012

## class       : RasterStack 
## dimensions  : 100, 179, 17900, 15  (nrow, ncol, ncell, nlayers)
## resolution  : 0.1, 0.1  (x, y)
## extent      : 63, 80.9, -56, -46  (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0 
## names       :         depth, seasurfac//_2005_2012, seasurfac//_2005_2012, seafloor_//_2005_2012, seafloor_//_2005_2012, seasurfac//_2005_2012, seasurfac//_2005_2012, seafloor_//_2005_2012, seafloor_//_2005_2012, chlorophy//_2005_2012, geomorphology,     sediments,         slope, seafloor_//_2005_2012,     roughness 
## min values  : -4.977000e+03,          3.263100e-01,         -4.002820e+00,         -2.985100e-01,         -3.097540e+00,          3.362941e+01,         -6.510162e-02,          3.364861e+01,         -1.297913e-01,          1.358515e-01,  1.000000e+00,  2.000000e+00,  4.822916e-05,          4.008017e+00,  7.000000e+00 
## max values  :    -1.0000000,             9.0129099,            -1.2873000,             4.8823700,             0.9008052,            33.9671898,             0.1001129,            34.8030014,             0.1012001,             2.7323778,    19.0000000,    13.0000000,     0.1546878,             7.6222906,  3417.0000000

names(predictors2005_2012)

##  [1] "depth"                                     
##  [2] "seasurface_temperature_mean_2005_2012"     
##  [3] "seasurface_temperature_amplitude_2005_2012"
##  [4] "seafloor_temperature_mean_2005_2012"       
##  [5] "seafloor_temperature_amplitude_2005_2012"  
##  [6] "seasurface_salinity_mean_2005_2012"        
##  [7] "seasurface_salinity_amplitude_2005_2012"   
##  [8] "seafloor_salinity_mean_2005_2012"          
##  [9] "seafloor_salinity_amplitude_2005_2012"     
## [10] "chlorophyla_summer_mean_2005_2012"         
## [11] "geomorphology"                             
## [12] "sediments"                                 
## [13] "slope"                                     
## [14] "seafloor_oxygen_mean_2005_2012"            
## [15] "roughness"

Prepare your model inputs

The first step after checking your data is to adapt your dataset for modelling. Model algorithms require a table containing the environmental values associated with each occurrence data.

ID*	Longitude	Latitude	depth	…	temperature
1	63.33	-48.26	-480	…	1.4
1	64.13	-48.57	-104	…	1.2
…	…	…	…	…	…
0	67.32	-47.23	-1013	…	2.5
0	67.90	-55.45	-98	…	4.3

*ID corresponds to presence data (ID=1), or absence/background data (ID=0).

This table will be afterwards included within the SDM algorithm (BRT, MaxEnt) and will infere presence probabilities on the area that you will define for extrapolation. The package will guide you to build this dataframe and save it as a SDMtab object that will be loaded in the following functions of the package.

Extract latitude and longitude values

ctenocidaris.nutrix.occ <- ctenocidaris.nutrix[,c(7,8)]
head(ctenocidaris.nutrix.occ)

##    decimal.Longitude decimal.Latitude
## 56          67.13167        -48.98500
## 57          67.33167        -49.44167
## 58          67.51167        -49.00500
## 59          67.54167        -48.11667
## 60          67.88500        -49.46667
## 61          68.05833        -49.06667

Create your SDMtab dataframe

SDMtable_ctenocidaris <- SDMtab(xydata=ctenocidaris.nutrix.occ, 
       predictors=predictors2005_2012,
       unique.data=FALSE,
       same=TRUE)

unique.data indicates that the function will look for presence-only data that fall on a same grid-cell pixel. When unique.data= TRUE, duplicates will be removed from the xydata variable. same and background.nb functions refer to the background data sampling. background.nb, indicates the specific number of background data to sample, while same is a shortcut that enable to sample as many background data as the number of presence-only data available. You can refer to Barbet-Massin et al. (2012) to choose the most appropriate number of background data to sample for your case study.

We can display the beginning and the end of the first columns of this new SDMtab object:

head(SDMtable_ctenocidaris[,c(1:5)])

##   id longitude latitude depth seasurface_temperature_mean_2005_2012
## 1  1     67.15   -48.95  -653                               4.24109
## 2  1     67.35   -49.45  -204                               3.89770
## 3  1     67.55   -49.05  -168                               4.06841
## 4  1     67.55   -48.15  -355                               4.65109
## 5  1     67.85   -49.45  -136                               3.86259
## 6  1     68.05   -49.05  -155                               4.03309

tail(SDMtable_ctenocidaris[,c(1:5)])

##     id longitude latitude depth seasurface_temperature_mean_2005_2012
## 245  0     66.35   -54.35 -4579                               2.33730
## 246  0     73.85   -46.95 -3389                               6.31981
## 247  0     64.85   -50.25  -479                               3.63045
## 248  0     80.05   -46.35 -3571                               7.90619
## 249  0     75.45   -54.45 -1415                               1.97339
## 250  0     76.75   -53.65 -1512                               1.68900

The dataframe combines environmental values of the 125 presence-only data available (ID=1) and environmental values associated with 125 background data randomly sampled in the area (ID=0).

You can display the sampled data on a map:

# nice colors and map
bluepalette<-colorRampPalette(c("blue4","blue","dodgerblue", "deepskyblue","lightskyblue"))(800) 
map <- read.csv(system.file("extdata","worldmap-0.01.csv",package="SDMPlay"), header=T)

# Isolate depth layer from the environmental stack
depth <- subset(predictors2005_2012,1)

# Extract background coordinates
background.occ <- subset(SDMtable_ctenocidaris,SDMtable_ctenocidaris$id==0)[,c(2,3)]
plot(depth, col=bluepalette, cex=0.8,legend.width=0.5, legend.shrink=0.4,
     legend.args=list(text='Depth (m)', side=3, font=2, cex=0.8))
points(map, type="l")
points(ctenocidaris.nutrix.occ, pch= 20, col="black")
points(background.occ, pch= 20, col="red")
legend("bottomleft", pch=20, col=c("black", "red"), legend=c("presence-only data","background data"), cex=0.6)

You can assess the quality of your dataset with the SDMdata.quality function. This function estimates the percentage of presence-only data that fall on grid-cell pixels containing non-informative values (N/A).

SDMdata.quality(SDMtable_ctenocidaris)

##                                            NA.percent (%)
## depth                                                 1.2
## seasurface_temperature_mean_2005_2012                 4.8
## seasurface_temperature_amplitude_2005_2012            4.8
## seafloor_temperature_mean_2005_2012                  31.6
## seafloor_temperature_amplitude_2005_2012             31.6
## seasurface_salinity_mean_2005_2012                    4.8
## seasurface_salinity_amplitude_2005_2012               1.6
## seafloor_salinity_mean_2005_2012                     31.6
## seafloor_salinity_amplitude_2005_2012                31.6
## chlorophyla_summer_mean_2005_2012                     5.2
## geomorphology                                         3.6
## sediments                                             1.2
## slope                                                 0.0
## seafloor_oxygen_mean_2005_2012                       40.4
## roughness                                             3.2

A last calibration step that you can perform before modelling is delineating the modelled area. The delim.area function can be used to restrict in geography and depth the environmental descriptors RasterStack. This step can be really important to enhance modelling performances by limiting the extent where the model extrapolation occurs.

par(mar=c(0,0,0,0))
predictors2005_2012_1500m <- delim.area(predictors2005_2012, longmin=62, longmax=80,latmin=-55 , latmax=-45, interval=c(0,-1500))
plot(subset(predictors2005_2012_1500m,1), col=bluepalette,legend.width=0.5, legend.shrink=0.25,
     legend.args=list(text='Depth (m)', side=3, font=2, cex=0.8))
points(map, type="l")

You can focus your background sampling on this restrained environment. Run again the SDMtab code with these changes.

SDMtable_ctenocidaris_1500 <- SDMtab(xydata=ctenocidaris.nutrix.occ, 
       predictors=predictors2005_2012_1500m,
       unique.data=FALSE,
       same=TRUE)

You can observe the changes:

Perform models

Once you have built your SDMtab dataframe, you will be easily able to perform models using the compute.maxent and compute.brt functions.

compute.brt(x, proj.predictors, tc = 2, lr = 0.001, bf = 0.75,
           n.trees = 50, step.size = n.trees)
compute.maxent(x, proj.predictors)

The fonctions require two main parameters, x which correspond to the SDMtab object previously created and proj.predictors, the RasterStack containing the environmental descriptors on which you want to project your model. The other arguments aim at calibrating the model. You can refere to Elith et al. (2008) and Elith et al. (2011) to choose the parameters according to your dataset.

Example for BRT

Extrapolate species distribution on the Kerguelen Plateau, for [2005-2012].

Cteno_model_2005_2012 <- SDMPlay:::compute.brt(x=SDMtable_ctenocidaris_1500, proj.predictors=predictors2005_2012_1500m, tc = 2, lr = 0.001, bf = 0.75, n.trees = 50)

While the function is uploading, you can observe the gbm function, called by SDMPlay that calculates the regression trees until reaching the best estimation. T can help you to refine your calibration.

Afterwards,different outputs can be produced.

Distribution probability maps

palettecolor <- colorRampPalette(c("deepskyblue", "darkseagreen","lightgreen","green","yellow","gold","orange", "red","firebrick"))( 100 ) # display nice colors
plot(Cteno_model_2005_2012$raster.prediction,col=palettecolor, main="Projection for [2005-2012]",
     cex.axis= 0.7, 
     legend.width=0.5, legend.shrink=0.25,
     legend.args=list(text='Probability', side=3, font=2, cex=0.8))
points(map, type="l")

Contribution of the different environmental descriptors

contributions <- Cteno_model_2005_2012$response$contributions
b <- barplot(contributions[,2], ylab="contribution (%)")
text(b-0.5, par("usr")[3] - 0.025, srt = 45, adj = 1, labels=contributions[,1],cex=0.5,xpd=T)

Plot gbm response variables

library(dismo)
gbm.plot(Cteno_model_2005_2012$response,n.plots=12,cex.axis=0.8, smooth=TRUE)

Get the interactions between variables and plot them

interactions <- gbm.interactions(Cteno_model_2005_2012$response)
head(interactions$rank.list[,c(2,4,5)])

##                                   var1.names
## 1      seasurface_temperature_mean_2005_2012
## 2             seafloor_oxygen_mean_2005_2012
## 3    seasurface_salinity_amplitude_2005_2012
## 4 seasurface_temperature_amplitude_2005_2012
## 5         seasurface_salinity_mean_2005_2012
## 6         seasurface_salinity_mean_2005_2012
##                                   var2.names int.size
## 1                                      depth     4.74
## 2                                      depth     3.48
## 3                                      depth     2.90
## 4      seasurface_temperature_mean_2005_2012     2.58
## 5        seafloor_temperature_mean_2005_2012     1.73
## 6 seasurface_temperature_amplitude_2005_2012     0.97

gbm.perspec(Cteno_model_2005_2012$response,interactions$rank.list[1,1],interactions$rank.list[1,3], cex.lab=0.7, cex.axis=0.7,par(mar=c(0,0,0,0)))

Project on other time periods

If you want to project your model on another time period and infere your species distribution for other environmental conditions, you just need to change the proj.predictors in compute.brt. The fonction will do the relationship between the environmental descriptors used for modelling and projecting. You must ensure that the extent, order and names of your raster layers are similar.

Go further

SDMPlay provides extra fonctions to go further in your modelling work. You can perform null models with null.model, evaluate modelling performance and define probability threshold with SDMeval. Further reading and examples are provided within the functions, don’t hesitate to explore them.

References

Araujo, M. B., & Guisan, A. (2006). Five (or so) challenges for species distribution modelling. Journal of biogeography, 33(10), 1677-1688.

Barbet‐Massin, M., Jiguet, F., Albert, C. H., & Thuiller, W. (2012). Selecting pseudo‐absences for species distribution models: how, where and how many? Methods in Ecology and Evolution, 3(2), 327-338.

Elith, J.,P Anderson, R., Dudík, M., Ferrier, S., Guisan, A., J Hijmans, R., Huettmann, F., … & A Loiselle, B. (2006). Novel methods improve prediction of species’ distributions from occurrence data. Ecography, 29(2), 129-151.

Elith, J., Leathwick, J. R., & Hastie, T. (2008). A working guide to boosted regression trees. Journal of Animal Ecology, 77(4), 802-813.

Elith, J., & Leathwick, J. R. (2009). Species distribution models: ecological explanation and prediction across space and time. Annual review of ecology, evolution, and systematics, 40, 677-697.

Elith, J., Phillips, S. J., Hastie, T., Dudík, M., Chee, Y. E., & Yates, C. J. (2011). A statistical explanation of MaxEnt for ecologists. Diversity and distributions, 17(1), 43-57.

Guillaumot, C., Martin, A., Fabri-Ruiz, S., Eléaume, M., & Saucède, T. (2016). Echinoids of the Kerguelen Plateau–occurrence data and environmental setting for past, present, and future species distribution modelling. ZooKeys, (630), 1.

Pearce, J. L., & Boyce, M. S. (2006). Modelling distribution and abundance with presence‐only data. Journal of applied ecology, 43(3), 405-412.

Pearson, R. G. (2007). Species’ distribution modeling for conservation educators and practitioners. Synthesis. American Museum of Natural History, 50.

Phillips, S. J., Dudík, M., Elith, J., Graham, C. H., Lehmann, A., Leathwick, J., & Ferrier, S. (2009). Sample selection bias and presence‐only distribution models: implications for background and pseudo‐absence data. Ecological applications, 19(1), 181-197.

Proosdij, A. S., Sosef, M. S., Wieringa, J. J., & Raes, N. (2016). Minimum required number of specimen records to develop accurate species distribution models. Ecography, 39(6), 542-552.

Raes, N., & ter Steege, H. (2007). A null‐model for significance testing of presence‐only species distribution models. Ecography, 30(5), 727-736.

Robinson, L. M., Elith, J., Hobday, A. J., Pearson, R. G., Kendall, B. E., Possingham, H. P., & Richardson, A. J. (2011). Pushing the limits in marine species distribution modelling: lessons from the land present challenges and opportunities. Global Ecology and Biogeography, 20(6), 789-802.