This week, we’ll be preparing the data to run Species Distribution Models in R. Next week, we’ll run them. Remember that the aim of SDM is to estimate the probability that a species occurs at a given site (in a given grid cell) based in that site’s (grid cell’s) environmental similarity with the environmental conditions at known species occurrence locations.
Create a Lab3 folder nested within the folder you use for this class. Do not use spaces in the name of the folder (i.e., not “Lab 3”)
If you do not already have the latest version of R and RStudio,
download and install them using this link: https://posit.co/download/rstudio-desktop/#download. R
is the programming language we will use, and RStudio is a program that
makes interfacing with R more seamless.
2.1 Download R. Under “1: Install R,” click “Download and Install R.”
Click on the link for your operating system, and run the installer as
you would when installing any software program.
2.2 Download RStudio. Under “2: Install RStudio,” click the download
button if the right version for your operating system appears there. If
not, scroll down for the correct version. Run the installer as you would
when installing any software program.
3.1 Open RStudio
3.2 Create an R Script file. Click File -> NewFile -> R
Script
3.3 Save the R Script to your Lab3 folder. Click File > Save
As
We need to set a working directory, which is the default location of
any files you read into R or save out of R. We’ll do this by specifying
the path name to the Lab3 folder in your new R Script.
4.1. Copy and past the text in the gray rectangle below into your R
Script
4.2. MODIFY it to
your own path name
4.3. Run the line of code by selecting anywhere in that line and
clicking the Run icon. The shortcut is Control
+ Enter
4.4. If it ran correctly, you should see the line of code appear in the
Console with no errors
4.5. Going forward in this tutorial, anything in a gray box is code that
you should copy and paste into your script, modify as appropriate, and
run. 4.6 Save your R Script freuently.
setwd("/Users/megancattau/Dropbox/0_BoiseState/Courses/2024_spr_Ecol_Dist/Labs/Lab3_SDM")
5.1. Install the packages
If you do not already have these packages installed, run this line:
install.packages(c( “dismo”,“raster”, “sf”, “maps”, “rgbif”,
“usethis”))
5.2. Load the packages
library(dismo)
library(raster)
library(sf)
library(rgbif)
library(maps)
Data Preparation will constitute the remainder of today’s lab. This
is often the most time consuming part of a species distribution modeling
project. You need occurrence records that document presence (and perhaps
absence) of the species of interest. You also need to have accurate and
relevant environmental data (predictor variables) at a sufficiently fine
spatial resolution.
To run SDMs, we will need known species “presence” data with at least
2 columns that hold the coordinates of the locations where a species was
observed. Note: The columns should be organized so that longitude is the
first and latitude the second column (think x and y axes in a plot;
longitude is x, latitude is y); they often are in the reverse order,
which can cause confusion.
We will download species occurrence data from the Global Biodiversity
Inventory Facility (GBIF) (http: //www.gbif.org/), but you could import
any data to do this. See the rgbif tutorial if you would like more
information: https://docs.ropensci.org/rgbif/
Register for an account at https://www.gbif.org/
Note: do not sign up for an account with your google, github, etc.
account. Instead, use your email.
Some rgbif features require GBIF login details, and we’ll allow rgbif to access to your GBIF credentials by saving them in your .Renviron file.
usethis::edit_r_environ()
Note that you are not typing this into oyour RScript. Rather, into
the .Renviron file that appears when you run the above.
MODIFY the code below
to your own credentials
GBIF_USER="mcattau"
GBIF_PWD="GBIFpasswordforclass"
GBIF_EMAIL="megan.cattau@gmail.com"
Optional: Some more options and information on getting occurrence data from gbif here: https://docs.ropensci.org/rgbif/articles/getting_occurrence_data.html
When you run this for your own species, you will REPLACE ‘Pongo pygmaeus’ below with your species
spp_taxonKey <- name_backbone("Pongo pygmaeus")$usageKey
Note: the default option below removes geospatial issues, keeps only records with coordinates, remove absent records, removes fossils and living specimens, and retrieves all records for the spp.
gbif_download <- occ_download(
pred_default(),pred("taxonKey", value=spp_taxonKey),
format = "SIMPLE_CSV")
Sys.sleep(300)
When you run this for your own species, you will REPLACE ‘0010432-240202131308920’ below with the numbers from your order found when you run “gbif_download” You can also check by looking for a downloaded zip file in your folder once this is complete
gbif_download
occ_download_wait('0010432-240202131308920')
presence <- occ_download_get(gbif_download) %>%
occ_download_import()
See the bottom of the printout below for the citation. This is what you would use if used this download in a research paper. Please also see GBIF’s citation guidelines when using GBIF mediated data.
gbif_download
How many rows and colums does it have?
dim(presence)
What are the names of the columns (i.e., attributes)?
names(presence)
We just need the latitude and longitude columns When you run this for your own species, if lat and long are not found in columns 22 and 23, you will REPLACE ‘22, 23’ below with the correct column numbers.
presence<-presence[,c(22,23)]
Make a map of the occurrences. It is important to make such maps to assure that the points are, at least roughly, in the right location. First look at the range of values where your species occurrence points can be found and restrict the map to that area. Note that if you wanted to specify a range directly, you could use this instead using your own coordinates: map(xlim=c(-130,120), ylim=c(-10,60))
range(presence$decimalLongitude)
range(presence$decimalLatitude)
map(xlim=c(range(presence$decimalLongitude)), ylim=c(range(presence$decimalLatitude)))
map.axes()
points(presence$decimalLongitude, presence$decimalLatitude, col="orange", pch=20, cex=0.75)
If R is new for you and you are choosing to run this for your own species, this section is OPTIONAL
There are many steps that people often do to ‘clean’ their data, including cross-checking, georeferencing, and dealing with sampling bias. You are not required to do these steps for this lab, but it would be good for you to know that these are common pre-processing steps. Cross-checking: It is important to cross-check coordinates by visual and other means. One approach is to compare the country (and lower level administrative subdivisions) of the site as specified by the records, with the country implied by the coordinates. You could do this in R or ArcGIS. Sampling bias: Sampling bias is frequently present in occurrence records. One can attempt to remove some of the bias by subsampling records. However, subsampling reduces the number of records, and it cannot correct the data for areas that have not been sampled at all. It also suffers from the problem that locally dense records might in fact be a true reflection of the relative suitable of habitat. As in many steps in SDM, you need to understand something about your data and species to implement them well. In a real research project you would want to spend much more time on this first data-cleaning and completion step, partly with R, but also with other programs.
For this lab, we’ll just make sure records are generally in the right place. Pongo pygmaeus is a species that occurs on the islands of Sumatra and Kalimantan in Indonesia Do you see any errors on the map?
Taking a liberal approach, you could correct any obvious errors (a
record appears to be missing a minus sign, or it is plausable that the
lat and long are switched). Taking a conservative approach, you could
exclude just any records that are clearly not within range.
Option 1: you could use political boundaries for mapping. If you use this option when you run this for your own species, you will REPLACE ‘Indonesia’ below with the boundary of your choice
map (regions ='Indonesia')
map.axes()
Option 2: You could specify the coordinates directly. If you use this option when you run this for your own species, you will REPLACE the coordinates below with the relevant coordinates for your species
zoom_x<-c(90,120)
zoom_y<-c(-10,10)
map(xlim=zoom_x, ylim=zoom_y)
map.axes()
When you run this for your own species, you will REPLACE the coordinates below with the relevant coordinates for your species. We’ll also check this new dataset to ensure we’ve removed everything we need to, once at the full extent and once Zooming in
presence <- presence[(presence$decimalLongitude > 90 & presence$decimalLongitude < 120 ) & (presence$decimalLatitude > -10 & presence$decimalLatitude < 10) , ]
map()
map.axes()
points(presence$decimalLongitude, presence$decimalLatitude, col="orange", pch=20, cex=0.75)
map(xlim=zoom_x, ylim=zoom_y)
map.axes()
points(presence$decimalLongitude, presence$decimalLatitude, col="orange", pch=20, cex=0.75)
You can see there are still two misplaced points, and we’ll inspect the latitude and longitude columns to find them. We create a unique identifier for each row (FID), inspect each row that seems to be out of bounds and note the FID of those rows, exclude those errant points from the data, and, finally, keep just the latitude and longitude columns for the final dataset.
When you run this for your own species, you will REPLACE the coordinates below with the relevant coordinates for your species
# Create a unique identifier for each row
presence$FID<-1:nrow(presence)
# inspect each row that seems to be out of bounds and note the FID of those rows
presence[(presence$decimalLatitude < 0 & presence$decimalLongitude < 100),]
presence[(presence$decimalLatitude>0 & presence$decimalLongitude < 105 & presence$decimalLongitude>102),]
# Exclude those errant points from the data FID 562-567
presence<-presence[-c(562:567),]
# Check it
map(xlim=zoom_x, ylim=zoom_y)
map.axes()
points(presence$decimalLongitude, presence$decimalLatitude, col="orange", pch=20, cex=0.75)
# Keep just the latitude and longitude columns
presence<-presence[,c(1,2)]
Note: In the future, you could also do all of the above in QGIS or some other program if you prefer
You now have a dataset of cleaned species occurrence points! Think: Do you think the observations are a reasonable representation of the distribution (and ecological niche) of the species?
In species distribution modeling, predictor variables are typically organized as raster type files. Each predictor should be a ’raster’ representing a variable of interest. Variables can include climatic, soil, terrain, vegetation, land use, and other variables. These data are typically stored in files in some kind of GIS format. Almost all relevant formats can be used (including ESRI grid, geoTiff, netCDF, IDRISI). Avoid ASCII files if you can, as they tend to considerably slow down processing speed. For any particular study the layers should all have the same spatial extent, resolution, origin, and projection. If necessary, use functions like crop, extend, aggregate, resample, and projectRaster from the ’raster’ package to prepare your predictor variable data. See the help files and the vignette of the raster package for more info on how to do this. The set of predictor variables (rasters) can be used to make a ’RasterStack’, which is a collection of ’RasterLayer’ objects (see the raster package for more info).
We’ll use WorldClim data. The Worldclim data represent ’bioclimatic variables’ from the WorldClim database (http://www.worldclim.org, Hijmans et al., 2004). See https://worldclim.org/data/index.html for more information.
We’ll use the getData function from the raster package to download Worldclim data, we’ll explore it, and we’ll plot one of the variables. You can use ‘?getData’ to see what other data are available in the raster package.
worldcl<-getData(name='worldclim', var='bio', res=2.5)
?getData
names(worldcl)
plot(worldcl[[1]])
Remember to think critically abut what environental data you select. You’ll want to do a little research on your species of interest in order to select variables that are biologically meaningful for your model next week. Predictor variable selection can be important, particularly if the objective of a study is explanation. In all cases it is important to use variables that are relevant to the ecology of the species (rather than with the first dataset that can be found on the web!). In some cases it can be useful to develop new, more ecologically relevant, predictor variables from existing data. For example, one could use land cover data and the focal function in the raster package to create a new variable that indicates how much forest area is available within x km of a grid cell, for a species that might have a home range of x
So we have our species occurrence data and our predictor variables
(climate in this case). Some SDM models only use ’presence’ data. Other
methods also use ’absence’ data or ’background’ data. If you have a
large dataset with presence/absence from a well designed survey, you
should use a method that can use these data (i.e. do not use a modeling
method that only considers presence data). If you only have presence
data, you can still use a method that needs absence data, by
substituting absence data with background data. Background data are not
attempting to guess at absence locations, but rather to characterize
environments in the study region (with or without the species present).
Background data establishes the environmental domain of the study,
whilst presence data should establish under which conditions a species
is more likely to be present than on average. Survey-absence data has
value. In conjunction with presence records, it establishes where
surveys have been done, and the prevalence of the species given the
survey effort. That information is lacking for presence-only data, a
fact that can cause substantial difficulties for modeling presence-only
data well. However, absence data can also be biased and incomplete, as
discussed in lecture (i.e., “false absences”).
There are several approaches we could use to sample ’pseudo-absence’
points
You can use a ’mask’ to exclude area from a raster with no data (NA) and the randomPoints function to pull background points from within that mask (all land in the world, in our case).
mask <- worldcl[[1]]
# set seed to assure that the examples will always have the same random sample
set.seed(1963)
# select 500 random points
bg <- randomPoints(mask, 500 )
map()
points(bg)
points(presence$decimalLongitude, presence$decimalLatitude, col="orange", pch=20, cex=0.75)
You can use an ’extent’ to further restrict the area from which
random locations are drawn. Let’s use the extent of our species
points.
The map here allows us to visualize the presences and absences
together.
### NOTE THAT THIS
MAP AND THE NEXT FIGURE BELOW (FOR THE ORANGUTAN OR YOUR OWN SPECIES)
ARE ALL YOU WILL TURN IN FOR THIS LAB 3a. If you use the orangutan,
MODIFY the color
“orange” below with a different color.
e <- extent(c(range(presence$decimalLongitude), range(presence$decimalLatitude)))
bg2 <- randomPoints(mask, 500, ext=e)
map(xlim=c(range(presence$decimalLongitude)), ylim=c(range(presence$decimalLatitude)))
points(bg2)
points(presence$decimalLongitude, presence$decimalLatitude, col="orange", pch=20, cex=0.75)
We will not go over this option in class, but just note that you could instead take the background points from within a radius of the presence points.
Create a dataframe to run the models on. We now have a set of predictor variables (rasters) and occurrence points / absence. The next step is to extract the values of the predictors at the locations of the points. (This step can be skipped for the modeling methods that are implemented in the dismo package). This is a very straightforward thing to do using the ’extract’ function from the raster package.
presvals <- extract(worldcl, presence)
absvals <- extract(worldcl, bg2)
pb <- c(rep(1, nrow(presvals)), rep(0, nrow(absvals)))
sdmdata <- data.frame(cbind(pb, rbind(presvals, absvals)))
write.csv(sdmdata, "sdmdata.csv")
### NOTE THAT THIS
FIGURE BELOW AND THE MAP ABOVE (FOR THE ORANGUTAN OR YOUR OWN SPECIES)
ARE ALL YOU WILL TURN IN FOR THIS LAB 3a
pairs(sdmdata[,2:5], cex=0.1, fig=TRUE)
Next week, we will do Part II: Fitting a model, making a prediction,
and evaluating the result. We will have a report due for Lab 3a and 3b
in combination. The report will be based just on the data of the species
of your choosing rather than on the orangutan data we worked with in
class.