UPDATE If R is new to you, you can run through the code using the orangutan example and submit those figures (altering the color) rather than rerunning it using your own species. In that case, do ‘clean’ the data.

A. Introduction

This week, we’ll be preparing the data to run Species Distribution Models in R. Next week, we’ll run them. Remember that the aim of SDM is to estimate the probability that a species occurs at a given site (in a given grid cell) based in that site’s (grid cell’s) environmental similarity with the environmental conditions at known species occurrence locations.

B. Set-up

1. Create a Lab3 folder

Create a Lab3 folder nested within the folder you use for this class. Do not use spaces in the name of the folder (i.e., not “Lab 3”)

2. Install R and RStudio

If you do not already have the latest version of R and RStudio, download and install them using this link: https://posit.co/download/rstudio-desktop/#download. R is the programming language we will use, and RStudio is a program that makes interfacing with R more seamless.
2.1 Download R. Under “1: Install R,” click “Download and Install R.” Click on the link for your operating system, and run the installer as you would when installing any software program.
2.2 Download RStudio. Under “2: Install RStudio,” click the download button if the right version for your operating system appears there. If not, scroll down for the correct version. Run the installer as you would when installing any software program.

3. Setting up your R Script

3.1 Open RStudio
3.2 Create an R Script file. Click File -> NewFile -> R Script
3.3 Save the R Script to your Lab3 folder. Click File > Save As

4. Set your working Directory and start coding

We need to set a working directory, which is the default location of any files you read into R or save out of R. We’ll do this by specifying the path name to the Lab3 folder in your new R Script.

4.1. Copy and past the text in the gray rectangle below into your R Script
4.2. MODIFY it to your own path name
4.3. Run the line of code by selecting anywhere in that line and clicking the Run icon. The shortcut is Control + Enter
4.4. If it ran correctly, you should see the line of code appear in the Console with no errors
4.5. Going forward in this tutorial, anything in a gray box is code that you should copy and paste into your script, modify as appropriate, and run. 4.6 Save your R Script freuently.

setwd("/Users/megancattau/Dropbox/0_BoiseState/Courses/2024_spr_Ecol_Dist/Labs/Lab3_SDM")

5. Install, load, and call the packages below and their dependencies.

5.1. Install the packages
If you do not already have these packages installed, run this line: install.packages(c( “dismo”,“raster”, “sf”, “maps”, “rgbif”, “usethis”))

5.2. Load the packages

library(dismo)
library(raster)
library(sf)
library(rgbif)
library(maps)

C. Data Preparation

Data Preparation will constitute the remainder of today’s lab. This is often the most time consuming part of a species distribution modeling project. You need occurrence records that document presence (and perhaps absence) of the species of interest. You also need to have accurate and relevant environmental data (predictor variables) at a sufficiently fine spatial resolution.

We will run through this code together using the orangutan as an example, and then you will run the code for a species of your choice. There will only be two figures to turn in for this week’s lab (using ‘your’ species), specified balow. Note that if R is new to you, you can just use the orangutan example rather than running the code for your own species

1. Species Occurrence Data, “Presence” Data

To run SDMs, we will need known species “presence” data with at least 2 columns that hold the coordinates of the locations where a species was observed. Note: The columns should be organized so that longitude is the first and latitude the second column (think x and y axes in a plot; longitude is x, latitude is y); they often are in the reverse order, which can cause confusion.

We will download species occurrence data from the Global Biodiversity Inventory Facility (GBIF) (http: //www.gbif.org/), but you could import any data to do this. See the rgbif tutorial if you would like more information: https://docs.ropensci.org/rgbif/

1.1 Register for a GBIF account

Register for an account at https://www.gbif.org/
Note: do not sign up for an account with your google, github, etc. account. Instead, use your email.

1.2 Give R access to your GBIF credentials

Some rgbif features require GBIF login details, and we’ll allow rgbif to access to your GBIF credentials by saving them in your .Renviron file.

1.2.1 Open R Environment

usethis::edit_r_environ()

1.2.2 Edit your .Renviron file.

Note that you are not typing this into oyour RScript. Rather, into the .Renviron file that appears when you run the above.
MODIFY the code below to your own credentials

GBIF_USER="mcattau"
GBIF_PWD="GBIFpasswordforclass"
GBIF_EMAIL="megan.cattau@gmail.com"

1.2.3 Restart R

1.3 Download and preprocess the data

Optional: Some more options and information on getting occurrence data from gbif here: https://docs.ropensci.org/rgbif/articles/getting_occurrence_data.html

1.3.1 Find the taxon key for a species, using Pongo pygmaeus as an example

When you run this for your own species, you will REPLACE ‘Pongo pygmaeus’ below with your species

spp_taxonKey <- name_backbone("Pongo pygmaeus")$usageKey

1.3.2 Download occurrence records to your working directory.

Note: the default option below removes geospatial issues, keeps only records with coordinates, remove absent records, removes fossils and living specimens, and retrieves all records for the spp.

gbif_download <- occ_download(
   pred_default(),pred("taxonKey", value=spp_taxonKey),
   format = "SIMPLE_CSV")
Sys.sleep(300)

1.3.3 Check the status of your download

When you run this for your own species, you will REPLACE ‘0010432-240202131308920’ below with the numbers from your order found when you run “gbif_download” You can also check by looking for a downloaded zip file in your folder once this is complete

gbif_download
occ_download_wait('0010432-240202131308920')

1.3.4 Import the data into R

presence <- occ_download_get(gbif_download) %>%
   occ_download_import()

1.3.5 Get the citation

See the bottom of the printout below for the citation. This is what you would use if used this download in a research paper. Please also see GBIF’s citation guidelines when using GBIF mediated data.

gbif_download

1.3.6 Explore the data

How many rows and colums does it have?

dim(presence)

What are the names of the columns (i.e., attributes)?

names(presence)

1.3.7 Subset the data

We just need the latitude and longitude columns When you run this for your own species, if lat and long are not found in columns 22 and 23, you will REPLACE ‘22, 23’ below with the correct column numbers.

presence<-presence[,c(22,23)]

1.4 Visualize the data

Make a map of the occurrences. It is important to make such maps to assure that the points are, at least roughly, in the right location. First look at the range of values where your species occurrence points can be found and restrict the map to that area. Note that if you wanted to specify a range directly, you could use this instead using your own coordinates: map(xlim=c(-130,120), ylim=c(-10,60))

range(presence$decimalLongitude)
range(presence$decimalLatitude)
map(xlim=c(range(presence$decimalLongitude)), ylim=c(range(presence$decimalLatitude)))
map.axes()
points(presence$decimalLongitude, presence$decimalLatitude, col="orange", pch=20, cex=0.75)

1.5 Clean the data

If R is new for you and you are choosing to run this for your own species, this section is OPTIONAL

There are many steps that people often do to ‘clean’ their data, including cross-checking, georeferencing, and dealing with sampling bias. You are not required to do these steps for this lab, but it would be good for you to know that these are common pre-processing steps. Cross-checking: It is important to cross-check coordinates by visual and other means. One approach is to compare the country (and lower level administrative subdivisions) of the site as specified by the records, with the country implied by the coordinates. You could do this in R or ArcGIS. Sampling bias: Sampling bias is frequently present in occurrence records. One can attempt to remove some of the bias by subsampling records. However, subsampling reduces the number of records, and it cannot correct the data for areas that have not been sampled at all. It also suffers from the problem that locally dense records might in fact be a true reflection of the relative suitable of habitat. As in many steps in SDM, you need to understand something about your data and species to implement them well. In a real research project you would want to spend much more time on this first data-cleaning and completion step, partly with R, but also with other programs.

For this lab, we’ll just make sure records are generally in the right place. Pongo pygmaeus is a species that occurs on the islands of Sumatra and Kalimantan in Indonesia Do you see any errors on the map?

Taking a liberal approach, you could correct any obvious errors (a record appears to be missing a minus sign, or it is plausable that the lat and long are switched). Taking a conservative approach, you could exclude just any records that are clearly not within range.

1.5.1 Zoom in a little to help visualize - you can specify a boundary or coordinates

Option 1: you could use political boundaries for mapping. If you use this option when you run this for your own species, you will REPLACE ‘Indonesia’ below with the boundary of your choice

map (regions ='Indonesia')
map.axes()

Option 2: You could specify the coordinates directly. If you use this option when you run this for your own species, you will REPLACE the coordinates below with the relevant coordinates for your species

zoom_x<-c(90,120)
zoom_y<-c(-10,10)
map(xlim=zoom_x, ylim=zoom_y)
map.axes()

1.5.2 Get rid of points outside of the range where you would expect to see the species.

When you run this for your own species, you will REPLACE the coordinates below with the relevant coordinates for your species. We’ll also check this new dataset to ensure we’ve removed everything we need to, once at the full extent and once Zooming in

presence <- presence[(presence$decimalLongitude > 90 & presence$decimalLongitude < 120 ) & (presence$decimalLatitude > -10 & presence$decimalLatitude < 10) , ]

map()
map.axes()
points(presence$decimalLongitude, presence$decimalLatitude, col="orange", pch=20, cex=0.75)

map(xlim=zoom_x, ylim=zoom_y)
map.axes()
points(presence$decimalLongitude, presence$decimalLatitude, col="orange", pch=20, cex=0.75)

1.5.3 Further refine.

You can see there are still two misplaced points, and we’ll inspect the latitude and longitude columns to find them. We create a unique identifier for each row (FID), inspect each row that seems to be out of bounds and note the FID of those rows, exclude those errant points from the data, and, finally, keep just the latitude and longitude columns for the final dataset.

When you run this for your own species, you will REPLACE the coordinates below with the relevant coordinates for your species

# Create a unique identifier for each row
presence$FID<-1:nrow(presence) 

# inspect each row that seems to be out of bounds and note the FID of those rows
presence[(presence$decimalLatitude < 0 & presence$decimalLongitude < 100),]
presence[(presence$decimalLatitude>0 & presence$decimalLongitude < 105 & presence$decimalLongitude>102),]

# Exclude those errant points from the data FID 562-567
presence<-presence[-c(562:567),]

# Check it
map(xlim=zoom_x, ylim=zoom_y)
map.axes()
points(presence$decimalLongitude, presence$decimalLatitude, col="orange", pch=20, cex=0.75)

# Keep just the latitude and longitude columns
presence<-presence[,c(1,2)]

Note: In the future, you could also do all of the above in QGIS or some other program if you prefer

You now have a dataset of cleaned species occurrence points! Think: Do you think the observations are a reasonable representation of the distribution (and ecological niche) of the species?

2. Environmental Data

In species distribution modeling, predictor variables are typically organized as raster type files. Each predictor should be a ’raster’ representing a variable of interest. Variables can include climatic, soil, terrain, vegetation, land use, and other variables. These data are typically stored in files in some kind of GIS format. Almost all relevant formats can be used (including ESRI grid, geoTiff, netCDF, IDRISI). Avoid ASCII files if you can, as they tend to considerably slow down processing speed. For any particular study the layers should all have the same spatial extent, resolution, origin, and projection. If necessary, use functions like crop, extend, aggregate, resample, and projectRaster from the ’raster’ package to prepare your predictor variable data. See the help files and the vignette of the raster package for more info on how to do this. The set of predictor variables (rasters) can be used to make a ’RasterStack’, which is a collection of ’RasterLayer’ objects (see the raster package for more info).

2.1 Get Environmental data

We’ll use WorldClim data. The Worldclim data represent ’bioclimatic variables’ from the WorldClim database (http://www.worldclim.org, Hijmans et al., 2004). See https://worldclim.org/data/index.html for more information.

We’ll use the getData function from the raster package to download Worldclim data, we’ll explore it, and we’ll plot one of the variables. You can use ‘?getData’ to see what other data are available in the raster package.

worldcl<-getData(name='worldclim', var='bio', res=2.5)
?getData
names(worldcl)
plot(worldcl[[1]])

Remember to think critically abut what environental data you select. You’ll want to do a little research on your species of interest in order to select variables that are biologically meaningful for your model next week. Predictor variable selection can be important, particularly if the objective of a study is explanation. In all cases it is important to use variables that are relevant to the ecology of the species (rather than with the first dataset that can be found on the web!). In some cases it can be useful to develop new, more ecologically relevant, predictor variables from existing data. For example, one could use land cover data and the focal function in the raster package to create a new variable that indicates how much forest area is available within x km of a grid cell, for a species that might have a home range of x

3. Absence / Background Data

So we have our species occurrence data and our predictor variables (climate in this case). Some SDM models only use ’presence’ data. Other methods also use ’absence’ data or ’background’ data. If you have a large dataset with presence/absence from a well designed survey, you should use a method that can use these data (i.e. do not use a modeling method that only considers presence data). If you only have presence data, you can still use a method that needs absence data, by substituting absence data with background data. Background data are not attempting to guess at absence locations, but rather to characterize environments in the study region (with or without the species present). Background data establishes the environmental domain of the study, whilst presence data should establish under which conditions a species is more likely to be present than on average. Survey-absence data has value. In conjunction with presence records, it establishes where surveys have been done, and the prevalence of the species given the survey effort. That information is lacking for presence-only data, a fact that can cause substantial difficulties for modeling presence-only data well. However, absence data can also be biased and incomplete, as discussed in lecture (i.e., “false absences”).

There are several approaches we could use to sample ’pseudo-absence’ points

3.1 Option 1: Background points from a mask

You can use a ’mask’ to exclude area from a raster with no data (NA) and the randomPoints function to pull background points from within that mask (all land in the world, in our case).

mask <- worldcl[[1]]
# set seed to assure that the examples will always have the same random sample
set.seed(1963)
# select 500 random points 
bg <- randomPoints(mask, 500 )
map()
points(bg)
points(presence$decimalLongitude, presence$decimalLatitude, col="orange", pch=20, cex=0.75)

3.2 Option 2: Background points from a restricted mask

You can use an ’extent’ to further restrict the area from which random locations are drawn. Let’s use the extent of our species points.
The map here allows us to visualize the presences and absences together.

### NOTE THAT THIS MAP AND THE NEXT FIGURE BELOW (FOR THE ORANGUTAN OR YOUR OWN SPECIES) ARE ALL YOU WILL TURN IN FOR THIS LAB 3a. If you use the orangutan, MODIFY the color “orange” below with a different color.

e <- extent(c(range(presence$decimalLongitude), range(presence$decimalLatitude)))
bg2 <- randomPoints(mask, 500, ext=e) 
map(xlim=c(range(presence$decimalLongitude)), ylim=c(range(presence$decimalLatitude)))
points(bg2)
points(presence$decimalLongitude, presence$decimalLatitude, col="orange", pch=20, cex=0.75)

We will not go over this option in class, but just note that you could instead take the background points from within a radius of the presence points.

4. Dataframe

Create a dataframe to run the models on. We now have a set of predictor variables (rasters) and occurrence points / absence. The next step is to extract the values of the predictors at the locations of the points. (This step can be skipped for the modeling methods that are implemented in the dismo package). This is a very straightforward thing to do using the ’extract’ function from the raster package.

4.1 Extract values from your predictor variables at the location of your occurrence points and your absence / background points and save this to your working directory. CHECK your working directory to make sure there is a file called sdmdata.csv there.

presvals <- extract(worldcl, presence)
absvals <- extract(worldcl, bg2)
pb <- c(rep(1, nrow(presvals)), rep(0, nrow(absvals)))
sdmdata <- data.frame(cbind(pb, rbind(presvals, absvals)))
write.csv(sdmdata, "sdmdata.csv")

4.2 Assess if there is colinearity among predictor variables

### NOTE THAT THIS FIGURE BELOW AND THE MAP ABOVE (FOR THE ORANGUTAN OR YOUR OWN SPECIES) ARE ALL YOU WILL TURN IN FOR THIS LAB 3a

pairs(sdmdata[,2:5], cex=0.1, fig=TRUE)

REPORT

As noted, all you need to turn in for this lab 3a are the two figures specified above for your own species.

Next week’s Lab

Next week, we will do Part II: Fitting a model, making a prediction, and evaluating the result. We will have a report due for Lab 3a and 3b in combination. The report will be based just on the data of the species of your choosing rather than on the orangutan data we worked with in class.

Lab3a

Megan Cattau, Modified from Robert J. Hijmans and Jane Elith, 2016

2024