if (!"occAssess" %in% installed.packages()) devtools::install_github("https://github.com/robboyd/occAssess")
library(occAssess)occAssess enables quick and easy screening of species occurrence data for common forms of bias and uncertainty.
The package comprises a number of discrete functions, each of which is designed to assess a common form of bias or uncertainty, or indicate poor coverage. Generally speaking, users must simply pass their occurrence data to the functions, and specify time periods into which the resulting metrics will be split. Ouputs are provided in list format, with one element containing a ggplot2 object, and a second containing the data that underpins that plot.
In this document I demonstrate the utility of occAssess by applying it to data on pollinator occurrences in South America over the period 1950 to 2019. The data were extracted from GBIF using the rgbif function occ_data. I come at the data from the perspective of a modeller interested in estimating how the species’ distributions have changed over time.
occAssess requires a data.frame with the following fields: species (species name), x (x coordinae of record), y (y coordinate), year, spatialUncertainty (how much uncertainty is associated with the x and y coordinates; units do not matter but must be consistent) and identifier. The identifier is used to split the data into groups; for example, it could represent taxonomic groups or specific datasets. Where you have no information for a field, its value should be set to NA.
spDat <- read.csv("C:/Users/Rob.Lenovo-PC/Documents/surpass/Data/GBIF/20.01.21/allDat.csv")
str(spDat)
#> 'data.frame': 120357 obs. of 6 variables:
#> $ species : Factor w/ 452 levels "Allograpta annulipes",..: 123 NA NA 123 NA 127 123 NA NA 127 ...
#> $ x : num -58.4 -58.4 -58.6 -58.4 -56.7 ...
#> $ y : num -34.7 -34.6 -34.5 -34.7 -36.7 ...
#> $ year : int 2019 2019 2019 2019 2019 2019 2019 2019 2019 2019 ...
#> $ spatialUncertainty: num 526 4 31 102 410 410 410 410 NA 410 ...
#> $ identifier : Factor w/ 2 levels "Phyllostomidae",..: 2 2 2 2 2 2 2 2 2 2 ...In addition to your occurrence data, you must also provide a number of periods into which your data will be split (thi can be one period covering e.g. your whole study extent). In this example I will split the data into (roughly) decades over the period 1950 to 2019. Periods are defined as follows:
All of the functions in occAssess require two common arguments: dat and periods (outlined above). I will run through each function in the following, indicating where additional arguments are required. Generally, the functions in occAssess return a list with two elements: one being a ggplot2 object, with a separate panel for each level of identifier; and a second with the data underpinning the plot.
The first function I will introduce is the simplest: assessRecordNumber. This function simply plots out the number of records per year in your dataset.
nRec <- assessRecordNumber(dat = spDat,
periods = periods)
str(nRec$data)
#> 'data.frame': 139 obs. of 4 variables:
#> $ val : int 236 352 220 167 32 330 3 362 6 233 ...
#> $ year : int 1950 1950 1951 1951 1952 1952 1953 1953 1954 1954 ...
#> $ group : Factor w/ 2 levels "Phyllostomidae",..: 2 1 2 1 2 1 2 1 2 1 ...
#> $ Period: Factor w/ 7 levels "p7","p6","p5",..: 7 7 7 7 7 7 7 7 7 7 ...
nRec$plotThis function enables researchers to quickly establish how the number of records has changed over time.
In addition to the number of records, you may wish to know how the number of species (taxonomic coverage) in your dataset changes over time. For this you can use the function assessSpeciesNumber:
nSpec <- assessSpeciesNumber(dat = spDat,
periods = periods)
str(nSpec$data)
#> 'data.frame': 139 obs. of 4 variables:
#> $ val : int 19 25 26 18 10 24 3 24 3 29 ...
#> $ year : int 1950 1950 1951 1951 1952 1952 1953 1953 1954 1954 ...
#> $ group : Factor w/ 2 levels "Phyllostomidae",..: 2 1 2 1 2 1 2 1 2 1 ...
#> $ Period: Factor w/ 7 levels "p7","p6","p5",..: 7 7 7 7 7 7 7 7 7 7 ...
nSpec$plotIt has been speculated that apparent changes in taxonomic coverage could, in fact, reflect a change in taxonomic expertise over time. For example, if fewer individuals have the skill to identify certain species, then it may not appear in your dataset in the later periods. The function assessSpeciesID treats the proportion of species identified to species level as a proxy for taxonomic expertise:
propID <- assessSpeciesID(dat = spDat,
periods = periods,
type = "proportion")
str(propID$data)
#> 'data.frame': 139 obs. of 4 variables:
#> $ prop : num 0.911 0.997 0.95 1 0.969 ...
#> $ year : int 1950 1950 1951 1951 1952 1952 1953 1953 1954 1954 ...
#> $ group : Factor w/ 2 levels "Phyllostomidae",..: 2 1 2 1 2 1 2 1 2 1 ...
#> $ Period: Factor w/ 7 levels "p7","p6","p5",..: 7 7 7 7 7 7 7 7 7 7 ...
propID$plotThe argument “type” can take the values proportion (proportion of records identified to species level) or count (number of records identified to species level).
A number of studies have defined taxonomic bias in a dataset as the degree of proportionality between species’ range sizes (usually proxied by the number of grid cells on which it has been recorded) and the total number of records. One can regress the number of records on range size, and the residuals give an index of how over-or undersampled a species is given its prevalence. The function assessSpeciesBias conducts these analyses for each time period, and uses the r2 value from the linear regressions as an index proportionality between range sizes and number of records. Higher values indicate that species’ are sampled in proportion to their range sizes whereas lower values indicate that some species are over- or undersampled.
taxBias <- assessSpeciesBias(dat = spDat,
periods = periods,
res = 0.5)
#> Warning in assessSpeciesBias(dat = spDat, periods = periods, res = 0.5):
#> Removing 9626 records because they are do not identified to species level.
str(taxBias$data)
#> 'data.frame': 14 obs. of 3 variables:
#> $ period: Factor w/ 7 levels "p1","p2","p3",..: 1 2 3 4 5 6 7 1 2 3 ...
#> $ id : Factor w/ 2 levels "Syrphidae","Phyllostomidae": 1 1 1 1 1 1 1 2 2 2 ...
#> $ index : num 0.625 0.777 0.659 0.413 0.223 ...
taxBias$plotThe function assessSpatialCov grids your data at a specified spatial resolution then maps it in geographic space:
assessSpatialCov(dat = spDat,
periods = periods,
res = 0.5,
logCount = TRUE,
countries = c("Brazil", "Argentina", "Chile", "Colombia", "Ecuador", "Bolivia", "Uruguay",
"Venezuela", "Guayana", "Paraguay", "Peru", "Suriname"))As you can see there are three new arguments to be specified. res is the spatial resolution at which you would like to map the data (units depend on you coordinate reference system, e.g. m if easting and northing, and decimal degress in lon/ lat); logCount indicates whether or not you would like to log10 transform the counts for visual purposes; and countries defines the countries covered by your data. Countries must be specified in order to plot their boundaries.
Point occurrence data often comes with assocatiated spatial uncertainty (i.e. how uncertain the coordinates are in x and y dimensions). For example, GBIF data comes with a field called coorinateUncertaintyInMeters. The function assessSptialUncertainty can be used to visualize spatial uncertainty in your dataset:
assessSpatialUncertainty(dat = spDat,
periods = periods)
#> Warning: Ignoring unknown aesthetics: fill
#> $data
#> [,1]
#> [1,] 0.4927808
#> [2,] 0.2744779
#>
#> $plot
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> Warning: Removed 86654 rows containing non-finite values (stat_bin).
#> Warning: Removed 28 rows containing missing values (geom_bar).Even if your data has good spatial coverage, it may be biased; that is to say, it may deviate from a random distribution in space. The function assessSpatialBias provides an index of how far your data deviates from a random distribution. To do this is simulates an equal number of points to your data randomly across your study region. Then, for each time period, it calculates the average nearest neighbour distance across your data points and divides it by the average nearest neighbour distance from the random sample. If the index is lower than one then your data is more clustered than the random sample, and if it is above one it is more dispersed. To delineate your study area, you must provide a mask layer. The mask is a raster object which is has numeric values within your study area, and is NA outside of your study area. Here, I’ll use some climate data from South America as a mask layer:
mask <- raster::raster("C:/Users/Rob.Lenovo-PC/Documents/surpass/Data/occAssessData/climSA.asc")
mask
#> class : RasterLayer
#> dimensions : 410, 279, 114390 (nrow, ncol, ncell)
#> resolution : 0.1666667, 0.1666667 (x, y)
#> extent : -81.33333, -34.83333, -55.83333, 12.5 (xmin, xmax, ymin, ymax)
#> crs : NA
#> source : C:/Users/Rob.Lenovo-PC/Documents/surpass/Data/occAssessData/climSA.asc
#> names : climSA
spatBias <- assessSpatialBias(dat = spDat,
periods = periods,
mask = mask,
nSamps = 10,
degrade = TRUE)
str(spatBias$data)
#> 'data.frame': 14 obs. of 5 variables:
#> $ mean : num 0.283 0.255 0.475 0.523 0.465 ...
#> $ upper : num 0.306 0.268 0.511 0.628 0.49 ...
#> $ lower : num 0.266 0.243 0.434 0.445 0.442 ...
#> $ Period : Factor w/ 7 levels "p1","p2","p3",..: 1 2 3 4 5 6 7 1 2 3 ...
#> $ identifier: Factor w/ 2 levels "Syrphidae","Phyllostomidae": 1 1 1 1 1 1 1 2 2 2 ...
spatBias$plotThe argument nSamps indicates how many random distributions should be drawn, and the argument degrade = TRUE indicates that any duplicated coordinates within a time period and for a given level of identifier are removed. The shaded regions on the plot indicate the 5th and 95th percentiles of the nearest neighbour index calculated over nSamps random samples.
Spatial bias in your dataset does not necessarily tell you anything about environmental bias. The function assessEnvBias assess the degree to which your data are biased across time periods in environmental space. To do this we first need to get some climate data. I will use the standard suite of 19 bioclimatic variables from worldclim. It is possible to get this data through R using the raster package, but here I will use my local version for speed:
## How to get the data using raster::getData()
#clim <- raster::getData("worldclim",var="bio",res=10)
clim <- raster::stack(list.files("C:/Users/Rob.Lenovo-PC/Documents/surpass/Data/bio/bio/",
full.names = T))
spDat <- cbind(spDat, raster::extract(clim, spDat[, c("x", "y")]))assessEnvBias conducts a principal component analysis on your environmental data, then maps your occurrence data in environmental space:
envBias <- assessEnvBias(dat = spDat,
periods = periods,
nEnvVar = 19,
frame = TRUE,
frame.type = "norm")
#> Warning: package 'ggfortify' was built under R version 3.6.3
#> Loading required package: ggplot2
#> Warning: package 'ggplot2' was built under R version 3.6.3
#> Warning: `select_()` is deprecated as of dplyr 0.7.0.
#> Please use `select()` instead.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_warnings()` to see where this warning was generated.
str(envBias$pca)
#> List of 5
#> $ sdev : num [1:19] 1122.4 292.7 163 120.2 84.8 ...
#> $ rotation: num [1:19, 1:19] 0.001196 0.000839 0.001576 0.887799 0.092089 ...
#> ..- attr(*, "dimnames")=List of 2
#> .. ..$ : chr [1:19] "wc2.1_10m_bio_1" "wc2.1_10m_bio_10" "wc2.1_10m_bio_11" "wc2.1_10m_bio_12" ...
#> .. ..$ : chr [1:19] "PC1" "PC2" "PC3" "PC4" ...
#> $ center : Named num [1:19] 23.5 24.6 22.4 2050.2 300.2 ...
#> ..- attr(*, "names")= chr [1:19] "wc2.1_10m_bio_1" "wc2.1_10m_bio_10" "wc2.1_10m_bio_11" "wc2.1_10m_bio_12" ...
#> $ scale : logi FALSE
#> $ x : num [1:120264, 1:19] -1159 -1130 -1168 -1159 -1304 ...
#> ..- attr(*, "dimnames")=List of 2
#> .. ..$ : chr [1:120264] "1" "2" "3" "4" ...
#> .. ..$ : chr [1:19] "PC1" "PC2" "PC3" "PC4" ...
#> - attr(*, "class")= chr "prcomp"
envBias$plot It is also possible to modify the appearance of envBias$plot using additional arguments that can be passed to ggfortify::autoplot. For example, you can include elipses, use different principal components, include vaiable vectors, etc.