if (!"occAssess" %in% installed.packages()) devtools::install_github("https://github.com/robboyd/occAssess")
library(occAssess)This vignette provides a worked example for the functionality of occAssess. For a full tutorial see the main vignette supplied in Boyd et al. (2021).
In this worked example, we simulated data that approximates random distributions in space and time for fourty species. The species were simulated with varying prevalence (number of records varies by a factor of 5 with a mean of ~1000). The data were generated at random locations in the UK, and uniformly over the period 2001 to 2010. Each simulated data point was randomly assigned to one of two surveys which are specified in the “identifier” field. The data can be accessed within occAssess as follows:
data(random40Species)
spDat <- random40Species
str(spDat)
#> 'data.frame': 40000 obs. of 25 variables:
#> $ species : Factor w/ 40 levels "species1","species10",..: 14 19 8 28 5 33 12 27 36 12 ...
#> $ x : num 213500 289500 369500 223500 374500 ...
#> $ y : num 797500 745500 537500 591500 520500 ...
#> $ year : int 2009 2004 2006 2010 2009 2004 2004 2004 2010 2001 ...
#> $ spatialUncertainty: num 9802 3506 15310 9810 7189 ...
#> $ identifier : Factor w/ 2 levels "survey1","survey2": 2 2 1 1 1 1 1 1 1 1 ...
#> $ clim1 : num 7.01 6.65 6.76 8.05 7.3 ...
#> $ clim2 : num 12.2 12.2 12.4 13 12.9 ...
#> $ clim3 : num 2.43 1.82 1.82 3.63 2.32 ...
#> $ clim4 : num 2402 1300 1464 1364 1368 ...
#> $ clim5 : num 296 167 164 152 149 ...
#> $ clim6 : num 105.6 67.9 84.6 70.7 80.4 ...
#> $ clim7 : num 36.1 29 24.6 26.2 23.6 ...
#> $ clim8 : num 862 440 483 446 445 ...
#> $ clim9 : num 341 219 259 229 245 ...
#> $ clim10 : num 399 240 292 280 282 ...
#> $ clim11 : num 780 412 440 392 400 ...
#> $ clim12 : num 6.39 6.79 6.87 6.2 6.92 ...
#> $ clim13 : num 38 37.5 37.1 37.5 37.1 ...
#> $ clim14 : num 406 430 435 392 437 ...
#> $ clim15 : num 16.5 16.9 17.5 17.2 18.1 ...
#> $ clim16 : num -0.322 -1.163 -1.018 0.666 -0.573 ...
#> $ clim17 : num 16.8 18.1 18.5 16.5 18.6 ...
#> $ clim18 : num 3.11 2.55 2.6 6.3 3.12 ...
#> $ clim19 : num 8.48 8.2 8.21 9.29 8.75 ...In this example, we will specify five periods over 2001 to 2010
All of the functions in occAssess require two common arguments: dat and periods (outlined above). I will run through each function in the following, indicating where additional arguments are required. Generally, the functions in occAssess return a list with two elements: one being a ggplot2 object, with a separate panel for each level of identifier; and a second with the data underpinning the plot.
The first function I will introduce is the simplest: assessRecordNumber. This function simply plots out the number of records per year in your dataset.
nRec <- assessRecordNumber(dat = spDat,
periods = periods)
str(nRec$data)
#> 'data.frame': 10 obs. of 3 variables:
#> $ val : int 3968 4093 4029 4019 4020 3960 4043 4029 3967 3872
#> $ group : Factor w/ 2 levels "survey1","survey2": 1 1 1 1 1 2 2 2 2 2
#> $ Period: Factor w/ 5 levels "p1","p2","p3",..: 1 2 3 4 5 1 2 3 4 5
nRec$plotThis function enables researchers to quickly establish how the number of records has changed over time.
In addition to the number of records, you may wish to know how the number of species (taxonomic coverage) in your dataset changes over time. For this you can use the function assessSpeciesNumber:
nSpec <- assessSpeciesNumber(dat = spDat,
periods = periods)
str(nSpec$data)
#> 'data.frame': 10 obs. of 3 variables:
#> $ val : int 40 40 40 40 40 40 40 40 40 40
#> $ group : Factor w/ 2 levels "survey1","survey2": 1 1 1 1 1 2 2 2 2 2
#> $ Period: Factor w/ 5 levels "p1","p2","p3",..: 1 2 3 4 5 1 2 3 4 5
nSpec$plotIt has been speculated that apparent changes in taxonomic coverage could, in fact, reflect a change in taxonomic expertise over time. For example, if fewer individuals have the skill to identify certain species, then it may not appear in your dataset in the later periods. The function assessSpeciesID treats the proportion of species identified to species level as a proxy for taxonomic expertise:
propID <- assessSpeciesID(dat = spDat,
periods = periods,
type = "proportion")
str(propID$data)
#> 'data.frame': 10 obs. of 3 variables:
#> $ prop : num 1 1 1 1 1 1 1 1 1 1
#> $ group : Factor w/ 2 levels "survey1","survey2": 1 1 1 1 1 2 2 2 2 2
#> $ Period: Factor w/ 5 levels "p1","p2","p3",..: 1 2 3 4 5 1 2 3 4 5
propID$plotThe argument “type” can take the values proportion (proportion of records identified to species level) or count (number of records identified to species level).
A number of studies have defined taxonomic bias in a dataset as the degree of proportionality between species’ range sizes (usually proxied by the number of grid cells on which it has been recorded) and the total number of records. One can regress the number of records on range size, and the residuals give an index of how over-or undersampled a species is given its prevalence. The function assessRarityBias conducts these analyses for each time period, and uses the r2 value from the linear regressions as an index proportionality between range sizes and number of records. Higher values indicate that species’ are sampled in proportion to their range sizes whereas lower values indicate that some species are over- or undersampled.
taxBias <- assessRarityBias(dat = spDat,
periods = periods,
res = 20000)
#> Warning in assessRarityBias(dat = spDat, periods = periods, res = 20000): There
#> are less than 30 species in some time periods which is a low sample size for the
#> regression. View results with caution.
str(taxBias$data)
#> 'data.frame': 10 obs. of 3 variables:
#> $ period: Factor w/ 5 levels "p1","p2","p3",..: 1 2 3 4 5 1 2 3 4 5
#> $ id : Factor w/ 2 levels "survey1","survey2": 1 1 1 1 1 2 2 2 2 2
#> $ index : num 0.947 0.92 0.944 0.948 0.922 ...
taxBias$plotNote the warning message which tells us that there are low numbers of species in some periods (not surprising as the data only contain five species). This represents a small sample size for the regression of range size on number of records so the results should be viewed with caution.
The function assessSpatialCov grids your data at a specified spatial resolution then maps it in geographic space. In this example, I provide a shapefile with the boundaries of the UK. If I was working on the WGS84 coordinate reference system (here I am using OSGB 36) this would not be necessary; I could instead use the countries argument and simply specify “UK”.
library(BRCmap) ## a colleague's package that is not publically available. Users will have to provide their own shapefile
#> Loading required package: maptools
#> Warning: package 'maptools' was built under R version 3.6.3
#> Loading required package: sp
#> Warning: package 'sp' was built under R version 3.6.3
#> Checking rgeos availability: TRUE
#> Loading required package: rgeos
#> Warning: package 'rgeos' was built under R version 3.6.3
#> rgeos version: 0.5-5, (SVN revision 640)
#> GEOS runtime version: 3.8.0-CAPI-1.13.1
#> Linking to sp version: 1.4-2
#> Polygon checking: TRUE
#> Loading required package: rgdal
#> Warning: package 'rgdal' was built under R version 3.6.3
#> rgdal: version: 1.5-16, (SVN revision 1050)
#> Geospatial Data Abstraction Library extensions to R successfully loaded
#> Loaded GDAL runtime: GDAL 3.0.4, released 2020/01/28
#> Path to GDAL shared files: C:/Users/Rob.Lenovo-PC/Documents/R/win-library/3.6/rgdal/gdal
#> GDAL binary built with GEOS: TRUE
#> Loaded PROJ runtime: Rel. 6.3.1, February 10th, 2020, [PJ_VERSION: 631]
#> Path to PROJ shared files: C:/Users/Rob.Lenovo-PC/Documents/R/win-library/3.6/rgdal/proj
#> Linking to sp version:1.4-2
#> To mute warnings of possible GDAL/OSR exportToProj4() degradation,
#> use options("rgdal_show_exportToProj4_warnings"="none") before loading rgdal.
data(UK)
shp <- UK$britain
assessSpatialCov(dat = spDat,
res = 10000,
logCount = TRUE,
countries = NULL,
shp <- shp)As you can see there are three new arguments to be specified. res is the spatial resolution at which you would like to map the data (units depend on you coordinate reference system, e.g. m if easting and northing, and decimal degress in lon/ lat); logCount indicates whether or not you would like to log10 transform the counts for visual purposes; and countries defines the countries covered by your data. Countries must be specified in order to plot their boundaries.
Point occurrence data often comes with assocatiated spatial uncertainty (i.e. how uncertain the coordinates are in x and y dimensions). For example, GBIF data comes with a field called coorinateUncertaintyInMeters. The function assessSptialUncertainty can be used to visualize spatial uncertainty in your dataset as a kernel density plot:
assessSpatialUncertainty(dat = spDat,
periods = periods)
#> $data
#> [,1]
#> [1,] 1
#> [2,] 1
#>
#> $plot
#> Warning: Removed 9957 rows containing non-finite values (stat_density).Even if your data has good spatial coverage, it may be biased; that is to say, it may deviate from a random distribution in space. The function assessSpatialBias provides an index of how far your data deviates from a random distribution. To do this is simulates an equal number of points to your data randomly across your study region. Then, for each time period, it calculates the average nearest neighbour distance across your data points and divides it by the average nearest neighbour distance from the random sample. If the index is lower than one then your data is more clustered than the random sample, and if it is above one it is more dispersed. To delineate your study area, you must provide a mask layer. The mask is a raster object which is has numeric values within your study area, and is NA outside of your study area. Here, I’ll use some species distribution model outputs for the UK as a mask layer:
mask <- raster::raster("C:/Users/Rob.Lenovo-PC/Documents/surpass/Data/Mammals.asc")
mask
#> class : RasterLayer
#> dimensions : 1250, 700, 875000 (nrow, ncol, ncell)
#> resolution : 1000, 1000 (x, y)
#> extent : 0, 7e+05, 0, 1250000 (xmin, xmax, ymin, ymax)
#> crs : NA
#> source : C:/Users/Rob.Lenovo-PC/Documents/surpass/Data/Mammals.asc
#> names : Mammals
spatBias <- assessSpatialBias(dat = spDat,
periods = periods,
mask = mask,
nSamps = 10,
degrade = TRUE)
str(spatBias$data)
#> 'data.frame': 10 obs. of 5 variables:
#> $ mean : num 1.012 0.998 1.002 0.999 0.99 ...
#> $ upper : num 1.03 1.01 1.01 1.01 1 ...
#> $ lower : num 1.003 0.989 0.989 0.99 0.979 ...
#> $ Period : Factor w/ 5 levels "p1","p2","p3",..: 1 2 3 4 5 1 2 3 4 5
#> $ identifier: Factor w/ 2 levels "survey1","survey2": 1 1 1 1 1 2 2 2 2 2
spatBias$plotThe argument nSamps indicates how many random distributions should be drawn, and the argument degrade = TRUE indicates that any duplicated coordinates within a time period and for a given level of identifier are removed. The shaded regions on the plot indicate the 5th and 95th percentiles of the nearest neighbour index calculated over nSamps random samples.
Spatial bias in your dataset does not necessarily tell you anything about environmental bias. The function assessEnvBias assess the degree to which your data are biased across time periods in environmental space. To do this we first need to get some climate data. I will use the standard suite of 19 bioclimatic variables from worldclim. It is possible to get this data through R using the raster package, but here I will use my local version for speed:
## How to get the data using raster::getData()
#clim <- raster::getData("worldclim",var="bio",res=10)assessEnvBias conducts a principal component analysis on your environmental data, then maps your occurrence data in environmental space:
envBias <- assessEnvBias(dat = spDat,
periods = periods,
nEnvVar = 19,
frame = TRUE,
frame.type = "norm")
#> Warning: package 'ggfortify' was built under R version 3.6.3
#> Loading required package: ggplot2
#> Warning: package 'ggplot2' was built under R version 3.6.3
#> Warning: `select_()` was deprecated in dplyr 0.7.0.
#> Please use `select()` instead.
str(envBias$pca)
#> List of 5
#> $ sdev : num [1:19] 527.48 29.91 21.71 10.05 6.58 ...
#> $ rotation: num [1:19, 1:19] 0.001397 0.002023 0.000811 -0.874051 -0.112379 ...
#> ..- attr(*, "dimnames")=List of 2
#> .. ..$ : chr [1:19] "clim1" "clim2" "clim3" "clim4" ...
#> .. ..$ : chr [1:19] "PC1" "PC2" "PC3" "PC4" ...
#> $ center : Named num [1:19] 8.69 14.16 3.84 1091 122.44 ...
#> ..- attr(*, "names")= chr [1:19] "clim1" "clim2" "clim3" "clim4" ...
#> $ scale : logi FALSE
#> $ x : num [1:40000, 1:19] -159.1 -305.2 333.9 55.2 39.9 ...
#> ..- attr(*, "dimnames")=List of 2
#> .. ..$ : chr [1:40000] "10" "21" "30" "34" ...
#> .. ..$ : chr [1:19] "PC1" "PC2" "PC3" "PC4" ...
#> - attr(*, "class")= chr "prcomp"
envBias$plot It is also possible to modify the appearance of envBias$plot using additional arguments that can be passed to ggfortify::autoplot. For example, you can include elipses, use different principal components, include vaiable vectors, etc.