Worked example 3: What does a near-random dataset look like?

Functions

All of the functions in occAssess require two common arguments: dat and periods (outlined above). I will run through each function in the following, indicating where additional arguments are required. Generally, the functions in occAssess return a list with two elements: one being a ggplot2 object, with a separate panel for each level of identifier; and a second with the data underpinning the plot.

assessRecordNumber

The first function I will introduce is the simplest: assessRecordNumber. This function simply plots out the number of records per year in your dataset.


nRec <- assessRecordNumber(dat = spDat,
                           periods = periods)

str(nRec$data)
#> 'data.frame':    10 obs. of  3 variables:
#>  $ val   : int  3968 4093 4029 4019 4020 3960 4043 4029 3967 3872
#>  $ group : Factor w/ 2 levels "survey1","survey2": 1 1 1 1 1 2 2 2 2 2
#>  $ Period: Factor w/ 5 levels "p1","p2","p3",..: 1 2 3 4 5 1 2 3 4 5

nRec$plot

This function enables researchers to quickly establish how the number of records has changed over time.

assessSpeciesNumber

In addition to the number of records, you may wish to know how the number of species (taxonomic coverage) in your dataset changes over time. For this you can use the function assessSpeciesNumber:


nSpec <- assessSpeciesNumber(dat = spDat,
                           periods = periods)

str(nSpec$data)
#> 'data.frame':    10 obs. of  3 variables:
#>  $ val   : int  40 40 40 40 40 40 40 40 40 40
#>  $ group : Factor w/ 2 levels "survey1","survey2": 1 1 1 1 1 2 2 2 2 2
#>  $ Period: Factor w/ 5 levels "p1","p2","p3",..: 1 2 3 4 5 1 2 3 4 5

nSpec$plot

assessSpeciesID

It has been speculated that apparent changes in taxonomic coverage could, in fact, reflect a change in taxonomic expertise over time. For example, if fewer individuals have the skill to identify certain species, then it may not appear in your dataset in the later periods. The function assessSpeciesID treats the proportion of species identified to species level as a proxy for taxonomic expertise:


propID <- assessSpeciesID(dat = spDat,
                           periods = periods,
                           type = "proportion")

str(propID$data)
#> 'data.frame':    10 obs. of  3 variables:
#>  $ prop  : num  1 1 1 1 1 1 1 1 1 1
#>  $ group : Factor w/ 2 levels "survey1","survey2": 1 1 1 1 1 2 2 2 2 2
#>  $ Period: Factor w/ 5 levels "p1","p2","p3",..: 1 2 3 4 5 1 2 3 4 5

propID$plot

The argument “type” can take the values proportion (proportion of records identified to species level) or count (number of records identified to species level).

assessRarityBias

A number of studies have defined taxonomic bias in a dataset as the degree of proportionality between species’ range sizes (usually proxied by the number of grid cells on which it has been recorded) and the total number of records. One can regress the number of records on range size, and the residuals give an index of how over-or undersampled a species is given its prevalence. The function assessRarityBias conducts these analyses for each time period, and uses the r2 value from the linear regressions as an index proportionality between range sizes and number of records. Higher values indicate that species’ are sampled in proportion to their range sizes whereas lower values indicate that some species are over- or undersampled.


taxBias <- assessRarityBias(dat = spDat,
                             periods = periods,
                             res = 20000)
#> Warning in assessRarityBias(dat = spDat, periods = periods, res = 20000): There
#> are less than 30 species in some time periods which is a low sample size for the
#> regression. View results with caution.

str(taxBias$data)
#> 'data.frame':    10 obs. of  3 variables:
#>  $ period: Factor w/ 5 levels "p1","p2","p3",..: 1 2 3 4 5 1 2 3 4 5
#>  $ id    : Factor w/ 2 levels "survey1","survey2": 1 1 1 1 1 2 2 2 2 2
#>  $ index : num  0.947 0.92 0.944 0.948 0.922 ...

taxBias$plot

Note the warning message which tells us that there are low numbers of species in some periods (not surprising as the data only contain five species). This represents a small sample size for the regression of range size on number of records so the results should be viewed with caution.

assessSpatialCov

The function assessSpatialCov grids your data at a specified spatial resolution then maps it in geographic space. In this example, I provide a shapefile with the boundaries of the UK. If I was working on the WGS84 coordinate reference system (here I am using OSGB 36) this would not be necessary; I could instead use the countries argument and simply specify “UK”.


library(BRCmap) ## a colleague's package that is not publically available. Users will have to provide their own shapefile
#> Loading required package: maptools
#> Warning: package 'maptools' was built under R version 3.6.3
#> Loading required package: sp
#> Warning: package 'sp' was built under R version 3.6.3
#> Checking rgeos availability: TRUE
#> Loading required package: rgeos
#> Warning: package 'rgeos' was built under R version 3.6.3
#> rgeos version: 0.5-5, (SVN revision 640)
#>  GEOS runtime version: 3.8.0-CAPI-1.13.1 
#>  Linking to sp version: 1.4-2 
#>  Polygon checking: TRUE
#> Loading required package: rgdal
#> Warning: package 'rgdal' was built under R version 3.6.3
#> rgdal: version: 1.5-16, (SVN revision 1050)
#> Geospatial Data Abstraction Library extensions to R successfully loaded
#> Loaded GDAL runtime: GDAL 3.0.4, released 2020/01/28
#> Path to GDAL shared files: C:/Users/Rob.Lenovo-PC/Documents/R/win-library/3.6/rgdal/gdal
#> GDAL binary built with GEOS: TRUE 
#> Loaded PROJ runtime: Rel. 6.3.1, February 10th, 2020, [PJ_VERSION: 631]
#> Path to PROJ shared files: C:/Users/Rob.Lenovo-PC/Documents/R/win-library/3.6/rgdal/proj
#> Linking to sp version:1.4-2
#> To mute warnings of possible GDAL/OSR exportToProj4() degradation,
#> use options("rgdal_show_exportToProj4_warnings"="none") before loading rgdal.

data(UK)

shp <- UK$britain

assessSpatialCov(dat = spDat,
                 res = 10000,
                 logCount = TRUE,
                 countries = NULL,
                 shp <- shp)

As you can see there are three new arguments to be specified. res is the spatial resolution at which you would like to map the data (units depend on you coordinate reference system, e.g. m if easting and northing, and decimal degress in lon/ lat); logCount indicates whether or not you would like to log10 transform the counts for visual purposes; and countries defines the countries covered by your data. Countries must be specified in order to plot their boundaries.

assessSpatialUncertainty

Point occurrence data often comes with assocatiated spatial uncertainty (i.e. how uncertain the coordinates are in x and y dimensions). For example, GBIF data comes with a field called coorinateUncertaintyInMeters. The function assessSptialUncertainty can be used to visualize spatial uncertainty in your dataset as a kernel density plot:


assessSpatialUncertainty(dat = spDat,
                         periods = periods)
#> $data
#>      [,1]
#> [1,]    1
#> [2,]    1
#> 
#> $plot
#> Warning: Removed 9957 rows containing non-finite values (stat_density).

assessSpatialBias

Even if your data has good spatial coverage, it may be biased; that is to say, it may deviate from a random distribution in space. The function assessSpatialBias provides an index of how far your data deviates from a random distribution. To do this is simulates an equal number of points to your data randomly across your study region. Then, for each time period, it calculates the average nearest neighbour distance across your data points and divides it by the average nearest neighbour distance from the random sample. If the index is lower than one then your data is more clustered than the random sample, and if it is above one it is more dispersed. To delineate your study area, you must provide a mask layer. The mask is a raster object which is has numeric values within your study area, and is NA outside of your study area. Here, I’ll use some species distribution model outputs for the UK as a mask layer:


mask <- raster::raster("C:/Users/Rob.Lenovo-PC/Documents/surpass/Data/Mammals.asc")

mask
#> class      : RasterLayer 
#> dimensions : 1250, 700, 875000  (nrow, ncol, ncell)
#> resolution : 1000, 1000  (x, y)
#> extent     : 0, 7e+05, 0, 1250000  (xmin, xmax, ymin, ymax)
#> crs        : NA 
#> source     : C:/Users/Rob.Lenovo-PC/Documents/surpass/Data/Mammals.asc 
#> names      : Mammals


spatBias <- assessSpatialBias(dat = spDat,
                              periods = periods,
                              mask = mask,
                              nSamps = 10,
                              degrade = TRUE)

str(spatBias$data)
#> 'data.frame':    10 obs. of  5 variables:
#>  $ mean      : num  1.012 0.998 1.002 0.999 0.99 ...
#>  $ upper     : num  1.03 1.01 1.01 1.01 1 ...
#>  $ lower     : num  1.003 0.989 0.989 0.99 0.979 ...
#>  $ Period    : Factor w/ 5 levels "p1","p2","p3",..: 1 2 3 4 5 1 2 3 4 5
#>  $ identifier: Factor w/ 2 levels "survey1","survey2": 1 1 1 1 1 2 2 2 2 2

spatBias$plot

The argument nSamps indicates how many random distributions should be drawn, and the argument degrade = TRUE indicates that any duplicated coordinates within a time period and for a given level of identifier are removed. The shaded regions on the plot indicate the 5th and 95th percentiles of the nearest neighbour index calculated over nSamps random samples.

assessEnvBias

Spatial bias in your dataset does not necessarily tell you anything about environmental bias. The function assessEnvBias assess the degree to which your data are biased across time periods in environmental space. To do this we first need to get some climate data. I will use the standard suite of 19 bioclimatic variables from worldclim. It is possible to get this data through R using the raster package, but here I will use my local version for speed:

## How to get the data using raster::getData()

#clim <- raster::getData("worldclim",var="bio",res=10)

assessEnvBias conducts a principal component analysis on your environmental data, then maps your occurrence data in environmental space:


envBias <- assessEnvBias(dat = spDat,
                         periods = periods,
                         nEnvVar = 19,
                         frame = TRUE,
                         frame.type = "norm")
#> Warning: package 'ggfortify' was built under R version 3.6.3
#> Loading required package: ggplot2
#> Warning: package 'ggplot2' was built under R version 3.6.3
#> Warning: `select_()` was deprecated in dplyr 0.7.0.
#> Please use `select()` instead.

str(envBias$pca)
#> List of 5
#>  $ sdev    : num [1:19] 527.48 29.91 21.71 10.05 6.58 ...
#>  $ rotation: num [1:19, 1:19] 0.001397 0.002023 0.000811 -0.874051 -0.112379 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : chr [1:19] "clim1" "clim2" "clim3" "clim4" ...
#>   .. ..$ : chr [1:19] "PC1" "PC2" "PC3" "PC4" ...
#>  $ center  : Named num [1:19] 8.69 14.16 3.84 1091 122.44 ...
#>   ..- attr(*, "names")= chr [1:19] "clim1" "clim2" "clim3" "clim4" ...
#>  $ scale   : logi FALSE
#>  $ x       : num [1:40000, 1:19] -159.1 -305.2 333.9 55.2 39.9 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : chr [1:40000] "10" "21" "30" "34" ...
#>   .. ..$ : chr [1:19] "PC1" "PC2" "PC3" "PC4" ...
#>  - attr(*, "class")= chr "prcomp"
envBias$plot

It is also possible to modify the appearance of envBias$plot using additional arguments that can be passed to ggfortify::autoplot. For example, you can include elipses, use different principal components, include vaiable vectors, etc.