Introduction

This document supports the risk-of-bias workshop of the SBDI conference, January 2024. The heuristics here are designed to aid analysts in thinking about the potential for bias in their dataset relative to the inferential goal of creating temporal trends in occupancy. In the absence of some designed probability sample of reasonable size, it will generally be impossible to be certain about how biased one’s dataset is relative to a research question; however, that does not mean that researchers should not thoroughly and thoughtfully investigate the potential for bias, especially when expert knowledge of a dataset is available.

What follows is not a formal ROBITT assessment, neither should it be considered an exhaustive or “correct” overview of potential biases in any way. It is merely a compendium of potentially useful views that may aid a ROBITT assessment; it was produced to aid the workshop discussion.

Published on RPubs here. Code is here.

#### Prepare some supporting information for SBDI workshop ####
## Use Swedish Longhorn Beetle data
#rm(list=ls())
if (!"occAssess" %in% installed.packages())
  devtools::install_github("https://github.com/robboyd/occAssess")
library(occAssess)
library(ggplot2)
library(maps)
library(here)
library(raster)
library(sp)
library(sf)
library(fasterize)
knitr::opts_chunk$set(fig.width=unit(12.8,"cm"), fig.height=unit(8,"cm"))

GBIF data for this example

We use GBIF data here for convenience, although there is nothing about the general process that follows that is particularly specific to GBIF. There are other checks that could be carried out that were more specifically designed for errors in GBIF data, e.g. the CoordinateCleaner package; note that these are not typically focused on risk-of-bias, although obviously such data errors can increase the risk of erroneous or nonsensical findings. The importance of such basic checks is likely to be a function of how well one knows one’s data.

Reporting data cleaning steps is, however, a part of the pre-assessment in ROBITT (steps 2.3 and 2.4, covering data provenance and processing respectively). Fairly obviously, failing to understand the basic structure of one’s dataset — whether things like varying spatio-temporal resolutions, time-varying taxonomic treatments, or other more specific errors (such as including records from zoo or museum locations) — can result in various errors.

## Use GBIF.org (17 January 2024) GBIF Occurrence Download  https://doi.org/10.15468/dl.x7uv2e
## Cerambycidae in SWE
# tab-delimited CSV
#dat <- read.delim(file = "data/0058696-231120084113126/0058696-231120084113126.csv",
#                 header = T, stringsAsFactors = F, sep = "\t")
#save(dat, file = "data/0058696-231120084113126.rdata")
load(file = here("data/0058696-231120084113126.rdata"))
head(dat, n = 5)

unique(dat$countryCode)

## [1] "SE" "FI" "NO"

Note that this includes data for Sweden, Finland, and Norway (presumably all boundary-straddling data?), and has no NAs in the countryCode field. This probably means that any Swedish data lacking an entry in the countryCode field have been silently ignored, because countryCode is not a mandatory GBIF field, but only “strongly recommended”, https://www.gbif.org/data-quality-requirements-occurrences. A shapefile-based boundary would probably return different results (obviously impacts of this vary by country/query).

A quick look at this composition of this (meta-)dataset.

sort(round(table(dat$datasetKey) / nrow(dat), digits = 3), decreasing = T)

## 
## 38b4c89f-584c-41bb-bd8f-cd1def33e92f 9940af5a-3271-4e6a-ad71-ced986b9a9a5 
##                                0.950                                0.032 
## 50c9509d-22c7-4a22-a47d-8c48425ef4a7 2a1034af-7400-4e8c-b21b-3837926a9682 
##                                0.012                                0.002 
## 8a863029-f435-446a-821e-275f4f641165 fea587f0-540e-11de-bdf7-b8a03c50a862 
##                                0.001                                0.001 
## 040c5662-da76-4782-a48e-cdea1892d14c 04930329-93ce-4f80-8d67-423a526a736b 
##                                0.000                                0.000 
## 0f188bfc-2537-4863-90b7-3b5f856dc87e 1af83152-24f7-4df7-afbc-b213b62175bb 
##                                0.000                                0.000 
## 1bec5de3-758c-4ed2-ab13-1597601ad07a 1c6657e6-8a4f-4731-9e8d-8af0f3cc14c6 
##                                0.000                                0.000 
## 2e4cc37b-302e-4f1b-bbbb-1f674ff90e14 38c1351d-9cfe-42c0-97da-02d2c8be141c 
##                                0.000                                0.000 
## 3b301884-51b9-443f-b63c-47feeccfb89f 43607f5f-3404-40bf-b812-1cabf3c66f2c 
##                                0.000                                0.000 
## 4508bed1-d45d-4a2d-9790-34cc99ee8cce 6ac3f774-d9fb-4796-b3e9-92bf6c81c084 
##                                0.000                                0.000 
## 956bc674-6022-4c52-833e-c5fb39dc837a 95ed873a-f762-11e1-a439-00145eb45e9a 
##                                0.000                                0.000 
## a2aab040-0680-11df-823c-b8a03c50a862 a307e4d7-1de2-4adc-95d5-a0a8d5f57236 
##                                0.000                                0.000 
## b124e1e0-4755-430f-9eab-894f25a9b59c d8cd16ba-bb74-4420-821e-083f2bac17c2 
##                                0.000                                0.000 
## df12ca07-f133-4550-ab3b-fde13f0e76ba fb7078b9-63e4-41c1-8496-10f67ea1e1b9 
##                                0.000                                0.000

The vast majority of data are from 38b4c89f-584c-41bb-bd8f-cd1def33e92f (~95%) https://www.gbif.org/dataset/38b4c89f-584c-41bb-bd8f-cd1def33e92f – Artportalen of course. However, “[t]his subdataset of specifically expert validated collection data is not possible to filter out from the GBIF data, the GBIF data include all sorts of other data – from museum collections to field records by opportunistic citizen scientists. But I think it would be possible to filter this subdataset from the Artportalen webpage where it is possible to select the specific project” (D. Arlt, Jan. 2024, pers. comm.)

As is often the case with “biological records”/species occurrence datasets, just because it is a single dataset in a repository, that doesn’t mean that it is a homogenous entity in terms of data collection! An obvious point perhaps, but one which can be overlooked by the uninitiated.

str(dat) # Just for our reference

## 'data.frame':    209475 obs. of  50 variables:
##  $ gbifID                          : num  3.90e+09 3.86e+09 4.41e+09 1.46e+09 1.43e+09 ...
##  $ datasetKey                      : chr  "8a863029-f435-446a-821e-275f4f641165" "38b4c89f-584c-41bb-bd8f-cd1def33e92f" "38b4c89f-584c-41bb-bd8f-cd1def33e92f" "38b4c89f-584c-41bb-bd8f-cd1def33e92f" ...
##  $ occurrenceID                    : chr  "https://observation.org/observation/250179236" "urn:lsid:artportalen.se:sighting:101472866" "urn:lsid:artportalen.se:sighting:111954675" "urn:lsid:artportalen.se:sighting:15336329" ...
##  $ kingdom                         : chr  "Animalia" "Animalia" "Animalia" "Animalia" ...
##  $ phylum                          : chr  "Arthropoda" "Arthropoda" "Arthropoda" "Arthropoda" ...
##  $ class                           : chr  "Insecta" "Insecta" "Insecta" "Insecta" ...
##  $ order                           : chr  "Coleoptera" "Coleoptera" "Coleoptera" "Coleoptera" ...
##  $ family                          : chr  "Cerambycidae" "Cerambycidae" "Cerambycidae" "Cerambycidae" ...
##  $ genus                           : chr  "Leptura" "Judolia" "Stictoleptura" "Stenurella" ...
##  $ species                         : chr  "Leptura quadrifasciata" "Judolia sexmaculata" "Stictoleptura rubra" "Stenurella melanura" ...
##  $ infraspecificEpithet            : chr  "" "" "" "" ...
##  $ taxonRank                       : chr  "SPECIES" "SPECIES" "SPECIES" "SPECIES" ...
##  $ scientificName                  : chr  "Leptura quadrifasciata Linnaeus, 1758" "Judolia sexmaculata (Linnaeus, 1758)" "Stictoleptura rubra (Linnaeus, 1758)" "Stenurella melanura (Linnaeus, 1758)" ...
##  $ verbatimScientificName          : chr  "Leptura quadrifasciata" "Judolia sexmaculata" "Stictoleptura rubra" "Stenurella melanura" ...
##  $ verbatimScientificNameAuthorship: chr  "" "(Linnaeus, 1758)" "(Linnaeus, 1758)" "(Linnaeus, 1758)" ...
##  $ countryCode                     : chr  "SE" "SE" "SE" "SE" ...
##  $ locality                        : chr  "Sweden - Årjäng" "Prästfäbodtjärnen, Vb" "Nidingens Fågelstation, Hl" "Stigen, Dls" ...
##  $ stateProvince                   : chr  "Sweden - Värmland" "Västerbotten" "Halland" "Dalsland" ...
##  $ occurrenceStatus                : chr  "PRESENT" "PRESENT" "PRESENT" "PRESENT" ...
##  $ individualCount                 : int  1 1 1 3 25 1 1 1 2 1 ...
##  $ publishingOrgKey                : chr  "c8d737e0-2ff8-42e8-b8fc-6b805d26fc5f" "b8323864-602a-4a7d-9127-bb903054e97d" "b8323864-602a-4a7d-9127-bb903054e97d" "b8323864-602a-4a7d-9127-bb903054e97d" ...
##  $ decimalLatitude                 : num  59.6 64.8 57.3 58.6 57.6 ...
##  $ decimalLongitude                : num  12.1 20.9 11.9 12.1 11.9 ...
##  $ coordinateUncertaintyInMeters   : num  NA 200 1000 5000 250 ...
##  $ coordinatePrecision             : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ elevation                       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ elevationAccuracy               : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ depth                           : logi  NA NA NA NA NA NA ...
##  $ depthAccuracy                   : logi  NA NA NA NA NA NA ...
##  $ eventDate                       : chr  "2022-07-24T00:00:00" "2022-06-06T00:00:00" "2023-09-07T00:00:00" "1936-07-11T00:00:00" ...
##  $ day                             : int  24 6 7 11 12 15 20 17 8 31 ...
##  $ month                           : int  7 6 9 7 10 6 6 7 5 7 ...
##  $ year                            : int  2022 2022 2023 1936 2016 2019 2020 2020 2021 2022 ...
##  $ taxonKey                        : int  4458782 4292671 8966182 8583922 8544630 6267761 9128025 8966182 1152050 8349338 ...
##  $ speciesKey                      : int  4458782 4292671 8966182 8583922 8544630 6267761 9128025 8966182 5006461 8349338 ...
##  $ basisOfRecord                   : chr  "HUMAN_OBSERVATION" "HUMAN_OBSERVATION" "HUMAN_OBSERVATION" "PRESERVED_SPECIMEN" ...
##  $ institutionCode                 : chr  "" "SLU Artdatabanken" "SLU Artdatabanken" "SLU Artdatabanken" ...
##  $ collectionCode                  : chr  "Observations" "Artportalen" "Artportalen" "Artportalen" ...
##  $ catalogNumber                   : chr  "OBS.250179236" "101472866" "111954675" "15336329" ...
##  $ recordNumber                    : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ identifiedBy                    : chr  "" "" "" "" ...
##  $ dateIdentified                  : chr  "" "" "" "" ...
##  $ license                         : chr  "CC_BY_NC_4_0" "CC0_1_0" "CC0_1_0" "CC0_1_0" ...
##  $ rightsHolder                    : chr  "Stichting Observation International" "Erik Normark" "Nidingens fågelstation" "SLU Artdatabanken" ...
##  $ recordedBy                      : chr  "User 64780" "Erik Normark" "Dennis Kraft" "Via Marit Persson" ...
##  $ typeStatus                      : logi  NA NA NA NA NA NA ...
##  $ establishmentMeans              : logi  NA NA NA NA NA NA ...
##  $ lastInterpreted                 : chr  "2024-01-12T14:15:33.857Z" "2024-01-13T13:23:06.356Z" "2024-01-13T13:18:56.131Z" "2024-01-13T13:13:39.689Z" ...
##  $ mediaType                       : chr  "" "" "" "" ...
##  $ issue                           : chr  "OCCURRENCE_STATUS_INFERRED_FROM_INDIVIDUAL_COUNT;COORDINATE_ROUNDED" "TAXON_MATCH_TAXON_ID_IGNORED" "TAXON_MATCH_TAXON_ID_IGNORED" "TAXON_MATCH_TAXON_ID_IGNORED;INSTITUTION_MATCH_NONE" ...

periods <-
  list(
    1900:1909, 1910:1919, 1920:1929, 1930:1939,
    1940:1949, 1950:1959, 1960:1969, 1970:1979,
    1980:1989, 1990:1999, 2000:2009, 2010:2019
  )
# Check for any vague dates
grep("/", dat$eventDate) # see https://dwc.tdwg.org/terms/#event

## integer(0)

The above are just a semi-arbitrary group of time periods that we will use to assess the Cerambycidae data for any temporal patterns that could indicate time-varying biases in the available data.

The check for forward slashes in the dat$eventDate field was just to see whether there were any records explicitly declared as “vague” (i.e. covering a span of some unit, such as multiple years). Apparently there are none, or at least none declared as such to GBIF. This may or may not be correct, depending on how the data were moved to GBIF, but for this exercise, we are just going to assume the year assignments in the data are meaningful and correct.

occAssess potential bias heuristics

Assess changing effort

## Assess Record Number
nRec <- assessRecordNumber(
  dat = dat,
  periods = periods,
  species = "species",
  x = "decimalLongitude",
  y = "decimalLatitude",
  year = "year",
  spatialUncertainty = "coordinateUncertaintyInMeters",
  # Not filtering, so don't really need this
  identifier = "genus",
  # group by genus
  normalize = FALSE
)
str(nRec$data)

## 'data.frame':    960 obs. of  3 variables:
##  $ val   : num  42 26 93 138 287 259 198 224 194 484 ...
##  $ group : chr  "Tetropium" "Tetropium" "Tetropium" "Tetropium" ...
##  $ Period: int  1 2 3 4 5 6 7 8 9 10 ...

# view plot of increasing number of records with time
nRec$plot # Plot not clear in HTML knit, move legend to underneath plot for presentation

#nRec$plot + ggplot2::theme(legend.position = 'bottom') # original better with knitr options set
# Unsurprisingly, we can assume large increases in effort over time
# note also that there are some records apparently without a genus-level determination
nrow(dat[dat$genus == "", ]) # 56

## [1] 56

head(dat[dat$genus == "", ]$issue) # apparently due to taxon matching issues with GBIF

## [1] "TAXON_MATCH_TAXON_ID_IGNORED;INSTITUTION_MATCH_NONE"
## [2] "TAXON_MATCH_TAXON_ID_IGNORED"                       
## [3] "TAXON_MATCH_TAXON_ID_IGNORED"                       
## [4] "TAXON_MATCH_TAXON_ID_IGNORED;TAXON_MATCH_HIGHERRANK"
## [5] "TAXON_MATCH_TAXON_ID_IGNORED"                       
## [6] "TAXON_MATCH_TAXON_ID_IGNORED"

Increasing effort in itself does not necessarily bias estimates, although even small probability samples can have large error due to sampling variance. (One can imagine a time series of small random samples with so much variance that any estimate of a linear trend would be subject to so much uncertainty as to be useless). Very often, however, increasing effort is associated with data collection patterns that do bias estimates. For example, common species being under-recorded in historic, museum-based collections, and then being over-represented in more recent datasets that include more contributions from volunteer naturalists.

Changing effort is often behind decisions to look at relative change (e.g. changing proportions of records, or site/time-period occupancies at some scale), rather than absolute measures.

Assess changing species number

## Assess Species Number
nSpec <- assessSpeciesNumber(
  dat = dat,
  periods = periods,
  species = "species",
  x = "decimalLongitude",
  y = "decimalLatitude",
  year = "year",
  spatialUncertainty = "coordinateUncertaintyInMeters",
  # Not filtering, so don't really need this
  identifier = "family",
  # just look at Cerambycidae overall
  normalize = FALSE
)
str(nSpec$data)

## 'data.frame':    12 obs. of  3 variables:
##  $ val   : int  97 80 97 97 106 108 106 108 108 112 ...
##  $ group : chr  "Cerambycidae" "Cerambycidae" "Cerambycidae" "Cerambycidae" ...
##  $ Period: int  1 2 3 4 5 6 7 8 9 10 ...

nSpec$plot

Again, assessing the number of species (or other taxonomic level) recorded in one’s datasets may indicate issues with time-varying biases in attention paid to different taxa. Of course, such patterns may also be genuine (new non-natives, taxonomic splits or merges e.g.), again highlighting where expert knowledge will be required to help interpret any pattern. Likewise, a lack of pattern does not necessarily indicate a lack of bias!

Assess changing taxonomic resolution

## Assess taxonomic resolution
# although relevant column needs NA rather than empty string
# really should double-check for issues with infraspecifics here
dat$species <- ifelse(dat$species == "", NA, dat$species) 
propID <- assessSpeciesID(
  dat = dat,
  periods = periods,
  type = "proportion",
  species = "species",
  x = "decimalLongitude",
  y = "decimalLatitude",
  year = "year",
  spatialUncertainty = "coordinateUncertaintyInMeters",
  identifier = "family"
)
str(propID$data)

## 'data.frame':    12 obs. of  3 variables:
##  $ prop  : num  1 1 1 1 1 ...
##  $ group : chr  "Cerambycidae" "Cerambycidae" "Cerambycidae" "Cerambycidae" ...
##  $ Period: int  1 2 3 4 5 6 7 8 9 10 ...

propID$plot # basically data set is more or less entirely comprised of species-level information (not always the case! See Boyd et al. 2022 e.g.)

The usefulness of assessSpeciesID() will vary depending on the dataset, but, for those working at very large scales with datasets that they did not collect (e.g. GBIF data at the continent scale), it can be useful in highlighting where datasets include large number of records that are only broadly resolved in taxonomic space. This could, e.g., indicate the participation of “para-taxonomists” or citizen scientists in projects in particular time periods.

Assess changing rarity bias

## Assess rarity bias
taxBias <- assessRarityBias(
  dat = dat,
  periods = periods,
  res = 0.5,
  prevPerPeriod = FALSE,
  species = "species",
  x = "decimalLongitude",
  y = "decimalLatitude",
  year = "year",
  spatialUncertainty = "coordinateUncertaintyInMeters",
  identifier = "family"
)
#> Warning in assessRarityBias(dat = spDat, periods = periods, res = 0.5, prevPerPeriod = FALSE, : Removing 4843 records
#> because they are do not identified to species level.
str(taxBias$data)

## 'data.frame':    12 obs. of  3 variables:
##  $ period: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ id    : chr  "Cerambycidae" "Cerambycidae" "Cerambycidae" "Cerambycidae" ...
##  $ index : num  0.58 0.585 0.564 0.532 0.61 ...

taxBias$plot + ggplot2::ylim(c(0, 1))

The assessRarityBias() function is a little more subtle than most in the occAssess package, as it presents a time series of results derived from running time-period specific models. This also means that, if biases in the coverage of species in relation to their true frequency is suspected, analysts should delve deeper into any patterns hinted at here.

The function runs a regression of the number of records against a proxy for species’ range size within each time period specified. The residuals from such a regression provide an index of how over- or undersampled a taxon might be given its prevalence (conditional on assumptions). assessRarityBias() reports the R-squared for each time-period regression as a index of proportionality between the number of records and the range size proxy. High values can indicate that species’ appear to be sampled in proportion to their range sizes, whereas lower values suggest that some species are over- or under-sampled (i.e. lower R-squared suggests lower predictive value of range size for number of records).

Assess changing spatial coverage

## Assess spatial coverage
mapsDensity <- assessSpatialCov(
  dat = dat,
  periods = periods,
  res = 0.5,
  # Note that I havent assessed relationship of coordinate uncertainty to the mapped res in lat/long
  logCount = T,
  #countries = c("Sweden", "Finland", "Norway"),
  countries = c("Sweden"),
  species = "species",
  x = "decimalLongitude",
  y = "decimalLatitude",
  year = "year",
  # Note that I havent assessed relationship of coordinate uncertainty to the mapped res in lat/long
  spatialUncertainty = "coordinateUncertaintyInMeters",
  identifier = "family",
  output = "density"
)
# Adjust max/min Lat/Long
mapsDensity[[1]] + ggplot2::xlim(c(10, 24)) + ggplot2::ylim(c(55, 70))

## Well, I've just learnt that most of Sweden's populatinon probably live on the coast :)
## e.g. https://ec.europa.eu/eurostat/statistics-explained/index.php?title=File:Share-population-living-50-km-from-coastline-NUTS3-2001.png

mapPeriods <- assessSpatialCov(
  dat = dat,
  periods = periods,
  res = 0.5,
  # Note that I havent assessed relationship of coordinate uncertainty to the mapped res in lat/long
  logCount = F,
  #countries = c("Sweden", "Finland", "Norway"),
  countries = c("Sweden"),
  species = "species",
  x = "decimalLongitude",
  y = "decimalLatitude",
  year = "year",
  # Note that I havent assessed relationship of coordinate uncertainty to the mapped res in lat/long
  spatialUncertainty = "coordinateUncertaintyInMeters",
  identifier = "family",
  output = "nPeriods"
)
# Adjust max/min Lat/Long
mapPeriods[[1]] + ggplot2::xlim(c(10, 24)) + ggplot2::ylim(c(55, 70))

mapOverlap <- assessSpatialCov(
  dat = dat,
  periods = periods,
  res = 0.5,
  # Note that I havent assessed relationship of coordinate uncertainty to the mapped res in lat/long
  logCount = F,
  #countries = c("Sweden", "Finland", "Norway"),
  countries = c("Sweden"),
  species = "species",
  x = "decimalLongitude",
  y = "decimalLatitude",
  year = "year",
  # Note that I havent assessed relationship of coordinate uncertainty to the mapped res in lat/long
  spatialUncertainty = "coordinateUncertaintyInMeters",
  identifier = "family",
  output = "overlap",
  minPeriods = 10,
  returnRaster = F
)
# Adjust max/min Lat/Long
mapOverlap[[1]] + ggplot2::xlim(c(10, 24)) + ggplot2::ylim(c(55, 70))

# Colours need fixing...
#mapOverlap$rasters$Cerambycidae

Here three different outputs from assessSpatialCov() are shown. Although only the first option is explicitly sub-divided by time period, the others are also intended to provide information about aspects of temporal coverage. Obviously thought should be given as to the resolution at which such exercises are most informative for one’s area; local grids can be used if lat/long is unhelpful.

Assess changing spatial bias relative to simple random sampling

## Assess spatial bias
# need raster first
map_obj <- maps::map("world", exact = FALSE, plot = FALSE, fill = TRUE)
mapWorld <- sf::st_as_sf(map_obj)
mapSweden <- mapWorld[mapWorld$ID=="Sweden",]
# Note that there will be sensitvity to raster resolution, but across reasonable values
# this typically only changes absolute values of the NNI, rather than the pattern across time
#r <- raster::raster(nrows = 120, ncols = 112, xmn = 10, xmx = 24, ymn = 55, ymx = 70)
r <- raster::raster(nrows = 60, ncols = 56, xmn = 10, xmx = 24, ymn = 55, ymx = 70)
raster::NAvalue(r) <- -1
swedenRas <- fasterize::fasterize(mapSweden, r)
plot(swedenRas)

# run function
spatBias <- assessSpatialBias(dat = dat,
                              periods = periods,
                              mask = swedenRas,
                              nSamps = 10,
                              degrade = T,
                              species = "species",
                              x = "decimalLongitude",
                              y = "decimalLatitude",
                              year = "year", 
                              spatialUncertainty = "coordinateUncertaintyInMeters",
                              #maxSpatUncertainty = 6000,
                              identifier = "family")
spatBias$plot

Under an assumption of simple random sampling (SRS) within a domain, the average nearest neighbour index (NNI) can be used to assess the empirical deviation from spatial pattern under random sampling. This function provides a (bootstrapped) comparison between the data and the pattern expected under SRS (at the resolution of the raster). Changes in the average NNI between time periods may indicate patterns that deserve further inspection.

Assess changing environmental bias

## Assess changing environmental bias
## How to get the data using raster::getData()
#clim <- raster::getData("worldclim",var="bio",res=10)
#raster::stackSave(clim, "climStack")
clim <- raster::stackOpen(here("climStack"))
# delineate Sweden in the climate data 
mapSweden # convert to this sp accetptable format first

mapSwedenSp <- sf::as_Spatial(mapSweden)
shp <- sp::spTransform(mapSwedenSp, raster::crs(clim))
clim <- raster::crop(clim, raster::extent(shp))
clim <- raster::mask(clim, shp)
## exract climate data at coordinates of occurrence data 
envDat <- raster::extract(clim, dat[, c("decimalLongitude", "decimalLatitude")])
## extract background environmental data 
backgroundEnvDat <- raster::sampleRandom(clim, 
                                         size = raster::ncell(clim)/2,
                                         xy = F)
## run assess environmental bias
envBias <- assessEnvBias(dat = dat,
                         species = "species",
                         x = "decimalLongitude",
                         y = "decimalLatitude",
                         year = "year", 
                         spatialUncertainty = "coordinateUncertaintyInMeters",
                         identifier = "family",
                         envDat = envDat,
                         backgroundEnvDat = backgroundEnvDat,
                         xPC = 1,
                         yPC = 2,
                         periods = periods) # xPC and yPC indicate which principal components to set as the x and y axes,respectively
envBias$plot

In a similar fashion to assessing departures from simple random sampling, we can also look at the changing coverage of any “environmental space” as codified by any available raster, as well as any differences between sampled periods and the full “background”. Here easily available climate data from WorldClim are used, but any relevant environmental data could be used. Again, this could be improved in various ways by using methods that don’t aassume multivariate normality, or by incorporating time-varying environmental data where available.

Post-script

That is the end of this supporting material. See here for the paper describing the occAssess potential bias “heuristics” package used here.

Overview of potential spatio-temporal biases relevant for long-term trend estimation in Swedish Cerambycidae (Longhorn Beetle) GBIF data

Oli Pescott & Rob Boyd

January 2024