This is a quick guide to obtaining data for research from GBIF. It guides you through how to get occurrence data for a particular species.
You need to have the rgbif package loaded. If you
haven’t already installed them you need to do this first like this…
install.packages("rgbif")
Then load it like this…
library(rgbif)
GBIF provides two methods for retrieving occurrence data:
occ_download(): Offers access to an unlimited number of
records, making it ideal for rigorous research and citation
purposes.occ_search(): Capped at 100,000 records, this method is
primarily suited for preliminary tests.The occ_search() function, along with its counterpart
occ_data(), is not recommended for comprehensive research
efforts. Despite the convenience of not requiring a username or
password, and avoiding the wait time associated with downloads,
occ_search() falls short for in-depth research projects.
For any substantial research undertaking, the use of
occ_download() is strongly advised.
We will therefore focus on occ_download.
The function occ_download() requires your log in
credentials. You will first need to create an account at GBIF.
Take a note of your username, email and password.
The simplest way to provide these details to the
occ_download() function is to run include them in the
function arguments.
To do that, I first create these credentials as objects in R.
user <- "owenjones"
pwd <- "my secure password 1234"
email <- "jones@biology.sdu.dk"
Then I can run occ_download to download data.
But first you should know how to query the database.
The results should now be on your downloads user page https://www.gbif.org/user/download.
You can filter your download request based on taxon, geographic
location and other factors. This is handled with a set of functions that
all have the prefix pred, which stands for
predicate.
It is likely that you will want to download data for a particular
species, so let’s look at that first. To cope with synonyms, GBIF uses a
fixed backbone taxonomy where species are given a numerical ID -
synonyms will have the same ID. To find out the ID number for a
particular species you can use the function name_backbone.
The function produces a data frame with a lot of taxonomic information,
but what you need in this context is the usageKey
For example, the European Cuckoo (Cuculus canorus)…
x <- name_backbone("Cuculus canorus")
x$usageKey
## [1] 5231918
We can now use that key to set the pred for the taxon
like this.
pred("taxonKey", x$usageKey)
We might also want to add other specifications for our search:
pred("country","DK") # has country Denmark (DK)
pred("hasGeospatialIssue", FALSE) # Has no problems with geospatial coordinates
pred("hasCoordinate", TRUE) # Has a coordinate included
pred("occurrenceStatus","PRESENT") # Is recorded as a presence
pred_gte("year", 1900) # Is recorded from 1900 onwards
There are other pred commands to build complex
sophisiticated queries: https://docs.ropensci.org/rgbif/articles/getting_occurrence_data.html
We can put all these together into a single command like this.
(gbif_download <- occ_download(
pred("taxonKey", x$usageKey),
pred("country", "DK"),
pred("hasGeospatialIssue", FALSE),
pred("hasCoordinate", TRUE),
pred("occurrenceStatus", "PRESENT"),
pred_gte("year", 1900),
user = user, pwd = pwd, email = email,
format = "SIMPLE_CSV"
))
When you run this command it will produce some output to the screen,
and save the information to an object called
gbif_download.
It will take a few minutes to prepare the data, so first you should check that the data are available like this;
occ_download_wait(gbif_download)
You will get a message when it is done. Then you can get the data
using a combination of occ_download_get and
occ_download_import:
myData <- occ_download_get(gbif_download) |>
occ_download_import()
Now your data are stored as myData, which you can
examine using summary(myData).