Getting GBIF data

Set up

Before starting, make sure that you have set your working directory (folder). You will be creating and importing files in/from this directory.

Packages

We need to use several packages to do this: - the tidyverse packages for data manipulation. - the taxize package for obtaining taxonomic information. - the rgbif package for obtaining the GBIF data.

You might need to install them first using install.packages().

library(tidyverse)
library(taxize)
library(rgbif)

Get taxonomic information

GBIF (and other taxonomic databases) use numeric taxon keys (also known as serial numbers or taxonomic serial numbers (TSN)) to identify taxa. For example, the GBIF taxon key for Lepidoptera is 797. The taxon keys used by other databases will be different!

x <- rgbif::name_backbone("Lepidoptera")
x

## # A tibble: 1 x 16
##   usageKey scientificName canonicalName rank  status   confidence matchType
## *    <int> <chr>          <chr>         <chr> <chr>         <int> <chr>    
## 1      797 Lepidoptera    Lepidoptera   ORDER ACCEPTED         94 EXACT    
## # … with 9 more variables: kingdom <chr>, phylum <chr>, order <chr>,
## #   kingdomKey <int>, phylumKey <int>, classKey <int>, orderKey <int>,
## #   synonym <lgl>, class <chr>

The workhorse of all things taxonomic in R is taxize. This package can interrogate many taxonomic databases including GBIF, ITIS and so on…

We can use the function gbif_downstream to get all “downstream” taxa, if you know its taxonomic serial number (what GBIF calls the “usagekey”). We already got the usage key for the whole of Lepidoptera, so we can ask for the families downstream. If you read the help file for the function it tells you that the default limit for the number of entries to return is 100 and therefore, because there are probably lots more families (we are not sure yet of the exact number) we should increase this to a large number, like 10000.

lepiFamilies <- taxize::gbif_downstream(x$usageKey,downto = "family",limit = 10000)
dim(lepiFamilies)

## [1] 223   4

head(lepiFamilies)

##                   name   rank     key     name_type
## 1 Acanthopteroctetidae family    8661 canonicalname
## 2            Acraeidae family 4707618 canonicalname
## 3         Acrolepiidae family    8662 canonicalname
## 4         Acrolophidae family    4537 canonicalname
## 5             Adelidae family    4538 canonicalname
## 6            Aganaidae family 3256897 canonicalname

You can see that you get a data frame that includes columns of name, taxonomic rank and key. There are 223 rows of data (i.e. 223 families).

We COULD start downloading data on a family-by-family basis, but I know that some families will have way more than 100000 observations, which exceeds the GBIF download limits. Therefore, I think it is better to do it on a genus-by-genus basis. So, we should obtain the genus list.

lepiGenera <- taxize::gbif_downstream(x$usageKey,downto = "genus",limit = 10000)
dim(lepiGenera)

## [1] 17686     4

head(lepiGenera)

##        name  rank     key     name_type
## 1    Abella genus 3257976 canonicalname
## 2   Ablabia genus 4704064 canonicalname
## 3 Abraxides genus 4689181 canonicalname
## 4    Abrisa genus 4703330 canonicalname
## 5  Absyrtes genus 3256535 canonicalname
## 6  Acacerus genus 4688691 canonicalname

This search found 17686. Thats a lot, but remember that many (most) of these genera will not be found in Denmark.

So… what we need to do now is, for each of the 17686, download the GBIF observations for each one, to a CSV file and save it into a folder on our own computer.

To illustrate this I will pick one genus.

If we know a particular genus name (e.g. my favourite genus of hawkmoth is Deilephila) we can filter the dsTaxa data frame to pull out the tsn for that genus.

focalGenus <- lepiGenera %>% 
  filter(name == "Deilephila") %>% 
  pull(key)

focalGenus

## [1] 1864355

The taxon key for Deilephila is 1864355

In the next section we’ll search GBIF for occurrence data for a genus.

## Querying GBIF

We want to get observation data of occurrences within the country of Denmark, that have precise geolocation information (i.e. latitude and longitude.)

lepi_dk_test <- occ_data(taxonKey = focalGenus,limit = 10, hasCoordinate = TRUE,country = "DK")

We get a message when we run that command. It tells us that it has found 5354 records (remember that we put a limit of 10 for our test run) and it reminds us what “arguments” we used in our search. The object has two parts meta and data. The meta part tells us about the contents of the data and data contains the actual data. You can see what the data include with names.

dim(lepi_dk_test)

## NULL

names(lepi_dk_test)

## [1] "meta" "data"

For now lets download all the data for this genus, note that I am using the count from the test-run as the limit to ensure we get all the data.

lepi_dk_all <- occ_data(taxonKey = focalGenus,limit = lepi_dk_test$meta$count, hasCoordinate = TRUE,country = "DK")

We can save the data part of this object (lepi_dk_all) out as a CSV file. I suggest that you first make a folder called “GBIF_raw_data” to put these files in.

write.csv(lepi_dk_all$data,file = "GBIF_raw_data/Deilephila.csv",row.names = FALSE)

Looking at the data

You can read in the data (e.g. if you are in a fresh R session) like this:

moth <- read.csv("GBIF_raw_data/Deilephila.csv")

We can do some simple plots to see the geographic distribution:

ggplot(moth,aes(x = decimalLongitude,y = decimalLatitude)) + 
  geom_point() + 
  ggtitle("Deilephila")

We can also ask what species we have:

table(moth$scientificName)

## 
##   Deilephila elpenor (Linnaeus, 1758) Deilephila porcellus (Linnaeus, 1758) 
##                                  3690                                  1664

There are two species - the “large elephant hawk moth” and “small elephant hawk moth”.

What about years of data?

table(moth$year)

## 
## 1856 1859 1863 1870 1892 1894 1895 1897 1901 1905 1943 1949 1952 1954 1956 1958 
##    1    1    1    1    2    1    2    1    1    1    1    1    1    1    1    1 
## 1962 1963 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 
##    1    1    2    4    7   17   20   31   49   51   57   52   23   12    6   14 
## 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 
##    5    4   10   20   33   37   49   14   12   10   24   16   19   27   39   11 
## 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 
##   56   65   85   89   31   47   17   49   64   45   68  136  133  128  165  255 
## 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 
##  294  256  278  396  357  349  487  379  344   86

We could make a graph of this:

tempTable <- table(moth$year)
tempDF <- data.frame(year = as.numeric(names(table(moth$year))),count = as.vector(table(moth$year)))
tempDF

##    year count
## 1  1856     1
## 2  1859     1
## 3  1863     1
## 4  1870     1
## 5  1892     2
## 6  1894     1
## 7  1895     2
## 8  1897     1
## 9  1901     1
## 10 1905     1
## 11 1943     1
## 12 1949     1
## 13 1952     1
## 14 1954     1
## 15 1956     1
## 16 1958     1
## 17 1962     1
## 18 1963     1
## 19 1965     2
## 20 1966     4
## 21 1967     7
## 22 1968    17
## 23 1969    20
## 24 1970    31
## 25 1971    49
## 26 1972    51
## 27 1973    57
## 28 1974    52
## 29 1975    23
## 30 1976    12
## 31 1977     6
## 32 1978    14
## 33 1979     5
## 34 1980     4
## 35 1981    10
## 36 1982    20
## 37 1983    33
## 38 1984    37
## 39 1985    49
## 40 1986    14
## 41 1987    12
## 42 1988    10
## 43 1989    24
## 44 1990    16
## 45 1991    19
## 46 1992    27
## 47 1993    39
## 48 1994    11
## 49 1995    56
## 50 1996    65
## 51 1997    85
## 52 1998    89
## 53 1999    31
## 54 2000    47
## 55 2001    17
## 56 2002    49
## 57 2003    64
## 58 2004    45
## 59 2005    68
## 60 2006   136
## 61 2007   133
## 62 2008   128
## 63 2009   165
## 64 2010   255
## 65 2011   294
## 66 2012   256
## 67 2013   278
## 68 2014   396
## 69 2015   357
## 70 2016   349
## 71 2017   487
## 72 2018   379
## 73 2019   344
## 74 2020    86

ggplot(tempDF, aes(x = year,y = count)) + 
  geom_line()

Before proceeding too far, here is where we need to remember that it might not be a good idea to blindly use ALL the data. There could be some serious biases in it and we should explore and think about the variables in the dataset to consider whether we should include the observations or not. For now we will ignore this issue.

library(lubridate)

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

We want to create a date column from the year month and day columns in the data. This is easy with the function ymd, after you paste together the year, month and day columns into a particular format (YYYY-MM-DD). After that we can get the day of the year with the yday function.

moth <- moth %>% 
  mutate(date = lubridate::ymd(paste(year, month, day,sep = "-"))) %>%
  mutate(yday = lubridate::yday(date))

Lets see that distribution for each species

ggplot(moth,aes(x = yday, fill = scientificName)) + 
  geom_density(alpha = 0.5)

The geom_density_ridges graphical geom is also a nice option for plotting when you have many categories. For example, here I plot the distribution through time of observations the genus between 2000 and 2010.

library(ggridges)
ggplot(moth %>% filter(year %in% 2000:2010),aes(x = yday, y = as.factor(year), fill = as.factor(year))) + 
  geom_density_ridges()

## Picking joint bandwidth of 7.53

We can revisit the details of these analyses at a later date. Think about how you would extract measures of phenology from these data.

Getting more data

Remember that we had >17000 genera. It would be incredibly time consuming to download those one at a time, We need to use some programming to do it efficiently. Now that we have done it for one species it is “simply” a matter of scaling up, and putting those commands to download data and save it a CSV file in a loop.

The steps within such a loop are:

Get the name of a genus from the list.
Check how many observations there are in DK that have lat/long information. 3a. If there are observations, then download the entire available data and save it out as a CSV 3b, If there are no observations, do nothing.
Go to step 1. (the end of the loop)

Here is how to do it for the first 1000 genera (most don’t occur in Europe). Take some time to study this code to see what is happening.

ngenera <- 1000 #Increase this to the full amount when you are satisfied that it works. 

for(i in 1:ngenera) {

  cat(paste(i, "  ", lepiGenera$name[i],"\n"))
  #Get the taxon key
  focalGenus <- lepiGenera %>%
    filter(name == lepiGenera$name[i]) %>%
    pull(key)
  
  focalGenus
  
  #Test run with 10 observations to see what data is there
  lepi_dk_test <-
    occ_data(taxonKey = focalGenus[1], limit = 10, hasCoordinate = TRUE, country = "DK")
  
  #Check if there is data? If >0 download it all.
  #If the count is 0, then this bit is not run
  if (lepi_dk_test$meta$count > 0) {
    lepi_dk_all <- occ_data(taxonKey = focalGenus, 
                            limit = lepi_dk_test$meta$count, hasCoordinate = TRUE, country = "DK")
    
    #File path/name to use (genus name .csv)
    filePath <- paste0("GBIF_raw_data/", lepiGenera$name[i], ".csv")
    
    #Write the file to the GBIF_raw_data directory
    write.csv(lepi_dk_all$data, file = filePath, row.names = FALSE)
  }
}

If this code works you should have a bunch of data files, all of the same format, in your directory. Your job now is to write a loop to import the files one by one and join them together to make a single file.