Before starting, make sure that you have set your working directory (folder). You will be creating and importing files in/from this directory.
We need to use several packages to do this: - the tidyverse packages for data manipulation. - the taxize package for obtaining taxonomic information. - the rgbif package for obtaining the GBIF data.
You might need to install them first using install.packages().
library(tidyverse)
library(taxize)
library(rgbif)
GBIF (and other taxonomic databases) use numeric taxon keys (also known as serial numbers or taxonomic serial numbers (TSN)) to identify taxa. For example, the GBIF taxon key for Lepidoptera is 797. The taxon keys used by other databases will be different!
x <- rgbif::name_backbone("Lepidoptera")
x
## # A tibble: 1 x 16
## usageKey scientificName canonicalName rank status confidence matchType
## * <int> <chr> <chr> <chr> <chr> <int> <chr>
## 1 797 Lepidoptera Lepidoptera ORDER ACCEPTED 94 EXACT
## # … with 9 more variables: kingdom <chr>, phylum <chr>, order <chr>,
## # kingdomKey <int>, phylumKey <int>, classKey <int>, orderKey <int>,
## # synonym <lgl>, class <chr>
The workhorse of all things taxonomic in R is taxize. This package can interrogate many taxonomic databases including GBIF, ITIS and so on…
We can use the function gbif_downstream to get all “downstream” taxa, if you know its taxonomic serial number (what GBIF calls the “usagekey”). We already got the usage key for the whole of Lepidoptera, so we can ask for the families downstream. If you read the help file for the function it tells you that the default limit for the number of entries to return is 100 and therefore, because there are probably lots more families (we are not sure yet of the exact number) we should increase this to a large number, like 10000.
lepiFamilies <- taxize::gbif_downstream(x$usageKey,downto = "family",limit = 10000)
dim(lepiFamilies)
## [1] 223 4
head(lepiFamilies)
## name rank key name_type
## 1 Acanthopteroctetidae family 8661 canonicalname
## 2 Acraeidae family 4707618 canonicalname
## 3 Acrolepiidae family 8662 canonicalname
## 4 Acrolophidae family 4537 canonicalname
## 5 Adelidae family 4538 canonicalname
## 6 Aganaidae family 3256897 canonicalname
You can see that you get a data frame that includes columns of name, taxonomic rank and key. There are 223 rows of data (i.e. 223 families).
We COULD start downloading data on a family-by-family basis, but I know that some families will have way more than 100000 observations, which exceeds the GBIF download limits. Therefore, I think it is better to do it on a genus-by-genus basis. So, we should obtain the genus list.
lepiGenera <- taxize::gbif_downstream(x$usageKey,downto = "genus",limit = 10000)
dim(lepiGenera)
## [1] 17686 4
head(lepiGenera)
## name rank key name_type
## 1 Abella genus 3257976 canonicalname
## 2 Ablabia genus 4704064 canonicalname
## 3 Abraxides genus 4689181 canonicalname
## 4 Abrisa genus 4703330 canonicalname
## 5 Absyrtes genus 3256535 canonicalname
## 6 Acacerus genus 4688691 canonicalname
This search found 17686. Thats a lot, but remember that many (most) of these genera will not be found in Denmark.
So… what we need to do now is, for each of the 17686, download the GBIF observations for each one, to a CSV file and save it into a folder on our own computer.
To illustrate this I will pick one genus.
If we know a particular genus name (e.g. my favourite genus of hawkmoth is Deilephila) we can filter the dsTaxa data frame to pull out the tsn for that genus.
focalGenus <- lepiGenera %>%
filter(name == "Deilephila") %>%
pull(key)
focalGenus
## [1] 1864355
The taxon key for Deilephila is 1864355
In the next section we’ll search GBIF for occurrence data for a genus.
## Querying GBIF
We want to get observation data of occurrences within the country of Denmark, that have precise geolocation information (i.e. latitude and longitude.)
lepi_dk_test <- occ_data(taxonKey = focalGenus,limit = 10, hasCoordinate = TRUE,country = "DK")
We get a message when we run that command. It tells us that it has found 5354 records (remember that we put a limit of 10 for our test run) and it reminds us what “arguments” we used in our search. The object has two parts meta and data. The meta part tells us about the contents of the data and data contains the actual data. You can see what the data include with names.
dim(lepi_dk_test)
## NULL
names(lepi_dk_test)
## [1] "meta" "data"
For now lets download all the data for this genus, note that I am using the count from the test-run as the limit to ensure we get all the data.
lepi_dk_all <- occ_data(taxonKey = focalGenus,limit = lepi_dk_test$meta$count, hasCoordinate = TRUE,country = "DK")
We can save the data part of this object (lepi_dk_all) out as a CSV file. I suggest that you first make a folder called “GBIF_raw_data” to put these files in.
write.csv(lepi_dk_all$data,file = "GBIF_raw_data/Deilephila.csv",row.names = FALSE)
You can read in the data (e.g. if you are in a fresh R session) like this:
moth <- read.csv("GBIF_raw_data/Deilephila.csv")
We can do some simple plots to see the geographic distribution:
ggplot(moth,aes(x = decimalLongitude,y = decimalLatitude)) +
geom_point() +
ggtitle("Deilephila")
We can also ask what species we have:
table(moth$scientificName)
##
## Deilephila elpenor (Linnaeus, 1758) Deilephila porcellus (Linnaeus, 1758)
## 3690 1664
There are two species - the “large elephant hawk moth” and “small elephant hawk moth”.
What about years of data?
table(moth$year)
##
## 1856 1859 1863 1870 1892 1894 1895 1897 1901 1905 1943 1949 1952 1954 1956 1958
## 1 1 1 1 2 1 2 1 1 1 1 1 1 1 1 1
## 1962 1963 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978
## 1 1 2 4 7 17 20 31 49 51 57 52 23 12 6 14
## 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994
## 5 4 10 20 33 37 49 14 12 10 24 16 19 27 39 11
## 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
## 56 65 85 89 31 47 17 49 64 45 68 136 133 128 165 255
## 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
## 294 256 278 396 357 349 487 379 344 86
We could make a graph of this:
tempTable <- table(moth$year)
tempDF <- data.frame(year = as.numeric(names(table(moth$year))),count = as.vector(table(moth$year)))
tempDF
## year count
## 1 1856 1
## 2 1859 1
## 3 1863 1
## 4 1870 1
## 5 1892 2
## 6 1894 1
## 7 1895 2
## 8 1897 1
## 9 1901 1
## 10 1905 1
## 11 1943 1
## 12 1949 1
## 13 1952 1
## 14 1954 1
## 15 1956 1
## 16 1958 1
## 17 1962 1
## 18 1963 1
## 19 1965 2
## 20 1966 4
## 21 1967 7
## 22 1968 17
## 23 1969 20
## 24 1970 31
## 25 1971 49
## 26 1972 51
## 27 1973 57
## 28 1974 52
## 29 1975 23
## 30 1976 12
## 31 1977 6
## 32 1978 14
## 33 1979 5
## 34 1980 4
## 35 1981 10
## 36 1982 20
## 37 1983 33
## 38 1984 37
## 39 1985 49
## 40 1986 14
## 41 1987 12
## 42 1988 10
## 43 1989 24
## 44 1990 16
## 45 1991 19
## 46 1992 27
## 47 1993 39
## 48 1994 11
## 49 1995 56
## 50 1996 65
## 51 1997 85
## 52 1998 89
## 53 1999 31
## 54 2000 47
## 55 2001 17
## 56 2002 49
## 57 2003 64
## 58 2004 45
## 59 2005 68
## 60 2006 136
## 61 2007 133
## 62 2008 128
## 63 2009 165
## 64 2010 255
## 65 2011 294
## 66 2012 256
## 67 2013 278
## 68 2014 396
## 69 2015 357
## 70 2016 349
## 71 2017 487
## 72 2018 379
## 73 2019 344
## 74 2020 86
ggplot(tempDF, aes(x = year,y = count)) +
geom_line()
Before proceeding too far, here is where we need to remember that it might not be a good idea to blindly use ALL the data. There could be some serious biases in it and we should explore and think about the variables in the dataset to consider whether we should include the observations or not. For now we will ignore this issue.
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
We want to create a date column from the year month and day columns in the data. This is easy with the function ymd, after you paste together the year, month and day columns into a particular format (YYYY-MM-DD). After that we can get the day of the year with the yday function.
moth <- moth %>%
mutate(date = lubridate::ymd(paste(year, month, day,sep = "-"))) %>%
mutate(yday = lubridate::yday(date))
Lets see that distribution for each species
ggplot(moth,aes(x = yday, fill = scientificName)) +
geom_density(alpha = 0.5)
The geom_density_ridges graphical geom is also a nice option for plotting when you have many categories. For example, here I plot the distribution through time of observations the genus between 2000 and 2010.
library(ggridges)
ggplot(moth %>% filter(year %in% 2000:2010),aes(x = yday, y = as.factor(year), fill = as.factor(year))) +
geom_density_ridges()
## Picking joint bandwidth of 7.53
We can revisit the details of these analyses at a later date. Think about how you would extract measures of phenology from these data.
Remember that we had >17000 genera. It would be incredibly time consuming to download those one at a time, We need to use some programming to do it efficiently. Now that we have done it for one species it is “simply” a matter of scaling up, and putting those commands to download data and save it a CSV file in a loop.
The steps within such a loop are:
Here is how to do it for the first 1000 genera (most don’t occur in Europe). Take some time to study this code to see what is happening.
ngenera <- 1000 #Increase this to the full amount when you are satisfied that it works.
for(i in 1:ngenera) {
cat(paste(i, " ", lepiGenera$name[i],"\n"))
#Get the taxon key
focalGenus <- lepiGenera %>%
filter(name == lepiGenera$name[i]) %>%
pull(key)
focalGenus
#Test run with 10 observations to see what data is there
lepi_dk_test <-
occ_data(taxonKey = focalGenus[1], limit = 10, hasCoordinate = TRUE, country = "DK")
#Check if there is data? If >0 download it all.
#If the count is 0, then this bit is not run
if (lepi_dk_test$meta$count > 0) {
lepi_dk_all <- occ_data(taxonKey = focalGenus,
limit = lepi_dk_test$meta$count, hasCoordinate = TRUE, country = "DK")
#File path/name to use (genus name .csv)
filePath <- paste0("GBIF_raw_data/", lepiGenera$name[i], ".csv")
#Write the file to the GBIF_raw_data directory
write.csv(lepi_dk_all$data, file = filePath, row.names = FALSE)
}
}
If this code works you should have a bunch of data files, all of the same format, in your directory. Your job now is to write a loop to import the files one by one and join them together to make a single file.