Before starting, make sure that you have set your working directory (folder). You will be creating and importing files in/from this directory.

Packages

We need to use several packages to do this: - the tidyverse packages for data manipulation. - the taxize package for obtaining taxonomic information. - the rgbif package for obtaining the GBIF data.

You might need to install them first using install.packages().

library(tidyverse)
library(taxize)
library(rgbif)

Get taxonomic information

GBIF (and other taxonomic databases) use numeric taxon keys (also known as serial numbers or taxonomic serial numbers (TSN)) to identify taxa. For example, the GBIF taxon key for genus Vanessa is 1898261. The taxon keys used by other databases will be different!

x <- rgbif::name_backbone("Vanessa")
x
## # A tibble: 1 x 20
##   usageKey scientificName       canonicalName rank  status  confidence matchType
## *    <int> <chr>                <chr>         <chr> <chr>        <int> <chr>    
## 1  1898261 Vanessa Fabricius, … Vanessa       GENUS ACCEPT…         92 EXACT    
## # … with 13 more variables: kingdom <chr>, phylum <chr>, order <chr>,
## #   family <chr>, genus <chr>, kingdomKey <int>, phylumKey <int>,
## #   classKey <int>, orderKey <int>, familyKey <int>, genusKey <int>,
## #   synonym <lgl>, class <chr>

We can now use this taxon key to get the data from GBIF First I just do a test run to make sure that it is working.

vanessa_dk_test <- occ_data(taxonKey = 1898261, limit = 10, hasCoordinate = TRUE, country = "DK")

Take a look at the output from this to check that it has worked.

vanessa_dk_test
## Records found [52660] 
## Records returned [10] 
## Args [hasCoordinate=TRUE, limit=10, offset=0, taxonKey=1898261, country=DK] 
## # A tibble: 10 x 78
##    key    scientificName   decimalLatitude decimalLongitude issues datasetKey   
##    <chr>  <chr>                      <dbl>            <dbl> <chr>  <chr>        
##  1 30937… Vanessa atalant…            56.0            12.3  osiic  95db4db8-f76…
##  2 30939… Vanessa atalant…            55.2            12.1  osiic  95db4db8-f76…
##  3 30707… Vanessa atalant…            55.7            11.4  cdrou… 50c9509d-22c…
##  4 31191… Vanessa atalant…            55.2            12.1  osiic  95db4db8-f76…
##  5 31191… Vanessa atalant…            54.6            11.5  osiic  95db4db8-f76…
##  6 31257… Vanessa atalant…            55.7            12.5  osiic  95db4db8-f76…
##  7 31257… Vanessa atalant…            56.1            10.2  osiic  95db4db8-f76…
##  8 31257… Vanessa atalant…            55.6             8.08 osiic  95db4db8-f76…
##  9 31125… Vanessa atalant…            55.6            12.6  cdrou… 50c9509d-22c…
## 10 31282… Vanessa atalant…            56.4            10.8  osiic  95db4db8-f76…
## # … with 72 more variables: publishingOrgKey <chr>, installationKey <chr>,
## #   publishingCountry <chr>, protocol <chr>, lastCrawled <chr>,
## #   lastParsed <chr>, crawlId <int>, hostingOrganizationKey <chr>,
## #   basisOfRecord <chr>, individualCount <int>, occurrenceStatus <chr>,
## #   taxonKey <int>, kingdomKey <int>, phylumKey <int>, classKey <int>,
## #   orderKey <int>, familyKey <int>, genusKey <int>, speciesKey <int>,
## #   acceptedTaxonKey <int>, acceptedScientificName <chr>, kingdom <chr>,
## #   phylum <chr>, order <chr>, family <chr>, genus <chr>, species <chr>,
## #   genericName <chr>, specificEpithet <chr>, taxonRank <chr>,
## #   taxonomicStatus <chr>, iucnRedListCategory <chr>,
## #   coordinateUncertaintyInMeters <dbl>, year <int>, month <int>, day <int>,
## #   eventDate <chr>, lastInterpreted <chr>, references <chr>, license <chr>,
## #   isInCluster <lgl>, geodeticDatum <chr>, class <chr>, countryCode <chr>,
## #   country <chr>, locationAccordingTo <chr>, identifier <chr>,
## #   recordedBy <chr>, vernacularName <chr>, locationID <chr>, locality <chr>,
## #   gbifID <chr>, occurrenceID <chr>, behavior <chr>, taxonID <chr>,
## #   identifiedBy <chr>, dateIdentified <chr>, stateProvince <chr>,
## #   modified <chr>, rightsHolder <chr>, http://unknown.org/nick <chr>,
## #   verbatimEventDate <chr>, datasetName <chr>, verbatimLocality <chr>,
## #   collectionCode <chr>, catalogNumber <chr>,
## #   http://unknown.org/occurrenceDetails <chr>, institutionCode <chr>,
## #   rights <chr>, eventTime <chr>, identificationID <chr>, name <chr>

Looks OK.

Since that worked well, we can now try to download ALL available data on this genus. It will take several minutes to download.

vanessa_dk <- occ_data(taxonKey = 1898261, limit = 100000, hasCoordinate = TRUE, country = "DK")

The returned object vanessa_dk has two parts meta and data. The meta part tells us about the contents of the data and data contains the actual data. You can see what the data include with names. Warning - there are lots of columns in the data!

names(vanessa_dk)
## [1] "meta" "data"

Here are the first 10 column names.

names(vanessa_dk$data)[1:10]
##  [1] "key"               "scientificName"    "decimalLatitude"  
##  [4] "decimalLongitude"  "issues"            "datasetKey"       
##  [7] "publishingOrgKey"  "installationKey"   "publishingCountry"
## [10] "protocol"

We can also see how many rows of data there are:

nrow(vanessa_dk$data)
## [1] 52660

Selecting and filtering the data.

We don’t need all of this data. There are >100 columns of data, most of which are not necessary for the analysis we have in mind. We only want information on the observed species, where and when it was observed, and (possibly) its life stage (adult/larva etc.).

This is how we can select those columns.

vanessa2 <- vanessa_dk$data %>%
  select(species, decimalLatitude, decimalLongitude, year, month, day, lifeStage)

we can plot the geographic distribution like this:

ggplot(vanessa2, aes(x = decimalLongitude, y = decimalLatitude)) +
  geom_point()

There is a single outlier point there in Greenland. We can remove this using filter. We can also filter the data by species so that we only look at V. atalanta (and not its sister species V. cardui).

vanessa2 <- vanessa_dk$data %>%
  select(species, decimalLatitude, decimalLongitude, year, month, day, lifeStage) %>%
  filter(decimalLatitude < 75) %>%
  filter(species == "Vanessa atalanta")

And replot…

ggplot(vanessa2, aes(x = decimalLongitude, y = decimalLatitude)) +
  geom_point()

This looks good.

Phenology

To look at phenology we need to turn the data information into a day of the year (doy). To do that we can use the function ymd in the lubridate package (lubridate::ymd) to produce a correctly-formatted date. It is useful to first filter the data again to remove any observations with missing (NA) observation dates. You will get a message saying that a certain number “failed to parse”. This means that for some records the lubridate::ymd function could not work because one or more of the year/month/day data is missing. It is a very small fraction of the total observations so we can forget about them and simply filter the data to remove those records.

After getting a correctly-formatted date, we can use the lubridate::yday function to express the date as a day-of-the-year, which we can put in a column called doy.

vanessa2 <- vanessa2 %>%
  mutate(date = lubridate::ymd(paste(year, month, day, sep = "-"))) %>%
  mutate(doy = lubridate::yday(date)) %>% 
  filter(!is.na(doy))
## Warning: 328 failed to parse.

The geom_density_ridges graphical geom (from the ggridges package) is also a nice option for plotting when you have many categories. For example, here I plot the distribution through time of observations the genus between 2000 and 2010.

library(ggridges)

#I here filter the data to years of interest
#and make sure that year is coded as a categorical variable.
temp_plotata <- vanessa2 %>%
  filter(year %in% 2000:2010) %>%
  mutate(year = as.factor(year))

ggplot(temp_plotata, aes(x = doy, y = year, fill = year)) +
  geom_density_ridges() + 
  ggtitle("Frequency distribution of observations through years")
## Picking joint bandwidth of 9.16

We will now focus on a single year, for demonstration purposes.

phenoExample <- vanessa2 %>%
  filter(year == "2015")

We can look at distribution of observations using geom_density.

ggplot(phenoExample, aes(x = doy)) +
  geom_density()

How could we charactarise this distribution, for this year?

Another way to look at this type of data is with a cumulative distribution. We can use mutate to create a new column where we count the observations. We need to first arrange the data in order of observation date.

phenoExample <- phenoExample %>%
  arrange(doy) %>%
  mutate(count = seq(n()))
ggplot(phenoExample, aes(x = doy, y = count)) +
  geom_line() +
  ggtitle("Cumulative count of observations in 2015")

We can use summarise and quantile to get the quantiles for this year:

phenoExample %>% 
  summarise(q10 = quantile(doy,.1),q50 = quantile(doy,.5))
## # A tibble: 1 x 2
##     q10   q50
##   <dbl> <dbl>
## 1   166   242

For all years we can elegantly do the same thing by using group_by to tell R that we want do make these calculations for each year in the dataset.

phenoSummary <- vanessa2 %>% 
  group_by(year) %>% 
  summarise(q50 = quantile(doy,.5))

Let’s take a look at the last few rows:

tail(phenoSummary)
## # A tibble: 6 x 2
##    year   q50
##   <int> <dbl>
## 1  2016   245
## 2  2017   223
## 3  2018   209
## 4  2019   225
## 5  2020   222
## 6  2021   206

We could graph this easily, but at the moment you should take this plot with a pinch of salt because we have not yet considered data quality/sample size issues.

ggplot(phenoSummary,aes(x = year, y = q50))+geom_point()