Phenology of Vanessa atalanta

Before starting, make sure that you have set your working directory (folder). You will be creating and importing files in/from this directory.

Packages

We need to use several packages to do this: - the tidyverse packages for data manipulation. - the taxize package for obtaining taxonomic information. - the rgbif package for obtaining the GBIF data.

You might need to install them first using install.packages().

library(tidyverse)
library(taxize)
library(rgbif)

vanessa_dk <- read.csv("allData.csv") %>% 
  filter(species == "Vanessa atalanta")

names(vanessa_dk)

##  [1] "family"           "genus"            "scientificName"   "species"         
##  [5] "year"             "month"            "day"              "lifeStage"       
##  [9] "decimalLatitude"  "decimalLongitude"

Selecting and filtering the data.

We don’t need all of this data. There are >100 columns of data, most of which are not necessary for the analysis we have in mind. We only want information on the observed species, where and when it was observed, and (possibly) its life stage (adult/larva etc.).

This is how we can select those columns.

vanessa2 <- vanessa_dk %>%
  dplyr::select(species, decimalLatitude, decimalLongitude, year, month, day)

we can plot the geographic distribution like this:

ggplot(vanessa2, aes(x = decimalLongitude, y = decimalLatitude)) +
  geom_point()

There is a single outlier point there in Greenland. We can remove this using filter. We can also filter the data by species so that we only look at V. atalanta (and not its sister species V. cardui).

vanessa2 <- vanessa_dk %>%
  dplyr::select(species, decimalLatitude, decimalLongitude, year, month, day, lifeStage) %>%
  dplyr::filter(decimalLatitude < 75) %>%
  dplyr::filter(species == "Vanessa atalanta")

And replot…

ggplot(vanessa2, aes(x = decimalLongitude, y = decimalLatitude)) +
  geom_point()

This looks good.

Phenology

To look at phenology we need to turn the data information into a day of the year (doy). To do that we can use the function ymd in the lubridate package (lubridate::ymd) to produce a correctly-formatted date. It is useful to first filter the data again to remove any observations with missing (NA) observation dates. You will get a message saying that a certain number “failed to parse”. This means that for some records the lubridate::ymd function could not work because one or more of the year/month/day data is missing. It is a very small fraction of the total observations so we can forget about them and simply filter the data to remove those records.

After getting a correctly-formatted date, we can use the lubridate::yday function to express the date as a day-of-the-year, which we can put in a column called doy.

vanessa2 <- vanessa2 %>%
  mutate(date = lubridate::ymd(paste(year, month, day, sep = "-"))) %>%
  mutate(doy = lubridate::yday(date)) %>% 
  filter(!is.na(doy))

## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `date = lubridate::ymd(paste(year, month, day, sep = "-"))`.
## Caused by warning:
## !  328 failed to parse.

The geom_density_ridges graphical geom (from the ggridges package) is also a nice option for plotting when you have many categories. For example, here I plot the distribution through time of observations the genus between 2000 and 2010.

library(ggridges)

#I here filter the data to years of interest
#and make sure that year is coded as a categorical variable.
temp_plotata <- vanessa2 %>%
  filter(year %in% 2000:2024) %>%
  mutate(year = as.factor(year))

ggplot(temp_plotata, aes(x = doy, y = year, fill = year)) +
  geom_density_ridges() + 
  ggtitle("Frequency distribution of observations through years")

## Picking joint bandwidth of 8.81

We will now focus on a single year, for demonstration purposes.

phenoExample <- vanessa2 %>%
  filter(year == "2015")

We can look at distribution of observations using geom_density.

ggplot(phenoExample, aes(x = doy)) +
  geom_density()

How could we charactarise this distribution, for this year?

day of first observation
day that density peaks
day that density reaches some other value? e,g, 25% of peak?
“spread” of the observatins (e.g. variance)
…

Another way to look at this type of data is with a cumulative distribution. We can use mutate to create a new column where we count the observations. We need to first arrange the data in order of observation date.

phenoExample <- phenoExample %>%
  arrange(doy) %>%
  mutate(count = seq(n()))

ggplot(phenoExample, aes(x = doy, y = count)) +
  geom_line() +
  ggtitle("Cumulative count of observations in 2015")

A<- ggplot(phenoExample, aes(x = doy)) +
  geom_histogram() + 
    ggtitle("Density of observations in 2015")

B<- ggplot(phenoExample, aes(x = doy, y = count)) +
  geom_line() +
  ggtitle("Cumulative count of observations in 2015")
library(patchwork)
A/B

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We can use summarise and quantile to get the quantiles for this year:

phenoExample %>% 
  summarise(q10 = quantile(doy,.1),q50 = quantile(doy,.5))

##   q10 q50
## 1 166 241

For all years we can elegantly do the same thing by using group_by to tell R that we want do make these calculations for each year in the dataset.

phenoSummary <- vanessa2 %>% 
  group_by(year) %>% 
  summarise(q50 = quantile(doy,.5))

Let’s take a look at the last few rows:

tail(phenoSummary)

## # A tibble: 6 × 2
##    year   q50
##   <int> <dbl>
## 1  2015   241
## 2  2016   245
## 3  2017   223
## 4  2018   208
## 5  2019   224
## 6  2020   222

We could graph this easily, but at the moment you should take this plot with a pinch of salt because we have not yet considered data quality/sample size issues.

ggplot(phenoSummary %>% filter(year>1999),aes(x = year, y = q50))+geom_point() + geom_smooth(method = "lm")

## `geom_smooth()` using formula = 'y ~ x'

Phenology of Vanessa atalanta

Owen Jones

07 October, 2024

Packages

Selecting and filtering the data.

Phenology