Before starting, make sure that you have set your working directory (folder). You will be creating and importing files in/from this directory.
We need to use several packages to do this: - the
tidyverse packages for data manipulation. - the
taxize package for obtaining taxonomic information. - the
rgbif package for obtaining the GBIF data.
You might need to install them first using
install.packages().
library(tidyverse)
library(taxize)
library(rgbif)
vanessa_dk <- read.csv("allData.csv") %>%
filter(species == "Vanessa atalanta")
names(vanessa_dk)
## [1] "family" "genus" "scientificName" "species"
## [5] "year" "month" "day" "lifeStage"
## [9] "decimalLatitude" "decimalLongitude"
We don’t need all of this data. There are >100 columns of data, most of which are not necessary for the analysis we have in mind. We only want information on the observed species, where and when it was observed, and (possibly) its life stage (adult/larva etc.).
This is how we can select those columns.
vanessa2 <- vanessa_dk %>%
dplyr::select(species, decimalLatitude, decimalLongitude, year, month, day)
we can plot the geographic distribution like this:
ggplot(vanessa2, aes(x = decimalLongitude, y = decimalLatitude)) +
geom_point()
There is a single outlier point there in Greenland. We can remove this using filter. We can also filter the data by species so that we only look at V. atalanta (and not its sister species V. cardui).
vanessa2 <- vanessa_dk %>%
dplyr::select(species, decimalLatitude, decimalLongitude, year, month, day, lifeStage) %>%
dplyr::filter(decimalLatitude < 75) %>%
dplyr::filter(species == "Vanessa atalanta")
And replot…
ggplot(vanessa2, aes(x = decimalLongitude, y = decimalLatitude)) +
geom_point()
This looks good.
To look at phenology we need to turn the data information into a day
of the year (doy). To do that we can use the function
ymd in the lubridate package
(lubridate::ymd) to produce a correctly-formatted date. It
is useful to first filter the data again to remove any observations with
missing (NA) observation dates. You will get a message
saying that a certain number “failed to parse”. This means
that for some records the lubridate::ymd function could not
work because one or more of the year/month/day data is missing. It is a
very small fraction of the total observations so we can forget about
them and simply filter the data to remove those records.
After getting a correctly-formatted date, we can use the
lubridate::yday function to express the date as a
day-of-the-year, which we can put in a column called
doy.
vanessa2 <- vanessa2 %>%
mutate(date = lubridate::ymd(paste(year, month, day, sep = "-"))) %>%
mutate(doy = lubridate::yday(date)) %>%
filter(!is.na(doy))
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `date = lubridate::ymd(paste(year, month, day, sep = "-"))`.
## Caused by warning:
## ! 328 failed to parse.
The geom_density_ridges graphical geom (from the
ggridges package) is also a nice option for plotting when
you have many categories. For example, here I plot the distribution
through time of observations the genus between 2000 and 2010.
library(ggridges)
#I here filter the data to years of interest
#and make sure that year is coded as a categorical variable.
temp_plotata <- vanessa2 %>%
filter(year %in% 2000:2024) %>%
mutate(year = as.factor(year))
ggplot(temp_plotata, aes(x = doy, y = year, fill = year)) +
geom_density_ridges() +
ggtitle("Frequency distribution of observations through years")
## Picking joint bandwidth of 8.81
We will now focus on a single year, for demonstration purposes.
phenoExample <- vanessa2 %>%
filter(year == "2015")
We can look at distribution of observations using
geom_density.
ggplot(phenoExample, aes(x = doy)) +
geom_density()
How could we charactarise this distribution, for this year?
Another way to look at this type of data is with a cumulative
distribution. We can use mutate to create a new column
where we count the observations. We need to first arrange
the data in order of observation date.
phenoExample <- phenoExample %>%
arrange(doy) %>%
mutate(count = seq(n()))
ggplot(phenoExample, aes(x = doy, y = count)) +
geom_line() +
ggtitle("Cumulative count of observations in 2015")
A<- ggplot(phenoExample, aes(x = doy)) +
geom_histogram() +
ggtitle("Density of observations in 2015")
B<- ggplot(phenoExample, aes(x = doy, y = count)) +
geom_line() +
ggtitle("Cumulative count of observations in 2015")
library(patchwork)
A/B
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We can use summarise and quantile to get
the quantiles for this year:
phenoExample %>%
summarise(q10 = quantile(doy,.1),q50 = quantile(doy,.5))
## q10 q50
## 1 166 241
For all years we can elegantly do the same thing by using
group_by to tell R that we want do make these calculations
for each year in the dataset.
phenoSummary <- vanessa2 %>%
group_by(year) %>%
summarise(q50 = quantile(doy,.5))
Let’s take a look at the last few rows:
tail(phenoSummary)
## # A tibble: 6 × 2
## year q50
## <int> <dbl>
## 1 2015 241
## 2 2016 245
## 3 2017 223
## 4 2018 208
## 5 2019 224
## 6 2020 222
We could graph this easily, but at the moment you should take this plot with a pinch of salt because we have not yet considered data quality/sample size issues.
ggplot(phenoSummary %>% filter(year>1999),aes(x = year, y = q50))+geom_point() + geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'