R Exercise Introduction

Since our lab is so data and code driven, we have historically asked final candidates to complete an R, Python, or mapping exercise in order to assess their aptitude for these skills. The following exercise has been adapted from an undergraduate homework assignment from the MSU FW Limnology course from this year. The exercise includes some pre-run code to familiarize you with the topic and coding concepts involved followed by a prompt for you to complete. You will then submit as a knitted HTML file the final output. You are free to use any resources, including adapting online code you find, as long as you do not directly receive help from another person. You may want to check that both your R version and RStudio versions are up to date prior to beginning.

The exercise will first go through the steps of downloading species occurrence data for one species from GBIF, the Global Biodiversity Information Facility, which includes species records from museums, community science platforms, and other datasets. The data will be accessed using an API (Application Programming Interface), allowing for download directly into the R session. The code will then present a few example maps and mapping packages, which you are free to adapt or modify as you see fit. You should also feel free to complete the exercise using different methods that you are more comfortable with, but this will not be treated as a criterion for assessment. The exercise will be assessed for (1) completeness, (2) understandable annotation of code which is important for sharing in our lab, and (3) the quality of the final figures or maps.

Coding Example Start

Install required packages:

library(rgbif) #Package for accessing GBIF API
library(ggplot2) #Plotting library
library(maps) #Basemap library
library(hexbin) #Hexbin for plotting
library(ggthemes) #Themes for plotting
library(gganimate) #Animation library for ggplot
library(gifski) #GIF-making library
library(dplyr) #Installed to use filter() function
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

GBIF API Data Pull

We will use the occ_data() function from the package rgbif to access the GBIF API for data. You can see the many toggles available to us for pulling data and that they are unique to this API. For these examples, we will map data for the scientificName of Hyla versicolor, or the gray treefrog. For this example pull, I will also specify stateProvince of Michigan and, importantly, specify hasCoordinate = TRUE since we only want occurrence records that have coordinates that we can map.

Since we get much more information than we need, we also have to do some cleaning of the data pull to get something simpler to work with. First, we want to (1) take the data portion of the pull and turn it into a dataframes, (2) remove occurrence records without year reported, and (3) specify what columns we want to keep out of the 100+.

IMPORTANTLY, we then save the data using write.csv() so that we have a local file for these data. We then don’t have to pull data again from the API which can be time consuming and so that our data do not change. After writing this once, we can # out the API pull code so that it won’t run each time we generate our report.

# 
# hyla_data <- occ_data(scientificName = c("Hyla versicolor"), stateProvince = "Michigan", hasCoordinate = TRUE, limit = 20000)
# 
# hyla_data <- as.data.frame(hyla_data$data)
# 
# hyla_data <- hyla_data[!is.na(hyla_data$year), ]
# 
# columns_to_keep <- c("key", "scientificName", "decimalLatitude", "decimalLongitude", "occurrenceStatus", "basisOfRecord", "gbifRegion", "stateProvince", "year", "datasetName")
# 
# hyla_data <- hyla_data[, columns_to_keep]
# 
# write.csv(hyla_data, file = "hyla_data.csv")

Next we can read back in our saved data to work with:

hyla_data <- read.csv(file = "hyla_data.csv")

Creating basemaps

The packages we brought in above contain basemaps that we can use to plot our species occurrence data on top of. Here, we are saying we want a US county map that I have centered around Michigan with the x limit for our longitude range and y limit for our latitude range around the state. We also specify the colors we want for our map.

michigan <- ggplot() +
  borders("county", colour = "gray85", fill = "gray80", xlim = c(-89, -82), ylim = c(41.5, 48)) +
  theme_map()
michigan

We can also create world and U.S. basemaps in the same way. You can use these basemaps for your portion of the exercise.

world <- ggplot() +
  borders("world", colour = "gray85", fill = "gray80") +
  theme_map() 
world

usa <- ggplot() +
  borders("usa", colour = "gray85", fill = "gray80") +
  theme_map() 
usa

Plotting Occurrence Data

We can now plot our point data (a vector type) for the occurrence of Hyla versicolor in GBIF across Michigan. Just as we used ‘+’ to layer maps in the previous exercise, we can layer points using geom_point() on top of our basemap, specifying the data source hyla_data and that our aes() or aesthetic uses the column decimalLongitude for the x data and the column decimalLatitude for the y data. We can also specify the size of our points using size = .

We can layer on still more commands by specifying the title we want and the text size of that title.

michigan +
    geom_point(data = hyla_data, aes(decimalLongitude, decimalLatitude), size = 5) +
    labs(title = 'GBIF Michigan Hyla versicolor') + 
  theme(plot.title = element_text(size=28))

Although our data here are accurate, you can see a common issue with mapping where you have many overlapping points that make it difficult to see the actual density of the data. We call this overplotting. Next, we will go over three (of many) strategies to better visualize overlapping data.

2D Density Plot

One is to bin the data and instead plot the density as opposed to each individual point. Here, we do this with geom_bin2d(). Note that here instead of point size we specify the number of bins = . If you increase the number of bins, then you will get smaller bins and vice versa. If you use this method then you can adjust the bin size for one that makes sense for your occurrence data for visualization.

We also specify the color ramp, again using the color-blind friendly palette viridis.

michigan +
    geom_bin2d(data = hyla_data, aes(decimalLongitude, decimalLatitude), bins = 50) +
  scale_fill_continuous(type = "viridis") +
    labs(title = 'GBIF Michigan Hyla versicolor') +
  theme(plot.title = element_text(size=28))

Hexbin Plot

A nearly identical method that many prefer visually is the hexbin plot, here using geom_hex(). For this, we will set a smaller number of bins to show how decreasing this number leads to larger areas to calculate point density.

michigan +
    geom_hex(data = hyla_data, aes(decimalLongitude, decimalLatitude), bins = 30) +
  scale_fill_continuous(type = "viridis") +
    labs(title = 'GBIF Michigan Hyla versicolor') +
  theme(plot.title = element_text(size=28))

Animated Occurrence Plot

Finally, we can also plot occurrences through times using an animation. NOTE: Making an animation will be optional here but can be a useful skill to learn if you are up for a challenge. You can actually animate nearly any kind of plot, not just maps!

We will again use geom_point() for plotting points, but we will change the alpha = to a value below 1, which decreases point opacity.

We specify that we want transitions between frames, transition_reveal() of our animation to be based on year observed. We also specify that each point is unique by specifying group = the unique identifier named key in our data. Finally, we indicate that we want the title for each frame of our animation to be our year. This whole object called ani_hyla is our set of frames that we want to animate.

Next, we use the animate() function to animate our above plot, specifying the pixel height and pixel width of the animation. You may need to change these proportions depending on what makes sense for your map. Finally, we specify that we want to render it as a gif using the gifski_renderer().

ani_hyla <- michigan +
  geom_point(data = hyla_data,
             aes(decimalLongitude, decimalLatitude, group = key), color = 'dark green', size = 30, alpha = 0.4) +
  transition_reveal(year) +
  labs(title = '{(frame_along)}') +
  theme(plot.title = element_text(size=1000))

#needed to change animation rendering size to be able to fit into the html
animate(ani_hyla, height = 100, width = 100, res = 25, renderer = gifski_renderer())

Coding Exercise Start

For this exercise, you will create your own maps to show the invasion of the non-native spiny water flea in the US:

Bythotrephes longimanus Spiny waterflea

In the RMarkdown file, we have provided three code chunks for you to work in, but feel free to add and annotate as many as you need to complete the exercise.

GBIF API Call

Bring in the occurrence data for spiny water flea from the GBIF API using the occ_data function. Specify that the records have coordinates for all global records. This may take a few minutes to bring into your session, which is another reason why we want to only do this once!

Clean data as in the tutorial by converting to a data frame, removing NA for year, and selecting the same columns as above

Save your data locally using write.csv() and # out your code once you have your local data so that it doesn’t run again.

#   *Bring in the data!*
# 
# # import GBIF data for spiny water flea using **occ_data**
#  spiny_data <- occ_data(scientificName = c("Bythotrephes longimanus"),
#        hasCoordinate = TRUE, limit = 20000)
# 
# #(1) take the data portion of the pull and turn it into a dataframe
# spiny_data <- as.data.frame(spiny_data$data)
# 
# #(2) remove occurrence records without year reported
# spiny_data <- spiny_data[!is.na(spiny_data$year), ]
# 
# #(3) specify what columns we want to keep
# keep_columns <- c("key", "scientificName", "decimalLatitude", "decimalLongitude", "occurrenceStatus", "basisOfRecord", "gbifRegion", "stateProvince", "year", "datasetName")
# 
# spiny_data <- spiny_data[, keep_columns]
# 
# #save data locally
# write.csv(spiny_data, file = "spiny_data.csv")

#read back the data
spiny_data <- read.csv(file = "spiny_data.csv")

Pre-Invasion Distribution

Make a map of the pre-invasion distribution of the species based on GBIF data by filtering the data frame to the years before it arrived in the U.S. or North America:

#   *Create a map of the pre-invasion distribution of Bythotrephes longimanus*

#create global basemap
spiny_world <- ggplot() +
  borders("world", colour = "gray85", fill = "gray80") +
  theme_map() 
spiny_world

#filter to only include data from before 1982, 
#the year Bythotrephes longimanus was first reported in North America 
#source: (https://nas.er.usgs.gov/queries/factsheet.aspx?SpeciesID=162)  
inv_date <- 1982 
pre_invasion <- filter(spiny_data, year<inv_date)

#create Hexbin plot showing global distribution of 
#Bythotrephes longimanus before invasion of North America
spiny_world +
    geom_hex(data = pre_invasion, aes(decimalLongitude, decimalLatitude), bins = 75) +
  scale_fill_continuous(type = "viridis") +
    labs(title = 'GBIF Bythotrephes longimanus Pre-Invasion Distribution') +
  theme(plot.title = element_text(size=15))

# *Create a close-up map for greater visibility and detail*

#basemap of the european region containing the datapoints
europe <- ggplot() +
  borders("world", colour = "gray85", fill = "gray80", xlim = c(-5, 26), ylim = c(45, 75)) +
  theme_map()
europe

#hexbin plot in new region
europe +
    geom_hex(data = pre_invasion, aes(decimalLongitude, decimalLatitude), bins = 50) +
  scale_fill_continuous(type = "viridis") +
    labs(title = 'GBIF Bythotrephes longimanus Pre-Invasion Distribution') +
  theme(plot.title = element_text(size=15))

U.S. Invasion Through Time

Create a plot, series of plots, or animation that shows the spread of the invasive species through time for the U.S. or North America using a visualization method that deals with overplotting:

#   *Create an animation and hexbin plot showing the spread of*
#   *Bythotrephes longimanus through time in North America*

#exclude data from outside North America
  #(so that we can zoom in on the invasion)
invasion <- filter(spiny_data, gbifRegion == "NORTH_AMERICA")

#create a basemap of USA and Canada 
  #*only used USA and Canada for map clarity because data does not extend farther south
  #*added coordinates (from Michigan basemap) because map was off-center 
us_canada <- ggplot() +
  borders("world", region = c("Canada", "USA"), colour = "gray85", fill = "gray80", xlim = c(-89, -100), ylim = c(41.5, 48))+
  theme_map()
us_canada

#animate!
  #change color to red for dramatic effect
ani_spiny <- us_canada +
  geom_point(data = invasion,
             aes(decimalLongitude, decimalLatitude, group = key), color = 'dark red', size = 30, alpha = 0.4) +
  transition_reveal(year) +
  labs(title = '{(frame_along)}') +
  theme(plot.title = element_text(size=1000))

animate(ani_spiny, height = 100, width = 100, res = 25, renderer = gifski_renderer())

#create a hexbin plot 
  #(to see the data all at once)
us_canada +
    geom_hex(data = invasion, aes(decimalLongitude, decimalLatitude), bins = 75) +
  scale_fill_continuous(type = "viridis") +
    labs(title = 'GBIF North American Bythotrephes longimanus Invasion') +
  theme(plot.title = element_text(size=15))

What was the most common database source for your chosen species US occurrence data? Please find this by using R code:

# #   *Find out which database source is the most common*
# 
# #re-import data into new dataframe because spiny_data excludes elements without year
# #and we now want ALL data that includes datasetName
# 
# #import data for spiny water flea using **occ_data**
#  spiny_all <- occ_data(scientificName = c("Bythotrephes longimanus"),
#           limit = 20000)
# 
# #(1) take the data portion of the pull and turn it into a dataframe
# spiny_all <- as.data.frame(spiny_all$data)
# 
# #(2) remove occurrence records without datasetName
# spiny_all <- spiny_all[!is.na(spiny_all$datasetName), ]
# 
# #(3) specify what columns we want to keep
# keep <- c("key", "scientificName", "decimalLatitude", "decimalLongitude", "occurrenceStatus", "basisOfRecord", "gbifRegion", "stateProvince", "year", "datasetName")
# spiny_all <- spiny_all[, keep]
# 
# #save data locally
# write.csv(spiny_all, file = "spiny_all.csv")

#read back saved data
spiny_all <- read.csv(file = "spiny_all.csv")


#now that we have our new dataframe...

#use unlist() to transform datasetName column into a vector
  #to use in our function
bor_vec <- unlist(spiny_all$datasetName)

#define a function to determine the statistical mode of bor_vec
mode_func <- function(x) {
  ux <- unique(x) 
  ux[which.max(tabulate(match(x, ux)))] 
  
}
#function elements explanation:
  #match() returns for each element in x its position in ux
  #tabulate() counts the number of occurrences of each integer returned by match()
  #which.max() returns the position of the integer with the greatest number of occurerences
  #indexing this function into ux returns whichever value occurs most frequently in the list 

#use mode_func() to find the most common database source
#print result within a message
result <- paste("The most common database source is...", mode_func(bor_vec), "!") 
cat(result) 
## The most common database source is... NatureServe Network Species Occurrence Data !