1. Tracking a changing climate

The climate is changing around the world. The impacts of climate change are felt in many different areas, but they are particularly noticeable in their effects on birds. Many bird species are moving north, if they can, to stay in climatic conditions that are suitable for them.

Our analysis will use data from the UK Met Office together with records from the Global Biodiversity Information Facility to build our very own species distribution model using machine learning. This model will be able to predict where our bird species of interest is likely to occur in the future - information that is invaluable to conservation organization working on the ground to preserve these species and save them from extinction!

In this notebook, we will model the Scottish crossbill (Loxia scotica). The Scottish crossbills is a small bird that inhabits the cool Scottish forests and feeds on pine seeds. Only ~ 20,000 individuals of this species are alive today. The code and the data sources in this project can be reapplied to any other species we may be interested in studying.

[image] We will start by importing the climate data from a local rds file.

# Load in the tidyverse, raster, and sf packages
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.2 --
## v ggplot2 3.3.6     v purrr   0.3.4
## v tibble  3.1.7     v dplyr   1.0.9
## v tidyr   1.2.0     v stringr 1.4.0
## v readr   2.1.2     v forcats 0.5.2
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(raster)
## Loading required package: sp
## 
## Attaching package: 'raster'
## 
## The following object is masked from 'package:dplyr':
## 
##     select
library(sf)
## Linking to GEOS 3.9.1, GDAL 3.2.1, PROJ 7.2.1; sf_use_s2() is TRUE
# Read the climate data from an rds file
climate  <- read_rds("~/AllEY/Data Science/Impact of Climate Change on Birds/climate_raster.rds")

# Have a look at the variables in the climate data
colnames(climate)
## [1] "decade"  "rasters"
# Convert to SpatialPixelDataFrame for plotting
climate_df <- mutate(
  .data = climate, 
  rasters = map(
    .x = rasters, 
    ~ as_tibble(as(.x, "SpatialPixelsDataFrame")))) %>%
  unnest(cols = c(rasters))

2. Mapping a changing climate

We have loaded the pre-processed climate data and converted it to a SpatialPixelDataFrame. This data frame now contains all the information we need:

An excellent first step in any analysis is visualizing the data. Visualizing the data makes sure the data import worked, and it helps us develop intuition about the patterns in our dataset. Here we are dealing with spatial data - let us create maps! We will start with two maps: one map of the climatic conditions in 1970, and one map of the climatic conditions in 2010. Our climate data has several variables, so let us pick minimum.temperature for now.

library(ggthemes)

# Filter the data to plot
ggp_temperature <- climate_df %>%
  filter(decade %in% c(1970)) %>% 
  # Create the plot
  ggplot(aes(x = x, y = y)) + geom_tile(aes(fill = minimum.temperature)) +
  # Style the plot with options ideal for maps
  theme_map() + coord_equal() +
  facet_grid(~ decade) + scale_fill_distiller(palette = "Spectral") + 
  theme(legend.title = element_blank(), legend.position = "bottom") +
  labs(title = "Minimum of Average Monthly Temperature (Celsius)", caption = 'Source: MetOffice UK')

# Display the map
ggp_temperature

3. Fieldwork in the digital age – download the data

Now we need to obtain species occurrence records. This used to be the main challenge in biogeography. Natural historians, such as Charles Darwin and Alexander von Humboldt, traveled around the globe for years on rustic sail ships collecting animal and plant specimens to understand the natural world. Today, we stand on the shoulders of giants. Getting data is fast and easy thanks to two organizations:

#install.packages("rgbif")
library(rgbif)

myData<-occ_data(taxonKey=10709636)
datasetCounts<-myData$data %>% count(datasetKey, sort=TRUE) 
write.table(datasetCounts, "~/AllEY/Data Science/Impact of Climate Change on Birds/occ_search.R",col.names=FALSE, row.names=FALSE,sep=",")
#source("~/AllEY/Data Science/Impact of Climate Change on Birds/occ_search.R")

# Call the API to get the occurrence records of this species
gbif_response <- occ_search(
  scientificName = "Loxia scotica", country = "GB",
  hasCoordinate = TRUE, hasGeospatialIssue = FALSE, limit = 2000)

# Inspect the class and names of gbif_response
class(gbif_response)
## [1] "gbif"
names(gbif_response)
## [1] "meta"      "hierarchy" "data"      "media"     "facets"
# Print the first six lines of the data element in gbif_response
head(gbif_response[["data"]])
## # A tibble: 6 x 130
##   key     scien~1 decim~2 decim~3 issues datas~4 publi~5 insta~6 publi~7 proto~8
##   <chr>   <chr>     <dbl>   <dbl> <chr>  <chr>   <chr>   <chr>   <chr>   <chr>  
## 1 376974~ Loxia ~    57.2   -3.62 ""     b10478~ 1f00d7~ 85ccfd~ NL      DWC_AR~
## 2 376974~ Loxia ~    57.2   -3.62 ""     b10478~ 1f00d7~ 85ccfd~ NL      DWC_AR~
## 3 376974~ Loxia ~    57.3   -3.58 ""     b10478~ 1f00d7~ 85ccfd~ NL      DWC_AR~
## 4 376975~ Loxia ~    57.3   -3.58 ""     b10478~ 1f00d7~ 85ccfd~ NL      DWC_AR~
## 5 376975~ Loxia ~    57.2   -3.62 ""     b10478~ 1f00d7~ 85ccfd~ NL      DWC_AR~
## 6 378496~ Loxia ~    57.0   -3.42 "cdro~ 50c950~ 28eb1a~ 997448~ US      DWC_AR~
## # ... with 120 more variables: lastCrawled <chr>, lastParsed <chr>,
## #   crawlId <int>, hostingOrganizationKey <chr>, basisOfRecord <chr>,
## #   occurrenceStatus <chr>, taxonKey <int>, kingdomKey <int>, phylumKey <int>,
## #   classKey <int>, orderKey <int>, familyKey <int>, genusKey <int>,
## #   speciesKey <int>, acceptedTaxonKey <int>, acceptedScientificName <chr>,
## #   kingdom <chr>, phylum <chr>, order <chr>, family <chr>, genus <chr>,
## #   species <chr>, genericName <chr>, specificEpithet <chr>, ...

4. Sorting out the bad eggs – data cleaning

GBIF and rOpenSci just saved us years of roaming around the highlands with a pair of binoculars, camping in mud, rain, and snow, and chasing crossbills through the forest! Nevertheless, it is still up to us to make sense of the data. In particular, data collected at this large scale can have issues. Luckily, GBIF provides some useful metadata on each record.

Here are some criteria we can use:

library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:raster':
## 
##     intersect, union
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
birds_dated <- mutate(
  .data = gbif_response$data,
  # Create a new column specifying the decade of observation
  decade = ymd_hms(eventDate) %>% round_date("10y") %>% year())

birds_cleaned <- birds_dated %>%
  filter(
    issues == "" &
    str_detect(license, "http://creativecommons.org/") &
    # No records before 1970s decade or after 2010s decade
    decade >= 1970 & decade <= 2010
  ) %>%
  transmute(decade = decade, x = decimalLongitude, y = decimalLatitude) %>%
  arrange(decade)

5. Nesting the data

We have cleaned the data, but there is a problem. We want to know the climatic conditions at the location of the bird observation at the time of the observation. This is tricky because we have climate data from multiple decades. How do we match each bird observation to the correct climate raster cell?

We will use a nifty trick: we can nest() data in a list column. The result is a data frame where the grouping columns do not change, and a list column of aggregated data from each group is added. List columns can hold many different kinds of variables such as vectors, data frames, and even objects. For example, the climate data that we imported earlier was already nested by decade and had a list column (rasters) that contained a rasterStack object for each decade.

# "Nest" the bird data
birds_nested <- birds_cleaned %>% 
  group_by(decade) %>% 
  nest(.key = "presences")
## Warning: `.key` is deprecated
head(birds_nested)
## # A tibble: 5 x 2
## # Groups:   decade [5]
##   decade presences         
##    <dbl> <list>            
## 1   1970 <tibble [8 x 2]>  
## 2   1980 <tibble [13 x 2]> 
## 3   1990 <tibble [36 x 2]> 
## 4   2000 <tibble [407 x 2]>
## 5   2010 <tibble [471 x 2]>
# Calculate the total number of records per decade
birds_counted <- birds_nested %>%
  mutate(n = map_dbl(.x = presences, .f = nrow))

head(birds_counted)
## # A tibble: 5 x 3
## # Groups:   decade [5]
##   decade presences              n
##    <dbl> <list>             <dbl>
## 1   1970 <tibble [8 x 2]>       8
## 2   1980 <tibble [13 x 2]>     13
## 3   1990 <tibble [36 x 2]>     36
## 4   2000 <tibble [407 x 2]>   407
## 5   2010 <tibble [471 x 2]>   471

6. Making things spatial - projecting our observations

Excellent! Both our datasets are nested by decade now. We have one more step before we extract the climatic conditions at bird locations. Locations in birds_counted are latitude and longitude coordinates. R doesn’t know that these are spatial information. We need to convert and project our data.

Projections are necessary because maps are 2-dimensional, but the earth is 3-dimensional. There is no entirely accurate way to represent the surface of a 3D sphere in 2D. Projections are sets of conventions to help us with this issue. GBIF hosts data from around the world and uses a global projection (WGS84). The Met Office is a UK organization and provides data in the British National Grid projection (BNG).

To project spatial data, use Coordinate Reference System (CRS) strings.

# Define geographical projections
proj_latlon <- st_crs("+proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0")
proj_ukgrid <- st_crs("+init=epsg:27700")
## Warning in CPL_crs_from_input(x): GDAL Message 1: +init=epsg:XXXX syntax is
## deprecated. It might return a CRS with a non-EPSG compliant axis order.
# Convert records to spatial points and project them
birds_presences <- mutate(birds_counted,
  presences = map(presences, ~ .x %>%
    # Specify the current projection
    st_as_sf(coords = c("x", "y"), crs = proj_latlon) %>%
    # Transform to new projection
    st_transform(crs = proj_ukgrid)))

7. Extract exactly what we want

Now we are ready to combine the two datasets and extract the climatic conditions at each location for the given decade. This is where the nested structure comes in handy! We join the data frames by their grouping column and can rest assured that the data in the list columns are matched up correctly. This allows us to operate on the list column variables element-wise using the map() family functions.

# Combine the bird data and the climate data in one data frame
birds_climate <- full_join(birds_presences, climate, by = "decade")

presence_data <- map2_df(
  .x = birds_climate[["rasters"]],
  .y = birds_climate[["presences"]],
  # extract the raster values at presence locations
  ~ raster::extract(x=.x, y=.y) %>% 
    as_tibble() %>% 
    mutate(observation = "presence"))

8. Pseudo-absences

To run a machine learning model, the classification algorithm needs two classes: presences and absences. Our presences are the observations from GBIF. Absences are a lot harder to get.

The difficulty is because of information asymmetry between the presences and absences. With a bird observation we are sure it occurred at that location, but to be certain the bird does not occur somewhere, we would have to continuously monitor the site.

One way to deal with this problem is to generate “pseudo-absences”. Pseudo-absences are a random sample from the entire study area. We assume that the species does not occur at the random locations and our hope is that the average actual probability of occurrence for the bird in these random locations is low enough to give our algorithm something to learn

# Define helper function for creating pseudo-absence data
create_pseudo_absences <- function(rasters, n, ...) {
    set.seed(12345)
    sampleRandom(rasters, size = n * 5, sp = TRUE) %>% 
    raster::extract(rasters, .) %>% as_tibble() %>%
    mutate(observation = "pseudo_absence")
}

# Create pseudo-absence proportional to the total number of records per decade
pseudo_absence_data <- pmap_df(.l = birds_climate, .f = create_pseudo_absences)

# Combine the two datasets
model_data <- bind_rows(presence_data, pseudo_absence_data) %>%
  mutate(observation = factor(observation)) %>% na.omit()

9. Making models - with caret

We are ready to train our model. We will use glmnet, which fits a generalized logistic regression (glm) with elastic net regularization (net). Our algorithm has several “hyperparameters”. These are variables used by the machine learning algorithm to learn from the data. They influence the performance of the model and often interact with one another, so it is difficult to know the right settings apriori.

To figure out a good set of hyperparameters, we need to try several possible scenarios to see which ones work best. caret makes this easy. All we need to do is define a “tuning grid” with sets of possible values for each training parameter. Then use cross-validation to evaluate how well the different combinations of hyperparameters did building the predictive model.

# Load caret and set a reproducible seed
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
set.seed(12345)

# Create a tuning grid with sets of hyperparameters to try
tuneGrid <- expand.grid(alpha = c(0, 0.5, 1), lambda = c(.003, .01, .03, .06))

# Create settings for model training
trControl <- trainControl(method = 'repeatedcv', number = 5, repeats = 1,
  classProbs = TRUE, verboseIter = FALSE, summaryFunction = twoClassSummary)

# Fit a statistical model to the data and plot
model_fit <- train(
  observation ~ ., data = model_data,
  method = "glmnet", family = "binomial", metric = "ROC",
  tuneGrid = tuneGrid, trControl = trControl)

plot(model_fit)

10. Prediction probabilities

We have built our first species distribution model! Next, we will use it to predict the probability of occurrence for our little crossbill across the UK! We will make a prediction for each decade and each cell of the grid. Since we fit a logistic regression model, we can choose to predict the probability. In our case, this becomes the “probability of occurrence” for our species.

# Use our model to make a prediction
climate_df[["prediction"]] <- predict(
    object = model_fit,
    newdata = climate_df,
    type = "prob")[["presence"]]

head(climate_df)
## # A tibble: 6 x 10
##   decade minimum~1 maxim~2 rainf~3 wind.~4 snow.~5 air.f~6      x      y predi~7
##    <dbl>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>  <dbl>  <dbl>   <dbl>
## 1   1970      5.18    9.76   102.     14.7   0.889    2.54 227500 972500  0.0274
## 2   1970      4.42    8.40   119.     18.3   1.77     3.76 232500 972500  0.0323
## 3   1970      5.20   10.9     74.0    12.2   1.24     3.14 307500 972500  0.0768
## 4   1970      4.77   10.6     71.9    11.5   1.57     4.13 322500 972500  0.159 
## 5   1970      5.49   10.5     92.3    11.8   1.11     2.88 222500 967500  0.0464
## 6   1970      4.46    9.32   114.     15.1   1.48     4.42 227500 967500  0.0429
## # ... with abbreviated variable names 1: minimum.temperature,
## #   2: maximum.temperature, 3: rainfall, 4: wind.speed, 5: snow.lying,
## #   6: air.frost, 7: prediction

11. A map says more than a thousand words

We have our predictions, but they are not in a digestible format. It is tough to figure out what is going on from that large table of numbers, and it would be even more challenging to convince a politician, the general public, or a local decision maker with it.

It would be great to visualize the predictions so we can see the patterns and how they change over time. A picture says more than a thousand words. And what says even more than a picture (at least if you are a geographer)? A colored map! Let us create another map that shows our predictions of a changing climate in the UK, from 1965 to 2015.

# Load packages gganimate and viridis
library(viridis)
## Loading required package: viridisLite
# Create the plot
ggp_changemap <- ggplot(data = climate_df, aes(x = x, y = y, fill = prediction)) +
  geom_tile() +
  # Style the plot with the appropriate settings for a map
  theme_map() + coord_equal() +
  scale_fill_viridis(option = "A") + theme(legend.position = "bottom") +
  # Add faceting by decade
  facet_grid(~ decade) +
  labs(title = 'Habitat Suitability', subtitle = 'by decade',
       caption = 'Source:\nGBIF data and\nMetOffice UK climate data',
       fill = 'Habitat Suitability [0 low - high 1]')

# Display the plot
ggp_changemap