Introduction

At the light of the actual global warming situation, predicting potential future species distribution is a major challenge. Spatial distribution models (SDM) have often been used, especially for marine species ((e.g. Cheung et al., 2009). These models aim to predict and map changes in species distribution areas in response to environmental changes (i.e. climate change). Such models can use occurrence or abundance data, linked with layers of useful spatial data (temperature, depth, salinity, …). This approach aims to define the affinity for each site depending on the species studied. Based on the results, it is possible to project a species distribution depending on different climate change scenarios

Here we chose to modelize the distribution of the Atlantic cod Gadus morhua from the Northern coast of Spain to the North of Great Britain and including Scandinavian seas. In order to do so, we will use three types of models : generalized linear (GLM), additive generalized (GAM) and random forest (RF), that we will compare based on their predictive power. Then we will use those models to predict Atlantic cod distribution in a situation of global warming. Given that the IPCC predicts an increase in ocean salinity and temperature in the coming years, we expect Gadus morhua populations to decline in the southern part of our area study.

Methods

Data selection and treatment

Data of presence and pseudo-absence of the Atlantic cod have been obtained by using the first part of a script available on : https://github.com/TarekHattab/SDM. Because this part implies downloading large amount of data and manipulating equally large objects, the dataset will be directly downloaded

The Atlantic cod biotope expands from the pelagic zone to 600 meters deep (https://doris.ffessm.fr/ref/specie/867), which means that it can be found almost everywhere expect near coasts and seafloors. For these marine layers, we focused on measurements of salinity and temperature.

Modeling and validation

We chose to use three types of models : generalized linear (GLM), additive generalized (GAM) and random forest (RF). We will compare those models through their ability to be right when they predict an occurrence, but also using ROC curves, which is a performance measurement of a model based on its capacity to categorize inputs in two groups (0 or 1) based here on salinity and temperatures values. ROC curves represent the rate of true positives (called sensitivity) depending on the false positives rate (called specificity).

Loading the data

Occurrence data result from searching 5 global biogeographic database :
OBIS (Ocean Biogeographic Information System) http://www.iobis.org/
GBIF (Global Biodiversity Information Facility) http://www.gbif.org/
iNaturalist (A Community for Naturalists) http://www.inaturalist.org
VertNet (vertebrate biodiversity networks) http://vertnet.org/
Ecoengine (UC Berkeley’s Natural History Data) https://ecoengine.berkeley.edu/

Using data found in five different sources allows us to consider that every possible zone of G. morhua occurrence is equally represented in terms of resources used to look for data.

Gadus morhua can normally be found from the North of Spain to Iceland. Plotting the occurrence map allows us to see that the observations are representative of the G. morhua distribution mentionned before (Fig.1).

Figure 1 : Observations of Gadus morhua

Data loading

data <- read.csv("https://raw.githubusercontent.com/martin-faucher/projet-MODE-MLB/main/data_OCC.csv")
data <- as.data.frame(data) 
names(data)[1] <- "type" # we change the name of the first column

Package loading

library(mgcv)
library(randomForest)
library(raster)
library(ggplot2)
library(adabag)
library(pROC)
library(gridExtra)

Data exploration

Data used here include 1000 records of pseudo-absence and 739 records of presence.

We display the repartition of temperature and salinity based on the presence of Gadus morhua.

data_hist <- data

data_hist$type <- as.factor(data_hist$type) # transformation of "type" variable into a factor

# Plotting of presence data according to salinity
sal <- ggplot(data_hist, aes(y=Salinity, fill=type)) +
  geom_boxplot()+
  theme_bw()+
  theme(axis.text.x = element_blank(),axis.ticks = element_blank())+
  ggtitle("Figure 2b: Pseudo-absence (0) or \n presence (1) of Gadus morhua \n as a function of salinity")

# Plotting of presence data according to temperature
temp <- ggplot(data_hist, aes(y=Temperature, fill=type)) +
  geom_boxplot()+
  theme_bw()+
  theme(axis.text.x = element_blank(),axis.ticks = element_blank())+
  ggtitle("Figure 2a: Pseudo-absence (0) or \n presence (1) of Gadus morhua \n as a function of temperature")

grid.arrange(sal, temp, nrow=1)

Gadus morhua occurence appears to be correlated with ocean temperature (Fig. 2a) : occurences tend to be more frequent in waters with temperatures between 5 and 10 °C. On the contrary, salinity does not seem to be impactful to determine G. morhua presence (Fig. 2b).

Data splitting

Here, we create a function to create two sets of data, the first set of data will be used to train the models and the second one will be used to validate them.

train_test_split <- function (X, test_size=0.25){
  # First we create two dataset to store separately occurences and pseudo-absences
  # to keep the same proportions in our subsets of data
  data_1 <- X[X$type==1,]
  data_0 <- X[X$type==0,]
  
  # We sample data randomly in both data sets
  samp_1_train <- sample(1:nrow(data_1), round(test_size*nrow(data_1)))
  samp_0_train <- sample(1:nrow(data_0), round(test_size*nrow(data_0)))
  
  # We bind occurrences and pseudo-absence data sets
  data_train <- rbind(data_1[-samp_1_train,], data_0[-samp_0_train,])
  data_test <- rbind(data_1[samp_1_train,], data_0[samp_0_train,])
  
  # data_list contains both data for training and for testing
  data_list <- list(data_train, data_test)
  names(data_list) <- c("train","test")
  
  return(data_list)
}

Modelling

Generalized linear model

B = 10 # B = number of repetitions for the cross-validation
moy_succ <-  c()

for (b in 1:B){
  X <- train_test_split(data)
  train = X$train
  test = X$test
  
  mod_glm <- glm(type~Temperature+Salinity, 
             family="binomial", data=train)
  
  y_pred = predict(mod_glm,test, type = "response")
  test <- cbind(test,y_pred)
  pres_test <- which(test[,1] == 1)
  moy_succ <- c(moy_succ,mean(test[pres_test,4]))
}

Here, the mean probability for the GLM to predict an occurrence when given data linked with an occurrence is 0.5822176.

GAM

B = 10 # B = number of repetitions for the cross-validation
moy_succ <-  c()

for (b in 1:B){
  X <- train_test_split(data)
  train = X$train
  test = X$test
  
  mod_GAM <- gam(type ~ Salinity + Temperature, 
                 family = binomial("logit"), data = train, method = "REML")
  
  y_pred = predict(mod_GAM,test, type = "response")
  test <- cbind(test,y_pred)
  pres_test <- which(test[,1] == 1)
  moy_succ <- c(moy_succ,mean(test[pres_test,4]))
}

Here, the mean probability for the GAM to predict an occurrence when given data linked with an occurrence is 0.579138

Random forest

mse_rf = NULL ; B = 10 # B = number of repetitions for the cross-validation

moy_succ_rf <-  c()

for (b in 1:B){
  X <- train_test_split(data)
  train = X$train
  test = X$test

  RF = randomForest(type~.,data=train,mtry=20,control=rpart.control(maxdepth=5, minsplit=15), do.trace = F)
  # maxdepth = depth of each tree
  # minsplit = mininum amout of observations in a branch before branching
  y_pred = predict(RF,test, type = "response")
  test <- cbind(test,y_pred)
  pres_test <- which(test[,1] == 1)
  moy_succ_rf <- c(moy_succ_rf,mean(test[pres_test,4]))
}
m_rf <- mean(moy_succ_rf)

Here, the mean probability for the RF to predict an occurrence when given data linked with an occurrence is 0.9602202.

par(mfrow=c(1,3))
plot(roc_glm, main="Figure 3a : ROC curve GLM", xlab = "Specificity (True Positive Rate)", ylab="Sensitivity (False Positive Rate)") ; plot(roc_gam, main="Figure 3b : ROC curve GAM", xlab = "Specificity (True Positive Rate)", ylab="Sensitivity (False Positive Rate)") ; plot(roc_rf, main="Figure 3c : ROC curve RF", xlab = "Specificity (True Positive Rate)", ylab="Sensitivity (False Positive Rate)")

Area under curve is higher for the RF model than for the other two.

Based on the validation methods we used, using a Random Forest approach seems to be the most powerful to predict Gadus morhua occurrence using temperature and salinity data (Fig.3c).

In the second part of this report, we modelize the Gadus morhua distribution area after changes in salinity and temperature caused by global warming.

Actual distribution according to our three models

Now we model occurrence of Gadus morhua for the data collected the year where occurrences were observed, 2012, using respectively the three models.

download.file("https://pcsbox.univ-littoral.fr/d/b75dc393597748659c0f/files/?p=%2FEnvironmental%20data%2FClimatic_data%2FWOD%2FLocal%2FTemperature_2005_2012_local.RData&dl=1", destfile = "Temperature_2005_2012_local.RData", mode = "wb")
download.file("https://pcsbox.univ-littoral.fr/d/b75dc393597748659c0f/files/?p=%2FEnvironmental%20data%2FClimatic_data%2FWOD%2FLocal%2FSalinity_2005_2012_local.RData&dl=1", destfile = "Salinity_2005_2012_local.RData", mode = "wb")

TemperatureL <- brick(get(load("Temperature_2005_2012_local.RData")))
SalinityL <- brick(get(load("Salinity_2005_2012_local.RData")))

Temp_pred <- TemperatureL
Sal_pred <- SalinityL

layer_sp <- 2 # Selecting the layer number 2

temp_pred_sp <- Temp_pred[[layer_sp]]
sal_pred_sp <- Sal_pred[[layer_sp]]

env_pred <- brick(temp_pred_sp, sal_pred_sp)

samp_env <- sampleRandom(env_pred, ncell(env_pred), xy = TRUE, sp=TRUE, 
                         na.rm = TRUE, ext=extent(env_pred))

pred_df <- as.data.frame(samp_env)[c(1:4)]
names(pred_df)[3:4] <- c("Temperature", "Salinity")


# Add predictions ---------------------------------------------------------

# Random Forest
pred_df$pred_rf <- predict(RF,pred_df, type = "response")

# GAM
pred_df$pred_gam <-  predict(mod_GAM,pred_df, type = "response")

#GLM
pred_df$pred_glm <-  predict(mod_glm,pred_df, type = "response")

# Plot predictions --------------------------------------------------------

# Random forest
plot_rf_act <- ggplot(pred_df,aes(x,y))+
  geom_tile(aes(fill=pred_rf))+
  ggtitle("Figure 4a : Gadus morhua distribution \n for the year 2012 \n Random forest predictions")+
  coord_equal()

# GAM
plot_gam_act <- ggplot(pred_df,aes(x,y))+
  geom_tile(aes(fill=pred_gam))+
  ggtitle("Figure 4b : Gadus morhua distribution \n for the year 2012 \n GAM predictions ")+
  coord_equal()

# GLM
plot_glm_act <- ggplot(pred_df,aes(x,y))+
  geom_tile(aes(fill=pred_glm))+
  ggtitle(" Figure 4c : Gadus morhua distribution \n for the year 2012 \n GLM predictions ")+
  coord_equal()

plot_rf_act

plot_gam_act

plot_glm_act

Figures 4a to 4c map the probability prediction of Gadus morhua occurrence (varying from 0 in dark blue to 1 in light blue) according to our three models. We see that our models produce maps similar to the real occurrence map (Figure 1), and that the most precise is the one obtained thanks to the RF model (Fig. 4a).

Gadus morhua abundance distribution predictions in a context of global warming

For projections, the script allows to use climate projections for 2041-2050 and 2091-2100 under the RCP 2.6 (strong mitigation) scenario of the IPCC (Intergovernmental Panel on Climate Change).

# Import local scale climate maps (future)
download.file("https://pcsbox.univ-littoral.fr/d/b75dc393597748659c0f/files/?p=%2FEnvironmental%20data%2FClimatic_data%2FCMIP5%2FDelta_Temperature_GFDL-ESM2G_RCP2.6.RData&dl=1", destfile = "Delta_Temperature_GFDL-ESM2G_RCP2.6.RData", mode = "wb")
download.file("https://pcsbox.univ-littoral.fr/d/b75dc393597748659c0f/files/?p=%2FEnvironmental%20data%2FClimatic_data%2FCMIP5%2FDelta_Salinity_GFDL-ESM2G_RCP2.6.RData&dl=1", destfile = "Delta_Salinity_GFDL-ESM2G_RCP2.6.RData", mode = "wb")

# Import local scale climate maps (GFDL-ESM2G_RCP2.6)

load("Delta_Temperature_GFDL-ESM2G_RCP2.6.RData")
Temp_pred <- TemperatureL + Delta ; rm(Delta)
load("Delta_Salinity_GFDL-ESM2G_RCP2.6.RData")
Sal_pred <- SalinityL + Delta ; rm(Delta)

# Select concerned layers, here, the intermediate one
layer_sp <- 2

temp_pred_sp <- Temp_pred[[layer_sp]]
sal_pred_sp <- Sal_pred[[layer_sp]]

# stack the different rasters
env_pred <- brick(temp_pred_sp, sal_pred_sp)

# Sample points
samp_env <- sampleRandom(env_pred, ncell(env_pred), xy = TRUE, sp=TRUE, 
                     na.rm = TRUE, ext=extent(env_pred))

# table with all locations and environmental variables
pred_df <- as.data.frame(samp_env)[c(1:4)]
names(pred_df)[3:4] <- c("Temperature", "Salinity")

Adding predictions

# Random Forest
pred_df$pred_rf <- predict(RF,pred_df, type = "response")

# GAM
pred_df$pred_gam <-  predict(mod_GAM,pred_df, type = "response")

#GLM
pred_df$pred_glm <-  predict(mod_glm,pred_df, type = "response")

Mapping predictions

# Random forest
plot_rf <- ggplot(pred_df,aes(x,y))+
  geom_tile(aes(fill=pred_rf))+
  ggtitle("Figure 5a : Gadus morhua distribution \n random forest predictions 2041-2050")+
  coord_equal()
#plot_rf

# GAM
plot_gam <- ggplot(pred_df,aes(x,y))+
  geom_tile(aes(fill=pred_gam))+
  ggtitle("Figure 5b : Gadus morhua distribution \n GAM predictions 2041-2050")+
  coord_equal()
#plot_gam

# GLM
plot_glm <- ggplot(pred_df,aes(x,y))+
  geom_tile(aes(fill=pred_glm))+
  ggtitle("Figure 5c :Gadus morhua distribution \n GLM predictions 2041-2050")+
  coord_equal()
#plot_glm
grid.arrange(plot_rf_act,plot_rf, nrow=1)

Figure 5a maps the probability prediction of Gadus morhua occurrence (varying from 0 in dark blue to 1 in light blue) according to random forest model, and this for the period of 2041-2050. The random forest model predicts a decrease in the presence of G. morhua in the southern part of the study area i.e. from its limit of presence in 2012 to the south of Ireland and England and bordering the entire French coast (area with y coordinates ranging from 45 to 52) (Fig.4a et 5a). In the more northerly areas, the model predicts that the distribution predicted for 2012 will be maintained.

Conclusion

In this work, we tried to predict the Atlantic cod distribution area under a situation of climate change, represented by raising salinity and temperature values. To do so, we used three modeling methods : generalized linear, generalized additive and random forest. The first part of our work consisted of comparing the three models and choosing the most adequate. We determined that the random forest approach provided the best results. The fact that Random Forest was better than the other two methods may have different explanations. For example, Random Forest does not assume linearity or monotony in the data unlike GLM. Appropriate features must be selected before fitting the GLM and GAM unlike RF which select features in its decision trees. And RF has an effective method for estimating missing data and maintains accuracy when large proportion of the data are missing.

Then we modified temperature and salinity data to form a data set simulating a possible future situation according to climate change and warming of the oceans. The model predicts a decrease of the abundance of G. morhua in the area with y coordinates ranging from 45 to 52, including for example the waters between France and England. Thus, between 2041 and 2050, G. morhua populations are expected to decline and concentrate in the northern part of the study area in response to an increase in marine temperature and salinity. So, in accordance with our expectations RF predicted a decrease in cod occurrence in the southernmost waters of the area.

Random Forest may have difficulty classifying the data if the values of the explanatory variables are outside of the training data, which was the case since we simulated an increase in temperature and salinity data. This problem does not exist for the regression as it is developed on an arithmetic function. However we postulate that the difference in prediction efficiency was so different, that the choice of RF was not that risky.

Atlantic cod is probably the most important commercial fishery species in the world. Cod has become considerably scarcer in European waters due to overfishing in recent years, leading to the implementation of protection plans. In a context of global changes and an ever-increasing human population, therefore with growing food needs, predicting the distribution of edible fish stocks is a major challenge.

Bibliographie

LESSARD Marie-Pierre, CHAMBERLAND Suzie, MASSON Stéphane, FEY Laurent in : DORIS, 09/11/2020 : Gadus morhua (Linnaeus, 1758), https://doris.ffessm.fr/ref/specie/867

Cheung, W.W.L., Lam, V.W.Y., Sarmiento, J.L., Kearney, K., Watson et R., Pauly, D., 2009. Projecting global marine biodiversity impacts under climate change scenarios. Fish and Fisheries, 10: 235-251.

Ben Rais Lasram, F., Hattab, T., Noguès, Q., Beaugrand, G., Dauvin, J., Halouani, G., Le Loc’h, F., Niquil, N. et Leroy, B., 2020. An open-source framework to model present and future marine species distributions at local scale. Ecological Informatics, 59: 101130.

Species Distributions Modelling : Assessment of 3 different approachs

P. Cuchot, P. Fournier, M. Faucher - M2 MODE

03/01/2021