PREAMBLE

This document provides the R code and data processing steps used to build and validate a predictive model of arthropod density in dry lawns of the Crau plain. It ensures transparency and reproducibility for the analyses presented in the associated manuscript. The objective of this study is to develop a simplified detection tool that can be used by managers. For justification of the protocol, refer to the manuscript in the materials and methods section.

SUMMARY

In the following script, I will first load my data in part I. In this part, I will also transform my data to obtain a new CSV file, on which all manipulations will take place. In the following script, I will first load my data in part I. In this part, I will also transform my data to obtain a new CSV file, on which all manipulations will be performed. This part will also check whether the data follows a Poisson distribution in order to apply it to the GLMs.Once the conditions have been verified, I will explore which variables can be exploited within these models, taking into account that the variables are not correlated with each other, and verify the collinearity of these variables in part 3. In Part 4, I explore which variables are relevant to the model, choosing the three parameters that have the greatest influence on arthropods. In this section, Figure 4 is created. I use a random forest to explore this aspect. A fifth section focuses on the reliability and robustness of a model based on temperature, wind and humidity. I evaluate the RMSE, the R² of the linear regression, McFadden’s pseudo R² of the GLMs, as well as their AIC. A graph to visualise the comparison is developed in this section (Figure 5). In the last part (part 6), I calculate the model intercepts and parametric coefficients in order to obtain the equation for each model (although the models have been tested beforehand). A Chi2 test and the calculation of the coefficient of adjustment are performed to assess the reliability of the model in relation to the observations.

I) CREATION OF A CSV FILE CONTAINING THE AVERAGE FOR EACH PARAMETER PER SESSION

The data was compiled arthropod by arthropod. To study the influence of parameters on all arthropods, the author works with an average per session. To do this, a new CSV file must be created containing these averages by parameter and by session, explaining the total number of arthropods per session.The original CSV file named ‘Données reg linéaire crau.csv’ was loaded using ‘File command’ then ‘Set as working directory’. The command below loads the data and creates a new CSV file for further processing.

data <- read.csv2("Données reg linéaire crau.csv",
                  header = TRUE,
                  fileEncoding = "UTF-8", # ou "Windows-1252"
                  stringsAsFactors = FALSE)

# Vérifier les noms de colonnes
names(data)

## [1] "Ordre.Température.Humidité.Vent.Lumière.PPM.2.5.Quantité.CO2.Moyenne.strat.herbacée.Jour.Lieu.Heure.médiane.Session"

library(here)

## here() starts at C:/Users/Scarp/Documents/Recherche entomologie/Crau/Script for CRAU

# Ensure that the file is in the correct locatio
if(!file.exists(here("resume_par_session.csv"))) {
  stop("resume_par_session.csv n'existe pas !")
}
# Read the data set
dataM <- read.csv(here("resume_par_session.csv"), sep = ";", header = TRUE, fileEncoding = "Windows-1252")
head(dataM)

##     Session       Date             Lieu Temperature Humidite       Vent
## 1 Session_1 21/06/2024 Etang des Aulnes    32.58333 43.03810 0.01190476
## 2 Session_2 23/06/2024 Etang des Aulnes    27.66250 41.83750 2.57500000
## 3 Session_3 25/06/2024 Etang des Aulnes    34.42000 38.21600 0.03800000
## 4 Session_4 02/07/2024 Etang des Aulnes    29.66667 32.97500 3.74166667
## 5 Session_5 04/07/2024        Entressen    29.28947 34.83684 1.57894737
## 6 Session_6 10/07/2024        Entressen    32.57333 33.91333 0.52000000
##    Lumiere  PPM.2.5 QuantiteCO2 Hauteur_strate_herbacee Individus
## 1 19826.19 1435.571    405.2619                      23        42
## 2 20000.00 1333.250    398.3750                      23        24
## 3 19996.00  407.060    415.1200                      29        50
## 4 20000.00 1300.667    412.0833                      25        12
## 5 20000.00 1421.842    399.3158                      15        19
## 6 20000.00  860.100    418.9000                      15        30

For practical reasons, the corresponding date and session were entered manually. Lieu = Location Temperature = Temperature Humidite = Hygrometry Vent = Wind Lumiere = Brightness PPM.2.5 = Particles per million measuring 2.5 micrometres per cubic metre of air QuantiteCO2 = Amount of CO2 per cubic metre of air Hauteur_strate_herbcee = Herbaceous layer height Individus = Individuals (Number of arthropods, response variable)

1.1 Check whether the response variable follows a Poisson distribution

Need to see if the response variable follows a Poisson distribution for GLMs.

dataM <- read.csv("resume_par_session.csv", sep = ";", header = TRUE, fileEncoding = "Windows-1252")
y <- dataM$Individus
mean_y <- mean(y)
var_y <- var(y)

pois_mod <- glm(Individus ~ 1, data = dataM, family = poisson)

chi2 <- sum(residuals(pois_mod, type = "pearson")^2)
df <- nrow(dataM) - 1  # df = n - nombre de paramètres
p_value <- pchisq(chi2, df = df, lower.tail = FALSE)

cat("Chi² =", chi2, "| df =", df, "| p-value =", p_value, "\n")

## Chi² = 56.65957 | df = 19 | p-value = 1.290306e-05

The variable has overdispersed data.

II) HOMOSCEDASTICITY, INDEPENDENCE AND NORMALITY TEST

In this section, the author tests whether the explanatory variables follow a normal distribution, whether the conditions of homoscedasticity are met, and whether the data are independent of each other. The code below summarises the data and analyses whether each variable (response and explanatory) follows a Poisson distribution.In the second part, I will verify the conditions of the various models explored to ensure that the desired models can be used.

dataM <- read.csv("resume_par_session.csv", sep = ";", header = TRUE, fileEncoding = "Windows-1252")
head(dataM)

##     Session       Date             Lieu Temperature Humidite       Vent
## 1 Session_1 21/06/2024 Etang des Aulnes    32.58333 43.03810 0.01190476
## 2 Session_2 23/06/2024 Etang des Aulnes    27.66250 41.83750 2.57500000
## 3 Session_3 25/06/2024 Etang des Aulnes    34.42000 38.21600 0.03800000
## 4 Session_4 02/07/2024 Etang des Aulnes    29.66667 32.97500 3.74166667
## 5 Session_5 04/07/2024        Entressen    29.28947 34.83684 1.57894737
## 6 Session_6 10/07/2024        Entressen    32.57333 33.91333 0.52000000
##    Lumiere  PPM.2.5 QuantiteCO2 Hauteur_strate_herbacee Individus
## 1 19826.19 1435.571    405.2619                      23        42
## 2 20000.00 1333.250    398.3750                      23        24
## 3 19996.00  407.060    415.1200                      29        50
## 4 20000.00 1300.667    412.0833                      25        12
## 5 20000.00 1421.842    399.3158                      15        19
## 6 20000.00  860.100    418.9000                      15        30

summary(dataM)

##    Session              Date               Lieu            Temperature   
##  Length:20          Length:20          Length:20          Min.   :27.21  
##  Class :character   Class :character   Class :character   1st Qu.:28.66  
##  Mode  :character   Mode  :character   Mode  :character   Median :31.55  
##                                                           Mean   :31.17  
##                                                           3rd Qu.:32.72  
##                                                           Max.   :36.77  
##                                                                          
##     Humidite          Vent            Lumiere         PPM.2.5       
##  Min.   :32.83   Min.   :0.01191   Min.   : 9238   Min.   :  23.89  
##  1st Qu.:35.83   1st Qu.:0.23051   1st Qu.:19101   1st Qu.: 236.86  
##  Median :41.69   Median :0.54000   Median :19998   Median : 902.01  
##  Mean   :43.59   Mean   :1.02982   Mean   :18211   Mean   : 796.02  
##  3rd Qu.:49.57   3rd Qu.:1.66774   3rd Qu.:20000   3rd Qu.:1325.10  
##  Max.   :64.88   Max.   :3.74167   Max.   :20000   Max.   :1462.92  
##                                                    NA's   :2        
##   QuantiteCO2    Hauteur_strate_herbacee   Individus    
##  Min.   :389.6   Min.   : 8.50           Min.   :12.00  
##  1st Qu.:400.8   1st Qu.:14.62           1st Qu.:24.00  
##  Median :408.3   Median :17.50           Median :29.00  
##  Mean   :410.4   Mean   :18.40           Mean   :30.55  
##  3rd Qu.:418.0   3rd Qu.:23.00           3rd Qu.:37.50  
##  Max.   :443.0   Max.   :36.50           Max.   :50.00  
##  NA's   :2

packages <- c("car", "lmtest", "broom", "ggplot2", "gridExtra", "grid", "ggpubr")
install_if_needed <- function(pkg) {
  if (!require(pkg, character.only = TRUE)) install.packages(pkg, dependencies = TRUE)
  library(pkg, character.only = TRUE)
}
invisible(lapply(packages, install_if_needed))

## Le chargement a nécessité le package : car

## Le chargement a nécessité le package : carData

## Le chargement a nécessité le package : lmtest

## Le chargement a nécessité le package : zoo

## 
## Attachement du package : 'zoo'

## Les objets suivants sont masqués depuis 'package:base':
## 
##     as.Date, as.Date.numeric

## Le chargement a nécessité le package : broom

## Le chargement a nécessité le package : ggplot2

## Le chargement a nécessité le package : gridExtra

## Le chargement a nécessité le package : grid

## Le chargement a nécessité le package : ggpubr

dataM <- read.csv("resume_par_session.csv", sep = ";", header = TRUE, fileEncoding = "Windows-1252")

# French to English name correspondence
nom_variables <- c(
  "Temperature" = "Temperature",
  "Humidite" = "Hygrometry",
  "Vent" = "Wind",
  "Lumiere" = "Brightness",
  "QuantiteCO2" = "Amount of CO2",
  "PPM.2.5" = "PPM2.5",
  "Hauteur_strate_herbacee" = "Height of the herbaceous layer"
)
vars <- names(nom_variables)

# Initialisation of the final array
resultats <- data.frame(
  Variable = character(),
  Moyenne = numeric(),
  Variance = numeric(),
  Suit_Poisson = character(),
  Homoscédasticité_BP = character(),
  Indépendance_DW = character(),
  Normalité_Shap = character(),
  stringsAsFactors = FALSE
)

analyser_variable <- function(var, cible = "Individus", nom_affiche) {
  form <- as.formula(paste(cible, "~", var))
  mod <- lm(form, data = dataM)
  
  moy <- mean(dataM[[var]], na.rm = TRUE)
  var_ <- var(dataM[[var]], na.rm = TRUE)
  poisson <- if (all(dataM[[var]] == floor(dataM[[var]]))) "Discrète" else "Non discrète"
  
  bp <- bptest(mod)
  shap <- shapiro.test(residuals(mod))
  dw <- tryCatch({ dwtest(mod) }, error = function(e) NULL)
  
  bp_val <- paste0(round(bp$statistic, 2), " (p=", signif(bp$p.value, 3), ")")
  shap_val <- paste0(round(shap$statistic, 2), " (p=", signif(shap$p.value, 3), ")")
  if (!is.null(dw)) {
    dw_val <- paste0(round(dw$statistic[1], 2), " (p=", signif(dw$p.value, 3), ")")
  } else {
    dw_val <- "NA"
  }
  
  return(data.frame(
    Variable = nom_affiche,
    Moyenne = round(moy, 2),
    Variance = round(var_, 2),
    Suit_Poisson = poisson,
    Homoscédasticité_BP = bp_val,
    Indépendance_DW = dw_val,
    Normalité_Shap = shap_val,
    stringsAsFactors = FALSE
  ))
}

# Analysis of all explanatory variables
for (var in vars) {
  ligne <- analyser_variable(var, cible = "Individus", nom_affiche = nom_variables[[var]])
  resultats <- rbind(resultats, ligne)
}

# Add the line ‘Arthropods’ (analysis of the variable ‘Individuals’)
ligne_arthropodes <- analyser_variable("Individus", cible = "Temperature", nom_affiche = "Arthropodes")
resultats <- rbind(resultats, ligne_arthropodes)

print(resultats, row.names = FALSE)

##                        Variable  Moyenne    Variance Suit_Poisson
##                     Temperature    31.17        7.88 Non discrète
##                      Hygrometry    43.59       91.03 Non discrète
##                            Wind     1.03        1.10 Non discrète
##                      Brightness 18210.64 12653474.21 Non discrète
##                   Amount of CO2   410.41      187.29 Non discrète
##                          PPM2.5   796.02   321630.57 Non discrète
##  Height of the herbaceous layer    18.40       46.52 Non discrète
##                     Arthropodes    30.55       91.10     Discrète
##  Homoscédasticité_BP Indépendance_DW Normalité_Shap
##       0.19 (p=0.667)  1.69 (p=0.266) 0.98 (p=0.856)
##       1.27 (p=0.259)  2.29 (p=0.718) 0.98 (p=0.956)
##       1.08 (p=0.298)  2.19 (p=0.681) 0.93 (p=0.153)
##       1.28 (p=0.258)  2.27 (p=0.702) 0.99 (p=0.993)
##          0 (p=0.996)  2.26 (p=0.705) 0.96 (p=0.623)
##       0.04 (p=0.848)  2.22 (p=0.665) 0.95 (p=0.492)
##       6.04 (p=0.014)  2.37 (p=0.766) 0.97 (p=0.824)
##       0.04 (p=0.836)  2.09 (p=0.601) 0.97 (p=0.739)

“Discrète” corresponds to the fact that the variable follows a Poisson distribution. Multiple linear regression models as well as GLM/GLMM models are possible. All variables follow a normal distribution, with homoscedasticity and independence of data respected. Only the variable ‘Height of the herbaceous layer’ does not respect homoscedasticity. This variable cannot currently be included in a prediction model.

III) CORRELATION TEST BETWEN VARIABLES - SPEARMAN

In order to distinguish the variables that can be integrated into the same model, it is necessary to ensure that they are not correlated with each other. Here, all variables outside the intervals 0.7 and -0.7 are considered correlated (see materials and methods).

library(corrplot)

## corrplot 0.95 loaded

# Explanatory variables
vars <- c("Temperature", "Humidite", "Vent", "Lumiere", "QuantiteCO2", "PPM.2.5", "Hauteur_strate_herbacee")
data_corr <- dataM[, vars]

# Rename columns for display in English
colnames(data_corr) <- c(
  "Temperature",
  "Hygrometry",
  "Wind",
  "Brightness",
  "Amount of CO2",
  "PPM 2.5",
  "Height of the herbaceous layer"
)

# Spearman's correlation matrix
corr_matrix <- cor(data_corr, method = "spearman", use = "pairwise.complete.obs")
corrplot(corr_matrix, 
         method = "color", 
         type = "upper", 
         order = "hclust", 
         tl.col = "black", 
         tl.cex = 0.9,
         addCoef.col = "black",
         number.cex = 0.7,
         col = colorRampPalette(c("blue", "white", "red"))(200),
         title = "Spearman test",
         mar = c(0,0,2,0))

We can see that the variables humidity, temperature and wind are not correlated with each other and can be included in the same model.

3.1 Variance inflation factor

The author also verifies that, in the event of correlation, the other variables can be used in a model at this stage. To do this, the Variance Inflation Factor must be checked to ensure the presence or absence of collinearity.

library(car)
dataM <- read.csv("resume_par_session.csv", sep = ";", header = TRUE, fileEncoding = "Windows-1252")
names(dataM) <- gsub("Humidite", "Hygrometry", names(dataM))
names(dataM) <- gsub("Vent", "Wind", names(dataM))
names(dataM) <- gsub("Lumiere", "Brightness", names(dataM))
names(dataM) <- gsub("QuantiteCO2", "CO2", names(dataM))
names(dataM) <- gsub("PPM.2.5", "PPM2.5", names(dataM))
names(dataM) <- gsub("Hauteur_strate_herbacee", "HerbaceousHeight", names(dataM))

vars <- c("Temperature", "Hygrometry", "Wind", "Brightness", "CO2", "PPM2.5", "HerbaceousHeight")

# Multiple linear regression with all variables
lm_vif <- lm(Individus ~ ., data = dataM[, c("Individus", vars)])

# VIF calcul
vif_values <- vif(lm_vif)
print(vif_values)

##      Temperature       Hygrometry             Wind       Brightness 
##         2.655027         3.995101         2.628069         2.756228 
##              CO2           PPM2.5 HerbaceousHeight 
##         1.747724         4.273446         1.218991

All variables have a VIF below , indicating low to moderate collinearity. Variables can coexist in the same prediction model without seriously affecting the model.

IV) IMPLEMENTATION OF RANDOM FOREST

Here, the objective is to see which three variables have the greatest influence in explaining the quantity of arthropods over 4m², and using these three variables (fewer if not sufficiently explanatory to maintain maximum robustness), see the RMSE and R²/pseudo R² of the model including additive temperature, the wind:hygrometry interaction and additive hygrometry.

4.1 Random forest

library(randomForest)

## randomForest 4.7-1.2

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attachement du package : 'randomForest'

## L'objet suivant est masqué depuis 'package:gridExtra':
## 
##     combine

## L'objet suivant est masqué depuis 'package:ggplot2':
## 
##     margin

library(ggplot2)
dataM <- read.csv("resume_par_session.csv", sep = ";", header = TRUE,
                  fileEncoding = "Windows-1252", check.names = FALSE)

# translate variables into English
rename_map <- c(
  "Temperature"             = "Temperature",
  "Humidite"                = "Hygrometry",        
  "Vent"                    = "Wind",
  "Lumiere"                 = "Brightness",         
  "PPM.2.5"                 = "PPM2.5",
  "QuantiteCO2"             = "Amount_of_CO2",
  "Hauteur_strate_herbacee" = "Herbaceous_layer_height"
)

for (fr in names(rename_map)) {
  if (fr %in% names(dataM)) {
    names(dataM)[names(dataM) == fr] <- rename_map[fr]
  }
}

print(names(dataM))

##  [1] "Session"                 "Date"                   
##  [3] "Lieu"                    "Temperature"            
##  [5] "Hygrometry"              "Wind"                   
##  [7] "Brightness"              "PPM2.5"                 
##  [9] "Amount_of_CO2"           "Herbaceous_layer_height"
## [11] "Individus"

vars <- c("Temperature", "Hygrometry", "Wind", "Brightness",
          "Amount_of_CO2", "PPM2.5", "Herbaceous_layer_height")

# Delete NoData (Here, only two sessions are missing co2 and PM2.5 readings.)
dataM <- na.omit(dataM[, c("Individus", vars)])

# Building the formula for Random Forest
form <- as.formula(paste("Individus ~", paste(vars, collapse = " + ")))

# Random Forest
set.seed(42)
rf_mod <- randomForest(form, data = dataM, importance = TRUE)

# Importance of variables
imp <- as.data.frame(importance(rf_mod))
imp$Variable <- rownames(imp)

# Change names for plot
pretty_names <- c(
  Temperature = "Temperature",
  Hygrometry = "Hygrometry",
  Wind = "Wind",
  Brightness = "Brightness",
  Amount_of_CO2 = "Amount of CO2",
  `PPM2.5` = "PPM2.5",
  Herbaceous_layer_height = "Herbaceous layer height"
)
imp$Variable <- pretty_names[imp$Variable]

# Create plot
ggplot(imp, aes(x = reorder(Variable, IncNodePurity), y = IncNodePurity)) +
  geom_col(fill = "steelblue") +
  coord_flip() +
  labs(title = "Importance des variables (Random Forest)",
       x = "Variables explicatives",
       y = "Importance (IncNodePurity)") +
  theme_minimal(base_size = 14)

print(importance(rf_mod))

##                            %IncMSE IncNodePurity
## Temperature             14.6637863     522.27825
## Hygrometry               5.8130073     228.45946
## Wind                     4.6982677     267.30700
## Brightness               1.4041055      86.03827
## Amount_of_CO2           -0.6239191     124.33879
## PPM2.5                  -0.6819170     108.64711
## Herbaceous_layer_height -3.7997241     144.23141

# Recovering the importance of the previous model
imp <- as.data.frame(importance(rf_mod))
imp$Variable <- rownames(imp)

# Change names for plot
pretty_names <- c(
  Temperature = "Temperature",
  Hygrometry = "Hygrometry",
  Wind = "Wind",
  Brightness = "Brightness",
  Amount_of_CO2 = "Amount of CO2",
  `PPM2.5` = "PPM2.5",
  Herbaceous_layer_height = "Herbaceous layer height"
)
imp$Variable <- pretty_names[imp$Variable]

# Plot %IncMSE
ggplot(imp, aes(x = reorder(Variable, `%IncMSE`), y = `%IncMSE`)) +
  geom_col(fill = "steelblue") +
  coord_flip() +
  labs(title = "Importance des variables (%IncMSE) - Random Forest",
       x = "Variables explicatives",
       y = "Importance (%IncMSE)") +
  theme_minimal(base_size = 14)

# Value of IncMSE per variable
library(randomForest)
imp <- importance(rf_mod)

incmse_table <- data.frame(
  Variable = rownames(imp),
  IncMSE = imp[, "%IncMSE"]
)

print(incmse_table)

##                                        Variable     IncMSE
## Temperature                         Temperature 14.6637863
## Hygrometry                           Hygrometry  5.8130073
## Wind                                       Wind  4.6982677
## Brightness                           Brightness  1.4041055
## Amount_of_CO2                     Amount_of_CO2 -0.6239191
## PPM2.5                                   PPM2.5 -0.6819170
## Herbaceous_layer_height Herbaceous_layer_height -3.7997241

IncNodePurity was established out of curiosity and is not part of the materials and methods. It is used to observe a structuring trend within the Random Forest method.On the other hand, IncMSE tends to explain that the three variables that best explain variations in arthropods are primarily temperature, followed by humidity and finally wind.

V) CALCULATION OF RMSE AND MCFADDEN PSEUDO R² OR R² AJUSTED OF PREFERRED MODELS, USING ONLY HYGROMETRY, TEMPERATURE AND WIND DATA

This section is intended to evaluate the quality and robustness of the model by studying various aspects, such as the coefficient of determination ajusted for each model (using McFadden’s method) and the RMSE per model with the three variables that best explain the variability of arthropods.

library(MASS)    
library(lme4)

## Le chargement a nécessité le package : Matrix

library(MuMIn)

## 
## Attachement du package : 'MuMIn'

## L'objet suivant est masqué depuis 'package:randomForest':
## 
##     importance

library(tidyr)

## 
## Attachement du package : 'tidyr'

## Les objets suivants sont masqués depuis 'package:Matrix':
## 
##     expand, pack, unpack

library(ggplot2)

dataM <- read.csv("resume_par_session.csv", sep = ";", header = TRUE,
                  fileEncoding = "Windows-1252", check.names = FALSE)

# Rename columns in English following a bug
rename_map <- c(
  "Individus" = "Individuals",
  "Humidite" = "Hygrometry",
  "Vent" = "Wind",
  "Temperature" = "Temperature",
  "Lieu" = "Location",
  "Session" = "Session"
)
for (fr in names(rename_map)) {
  if (fr %in% names(dataM)) {
    names(dataM)[names(dataM) == fr] <- rename_map[fr]
  }
}

# Select variables
dataM_model <- na.omit(dataM[, c("Individuals", "Temperature", "Hygrometry", "Wind")])
dataM_glmm  <- na.omit(dataM[, c("Individuals", "Temperature", "Hygrometry", "Wind", "Location", "Session")])

# Define the RMSE function
calc_rmse <- function(model, data) {
  y <- model.response(model.frame(model))
  pred <- predict(model, newdata = data, type = "response")
  sqrt(mean((y - pred)^2, na.rm = TRUE))
}

# McFadden's pseudo-R² function for GLM
calc_pseudoR2 <- function(model) {
  ll_full <- as.numeric(logLik(model))
  ll_null <- as.numeric(logLik(update(model, . ~ 1)))
  1 - (ll_full / ll_null)
}

# Models with interaction Wind:Hygrometry
lm_mod    <- lm(Individuals ~ Temperature + Wind * Hygrometry, data = dataM_model)
glm_pois  <- glm(Individuals ~ Temperature + Wind * Hygrometry, data = dataM_model, family = poisson)
glm_quasi <- glm(Individuals ~ Temperature + Wind * Hygrometry, data = dataM_model, family = quasipoisson)
glm_nb    <- glm.nb(Individuals ~ Temperature + Wind * Hygrometry, data = dataM_model)

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : nombre limite d'iterations atteint
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : nombre limite d'iterations atteint

glmm_mod  <- glmer(Individuals ~ Temperature + Wind * Hygrometry + (1|Location) + (1|Session),
                   data = dataM_glmm, family = poisson)

## boundary (singular) fit: see help('isSingular')

# RMSE calculation
rmse_values <- c(
  calc_rmse(lm_mod, dataM_model),
  calc_rmse(glm_pois, dataM_model),
  calc_rmse(glm_quasi, dataM_model),
  calc_rmse(glm_nb, dataM_model),
  calc_rmse(glmm_mod, dataM_glmm)
)

# R² / pseudo-R² calculation
glmm_r2 <- r.squaredGLMM(glmm_mod)

## Warning: 'r.squaredGLMM' now calculates a revised statistic. See the help page.

## Warning: the null model is correct only if all variables used by the original
## model remain unchanged.

## boundary (singular) fit: see help('isSingular')

r2_values <- c(
  summary(lm_mod)$adj.r.squared,   
  calc_pseudoR2(glm_pois),
  NA,  # quasi-Poisson pseudo-R² not directly computable
  calc_pseudoR2(glm_nb),
  glmm_r2["delta", "R2m"]
)

r2_type <- c(
  "Adjusted R²",
  "Pseudo-R² McFadden",
  "Pseudo-R² McFadden",
  "Pseudo-R² McFadden",
  "Marginal R² (Nakagawa)"
)

# Results
results <- data.frame(
  Model = c("LM", "GLM Poisson", "GLM quasi-Poisson", "GLM NegBin", "GLMM"),
  RMSE = rmse_values,
  R2_or_pseudoR2 = r2_values,
  R2_type = r2_type
)
print(results)

##               Model     RMSE R2_or_pseudoR2                R2_type
## 1                LM 4.366144      0.7210001            Adjusted R²
## 2       GLM Poisson 4.532579      0.2839780     Pseudo-R² McFadden
## 3 GLM quasi-Poisson 4.532579             NA     Pseudo-R² McFadden
## 4        GLM NegBin 4.532581      0.2038348     Pseudo-R² McFadden
## 5              GLMM 4.532579      0.7321930 Marginal R² (Nakagawa)

The quasi-fish GLM model displays NA because it is impossible to calculate the pseudo R² using McFadden’s technique, due to the absence of a conventional log-likelihood.

The author also explores the RMSE and pseudo R² of the quasi-Poisson GLM based on a calculation method derived from deviance.

# Function for pseudo-R² based on explained deviance
calc_pseudoR2_quasi <- function(model) {
  dev_full <- deviance(model)               # residual deviance of the fitted model
  dev_null <- model$null.deviance           # null deviance (intercept only)
  1 - (dev_full / dev_null)                 # proportion of deviance explained
}

# RMSE for quasi-Poisson (already works with your calc_rmse)
rmse_quasi <- calc_rmse(glm_quasi, dataM_model)

# Pseudo-R² for quasi-Poisson
r2_quasi <- calc_pseudoR2_quasi(glm_quasi)

cat("GLM quasi-Poisson:\n")

## GLM quasi-Poisson:

cat("  RMSE =", round(rmse_quasi, 3), "\n")

##   RMSE = 4.533

cat("  Pseudo-R² (explained deviance) =", round(r2_quasi, 3), "\n")

##   Pseudo-R² (explained deviance) = 0.793

This model has an RMSE identical to the other GLMs. However, its detection power appears to be greater than the other models, but the technique used does not allow for a good comparison.

This model is robust to overdispersion issues (variance > mean) that would bias a classic Poisson model. However, with high overdispersion, this model may seem more suited to a GLMM depending on the random effects of the latter. Here, the random effects would be the location of the sites and the session number.

dataM <- read.csv("resume_par_session.csv", sep = ";", header = TRUE,
                  fileEncoding = "Windows-1252", check.names = FALSE)

rename_map <- c(
  "Individus"   = "Individuals",
  "Humidite"    = "Hygrometry",
  "Vent"        = "Wind",
  "Temperature" = "Temperature",
  "Lieu"        = "Location",
  "Session"     = "Session"
)

for (fr in names(rename_map)) {
  if (fr %in% names(dataM)) {
    names(dataM)[names(dataM) == fr] <- rename_map[fr]
  }
}

data <- na.omit(dataM[, c("Individuals", "Temperature", "Wind", "Hygrometry", "Location", "Session")])


library(lme4)
glmm_full <- glmer(Individuals ~ Temperature + Wind * Hygrometry +
                     (1|Location) + (1|Session),
                   data = data, family = poisson,
                   control = glmerControl(optimizer = "bobyqa"))

## boundary (singular) fit: see help('isSingular')

# Simplified model without random effects (location and session)
glmm_reduced <- glm(Individuals ~ Temperature + Wind * Hygrometry,
                    data = data, family = quasipoisson)

# Comparison via AIC (note: not exactly the same likelihood, just indicative)
AIC_full <- AIC(glmm_full)
AIC_reduced <- AIC(glmm_reduced)

cat("AIC GLMM full:", AIC_full, "\n")

## AIC GLMM full: 130.2464

cat("AIC GLM quasi-Poisson:", AIC_reduced, "\n")

## AIC GLM quasi-Poisson: NA

# Alternative option: likelihood ratio test (for true GLMM models)
# Here you can remove one random effect at a time to test its impact.
glmm_noLocation <- update(glmm_full, . ~ . - (1|Location))

## boundary (singular) fit: see help('isSingular')

anova(glmm_full, glmm_noLocation)

## Data: data
## Models:
## glmm_noLocation: Individuals ~ Temperature + Wind + Hygrometry + (1 | Session) + Wind:Hygrometry
## glmm_full: Individuals ~ Temperature + Wind * Hygrometry + (1 | Location) + (1 | Session)
##                 npar    AIC    BIC  logLik deviance Chisq Df Pr(>Chisq)
## glmm_noLocation    6 128.25 134.22 -58.123   116.25                    
## glmm_full          7 130.25 137.22 -58.123   116.25     0  1          1

glmm_noSession <- update(glmm_full, . ~ . - (1|Session))

## boundary (singular) fit: see help('isSingular')

anova(glmm_full, glmm_noSession)

## Data: data
## Models:
## glmm_noSession: Individuals ~ Temperature + Wind + Hygrometry + (1 | Location) + Wind:Hygrometry
## glmm_full: Individuals ~ Temperature + Wind * Hygrometry + (1 | Location) + (1 | Session)
##                npar    AIC    BIC  logLik deviance Chisq Df Pr(>Chisq)
## glmm_noSession    6 128.25 134.22 -58.123   116.25                    
## glmm_full         7 130.25 137.22 -58.123   116.25     0  1          1

The mention boundary (singular) fit means that one or more of your random effects have an estimated variance very close to zero. This means that the random effect of the location or session does not provide significant information that could improve the model. It is therefore preferable not to include them so as not to unnecessarily complicate the model. In the absence of random effects, this is equivalent to using a quasi-Poisson GLM model. However, McFadden’s pseudo R² compares the complete model to a null model and is more an indicator of relative quality than predictive accuracy. This is why the RMSE value takes precedence over the pseudo R² value.

library(ggplot2)
library(dplyr)

## 
## Attachement du package : 'dplyr'

## L'objet suivant est masqué depuis 'package:MASS':
## 
##     select

## L'objet suivant est masqué depuis 'package:randomForest':
## 
##     combine

## L'objet suivant est masqué depuis 'package:gridExtra':
## 
##     combine

## L'objet suivant est masqué depuis 'package:car':
## 
##     recode

## Les objets suivants sont masqués depuis 'package:stats':
## 
##     filter, lag

## Les objets suivants sont masqués depuis 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)

# Transformer les données en format long
plot_data <- results %>%
  pivot_longer(cols = c("RMSE", "R2_or_pseudoR2"),
               names_to = "Metric",
               values_to = "Value")

# Forcer l'ordre des couleurs
plot_data$Metric <- factor(plot_data$Metric, levels = c("R2_or_pseudoR2", "RMSE"))

# Créer le barplot
ggplot(plot_data, aes(x = Model, y = Value, fill = Metric)) +
  geom_col(position = "dodge") +
  scale_fill_manual(values = c("R2_or_pseudoR2" = "green3", "RMSE" = "steelblue"),
                    labels = c("R² / pseudo-R²", "RMSE")) +
  labs(title = "Comparison of Models (Temperature + Wind:Hygrometry)",
       x = "Model",
       y = "Value",
       fill = "Metric") +
  theme_minimal(base_size = 14) +
  theme(axis.text.x = element_text(angle = 30, hjust = 1))

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_col()`).

VI) EQUATION FOR EACH MODEL

Now, it is necessary to see the coefficient of each parameter and the intercept in order to design the prediction equation according to each model.

library(MASS)    
library(lme4)    
library(mgcv)

## Le chargement a nécessité le package : nlme

## 
## Attachement du package : 'nlme'

## L'objet suivant est masqué depuis 'package:dplyr':
## 
##     collapse

## L'objet suivant est masqué depuis 'package:lme4':
## 
##     lmList

## This is mgcv 1.9-3. For overview type 'help("mgcv-package")'.

library(tidyr)
library(ggplot2)


#Read data
dataM <- read.csv("resume_par_session.csv", sep = ";", header = TRUE, fileEncoding = "Windows-1252")
dataM <- na.omit(dataM[, c("Individus", "Temperature", "Humidite", "Vent", "Lieu", "Session")])


#LMR
lm_mod <- lm(Individus ~ Temperature + Humidite * Vent, data = dataM)
print(coef(lm_mod))

##   (Intercept)   Temperature      Humidite          Vent Humidite:Vent 
##  -47.47695318    2.62818052   -0.02916673  -20.59062901    0.46511605

# GLM Poisson
glm_pois <- glm(Individus ~ Temperature + Humidite * Vent, data = dataM, family = poisson)
print(coef(glm_pois))

##   (Intercept)   Temperature      Humidite          Vent Humidite:Vent 
##   1.036894786   0.083416292  -0.003560927  -0.940039008   0.021667781

# GLM quasi-Poisson
glm_quasi <- glm(Individus ~ Temperature + Humidite * Vent, data = dataM, family = quasipoisson)
print(coef(glm_quasi))

##   (Intercept)   Temperature      Humidite          Vent Humidite:Vent 
##   1.036894786   0.083416292  -0.003560927  -0.940039008   0.021667781

# GLM Negative Binomial
glm_nb <- glm.nb(Individus ~ Temperature + Humidite * Vent, data = dataM)

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : nombre limite d'iterations atteint
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : nombre limite d'iterations atteint

print(coef(glm_nb))

##   (Intercept)   Temperature      Humidite          Vent Humidite:Vent 
##   1.036887912   0.083416508  -0.003560927  -0.940040434   0.021667819

# GAM (optional)
gam_mod <- gam(Individus ~ s(Temperature, k = 3) + s(Humidite, k = 3) + s(Vent, k = 3) +
                 ti(Humidite, Vent, k = 3),
               data = dataM, family = poisson, select = TRUE)
print(summary(gam_mod)$s.table)

##                            edf Ref.df       Chi.sq      p-value
## s(Temperature)    9.596442e-01      2 2.027002e+01 2.382724e-06
## s(Humidite)       1.518961e+00      2 1.243376e+01 2.738260e-04
## s(Vent)           3.039806e-06      2 3.610160e-06 2.147996e-01
## ti(Humidite,Vent) 7.185827e-01      4 3.478641e+00 2.430495e-02

# GLMM
glmm_mod <- glmer(Individus ~ Temperature + Humidite * Vent + (1 | Lieu) + (1 | Session),
                  data = dataM, family = poisson)

## boundary (singular) fit: see help('isSingular')

print(fixef(glmm_mod))

##   (Intercept)   Temperature      Humidite          Vent Humidite:Vent 
##   1.036894786   0.083416292  -0.003560927  -0.940039008   0.021667781

6.1 Chi2 test and adjustment coefficient

# Function to calculate χ² goodness-of-fit
# Function to calculate χ² goodness-of-fit with ĉ
calc_chi2 <- function(obs, pred, n_params) {
  chi2 <- sum((obs - pred)^2 / pred)
  df <- length(obs) - n_params
  p_value <- pchisq(chi2, df = df, lower.tail = FALSE)
  c_hat <- chi2 / df
  return(list(chi2 = chi2, df = df, p_value = p_value, c_hat = c_hat))
}



# Extract coefficients and predictions
coef_lm <- coef(lm_mod)
manual_pred_lm <- coef_lm[1] + coef_lm["Temperature"] * dataM$Temperature +
  coef_lm["Humidite"] * dataM$Humidite + coef_lm["Vent"] * dataM$Vent +
  coef_lm["Humidite:Vent"] * dataM$Humidite * dataM$Vent

obs <- dataM$Individus

# Example for LM
pred_lm <- predict(lm_mod, type = "response")
n_params_lm <- length(coef(lm_mod))
chi_lm <- calc_chi2(obs, pred_lm, n_params_lm)

# Example for GLM Poisson
pred_glm <- predict(glm_pois, type = "response")
n_params_glm <- length(coef(glm_pois))
chi_glm <- calc_chi2(obs, pred_glm, n_params_glm)

# Exemple for GLM Quasi-Poisson
pred_quasi <- predict(glm_quasi, type = "response")
n_params_quasi <- length(coef(glm_quasi))
chi_quasi <- calc_chi2(obs, pred_quasi, n_params_quasi)

# Example for GLM NegBin
pred_nb <- predict(glm_nb, type = "response")
n_params_nb <- length(coef(glm_nb))
chi_nb <- calc_chi2(obs, pred_nb, n_params_nb)

# Example for GLMM (remember to use data without NA, and check that glmm_mod is defined)
pred_glmm <- predict(glmm_mod, type = "response")
n_params_glmm <- length(fixef(glmm_mod))
chi_glmm <- calc_chi2(obs, pred_glmm, n_params_glmm)


cat("\nLM Chi² =", chi_lm$chi2, "| df =", chi_lm$df, "| p-value =", chi_lm$p_value, "| ĉ =", chi_lm$c_hat, "\n")

## 
## LM Chi² = 11.75934 | df = 15 | p-value = 0.6971461 | ĉ = 0.7839562

cat("\nGLM Poisson Chi² =", chi_glm$chi2, "| df =", chi_glm$df, "| p-value =", chi_glm$p_value, "| ĉ =", chi_glm$c_hat, "\n")

## 
## GLM Poisson Chi² = 11.58152 | df = 15 | p-value = 0.7103887 | ĉ = 0.7721012

cat("\nGLM Quasi-Poisson Chi² =", chi_quasi$chi2, "| df =", chi_quasi$df, "| p-value =", chi_quasi$p_value, "| ĉ =", chi_quasi$c_hat, "\n")

## 
## GLM Quasi-Poisson Chi² = 11.58152 | df = 15 | p-value = 0.7103887 | ĉ = 0.7721012

cat("\nGLM NegBin Chi² =", chi_nb$chi2, "| df =", chi_nb$df, "| p-value =", chi_nb$p_value, "| ĉ =", chi_nb$c_hat, "\n")

## 
## GLM NegBin Chi² = 11.58152 | df = 15 | p-value = 0.7103888 | ĉ = 0.772101

cat("\nGLMM Chi² =", chi_glmm$chi2, "| df =", chi_glmm$df, "| p-value =", chi_glmm$p_value, "| ĉ =", chi_glmm$c_hat, "\n")

## 
## GLMM Chi² = 11.58152 | df = 15 | p-value = 0.7103887 | ĉ = 0.7721012

Multiple linear regression has a higher chi-square value but a more consistent dispersion (close to 1) than GLMs. However, the differences remain very minimal.To conclude on the most reliable and robust model, linear regression is preferred. Indeed, this model, based on an equation that includes temperature as an additive factor and the interaction between humidity and wind, has a lower RMSE than GLMs. Despite a higher chi2, this model manages to explain 72% of the variation in arthropods, with a relatively simplified equation. This model is therefore retained in the preliminary development of a tool for detecting arthropod abundance on the Crau plain, although the quasi-fish GLM presents an interesting alternative equation.

To optimise this study, an application to calculate this quantity of arthropods is currently being developed. Its principle is based on Rshiny, and a script is available below. The aim will be to develop this application so that it can be downloaded, provided that this manuscript is recognised as viable by peers. (french version)

library(shiny)

# Coefficients du modèle linéaire multiple
coef_intercept   <- -47.47695318
coef_temp        <- 2.62818052
coef_humidite    <- -0.02916673
coef_vent        <- -20.59062901
coef_humid_vent  <- 0.46511605

predict_individus <- function(temp, humidite, vent) {
  individus <- coef_intercept +
    coef_temp * temp +
    coef_humidite * humidite +
    coef_vent * vent +
    coef_humid_vent * (humidite * vent)
  return(round(individus, 2))
}

ui <- fluidPage(
  titlePanel("Prédiction du nombre d'individus selon les variables microclimatiques"),
  sidebarLayout(
    sidebarPanel(
      sliderInput("temp", "Température (°C):", min=27, max=37, value=32, step=0.1),
      sliderInput("humidite", "Humidité (%):", min=32, max=65, value=48, step=0.1),
      sliderInput("vent", "Vent (m/s):", min=0, max=4, value=2, step=0.1)
    ),
    mainPanel(
      h3("Nombre prédit d'individus :"),
      verbatimTextOutput("result")
    )
  )
)

server <- function(input, output) {
  output$result <- renderText({
    nb <- predict_individus(input$temp, input$humidite, input$vent)
    paste(nb)
  })
}

shinyApp(ui = ui, server = server)

## PhantomJS not found. You can install it with webshot::install_phantomjs(). If it is installed, please make sure the phantomjs executable can be found via the PATH variable.

Shiny applications not supported in static R Markdown documents

Preliminary development of an arthropod detection tool on the Crau plain

SCARPARI Thomas

2025-09-29