Description

This script illustrates a comprehensive approach to evaluating hydrological data (here, EgaEnEstellaQts, a series of streamflow measurements on the Ega River at the Estella station in Spain) through different methods. First, it applies a ROC (Receiver Operating Characteristic) analysis to test the discriminative capacity of the predictions. It then implements performance measures (goodness-of-fit) to compare observations and simulations, supported by graphical visualizations such as adjustment curves, scatter plots, and time series. Finally, the code provides descriptive statistics and an analysis of the data distribution. The purpose of this script is to combine statistical and graphical tools in order to assess the quality of the simulations, detect potential biases, and provide a comprehensive view of model performance when applied to environmental or hydrological datasets.

1. Loading Required Packages

We begin by loading the necessary R packages for ROC analysis, visualization, and goodness-of-fit evaluation.

library(pROC)
library(ggplot2)
library(hydroGOF)
library(psych)
par(col = "#FF0000", col.axis = "#FF0000", col.lab = "#FF0000")

2. Data Loading and Cleaning

The EgaEnEstellaQts dataset is loaded, and we check for missing values to ensure data integrity.

data(EgaEnEstellaQts)
obs <- EgaEnEstellaQts

# Check for missing values
any(is.na(obs))

## [1] TRUE

obs <- na.omit(obs)  # Uncomment if missing values need to be removed

3. ROC Analysis

ROC analysis is performed to evaluate the performance of a simulated predictive model. A binary classification variable is created using the median as a threshold, and the ROC curve is plotted with the Area Under the Curve (AUC) metric.

set.seed(123)
predictions <- rnorm(length(obs), mean = mean(obs), sd = sd(obs))

# Create binary classification variable
binary_obs <- as.factor(ifelse(obs > median(obs), 1, 0))
print(table(binary_obs))

## binary_obs
##    0    1 
## 1825 1825

# Vérifier si nous avons au moins deux classes
seuil <- quantile(obs, 0.4)  # 40ème percentile au lieu de la médiane
binary_obs <- as.factor(ifelse(obs > seuil, 1, 0))

# Compute ROC curve
roc_obj <- roc(binary_obs, predictions)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

auc_value <- auc(roc_obj)
  
# Plot ROC curve
plot(roc_obj, 
     main = "ROC Curve for EgaEnEstellaQts",
     col = "#FF0000",
     lwd = 2,
     print.auc = TRUE,
     auc.polygon = TRUE,
     auc.polygon.col = "#FFB3B3",
     max.auc.polygon = TRUE)

plot(roc_obj, 
     main = "ROC Curve for Extreme Events (95th percentile)",
     col = "#FF0000",
     lwd = 2,
     print.auc = TRUE,
     auc.polygon = TRUE,
     auc.polygon.col = "#FFB3B3",
     max.auc.polygon = TRUE)

4. Goodness-of-Fit Analysis

To assess the agreement between observed and simulated values, we compute goodness-of-fit metrics. Simulated data is generated by adding random noise to a subset of the observations.

sim <- obs
n <- min(2000, length(obs))
sim[1:n] <- obs[1:n] + rnorm(n, mean = 10)

# Compute goodness-of-fit measures
gof_measures <- gof(sim = sim, obs = obs)
print(gof_measures)

##          [,1]
## ME       5.47
## MAE      5.47
## MSE     55.19
## RMSE     7.43
## NRMSE % 37.10
## PBIAS % 34.60
## RSR      0.37
## rSD      1.04
## NSE      0.86
## mNSE     0.57
## rNSE    -0.56
## d        0.97
## md       0.78
## rd       0.63
## cp       0.42
## r        0.97
## R2       0.94
## bR2      0.83
## KGE      0.65
## VE       0.65

# Visualize goodness-of-fit
ggof(sim = sim, obs = obs, ftype = "dm", FUN = mean) +
  theme_minimal() +
  scale_color_manual(values = c("#FF0000", "#CC0000")) +
  scale_fill_manual(values = c("#FFB3B3", "#FFD1D1"))

## NULL

5. Scatter Plot of Observations vs. Simulations

A scatter plot is created to visually compare observed and simulated values, with a 1:1 line for reference.

plot(obs, sim, 
     main = "Observations vs. Simulations", 
     xlab = "Observations", 
     ylab = "Simulations", 
     col = "#FF0000")
abline(0, 1, col = "#FF0000")  # 1:1 line in red

6. Descriptive Statistics

Descriptive statistics are calculated to summarize the central tendency, dispersion, and shape of the dataset.

describe(obs)

##    vars    n  mean    sd median trimmed  mad min max range skew kurtosis   se
## X1    1 3650 15.81 20.02    8.7   11.55 7.78 1.1 240 238.9 3.58    19.24 0.33

7. Time Series Visualization

Assuming the data is temporal, a time series plot is generated to explore trends and patterns.

plot(obs, type = "l", lwd = 2, 
     main = "Time Series of EgaEnEstellaQts", 
     col = "#FF0000")

8. Distribution Analysis

A histogram is plotted to examine the distribution of the EgaEnEstellaQts data.

hist(obs, 
     main = "Distribution of EgaEnEstellaQts", 
     xlab = "Values", 
     col = "#FFB3B3", 
     border = "#FF0000")

References

Contact

Abdi-Basid ADAN
abdi-basid@outlook.com

Combined Statistical Approaches for the Analysis of EgaEnEstellaQts Streamflow: From ROC Curves to Goodness-of-Fit Measures

Abdi-Basid ADAN