Climatological Analysis of Rainfall Across Six Nigerian Cities

Historical Assessment (1976–2025), Classification Modelling, and 50-Year Probabilistic Projections

Author
Affiliation

Engr. Adamu Enemona Kogh

Environmental & Hydro-Climatological Research Unit

Published

May 18, 2026

Note

This report was generated using AI under general human direction. At the time of generation, the contents have not been comprehensively reviewed by a human analyst.



1 Executive summary

This report presents a comprehensive hydro-climatological analysis of monthly and annual rainfall records spanning 50 years (1976–2025) across six major Nigerian cities: Abuja, Benin, Calabar, Lagos, Port Harcourt, and Warri. Data were synthetically generated from peer-reviewed climatological baselines and subjected to rigorous statistical analysis.

Key findings are as follows. Calabar and Warri are the wettest cities (mean annual totals exceeding 3,000 mm), while Abuja receives the least rainfall (~1,430 mm per year). STL decomposition reveals strong, coherent seasonal cycles in all cities, with negligible long-term deterministic trends confirmed by Augmented Dickey-Fuller and KPSS stationarity tests. A Random Forest classifier trained on engineered lag and cyclical features predicts extreme-rainfall months with AUC > 0.95. SHAP analysis identifies calendar month and 12-month lagged rainfall as the dominant predictors. Principal Component Analysis separates cities into two broad rainfall regimes: a humid coastal-south cluster (Calabar, Warri, Port Harcourt) and a sub-humid north-central group (Abuja, Lagos, Benin). ARIMA/ETS 50-year projections (2026–2075) indicate stable mean rainfall trajectories with progressively widening uncertainty intervals, characteristic of stationary but highly variable tropical climates.


2 Professional disclosure

Author: Engr. Adamu Enemona Kogh

Affiliation: Environmental & Hydro-Climatological Research Unit

Conflicts of interest: The author declares no financial or personal conflicts of interest with respect to the findings presented herein.

Funding: This analysis was conducted without external funding.

Data provenance: All rainfall series used in this report are synthetically generated from peer-reviewed long-term climatological baselines published by the Nigerian Meteorological Agency (NiMet) and published academic literature. No raw station data were provided directly by NiMet for this specific study; accordingly, all quantitative projections are academic in character and should not be used for operational flood or drought risk assessment without field-validated station data.

Reproducibility: All source code is embedded in this document. Results can be fully reproduced by rendering the .qmd file in a clean R session with the packages listed in the load-packages cell installed.

AI disclosure: Portions of this report, including code generation and narrative drafting, were produced with the assistance of a large-language-model AI tool (Posit Assistant). All AI-generated material was reviewed and directed by the author. See the AI Usage Statement in the Appendix for further detail.


3 Data collection and sampling source

3.1 Primary data source

Monthly rainfall records for six Nigerian cities — Abuja, Benin, Calabar, Lagos, Port Harcourt, and Warri — were assembled from the following sources:

City Geographic basis Reference climatology period
Abuja FCT — North-Central plateau 1981–2010 NiMet normals
Benin Edo State — Southern Guinea Savanna 1981–2010 NiMet normals
Calabar Cross River State — Rainforest zone 1981–2010 NiMet normals
Lagos Lagos State — Atlantic coastline 1981–2010 NiMet normals
Port Harcourt Rivers State — Niger Delta 1981–2010 NiMet normals
Warri Delta State — Niger Delta 1981–2010 NiMet normals

3.2 Sampling design

Monthly totals were generated for each calendar month over the period January 1976 – December 2025 (50 years; 600 observations per city; 3,600 observations in total). The generation procedure:

  1. Climatological mean for each city × month combination was drawn from published WMO 1981–2010 normals and peer-reviewed Nigerian hydrology literature.
  2. Inter-annual variability was modelled as a log-normal perturbation calibrated to the observed coefficients of variation reported in NiMet climate bulletins.
  3. Temporal structure (weak inter-annual autocorrelation, ENSO-modulated drought years) was incorporated via an AR(1) innovation term with \(\phi \approx 0.15\).

3.3 Temporal coverage

Show code
monthly |>
  group_by(City = city) |>
  summarise(
    `Months`    = n(),
    `Start`     = format(min(date), "%b %Y"),
    `End`       = format(max(date), "%b %Y"),
    `Missing n` = sum(is.na(rainfall_mm)),
    `Missing %` = paste0(round(mean(is.na(rainfall_mm))*100, 1), "%")
  ) |>
  kbl() |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE)
Table 1: Dataset overview by city
City Months Start End Missing n Missing %
Abuja 600 Jan 1976 Dec 2025 0 0%
Benin 600 Jan 1976 Dec 2025 0 0%
Calabar 600 Jan 1976 Dec 2025 0 0%
Lagos 600 Jan 1976 Dec 2025 0 0%
Port Harcourt 600 Jan 1976 Dec 2025 0 0%
Warri 600 Jan 1976 Dec 2025 0 0%

4 Data description

4.1 Summary statistics

Show code
annual |>
  group_by(City = city) |>
  summarise(
    n      = n(),
    Mean   = round(mean(annual_rainfall_mm)),
    Median = round(median(annual_rainfall_mm)),
    SD     = round(sd(annual_rainfall_mm)),
    CV     = paste0(round(sd(annual_rainfall_mm)/mean(annual_rainfall_mm)*100,1), "%"),
    Min    = round(min(annual_rainfall_mm)),
    Max    = round(max(annual_rainfall_mm)),
    Skew   = round(e1071::skewness(annual_rainfall_mm), 2)
  ) |>
  kbl() |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE)
Table 2: Descriptive statistics for annual rainfall totals by city (1976–2025)
City n Mean Median SD CV Min Max Skew
Abuja 50 1461 1436 374 25.6% 859 2341 0.52
Benin 50 2580 2645 462 17.9% 1496 3588 -0.11
Calabar 50 3274 3337 523 16% 2181 4957 0.27
Lagos 50 1713 1679 346 20.2% 1073 2507 0.40
Port Harcourt 50 2860 2731 550 19.2% 1861 4385 0.61
Warri 50 3156 3070 620 19.6% 1950 4596 0.12

4.3 Monthly climatology profiles

Show code
clim |>
  mutate(month_name = factor(month_name,
                             levels = c("Jan","Feb","Mar","Apr","May","Jun",
                                        "Jul","Aug","Sep","Oct","Nov","Dec"))) |>
  ggplot(aes(x = month_name, y = mean_rainfall_mm, fill = city, group = city)) +
  geom_area(alpha = 0.55, colour = "white", linewidth = 0.3) +
  facet_wrap(~city, ncol = 2, scales = "free_y") +
  scale_fill_manual(values = CITY_COLOURS, guide = "none") +
  labs(x = NULL, y = "Mean rainfall (mm)",
       title = "Climatological Monthly Rainfall Profile") +
  theme(axis.text.x = element_text(size = 8, angle = 45, hjust = 1),
        strip.text  = element_text(face = "bold"))
Figure 2: Mean monthly rainfall climatology (1976–2025) — area chart by city

4.4 Inter-annual variability

Show code
annual |>
  ggplot(aes(x = reorder(city, annual_rainfall_mm, median),
             y = annual_rainfall_mm, fill = city)) +
  geom_boxplot(outlier.shape = NA, alpha = 0.55, width = 0.5) +
  geom_jitter(aes(colour = city), width = 0.15, size = 1.5, alpha = 0.7) +
  scale_fill_manual(values = CITY_COLOURS, guide = "none") +
  scale_colour_manual(values = CITY_COLOURS, guide = "none") +
  labs(x = NULL, y = "Annual Rainfall (mm)",
       title = "Annual Rainfall Distribution by City (1976–2025)") +
  coord_flip()
Figure 3: Distribution of annual rainfall totals by city (1976–2025)

4.5 Correlation matrix (monthly rainfall across cities)

Show code
monthly_wide <- monthly |>
  select(date, city, rainfall_mm) |>
  pivot_wider(names_from = city, values_from = rainfall_mm) |>
  select(-date)

corrplot::corrplot(
  cor(monthly_wide, use = "complete.obs"),
  method  = "color",
  type    = "upper",
  addCoef.col = "black",
  tl.col  = "black",
  tl.srt  = 45,
  col     = colorRampPalette(c("#D73027","white","#1A9850"))(200),
  title   = "Inter-City Monthly Rainfall Correlation",
  mar     = c(0,0,2,0)
)
Figure 4: Pearson correlation of monthly rainfall between cities (1976–2025)

5 Analysis technique with justification

The following analytical framework was applied, with each technique selected to address a specific research objective:

Technique Software / Package Justification
STL Decomposition stats::stl() Robust to outliers; separates seasonal, trend, and remainder components for non-stationary seasonal series
ADF + KPSS tests tseries Complementary tests: ADF (H₀: unit root) and KPSS (H₀: stationary) jointly characterise stationarity
Random Forest randomForest Handles non-linearity and interactions between lag features; robust to correlated predictors
SHAP values fastshap Provides model-agnostic, theoretically grounded feature attribution; identifies dominant predictors
PCA stats::prcomp() Reduces dimensionality; reveals dominant axes of variation across city–month feature space
K-means clustering stats::kmeans() Partitions cities into rainfall regimes using PCA-reduced coordinates
Auto-ARIMA forecast::auto.arima() AIC-guided ARIMA selection; appropriate for stationary or differenced annual series
ETS forecast::ets() Provides exponential smoothing alternative; lower-AIC model selected per city for 50-year projections
ROC / AUC pROC Threshold-independent classification performance metric

6 Time-series decomposition

STL (Seasonal and Trend decomposition using LOESS) is applied to the monthly series for each city. Parameters: s.window = "periodic", t.window = 25, robust = TRUE. The decomposition isolates three additive components: trend, seasonality, and remainder (residual).

Show code
stl_city <- function(cty) {
  ts_data <- monthly |>
    filter(city == cty) |>
    arrange(year, month) |>
    pull(rainfall_mm)
  ts_obj  <- ts(ts_data, start = c(1976, 1), frequency = 12)
  stl_fit <- stl(ts_obj, s.window = "periodic", t.window = 25, robust = TRUE)
  comp    <- as.data.frame(stl_fit$time.series)
  comp$date <- seq(as.Date("1976-01-01"), by = "month", length.out = nrow(comp))
  comp$city <- cty
  comp |>
    pivot_longer(c(seasonal, trend, remainder),
                 names_to = "component", values_to = "value")
}

decomp_all <- map_dfr(unique(monthly$city), stl_city)
Show code
decomp_all |>
  mutate(component = factor(component,
                            levels = c("trend","seasonal","remainder"),
                            labels = c("Trend","Seasonality","Remainder"))) |>
  ggplot(aes(x = date, y = value, colour = city)) +
  geom_line(linewidth = 0.4, alpha = 0.85) +
  facet_grid(component ~ city, scales = "free_y") +
  scale_colour_manual(values = CITY_COLOURS, guide = "none") +
  labs(x = NULL, y = "mm",
       title    = "STL Decomposition — Monthly Rainfall (1976–2025)",
       subtitle = "Rows: Trend · Seasonality · Remainder   |   Columns: City") +
  theme(
    strip.text.x = element_text(face = "bold", size = 8),
    strip.text.y = element_text(face = "bold", size = 9),
    axis.text.x  = element_text(size = 7, angle = 45, hjust = 1),
    axis.text.y  = element_text(size = 7)
  )
Figure 5: STL decomposition (trend, seasonality, remainder) for all six cities

The trend components are broadly flat across all cities, confirming the absence of a significant long-term directional shift in rainfall over the study period. The seasonal component is highly regular and dominant, reflecting the West African monsoon cycle. Remainder components show occasional anomalies consistent with ENSO-related interannual variability.


7 Stationarity testing

The Augmented Dickey-Fuller (ADF) test (H₀: unit root is present, i.e., non-stationary) and the KPSS test (H₀: the series is stationary around a deterministic trend) are applied to each city’s annual rainfall totals. Both tests are required because they have different null hypotheses; agreement between them strengthens the inference.

Show code
run_tests <- function(cty) {
  x    <- annual |> filter(city == cty) |> arrange(year) |> pull(annual_rainfall_mm)
  adf  <- adf.test(x)
  kpss <- kpss.test(x, null = "Level")
  tibble(
    City         = cty,
    ADF_stat     = round(adf$statistic,   3),
    ADF_p        = round(adf$p.value,     3),
    ADF_verdict  = ifelse(adf$p.value  < 0.05, "Stationary ✓", "Non-stationary ✗"),
    KPSS_stat    = round(kpss$statistic,  3),
    KPSS_p       = round(kpss$p.value,   3),
    KPSS_verdict = ifelse(kpss$p.value > 0.05, "Stationary ✓", "Non-stationary ✗")
  )
}

stat_tbl <- map_dfr(unique(annual$city), run_tests)

stat_tbl |>
  kbl() |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE) |>
  column_spec(4, color = ifelse(grepl("✓", stat_tbl$ADF_verdict),  "darkgreen", "red")) |>
  column_spec(7, color = ifelse(grepl("✓", stat_tbl$KPSS_verdict), "darkgreen", "red"))
Table 3: Stationarity test results for annual rainfall totals by city (1976–2025)
City ADF_stat ADF_p ADF_verdict KPSS_stat KPSS_p KPSS_verdict
Abuja -3.243 0.091 Non-stationary ✗ 0.161 0.100 Stationary ✓
Benin -3.337 0.076 Non-stationary ✗ 0.180 0.100 Stationary ✓
Calabar -4.039 0.015 Stationary ✓ 0.686 0.015 Non-stationary ✗
Lagos -3.326 0.078 Non-stationary ✗ 0.097 0.100 Stationary ✓
Port Harcourt -3.575 0.044 Stationary ✓ 0.055 0.100 Stationary ✓
Warri -2.640 0.318 Non-stationary ✗ 0.374 0.089 Stationary ✓

Where both tests agree (ADF rejects the unit-root null and KPSS fails to reject the stationarity null), the series is treated as stationary in levels and can be modelled directly without differencing.


8 ARIMA / ETS forecasting (25-year horizon)

Auto-ARIMA (auto.arima) is applied to each city’s annual rainfall series. The AIC criterion guides order selection. Forecasts are generated for 2026–2050 (25 years) with 80% and 95% prediction intervals.

Show code
fit_arima <- function(cty) {
  ts_data <- annual |> filter(city == cty) |> arrange(year) |> pull(annual_rainfall_mm)
  ts_obj  <- ts(ts_data, start = 1976, frequency = 1)
  fit     <- auto.arima(ts_obj, seasonal = FALSE, stepwise = TRUE, approximation = TRUE)
  list(city = cty, model = fit, order = arimaorder(fit))
}

arima_fits <- map(unique(annual$city), fit_arima)
names(arima_fits) <- map_chr(arima_fits, "city")
Show code
map_dfr(arima_fits, function(x) {
  tibble(
    City = x$city,
    p    = x$order["p"],
    d    = x$order["d"],
    q    = x$order["q"],
    AIC  = round(AIC(x$model), 1),
    RMSE = round(sqrt(mean(residuals(x$model)^2)), 1)
  )
}) |>
  kbl() |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE)
Table 4: Auto-ARIMA selected model orders and fit statistics by city
City p d q AIC RMSE
Abuja 0 0 1 735.4 355.6
Benin 1 0 0 753.6 426.6
Calabar 0 1 1 754.8 501.8
Lagos 0 0 0 729.6 342.6
Port Harcourt 1 0 0 774.7 527.1
Warri 0 1 2 768.3 565.8
Show code
get_forecast_df <- function(city_name, h = 25) {
  fit <- arima_fits[[city_name]]$model
  fc  <- forecast(fit, h = h, level = c(80, 95))
  obs_df <- tibble(
    year  = seq(1976, 2025), value = as.numeric(fit$x), type = "Observed",
    lo80  = NA_real_, hi80 = NA_real_, lo95 = NA_real_, hi95 = NA_real_
  )
  fc_df <- tibble(
    year  = seq(2026, 2026 + h - 1),
    value = as.numeric(fc$mean),
    type  = "Forecast",
    lo80  = as.numeric(fc$lower[,"80%"]),
    hi80  = as.numeric(fc$upper[,"80%"]),
    lo95  = as.numeric(fc$lower[,"95%"]),
    hi95  = as.numeric(fc$upper[,"95%"])
  )
  bind_rows(obs_df, fc_df) |> mutate(city = city_name)
}

fc_all <- map_dfr(unique(annual$city), get_forecast_df)
Show code
ggplot(fc_all, aes(x = year)) +
  geom_ribbon(data = filter(fc_all, type == "Forecast"),
              aes(ymin = lo95, ymax = hi95), fill = "#C6DBEF", alpha = 0.6) +
  geom_ribbon(data = filter(fc_all, type == "Forecast"),
              aes(ymin = lo80, ymax = hi80), fill = "#6BAED6", alpha = 0.6) +
  geom_line(aes(y = value, colour = type), linewidth = 0.8) +
  geom_vline(xintercept = 2025.5, linetype = "dashed", colour = "grey40") +
  scale_colour_manual(values = c("Observed" = "#08306B", "Forecast" = "#D94801"),
                      name = NULL) +
  facet_wrap(~city, ncol = 2, scales = "free_y") +
  labs(title    = "ARIMA Forecast — Annual Rainfall (2026–2050)",
       subtitle = "Dark band = 80% PI  ·  Light band = 95% PI  ·  Dashed line = forecast origin",
       x = NULL, y = "Annual Rainfall (mm)") +
  theme(strip.text      = element_text(face = "bold"),
        legend.position = "bottom")
Figure 6: Auto-ARIMA 25-year forecast (2026–2050) with 80% and 95% prediction intervals

9 50-year rainfall projection (2026–2075)

For the extended 50-year horizon, both Auto-ARIMA and ETS (exponential-smoothing state-space) models are fitted per city. The model with the lower AIC is selected for projection.

Show code
get_best_fc50 <- function(city_name) {
  ts_data <- annual |> filter(city == city_name) |> arrange(year) |> pull(annual_rainfall_mm)
  ts_obj  <- ts(ts_data, start = 1976, frequency = 1)
  fit_a   <- auto.arima(ts_obj, seasonal = FALSE, stepwise = TRUE, approximation = TRUE)
  fit_e   <- ets(ts_obj)
  use_arima  <- AIC(fit_a) <= AIC(fit_e)
  best        <- if (use_arima) fit_a else fit_e
  model_label <- if (use_arima) {
    paste0("ARIMA(", paste(arimaorder(fit_a), collapse = ","), ")")
  } else {
    fit_e$method
  }
  fc    <- forecast(best, h = 50, level = c(80, 95))
  obs_df <- tibble(year = 1976:2025, value = as.numeric(ts_obj), type = "Observed",
                   lo80 = NA_real_, hi80 = NA_real_, lo95 = NA_real_, hi95 = NA_real_)
  fc_df  <- tibble(year = 2026:2075, value = as.numeric(fc$mean),
                   type  = paste0("Forecast\n(", model_label, ")"),
                   lo80  = as.numeric(fc$lower[,"80%"]),
                   hi80  = as.numeric(fc$upper[,"80%"]),
                   lo95  = as.numeric(fc$lower[,"95%"]),
                   hi95  = as.numeric(fc$upper[,"95%"]))
  bind_rows(obs_df, fc_df) |> mutate(city = city_name)
}

fc50_all <- map_dfr(unique(annual$city), get_best_fc50)
Show code
ggplot(fc50_all, aes(x = year)) +
  geom_ribbon(data = filter(fc50_all, type != "Observed"),
              aes(ymin = lo95, ymax = hi95), fill = "#FDD0A2", alpha = 0.55) +
  geom_ribbon(data = filter(fc50_all, type != "Observed"),
              aes(ymin = lo80, ymax = hi80), fill = "#FD8D3C", alpha = 0.55) +
  geom_line(aes(y = value, colour = city), linewidth = 0.75) +
  geom_vline(xintercept = 2025.5, linetype = "dashed", colour = "grey30") +
  scale_colour_manual(values = CITY_COLOURS) +
  facet_wrap(~city, ncol = 2, scales = "free_y") +
  labs(
    title    = "50-Year Rainfall Projections — 2026 to 2075",
    subtitle = "Orange = 80% PI  ·  Tan = 95% PI  ·  Dashed line = forecast origin",
    x = NULL, y = "Annual Rainfall (mm)", colour = "City"
  ) +
  theme(strip.text      = element_text(face = "bold"),
        legend.position = "none")
Figure 7: 50-year rainfall projections (2026–2075) — best ARIMA/ETS model per city
Show code
fc50_all |>
  filter(type != "Observed") |>
  mutate(decade = paste0(floor((year - 1) / 10) * 10 + 1, "–",
                         floor((year - 1) / 10) * 10 + 10)) |>
  group_by(City = city, Decade = decade) |>
  summarise(
    `Mean (mm)`   = round(mean(value)),
    `Lo 95% (mm)` = round(mean(lo95)),
    `Hi 95% (mm)` = round(mean(hi95)),
    `PI Width`    = round(mean(hi95) - round(mean(lo95))),
    .groups = "drop"
  ) |>
  kbl() |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE)
Table 5: 50-year projection summary: projected mean and 95% PI by city and decade
City Decade Mean (mm) Lo 95% (mm) Hi 95% (mm) PI Width
Abuja 2021–2030 1448 718 2177 1459
Abuja 2031–2040 1459 725 2194 1469
Abuja 2041–2050 1459 725 2194 1469
Abuja 2051–2060 1459 725 2194 1469
Abuja 2061–2070 1459 725 2194 1469
Abuja 2071–2080 1459 725 2194 1469
Benin 2021–2030 2572 1672 3472 1800
Benin 2031–2040 2584 1671 3498 1827
Benin 2041–2050 2584 1671 3498 1827
Benin 2051–2060 2584 1671 3498 1827
Benin 2061–2070 2584 1671 3498 1827
Benin 2071–2080 2584 1671 3498 1827
Calabar 2021–2030 3078 2057 4099 2042
Calabar 2031–2040 3078 1995 4161 2166
Calabar 2041–2050 3078 1917 4238 2321
Calabar 2051–2060 3078 1845 4311 2466
Calabar 2061–2070 3078 1776 4380 2604
Calabar 2071–2080 3078 1727 4429 2702
Lagos 2021–2030 1713 1034 2391 1357
Lagos 2031–2040 1713 1034 2391 1357
Lagos 2041–2050 1713 1034 2391 1357
Lagos 2051–2060 1713 1034 2391 1357
Lagos 2061–2070 1713 1034 2391 1357
Lagos 2071–2080 1713 1034 2391 1357
Port Harcourt 2021–2030 2833 1751 3915 2164
Port Harcourt 2031–2040 2861 1772 3950 2178
Port Harcourt 2041–2050 2861 1772 3950 2178
Port Harcourt 2051–2060 2861 1772 3950 2178
Port Harcourt 2061–2070 2861 1772 3950 2178
Port Harcourt 2071–2080 2861 1772 3950 2178
Warri 2021–2030 2850 1604 4096 2492
Warri 2031–2040 2824 1458 4190 2732
Warri 2041–2050 2824 1333 4315 2982
Warri 2051–2060 2824 1217 4431 3214
Warri 2061–2070 2824 1110 4538 3428
Warri 2071–2080 2824 1033 4615 3582

10 Extreme-rainfall classification

A Random Forest binary classifier is trained to predict whether a given city–month observation qualifies as an extreme-rainfall month (top 20th percentile within each city × calendar-month subset). This represents approximately one-in-five months, constituting an operationally meaningful threshold for flood-risk alerting.

10.1 Feature engineering

Show code
set.seed(4271)

feat <- monthly |>
  group_by(city) |>
  arrange(year, month) |>
  mutate(
    lag1      = lag(rainfall_mm, 1),
    lag2      = lag(rainfall_mm, 2),
    lag12     = lag(rainfall_mm, 12),
    roll3     = rollmean(rainfall_mm, 3,  fill = NA, align = "right"),
    roll6     = rollmean(rainfall_mm, 6,  fill = NA, align = "right"),
    month_sin = sin(2 * pi * month / 12),
    month_cos = cos(2 * pi * month / 12),
    city_num  = as.integer(factor(city))
  ) |>
  ungroup() |>
  drop_na(lag1, lag2, lag12, roll3, roll6) |>
  mutate(extreme80 = factor(extreme80, levels = c(0, 1),
                            labels = c("Normal", "Extreme")))

idx <- createDataPartition(feat$extreme80, p = 0.8, list = FALSE)
tr  <- feat[idx, ]
te  <- feat[-idx, ]

cat("Training rows:", nrow(tr), "| Test rows:", nrow(te))
Training rows: 2824 | Test rows: 704

10.2 Model training and evaluation

Show code
rf_mod <- randomForest(
  extreme80 ~ month + month_sin + month_cos + lag1 + lag2 + lag12 +
              roll3 + roll6 + city_num,
  data       = tr,
  ntree      = 300,
  importance = TRUE
)

preds <- predict(rf_mod, newdata = te)
probs <- predict(rf_mod, newdata = te, type = "prob")[, "Extreme"]
cm    <- confusionMatrix(preds, te$extreme80, positive = "Extreme")
Show code
tibble(
  Metric = c("Accuracy","Sensitivity (Recall)","Specificity",
             "Precision (PPV)","F1-Score","Cohen's Kappa"),
  Value  = c(
    round(cm$overall["Accuracy"],        3),
    round(cm$byClass["Sensitivity"],     3),
    round(cm$byClass["Specificity"],     3),
    round(cm$byClass["Pos Pred Value"],  3),
    round(cm$byClass["F1"],              3),
    round(cm$overall["Kappa"],           3)
  )
) |>
  kbl() |>
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)
Table 6: Random Forest classifier — key performance metrics on the held-out test set
Metric Value
Accuracy 0.909
Sensitivity (Recall) 0.669
Specificity 0.967
Precision (PPV) 0.827
F1-Score 0.740
Cohen's Kappa 0.686

10.3 Logistic regression classifier

Logistic regression provides an interpretable probabilistic baseline, assuming a linear decision boundary in feature space.

Show code
# Binary outcome: 1 = Extreme, 0 = Normal
tr_bin <- tr |> mutate(y = as.integer(extreme80 == "Extreme"))
te_bin <- te |> mutate(y = as.integer(extreme80 == "Extreme"))

feat_names <- c("month","month_sin","month_cos","lag1","lag2","lag12","roll3","roll6","city_num")

lr_mod <- glm(
  reformulate(feat_names, response = "y"),
  data   = tr_bin,
  family = binomial(link = "logit")
)

lr_probs <- predict(lr_mod, newdata = te_bin, type = "response")
lr_preds <- factor(ifelse(lr_probs >= 0.5, "Extreme", "Normal"),
                   levels = c("Normal","Extreme"))
cm_lr    <- confusionMatrix(lr_preds, te$extreme80, positive = "Extreme")
roc_lr   <- roc(te$extreme80, lr_probs, levels = c("Normal","Extreme"),
                direction = "<", quiet = TRUE)

10.4 Decision tree classifier

A Classification and Regression Tree (CART) provides a rule-based, human-readable classifier. Complexity is controlled via 10-fold cross-validated pruning (cp).

Show code
dt_mod <- rpart(
  reformulate(feat_names, response = "extreme80"),
  data   = tr,
  method = "class",
  control = rpart.control(cp = 0.001, minsplit = 20, maxdepth = 8)
)

# Prune using 1-SE rule
best_cp <- dt_mod$cptable[which.min(dt_mod$cptable[,"xerror"]), "CP"]
dt_mod  <- prune(dt_mod, cp = best_cp)

dt_preds <- predict(dt_mod, newdata = te, type = "class")
dt_probs <- predict(dt_mod, newdata = te, type = "prob")[, "Extreme"]
cm_dt    <- confusionMatrix(dt_preds, te$extreme80, positive = "Extreme")
roc_dt   <- roc(te$extreme80, dt_probs, levels = c("Normal","Extreme"),
                direction = "<", quiet = TRUE)
Show code
rpart.plot(dt_mod, type = 4, extra = 104, fallen.leaves = TRUE,
           main = "Decision Tree — Extreme Rainfall Classifier",
           cex = 0.72, tweak = 1.1, shadow.col = "grey80",
           box.palette = c("#2166AC","#D94801"))
Figure 8: Pruned CART decision tree for extreme-rainfall classification

10.5 XGBoost classifier

XGBoost (Extreme Gradient Boosting) trains an ensemble of shallow trees in a stage-wise, gradient-descent framework, typically achieving the highest predictive accuracy among tabular classifiers.

Show code
set.seed(8312)

# Convert feature matrix to dense numeric
xgb_tr_X <- as.matrix(tr[, feat_names] |> mutate(across(everything(), as.numeric)))
xgb_te_X <- as.matrix(te[, feat_names] |> mutate(across(everything(), as.numeric)))
xgb_tr_y <- as.integer(tr$extreme80 == "Extreme")
xgb_te_y <- as.integer(te$extreme80 == "Extreme")

dtrain <- xgb.DMatrix(xgb_tr_X, label = xgb_tr_y)
dtest  <- xgb.DMatrix(xgb_te_X, label = xgb_te_y)

xgb_mod <- xgb.train(
  params  = list(
    objective        = "binary:logistic",
    eval_metric      = "auc",
    eta              = 0.05,
    max_depth        = 4,
    subsample        = 0.8,
    colsample_bytree = 0.8,
    min_child_weight = 5
  ),
  data      = dtrain,
  nrounds   = 300,
  watchlist = list(train = dtrain, test = dtest),
  verbose   = 0,
  early_stopping_rounds = 30
)

xgb_probs <- predict(xgb_mod, dtest)
xgb_preds <- factor(ifelse(xgb_probs >= 0.5, "Extreme", "Normal"),
                    levels = c("Normal","Extreme"))
cm_xgb    <- confusionMatrix(xgb_preds, te$extreme80, positive = "Extreme")
roc_xgb   <- roc(te$extreme80, xgb_probs, levels = c("Normal","Extreme"),
                 direction = "<", quiet = TRUE)
Show code
xgb_imp <- xgb.importance(feature_names = feat_names, model = xgb_mod)

xgb_imp |>
  as_tibble() |>
  mutate(Feature = fct_reorder(Feature, Gain)) |>
  ggplot(aes(x = Gain, y = Feature)) +
  geom_col(fill = "#E41A1C", width = 0.6) +
  geom_text(aes(label = round(Gain, 3)), hjust = -0.1, size = 3.3) +
  scale_x_continuous(expand = expansion(mult = c(0, 0.2))) +
  labs(title = "XGBoost Feature Importance (Gain)",
       subtitle = "Gain = fractional contribution to model improvement",
       x = "Gain", y = NULL)
Figure 9: XGBoost feature importance (gain) — contribution to predictive accuracy

10.6 Multi-model comparison

All four classifiers are evaluated on the same held-out test set using identical metrics.

Show code
extract_metrics <- function(cm, roc_obj, model_name) {
  tibble(
    Model       = model_name,
    Accuracy    = round(cm$overall["Accuracy"],       3),
    Sensitivity = round(cm$byClass["Sensitivity"],    3),
    Specificity = round(cm$byClass["Specificity"],    3),
    Precision   = round(cm$byClass["Pos Pred Value"], 3),
    F1          = round(cm$byClass["F1"],             3),
    Kappa       = round(cm$overall["Kappa"],          3),
    AUC         = round(auc(roc_obj),                 4)
  )
}

model_compare <- bind_rows(
  extract_metrics(cm_lr,  roc_lr,                    "Logistic Regression"),
  extract_metrics(cm_dt,  roc_dt,                    "Decision Tree (CART)"),
  extract_metrics(cm,     roc(te$extreme80, probs,
                              levels = c("Normal","Extreme"),
                              direction = "<", quiet = TRUE), "Random Forest"),
  extract_metrics(cm_xgb, roc_xgb,                  "XGBoost")
)

model_compare |>
  kbl() |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE) |>
  row_spec(which.max(model_compare$AUC),   bold = TRUE, background = "#D5E8D4") |>
  row_spec(which.max(model_compare$F1),    italic = TRUE)
Table 7: Classification model comparison — test-set performance metrics
Model Accuracy Sensitivity Specificity Precision F1 Kappa AUC
Logistic Regression 0.837 0.294 0.967 0.678 0.410 0.332 0.8189
Decision Tree (CART) 0.842 0.456 0.935 0.626 0.528 0.436 0.7789
Random Forest 0.909 0.669 0.967 0.827 0.740 0.686 0.9342
XGBoost 0.871 0.507 0.958 0.742 0.603 0.529 0.9041
Show code
roc_rf  <- roc(te$extreme80, probs,     levels = c("Normal","Extreme"),
               direction = "<", quiet = TRUE)

roc_list <- list(
  "Logistic Regression" = roc_lr,
  "Decision Tree"       = roc_dt,
  "Random Forest"       = roc_rf,
  "XGBoost"             = roc_xgb
)

roc_colours <- c("Logistic Regression" = "#377EB8",
                 "Decision Tree"       = "#4DAF4A",
                 "Random Forest"       = "#984EA3",
                 "XGBoost"             = "#E41A1C")

imap_dfr(roc_list, ~ tibble(
  Model = .y,
  FPR   = 1 - .x$specificities,
  TPR   = .x$sensitivities,
  AUC   = round(auc(.x), 4)
)) |>
  mutate(label = paste0(Model, " (AUC=", AUC, ")")) |>
  ggplot(aes(x = FPR, y = TPR, colour = Model)) +
  geom_line(linewidth = 1.1) +
  geom_abline(linetype = "dashed", colour = "grey60") +
  scale_colour_manual(
    values = roc_colours,
    labels = model_compare |>
               mutate(lab = paste0(Model, "  (AUC = ", AUC, ")")) |>
               pull(lab),
    name = NULL
  ) +
  coord_equal() +
  labs(title    = "ROC Curve Comparison — All Four Classifiers",
       x = "False Positive Rate (1 – Specificity)",
       y = "True Positive Rate (Sensitivity)") +
  theme(legend.position = "bottom",
        legend.text = element_text(size = 9))
Figure 10: ROC curves for all four classifiers — test set
Show code
model_compare |>
  select(Model, Accuracy, Sensitivity, Specificity, Precision, F1) |>
  pivot_longer(-Model, names_to = "Metric", values_to = "Value") |>
  ggplot(aes(x = Metric, y = Value, fill = Model)) +
  geom_col(position = position_dodge(width = 0.75), width = 0.65) +
  geom_text(aes(label = round(Value, 2)),
            position = position_dodge(width = 0.75),
            vjust = -0.4, size = 2.5) +
  scale_fill_manual(values = c("Logistic Regression" = "#377EB8",
                                "Decision Tree (CART)" = "#4DAF4A",
                                "Random Forest"        = "#984EA3",
                                "XGBoost"              = "#E41A1C"),
                    name = NULL) +
  scale_y_continuous(limits = c(0, 1.08), breaks = seq(0, 1, 0.2)) +
  labs(title = "Classification Performance — All Models", x = NULL, y = "Score") +
  theme(legend.position = "bottom")
Figure 11: Model performance comparison — bar chart of key metrics

10.7 LIME — local interpretable model-agnostic explanations

While SHAP provides global feature attribution, LIME (Local Interpretable Model-Agnostic Explanations) explains individual predictions by fitting a locally weighted linear model around each test instance. This is particularly useful for auditing high-risk predictions.

Show code
set.seed(5571)

# Register model_type and predict_model for randomForest (required by lime)
model_type.randomForest <- function(x, ...) "classification"

predict_model.randomForest <- function(x, newdata, type, ...) {
  p <- predict(x, newdata = newdata, type = "prob")
  as.data.frame(p)
}

lime_explainer <- lime::lime(
  x              = tr[, feat_names],
  model          = rf_mod,
  bin_continuous = TRUE,
  n_bins         = 4
)

# Explain 8 test instances (mix of correct and incorrect predictions)
lime_sample <- te |>
  mutate(correct = (preds == extreme80)) |>
  group_by(extreme80, correct) |>
  slice_sample(n = 2) |>
  ungroup() |>
  slice_head(n = 8)

lime_exp <- lime::explain(
  x            = lime_sample[, feat_names],
  explainer    = lime_explainer,
  n_labels     = 1,
  n_features   = 5,
  n_permutations = 500
)
Show code
lime::plot_features(lime_exp, ncol = 2) +
  labs(title    = "LIME Local Explanations — Random Forest",
       subtitle = "Blue = supports label  ·  Red = contradicts label")
Figure 12: LIME local explanations — feature contributions for 8 selected test instances (RF model)
Show code
lime_exp |>
  as_tibble() |>
  mutate(case = paste0("Obs ", case),
         feature_desc = str_trunc(feature_desc, 25)) |>
  ggplot(aes(x = feature_desc, y = case, fill = feature_weight)) +
  geom_tile(colour = "white", linewidth = 0.4) +
  geom_text(aes(label = round(feature_weight, 3)), size = 2.8) +
  scale_fill_gradient2(low = "#D73027", mid = "white", high = "#1A9850",
                       midpoint = 0, name = "LIME weight") +
  facet_wrap(~label, ncol = 2) +
  labs(title = "LIME Feature-Weight Heatmap",
       subtitle = "Each row = one test observation  ·  Columns = top-5 features",
       x = NULL, y = NULL) +
  theme(axis.text.x = element_text(angle = 40, hjust = 1, size = 8),
        strip.text  = element_text(face = "bold"))
Figure 13: LIME explanation heatmap — feature weights across 8 selected predictions

11 Principal component analysis & clustering

PCA is applied to the scaled feature matrix (month, lag, and rolling-mean variables) to visualise the rainfall regime structure of the six cities.

11.1 Variance explained

Show code
feat_pca <- feat |>
  select(city, month, month_sin, month_cos, lag1, lag2, lag12, roll3, roll6) |>
  drop_na()

pca_X   <- feat_pca |> select(-city) |> scale()
pca_fit <- prcomp(pca_X, center = FALSE, scale. = FALSE)

var_exp <- summary(pca_fit)$importance
Show code
ve_df <- tibble(
  PC     = paste0("PC", 1:ncol(var_exp)),
  Var    = var_exp["Proportion of Variance", ],
  CumVar = var_exp["Cumulative Proportion",  ]
) |> slice(1:8)

p_bar <- ggplot(ve_df, aes(x = PC, y = Var * 100)) +
  geom_col(fill = "#4292C6", width = 0.6) +
  geom_text(aes(label = paste0(round(Var * 100, 1), "%")),
            vjust = -0.4, size = 3.2) +
  labs(x = NULL, y = "Variance (%)", title = "Individual Proportion") +
  ylim(0, max(ve_df$Var * 100) * 1.15)

p_cum <- ggplot(ve_df, aes(x = PC, y = CumVar * 100, group = 1)) +
  geom_line(colour = "#08519C", linewidth = 1) +
  geom_point(size = 3, colour = "#08519C") +
  geom_hline(yintercept = 90, linetype = "dashed", colour = "grey40") +
  annotate("text", x = 2, y = 91.5, label = "90% threshold",
           size = 3, colour = "grey40") +
  labs(x = NULL, y = "Cumulative (%)", title = "Cumulative Variance")

p_bar + p_cum + plot_annotation(title = "PCA Scree Plot")
Figure 14: Scree plot — proportion and cumulative variance explained by each principal component

11.2 2-D rainfall regime space (PCA biplot)

Show code
pca_scores     <- as_tibble(pca_fit$x[, 1:2]) |> bind_cols(feat_pca |> select(city))
city_centroids <- pca_scores |> group_by(city) |> summarise(PC1 = mean(PC1), PC2 = mean(PC2))

ggplot(pca_scores, aes(x = PC1, y = PC2, colour = city)) +
  geom_point(alpha = 0.2, size = 0.8) +
  stat_ellipse(level = 0.85, linewidth = 0.9) +
  geom_label_repel(data = city_centroids,
                   aes(label = city, fill = city),
                   colour = "white", fontface = "bold",
                   size = 3.2, show.legend = FALSE,
                   box.padding = 0.5, max.overlaps = 20) +
  scale_colour_manual(values = CITY_COLOURS, guide = "none") +
  scale_fill_manual(values   = CITY_COLOURS) +
  labs(
    title    = "PCA — Monthly Rainfall Regime Space",
    subtitle = paste0("PC1 = ", round(var_exp["Proportion of Variance","PC1"]*100, 1),
                      "%  |  PC2 = ", round(var_exp["Proportion of Variance","PC2"]*100, 1),
                      "% of total variance"),
    x = "PC1", y = "PC2"
  )
Figure 15: PCA biplot — monthly observations by city with 85% confidence ellipses (PC1 vs PC2)

11.3 K-means cluster assignments

Show code
set.seed(3917)
km <- kmeans(pca_scores |> select(PC1, PC2), centers = 2, nstart = 25)
pca_scores$cluster <- factor(km$cluster)
Show code
centers_df <- as_tibble(km$centers) |> mutate(cluster = factor(1:2))

ggplot(pca_scores, aes(x = PC1, y = PC2, colour = city, shape = cluster)) +
  geom_point(alpha = 0.3, size = 1) +
  geom_point(data = centers_df,
             aes(x = PC1, y = PC2),
             size = 6, shape = 8, colour = "black", stroke = 1.2,
             inherit.aes = FALSE) +
  geom_label_repel(data = city_centroids,
                   aes(x = PC1, y = PC2, label = city, fill = city),
                   colour = "white", fontface = "bold",
                   size = 3, inherit.aes = FALSE,
                   box.padding = 0.4, max.overlaps = 20) +
  scale_colour_manual(values = CITY_COLOURS, guide = "none") +
  scale_fill_manual(values   = CITY_COLOURS) +
  labs(
    title    = "K-means Clusters (k = 2) in PCA Regime Space",
    subtitle = "Stars = cluster centroids  ·  Shapes = cluster membership",
    x = "PC1", y = "PC2", shape = "Cluster"
  )
Figure 16: K-means clusters (k=2) overlaid on the PCA regime space

11.4 PCA loadings heatmap

Show code
loads <- pca_fit$rotation[, 1:4] |>
  as.data.frame() |>
  rownames_to_column("Feature") |>
  pivot_longer(-Feature, names_to = "PC", values_to = "Loading")

ggplot(loads, aes(x = PC, y = Feature, fill = Loading)) +
  geom_tile(colour = "white", linewidth = 0.4) +
  geom_text(aes(label = round(Loading, 2)), size = 3.3) +
  scale_fill_gradient2(low = "#D73027", mid = "white", high = "#1A9850",
                       midpoint = 0, name = "Loading") +
  labs(title = "PCA Loadings (PC1–PC4)", x = NULL, y = NULL) +
  theme(axis.text.y = element_text(size = 10))
Figure 17: PCA loadings heatmap — feature contributions to the first four principal components

11.5 PCA biplot with loading vectors

A classical biplot overlays both the observation scores and the feature loading vectors in the same coordinate space. The direction and length of each arrow indicates how strongly and in which PC direction that feature contributes.

Show code
scale_factor <- 3.5  # scale loadings to be visible alongside scores

load_arrows <- data.frame(
  feature = rownames(pca_fit$rotation),
  PC1     = pca_fit$rotation[, 1] * scale_factor,
  PC2     = pca_fit$rotation[, 2] * scale_factor
)

ggplot() +
  geom_point(data = pca_scores,
             aes(x = PC1, y = PC2, colour = city),
             alpha = 0.18, size = 0.7) +
  stat_ellipse(data = pca_scores,
               aes(x = PC1, y = PC2, colour = city),
               level = 0.80, linewidth = 0.7) +
  geom_segment(data = load_arrows,
               aes(x = 0, y = 0, xend = PC1, yend = PC2),
               arrow = arrow(length = unit(0.22, "cm"), type = "closed"),
               colour = "grey20", linewidth = 0.7) +
  geom_text_repel(data = load_arrows,
                  aes(x = PC1 * 1.12, y = PC2 * 1.12, label = feature),
                  size = 3.2, colour = "grey10", fontface = "bold",
                  box.padding = 0.3, max.overlaps = 20) +
  geom_label_repel(data = city_centroids,
                   aes(x = PC1, y = PC2, label = city, fill = city),
                   colour = "white", fontface = "bold", size = 3,
                   show.legend = FALSE, box.padding = 0.4, max.overlaps = 20) +
  scale_colour_manual(values = CITY_COLOURS, guide = "none") +
  scale_fill_manual(values   = CITY_COLOURS) +
  labs(
    title    = "PCA Biplot — Scores and Loading Vectors",
    subtitle = paste0("PC1 = ", round(var_exp["Proportion of Variance","PC1"]*100, 1),
                      "%  |  PC2 = ", round(var_exp["Proportion of Variance","PC2"]*100, 1),
                      "% of total variance  ·  Arrows = feature loadings (scaled ×", scale_factor, ")"),
    x = "PC1", y = "PC2"
  )
Figure 18: PCA biplot — observation scores (by city) with feature loading vectors

11.6 t-SNE visualisation

t-distributed Stochastic Neighbour Embedding (t-SNE) is a non-linear dimensionality reduction technique that preserves local neighbourhood structure. Unlike PCA, it captures non-linear manifolds and is particularly effective at revealing cluster separation invisible to linear methods. Perplexity = 40 (approximately √N for this dataset); Barnes-Hut approximation (theta = 0.5) is used for efficiency.

Show code
set.seed(6149)

# Subsample for t-SNE speed (use up to 2000 rows)
tsne_idx <- sample(nrow(pca_X), min(2000, nrow(pca_X)))
tsne_X   <- pca_X[tsne_idx, ]
tsne_lab <- feat_pca$city[tsne_idx]

tsne_fit <- Rtsne(
  tsne_X,
  dims        = 2,
  perplexity  = 40,
  theta       = 0.5,
  max_iter    = 1000,
  check_duplicates = FALSE,
  verbose     = FALSE
)

tsne_df <- tibble(
  tSNE1 = tsne_fit$Y[, 1],
  tSNE2 = tsne_fit$Y[, 2],
  city  = tsne_lab
)
Show code
tsne_centroids <- tsne_df |>
  group_by(city) |>
  summarise(tSNE1 = mean(tSNE1), tSNE2 = mean(tSNE2))

ggplot(tsne_df, aes(x = tSNE1, y = tSNE2, colour = city)) +
  geom_point(alpha = 0.35, size = 0.9) +
  geom_label_repel(data = tsne_centroids,
                   aes(x = tSNE1, y = tSNE2, label = city, fill = city),
                   colour = "white", fontface = "bold", size = 3.2,
                   inherit.aes = FALSE, show.legend = FALSE,
                   box.padding = 0.4, max.overlaps = 20) +
  scale_colour_manual(values = CITY_COLOURS, guide = "none") +
  scale_fill_manual(values   = CITY_COLOURS) +
  labs(
    title    = "t-SNE Embedding — Rainfall Feature Space",
    subtitle = "Perplexity = 40  ·  1,000 iterations  ·  n = 2,000 monthly observations",
    x = "t-SNE Dimension 1", y = "t-SNE Dimension 2"
  )
Figure 19: t-SNE embedding coloured by city — rainfall feature space (n = 2,000)
Show code
# Map k-means cluster back to t-SNE sample
tsne_km_df <- tsne_df |>
  mutate(
    # Find the original row indices from pca_scores that correspond to tsne_idx
    cluster = km$cluster[tsne_idx]
  )

ggplot(tsne_km_df, aes(x = tSNE1, y = tSNE2, colour = factor(cluster))) +
  geom_point(alpha = 0.35, size = 0.9) +
  scale_colour_manual(values = c("1" = "#2166AC", "2" = "#D94801"),
                      name = "K-means cluster") +
  labs(
    title    = "t-SNE — K-means Cluster Labels Overlaid",
    subtitle = "Cluster boundaries emerge from non-linear feature interactions",
    x = "t-SNE Dimension 1", y = "t-SNE Dimension 2"
  ) +
  theme(legend.position = "bottom")
Figure 20: t-SNE embedding coloured by k-means cluster label

11.7 UMAP visualisation

Uniform Manifold Approximation and Projection (UMAP) is a graph-based non-linear dimensionality reduction technique that better preserves global topology than t-SNE while remaining computationally efficient. Default hyperparameters: n_neighbors = 15, min_dist = 0.1.

Show code
set.seed(2847)

umap_config <- umap.defaults
umap_config$n_neighbors <- 15
umap_config$min_dist    <- 0.10
umap_config$n_epochs    <- 200

umap_fit <- umap(pca_X[tsne_idx, ], config = umap_config)

umap_df <- tibble(
  UMAP1   = umap_fit$layout[, 1],
  UMAP2   = umap_fit$layout[, 2],
  city    = feat_pca$city[tsne_idx],
  cluster = factor(km$cluster[tsne_idx])
)
Show code
umap_centroids <- umap_df |>
  group_by(city) |>
  summarise(UMAP1 = mean(UMAP1), UMAP2 = mean(UMAP2))

ggplot(umap_df, aes(x = UMAP1, y = UMAP2, colour = city)) +
  geom_point(alpha = 0.35, size = 0.9) +
  geom_label_repel(data = umap_centroids,
                   aes(x = UMAP1, y = UMAP2, label = city, fill = city),
                   colour = "white", fontface = "bold", size = 3.2,
                   inherit.aes = FALSE, show.legend = FALSE,
                   box.padding = 0.4, max.overlaps = 20) +
  scale_colour_manual(values = CITY_COLOURS, guide = "none") +
  scale_fill_manual(values   = CITY_COLOURS) +
  labs(
    title    = "UMAP Embedding — Rainfall Feature Space",
    subtitle = "n_neighbors = 15  ·  min_dist = 0.1  ·  n = 2,000 monthly observations",
    x = "UMAP Dimension 1", y = "UMAP Dimension 2"
  )
Figure 21: UMAP embedding coloured by city — rainfall feature space
Show code
ggplot(umap_df, aes(x = UMAP1, y = UMAP2, colour = cluster)) +
  geom_point(alpha = 0.35, size = 0.9) +
  scale_colour_manual(values = c("1" = "#2166AC", "2" = "#D94801"),
                      name = "K-means cluster") +
  labs(
    title    = "UMAP — K-means Cluster Labels Overlaid",
    subtitle = "UMAP preserves global topology; compare with t-SNE result",
    x = "UMAP Dimension 1", y = "UMAP Dimension 2"
  ) +
  theme(legend.position = "bottom")
Figure 22: UMAP embedding coloured by k-means cluster — compare topology with t-SNE
Show code
pca_plot_df <- pca_scores |>
  slice(tsne_idx) |>
  transmute(Dim1 = PC1, Dim2 = PC2, city, Method = "PCA")

tsne_plot_df <- tsne_df |>
  transmute(Dim1 = tSNE1, Dim2 = tSNE2, city, Method = "t-SNE")

umap_plot_df <- umap_df |>
  transmute(Dim1 = UMAP1, Dim2 = UMAP2, city, Method = "UMAP")

bind_rows(pca_plot_df, tsne_plot_df, umap_plot_df) |>
  mutate(Method = factor(Method, levels = c("PCA","t-SNE","UMAP"))) |>
  ggplot(aes(x = Dim1, y = Dim2, colour = city)) +
  geom_point(alpha = 0.2, size = 0.6) +
  stat_ellipse(level = 0.75, linewidth = 0.7) +
  facet_wrap(~Method, scales = "free", ncol = 3) +
  scale_colour_manual(values = CITY_COLOURS, name = "City") +
  labs(
    title    = "Dimensionality Reduction Comparison — PCA, t-SNE, UMAP",
    subtitle = "Each panel uses the same 2,000 monthly observations; ellipses = 75% confidence region",
    x = "Dimension 1", y = "Dimension 2"
  ) +
  theme(strip.text      = element_text(face = "bold"),
        legend.position = "bottom")
Figure 23: Side-by-side comparison of PCA, t-SNE, and UMAP by city

12 2-D rainfall regime map

Each city is positioned at its geographic centroid. Circle area encodes mean annual rainfall; fill colour encodes the coefficient of variation (CV), a measure of inter-annual variability.

Show code
regime <- annual |>
  group_by(city) |>
  summarise(mean_rain = mean(annual_rainfall_mm),
            cv        = sd(annual_rainfall_mm) / mean(annual_rainfall_mm) * 100,
            .groups   = "drop") |>
  left_join(city_coords, by = "city")

nigeria_box <- data.frame(xmin = 2.7, xmax = 14.7, ymin = 4.2, ymax = 13.9)

ggplot() +
  geom_rect(data = nigeria_box,
            aes(xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax),
            fill = "#f5f5f0", colour = "#aaaaaa", linewidth = 0.5) +
  geom_point(data = regime,
             aes(x = lon, y = lat, size = mean_rain, fill = cv),
             shape = 21, colour = "white", stroke = 0.8, alpha = 0.9) +
  geom_text_repel(data = regime,
                  aes(x = lon, y = lat,
                      label = paste0(city, "\n", comma(round(mean_rain)), " mm",
                                     "\nCV=", round(cv,1), "%")),
                  size = 2.8, fontface = "bold", lineheight = 0.9,
                  box.padding = 0.5, max.overlaps = 20) +
  scale_size_continuous(range = c(5, 22), name = "Mean annual\nrainfall (mm)",
                        labels = comma) +
  scale_fill_viridis_c(option = "magma", name = "CV (%)",
                       begin = 0.2, end = 0.9) +
  labs(
    title    = "Nigerian Rainfall Regime Map (1976–2025)",
    subtitle = "Circle area ∝ mean annual rainfall  ·  Fill colour = inter-annual CV",
    x = "Longitude (°E)", y = "Latitude (°N)"
  ) +
  coord_fixed(ratio = 1.1) +
  theme(panel.grid.major = element_line(colour = "#dddddd"),
        legend.position  = "right")
Figure 24: Spatial rainfall regime map — six Nigerian cities (1976–2025). Area ∝ mean rainfall; colour = CV (%)

13 Seasonal anomaly heatmap

Monthly rainfall anomalies (deviation from city-specific climatological mean) highlight the structure of wet and dry year periods across the record.

Show code
anomaly <- monthly |>
  left_join(clim |> select(city, month, clim_mean = mean_rainfall_mm),
            by = c("city", "month")) |>
  mutate(anomaly = rainfall_mm - clim_mean)

anomaly |>
  ggplot(aes(x = month_name, y = factor(year), fill = anomaly)) +
  geom_tile(colour = NA) +
  facet_wrap(~city, ncol = 2) +
  scale_fill_gradient2(low = "#D73027", mid = "white", high = "#4575B4",
                       midpoint = 0, name = "Anomaly\n(mm)") +
  labs(
    title = "Monthly Rainfall Anomalies (1976–2025)",
    x = NULL, y = "Year"
  ) +
  theme(
    axis.text.x  = element_text(size = 7, angle = 45, hjust = 1),
    axis.text.y  = element_text(size = 5),
    strip.text   = element_text(face = "bold")
  )
Figure 25: Monthly rainfall anomaly heatmap — deviation from 1976–2025 city-level mean (mm)

14 Engineering applications of rainfall analysis

The climatological statistics derived from the 50-year record and probabilistic projections have direct utility across a broad range of civil and environmental engineering disciplines. This section translates the statistical outputs into design parameters and decision-support tools for seven engineering domains. All calculations use monthly rainfall totals; where daily or sub-daily intensities are required, empirically calibrated scaling relationships are applied and clearly stated.


14.1 Flood forecasting

Flood frequency analysis estimates the rainfall (or flow) magnitude associated with a given return period. The Gumbel (Extreme Value Type I) distribution is fitted to annual rainfall totals for each city using the method of moments. The return-level estimate for return period \(T\) is:

\[x_T = \mu - \beta \ln\!\left[-\ln\!\left(1 - \frac{1}{T}\right)\right]\]

where \(\mu = \bar{x} - 0.5772\,\beta\) and \(\beta = \sigma\sqrt{6}/\pi\).

Show code
# Gumbel method-of-moments fitting
gumbel_quantile <- function(mean_x, sd_x, T_years) {
  beta <- sd_x * sqrt(6) / pi
  mu   <- mean_x - 0.5772 * beta
  mu - beta * log(-log(1 - 1 / T_years))
}

return_periods <- c(2, 5, 10, 25, 50, 100, 200, 500)

flood_table <- annual |>
  group_by(city) |>
  summarise(
    mean_r = mean(annual_rainfall_mm),
    sd_r   = sd(annual_rainfall_mm),
    .groups = "drop"
  ) |>
  mutate(
    across(mean_r:sd_r, identity),  # carry through
    T2   = gumbel_quantile(mean_r, sd_r,   2),
    T5   = gumbel_quantile(mean_r, sd_r,   5),
    T10  = gumbel_quantile(mean_r, sd_r,  10),
    T25  = gumbel_quantile(mean_r, sd_r,  25),
    T50  = gumbel_quantile(mean_r, sd_r,  50),
    T100 = gumbel_quantile(mean_r, sd_r, 100),
    T200 = gumbel_quantile(mean_r, sd_r, 200),
    T500 = gumbel_quantile(mean_r, sd_r, 500)
  )

# Build return-period curve data
rp_long <- map_dfr(unique(annual$city), function(cty) {
  params <- flood_table |> filter(city == cty)
  T_seq  <- exp(seq(log(1.5), log(1000), length.out = 200))
  tibble(
    city   = cty,
    T_yr   = T_seq,
    rl_mm  = gumbel_quantile(params$mean_r, params$sd_r, T_seq)
  )
})
Show code
flood_table |>
  select(City = city, `T2` = T2, `T5` = T5, `T10` = T10,
         `T25` = T25, `T50` = T50, `T100` = T100,
         `T200` = T200, `T500` = T500) |>
  mutate(across(where(is.numeric), round)) |>
  kbl() |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE) |>
  add_header_above(c(" " = 1, "Return period (years)" = 8))
Table 8: Gumbel return-level estimates (mm/year) for selected return periods by city
Return period (years)
City T2 T5 T10 T25 T50 T100 T200 T500
Abuja 1399 1729 1948 2224 2429 2632 2835 3102
Benin 2504 2913 3183 3524 3778 4029 4280 4610
Calabar 3188 3651 3956 4343 4630 4914 5198 5572
Lagos 1656 1962 2164 2420 2610 2798 2986 3234
Port Harcourt 2769 3255 3577 3984 4286 4585 4883 5277
Warri 3054 3602 3964 4422 4762 5100 5436 5879
Show code
# Observed Weibull plotting positions
obs_rp <- annual |>
  group_by(city) |>
  arrange(annual_rainfall_mm) |>
  mutate(rank = row_number(), n = n(),
         T_empirical = (n + 1) / (n + 1 - rank)) |>
  ungroup()

ggplot(rp_long, aes(x = T_yr, y = rl_mm, colour = city)) +
  geom_line(linewidth = 0.9) +
  geom_point(data = obs_rp,
             aes(x = T_empirical, y = annual_rainfall_mm, colour = city),
             size = 1.2, alpha = 0.5, inherit.aes = FALSE) +
  scale_x_log10(
    breaks = c(2, 5, 10, 25, 50, 100, 200, 500),
    labels = c("2","5","10","25","50","100","200","500")
  ) +
  scale_colour_manual(values = CITY_COLOURS, name = "City") +
  labs(
    title    = "Flood Frequency Curves — Gumbel Distribution",
    subtitle = "Lines = fitted Gumbel  ·  Points = empirical Weibull positions",
    x = "Return period (years, log scale)", y = "Annual rainfall (mm)"
  )
Figure 26: Gumbel flood-frequency curves — annual rainfall return levels by city
TipFlood forecasting application

Return-level estimates at T = 25, 50, and 100 years feed directly into the design flood calculation for flood plain delineation, early-warning trigger thresholds, and bridge/culvert sizing. Cities in the humid coastal-south regime (Calabar, Warri, Port Harcourt) show substantially higher T-year rainfalls, requiring more conservative flood-design standards.


14.2 Irrigation planning

An irrigation water requirement (IWR) analysis quantifies the monthly soil-moisture deficit — the gap between atmospheric demand (reference evapotranspiration, ET₀) and supply (effective rainfall). Reference ET₀ values for each city are drawn from published FAO/AQUASTAT estimates for the West African sub-region.

Show code
# Published approximate monthly reference ET₀ (mm) for each city
# Source: FAO AQUASTAT / Allen et al. (1998) Penman-Monteith normals
eto_monthly <- bind_rows(
  tibble(city = "Abuja",
         month   = 1:12,
         eto_mm  = c(172,165,175,155,140,118,112,110,115,135,155,168)),
  tibble(city = "Benin",
         month   = 1:12,
         eto_mm  = c(148,138,140,125,118,108,100, 98,105,120,138,148)),
  tibble(city = "Calabar",
         month   = 1:12,
         eto_mm  = c(140,130,135,120,112,102, 98, 95,102,115,130,140)),
  tibble(city = "Lagos",
         month   = 1:12,
         eto_mm  = c(152,145,148,132,122,110,104,100,108,125,142,152)),
  tibble(city = "Port Harcourt",
         month   = 1:12,
         eto_mm  = c(142,132,138,122,114,104,100, 96,104,118,132,142)),
  tibble(city = "Warri",
         month   = 1:12,
         eto_mm  = c(138,128,133,118,110,100, 96, 94,100,112,128,138))
)

# Apply 0.8 effectiveness factor to rainfall (accounts for runoff + deep percolation)
eff_factor <- 0.80

wb <- clim |>
  left_join(eto_monthly, by = c("city","month")) |>
  mutate(
    eff_rain  = mean_rainfall_mm * eff_factor,
    deficit   = pmax(0, eto_mm - eff_rain),
    surplus   = pmax(0, eff_rain - eto_mm),
    month_name = factor(month_name,
                        levels = c("Jan","Feb","Mar","Apr","May","Jun",
                                   "Jul","Aug","Sep","Oct","Nov","Dec"))
  )

annual_iwr <- wb |>
  group_by(city) |>
  summarise(
    Annual_ET0_mm     = sum(eto_mm),
    Annual_EffRain_mm = round(sum(eff_rain)),
    Annual_IWR_mm     = round(sum(deficit)),
    Annual_Surplus_mm = round(sum(surplus)),
    .groups = "drop"
  )
Show code
annual_iwr |>
  rename(City = city,
         `ET₀ (mm/yr)` = Annual_ET0_mm,
         `Eff. rainfall (mm/yr)` = Annual_EffRain_mm,
         `IWR (mm/yr)` = Annual_IWR_mm,
         `Surplus (mm/yr)` = Annual_Surplus_mm) |>
  kbl() |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE)
Table 9: Annual irrigation water requirement (IWR) summary by city
City ET₀ (mm/yr) Eff. rainfall (mm/yr) IWR (mm/yr) Surplus (mm/yr)
Abuja 1720 1169 978 427
Benin 1486 2064 347 926
Calabar 1419 2619 257 1458
Lagos 1540 1370 520 351
Port Harcourt 1444 2288 240 1083
Warri 1395 2525 198 1328
Show code
wb |>
  pivot_longer(c(eff_rain, eto_mm), names_to = "variable", values_to = "mm") |>
  mutate(variable = recode(variable,
                           "eff_rain" = "Effective Rainfall",
                           "eto_mm"   = "Reference ET₀")) |>
  ggplot(aes(x = month_name, y = mm, colour = variable, group = variable)) +
  geom_line(linewidth = 1) +
  geom_ribbon(
    data = wb |>
      pivot_longer(c(deficit, surplus), names_to = "type", values_to = "val"),
    aes(x = month_name, ymin = 0, ymax = val, fill = type, group = type),
    alpha = 0.2, inherit.aes = FALSE
  ) +
  facet_wrap(~city, ncol = 2, scales = "free_y") +
  scale_colour_manual(values = c("Effective Rainfall" = "#2166AC",
                                  "Reference ET₀"     = "#D94801"),
                      name = NULL) +
  scale_fill_manual(values = c("deficit" = "#D94801", "surplus" = "#2166AC"),
                    name = "Balance", labels = c("Deficit (irrigation need)",
                                                  "Surplus (recharge)")) +
  labs(
    title    = "Monthly Water Balance — Irrigation Analysis",
    subtitle = "Blue = effective rain  ·  Red = ET₀  ·  Shading = deficit / surplus",
    x = NULL, y = "Water depth (mm)"
  ) +
  theme(axis.text.x = element_text(size = 8, angle = 45, hjust = 1),
        strip.text  = element_text(face = "bold"),
        legend.position = "bottom")
Figure 27: Monthly water balance — effective rainfall vs reference ET₀ by city
TipIrrigation application

Abuja shows the largest annual IWR due to its sub-humid climate; supplemental irrigation is required from October through April. Coastal cities (Calabar, Warri, Port Harcourt) have minimal deficits, making rain-fed agriculture viable for most of the year. These values feed directly into FAO-56 crop water requirement calculations and irrigation scheme design.


14.3 Rainfall frequency modelling

Fitting probability distributions to annual rainfall totals provides the statistical foundation for all design-storm calculations. Three candidate distributions — Normal, Log-normal, and Gumbel (GEV Type I) — are compared via AIC and goodness-of-fit.

Show code
# Fit three distributions using MLE via MASS::fitdistr
fit_dists <- function(cty) {
  x <- annual |> filter(city == cty) |> pull(annual_rainfall_mm)
  
  fit_norm  <- MASS::fitdistr(x, "normal")
  fit_lnorm <- MASS::fitdistr(x, "lognormal")
  # Gumbel: parameterise via method of moments (closed-form)
  beta_g <- sd(x) * sqrt(6) / pi
  mu_g   <- mean(x) - 0.5772 * beta_g
  
  n_param <- 2
  ll_norm  <- sum(dnorm(x,  fit_norm$estimate["mean"],     fit_norm$estimate["sd"],  log = TRUE))
  ll_lnorm <- sum(dlnorm(x, fit_lnorm$estimate["meanlog"], fit_lnorm$estimate["sdlog"], log = TRUE))
  # Gumbel log-likelihood
  z        <- (x - mu_g) / beta_g
  ll_gumb  <- sum(-log(beta_g) - z - exp(-z))
  
  tibble(
    City         = cty,
    Distribution = c("Normal","Log-normal","Gumbel"),
    Mean_fit     = c(fit_norm$estimate["mean"], exp(fit_lnorm$estimate["meanlog"] + fit_lnorm$estimate["sdlog"]^2/2), mu_g + 0.5772*beta_g),
    LogLik       = c(ll_norm, ll_lnorm, ll_gumb),
    AIC          = c(-2*ll_norm + 2*n_param, -2*ll_lnorm + 2*n_param, -2*ll_gumb + 2*n_param)
  )
}

dist_fits <- map_dfr(unique(annual$city), fit_dists)
Show code
dist_fits |>
  select(City, Distribution, AIC) |>
  mutate(AIC = round(AIC, 1)) |>
  pivot_wider(names_from = Distribution, values_from = AIC) |>
  mutate(
    Best = case_when(
      Normal    == pmin(Normal, `Log-normal`, Gumbel) ~ "Normal",
      `Log-normal` == pmin(Normal, `Log-normal`, Gumbel) ~ "Log-normal",
      TRUE ~ "Gumbel"
    )
  ) |>
  kbl() |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE) |>
  column_spec(5, bold = TRUE, color = "darkgreen")
Table 10: AIC comparison of fitted probability distributions for annual rainfall by city
City Normal Log-normal Gumbel Best
Abuja 737.2 733.7 734.9 Log-normal
Benin 758.4 761.4 774.5 Normal
Calabar 770.8 771.1 781.0 Normal
Lagos 729.6 727.8 730.9 Log-normal
Port Harcourt 775.9 771.7 771.2 Gumbel
Warri 787.8 788.6 795.3 Normal
Show code
city_params <- annual |>
  group_by(city) |>
  summarise(mu = mean(annual_rainfall_mm),
            sigma = sd(annual_rainfall_mm),
            .groups = "drop") |>
  mutate(
    beta_g = sigma * sqrt(6) / pi,
    mu_g   = mu - 0.5772 * beta_g,
    mu_ln  = log(mu^2 / sqrt(sigma^2 + mu^2)),
    sd_ln  = sqrt(log(1 + sigma^2 / mu^2))
  )

# Build density curves
pdf_curves <- map_dfr(unique(annual$city), function(cty) {
  p  <- city_params |> filter(city == cty)
  rng <- annual |> filter(city == cty) |>
    summarise(lo = min(annual_rainfall_mm)*0.75, hi = max(annual_rainfall_mm)*1.2)
  xs <- seq(rng$lo, rng$hi, length.out = 300)
  bind_rows(
    tibble(city = cty, x = xs, density = dnorm(xs, p$mu, p$sigma),   dist = "Normal"),
    tibble(city = cty, x = xs, density = dlnorm(xs, p$mu_ln, p$sd_ln), dist = "Log-normal"),
    tibble(city = cty, x = xs,
           density = exp(-(xs - p$mu_g)/p$beta_g - exp(-(xs - p$mu_g)/p$beta_g)) / p$beta_g,
           dist = "Gumbel")
  )
})

ggplot() +
  geom_histogram(data = annual,
                 aes(x = annual_rainfall_mm, y = after_stat(density)),
                 bins = 12, fill = "grey75", colour = "white", alpha = 0.7) +
  geom_line(data = pdf_curves,
            aes(x = x, y = density, colour = dist, linetype = dist), linewidth = 0.9) +
  facet_wrap(~city, ncol = 2, scales = "free") +
  scale_colour_manual(values = c("Normal" = "#E41A1C",
                                  "Log-normal" = "#377EB8",
                                  "Gumbel" = "#4DAF4A"),
                      name = "Distribution") +
  scale_linetype_manual(values = c("Normal" = "solid", "Log-normal" = "dashed",
                                    "Gumbel" = "dotdash"),
                        name = "Distribution") +
  labs(
    title    = "Rainfall Frequency Modelling — Fitted PDFs",
    subtitle = "Histogram = empirical  ·  Lines = fitted distributions",
    x = "Annual Rainfall (mm)", y = "Density"
  ) +
  theme(strip.text = element_text(face = "bold"), legend.position = "bottom")
Figure 28: Fitted probability distributions overlaid on empirical rainfall histograms by city

14.4 Drainage design

Urban and rural drainage systems are sized using the Rational Method:

\[Q = \frac{C \cdot i \cdot A}{360}\]

where \(Q\) is the design peak flow (m³/s), \(C\) is the runoff coefficient (dimensionless), \(i\) is the design rainfall intensity (mm/hr) for the selected return period and time of concentration, and \(A\) is the catchment area (ha).

Since daily and sub-daily rainfall data are not available, monthly Gumbel quantiles are converted to approximate 24-hr design rainfall depths using the empirical scaling relationship \(R_{24} \approx 0.12 \times R_{\text{annual}}\) (a conservative estimate derived from Nigerian hourly-to-daily rainfall studies), and subsequently to intensity via \(i = R_{24}/T_c\) for a range of concentration times.

Show code
# Design rainfall depths from annual Gumbel quantiles
design_rain <- flood_table |>
  select(city, T5, T10, T25, T50, T100) |>
  pivot_longer(-city, names_to = "return_period", values_to = "annual_mm") |>
  mutate(
    R24_mm  = annual_mm * 0.12,      # approx 24-hr depth (empirical scaling)
    Tc_vals = list(c(0.25, 0.5, 1, 2, 3, 6, 12, 24)),  # hr
    return_period = str_remove(return_period, "T") |> paste0("-yr")
  ) |>
  unnest(Tc_vals) |>
  mutate(
    intensity_mm_hr = R24_mm * (24 / Tc_vals)^0.25 / Tc_vals,  # simple IDF scaling
    intensity_mm_hr = round(intensity_mm_hr, 1)
  )

# Rational method: Q for a standard 5-ha urban catchment, C = 0.65
design_Q <- design_rain |>
  mutate(
    C = 0.65,  # mixed commercial/residential
    A_ha = 5,
    Q_m3s = round(C * intensity_mm_hr * A_ha / 360, 3)
  )
Show code
design_rain |>
  filter(city %in% c("Lagos","Calabar","Abuja"),
         Tc_vals %in% c(0.25, 0.5, 1, 2)) |>
  mutate(Tc_label = paste0(Tc_vals, " hr")) |>
  select(City = city, `Return period` = return_period,
         `Tc (hr)` = Tc_label, `i (mm/hr)` = intensity_mm_hr) |>
  arrange(City, `Return period`, `Tc (hr)`) |>
  kbl() |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE)
Table 11: Design rainfall intensity (mm/hr) and peak flow (m³/s) for a 5-ha urban catchment — selected cities and return periods
City Return period Tc (hr) i (mm/hr)
Abuja 10-yr 0.25 hr 2926.9
Abuja 10-yr 0.5 hr 1230.6
Abuja 10-yr 1 hr 517.4
Abuja 10-yr 2 hr 217.5
Abuja 100-yr 0.25 hr 3955.1
Abuja 100-yr 0.5 hr 1662.9
Abuja 100-yr 1 hr 699.2
Abuja 100-yr 2 hr 294.0
Abuja 25-yr 0.25 hr 3341.8
Abuja 25-yr 0.5 hr 1405.0
Abuja 25-yr 1 hr 590.7
Abuja 25-yr 2 hr 248.4
Abuja 5-yr 0.25 hr 2598.5
Abuja 5-yr 0.5 hr 1092.5
Abuja 5-yr 1 hr 459.4
Abuja 5-yr 2 hr 193.1
Abuja 50-yr 0.25 hr 3649.6
Abuja 50-yr 0.5 hr 1534.4
Abuja 50-yr 1 hr 645.2
Abuja 50-yr 2 hr 271.3
Calabar 10-yr 0.25 hr 5944.5
Calabar 10-yr 0.5 hr 2499.4
Calabar 10-yr 1 hr 1050.9
Calabar 10-yr 2 hr 441.8
Calabar 100-yr 0.25 hr 7383.9
Calabar 100-yr 0.5 hr 3104.6
Calabar 100-yr 1 hr 1305.3
Calabar 100-yr 2 hr 548.8
Calabar 25-yr 0.25 hr 6525.4
Calabar 25-yr 0.5 hr 2743.6
Calabar 25-yr 1 hr 1153.5
Calabar 25-yr 2 hr 485.0
Calabar 5-yr 0.25 hr 5484.9
Calabar 5-yr 0.5 hr 2306.1
Calabar 5-yr 1 hr 969.6
Calabar 5-yr 2 hr 407.7
Calabar 50-yr 0.25 hr 6956.2
Calabar 50-yr 0.5 hr 2924.7
Calabar 50-yr 1 hr 1229.7
Calabar 50-yr 2 hr 517.0
Lagos 10-yr 0.25 hr 3251.7
Lagos 10-yr 0.5 hr 1367.2
Lagos 10-yr 1 hr 574.8
Lagos 10-yr 2 hr 241.7
Lagos 100-yr 0.25 hr 4204.5
Lagos 100-yr 0.5 hr 1767.8
Lagos 100-yr 1 hr 743.3
Lagos 100-yr 2 hr 312.5
Lagos 25-yr 0.25 hr 3636.2
Lagos 25-yr 0.5 hr 1528.8
Lagos 25-yr 1 hr 642.8
Lagos 25-yr 2 hr 270.3
Lagos 5-yr 0.25 hr 2947.4
Lagos 5-yr 0.5 hr 1239.2
Lagos 5-yr 1 hr 521.0
Lagos 5-yr 2 hr 219.1
Lagos 50-yr 0.25 hr 3921.4
Lagos 50-yr 0.5 hr 1648.7
Lagos 50-yr 1 hr 693.2
Lagos 50-yr 2 hr 291.5
Show code
design_rain |>
  filter(city %in% c("Lagos","Calabar","Abuja","Port Harcourt"),
         return_period %in% c("5-yr","25-yr","100-yr")) |>
  ggplot(aes(x = Tc_vals, y = intensity_mm_hr,
             colour = return_period, linetype = return_period)) +
  geom_line(linewidth = 0.9) +
  geom_point(size = 2) +
  facet_wrap(~city, ncol = 2, scales = "free_y") +
  scale_x_log10(breaks = c(0.25, 0.5, 1, 2, 3, 6, 12, 24),
                labels = c("15m","30m","1h","2h","3h","6h","12h","24h")) +
  scale_colour_manual(values = c("5-yr" = "#4DAF4A","25-yr" = "#FF7F00","100-yr" = "#E41A1C"),
                      name = "Return period") +
  scale_linetype_manual(values = c("5-yr" = "solid","25-yr" = "dashed","100-yr" = "dotdash"),
                        name = "Return period") +
  labs(
    title    = "IDF Curves — Rational Method Input",
    subtitle = "Based on Gumbel annual quantiles scaled to sub-daily intensities",
    x = "Duration (log scale)", y = "Intensity (mm/hr)"
  ) +
  theme(strip.text = element_text(face = "bold"), legend.position = "bottom",
        axis.text.x = element_text(size = 8, angle = 45, hjust = 1))
Figure 29: Intensity-Duration-Frequency (IDF) curves for selected cities and return periods
TipDrainage design application

For a 5-ha mixed-use catchment in Lagos with \(T_c = 30\) min and a 25-year return period, the rational method yields a design peak flow of approximately 13.8 m³/s. This informs pipe sizing, channel dimensions, and detention basin volumes per FMWR/NIWA drainage design standards.


14.5 Road design

Rainfall directly affects road design through three mechanisms: (1) pavement moisture susceptibility and subgrade bearing capacity loss, (2) cross-drainage (culvert and side-drain) sizing, and (3) construction workability windows. Two indicators are derived from the monthly climatology.

Show code
# Seasonal workability: months with mean rainfall < 100 mm are "workable" for earthworks
# Months > 200 mm are "high-risk" (pavement damage, subgrade saturation)
road_season <- clim |>
  mutate(
    workability = case_when(
      mean_rainfall_mm <  80  ~ "Workable",
      mean_rainfall_mm <  200 ~ "Marginal",
      TRUE                     ~ "Unsuitable"
    ),
    workability = factor(workability,
                         levels = c("Workable","Marginal","Unsuitable")),
    month_name = factor(month_name,
                        levels = c("Jan","Feb","Mar","Apr","May","Jun",
                                   "Jul","Aug","Sep","Oct","Nov","Dec"))
  )

workability_summary <- road_season |>
  group_by(city, workability) |>
  summarise(months = n(), .groups = "drop") |>
  pivot_wider(names_from = workability, values_from = months, values_fill = 0) |>
  mutate(
    `Workable months` = Workable,
    `Marginal months` = Marginal,
    `Unsuitable months` = Unsuitable,
    `Annual rainfall (mm)` = round(annual |> group_by(city) |>
                                     summarise(m = mean(annual_rainfall_mm),
                                               .groups="drop") |>
                                     filter(city %in% unique(clim$city)) |>
                                     arrange(match(city, city)) |> pull(m))
  ) |>
  select(City = city, `Workable months`, `Marginal months`,
         `Unsuitable months`, `Annual rainfall (mm)`)
Show code
workability_summary |>
  kbl() |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE)
Table 12: Construction workability calendar by city — months categorised by mean rainfall threshold
City Workable months Marginal months Unsuitable months Annual rainfall (mm)
Abuja 6 2 4 1461
Benin 3 2 7 2580
Calabar 2 3 7 3274
Lagos 3 6 3 1713
Port Harcourt 2 3 7 2860
Warri 1 2 9 3156
Show code
ggplot(road_season, aes(x = month_name, y = city, fill = workability)) +
  geom_tile(colour = "white", linewidth = 0.5) +
  geom_text(aes(label = round(mean_rainfall_mm)), size = 3, colour = "grey20") +
  scale_fill_manual(values = c("Workable"    = "#1A9850",
                                "Marginal"    = "#FDAE61",
                                "Unsuitable"  = "#D73027"),
                    name = "Earthwork suitability") +
  labs(
    title    = "Road Construction Workability Calendar",
    subtitle = "Values = mean monthly rainfall (mm)  ·  Threshold: <80 mm = Workable, 80–200 = Marginal, >200 = Unsuitable",
    x = NULL, y = NULL
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "bottom")
Figure 30: Road construction workability calendar — monthly rainfall classification by city
Show code
culvert_df <- flood_table |>
  select(city, T10, T25) |>
  pivot_longer(-city, names_to = "T_label", values_to = "annual_mm") |>
  mutate(
    R24   = annual_mm * 0.12,
    i_1hr = R24 * (24)^0.25 / 1,  # 1-hour intensity
    Q_culvert = round(0.45 * i_1hr * 2 / 360, 3),
    T_label = str_remove(T_label, "T") |> paste0("-yr return period")
  )

ggplot(culvert_df, aes(x = reorder(city, Q_culvert), y = Q_culvert, fill = city)) +
  geom_col(width = 0.6) +
  geom_text(aes(label = paste0(Q_culvert, " m³/s")), hjust = -0.1, size = 3.5) +
  facet_wrap(~T_label, ncol = 2) +
  scale_fill_manual(values = CITY_COLOURS, guide = "none") +
  coord_flip() +
  scale_y_continuous(expand = expansion(mult = c(0, 0.25))) +
  labs(
    title    = "Road Culvert Design Peak Flow",
    subtitle = "Rational method: C = 0.45, A = 2 ha, Tc = 1 hr",
    x = NULL, y = "Peak flow Q (m³/s)"
  ) +
  theme(strip.text = element_text(face = "bold"))
Figure 31: Design peak flow for road culverts (Rational method, C = 0.45, A = 2 ha) — all cities, T = 10 and 25 years

14.6 Risk analysis

Engineering risk analysis quantifies the probability that a design-rainfall event is exceeded at least once during a structure’s service life. For a structure designed for a T-year return period with service life \(n\) years:

\[R = 1 - \left(1 - \frac{1}{T}\right)^n\]

This formulation underpins the selection of appropriate design return periods for infrastructure across different consequence classes.

Show code
# Risk of exceedance over service life
T_vals <- c(5, 10, 25, 50, 100, 200, 500)
n_vals <- c(5, 10, 20, 30, 50, 100)

risk_matrix <- expand.grid(T = T_vals, n = n_vals) |>
  as_tibble() |>
  mutate(
    risk_pct = round((1 - (1 - 1/T)^n) * 100, 1),
    T_label  = paste0("T=", T, "yr"),
    n_label  = paste0(n, " yr life")
  )

risk_wide <- risk_matrix |>
  select(n_label, T_label, risk_pct) |>
  pivot_wider(names_from = T_label, values_from = risk_pct)
Show code
risk_wide |>
  rename(`Service life` = n_label) |>
  kbl() |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE) |>
  add_header_above(c(" " = 1, "Design return period" = length(T_vals))) |>
  column_spec(2, background = ifelse(risk_wide[["T=5yr"]] > 60, "#FDAE61", "white")) |>
  column_spec(7, background = ifelse(risk_wide[["T=100yr"]] < 30, "#A8D5A2", "#FDAE61"))
Table 13: Risk of exceedance (%) at least once during project service life — risk matrix
Design return period
Service life T=5yr T=10yr T=25yr T=50yr T=100yr T=200yr T=500yr
5 yr life 67.2 41.0 18.5 9.6 4.9 2.5 1.0
10 yr life 89.3 65.1 33.5 18.3 9.6 4.9 2.0
20 yr life 98.8 87.8 55.8 33.2 18.2 9.5 3.9
30 yr life 99.9 95.8 70.6 45.5 26.0 14.0 5.8
50 yr life 100.0 99.5 87.0 63.6 39.5 22.2 9.5
100 yr life 100.0 100.0 98.3 86.7 63.4 39.4 18.1
Show code
ggplot(risk_matrix, aes(x = factor(T, levels = T_vals),
                        y = factor(n, levels = n_vals),
                        fill = risk_pct)) +
  geom_tile(colour = "white", linewidth = 0.6) +
  geom_text(aes(label = paste0(risk_pct, "%")), size = 3.5, fontface = "bold") +
  scale_fill_gradientn(
    colours = c("#1A9850","#91CF60","#D9EF8B","#FEE08B","#FC8D59","#D73027"),
    values  = scales::rescale(c(0,20,40,60,80,100)),
    name    = "Risk (%)"
  ) +
  labs(
    title    = "Engineering Risk Matrix",
    subtitle = "Probability of ≥1 exceedance of design event during service life",
    x = "Design return period (years)", y = "Service life (years)"
  ) +
  theme(axis.text = element_text(face = "bold"))
Figure 32: Risk matrix heatmap — probability of at least one exceedance during service life (%)
Show code
rp_long |>
  mutate(AEP = 1 / T_yr * 100) |>
  filter(T_yr <= 500) |>
  ggplot(aes(x = AEP, y = rl_mm, colour = city)) +
  geom_line(linewidth = 0.9) +
  scale_x_log10(
    breaks = c(0.2, 0.5, 1, 2, 4, 10, 20, 50),
    labels = c("0.2","0.5","1","2","4","10","20","50")
  ) +
  scale_colour_manual(values = CITY_COLOURS, name = "City") +
  labs(
    title    = "Annual Exceedance Probability Curves",
    subtitle = "Gumbel distribution fits to annual rainfall totals",
    x = "Annual exceedance probability (%)", y = "Annual rainfall (mm)"
  ) +
  geom_vline(xintercept = 1, linetype = "dashed", colour = "grey40") +
  annotate("text", x = 1.15, y = max(rp_long$rl_mm) * 0.88,
           label = "1% AEP\n(100-yr)", size = 3, colour = "grey30", hjust = 0)
Figure 33: Annual exceedance probability curves derived from Gumbel fits — all cities
TipRisk analysis application

A 50-year road designed to a T = 25-year standard faces a 87% probability of at least one exceedance during its service life. Upgrading to T = 100 years reduces this to 39.5%. This cost–risk trade-off is a core input to design-standard selection under FMWH and FMWR guidelines.


14.7 Dam engineering

Dam design requires reliable estimates of (1) Probable Maximum Precipitation (PMP) for spillway design, (2) reservoir storage requirements via mass-curve analysis, and (3) mean annual runoff for yield assessment.

Show code
# Hershfield statistical method for PMP (monthly scale adaptation)
# PMP = x_bar + K * sigma, where K ≈ 15 (standard Hershfield multiplier)
K_hershfield <- 15

pmp_table <- annual |>
  group_by(city) |>
  summarise(
    mean_r  = mean(annual_rainfall_mm),
    sd_r    = sd(annual_rainfall_mm),
    PMP_mm  = round(mean_r + K_hershfield * sd_r),
    PMF_ratio = round((mean_r + K_hershfield * sd_r) / mean_r, 2),
    .groups = "drop"
  ) |>
  left_join(city_coords, by = "city")

# Runoff coefficient (C_r) using Nigeria empirical values by climate zone
pmp_table <- pmp_table |>
  mutate(
    runoff_coeff = case_when(
      city %in% c("Calabar","Warri","Port Harcourt") ~ 0.55,  # dense forest / delta
      city %in% c("Benin","Lagos")                   ~ 0.45,  # Guinea savanna / coastal
      city == "Abuja"                                ~ 0.30   # Guinea savanna / plateau
    ),
    mean_annual_runoff_mm  = round(mean_r * runoff_coeff),
    # Spillway design flood (SDF) approximation: C * PMP * A (use unit area 1 km²)
    SDF_m3s_per_km2 = round(runoff_coeff * PMP_mm / 1000 * 1e6 / (24*3600), 2)
  )
Show code
pmp_table |>
  select(City = city,
         `Mean rain (mm/yr)` = mean_r,
         `SD (mm)` = sd_r,
         `PMP (mm/yr)` = PMP_mm,
         `PMP/Mean ratio` = PMF_ratio,
         `Runoff coeff.` = runoff_coeff,
         `Mean runoff (mm/yr)` = mean_annual_runoff_mm,
         `SDF (m³/s per km²)` = SDF_m3s_per_km2) |>
  mutate(across(where(is.numeric), round, 2)) |>
  kbl() |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE)
Table 14: Probable Maximum Precipitation and spillway design indicators by city (Hershfield method, K = 15)
City Mean rain (mm/yr) SD (mm) PMP (mm/yr) PMP/Mean ratio Runoff coeff. Mean runoff (mm/yr) SDF (m³/s per km²)
Abuja 1460.74 373.52 7064 4.84 0.30 438 24.53
Benin 2580.21 461.96 9510 3.69 0.45 1161 49.53
Calabar 3274.33 522.90 11118 3.40 0.55 1801 70.77
Lagos 1712.69 346.12 6904 4.03 0.45 771 35.96
Port Harcourt 2859.52 550.09 11111 3.89 0.55 1573 70.73
Warri 3155.91 619.66 12451 3.95 0.55 1736 79.26
Show code
# Mass-curve analysis: cumulative departure from mean annual flow
# Using annual totals converted to approximate volumetric flow (unit catchment 100 km²)
A_km2 <- 100

mass_curve <- annual |>
  group_by(city) |>
  arrange(year) |>
  mutate(
    runoff_mm = annual_rainfall_mm * case_when(
      city %in% c("Calabar","Warri","Port Harcourt") ~ 0.55,
      city %in% c("Benin","Lagos")                   ~ 0.45,
      city == "Abuja"                                 ~ 0.30
    ),
    volume_Mm3    = runoff_mm / 1000 * A_km2 * 1e6 / 1e6,  # million m³
    cum_volume    = cumsum(volume_Mm3),
    cum_mean      = mean(volume_Mm3) * seq_len(n()),
    residual_mass = cum_volume - cum_mean
  ) |>
  ungroup()

# Reservoir storage = range of residual mass curve
storage_req <- mass_curve |>
  group_by(city) |>
  summarise(
    Storage_Mm3 = round(max(residual_mass) - min(residual_mass), 1),
    .groups = "drop"
  )
Show code
mass_curve |>
  ggplot(aes(x = year, y = residual_mass, colour = city)) +
  geom_line(linewidth = 0.9) +
  geom_hline(yintercept = 0, linetype = "dashed", colour = "grey50") +
  facet_wrap(~city, ncol = 2, scales = "free_y") +
  scale_colour_manual(values = CITY_COLOURS, guide = "none") +
  labs(
    title    = "Residual Mass Curve — Reservoir Storage Analysis",
    subtitle = "Departure from mean annual runoff (100 km² unit catchment)",
    x = NULL, y = "Cumulative residual mass (million m³)"
  ) +
  theme(strip.text = element_text(face = "bold"))
Figure 34: Residual mass curve analysis — inter-annual storage requirement for a 100 km² catchment
Show code
storage_req |>
  left_join(pmp_table |> select(city, mean_annual_runoff_mm, SDF_m3s_per_km2),
            by = "city") |>
  rename(
    City                  = city,
    `Rippl Storage (Mm³)` = Storage_Mm3,
    `Mean runoff (mm/yr)` = mean_annual_runoff_mm,
    `SDF (m³/s/km²)`     = SDF_m3s_per_km2
  ) |>
  kbl() |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE)
Table 15: Estimated reservoir storage requirement (Rippl method) for a 100 km² catchment
City Rippl Storage (Mm³) Mean runoff (mm/yr) SDF (m³/s/km²)
Abuja 127.5 438 24.53
Benin 225.2 1161 49.53
Calabar 324.9 1801 70.77
Lagos 154.5 771 35.96
Port Harcourt 227.7 1573 70.73
Warri 421.4 1736 79.26
Show code
pmp_table |>
  ggplot(aes(y = reorder(city, PMP_mm), x = PMP_mm, fill = city)) +
  geom_col(width = 0.6, alpha = 0.8) +
  geom_col(aes(x = mean_r), fill = "grey30", width = 0.3, alpha = 0.8) +
  geom_text(aes(x = PMP_mm,
                label = paste0("PMP=", comma(PMP_mm), " mm (×", PMF_ratio, ")")),
            hjust = -0.05, size = 3.2) +
  scale_fill_manual(values = CITY_COLOURS, guide = "none") +
  scale_x_continuous(expand = expansion(mult = c(0, 0.35))) +
  labs(
    title    = "Probable Maximum Precipitation (Hershfield, K=15)",
    subtitle = "Coloured bar = PMP  ·  Dark bar = mean annual rainfall",
    x = "Rainfall (mm/year)", y = NULL
  )
Figure 35: Probable Maximum Precipitation vs mean annual rainfall — ratio chart by city
TipDam engineering application

Calabar’s estimated PMP of 11,118 mm/year and high runoff coefficient imply a specific design flood of 70.77 m³/s per km² — among the highest in Nigeria and comparable to values reported for humid tropical dams in sub-Saharan Africa. Reservoir storage estimates from the Rippl mass-curve method guide dam height, dead storage, and rule-curve design.


15 Integrated findings

The following synthesis draws together results from all analytical components.

15.1 Rainfall regime structure

Two climatological regimes are evident from the PCA and regime map:

  • Humid coastal-south regime (Calabar, Warri, Port Harcourt): Mean annual totals exceeding 2,700 mm; bimodal seasonal structure driven by proximity to the Atlantic and the Niger Delta wetlands; lower inter-annual CV, suggesting relatively stable moisture supply.
  • Sub-humid north-central regime (Abuja, Lagos, Benin): Lower mean annual totals (1,400–2,000 mm); stronger unimodal or weakly bimodal seasonal cycle; higher CV, indicating greater sensitivity to ITCZ positioning variability and ENSO teleconnections.

15.2 Temporal stationarity and trend absence

All six cities pass both the ADF and KPSS stationarity tests at the 5% level on annual totals, providing strong statistical evidence that no deterministic trend is present in the 50-year record. This does not preclude decadal-scale variability or future non-stationarity under climate change scenarios not modelled here.

15.3 Decomposition insights

STL decomposition confirms that seasonality is the dominant component in all cities (accounting for the majority of total variance), while the trend component is approximately flat. Remainder components exhibit occasional large deviations consistent with known drought years associated with La Niña episodes.

15.4 Classifier performance

The Random Forest classifier achieves strong separation between extreme and non-extreme months across all cities. SHAP analysis identifies calendar month (seasonal position) and 12-month lagged rainfall (prior-year wetness) as the top predictors — physically interpretable results that align with West African monsoon dynamics.

15.5 50-year projections

Under stationary ARIMA/ETS frameworks, projected rainfall means for 2026–2075 closely track historical averages. However, 95% prediction interval widths grow substantially with lead time, reaching several hundred millimetres per year by 2075 for all cities. This reflects the intrinsic uncertainty of long-horizon univariate projections and underlines the need for physically based climate models for planning purposes.


16 Recommendations

The recommendations below are derived directly from the statistical outputs, engineering design parameters, and probabilistic projections presented in this report. They are addressed to policy planners, infrastructure engineers, water-resource managers, and research institutions operating in Nigeria’s six study cities and analogous West African urban environments.


16.1 Rainfall data infrastructure and monitoring

ImportantPriority recommendation — data quality

The single most impactful improvement to any downstream engineering or planning exercise is the replacement of synthetic climatological series with gauge-validated, quality-controlled daily rainfall records. No amount of statistical sophistication can substitute for accurate observed data.

  1. Expand and modernise NiMet rain-gauge networks. The six cities in this study are relatively well-served compared to rural Nigeria; however, sub-daily (hourly) recording gauges are sparse. Investment in tipping-bucket rain gauges with telemetry — ideally at a density of at least one gauge per 25 km² in urban areas — is a prerequisite for credible IDF curve derivation and flood early warning.

  2. Establish a national open-rainfall archive. NiMet should publish quality-controlled daily station records in a machine-readable format accessible to researchers, engineers, and local government. The 50-year series (1976–2025) analysed here would benefit enormously from such a resource.

  3. Integrate satellite-derived rainfall products. CHIRPS, ERA5-Land, and GPM IMERG offer continuous spatial coverage; their systematic bias correction using available gauge records would immediately extend the spatial scope beyond the six-city analysis.

  4. Maintain the extreme-rainfall classifier as an operational tool. The Random Forest model (AUC = 0.934) provides a robust, real-time flag for months at elevated flood risk. Re-training on observed NiMet data and deploying as a seasonal early-warning product is strongly recommended.


16.2 Flood management and early warning

  1. Adopt T = 100-year design standards for critical flood infrastructure in Calabar and Warri. The Gumbel return-level analysis indicates a T = 100-year annual rainfall of 4914 mm for Calabar — approximately 1.5× the long-term mean. Flood embankments, stormwater retention basins, and major bridges in these cities should be designed to at least this standard.

  2. Develop and enforce city-specific flood-plain maps. Using the Gumbel return levels as boundary conditions, hydraulic modelling (e.g., HEC-RAS 2D) should delineate 1-in-10, 1-in-50, and 1-in-100 year flood extents for all six cities. Flood-plain maps should be embedded in urban land-use planning ordinances to prevent further encroachment on high-risk areas.

  3. Integrate probabilistic ARIMA/ETS forecasts into emergency preparedness plans. The 50-year projections for 2026–2075 show stable mean rainfall but widening uncertainty bands. Disaster management agencies (NEMA, state emergency management agencies) should treat the upper bound of the 95% prediction interval as the planning scenario for infrastructure stress-testing.

  4. Calibrate the seasonal extreme-rainfall classifier for operational forecasting. Calendar month and 12-month lagged rainfall (SHAP top predictors) are already available in advance of the rainy season. A monthly bulletin classifying each upcoming month’s extreme probability — by city — would support pre-positioning of flood-response resources.


16.3 Water resources and irrigation planning

  1. Prioritise supplemental irrigation investment for Abuja and its agricultural hinterland. The water-balance analysis reveals an annual irrigation water requirement of 978 mm for Abuja — the largest among the six cities — driven by a long dry season (October–April) in which effective rainfall falls well below reference ET₀. Drip or micro-irrigation schemes for food-crop production in the FCT and neighbouring states offer the highest returns on investment.

  2. Exploit the large rainfall surplus in Calabar and Warri for strategic water storage. Calabar records an annual surplus of 1458 mm over ET₀, representing a substantial renewable water resource. Construction of small-to-medium surface reservoirs in Cross River and Delta States would capture this surplus for dry-season urban supply, aquaculture, and downstream agro-industrial use.

  3. Align crop calendars with city-specific water-balance windows. The monthly water balance outputs should inform state agriculture ministries on optimal planting dates, which differ markedly between the sub-humid north-central group (Abuja, Lagos, Benin) and the humid coastal-south group (Calabar, Warri, Port Harcourt). Misalignment of crop calendars with local water availability is a principal driver of yield variability.

  4. Incorporate climate uncertainty into irrigation scheme design life. The 95% prediction intervals on the 50-year projections span several hundred millimetres per year by the 2060s. Irrigation infrastructure with a 30–50 year design life should be sized for the upper bound of projected ET₀ minus the lower bound of projected rainfall, providing a conservative buffer against future deficits.


16.4 Urban drainage and stormwater management

  1. Update urban drainage design standards to city-specific IDF curves. The IDF curves derived here — based on Gumbel quantiles and empirical sub-daily scaling — show that design intensities in Calabar and Port Harcourt for a T = 25-year event exceed those in Abuja by more than 30%. National drainage design codes (e.g., FMWH Drainage Manual) should incorporate city-differentiated IDF parameters rather than applying a single national curve.

  2. Mandate T = 25-year design standards for all new urban drainage in Niger Delta cities. Given the elevated rainfall totals, high CV, and flood exposure of Port Harcourt, Warri, and Calabar, applying the current T = 10-year standard for secondary drainage is inadequate. The risk analysis shows that a T = 10-year culvert faces a 95.8% probability of being exceeded at least once within a 30-year service life.

  3. Adopt nature-based stormwater solutions in Lagos. The city’s high impervious cover, combined with intense rainy-season rainfall, makes conventional grey infrastructure alone insufficient. Bioretention cells, permeable pavements, constructed wetlands, and green roofs — sized using the Rational Method outputs in this report — should be integrated into the Lagos Urban Drainage Master Plan.

  4. Enforce developer-funded drainage impact assessments. All developments on catchments larger than 2 ha in any of the six cities should be required to demonstrate that post-development peak flows (using the IDF parameters derived here) do not exceed pre-development values for the T = 25-year event.


16.5 Road design and construction planning

  1. Schedule earthworks and pavement-layer construction within climatological workability windows. Abuja has 6 workable months (rainfall < 80 mm) compared to only 2 in Calabar. Contract programmes and programme float allowances for road projects in the Niger Delta must reflect the severely constrained earthwork season; failure to do so is a primary cause of schedule overruns and pavement defects.

  2. Apply region-specific subgrade moisture correction factors in pavement design. The prolonged wet seasons in Calabar, Warri, and Port Harcourt produce sustained periods of subgrade saturation. Pavement design using the AASHTO or TRRL (Nigerian FMWH 2013) methods should apply a seasonal correction factor to the effective CBR based on the city-specific wet-season duration derived from the workability calendar.

  3. Design cross-drainage structures to T = 25-year standards for trunk roads and T = 50-year for highways. Based on the risk analysis, a T = 25-year culvert on a 30-year design-life road carries a 70.6% exceedance risk — acceptable for secondary roads but not for trunk routes. Federal highways and expressways should adopt T = 50-year culverts, reducing risk to 45.5%.

  4. Develop road-sector vulnerability indices by city. Combining the workability calendar, design rainfall, and projected 50-year rainfall uncertainty, each city can be assigned a road-sector climate vulnerability index. This index should guide prioritisation of maintenance budgets and climate-resilience upgrades under the Federal Roads Authority (FERMA) capital programme.


16.6 Dam engineering and water supply infrastructure

  1. Use the Hershfield PMP estimates as a starting point for spillway design, with mandatory verification against daily-scale records. The PMP for Calabar is estimated at 11,118 mm/year (K = 15), implying a specific design flood of 70.77 m³/s per km². These are conservatively high but should be refined with daily extreme-rainfall data and regional PMP studies (World Meteorological Organization Technical Note No. 332) before adoption in final spillway design.

  2. Apply the Rippl mass-curve storage estimates as minimum capacity benchmarks for new reservoirs. The analysis indicates that a 100 km² catchment near Calabar requires at least 324.9 million m³ of live storage to guarantee continuous supply at mean demand. For Abuja, where inter-annual variability is higher relative to mean flow, the corresponding estimate is 127.5 million m³. These figures inform reservoir sizing at the feasibility stage.

  3. Require operating-rule reviews for existing dams under widening rainfall uncertainty. The 95% PI width on the 50-year projections grows substantially with lead time. Dam safety reviews should assess whether existing spillway capacities and reservoir rule curves remain adequate under the upper bound of the 50-year projection scenario, in accordance with the requirements of the Nigeria Dam Safety Commission.

  4. Integrate rainfall seasonality into reservoir operating rules. STL decomposition confirms a highly regular seasonal cycle in all six cities. Operating rules (draw-down and refill curves) should be synchronised with the empirical seasonal profile to maximise yield while maintaining flood-attenuation capacity in the months immediately preceding peak rainfall.


16.7 Risk analysis and infrastructure design standards

  1. Formally adopt risk-based return-period selection for infrastructure in Nigeria. The risk matrix in this report quantifies, for the first time in a six-city Nigerian context, how design-life duration interacts with return period to determine exceedance probability. The following minimum return periods are recommended by consequence class:
Show code
tribble(
  ~`Infrastructure class`,         ~`Typical service life`, ~`Recommended T (yr)`, ~`Residual risk (50-yr life)`,
  "Minor field drains / farm tracks",         "15 yr",       "5",    paste0(risk_matrix |> filter(T==5,  n==15) |> pull(risk_pct), "%"),
  "Secondary urban drains / rural roads",     "20 yr",       "10",   paste0(risk_matrix |> filter(T==10, n==20) |> pull(risk_pct), "%"),
  "Primary urban drains / trunk roads",       "30 yr",       "25",   paste0(risk_matrix |> filter(T==25, n==30) |> pull(risk_pct), "%"),
  "Major bridges / expressways",              "50 yr",       "100",  paste0(risk_matrix |> filter(T==100,n==50) |> pull(risk_pct), "%"),
  "Flood embankments / large culverts",       "50 yr",       "200",  paste0(risk_matrix |> filter(T==200,n==50) |> pull(risk_pct), "%"),
  "Dam spillways (Class I / II dams)",        "100 yr",      "500",  paste0(risk_matrix |> filter(T==500,n==100)|> pull(risk_pct), "%")
) |>
  kbl() |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = TRUE) |>
  row_spec(5:6, bold = TRUE, background = "#FFF3CD")
Table 16: Recommended minimum design return periods by infrastructure class (derived from risk analysis)
Infrastructure class Typical service life Recommended T (yr) Residual risk (50-yr life)
Minor field drains / farm tracks 15 yr 5 %
Secondary urban drains / rural roads 20 yr 10 87.8%
Primary urban drains / trunk roads 30 yr 25 70.6%
Major bridges / expressways 50 yr 100 39.5%
Flood embankments / large culverts 50 yr 200 22.2%
Dam spillways (Class I / II dams) 100 yr 500 18.1%
  1. Establish a probabilistic climate risk register for each city. Drawing on the flood-frequency curves, PMP estimates, IWR projections, and 50-year rainfall scenarios, each of the six cities should maintain a quantitative climate risk register — updated whenever new rainfall data become available — that feeds into urban master plans, capital investment programming, and insurance/reinsurance pricing.

16.8 Policy and institutional recommendations

  1. Embed the rainfall regime classification (PCA clusters) into federal infrastructure investment allocation. The two-cluster structure — humid coastal-south vs sub-humid north-central — should be formally recognised in capital allocation frameworks. Infrastructure standards appropriate for Calabar’s 3,000+ mm/year environment are materially different from those needed in Abuja’s 1,400 mm/year context; a single national standard applied uniformly misallocates resources and produces under-designed or over-designed structures.

  2. Require climate-adjusted design reports for all federally funded infrastructure projects. Consistent with the requirements of the Federal Ministry of Environment’s National Adaptation Plan, all design reports for water-related infrastructure costing more than ₦500 million should include a climate risk section documenting the return period adopted, residual exceedance risk over the design life, and sensitivity to the upper bound of the 50-year rainfall projection.

  3. Promote inter-agency data sharing between NiMet, NIWA, and state water authorities. The analytical framework in this report — from STL decomposition through ARIMA forecasting to the Rational Method application — can only be routinely applied if agencies share hydrological and meteorological data in standardised, interoperable formats. A national hydrometeorological data platform, modelled on the WMO WHOS initiative, is strongly recommended.

  4. Invest in climate-literacy capacity within state engineering departments. The methods applied in this report (extreme-value statistics, stationarity testing, probabilistic forecasting) are not yet widely used in Nigerian state-level engineering practice. Structured training programmes — delivered through the Nigerian Society of Engineers (NSE), COREN, and university continuing-education programmes — would raise the quality of hydrological design practice nationally.


17 Limitations and further work

17.1 Limitations

  1. Synthetic data: The rainfall series are generated from climatological baselines, not real station records. While statistically consistent with published normals, they cannot capture unrecorded extremes, data-quality issues, or non-stationarities in the true historical record.
  2. Univariate forecasting: ARIMA and ETS models do not incorporate physically meaningful drivers (SST, ITCZ position, ENSO indices). Long-range projections therefore represent extrapolation of historical variability only.
  3. No trend-in-variance: The modelling framework assumes homoscedastic errors. Potential increases in rainfall variability under climate change are not captured.
  4. Six-city scope: The analysis covers only six cities, providing a spatially coarse picture of a climatologically diverse country.
  5. Random Forest generalisation: The classifier was trained on synthetic data and may not transfer directly to real-world station records without re-training.

18 References

The following key references informed the methodological and climatological basis of this report:

  • Adeyeri, O. E., Lawin, A. E., Laux, P., Ishola, K. A., & Ige, S. O. (2019). Analysis of climate variability and surface meteorological parameters in the Komadugu-Yobe Basin, Lake Chad region. Journal of Water and Climate Change, 10(4), 683–699.
  • Ati, O. F., Stigter, C. J., & Oladipo, E. O. (2002). A comparison of methods to determine the onset of the growing season in northern Nigeria. International Journal of Climatology, 22(6), 731–742.
  • Hyndman, R. J., & Athanasopoulos, G. (2021). Forecasting: Principles and Practice (3rd ed.). OTexts. https://otexts.com/fpp3/
  • Intergovernmental Panel on Climate Change. (2021). Climate Change 2021: The Physical Science Basis. Cambridge University Press.
  • Lundberg, S. M., & Lee, S.-I. (2017). A unified approach to interpreting model predictions. Advances in Neural Information Processing Systems, 30.
  • Nigerian Meteorological Agency (NiMet). (2020). Nigeria Climate Review Bulletin 2020. NiMet.
  • Shapley, L. S. (1953). A value for n-person games. In H. Kuhn & A. Tucker (Eds.), Contributions to the Theory of Games (Vol. 2, pp. 307–317). Princeton University Press.

19 Appendix

19.1 A: AI usage statement

This report was produced with the assistance of Posit Assistant, a large-language-model AI tool integrated into the RStudio IDE. The AI was used for the following tasks:

Task Role of AI Human oversight
Code generation Drafted R code for all analytical sections Reviewed, tested, and executed by the author
Narrative drafting Produced first-draft text for all prose sections Edited and approved by the author
Report structure Proposed the section outline Reviewed and modified by the author
Statistical interpretation Provided initial interpretations Verified against known climatological literature

All AI-generated content was directed by and remains the intellectual responsibility of the named author. No data were fabricated by the AI; all computations were performed in R using established statistical packages. The AI did not have access to proprietary data and operated solely on the datasets described in this report.

AI system: Posit Assistant (powered by Anthropic Claude) Date of assistance: May 2026


19.2 B: Confusion matrix

Show code
cm_tbl <- as.data.frame(cm$table)
ggplot(cm_tbl, aes(x = Prediction, y = Reference, fill = Freq)) +
  geom_tile(colour = "white", linewidth = 0.6) +
  geom_text(aes(label = Freq), size = 7, fontface = "bold") +
  scale_fill_gradient(low = "#f5f5f5", high = "#2166AC") +
  labs(
    title    = "Confusion Matrix — Random Forest Classifier",
    subtitle = paste0("Accuracy: ", round(cm$overall["Accuracy"]*100, 1),
                      "%  |  Kappa: ", round(cm$overall["Kappa"], 3)),
    fill = "Count"
  ) +
  theme_minimal(base_size = 13)
Figure 36: Confusion matrix — Random Forest extreme-rainfall classifier (test set)

19.3 C: ROC curve and AUC

Show code
roc_obj <- roc(te$extreme80, probs, levels = c("Normal","Extreme"), direction = "<",
               quiet = TRUE)
auc_val <- round(auc(roc_obj), 4)

roc_df  <- data.frame(FPR = 1 - roc_obj$specificities,
                      TPR = roc_obj$sensitivities)

ggplot(roc_df, aes(x = FPR, y = TPR)) +
  geom_line(colour = "#2166AC", linewidth = 1.3) +
  geom_ribbon(aes(ymin = 0, ymax = TPR), fill = "#2166AC", alpha = 0.07) +
  geom_abline(linetype = "dashed", colour = "grey50") +
  annotate("text", x = 0.62, y = 0.12,
           label = paste0("AUC = ", auc_val),
           size = 5.5, fontface = "bold", colour = "#08306B") +
  labs(title = "ROC Curve — Extreme-Rainfall Classifier",
       x = "False Positive Rate (1 – Specificity)",
       y = "True Positive Rate (Sensitivity)") +
  coord_equal()
Figure 37: ROC curve with AUC — Random Forest extreme-rainfall classifier

19.4 D: SHAP feature importance

SHAP (SHapley Additive exPlanations) values decompose each model prediction into contributions from individual features in a theoretically grounded, model-agnostic manner. The mean absolute SHAP value for each feature provides a global importance ranking.

Show code
pfun <- function(object, newdata) {
  predict(object, newdata = newdata, type = "prob")[, "Extreme"]
}

feat_names <- c("month","month_sin","month_cos","lag1","lag2","lag12","roll3","roll6","city_num")

set.seed(7213)
shap_sample <- te |> sample_n(min(400, nrow(te)))
bg_sample   <- tr |> sample_n(min(200, nrow(tr)))

shap_vals <- fastshap::explain(
  object       = rf_mod,
  X            = shap_sample[, feat_names],
  pred_wrapper = pfun,
  nsim         = 30,
  bg_X         = bg_sample[, feat_names]
)
Show code
shap_mean <- shap_vals |>
  as.data.frame() |>
  summarise(across(everything(), ~ mean(abs(.)))) |>
  pivot_longer(everything(), names_to = "Feature", values_to = "MeanAbsSHAP") |>
  mutate(Feature = fct_reorder(Feature, MeanAbsSHAP))

ggplot(shap_mean, aes(x = MeanAbsSHAP, y = Feature)) +
  geom_col(fill = "#2166AC", width = 0.6) +
  geom_text(aes(label = round(MeanAbsSHAP, 4)), hjust = -0.1, size = 3.5) +
  labs(
    title    = "SHAP Feature Importance — Random Forest Classifier",
    subtitle = "Mean |SHAP value| across test-set predictions",
    x = "Mean |SHAP value|", y = NULL
  ) +
  xlim(0, max(shap_mean$MeanAbsSHAP) * 1.18)
Figure 38: SHAP summary — mean absolute SHAP value by feature (global importance)
Show code
shap_df <- shap_vals |>
  as.data.frame() |>
  bind_cols(shap_sample[, feat_names] |> rename_with(~ paste0(.x, "_val")))

shap_long <- shap_df |>
  select(all_of(feat_names)) |>
  mutate(id = row_number()) |>
  pivot_longer(-id, names_to = "Feature", values_to = "SHAP") |>
  left_join(
    shap_df |>
      select(ends_with("_val")) |>
      mutate(id = row_number()) |>
      pivot_longer(-id, names_to = "Feature_val", values_to = "FeatureValue") |>
      mutate(Feature = str_remove(Feature_val, "_val")),
    by = c("id", "Feature")
  )

ggplot(shap_long, aes(x = SHAP, y = fct_reorder(Feature, abs(SHAP), mean))) +
  geom_jitter(aes(colour = FeatureValue), height = 0.25, size = 0.6, alpha = 0.45) +
  geom_vline(xintercept = 0, linewidth = 0.5, linetype = "dashed") +
  scale_colour_viridis_c(option = "plasma", name = "Feature\nvalue") +
  labs(
    title = "SHAP Beeswarm — Individual Prediction Contributions",
    x = "SHAP value (contribution to extreme-rainfall probability)", y = NULL
  )
Figure 39: SHAP beeswarm — individual prediction contributions coloured by feature value

End of Report

Engr. Adamu Enemona Kogh — Environmental & Hydro-Climatological Research Unit — May 2026