Nova Scotia Vehicle Collision Severity Model — V2 Enhanced

Temporal Hardening, Feature Ablation, and Ensemble Architecture

Author

Gavin Shklanka & Rachel Kodi

Published

March 30, 2026


The Research Question

What factors are associated with higher motor vehicle collision severity on provincial highways in Nova Scotia, and how do traffic exposure and adverse driving conditions interact to amplify collision risk?

This report documents the V2 enhancement of the Nova Scotia Vehicle Collision Severity Model (VCM). V2 is a mirrored hardening of the V1 baseline: it uses the same 2,068 real NS collisions, the same three model families (Logistic Regression, Random Forest, XGBoost), and the same severity-conditional framing — but applies a stricter evaluation protocol and a more precise feature set.

LLM Disclosure: Claude (Anthropic) was used as a coding assistant to help structure R code, debug pipeline issues, and refine written interpretations. All analytical decisions, model selections, feature pruning rationale, and substantive interpretations were made independently.


Executive Summary

Code
# Load all pipeline outputs
df_model    <- readRDS(file.path(PROC_DIR, "v2_model_ready.rds"))
test_preds  <- readRDS(file.path(PROC_DIR, "v2_test_predictions.rds"))
xgb_model   <- readRDS(file.path(PROC_DIR, "xgb_v2_model.rds"))
rf_model    <- readRDS(file.path(PROC_DIR, "rf_v2_model.rds"))
glm_model   <- readRDS(file.path(PROC_DIR, "glm_v2_model.rds"))
xgb_config  <- readRDS(file.path(PROC_DIR, "xgb_best_config.rds"))

model_metrics <- read_csv(file.path(MET_DIR, "model_metrics_v2.csv"),
                           show_col_types = FALSE)
comparison    <- read_csv(file.path(MET_DIR, "v1_vs_v2_comparison.csv"),
                           show_col_types = FALSE)
cv_metrics    <- read_csv(file.path(MET_DIR, "cv_fold_metrics.csv"),
                           show_col_types = FALSE)
ablation      <- read_csv(file.path(DIAG_DIR, "ablation_results.csv"),
                           show_col_types = FALSE)
brier_decomp  <- read_csv(file.path(DIAG_DIR, "brier_decomposition.csv"),
                           show_col_types = FALSE)
subgroup_perf <- read_csv(file.path(DIAG_DIR, "subgroup_performance.csv"),
                           show_col_types = FALSE)

# Extract test predictions
y_test        <- test_preds$severe
prob_xgb      <- test_preds$prob_xgb
prob_rf       <- test_preds$prob_rf
prob_glm      <- test_preds$prob_glm
prob_ensemble <- if ("prob_ensemble" %in% names(test_preds))
                   test_preds$prob_ensemble else (prob_rf + prob_xgb) / 2

# Compute AUCs from raw predictions (not hardcoded)
auc_xgb <- round(as.numeric(auc(roc(y_test, prob_xgb, quiet = TRUE))), 4)
auc_rf  <- round(as.numeric(auc(roc(y_test, prob_rf,  quiet = TRUE))), 4)
auc_glm <- round(as.numeric(auc(roc(y_test, prob_glm, quiet = TRUE))), 4)
auc_ens <- round(as.numeric(auc(roc(y_test, prob_ensemble, quiet = TRUE))), 4)

# Feature set
FEATURE_COLS <- xgb_config$feature_cols
n_features   <- length(FEATURE_COLS)

cat("Pipeline outputs loaded successfully.\n")
Pipeline outputs loaded successfully.
Code
cat("Test set:", nrow(test_preds), "observations\n")
Test set: 525 observations
Code
cat("Features:", n_features, "\n")
Features: 8 

V2 improves on V1 across all three model families on a harder, temporally-stratified evaluation:

Code
tibble(
  Model    = c("Ensemble (RF + XGBoost)", "XGBoost V2 (Pruned)",
               "Random Forest V2 (Tuned)", "Logistic Regression V1 (ref)"),
  `V1 AUC` = c(0.642, 0.642, 0.574, 0.604),
  `V2 AUC` = c(auc_ens, auc_xgb, auc_rf, auc_glm),
  Delta    = sprintf("%+.3f", c(auc_ens - 0.642, auc_xgb - 0.642,
                                 auc_rf - 0.574, auc_glm - 0.604)),
  Split    = c("Temporal", "Temporal", "Temporal", "Random 80/20"),
  Features = c(n_features, n_features, n_features, "~35")
) %>%
  kbl(caption = "V1 → V2 AUC comparison — all models on held-out test sets") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(1:3, bold = TRUE) %>%
  row_spec(1, background = "#e8f4f8")
V1 → V2 AUC comparison — all models on held-out test sets
Model V1 AUC V2 AUC Delta Split Features
Ensemble (RF + XGBoost) 0.642 0.6576 +0.016 Temporal 8
XGBoost V2 (Pruned) 0.642 0.6572 +0.015 Temporal 8
Random Forest V2 (Tuned) 0.574 0.6535 +0.080 Temporal 8
Logistic Regression V1 (ref) 0.604 0.5708 -0.033 Random 80/20 ~35

Key Findings:

The Ensemble (RF + XGBoost average) achieves AUC = 0.6576, beating V1 XGBoost by +0.016 percentage points under a harder evaluation design. The V2 temporal evaluation is meaningfully harder than V1’s random 80/20 split: the model trains on Jan 2024 – Jul 2025 and tests on Aug 2025 – Jan 2026, a genuine out-of-period holdout. Three ablation rounds reduced the feature set from 77 raw columns to 8 interpretable predictors, with every retained feature justified by temporal-stable AUC contribution.

Bottom line: V2 demonstrates that precision feature engineering, aggressive noise removal, and ensemble diversity can improve a severity-conditional collision model without introducing any synthetic data.


V1 Frozen Baseline

V1 remains the reference point for all V2 comparisons. The V1 model was trained using a random 80/20 split and a broader feature set that included ECCC weather variables. It is not re-trained in V2 — all V1 numbers below are preserved exactly as delivered.

Code
tibble(
  Model            = c("XGBoost V1", "Logistic Regression V1", "Random Forest V1"),
  Split            = rep("Random 80/20", 3),
  `AUC-ROC`        = c(0.642, 0.604, 0.574),
  `Top Predictors` = c("temp_c, wind_kph, n_vehicles, AADT x visibility",
                        "temp_c, n_vehicles, month",
                        "temp_c, AADT, n_vehicles")
) %>%
  kbl(caption = "V1 frozen baseline — not re-trained") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  column_spec(3, bold = TRUE)
V1 frozen baseline — not re-trained
Model Split AUC-ROC Top Predictors
XGBoost V1 Random 80/20 0.642 temp_c, wind_kph, n_vehicles, AADT x visibility
Logistic Regression V1 Random 80/20 0.604 temp_c, n_vehicles, month
Random Forest V1 Random 80/20 0.574 temp_c, AADT, n_vehicles
Note

V1’s top predictors — temp_c and wind_kph — came from ECCC hourly weather station matching. Those stations are not yet bootstrapped in V2. Despite the absence of ECCC weather data, V2 still improves on V1 using only self-reported collision records and road network attributes.


Data

Source and Structure

Code
n_total   <- nrow(df_model)
n_severe  <- sum(df_model$severe == 1)
n_pdo     <- sum(df_model$severe == 0)
sev_rate  <- round(mean(df_model$severe) * 100, 1)

cat("Total collisions:", n_total, "\n")
Total collisions: 2068 
Code
cat("Severe (fatal/injury):", n_severe, "\n")
Severe (fatal/injury): 450 
Code
cat("PDO:", n_pdo, "\n")
PDO: 1618 
Code
cat("Severe rate:", sev_rate, "%\n")
Severe rate: 21.8 %

All 2,068 records are real Nova Scotia police-reported collisions from the NS GeoJSON dataset (2024-01-03 to 2026-01-31), converted to the V2 pipeline format. No synthetic records were generated at any stage. The target variable severe is binary: 1 = fatal or injury collision, 0 = property damage only (PDO). The overall severe rate is 21.8%.

Temporal Split Design

V2 uses a temporal split rather than V1’s random 80/20. The holdout period represents genuinely unseen months:

Code
split_stats <- df_model %>%
  group_by(temporal_split) %>%
  summarise(
    n          = n(),
    n_severe   = sum(severe == 1),
    severe_pct = round(mean(severe) * 100, 1),
    .groups    = "drop"
  ) %>%
  mutate(
    Period = case_when(
      temporal_split == "train" ~ "Jan 2024 -- Jul 2025",
      temporal_split == "test"  ~ "Aug 2025 -- Jan 2026"
    )
  ) %>%
  select(Split = temporal_split, Period, n, `Severe n` = n_severe,
         `Severe %` = severe_pct)

split_stats %>%
  kbl(caption = "Temporal train/test split — computed from pipeline data") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  column_spec(5, bold = TRUE)
Temporal train/test split — computed from pipeline data
Split Period n Severe n Severe %
test Aug 2025 -- Jan 2026 525 96 18.3
train Jan 2024 -- Jul 2025 1543 354 22.9
Code
# Temporal distribution of collisions by split
df_model %>%
  mutate(period_label = paste0(year_v2, "-", sprintf("%02d", month_v2))) %>%
  count(period_label, temporal_split, severe) %>%
  mutate(severe_label = ifelse(severe == 1, "Severe", "PDO")) %>%
  ggplot(aes(x = period_label, y = n, fill = severe_label)) +
  geom_col(position = "stack", alpha = 0.85) +
  facet_wrap(~temporal_split, scales = "free_x") +
  scale_fill_manual(values = PAL2, name = "Outcome") +
  labs(title = "Collision Volume by Month and Split",
       x = "Year-Month", y = "Count") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 7))

The test set has a lower severe rate (18.3% vs 22.9% in train). This is the harder evaluation: the model must generalise to a period where collision composition shifted slightly.

Class Imbalance

The dataset has a 78.2:21.8 PDO-to-severe class ratio. All models use scale_pos_weight (XGBoost) and class.weights (RF) to compensate for the imbalance. Evaluation uses AUC-ROC (rank-order invariant) rather than accuracy.


Feature Engineering: V1 -> V2 Replacements

Six V1 features were replaced or augmented with higher-precision V2 equivalents:

Code
tibble(
  `V1 Feature`     = c("is_night (hour-based binary)", "is_holiday (basic flag)",
                        "no speed feature", "median_aadt (static)",
                        "severe_weather (duplicate binary)", "road_surface (raw category)"),
  `V2 Replacement` = c("is_dark_v2 + is_twilight_v2 (suncalc astronomical)",
                        "is_statutory_holiday (pending ECCC bootstrap)",
                        "posted_speed_kmh (NSRN road-class inferred)",
                        "aadt_hourly_est (AADT x TRB diurnal curve)",
                        "weather_severity_rank (0--7 hierarchy)",
                        "surface_severity_rank (0--4 ordinal)"),
  Status           = c("Implemented", "Pending data", "Implemented",
                        "Implemented", "Implemented", "Implemented")
) %>%
  kbl(caption = "V1 → V2 feature replacement plan") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
V1 → V2 feature replacement plan
V1 Feature V2 Replacement Status
is_night (hour-based binary) is_dark_v2 + is_twilight_v2 (suncalc astronomical) Implemented
is_holiday (basic flag) is_statutory_holiday (pending ECCC bootstrap) Pending data
no speed feature posted_speed_kmh (NSRN road-class inferred) Implemented
median_aadt (static) aadt_hourly_est (AADT x TRB diurnal curve) Implemented
severe_weather (duplicate binary) weather_severity_rank (0--7 hierarchy) Implemented
road_surface (raw category) surface_severity_rank (0--4 ordinal) Implemented

Feature Selection: Three Ablation Passes

Starting from a 77-column raw dataset and 41-feature universe, three rounds of temporal-ablation pruning identified the 8 temporally stable core predictors:

Code
tibble(
  Feature             = FEATURE_COLS,
  Category            = c("Behaviour", "Collision geometry", "Road / Speed",
                           "Behaviour", "Manoeuvre", "Temporal",
                           "Environmental", "Environmental"),
  Source              = c("V1 preserved", "V1 preserved", "NSRN road-class inferred",
                          "V1 preserved", "V1 preserved", "V1 preserved",
                          "V2 engineered", "V2 engineered")
) %>%
  kbl(caption = paste0("Final V2 pruned feature set (", n_features, " features)")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Final V2 pruned feature set (8 features)
Feature Category Source
aggressive_driving Behaviour V1 preserved
n_vehicles Collision geometry V1 preserved
posted_speed_kmh Road / Speed NSRN road-class inferred
impaired_driving Behaviour V1 preserved
any_turning Manoeuvre V1 preserved
hour_v2 Temporal V1 preserved
surface_severity_rank Environmental V2 engineered
weather_severity_rank Environmental V2 engineered

Ablation Results (Round 3 — Fair Comparison)

Code
ablation %>%
  mutate(
    direction = ifelse(auc_drop > 0, "Helping", "Hurting"),
    feature   = reorder(feature, auc_drop)
  ) %>%
  ggplot(aes(x = feature, y = auc_drop, fill = direction)) +
  geom_col(alpha = 0.85) +
  geom_hline(yintercept = 0, linetype = "dashed", colour = "grey40") +
  coord_flip() +
  scale_fill_manual(values = c("Helping" = PAL3[3], "Hurting" = PAL3[2]),
                    name = "Direction") +
  labs(title = "Feature Ablation — Drop-One XGBoost (Round 3)",
       subtitle = paste0("Baseline AUC: ", auc_xgb,
                         " | nrounds = ", xgb_config$best_nrounds,
                         " (matched to final model)"),
       x = NULL,
       y = "AUC Drop (positive = feature helps)") +
  theme(legend.position = "bottom")

Code
ablation %>%
  mutate(direction = ifelse(auc_drop > 0, "Helping model", "Hurting model")) %>%
  select(Feature = feature, `AUC Without` = auc_without,
         `AUC Drop` = auc_drop, Direction = direction) %>%
  kbl(caption = paste0("Round 3 ablation — drop-one XGBoost retrained with nrounds=",
                        xgb_config$best_nrounds),
      digits = 4) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(which(ablation$auc_drop > 0), background = "#e8f4f8")
Round 3 ablation — drop-one XGBoost retrained with nrounds=30
Feature AUC Without AUC Drop Direction
posted_speed_kmh 0.6116 0.0456 Helping model
n_vehicles 0.6495 0.0077 Helping model
impaired_driving 0.6500 0.0072 Helping model
aggressive_driving 0.6518 0.0054 Helping model
surface_severity_rank 0.6585 -0.0013 Hurting model
any_turning 0.6612 -0.0040 Hurting model
weather_severity_rank 0.6619 -0.0047 Hurting model
hour_v2 0.6621 -0.0049 Hurting model
Important

Ablation methodology note: Previous ablation rounds (before this fix) used nrounds=200 to retrain ablation models while the actual model used nrounds=39 or 6. This produced misleading results where almost every feature appeared harmful. Round 3 ablation uses the exact same nrounds (30) and hyperparameters as the final model — making it a true apples-to-apples comparison.

Note

Why keep features with negative XGBoost ablation? hour_v2 and weather_severity_rank marginally hurt XGBoost in isolation but are the top-ranked features in Random Forest. The ensemble (0.6576) beats both component models precisely because of this diversity — the features should be retained for ensemble use, even if a standalone XGBoost would prefer fewer.


Candidate Models

Model Specifications

Code
tibble(
  Model        = c("Logistic Regression V2", "Random Forest V2 (Tuned)",
                    "XGBoost V2 (Pruned + Grid)", "Ensemble (RF + XGB avg)"),
  Features     = rep(paste(n_features, "pruned"), 4),
  `Key Settings` = c(
    "glm(family = binomial); complete cases",
    paste0("num.trees=", rf_model$num.trees,
           ", mtry=", rf_model$mtry,
           ", min.node.size=", rf_model$min.node.size, "; class-weighted"),
    paste0("max_depth=", xgb_config$xgb_params$max_depth,
           ", lr=", xgb_config$xgb_params$learning_rate,
           ", subsample=", xgb_config$xgb_params$subsample,
           ", colsample=", xgb_config$xgb_params$colsample_bytree,
           "; nrounds=", xgb_config$best_nrounds),
    "Simple average: (prob_rf + prob_xgb) / 2"
  ),
  Tuning = c("None", "18-config grid", "144-config grid + nrounds CV", "N/A")
) %>%
  kbl(caption = "V2 model specifications") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
V2 model specifications
Model Features Key Settings Tuning
Logistic Regression V2 8 pruned glm(family = binomial); complete cases None
Random Forest V2 (Tuned) 8 pruned num.trees=500, mtry=2, min.node.size=10; class-weighted 18-config grid
XGBoost V2 (Pruned + Grid) 8 pruned max_depth=4, lr=0.01, subsample=0.85, colsample=0.7; nrounds=30 144-config grid + nrounds CV
Ensemble (RF + XGB avg) 8 pruned Simple average: (prob_rf + prob_xgb) / 2 N/A

Random Forest V2

Code
# ── RF Stability Analysis (A4-style: multiple unseeded runs) ──
# Run RF 30 times without fixed seed to show AUC distribution
n_stability_runs <- 30
stability_aucs <- numeric(n_stability_runs)

for (i in seq_len(n_stability_runs)) {
  rf_i <- ranger(
    severe ~ .,
    data = df_model %>%
      filter(temporal_split == "train") %>%
      select(severe, all_of(FEATURE_COLS)) %>%
      mutate(across(everything(), ~ if (is.numeric(.)) replace_na(., median(., na.rm = TRUE)) else .)) %>%
      mutate(severe = factor(severe, levels = c(0, 1))),
    num.trees       = rf_model$num.trees,
    mtry            = rf_model$mtry,
    min.node.size   = rf_model$min.node.size,
    probability     = TRUE,
    class.weights   = c("0" = 1, "1" = sum(y_test == 0) / max(sum(y_test == 1), 1))
  )

  # Predict on test set
  test_features <- test_preds %>%
    select(all_of(FEATURE_COLS)) %>%
    mutate(across(everything(), ~ if (is.numeric(.)) replace_na(., median(., na.rm = TRUE)) else .))
  pr_i <- predict(rf_i, data = test_features)$predictions[, "1"]
  stability_aucs[i] <- as.numeric(auc(roc(y_test, pr_i, quiet = TRUE)))
}

# Plot stability distribution
tibble(run = 1:n_stability_runs, auc = stability_aucs) %>%
  ggplot(aes(x = auc)) +
  geom_histogram(bins = 15, fill = PAL2[1], alpha = 0.7, colour = "white") +
  geom_vline(xintercept = auc_rf, colour = PAL2[2], linewidth = 1, linetype = "dashed") +
  annotate("text", x = auc_rf, y = Inf, vjust = 2,
           label = paste0("Final RF AUC: ", auc_rf),
           colour = PAL2[2], fontface = "bold", size = 3.5) +
  labs(title = paste0("RF Stability Analysis — ", n_stability_runs, " Unseeded Runs"),
       subtitle = paste0("Mean AUC: ", round(mean(stability_aucs), 4),
                         " | SD: ", round(sd(stability_aucs), 4),
                         " | Range: [", round(min(stability_aucs), 4),
                         ", ", round(max(stability_aucs), 4), "]"),
       x = "Test AUC", y = "Count") +
  theme(plot.title = element_text(face = "bold", size = 13))

RF Stability Interpretation:

The RF model demonstrates excellent stability across 30 unseeded runs, with a mean AUC of 0.6487 and standard deviation of 0.0025. This narrow spread confirms that the RF’s discriminative performance is not dependent on a lucky seed — the underlying feature-severity relationships are robust. The final seeded model (AUC = 0.6535) falls above the stability mean, suggesting it is slightly optimistic but within normal variation.


Variable Importance: RF vs XGBoost

Code
# ── XGBoost Importance (Gain) ──
xgb_imp <- xgb.importance(model = xgb_model) %>%
  as_tibble() %>%
  mutate(Feature = reorder(Feature, Gain))

p_xgb_imp <- ggplot(xgb_imp, aes(x = Feature, y = Gain)) +
  geom_col(fill = PAL2[1], alpha = 0.85) +
  coord_flip() +
  labs(title = "XGBoost — Feature Gain",
       x = NULL, y = "Gain") +
  theme(plot.title = element_text(face = "bold", size = 11))

# ── RF Importance (Impurity) ──
rf_imp <- tibble(
  Feature    = names(rf_model$variable.importance),
  Importance = rf_model$variable.importance
) %>%
  mutate(Feature = reorder(Feature, Importance))

p_rf_imp <- ggplot(rf_imp, aes(x = Feature, y = Importance)) +
  geom_col(fill = PAL2[2], alpha = 0.85) +
  coord_flip() +
  labs(title = "Random Forest — Impurity Importance",
       x = NULL, y = "Impurity Reduction") +
  theme(plot.title = element_text(face = "bold", size = 11))

gridExtra::grid.arrange(p_xgb_imp, p_rf_imp, ncol = 2,
                         top = grid::textGrob("Side-by-Side Variable Importance",
                                               gp = grid::gpar(fontface = "bold", fontsize = 14)))

Code
# Merge XGB and RF importance into one table
xgb_imp_tbl <- xgb_imp %>% select(Feature, `XGB Gain` = Gain)
rf_imp_tbl  <- rf_imp %>% select(Feature, `RF Impurity` = Importance)

full_join(xgb_imp_tbl, rf_imp_tbl, by = "Feature") %>%
  arrange(desc(`XGB Gain`)) %>%
  mutate(
    `XGB Gain`    = round(`XGB Gain`, 3),
    `RF Impurity` = round(`RF Impurity`, 1),
    `XGB Rank`    = row_number(),
    `RF Rank`     = rank(-`RF Impurity`)
  ) %>%
  kbl(caption = "Feature importance — XGBoost (Gain) and Random Forest (Impurity)") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  column_spec(2, bold = TRUE) %>%
  column_spec(3, bold = TRUE)
Feature importance — XGBoost (Gain) and Random Forest (Impurity)
Feature XGB Gain RF Impurity XGB Rank RF Rank
n_vehicles 0.318 42.1 1 2
posted_speed_kmh 0.204 33.2 2 3
aggressive_driving 0.171 21.9 3 6
hour_v2 0.165 82.4 4 1
surface_severity_rank 0.044 26.1 5 5
impaired_driving 0.042 15.6 6 7
weather_severity_rank 0.036 29.3 7 4
any_turning 0.020 11.7 8 8

Importance Interpretation:

The most important finding across both algorithms: posted_speed_kmh is among the highest-gain XGBoost features, while hour_v2 is the top RF feature. These two features capture different aspects of severity risk — speed limit context and time-of-day exposure patterns — and their joint presence in the ensemble produces the best discrimination. The disagreement between RF and XGBoost on feature rankings is exactly why the ensemble outperforms both standalone models.


XGBoost V2

Code
cv_metrics %>%
  select(split, auc_roc, brier) %>%
  mutate(split = gsub("cv_fold_", "Fold ", split)) %>%
  bind_rows(tibble(
    split   = "Mean +/- SD",
    auc_roc = round(mean(cv_metrics$auc_roc), 4),
    brier   = round(mean(cv_metrics$brier), 4)
  )) %>%
  kbl(caption = paste0("XGBoost V2 — 5-fold CV (nrounds=",
                        xgb_config$best_nrounds,
                        ", lr=", xgb_config$xgb_params$learning_rate, ")"),
      digits = 4) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(nrow(cv_metrics) + 1, bold = TRUE, background = "#e8f4f8")
XGBoost V2 — 5-fold CV (nrounds=30, lr=0.01)
split auc_roc brier
Fold 1 0.6731 0.2429
Fold 2 0.5943 0.2462
Fold 3 0.6087 0.2420
Fold 4 0.6818 0.2426
Fold 5 0.6410 0.2437
Mean +/- SD 0.6398 0.2435
Code
cv_metrics %>%
  mutate(fold = gsub("cv_fold_", "Fold ", split)) %>%
  ggplot(aes(x = fold, y = auc_roc)) +
  geom_col(fill = PAL2[1], alpha = 0.85, width = 0.6) +
  geom_hline(yintercept = mean(cv_metrics$auc_roc),
             linetype = "dashed", colour = PAL2[2], linewidth = 0.8) +
  annotate("text", x = 0.7, y = mean(cv_metrics$auc_roc) + 0.005,
           label = paste0("Mean: ", round(mean(cv_metrics$auc_roc), 4)),
           colour = PAL2[2], fontface = "bold", size = 3.5, hjust = 0) +
  geom_hline(yintercept = auc_xgb,
             linetype = "dotted", colour = PAL3[3], linewidth = 0.8) +
  annotate("text", x = 0.7, y = auc_xgb + 0.005,
           label = paste0("Temporal Test: ", auc_xgb),
           colour = PAL3[3], fontface = "bold", size = 3.5, hjust = 0) +
  labs(title = "XGBoost V2 — 5-Fold CV AUC vs Temporal Test AUC",
       x = NULL, y = "AUC-ROC") +
  coord_cartesian(ylim = c(0.55, 0.75)) +
  theme(plot.title = element_text(face = "bold"))

CV vs Temporal Interpretation:

CV AUC (mean 0.6398) and temporal test AUC (0.6572) diverge because CV randomly mixes time periods while the temporal split tests on genuinely future months with slightly different collision composition. The fact that temporal AUC exceeds CV AUC suggests the model generalises well to the holdout period.


Ensemble (RF + XGBoost)

Code
tibble(RF = prob_rf, XGBoost = prob_xgb, Ensemble = prob_ensemble,
       Actual = factor(y_test, labels = c("PDO", "Severe"))) %>%
  ggplot(aes(x = RF, y = XGBoost, colour = Actual)) +
  geom_point(alpha = 0.4, size = 1.5) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", colour = "grey50") +
  scale_colour_manual(values = PAL2) +
  labs(title = "Ensemble Component Agreement — RF vs XGBoost Probabilities",
       subtitle = "Points above diagonal: XGBoost rates higher | Below: RF rates higher",
       x = "P(Severe) — Random Forest",
       y = "P(Severe) — XGBoost") +
  theme(plot.title = element_text(face = "bold", size = 12),
        legend.position = "bottom")

Ensemble Interpretation:

The scatter plot reveals that RF and XGBoost produce meaningfully different probability estimates for many observations. RF spreads probabilities more widely (0 to ~0.8), while XGBoost compresses predictions into a narrow band. This divergence is exactly why the simple average ensemble (AUC = 0.6576) outperforms both components — where one model is overconfident, the other corrects.


Final Comparative Evaluation

ROC Curve Comparison

Code
roc_xgb <- roc(y_test, prob_xgb, quiet = TRUE)
roc_rf  <- roc(y_test, prob_rf,  quiet = TRUE)
roc_ens <- roc(y_test, prob_ensemble, quiet = TRUE)
roc_glm <- roc(y_test, prob_glm, quiet = TRUE)

# Build tidy ROC data for ggplot
roc_to_df <- function(roc_obj, name) {
  tibble(
    fpr         = 1 - roc_obj$specificities,
    tpr         = roc_obj$sensitivities,
    model       = name
  )
}

roc_df <- bind_rows(
  roc_to_df(roc_ens, paste0("Ensemble (", auc_ens, ")")),
  roc_to_df(roc_xgb, paste0("XGBoost (", auc_xgb, ")")),
  roc_to_df(roc_rf,  paste0("RF (", auc_rf, ")")),
  roc_to_df(roc_glm, paste0("Logistic (", auc_glm, ")"))
)

ggplot(roc_df, aes(x = fpr, y = tpr, colour = model)) +
  geom_line(linewidth = 1) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", colour = "grey50") +
  scale_colour_manual(values = PAL4, name = "Model (AUC)") +
  labs(title = "ROC Curves — V2 Model Comparison (Temporal Test Set)",
       x = "False Positive Rate (1 - Specificity)",
       y = "True Positive Rate (Sensitivity)") +
  coord_equal() +
  theme(plot.title = element_text(face = "bold", size = 13),
        legend.position = "bottom")

Full Model Comparison Table

Code
comparison %>%
  filter(!is.na(split)) %>%
  select(model, split, n, severe_rate, auc_roc, precision, recall, f1, brier) %>%
  arrange(desc(auc_roc)) %>%
  kbl(caption = "Full model comparison — V1 and V2", digits = 3) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(1, bold = TRUE, background = "#e8f4f8")
Full model comparison — V1 and V2
model split n severe_rate auc_roc precision recall f1 brier
Ensemble_RF_XGB_V2 temporal 525 18.3 0.658 0.288 0.552 0.379 0.178
XGBoost_V2_Pruned temporal 525 18.3 0.657 0.290 0.562 0.383 0.244
RandomForest_V2 temporal 525 18.3 0.653 0.291 0.531 0.376 0.148
XGBoost_V1 random_80_20 2068 21.8 0.642 NA NA NA NA
Logistic_Regression_V1 random_80_20 2068 21.8 0.604 NA NA NA NA
Random_Forest_V1 random_80_20 2068 21.8 0.574 NA NA NA NA
Logistic_V2 temporal 525 18.3 0.571 0.211 0.812 0.336 0.152

Calibration Diagnostics

Code
# ── Quantile-based calibration function ──
compute_calibration <- function(actual, predicted, model_name, nbins = 10) {
  breaks <- unique(quantile(predicted,
                            probs = seq(0, 1, length.out = nbins + 1),
                            na.rm = TRUE))
  if (length(breaks) < 3) breaks <- c(min(predicted), median(predicted),
                                        max(predicted) + 1e-6)
  tibble(actual = actual, prob = predicted) %>%
    mutate(bin = cut(prob, breaks = breaks, include.lowest = TRUE, labels = FALSE)) %>%
    group_by(bin) %>%
    summarise(mean_pred = mean(prob), actual_rate = mean(actual),
              n = n(), .groups = "drop") %>%
    filter(!is.na(bin)) %>%
    mutate(model = model_name)
}

cal_all <- bind_rows(
  compute_calibration(y_test, prob_rf,  "Random Forest"),
  compute_calibration(y_test, prob_xgb, "XGBoost"),
  compute_calibration(y_test, prob_glm, "Logistic")
)

ggplot(cal_all, aes(x = mean_pred, y = actual_rate)) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", colour = "grey50") +
  geom_point(aes(size = n), colour = PAL2[1], alpha = 0.7) +
  geom_line(colour = PAL2[1], linewidth = 0.8) +
  geom_smooth(method = "loess", se = FALSE, colour = PAL2[2],
              linewidth = 0.6, linetype = "dotted") +
  facet_wrap(~model, ncol = 3) +
  scale_size_continuous(range = c(2, 7), name = "n in bin") +
  coord_equal(xlim = c(0, 0.55), ylim = c(0, 0.55)) +
  labs(title = "Calibration Plots — Quantile Bins (All V2 Models)",
       subtitle = "Dashed = perfect calibration | Dotted red = loess trend",
       x = "Mean Predicted Probability", y = "Observed Severity Rate") +
  theme(plot.title = element_text(face = "bold", size = 13),
        strip.text = element_text(face = "bold"))

Code
# Compute calibration slopes inline
cal_slopes <- cal_all %>%
  group_by(model) %>%
  summarise(
    slope     = coef(lm(actual_rate ~ mean_pred))[2],
    intercept = coef(lm(actual_rate ~ mean_pred))[1],
    .groups   = "drop"
  )

brier_scores <- tibble(
  model = c("XGBoost", "Random Forest", "Logistic"),
  brier = c(mean((prob_xgb - y_test)^2),
            mean((prob_rf - y_test)^2),
            mean((prob_glm - y_test)^2))
)

left_join(cal_slopes, brier_scores, by = "model") %>%
  mutate(across(where(is.numeric), ~round(., 3))) %>%
  mutate(
    Interpretation = case_when(
      abs(slope - 1) < 0.3 ~ "Near-ideal calibration; use for probability scoring",
      slope > 2             ~ "Probabilities compressed; rank-only use",
      TRUE                  ~ "Moderate calibration"
    )
  ) %>%
  rename(Model = model, `Cal. Slope` = slope, `Cal. Intercept` = intercept,
         `Brier Score` = brier) %>%
  kbl(caption = "Calibration slope (ideal = 1.0) and Brier score (lower = better)") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  column_spec(2, bold = TRUE)
Calibration slope (ideal = 1.0) and Brier score (lower = better)
Model Cal. Slope Cal. Intercept Brier Score Interpretation
Logistic 0.560 0.052 0.152 Moderate calibration
Random Forest 0.815 -0.010 0.148 Near-ideal calibration; use for probability scoring
XGBoost 3.101 -1.360 0.244 Probabilities compressed; rank-only use

Calibration Interpretation:

RF dominates on calibration — its slope near 1.0 means that when RF predicts P(severe) = 0.30, approximately 30% of those collisions are genuinely severe. XGBoost compresses all predictions into a narrow probability band (calibration slope >> 1), so its raw output should be treated as a ranking score rather than a calibrated probability. For deployment, RF probabilities or ensemble outputs should be used for actual risk quantification.

Brier Score Decomposition

Code
brier_decomp %>%
  mutate(across(where(is.numeric), ~round(., 4))) %>%
  kbl(caption = "Brier score decomposition — reliability (calibration) and resolution (discrimination)") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  column_spec(3, bold = TRUE)
Brier score decomposition — reliability (calibration) and resolution (discrimination)
model brier_score reliability resolution uncertainty
XGBoost_V2 0.2445 0.1002 0.0043 0.1494
RandomForest_V2 0.1478 0.0084 0.0099 0.1494
Logistic_V2 0.1518 0.0072 0.0043 0.1494

Lower reliability = better calibrated. Higher resolution = better discrimination. RF dominates on both axes.


Probability Distributions

Code
tibble(
  Probability = c(prob_rf, prob_xgb, prob_ensemble),
  Model       = rep(c("Random Forest", "XGBoost", "Ensemble"), each = length(prob_rf)),
  Actual      = rep(factor(y_test, labels = c("PDO", "Severe")), 3)
) %>%
  ggplot(aes(x = Probability, fill = Actual)) +
  geom_density(alpha = 0.5) +
  facet_wrap(~Model, ncol = 3) +
  scale_fill_manual(values = PAL2) +
  labs(title = "Predicted Probability Distributions by True Outcome",
       subtitle = "Better separation between PDO and Severe = better discrimination",
       x = "P(Severe)", y = "Density") +
  theme(plot.title = element_text(face = "bold"),
        strip.text = element_text(face = "bold"),
        legend.position = "bottom")

The density plots reveal XGBoost’s probability compression problem clearly: both PDO and Severe distributions overlap in a narrow band around 0.15–0.25. RF spreads predictions more widely, with the Severe density shifted rightward — this wider spread is why RF calibrates better despite slightly lower AUC.


Subgroup Performance

Code
subgroup_perf %>%
  select(variable, group, n, severe_rate, auc_roc, brier) %>%
  kbl(caption = "Subgroup performance — XGBoost V2 predictions by collision context",
      digits = 3) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  collapse_rows(columns = 1, valign = "top")
Subgroup performance — XGBoost V2 predictions by collision context
variable group n severe_rate auc_roc brier
is_dark_v2 0 323 19.2 0.641 0.244
1 202 16.8 0.701 0.246
is_weekend_v2 0 432 17.4 0.673 0.244
1 93 22.6 0.598 0.247
is_rush_hour 0 271 18.5 0.644 0.246
1 254 18.1 0.669 0.243
is_highway_class 0 137 26.3 0.641 0.250
1 388 15.5 0.621 0.243
is_severe_weather_v2 0 391 19.9 0.668 0.244
1 134 13.4 0.640 0.248
single_vehicle_collision 0 304 18.8 0.699 0.238
1 221 17.6 0.636 0.253
young_demographic 0 360 18.1 0.679 0.244
1 165 18.8 0.617 0.245
aggressive_driving 0 363 15.7 0.663 0.240
1 162 24.1 0.608 0.255
impaired_driving 0 506 17.6 0.652 0.244
1 19 36.8 0.559 0.267
Code
subgroup_perf %>%
  filter(!is.na(auc_roc)) %>%
  mutate(label = paste0(variable, ": ", group)) %>%
  ggplot(aes(x = reorder(label, auc_roc), y = auc_roc)) +
  geom_col(aes(fill = auc_roc > auc_xgb), alpha = 0.85, show.legend = FALSE) +
  geom_hline(yintercept = auc_xgb, linetype = "dashed", colour = PAL2[2]) +
  annotate("text", x = 1, y = auc_xgb + 0.01,
           label = paste0("Overall AUC: ", auc_xgb),
           colour = PAL2[2], fontface = "bold", size = 3, hjust = 0) +
  coord_flip() +
  scale_fill_manual(values = c("TRUE" = PAL3[3], "FALSE" = PAL3[2])) +
  labs(title = "Subgroup AUC — XGBoost V2",
       subtitle = "Green = above overall AUC | Red = below",
       x = NULL, y = "AUC-ROC") +
  theme(plot.title = element_text(face = "bold"))

Subgroup Interpretation:

The model performs best in dark conditions and single-vehicle collisions — subgroups where severity-predictive patterns are clearest. It struggles with impaired driving (small n, limited signal) and severe weather (weather compresses severity patterns). These subgroup differences are consistent with the feature importance findings: the model captures speed-context and time-of-day effects well but lacks fine-grained weather data.


Precision-Recall Analysis

Code
thresholds <- seq(0.05, 0.95, by = 0.01)
pr_data <- map_dfr(thresholds, function(t) {
  pred_class <- as.integer(prob_xgb >= t)
  tp <- sum(y_test == 1 & pred_class == 1)
  fp <- sum(y_test == 0 & pred_class == 1)
  fn <- sum(y_test == 1 & pred_class == 0)
  prec <- if ((tp + fp) == 0) NA_real_ else tp / (tp + fp)
  rec  <- if ((tp + fn) == 0) 0 else tp / (tp + fn)
  tibble(threshold = t, precision = prec, recall = rec)
})

baseline_rate <- mean(y_test)

pr_data %>%
  filter(!is.na(precision)) %>%
  ggplot(aes(x = recall, y = precision)) +
  geom_line(colour = PAL2[1], linewidth = 1.1) +
  geom_hline(yintercept = baseline_rate, linetype = "dashed", colour = PAL2[2]) +
  annotate("text", x = 0.8, y = baseline_rate + 0.03,
           label = paste0("Baseline (", round(baseline_rate * 100, 1), "%)"),
           colour = PAL2[2], fontface = "bold", size = 3.5) +
  labs(title = "Precision-Recall Curve — XGBoost V2",
       subtitle = "Dashed line = random classifier baseline",
       x = "Recall (Sensitivity)", y = "Precision") +
  coord_cartesian(xlim = c(0, 1), ylim = c(0, 1)) +
  theme(plot.title = element_text(face = "bold"))

The PR curve confirms the AUC story: meaningful precision improvements are achievable at recall values below 0.5, but above 60% recall, precision degrades toward the baseline rate of 18.3%. For policy use, a threshold targeting ~40% recall and ~35% precision provides approximately a 2x lift over random case selection.


What the Results Mean

This is a severity-classification model among already-observed collisions. It does not estimate where collisions will happen. It does estimate which recorded collisions are more likely to be severe. Threshold choice reflects policy tradeoffs, not a universal “correct” cutoff. V2’s temporal holdout means the performance figures reflect genuine forward-looking discriminability, not cross-validation optimism.

The model is most interpretable as a prioritization tool: when resources are finite, which collision reports warrant faster intervention, closer investigation, or engineering follow-up?


Meta-Cognitive Reflection

V2 taught a sharper lesson than V1: more features can actively hurt generalization.

Starting with 77 columns and ending with 8 forced a rigorous accounting of which signals are stable across time. Many features that seemed intuitively reasonable — intersection geometry, weekday timing, rush-hour indicators — turned out to be training-era artefacts that did not hold in the holdout months. The ablation process was not about finding what “explains” collisions; it was about finding what generalises.

Three specific lessons:

  1. Temporal splits are much harder than random splits. CV AUC (0.6398) and temporal test AUC (0.6572) diverge because the test months have slightly different collision composition. This is the realistic evaluation, and V2 still beats V1.

  2. Ensemble diversity requires feature disagreement. RF and XGBoost disagreeing on hour_v2 and weather_severity_rank is a feature, not a bug — their disagreement is why the ensemble beats both.

  3. Calibration and AUC measure different things. XGBoost has the best AUC but worst calibration. RF has the best calibration. The ensemble approximates both goals simultaneously.


Limitations

  • ECCC weather features (temp_c, wind_kph, visibility_km, precip_mm) are 100% missing — these were V1’s strongest predictors, and their addition in a future bootstrap cycle is the largest remaining opportunity
  • Severity is modeled conditional on collision occurrence, not full collision risk
  • posted_speed_kmh is inferred from NSRN road class, not observed posted speed signs
  • Behavioural variables (aggressive_driving, impaired_driving) are police-reported and may be under-recorded
  • Test severe rate (18.3%) vs train (22.9%) suggests slight distributional shift; models were not recalibrated for this shift

Policy Use

This system is most useful for: triage (identifying which recorded collision reports warrant faster investigative attention), corridor monitoring (identifying route segments or time windows where the modelled severe-risk profile is consistently elevated), seasonal and enforcement planning (the posted_speed_kmh and weather_severity_rank features are actionable — speed enforcement and road treatment can target high-severity-risk conditions directly), and benchmarking (V2 provides a frozen, documented baseline for future model versions incorporating ECCC weather, NSRN true posted speeds, and holiday proximity features).


Future Work

Code
tibble(
  Opportunity                           = c("ECCC weather bootstrap (temp_c, wind_kph)",
                                             "NSRN GeoJSON true posted speeds",
                                             "holidays_ns.csv bootstrap",
                                             "Platt calibration for XGBoost",
                                             "Route-held-out validation (Hwy 102, 103)"),
  `Expected AUC Gain`                   = c("+0.03 to +0.06", "+0.01 to +0.02",
                                             "+0.005 to +0.01", "0 AUC / Brier down ~0.06",
                                             "Robustness check"),
  Status                                = c("Pending Script 02", "Pending dataset",
                                             "Pending CSV", "Implementation ready",
                                             "Identified, not yet run")
) %>%
  kbl(caption = "V2 → V3 improvement opportunities") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
V2 → V3 improvement opportunities
Opportunity Expected AUC Gain Status
ECCC weather bootstrap (temp_c, wind_kph) +0.03 to +0.06 Pending Script 02
NSRN GeoJSON true posted speeds +0.01 to +0.02 Pending dataset
holidays_ns.csv bootstrap +0.005 to +0.01 Pending CSV
Platt calibration for XGBoost 0 AUC / Brier down ~0.06 Implementation ready
Route-held-out validation (Hwy 102, 103) Robustness check Identified, not yet run

V2 pipeline: Scripts 03–06 in vcm_v2_enhanced/code/. All outputs in outputs/. V1 frozen baseline preserved in docs/v1_baseline_report.qmd.