See 4 Urself
knitr::include_graphics("001_4_Section_4_Exploratory_Diagnostics_figure.png")Provincial Corridor Model Validation
---
title: "Nova Scotia Road Safety Intelligence System"
subtitle: "Provincial Corridor Model Validation"
author: "Gavin Shklanka & Rachel Kodi"
date: today
format:
html:
embed-resources: true
toc: true
toc-depth: 3
theme: cosmo
code-fold: true
code-summary: "See 4 Urself"
code-tools: true
df-print: paged
execute:
echo: true
warning: false
message: false
---
The Research Question
> **What factors are associated with higher motor vehicle collision severity on provincial highways in Nova Scotia, and how do traffic exposure and adverse weather conditions interact to amplify collision risk?**
This report evaluates an enhanced **severity-conditional-on-collision** modeling pipeline. The goal is not to predict whether a collision will occur, but rather to assess which recorded collisions are more likely to be severe.
Executive Summary
This project develops a machine learning-based road safety intelligence system for provincial-corridor collisions in Nova Scotia.
Key findings:
* XGBoost produced the strongest discrimination (**AUC = 0.642**)
* Logistic Regression provided a transparent baseline (**AUC = 0.604**)
* Random Forest underperformed relative to expectations (**AUC = 0.574**)
* Weather and exposure variables, especially **temperature**, **wind speed**, and **traffic volume**, were the strongest predictors
* Overall performance remained moderate because severe and non-severe collisions overlap heavily in feature space
**Bottom line:** this system is best interpreted as a **risk prioritization tool**, not a deterministic prediction engine.
Modeling Roadmap
The modeling process followed four steps:
1. Examine the data structure before fitting models
2. Train three candidate models of increasing flexibility
3. Compare performance on a held-out test set
4. Translate results into plain-language policy meaning
::: {.callout-note}
These diagnostics matter because if severe and non-severe collisions overlap heavily, even strong models will only achieve moderate discrimination.
:::
Pre-Model Diagnostics
Class Imbalance
::: {.cell}
```{.r .cell-code}
class_tbl <- tibble(
Class = c("No", "Yes"),
Count = c(1240, 387),
Share = c(0.762, 0.238)
)
class_tbl %>%
mutate(Share = scales::percent(Share, accuracy = 0.1)) %>%
kbl(caption = "Collision severity distribution — provincial corridor subset") %>%
kable_styling(full_width = FALSE)
```
::: {.cell-output-display}
`````{=html}
<table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;">
<caption>Collision severity distribution — provincial corridor subset</caption>
<thead>
<tr>
<th style="text-align:left;"> Class </th>
<th style="text-align:right;"> Count </th>
<th style="text-align:left;"> Share </th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align:left;"> No </td>
<td style="text-align:right;"> 1240 </td>
<td style="text-align:left;"> 76.2% </td>
</tr>
<tr>
<td style="text-align:left;"> Yes </td>
<td style="text-align:right;"> 387 </td>
<td style="text-align:left;"> 23.8% </td>
</tr>
</tbody>
</table>
::: :::
knitr::include_graphics("001_4_Section_4_Exploratory_Diagnostics_figure.png")The dataset is imbalanced: about 24% of collisions are severe. That makes accuracy a weak performance measure, so the evaluation focuses on ROC/AUC.
knitr::include_graphics("002_4_Section_4_Exploratory_Diagnostics_figure.png")The density plots show substantial overlap between severe and non-severe collisions. In practical terms, this means the classes are not cleanly separable using a simple rule.
knitr::include_graphics("003_4_Section_4_Exploratory_Diagnostics_figure.png")Key structural observations:
This supports the use of tree-based models, which can handle correlated predictors and nonlinear interactions more flexibly than a linear model.
Logistic Regression was used as the baseline because it is interpretable and provides a clean benchmark for binary classification.
logistic_tbl <- tibble(
Model = "Logistic Regression",
AUC = 0.604,
Interpretation = "Transparent baseline with modest discrimination"
)
logistic_tbl %>%
kbl(caption = "Logistic Regression summary") %>%
kable_styling(full_width = FALSE)| Model | AUC | Interpretation |
|---|---|---|
| Logistic Regression | 0.604 | Transparent baseline with modest discrimination |
Interpretation: This model detects some signal, but the relationship between predictors and severity is not cleanly linear. It performs better than random guessing, but not strongly enough to serve as a standalone operational model.
Plain-language takeaway: This is the “basic benchmark” model. It gives a sensible starting point, but it does not capture enough complexity to separate severe from non-severe collisions well.
Random Forest was used to test whether nonlinear decision rules and interaction effects would improve performance beyond the linear baseline.
rf_tbl <- tibble(
Model = "Random Forest",
AUC = 0.574,
Interpretation = "Flexible nonlinear model, but weaker held-out discrimination"
)
rf_tbl %>%
kbl(caption = "Random Forest summary") %>%
kable_styling(full_width = FALSE)| Model | AUC | Interpretation |
|---|---|---|
| Random Forest | 0.574 | Flexible nonlinear model, but weaker held-out discrimination |
Interpretation: Although Random Forest can model nonlinear effects, its held-out performance was weaker than Logistic Regression in this version of the dataset.
Plain-language takeaway: Adding flexibility alone did not guarantee better results. A more complicated model is not always a better model.
XGBoost was used as the most advanced candidate model because boosting can focus iteratively on harder-to-classify cases and capture more complex structure.
xgb_tbl <- tibble(
Model = "XGBoost",
AUC = 0.642,
Interpretation = "Best-performing model on held-out discrimination"
)
xgb_tbl %>%
kbl(caption = "XGBoost summary") %>%
kable_styling(full_width = FALSE)| Model | AUC | Interpretation |
|---|---|---|
| XGBoost | 0.642 | Best-performing model on held-out discrimination |
Interpretation: XGBoost achieved the strongest ranking performance of the three models, suggesting that severe collision risk is influenced by nonlinear combinations of weather, traffic exposure, and collision context.
Plain-language takeaway: This was the strongest model, but it is still not “predicting the future perfectly.” It is better understood as a tool for flagging higher-risk cases.
metrics_df <- tibble(
Model = c("Logistic Regression", "Random Forest", "XGBoost"),
`AUC-ROC` = c(0.604, 0.574, 0.642),
Conclusion = c(
"Transparent baseline",
"Flexible but weaker generalization",
"Best overall discrimination"
)
)
metrics_df %>%
arrange(desc(`AUC-ROC`)) %>%
kbl(caption = "Model comparison — held-out test set") %>%
kable_styling(full_width = FALSE)| Model | AUC-ROC | Conclusion |
|---|---|---|
| XGBoost | 0.642 | Best overall discrimination |
| Logistic Regression | 0.604 | Transparent baseline |
| Random Forest | 0.574 | Flexible but weaker generalization |
knitr::include_graphics("004_6_3_6_3_XGBoost_figure.png")XGBoost leads the final comparison, followed by Logistic Regression, then Random Forest.
Overall interpretation: The results suggest that severe collision prediction is feasible at a modest discrimination level. The main practical value of the system is in risk ranking and corridor monitoring, not exact event prediction.
knitr::include_graphics("005_6_3_6_3_XGBoost_figure.png")Across Random Forest and XGBoost, the strongest variables were:
temp_cwind_kphn_vehiclesConclusion: Severe collision risk in this subset appears to be driven more by environmental and exposure conditions than by isolated behavioral indicators.
This is a severity classification model among already-observed collisions.
That means:
At a simple level, this project showed that serious collisions are hard to predict cleanly because many things are happening at once.
Even after adding weather, road context, traffic exposure, and crash-structure features, the severe and non-severe cases still overlap a lot. That means the models can find patterns, but the patterns are not strong enough to create near-perfect separation.
What this taught us is:
In plain terms, the system is best thought of as a way to say:
“These conditions look more dangerous than average, so they deserve more attention.”
That is a realistic and defensible use of machine learning in a public-safety setting.
Severity conditional on collision occurrence The model classifies severity among observed collisions; it is not a full occurrence-risk model.
Route-level exposure approximation AADT and truck-share were joined at the route level, not exact segment-hour resolution.
Approximate weather assignment Weather was assigned using nearest-station and same-hour matching.
Random split validation The evaluation used a random train/test split rather than a temporal or corridor holdout.
Moderate predictive ceiling Collision severity contains substantial randomness and unobserved context.
This system is most useful for:
It should not be interpreted as a deterministic crash prediction tool.
Readers can expand the code throughout this report using the “See 4 Urself” toggles. A lightweight example of the modeling workflow is shown below.
# Example reporting object used in this presentation-style .qmd
tibble(
Model = c("Logistic Regression", "Random Forest", "XGBoost"),
AUC = c(0.604, 0.574, 0.642)
)Claude and ChatGPT were used to help structure the R/Quarto workflow, improve report organization, and refine interpretive phrasing. All final analytical claims, metrics, and project-specific outputs were reviewed by the authors.
A few quick checks before you render:
- Put the five image files in the **same folder** as the `.qmd`, or change the paths.
- The class counts in the small table are set to **1240 / 387** to match the plot labels shown in your figures.
- This version is intentionally **presentation-weighted** and avoids rerunning the full training pipeline inside the report, which should help prevent timeouts.