---
title: "Nova Scotia Road Safety Intelligence System"
subtitle: "Provincial Corridor Model Validation"
author: "Gavin Shklanka & Rachel Kodi"
date: today
format:
html:
embed-resources: true
toc: true
toc-depth: 3
theme: cosmo
code-fold: true
code-summary: "See 4 Urself"
code-tools: true
df-print: paged
execute:
echo: true
warning: false
message: false
---
```{r setup-data}
#| label: setup-data
#| include: false
library(tidyverse)
library(readr)
library(dplyr)
library(janitor)
library(lubridate)
library(tidyr)
library(ggplot2)
library(knitr)
library(kableExtra)
library(scales)
theme_set(theme_minimal(base_size = 12))
options(scipen = 999)
Sys.setenv(VROOM_CONNECTION_SIZE = 5000000)
# ── DATA PATH ──────────────────────────────────────────────────────────────────
# Replace the path below with the full path to your enriched weather-joined CSV
# (the one containing temp_c, wind_kph, precipitation_mm, visibility_km).
# Example: file_path <- "/data/ns_collisions_enriched.csv"
# If the file is absent, the report will still render; weather sections will be skipped.
file_path <- Sys.getenv("NS_ENRICHED_CSV", unset = "")
if (nchar(file_path) > 0 && file.exists(file_path)) {
prov_df <- read_csv(file_path, show_col_types = FALSE)
analysis_df <- janitor::clean_names(prov_df) %>%
mutate(severe = ifelse(severity_raw == 1, "Yes", "No"))
if ("datetime_raw" %in% names(analysis_df) && !"datetime" %in% names(analysis_df)) {
analysis_df <- analysis_df %>%
mutate(datetime = suppressWarnings(mdy_hms(datetime_raw, tz = "America/Halifax")))
}
has_data <- TRUE
} else {
# Render-safe fallback: no data loaded, all data-dependent chunks are guarded
analysis_df <- tibble()
has_data <- FALSE
message("No enriched CSV found — data-dependent sections will display a notice instead of crashing.")
}
has_temp <- has_data && "temp_c" %in% names(analysis_df)
```
## The Research Question
> **What factors are associated with higher motor vehicle collision severity on provincial highways in Nova Scotia, and how do traffic exposure and adverse weather conditions interact to amplify collision risk?**
This report evaluates an enhanced **severity-conditional-on-collision** modeling pipeline. The goal is not to predict whether a collision will occur, but rather to assess which recorded collisions are more likely to be severe.
## Executive Summary
This project develops a machine learning-based road safety intelligence system for provincial-corridor collisions in Nova Scotia.
Key findings:
- XGBoost produced the strongest discrimination (**AUC = 0.642**)
- Logistic Regression provided a transparent baseline (**AUC = 0.604**)
- Random Forest underperformed relative to expectations (**AUC = 0.574**)
- Weather and exposure variables — especially **temperature**, **wind speed**, and **traffic volume** — were the strongest predictors
- Overall performance remained moderate because severe and non-severe collisions overlap heavily in feature space
**Bottom line:** this system is best interpreted as a **risk prioritization tool**, not a deterministic prediction engine.
## Modeling Roadmap
The modeling process followed four steps:
1. Examine the data structure before fitting models
2. Train three candidate models of increasing flexibility
3. Compare performance on a held-out test set
4. Translate results into plain-language policy meaning
::: {.callout-note}
These diagnostics matter because if severe and non-severe collisions overlap heavily, even strong models will achieve only moderate discrimination — which is the expected result for this kind of severity-conditional problem.
:::
## Pre-Model Diagnostics
### Class Imbalance
```{r class-balance-table}
if (has_data) {
class_tbl <- analysis_df %>%
count(severe) %>%
mutate(share = scales::percent(n / sum(n), accuracy = 0.1))
class_tbl %>%
kbl(caption = "Collision severity distribution — provincial corridor subset") %>%
kable_styling(full_width = FALSE)
} else {
cat("Data not loaded — class balance table unavailable in this render.")
}
```
```{r class-balance-plot, fig.cap="Collision severity distribution — provincial corridor subset"}
if (file.exists("001_4_Section_4_Exploratory_Diagnostics_figure.png")) {
knitr::include_graphics("001_4_Section_4_Exploratory_Diagnostics_figure.png")
}
```
The dataset is imbalanced: severe collisions represent a meaningful but minority share of observed cases (approximately **21.8%**). This makes raw accuracy a weak performance measure, so the evaluation focuses on **ROC/AUC**.
### Continuous Feature Distributions
```{r feature-distributions, fig.cap="Enhanced continuous feature distributions by severity"}
if (file.exists("002_4_Section_4_Exploratory_Diagnostics_figure.png")) {
knitr::include_graphics("002_4_Section_4_Exploratory_Diagnostics_figure.png")
}
```
The density plots show substantial overlap between severe and non-severe collisions. In practical terms, this means the classes are not cleanly separable using a simple rule.
### Correlation Structure
```{r correlation-heatmap, fig.cap="Feature correlation matrix"}
if (file.exists("003_4_Section_4_Exploratory_Diagnostics_figure.png")) {
knitr::include_graphics("003_4_Section_4_Exploratory_Diagnostics_figure.png")
}
```
Key structural observations:
- Traffic variables cluster strongly together
- Engineered interaction terms behave as expected
- Weather variables are not perfectly collinear
This supports the use of tree-based models, which can handle correlated predictors and nonlinear interactions more flexibly than a linear model.
## Temperature and Severe Collision Rate
::: {.callout-note}
**Interpretation note:** The SHAP importance bar shows that `temp_c` is an influential predictor, but does not indicate whether cold or warm temperatures are more dangerous. Temperature likely acts as a **context variable** — reflecting freeze-thaw cycles, black ice conditions, slush, or seasonal exposure patterns — rather than a simple directional effect. The banded analysis below is the appropriate tool to assess whether severe outcomes are concentrated in very cold, near-freezing, mild, or warmer conditions.
:::
```{r temp-band-table}
if (!has_temp) {
cat("Temperature-banded analysis is not available in this render because the loaded dataset does not contain `temp_c`. Load the enriched weather-joined CSV to enable this section.")
} else {
temp_profile <- analysis_df %>%
filter(!is.na(temp_c), !is.na(severity_raw)) %>%
mutate(
temp_band = cut(
temp_c,
breaks = c(-Inf, -10, -5, 0, 5, 10, 15, 20, 25, Inf),
labels = c("< -10", "-10 to -5", "-5 to 0", "0 to 5",
"5 to 10", "10 to 15", "15 to 20", "20 to 25", "> 25"),
right = FALSE
)
) %>%
group_by(temp_band) %>%
summarise(
collisions = n(),
severe_collisions = sum(severity_raw == 1, na.rm = TRUE),
severe_rate = mean(severity_raw == 1, na.rm = TRUE),
.groups = "drop"
)
temp_profile %>%
mutate(severe_rate_pct = scales::percent(severe_rate, accuracy = 0.1)) %>%
select(temp_band, collisions, severe_collisions, severe_rate_pct) %>%
kbl(
caption = "Severe collision rate by temperature band",
col.names = c("Temperature Band (°C)", "Total Collisions",
"Severe Collisions", "Severe Rate")
) %>%
kable_styling(full_width = FALSE)
}
```
```{r temp-band-plot, fig.cap="Severe collision rate by temperature band"}
if (has_temp) {
ggplot(temp_profile, aes(x = temp_band, y = severe_rate)) +
geom_col(fill = "#2C7BB6", alpha = 0.85) +
geom_hline(
yintercept = mean(analysis_df$severity_raw == 1, na.rm = TRUE),
linetype = "dashed", colour = "#D7191C", linewidth = 0.8
) +
annotate(
"text", x = Inf, y = mean(analysis_df$severity_raw == 1, na.rm = TRUE),
label = " Overall severe rate", hjust = 1.05, vjust = -0.5,
colour = "#D7191C", size = 3.2
) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
labs(
title = "Severe collision rate by temperature band",
subtitle = "Dashed line = overall severe rate. Deviations indicate temperature-specific severity elevation.",
x = "Temperature band (°C)",
y = "Severe collision rate"
) +
theme(axis.text.x = element_text(angle = 30, hjust = 1))
}
```
**Interpreting this chart:** Bands that rise clearly above the dashed reference line represent temperature conditions associated with elevated severe outcomes in this dataset. Bands near or below the reference line do not show elevated risk. This chart does not establish causation; it identifies where temperature context correlates with severity in the observed records.
## Candidate Models
### Logistic Regression
Logistic Regression was used as the baseline because it is interpretable and provides a clean benchmark for binary classification.
```{r logistic-performance}
logistic_tbl <- tibble(
Model = "Logistic Regression",
AUC = 0.604,
Interpretation = "Transparent baseline with modest discrimination"
)
logistic_tbl %>%
kbl(caption = "Logistic Regression summary") %>%
kable_styling(full_width = FALSE)
```
**Interpretation:** This model detects meaningful signal, but the relationship between predictors and severity is not cleanly linear. It performs better than random guessing, but not strongly enough to serve as a standalone operational model.
**Plain-language takeaway:** This is the "basic benchmark" model. It gives a sensible starting point, but does not capture enough complexity to separate severe from non-severe collisions well.
### Random Forest
Random Forest was used to test whether nonlinear decision rules and interaction effects would improve performance beyond the linear baseline.
```{r random-forest-summary}
rf_tbl <- tibble(
Model = "Random Forest",
AUC = 0.574,
Interpretation = "Flexible nonlinear model, but weaker held-out discrimination"
)
rf_tbl %>%
kbl(caption = "Random Forest summary") %>%
kable_styling(full_width = FALSE)
```
**Interpretation:** Although Random Forest can model nonlinear effects, its held-out performance was weaker than Logistic Regression in this version of the dataset.
**Plain-language takeaway:** Adding flexibility alone did not guarantee better results. A more complicated model is not always a better model.
### XGBoost
XGBoost was used as the most advanced candidate because boosting can focus iteratively on harder-to-classify cases and capture more complex structure.
```{r xgboost-summary}
xgb_tbl <- tibble(
Model = "XGBoost",
AUC = 0.642,
Interpretation = "Best-performing model on held-out discrimination"
)
xgb_tbl %>%
kbl(caption = "XGBoost summary") %>%
kable_styling(full_width = FALSE)
```
**Interpretation:** XGBoost achieved the strongest ranking performance of the three models, suggesting that severe collision risk is influenced by nonlinear combinations of weather, traffic exposure, and collision context.
**Plain-language takeaway:** This was the strongest model, but it is still not predicting the future with certainty. It is better understood as a tool for flagging higher-risk cases for closer attention.
## Final Comparative Evaluation
### Test-Set Comparison
```{r model-comparison}
metrics_df <- tibble(
Model = c("Logistic Regression", "Random Forest", "XGBoost"),
`AUC-ROC` = c(0.604, 0.574, 0.642),
Conclusion = c(
"Transparent baseline",
"Flexible but weaker generalization",
"Best overall discrimination"
)
)
metrics_df %>%
arrange(desc(`AUC-ROC`)) %>%
kbl(caption = "Model comparison — held-out test set") %>%
kable_styling(full_width = FALSE)
```
### ROC Comparison
```{r roc-comparison, fig.cap="ROC curves — enhanced provincial corridor model"}
if (file.exists("004_6_3_6_3_XGBoost_figure.png")) {
knitr::include_graphics("004_6_3_6_3_XGBoost_figure.png")
}
```
XGBoost leads the final comparison, followed by Logistic Regression, then Random Forest.
**Overall interpretation:** The results suggest that severe collision prediction is feasible at a **modest** discrimination level. The main practical value of the system is in **risk ranking and corridor monitoring**, not exact event prediction.
## Variable Importance
```{r variable-importance, fig.cap="Variable importance — enhanced predictor stack"}
if (file.exists("005_6_3_6_3_XGBoost_figure.png")) {
knitr::include_graphics("005_6_3_6_3_XGBoost_figure.png")
}
```
Across Random Forest and XGBoost, the strongest predictors included:
- `temp_c`
- `wind_kph`
- `n_vehicles`
- Traffic exposure measures such as AADT and truck share
- Interaction terms involving traffic volume with visibility or precipitation
### Interpretation
- Weather conditions appear to shape the severity context of collisions
- Traffic volume acts as an exposure amplifier
- XGBoost appears to capture these interactions more effectively than the other models
- Behavioural indicators contributed less than expected, which may reflect weaker signal quality or underreporting
::: {.callout-important}
**Important:** A high SHAP importance for `temp_c` means the model relies heavily on temperature as a discriminating feature. It does **not** establish whether cold or warm temperatures are the primary driver of severity elevation. The temperature-banded analysis in the previous section is the appropriate tool for directional interpretation.
:::
**Conclusion:** Severe collision risk in this subset appears driven more by **environmental and exposure conditions** than by isolated behavioural indicators.
## What the Results Mean
This is a **severity-classification model among already-observed collisions**.
That means:
- it does **not** estimate where collisions will happen in the first place
- it **does** estimate which collisions are more likely to be severe once a collision has occurred
- threshold choice reflects policy tradeoffs, not a universal "correct" cutoff
- route-level traffic exposure improves realism, but is still an approximation
## Meta-Cognitive Reflection: What We Learned in Plain Language
At a simple level, this project showed that serious collisions are hard to predict cleanly because many factors are occurring simultaneously.
Even after adding weather, road context, traffic exposure, and crash-structure features, severe and non-severe cases still overlap substantially. The models can identify patterns, but those patterns are not strong enough to achieve near-perfect separation.
Key lessons:
- a better model does not remove uncertainty
- more features do not automatically produce clearer predictions
- traffic and weather often carry more stable signal than expected
- the most useful outcome is not "perfect prediction," but **better prioritization**
In plain terms, the system is best understood as a way to say:
> "These conditions look more dangerous than average, so they deserve more attention."
That is a realistic and defensible use of machine learning in a public-safety context.
## Limitations
This model should not be interpreted as establishing causal relationships or attributing institutional responsibility. It is a severity-classification system built on observed collision records and approximate exposure and context variables.
Important constraints include:
- severity is modeled conditional on collision occurrence, not full collision risk
- route-level traffic exposure is an approximation, not exact segment-hour exposure
- weather assignment is based on nearest-station matching and may miss local variation
- behavioural variables may be weaker partly because they are underreported or inconsistently captured
- moderate AUC performance indicates useful ranking ability, not high-certainty forecasting
Accordingly, the model is best used as a policy-support and prioritization layer, not as a standalone decision rule.
## Policy Use
This system is most useful for:
- identifying higher-risk corridor conditions
- supporting monitoring and intervention prioritization
- informing enforcement, signage, seasonal planning, and roadway review
It should not be interpreted as a deterministic crash prediction tool.
## Optional Code Appendix
Readers can expand the code throughout this report using the **"See 4 Urself"** toggles. A lightweight example of the modeling workflow is shown below.
```{r appendix-code}
# Summary of reported model performance
tibble(
Model = c("Logistic Regression", "Random Forest", "XGBoost"),
AUC = c(0.604, 0.574, 0.642)
) %>%
arrange(desc(AUC))
```
## Conclusion
Across the three candidate models, XGBoost delivered the strongest held-out performance, but overall discrimination remained moderate. This indicates that severe collision outcomes are partly predictable, though not cleanly separable, within the available provincial-corridor feature set.
The strongest signals came from environmental and exposure-related variables — especially temperature, wind speed, traffic volume, and interaction effects involving visibility and precipitation. Isolated behavioural indicators contributed less than expected, which may reflect weaker signal quality, underreporting, or limited measurement precision.
Taken together, the results support a practical interpretation of the system as a risk-prioritization tool rather than a deterministic prediction engine. Its most defensible public-sector value lies in helping provincial and municipal decision-makers identify corridor conditions that deserve closer monitoring, more responsive intervention, and more targeted resource allocation.
## Further Implications for Provincial Road Safety Planning
The model suggests that severe-collision management may benefit from shifting away from static road-safety interpretation toward condition-sensitive corridor monitoring.
Three implications stand out:
1. **Environmental exposure** should be integrated more directly into corridor safety review. Weather-related severity patterns suggest that seasonal and same-day operating conditions matter materially.
2. **Traffic exposure** should be treated as a severity amplifier. Higher-volume corridors may warrant stronger prioritization when adverse conditions are present.
3. **Provincial and municipal agencies** may benefit from a ranked-risk framework. Instead of treating all corridors equally, the model supports targeting attention toward conditions and segments where severity risk appears elevated.
These findings create a foundation for future work in corridor scoring, intervention targeting, and cross-jurisdiction safety coordination.
## LLM Usage Disclosure
Claude and ChatGPT were used to help structure the R/Quarto workflow, improve report organization, and refine interpretive phrasing. All final analytical claims, metrics, and project-specific outputs were reviewed by the authors.