Stable-CCA Crash Typology Discovery Using
clustrd
1 Purpose
This notebook implements a Stability-Regularized Cluster Correspondence Analysis (Stable-CCA) framework for crash typology discovery. Each row of the data represents one crash record, and each crash is described using categorical roadway, traffic control, environmental, human factor, and severity attributes.
The framework uses clustrd::clusmca() to estimate
Cluster Correspondence Analysis and adds a bootstrap-based stability
layer. The purpose is to move beyond a one-time exploratory CCA and
develop a reproducibility-aware crash typology method.
2 Research Questions
2.1 RQ1: Stable crash typology discovery
Can Cluster Correspondence Analysis identify stable and reproducible crash typologies from categorical crash records?
2.2 RQ2: Engineering interpretability
Do stable crash typologies reveal distinct roadway, traffic control, human factor, and severity mechanisms that can support engineering countermeasure selection?
2.3 RQ3: Countermeasure readiness
Which stable crash typologies are most actionable from a Safe System and roadway safety engineering perspective?
3 Methodological Contribution
This study introduces Stable-CCA, a bootstrap-regularized extension of Cluster Correspondence Analysis for crash data. The method:
- Treats each crash record as a categorical profile.
- Applies Cluster Correspondence Analysis using
clustrd::clusmca(). - Repeats CCA clustering over bootstrap samples.
- Aligns bootstrap cluster labels to the reference solution.
- Computes crash-level, cluster-level, and model-level stability.
- Selects the number of clusters using a composite stability score.
- Profiles stable crash typologies and computes a countermeasure readiness score.
4 Packages
5 Data
The following sample dataset is based on the crash-record structure
provided by the user. In the full study, replace this section with a
file import, such as read.csv() or
readr::read_csv().
crash_df <- tribble(
~Wthr, ~Lght, ~Rd_l, ~Trf_, ~Int_, ~FHE_, ~Oth_, ~Rd_c, ~PSL, ~Sesn, ~Cnt_, ~Ethn, ~Gndr, ~Hlmt, ~Dr__, ~Pp_G, ~Svrt,
"Clear", "Daylight", "Straight, level", "Signal light", "Int related", "Omv vehicle going straight", "Other", "City street", "30-40 mph", "Winter", "Other", "Hispanic", "Male", "Not worn", "Other", "250,000 pop. And over", "BC",
"Clear", "Dark, not lighted", "Straight, level", "Stop sign", "Int related", "Omv vehicle turning left", "Other", "City street", "45-60 mph", "Winter", "Pedestrian ftyrow to vehicle", "Hispanic", "Male", "Not worn", "Unlicensed", "100,000 - 249,999 pop.", "BC",
"Clear", "Daylight", "Straight, level", "Marked lanes", "Non Int", "Omv vehicle going straight", "Other", "Other", "30-40 mph", "Winter", "Other", "Asian", "Male", "Not worn", "Other", "Other", "BC",
"Cloudy", "Daylight", "Straight, level", "Stop sign", "Int", "Omv vehicle turning left", "Other", "City street", "30-40 mph", "Winter", "Other", "White", "Male", "Not worn", "Other", "Other", "BC",
"Clear", "Daylight", "Straight, level", "Signal light", "Int related", "Omv vehicle going straight", "Inattentive driving", "City street", "30-40 mph", "Winter", "Other", "White", "Female", "Not worn", "Other", "Other", "BC",
"Clear", "Daylight", "Straight, level", "Stop sign", "Int", "Omv vehicle going straight", "Other", "City street", "30-40 mph", "Winter", "Other", "Hispanic", "Female", "Not worn", "Other", "Other", "BC",
"Cloudy", "Daylight", "Straight, level", "Stop sign", "Int", "Omv vehicle going straight", "Inattentive driving", "City street", "30-40 mph", "Winter", "Driver inattention", "White", "Male", "Not worn", "Other", "50,000 - 99,999 pop", "O",
"Clear", "Daylight", "Straight, level", "Other", "Driveway accs", "Omv vehicle turning left", "veh leaving driveway", "City street", "30-40 mph", "Winter", "Other", "White", "Male", "Not worn", "Other", "Other", "BC",
"Clear", "Daylight", "Straight, level", "Signal light", "Non Int", "Omv vehicle going straight", "Other", "Other", "45-60 mph", "Winter", "Other", "Hispanic", "Male", "Not worn", "Other", "250,000 pop. And over", "BC",
"Cloudy", "Daylight", "Straight, level", "Marked lanes", "Int related", "Omv vehicle going straight", "Other", "City street", "45-60 mph", "Winter", "Not applicable", "White", "Male", "Not worn", "Other", "Other", "BC",
"Clear", "Daylight", "Straight, level", "None", "Non Int", "Omv vehicle going straight", "Other", "City street", "30-40 mph", "Winter", "Other", "White", "Female", "Not worn", "Unlicensed", "Other", "BC"
)
crash_df <- crash_df %>%
mutate(Crash_ID = paste0("C", row_number())) %>%
relocate(Crash_ID)
vars_for_cca <- c(
"Wthr", "Lght", "Rd_l", "Trf_", "Int_", "FHE_", "Oth_",
"Rd_c", "PSL", "Sesn", "Cnt_", "Ethn", "Gndr", "Hlmt",
"Dr__", "Pp_G", "Svrt"
)
crash_cat <- crash_df %>%
mutate(across(all_of(vars_for_cca), ~ as.factor(.x)))
knitr::kable(crash_cat, caption = "Sample crash data. Each row represents one crash record.")| Crash_ID | Wthr | Lght | Rd_l | Trf_ | Int_ | FHE_ | Oth_ | Rd_c | PSL | Sesn | Cnt_ | Ethn | Gndr | Hlmt | Dr__ | Pp_G | Svrt |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| C1 | Clear | Daylight | Straight, level | Signal light | Int related | Omv vehicle going straight | Other | City street | 30-40 mph | Winter | Other | Hispanic | Male | Not worn | Other | 250,000 pop. And over | BC |
| C2 | Clear | Dark, not lighted | Straight, level | Stop sign | Int related | Omv vehicle turning left | Other | City street | 45-60 mph | Winter | Pedestrian ftyrow to vehicle | Hispanic | Male | Not worn | Unlicensed | 100,000 - 249,999 pop. | BC |
| C3 | Clear | Daylight | Straight, level | Marked lanes | Non Int | Omv vehicle going straight | Other | Other | 30-40 mph | Winter | Other | Asian | Male | Not worn | Other | Other | BC |
| C4 | Cloudy | Daylight | Straight, level | Stop sign | Int | Omv vehicle turning left | Other | City street | 30-40 mph | Winter | Other | White | Male | Not worn | Other | Other | BC |
| C5 | Clear | Daylight | Straight, level | Signal light | Int related | Omv vehicle going straight | Inattentive driving | City street | 30-40 mph | Winter | Other | White | Female | Not worn | Other | Other | BC |
| C6 | Clear | Daylight | Straight, level | Stop sign | Int | Omv vehicle going straight | Other | City street | 30-40 mph | Winter | Other | Hispanic | Female | Not worn | Other | Other | BC |
| C7 | Cloudy | Daylight | Straight, level | Stop sign | Int | Omv vehicle going straight | Inattentive driving | City street | 30-40 mph | Winter | Driver inattention | White | Male | Not worn | Other | 50,000 - 99,999 pop | O |
| C8 | Clear | Daylight | Straight, level | Other | Driveway accs | Omv vehicle turning left | veh leaving driveway | City street | 30-40 mph | Winter | Other | White | Male | Not worn | Other | Other | BC |
| C9 | Clear | Daylight | Straight, level | Signal light | Non Int | Omv vehicle going straight | Other | Other | 45-60 mph | Winter | Other | Hispanic | Male | Not worn | Other | 250,000 pop. And over | BC |
| C10 | Cloudy | Daylight | Straight, level | Marked lanes | Int related | Omv vehicle going straight | Other | City street | 45-60 mph | Winter | Not applicable | White | Male | Not worn | Other | Other | BC |
| C11 | Clear | Daylight | Straight, level | None | Non Int | Omv vehicle going straight | Other | City street | 30-40 mph | Winter | Other | White | Female | Not worn | Unlicensed | Other | BC |
6 Descriptive Summary
summary_table <- crash_cat %>%
select(all_of(vars_for_cca)) %>%
summarise(across(everything(), ~ n_distinct(.x))) %>%
pivot_longer(everything(), names_to = "Variable", values_to = "Number_of_Categories")
knitr::kable(summary_table, caption = "Number of categories by crash variable.")| Variable | Number_of_Categories |
|---|---|
| Wthr | 2 |
| Lght | 2 |
| Rd_l | 1 |
| Trf_ | 5 |
| Int_ | 4 |
| FHE_ | 2 |
| Oth_ | 3 |
| Rd_c | 2 |
| PSL | 2 |
| Sesn | 1 |
| Cnt_ | 4 |
| Ethn | 3 |
| Gndr | 2 |
| Hlmt | 1 |
| Dr__ | 2 |
| Pp_G | 4 |
| Svrt | 2 |
7 Core Stable-CCA Functions
7.1 Function 1: Run
clustrd CCA
run_clustrd_cca <- function(data,
vars,
nclus = 2,
ndim = 2,
method = "clusCA",
nstart = 100,
seed = 123) {
x <- data %>%
select(all_of(vars)) %>%
mutate(across(everything(), as.factor))
fit <- clustrd::clusmca(
data = x,
nclus = nclus,
ndim = ndim,
method = method,
nstart = nstart,
seed = seed
)
return(fit)
}7.2 Function 2: Align bootstrap cluster labels
Cluster labels can switch across bootstrap runs. This function aligns the bootstrap solution to the reference solution by maximizing label overlap.
align_clusters <- function(base_labels, boot_labels, common_index) {
base_common <- base_labels[common_index]
tab <- table(base_common, boot_labels)
cost <- max(tab) - tab
assignment <- clue::solve_LSAP(cost)
map <- setNames(seq_along(assignment), assignment)
aligned <- map[as.character(boot_labels)]
return(as.integer(aligned))
}7.3 Function 3: Stability-Regularized CCA
stable_clustrd_cca <- function(data,
vars,
id_col = "Crash_ID",
nclus = 2,
ndim = 2,
method = "clusCA",
n_boot = 50,
sample_frac = 0.80,
nstart = 100,
seed = 123) {
set.seed(seed)
n <- nrow(data)
ids <- data[[id_col]]
base_fit <- run_clustrd_cca(
data = data,
vars = vars,
nclus = nclus,
ndim = ndim,
method = method,
nstart = nstart,
seed = seed
)
base_labels <- base_fit$cluster
assignment_mat <- matrix(NA, nrow = n, ncol = n_boot)
rownames(assignment_mat) <- ids
ari_values <- numeric(n_boot)
for (b in seq_len(n_boot)) {
boot_index <- sample(
seq_len(n),
size = floor(sample_frac * n),
replace = TRUE
)
unique_boot_index <- unique(boot_index)
if (length(unique_boot_index) <= nclus) next
boot_data <- data[unique_boot_index, ]
boot_fit <- tryCatch(
run_clustrd_cca(
data = boot_data,
vars = vars,
nclus = nclus,
ndim = ndim,
method = method,
nstart = nstart,
seed = seed + b
),
error = function(e) NULL
)
if (is.null(boot_fit)) next
boot_labels_raw <- boot_fit$cluster
aligned_labels <- align_clusters(
base_labels = base_labels,
boot_labels = boot_labels_raw,
common_index = unique_boot_index
)
assignment_mat[unique_boot_index, b] <- aligned_labels
ari_values[b] <- mclust::adjustedRandIndex(
base_labels[unique_boot_index],
aligned_labels
)
}
crash_stability <- purrr::map_dbl(seq_len(n), function(i) {
observed <- assignment_mat[i, ]
observed <- observed[!is.na(observed)]
if (length(observed) == 0) return(NA_real_)
mean(observed == base_labels[i])
})
cluster_stability <- tibble(
Crash_ID = ids,
Cluster = base_labels,
Crash_Stability = crash_stability
) %>%
group_by(Cluster) %>%
summarise(
N_Crashes = n(),
Mean_Stability = mean(Crash_Stability, na.rm = TRUE),
Median_Stability = median(Crash_Stability, na.rm = TRUE),
Min_Stability = min(Crash_Stability, na.rm = TRUE),
.groups = "drop"
)
out <- list(
base_fit = base_fit,
base_cluster = base_labels,
assignment_matrix = assignment_mat,
crash_stability = crash_stability,
cluster_stability = cluster_stability,
mean_ari = mean(ari_values, na.rm = TRUE),
overall_stability = mean(crash_stability, na.rm = TRUE),
ari_values = ari_values
)
return(out)
}8 Run Stable-CCA
For the small sample, the analysis uses two clusters and two CCA dimensions. For a full crash dataset, use more bootstrap replications and test multiple cluster solutions.
stable_fit <- stable_clustrd_cca(
data = crash_cat,
vars = vars_for_cca,
id_col = "Crash_ID",
nclus = 2,
ndim = 2,
method = "clusCA",
n_boot = 50,
sample_frac = 0.80,
nstart = 100,
seed = 123
)## | | | 0% | |======= | 10% | |============== | 20% | |===================== | 30% | |============================ | 40% | |=================================== | 50% | |========================================== | 60% | |================================================= | 70% | |======================================================== | 80% | |=============================================================== | 90% | |======================================================================| 100% | | | 0% | |======= | 10% | |============== | 20% | |===================== | 30% | |============================ | 40% | |=================================== | 50% | |========================================== | 60% | |================================================= | 70% | |======================================================== | 80% | |=============================================================== | 90% | | | 0% | |======= | 10% | |============== | 20% | | | 0% | |======= | 10% | |============== | 20% | | | 0% | |======= | 10% | |============== | 20% | | | 0% | |======= | 10% | |============== | 20% | |===================== | 30% | |============================ | 40% | | | 0% | |======= | 10% | | | 0% | |======= | 10% | |============== | 20% | |===================== | 30% | |============================ | 40% | |=================================== | 50% | |========================================== | 60% | |================================================= | 70% | |======================================================== | 80% | |=============================================================== | 90% | |======================================================================| 100% | | | 0% | |======= | 10% | |============== | 20% | | | 0% | |======= | 10% | |============== | 20% | | | 0% | |======= | 10% | | | 0% | |======= | 10% | | | 0% | |======= | 10% | |============== | 20% | |===================== | 30% | |============================ | 40% | |=================================== | 50% | |========================================== | 60% | | | 0% | |======= | 10% | |============== | 20% | |===================== | 30% | |============================ | 40% | |=================================== | 50% | |========================================== | 60% | |================================================= | 70% | | | 0% | |======= | 10% | | | 0% | |======= | 10% | |============== | 20% | |===================== | 30% | |============================ | 40% | |=================================== | 50% | |========================================== | 60% | | | 0% | |======= | 10% | |============== | 20% | |===================== | 30% | |============================ | 40% | |=================================== | 50% | |========================================== | 60% | | | 0% | |======= | 10% | |============== | 20% | | | 0% | |======= | 10% | | | 0% | |======= | 10% | |============== | 20% | | | 0% | |======= | 10% | | | 0% | |======= | 10% | |============== | 20% | |===================== | 30% | |============================ | 40% | |=================================== | 50% | | | 0% | |======= | 10% | |============== | 20% | | | 0% | |======= | 10% | | | 0% | |======= | 10% | |============== | 20% | | | 0% | |======= | 10% | |============== | 20% | |===================== | 30% | |============================ | 40% | |=================================== | 50% | |========================================== | 60% | |================================================= | 70% | |======================================================== | 80% | |=============================================================== | 90% | |======================================================================| 100% | | | 0% | |======= | 10% | |============== | 20% | |===================== | 30% | | | 0% | |======= | 10% | |============== | 20% | |===================== | 30% | | | 0% | |======= | 10% | |============== | 20% | | | 0% | |======= | 10% | |============== | 20% | |===================== | 30% | |============================ | 40% | |=================================== | 50% | |========================================== | 60% | |================================================= | 70% | |======================================================== | 80% | |=============================================================== | 90% | |======================================================================| 100% | | | 0% | |======= | 10% | |============== | 20% | |===================== | 30% | |============================ | 40% | |=================================== | 50% | |========================================== | 60% | | | 0% | |======= | 10% | | | 0% | |======= | 10% | | | 0% | |======= | 10% | | | 0% | |======= | 10% | |============== | 20% | |===================== | 30% | | | 0% | |======= | 10% | | | 0% | |======= | 10% | |============== | 20% | |===================== | 30% | |============================ | 40% | |=================================== | 50% | |========================================== | 60% | |================================================= | 70% | | | 0% | |======= | 10% | | | 0% | |======= | 10% | |============== | 20% | |===================== | 30% | |============================ | 40% | |=================================== | 50% | | | 0% | |======= | 10% | |============== | 20% | |===================== | 30% | |============================ | 40% | | | 0% | |======= | 10% | | | 0% | |======= | 10% | | | 0% | |======= | 10% | | | 0% | |======= | 10% | | | 0% | |======= | 10% | | | 0% | |======= | 10% | | | 0% | |======= | 10% | |============== | 20% | |===================== | 30% | |============================ | 40% | |=================================== | 50% | |========================================== | 60% | |================================================= | 70% | |======================================================== | 80% | |=============================================================== | 90% | |======================================================================| 100% | | | 0% | |======= | 10% | |============== | 20% | |===================== | 30% | |============================ | 40% | | | 0% | |======= | 10% | | | 0% | |======= | 10% | |============== | 20% | | | 0% | |======= | 10% | |============== | 20%
crash_results <- crash_cat %>%
mutate(
Stable_CCA_Cluster = stable_fit$base_cluster,
Crash_Stability = stable_fit$crash_stability,
Stability_Class = case_when(
Crash_Stability >= 0.80 ~ "Highly stable",
Crash_Stability >= 0.60 ~ "Moderately stable",
Crash_Stability >= 0.40 ~ "Weakly stable",
TRUE ~ "Unstable"
)
)
knitr::kable(crash_results, caption = "Crash-level Stable-CCA results.")| Crash_ID | Wthr | Lght | Rd_l | Trf_ | Int_ | FHE_ | Oth_ | Rd_c | PSL | Sesn | Cnt_ | Ethn | Gndr | Hlmt | Dr__ | Pp_G | Svrt | Stable_CCA_Cluster | Crash_Stability | Stability_Class |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| C1 | Clear | Daylight | Straight, level | Signal light | Int related | Omv vehicle going straight | Other | City street | 30-40 mph | Winter | Other | Hispanic | Male | Not worn | Other | 250,000 pop. And over | BC | 1 | 1.0 | Highly stable |
| C2 | Clear | Dark, not lighted | Straight, level | Stop sign | Int related | Omv vehicle turning left | Other | City street | 45-60 mph | Winter | Pedestrian ftyrow to vehicle | Hispanic | Male | Not worn | Unlicensed | 100,000 - 249,999 pop. | BC | 2 | 1.0 | Highly stable |
| C3 | Clear | Daylight | Straight, level | Marked lanes | Non Int | Omv vehicle going straight | Other | Other | 30-40 mph | Winter | Other | Asian | Male | Not worn | Other | Other | BC | 1 | 1.0 | Highly stable |
| C4 | Cloudy | Daylight | Straight, level | Stop sign | Int | Omv vehicle turning left | Other | City street | 30-40 mph | Winter | Other | White | Male | Not worn | Other | Other | BC | 1 | 1.0 | Highly stable |
| C5 | Clear | Daylight | Straight, level | Signal light | Int related | Omv vehicle going straight | Inattentive driving | City street | 30-40 mph | Winter | Other | White | Female | Not worn | Other | Other | BC | 1 | 1.0 | Highly stable |
| C6 | Clear | Daylight | Straight, level | Stop sign | Int | Omv vehicle going straight | Other | City street | 30-40 mph | Winter | Other | Hispanic | Female | Not worn | Other | Other | BC | 1 | 1.0 | Highly stable |
| C7 | Cloudy | Daylight | Straight, level | Stop sign | Int | Omv vehicle going straight | Inattentive driving | City street | 30-40 mph | Winter | Driver inattention | White | Male | Not worn | Other | 50,000 - 99,999 pop | O | 1 | 1.0 | Highly stable |
| C8 | Clear | Daylight | Straight, level | Other | Driveway accs | Omv vehicle turning left | veh leaving driveway | City street | 30-40 mph | Winter | Other | White | Male | Not worn | Other | Other | BC | 2 | 0.5 | Weakly stable |
| C9 | Clear | Daylight | Straight, level | Signal light | Non Int | Omv vehicle going straight | Other | Other | 45-60 mph | Winter | Other | Hispanic | Male | Not worn | Other | 250,000 pop. And over | BC | 1 | 1.0 | Highly stable |
| C10 | Cloudy | Daylight | Straight, level | Marked lanes | Int related | Omv vehicle going straight | Other | City street | 45-60 mph | Winter | Not applicable | White | Male | Not worn | Other | Other | BC | 1 | 1.0 | Highly stable |
| C11 | Clear | Daylight | Straight, level | None | Non Int | Omv vehicle going straight | Other | City street | 30-40 mph | Winter | Other | White | Female | Not worn | Unlicensed | Other | BC | 1 | 1.0 | Highly stable |
9 Stability Results
| Cluster | N_Crashes | Mean_Stability | Median_Stability | Min_Stability |
|---|---|---|---|---|
| 1 | 9 | 1.00 | 1.00 | 1.0 |
| 2 | 2 | 0.75 | 0.75 | 0.5 |
## Overall stability: 0.955
## Mean adjusted Rand index: 0.068
tibble(ARI = stable_fit$ari_values) %>%
filter(!is.na(ARI)) %>%
ggplot(aes(x = ARI)) +
geom_histogram(bins = 10, color = "white") +
labs(
title = "Bootstrap Stability Distribution",
subtitle = "Adjusted Rand Index between reference and bootstrap-aligned CCA clusters",
x = "Adjusted Rand Index",
y = "Bootstrap Count"
) +
theme_minimal()10 Cluster Selection
For the small sample, only k = 2:3 is evaluated. For a
full dataset, use a wider range such as k = 2:8.
evaluate_clustrd_k <- function(data,
vars,
k_values = 2:3,
ndim = 2,
method = "clusCA",
n_boot = 30,
sample_frac = 0.80,
nstart = 100,
seed = 123) {
purrr::map_dfr(k_values, function(k) {
message("Evaluating k = ", k)
fit_k <- stable_clustrd_cca(
data = data,
vars = vars,
id_col = "Crash_ID",
nclus = k,
ndim = ndim,
method = method,
n_boot = n_boot,
sample_frac = sample_frac,
nstart = nstart,
seed = seed + k
)
min_cluster_stability <- min(
fit_k$cluster_stability$Mean_Stability,
na.rm = TRUE
)
tibble(
K = k,
Overall_Stability = fit_k$overall_stability,
Mean_ARI = fit_k$mean_ari,
Min_Cluster_Stability = min_cluster_stability,
Stable_CCA_Score =
0.50 * Mean_ARI +
0.30 * Overall_Stability +
0.20 * Min_Cluster_Stability
)
}) %>%
arrange(desc(Stable_CCA_Score))
}k_results <- evaluate_clustrd_k(
data = crash_cat,
vars = vars_for_cca,
k_values = 2:3,
ndim = 2,
method = "clusCA",
n_boot = 30,
sample_frac = 0.80,
nstart = 100,
seed = 123
)## | | | 0% | |======= | 10% | |============== | 20% | |===================== | 30% | |============================ | 40% | |=================================== | 50% | |========================================== | 60% | |================================================= | 70% | |======================================================== | 80% | |=============================================================== | 90% | |======================================================================| 100% | | | 0% | |======= | 10% | | | 0% | |======= | 10% | |============== | 20% | | | 0% | |======= | 10% | |============== | 20% | |===================== | 30% | |============================ | 40% | | | 0% | |======= | 10% | | | 0% | |======= | 10% | |============== | 20% | |===================== | 30% | |============================ | 40% | |=================================== | 50% | |========================================== | 60% | |================================================= | 70% | |======================================================== | 80% | |=============================================================== | 90% | |======================================================================| 100% | | | 0% | |======= | 10% | |============== | 20% | | | 0% | |======= | 10% | |============== | 20% | | | 0% | |======= | 10% | | | 0% | |======= | 10% | | | 0% | |======= | 10% | |============== | 20% | |===================== | 30% | |============================ | 40% | |=================================== | 50% | |========================================== | 60% | | | 0% | |======= | 10% | |============== | 20% | |===================== | 30% | |============================ | 40% | |=================================== | 50% | |========================================== | 60% | |================================================= | 70% | | | 0% | |======= | 10% | | | 0% | |======= | 10% | |============== | 20% | |===================== | 30% | |============================ | 40% | |=================================== | 50% | |========================================== | 60% | | | 0% | |======= | 10% | |============== | 20% | |===================== | 30% | |============================ | 40% | |=================================== | 50% | |========================================== | 60% | | | 0% | |======= | 10% | |============== | 20% | | | 0% | |======= | 10% | | | 0% | |======= | 10% | |============== | 20% | | | 0% | |======= | 10% | | | 0% | |======= | 10% | |============== | 20% | |===================== | 30% | |============================ | 40% | |=================================== | 50% | | | 0% | |======= | 10% | |============== | 20% | | | 0% | |======= | 10% | | | 0% | |======= | 10% | |============== | 20% | | | 0% | |======= | 10% | |============== | 20% | |===================== | 30% | |============================ | 40% | |=================================== | 50% | |========================================== | 60% | |================================================= | 70% | |======================================================== | 80% | |=============================================================== | 90% | |======================================================================| 100% | | | 0% | |======= | 10% | |============== | 20% | |===================== | 30% | | | 0% | |======= | 10% | |============== | 20% | |===================== | 30% | | | 0% | |======= | 10% | |============== | 20% | | | 0% | |======= | 10% | |============== | 20% | |===================== | 30% | |============================ | 40% | |=================================== | 50% | |========================================== | 60% | |================================================= | 70% | |======================================================== | 80% | |=============================================================== | 90% | |======================================================================| 100% | | | 0% | |======= | 10% | |============== | 20% | |===================== | 30% | |============================ | 40% | |=================================== | 50% | |========================================== | 60% | | | 0% | |======= | 10% | | | 0% | |======= | 10%
## | | | 0% | |======= | 10% | |============== | 20% | |===================== | 30% | |============================ | 40% | |=================================== | 50% | |========================================== | 60% | |================================================= | 70% | |======================================================== | 80% | |=============================================================== | 90% | |======================================================================| 100% | | | 0% | |======= | 10% | |============== | 20% | |===================== | 30% | | | 0% | |======= | 10% | |============== | 20% | |===================== | 30% | |============================ | 40% | |=================================== | 50% | |========================================== | 60% | |================================================= | 70% | |======================================================== | 80% | |=============================================================== | 90% | |======================================================================| 100% | | | 0% | |======= | 10% | |============== | 20% | |===================== | 30% | |============================ | 40% | |=================================== | 50% | |========================================== | 60% | |================================================= | 70% | |======================================================== | 80% | |=============================================================== | 90% | |======================================================================| 100% | | | 0% | |======= | 10% | |============== | 20% | |===================== | 30% | |============================ | 40% | |=================================== | 50% | |========================================== | 60% | |================================================= | 70% | |======================================================== | 80% | |=============================================================== | 90% | |======================================================================| 100% | | | 0% | |======= | 10% | |============== | 20% | |===================== | 30% | |============================ | 40% | |=================================== | 50% | |========================================== | 60% | |================================================= | 70% | |======================================================== | 80% | |=============================================================== | 90% | |======================================================================| 100% | | | 0% | |======= | 10% | |============== | 20% | | | 0% | |======= | 10% | |============== | 20% | | | 0% | |======= | 10% | |============== | 20% | |===================== | 30% | |============================ | 40% | |=================================== | 50% | |========================================== | 60% | |================================================= | 70% | |======================================================== | 80% | | | 0% | |======= | 10% | |============== | 20% | |===================== | 30% | |============================ | 40% | |=================================== | 50% | |========================================== | 60% | |================================================= | 70% | |======================================================== | 80% | |=============================================================== | 90% | |======================================================================| 100% | | | 0% | |======= | 10% | |============== | 20% | |===================== | 30% | |============================ | 40% | |=================================== | 50% | |========================================== | 60% | |================================================= | 70% | |======================================================== | 80% | |=============================================================== | 90% | |======================================================================| 100% | | | 0% | |======= | 10% | |============== | 20% | |===================== | 30% | |============================ | 40% | |=================================== | 50% | |========================================== | 60% | |================================================= | 70% | | | 0% | |======= | 10% | |============== | 20% | |===================== | 30% | |============================ | 40% | |=================================== | 50% | |========================================== | 60% | |================================================= | 70% | |======================================================== | 80% | |=============================================================== | 90% | |======================================================================| 100% | | | 0% | |======= | 10% | |============== | 20% | |===================== | 30% | |============================ | 40% | |=================================== | 50% | |========================================== | 60% | |================================================= | 70% | |======================================================== | 80% | |=============================================================== | 90% | |======================================================================| 100% | | | 0% | |======= | 10% | |============== | 20% | |===================== | 30% | |============================ | 40% | |=================================== | 50% | |========================================== | 60% | |================================================= | 70% | |======================================================== | 80% | |=============================================================== | 90% | |======================================================================| 100% | | | 0% | |======= | 10% | |============== | 20% | |===================== | 30% | |============================ | 40% | |=================================== | 50% | |========================================== | 60% | |================================================= | 70% | |======================================================== | 80% | |=============================================================== | 90% | |======================================================================| 100% | | | 0% | |======= | 10% | |============== | 20% | |===================== | 30% | |============================ | 40% | |=================================== | 50% | |========================================== | 60% | |================================================= | 70% | | | 0% | |======= | 10% | |============== | 20% | |===================== | 30% | |============================ | 40% | |=================================== | 50% | |========================================== | 60% | |================================================= | 70% | |======================================================== | 80% | |=============================================================== | 90% | |======================================================================| 100% | | | 0% | |======= | 10% | |============== | 20% | |===================== | 30% | |============================ | 40% | |=================================== | 50% | |========================================== | 60% | | | 0% | |======= | 10% | |============== | 20% | |===================== | 30% | |============================ | 40% | |=================================== | 50% | |========================================== | 60% | |================================================= | 70% | |======================================================== | 80% | |=============================================================== | 90% | |======================================================================| 100% | | | 0% | |======= | 10% | | | 0% | |======= | 10% | |============== | 20% | |===================== | 30% | |============================ | 40% | |=================================== | 50% | |========================================== | 60% | |================================================= | 70% | |======================================================== | 80% | |=============================================================== | 90% | |======================================================================| 100% | | | 0% | |======= | 10% | |============== | 20% | |===================== | 30% | |============================ | 40% | |=================================== | 50% | |========================================== | 60% | |================================================= | 70% | |======================================================== | 80% | |=============================================================== | 90% | |======================================================================| 100% | | | 0% | |======= | 10% | |============== | 20% | |===================== | 30% | |============================ | 40% | |=================================== | 50% | |========================================== | 60% | |================================================= | 70% | |======================================================== | 80% | |=============================================================== | 90% | |======================================================================| 100% | | | 0% | |======= | 10% | |============== | 20% | |===================== | 30% | |============================ | 40% | |=================================== | 50% | |========================================== | 60% | |================================================= | 70% | |======================================================== | 80% | |=============================================================== | 90% | |======================================================================| 100% | | | 0% | |======= | 10% | |============== | 20% | | | 0% | |======= | 10% | |============== | 20% | |===================== | 30% | |============================ | 40% | |=================================== | 50% | |========================================== | 60% | |================================================= | 70% | |======================================================== | 80% | |=============================================================== | 90% | |======================================================================| 100% | | | 0% | |======= | 10% | |============== | 20% | |===================== | 30% | |============================ | 40% | |=================================== | 50% | |========================================== | 60% | |================================================= | 70% | |======================================================== | 80% | |=============================================================== | 90% | |======================================================================| 100% | | | 0% | |======= | 10% | | | 0% | |======= | 10% | |============== | 20% | |===================== | 30% | |============================ | 40% | |=================================== | 50% | |========================================== | 60% | |================================================= | 70% | |======================================================== | 80% | |=============================================================== | 90% | |======================================================================| 100% | | | 0% | |======= | 10% | |============== | 20% | |===================== | 30%
knitr::kable(k_results, digits = 3, caption = "Candidate cluster solutions ranked by Stable-CCA score.")| K | Overall_Stability | Mean_ARI | Min_Cluster_Stability | Stable_CCA_Score |
|---|---|---|---|---|
| 3 | 0.863 | 0.389 | 0.703 | 0.594 |
| 2 | 1.000 | 0.100 | 1.000 | 0.550 |
k_results %>%
pivot_longer(
cols = c(Overall_Stability, Mean_ARI, Min_Cluster_Stability, Stable_CCA_Score),
names_to = "Metric",
values_to = "Value"
) %>%
ggplot(aes(x = K, y = Value, group = Metric, linetype = Metric, shape = Metric)) +
geom_line() +
geom_point(size = 3) +
scale_x_continuous(breaks = k_results$K) +
labs(
title = "Stable-CCA Cluster Selection Metrics",
x = "Number of Clusters",
y = "Metric Value"
) +
theme_minimal()11 Cluster Profiling
describe_clusters <- function(data, cluster_col, vars) {
purrr::map_dfr(vars, function(v) {
data %>%
group_by(
Cluster = .data[[cluster_col]],
Category = .data[[v]]
) %>%
summarise(
n = n(),
.groups = "drop_last"
) %>%
mutate(
pct_within_cluster = 100 * n / sum(n)
) %>%
ungroup() %>%
mutate(
Variable = v
) %>%
select(Cluster, Variable, Category, n, pct_within_cluster)
})
}cluster_profile <- describe_clusters(
data = crash_results,
cluster_col = "Stable_CCA_Cluster",
vars = vars_for_cca
)
top_cluster_profile <- cluster_profile %>%
group_by(Cluster, Variable) %>%
slice_max(pct_within_cluster, n = 2, with_ties = FALSE) %>%
arrange(Cluster, Variable, desc(pct_within_cluster))
knitr::kable(top_cluster_profile, digits = 1, caption = "Top categorical features within each Stable-CCA cluster.")| Cluster | Variable | Category | n | pct_within_cluster |
|---|---|---|---|---|
| 1 | Cnt_ | Other | 7 | 77.8 |
| 1 | Cnt_ | Driver inattention | 1 | 11.1 |
| 1 | Dr__ | Other | 8 | 88.9 |
| 1 | Dr__ | Unlicensed | 1 | 11.1 |
| 1 | Ethn | White | 5 | 55.6 |
| 1 | Ethn | Hispanic | 3 | 33.3 |
| 1 | FHE_ | Omv vehicle going straight | 8 | 88.9 |
| 1 | FHE_ | Omv vehicle turning left | 1 | 11.1 |
| 1 | Gndr | Male | 6 | 66.7 |
| 1 | Gndr | Female | 3 | 33.3 |
| 1 | Hlmt | Not worn | 9 | 100.0 |
| 1 | Int_ | Int | 3 | 33.3 |
| 1 | Int_ | Int related | 3 | 33.3 |
| 1 | Lght | Daylight | 9 | 100.0 |
| 1 | Oth_ | Other | 7 | 77.8 |
| 1 | Oth_ | Inattentive driving | 2 | 22.2 |
| 1 | PSL | 30-40 mph | 7 | 77.8 |
| 1 | PSL | 45-60 mph | 2 | 22.2 |
| 1 | Pp_G | Other | 6 | 66.7 |
| 1 | Pp_G | 250,000 pop. And over | 2 | 22.2 |
| 1 | Rd_c | City street | 7 | 77.8 |
| 1 | Rd_c | Other | 2 | 22.2 |
| 1 | Rd_l | Straight, level | 9 | 100.0 |
| 1 | Sesn | Winter | 9 | 100.0 |
| 1 | Svrt | BC | 8 | 88.9 |
| 1 | Svrt | O | 1 | 11.1 |
| 1 | Trf_ | Signal light | 3 | 33.3 |
| 1 | Trf_ | Stop sign | 3 | 33.3 |
| 1 | Wthr | Clear | 6 | 66.7 |
| 1 | Wthr | Cloudy | 3 | 33.3 |
| 2 | Cnt_ | Other | 1 | 50.0 |
| 2 | Cnt_ | Pedestrian ftyrow to vehicle | 1 | 50.0 |
| 2 | Dr__ | Other | 1 | 50.0 |
| 2 | Dr__ | Unlicensed | 1 | 50.0 |
| 2 | Ethn | Hispanic | 1 | 50.0 |
| 2 | Ethn | White | 1 | 50.0 |
| 2 | FHE_ | Omv vehicle turning left | 2 | 100.0 |
| 2 | Gndr | Male | 2 | 100.0 |
| 2 | Hlmt | Not worn | 2 | 100.0 |
| 2 | Int_ | Driveway accs | 1 | 50.0 |
| 2 | Int_ | Int related | 1 | 50.0 |
| 2 | Lght | Dark, not lighted | 1 | 50.0 |
| 2 | Lght | Daylight | 1 | 50.0 |
| 2 | Oth_ | Other | 1 | 50.0 |
| 2 | Oth_ | veh leaving driveway | 1 | 50.0 |
| 2 | PSL | 30-40 mph | 1 | 50.0 |
| 2 | PSL | 45-60 mph | 1 | 50.0 |
| 2 | Pp_G | 100,000 - 249,999 pop. | 1 | 50.0 |
| 2 | Pp_G | Other | 1 | 50.0 |
| 2 | Rd_c | City street | 2 | 100.0 |
| 2 | Rd_l | Straight, level | 2 | 100.0 |
| 2 | Sesn | Winter | 2 | 100.0 |
| 2 | Svrt | BC | 2 | 100.0 |
| 2 | Trf_ | Other | 1 | 50.0 |
| 2 | Trf_ | Stop sign | 1 | 50.0 |
| 2 | Wthr | Clear | 2 | 100.0 |
top_cluster_profile %>%
mutate(Label = paste0(Variable, ": ", Category)) %>%
group_by(Cluster) %>%
slice_max(pct_within_cluster, n = 10, with_ties = FALSE) %>%
ungroup() %>%
ggplot(aes(x = reorder(Label, pct_within_cluster), y = pct_within_cluster)) +
geom_col() +
coord_flip() +
facet_wrap(~ Cluster, scales = "free_y") +
labs(
title = "Dominant Crash Characteristics by Stable-CCA Cluster",
x = "Crash Characteristic",
y = "Percent Within Cluster"
) +
theme_minimal()12 Countermeasure Readiness Score
This section computes a countermeasure readiness score using actionable variables. The score combines cluster stability and the dominance of actionable engineering features.
actionable_vars <- c(
"Lght", "Rd_l", "Trf_", "Int_", "FHE_", "Rd_c", "PSL", "Oth_"
)
countermeasure_relevance <- cluster_profile %>%
filter(Variable %in% actionable_vars) %>%
group_by(Cluster) %>%
summarise(
Actionable_Dominance = mean(pct_within_cluster[pct_within_cluster >= 50], na.rm = TRUE),
N_Actionable_Dominant_Items = sum(pct_within_cluster >= 50, na.rm = TRUE),
.groups = "drop"
) %>%
mutate(
Actionable_Dominance = ifelse(is.nan(Actionable_Dominance), 0, Actionable_Dominance)
) %>%
left_join(stable_fit$cluster_stability, by = "Cluster") %>%
mutate(
Countermeasure_Readiness_Score =
0.60 * Mean_Stability +
0.40 * scales::rescale(Actionable_Dominance, to = c(0, 1))
) %>%
arrange(desc(Countermeasure_Readiness_Score))
knitr::kable(countermeasure_relevance, digits = 3, caption = "Countermeasure readiness by Stable-CCA cluster.")| Cluster | Actionable_Dominance | N_Actionable_Dominant_Items | N_Crashes | Mean_Stability | Median_Stability | Min_Stability | Countermeasure_Readiness_Score |
|---|---|---|---|---|---|---|---|
| 1 | 87.037 | 6 | 9 | 1.00 | 1.00 | 1.0 | 1.00 |
| 2 | 61.538 | 13 | 2 | 0.75 | 0.75 | 0.5 | 0.45 |
countermeasure_relevance %>%
mutate(Cluster = as.factor(Cluster)) %>%
ggplot(aes(x = Cluster, y = Countermeasure_Readiness_Score)) +
geom_col() +
labs(
title = "Countermeasure Readiness Score by Stable-CCA Cluster",
x = "Stable-CCA Cluster",
y = "Countermeasure Readiness Score"
) +
theme_minimal()13 CCA Visualization
coords <- as.data.frame(stable_fit$base_fit$obscoord)
names(coords)[1:2] <- c("Dim1", "Dim2")
coords_plot <- coords %>%
mutate(
Crash_ID = crash_results$Crash_ID,
Cluster = as.factor(crash_results$Stable_CCA_Cluster),
Stability = crash_results$Crash_Stability,
Severity = crash_results$Svrt
)
ggplot(coords_plot, aes(x = Dim1, y = Dim2, color = Cluster)) +
geom_point(aes(size = Stability, shape = Severity), alpha = 0.80) +
geom_text(aes(label = Crash_ID), vjust = -0.7, size = 3) +
labs(
title = "Stability-Regularized Cluster Correspondence Analysis",
subtitle = "Implemented with clustrd::clusmca; each point is one crash record",
x = "CCA Dimension 1",
y = "CCA Dimension 2",
color = "Stable CCA Cluster",
size = "Crash Stability",
shape = "Severity"
) +
theme_minimal()14 Interpretation Template
Use the following structure when interpreting the clusters in a manuscript.
14.1 Cluster 1 interpretation
Cluster 1 is characterized by the dominant categories shown in the cluster profile table. The stability value indicates how reproducible the typology is under bootstrap resampling. If the cluster has high stability and high countermeasure readiness, it can be interpreted as an actionable crash typology for engineering treatment selection.
14.2 Cluster 2 interpretation
Cluster 2 should be interpreted by identifying its dominant roadway, traffic control, environmental, and user-related variables. The countermeasure readiness score indicates whether the cluster is strongly associated with modifiable engineering features such as lighting, traffic control, speed limit, roadway class, or intersection context.
15 Suggested Manuscript Method Text
Each crash record was represented as a multivariate categorical
profile consisting of roadway, traffic control, environmental,
user-related, and severity attributes. Cluster Correspondence Analysis
was implemented using the clustrd package through the
clusmca() function. To address the sensitivity of one-time
exploratory clustering, this study introduced a bootstrap-based
stability regularization procedure. For each candidate number of
clusters, CCA was repeatedly estimated on bootstrap samples, and the
resulting cluster labels were aligned to the reference solution using a
maximum-overlap assignment procedure. Crash-level stability was defined
as the proportion of bootstrap replications in which a crash retained
its original cluster assignment. Cluster-level stability was computed as
the average stability of crashes within each typology, and model-level
stability was summarized using the adjusted Rand index. The final
cluster solution was selected using a composite score that combined
overall stability, adjusted Rand index, and minimum cluster
stability.
16 Scaling to Full Crash Data
For a full crash dataset, replace the sample data with the following structure:
crash_df <- readr::read_csv("your_crash_file.csv")
crash_df <- crash_df %>%
mutate(Crash_ID = as.character(Crash_ID))
vars_for_cca <- c(
"Wthr", "Lght", "Rd_l", "Trf_", "Int_", "FHE_", "Oth_",
"Rd_c", "PSL", "Sesn", "Cnt_", "Ethn", "Gndr", "Hlmt",
"Dr__", "Pp_G", "Svrt"
)
crash_cat <- crash_df %>%
mutate(across(all_of(vars_for_cca), ~ as.factor(replace_na(as.character(.x), "Unknown"))))
stable_fit <- stable_clustrd_cca(
data = crash_cat,
vars = vars_for_cca,
id_col = "Crash_ID",
nclus = 4,
ndim = 3,
method = "clusCA",
n_boot = 200,
sample_frac = 0.80,
nstart = 200,
seed = 123
)
k_results <- evaluate_clustrd_k(
data = crash_cat,
vars = vars_for_cca,
k_values = 2:8,
ndim = 3,
method = "clusCA",
n_boot = 100,
sample_frac = 0.80,
nstart = 200,
seed = 123
)17 Export Results
write_csv(crash_results, "stable_cca_crash_level_results.csv")
write_csv(stable_fit$cluster_stability, "stable_cca_cluster_stability.csv")
write_csv(cluster_profile, "stable_cca_cluster_profile.csv")
write_csv(countermeasure_relevance, "stable_cca_countermeasure_readiness.csv")
write_csv(k_results, "stable_cca_cluster_selection.csv")18 Conclusion
This R Markdown file provides a complete workflow for
Stability-Regularized Cluster Correspondence Analysis using
clustrd. The approach supports crash typology discovery
while addressing a key limitation of traditional CCA: the lack of
reproducibility testing. By adding crash-level and cluster-level
stability metrics, the framework helps identify which crash typologies
are robust enough to support safety diagnosis and countermeasure
selection.