Traffic accidents remain one of the most important urban safety problems because they directly affect human health, mobility, infrastructure planning, and the allocation of public resources. For a city, the total number of accidents gives only a partial picture of the problem. A more useful question is where risk tends to concentrate and which parts of the transport system require closer attention. Accident risk is rarely distributed evenly across the road network. Some segments may become repeatedly dangerous because of their function in the city, their position in the network, their road class, speed conditions, or their connection to busy intersections and corridors. For this reason, modelling accident risk at the level of road segments can support a practical understanding of urban safety that is closer to the scale at which many road-safety interventions are planned.
Recent studies increasingly approach this problem through graph-based road safety modelling. Nippani et al. (2023) formulate accident prediction as an edge-level task on road networks, where historical accident records are aggregated into monthly road-level labels and used to predict accident counts or accident occurrence in subsequent time periods; their approach therefore combines road-network structure with an explicit temporal setup based on chronological training, validation, and testing. Huang and Hooi (2022) propose the TRAVEL model, which uses road geometry and graph relations to estimate accident vulnerability. Huang et al. (2023) introduce the TAP repository, emphasizing the need for graph-structured accident datasets that preserve road topology and geospatial features. Gao et al. (2024) develop a more advanced road-level framework for spatio-temporal crash prediction and uncertainty-aware risk estimation.
The main aim of the project is to test the effectiveness of GNN (graph neural network) approach on data for traffic accidents occurred in Warsaw in 2024. We did not try to replicate geo-temporal approach used by Nippani et al., instead we use time component for train and test split only. In addition to this, it was necessary to clean the original dataset (collected from “sewik” traffic accidents database) and cut a part of the original road network, so the current project should rather be considered a proof of concept than a complete road risk predicting framework.
Although there exists quite large number of GNN architectures, we applied GraphSAGE model.GraphSAGE is a graph neural network model designed to learn representations of nodes by using both their own features and information from their neighbours. Instead of treating each road intersection or road segment as an isolated observation, the model passes information through the graph structure. In this project, this means that the representation of a node in the Warsaw road network is influenced by nearby connected nodes. GraphSAGE works by aggregating information from a node’s local neighbourhood and combining it with the node’s own attributes to create a learned embedding. These embeddings can then be used for prediction tasks. In the present case, the embeddings of the two nodes forming a road segment were combined with edge-level road attributes to estimate accident occurrence and accident count. This makes the model suitable for spatial road-risk analysis, because it allows the prediction for one segment to depend partly on its position in the wider road network.
Since the target variable is defined for road segments, the final prediction is made at the edge level. After GraphSAGE creates embeddings for the nodes of the road network, each edge is represented by the embeddings of its two endpoint nodes. These two node embeddings describe the local graph context around both ends of the road segment. The model then combines them with edge-specific features, such as road type, length, speed limit, one-way status, and bridge information. This combined edge representation is passed through a prediction layer. For the binary task, the output is transformed into the estimated probability that at least one accident occurs on the segment. For the regression task, the output represents the expected accident count for the segment. In this way, the model predicts road-segment risk using both the segment’s own attributes and the structure of the surrounding road network.
Let us start with loading all the necessary libraries:
Then, read the accidents data:
maz <- read.csv("Mazowieckie_accidents.csv")
wawa <- maz %>%
filter(POWIAT == "POWIAT WARSZAWA") %>%
mutate(
accident_date = lubridate::ymd(DATA_ZDARZ), # obtain month for future split
accident_month = lubridate::floor_date(accident_date, "month")
)
wawa %>%
summarise(
n_accidents = n(),
missing_dates = sum(is.na(accident_date)),
min_date = min(accident_date, na.rm = TRUE),
max_date = max(accident_date, na.rm = TRUE),
n_months = n_distinct(accident_month)
)
## n_accidents missing_dates min_date max_date n_months
## 1 25101 0 2024-01-01 2024-12-31 12
As we can clearly see, the overall number of accidents occurred in Warsaw within a year is quite huge; however, not all of them will be used for final modeling procedure.After that, we need to read the file with the powiats borders and extract the borders for Warsaw.
The next step is to load the roads geometries. OpensStreetMaps API did not allow us to extract all roads, so we used “osmextract” package. Initially, we load all Line geometries and then filter out only the main road types. We need to admit the fact that not all roads were used for future analysis. Utilizing all of them would seriously overcomplicate the analysis.
roads_osm_lines <- osmextract::oe_get(
place = wawa_spat,
provider = "geofabrik",
layer = "lines",
extra_tags = c("maxspeed", "oneway", "bridge", "ref"),
force_vectortranslate = TRUE,
quiet = FALSE
)
## 0...10...20...30...40...50...60...70...80...90...100 - done.
## Reading layer `lines' from data source
## `C:\Users\yana1\AppData\Roaming\R\data\R\osmextract\geofabrik_mazowieckie-latest.gpkg'
## using driver `GPKG'
## Simple feature collection with 1347853 features and 14 fields
## Geometry type: LINESTRING
## Dimension: XY
## Bounding box: xmin: 18.23717 ymin: 50.88302 xmax: 23.89536 ymax: 53.80929
## Geodetic CRS: WGS 84
# Add missing columns if they are absent
needed_cols <- c("osm_id", "name", "ref", "highway", "oneway", "bridge", "maxspeed")
# define helper functions to transform the roads features
for (col in needed_cols) {
if (!col %in% names(roads_osm_lines)) {
roads_osm_lines[[col]] <- NA_character_
}
}
to_binary_osm <- function(x) {
z <- stringr::str_to_lower(as.character(x))
as.integer(z %in% c("yes", "true", "1", "-1", "reverse"))
}
to_integer_speed <- function(x) {
as.integer(readr::parse_number(as.character(x)))
}
roads_spat <- roads_osm_lines %>%
dplyr::transmute(
osm_id = suppressWarnings(as.numeric(osm_id)),
name = as.character(name),
ref = as.character(ref),
type = as.character(highway),
oneway = to_binary_osm(oneway),
bridge = to_binary_osm(bridge),
maxspeed = to_integer_speed(maxspeed),
geometry = geometry
) %>%
sf::st_transform(target_crs) %>%
sf::st_cast("MULTILINESTRING", warn = FALSE)
road_types_keep <- c( # filter the main road types
"motorway", "motorway_link",
"trunk", "trunk_link",
"primary", "primary_link",
"secondary", "secondary_link",
"tertiary", "tertiary_link",
"residential"
)
roads_spat <- roads_spat %>%
dplyr::filter(type %in% road_types_keep)
roads_spat %>%
sf::st_drop_geometry() %>%
dplyr::count(type, sort = TRUE)
## type n
## 1 residential 84082
## 2 tertiary 28462
## 3 secondary 15636
## 4 primary 9213
## 5 motorway 2856
## 6 motorway_link 1676
## 7 primary_link 924
## 8 secondary_link 561
## 9 trunk 413
## 10 tertiary_link 329
## 11 trunk_link 58
roads_spat %>%
sf::st_drop_geometry() %>%
dplyr::summarise(
n_roads = dplyr::n(),
non_missing_maxspeed = sum(!is.na(maxspeed)),
missing_maxspeed = sum(is.na(maxspeed)),
missing_maxspeed_share = mean(is.na(maxspeed))
)
## n_roads non_missing_maxspeed missing_maxspeed missing_maxspeed_share
## 1 144210 50893 93317 0.647091
Then, we leave only the road segments lying inside of Warsaw.
roads_wawa <- roads_spat %>%
st_transform(st_crs(wawa_spat)) %>%
st_filter(wawa_spat, .predicate = st_intersects)
roads_wawa_clipped <- roads_wawa %>%
st_intersection(wawa_spat) %>%
st_collection_extract("LINESTRING") %>%
st_cast("LINESTRING", warn = FALSE) %>%
mutate(
road_id = row_number(),
length_m = as.numeric(st_length(geometry))
)
roads_wawa_clipped <- roads_wawa_clipped %>%
filter(type %in% road_types_keep)
ggplot() +
geom_sf(data = wawa_spat, fill = "grey97", color = "grey25", linewidth = 0.7) +
geom_sf(data = roads_wawa_clipped, color = "grey20", linewidth = 0.08, alpha = 0.65) +
coord_sf(expand = FALSE) +
theme_void() +
labs(
title = "Road Network of Warsaw"
) +
theme(
plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 11, hjust = 0.5),
plot.caption = element_text(size = 8, hjust = 1),
plot.margin = margin(10, 10, 10, 10))
We also need to make sure all the coordinates are stored in correct format. Here it can be seen that they are presented in mixed formats.
head(wawa[, c("WSP_GPS_X", "WSP_GPS_Y")])
## WSP_GPS_X WSP_GPS_Y
## 1 21*15'060 52*12'250
## 2 21*02'168 52*21'365
## 3 20*58'488 52*16'126
## 4 21.069849 52.230339
## 5 21.070175 52.23019
## 6 20.903177 52.239088
Let us define the proper transforming function and apply it.
# build function to take coords to same format
convert_wsp_coord <- function(x) {
x <- str_replace_all(as.character(x), ",", ".")
dms <- str_match(x, "^(\\d+)\\*(\\d+)'(\\d+)$")
from_dms <- as.numeric(dms[, 2]) +
as.numeric(dms[, 3]) / 60 +
(as.numeric(dms[, 4]) / 10) / 3600
from_decimal <- suppressWarnings(as.numeric(x))
ifelse(str_detect(x, "\\*"), from_dms, from_decimal)
}
# apply
wawa_sf <- wawa %>% mutate(
lon = convert_wsp_coord(WSP_GPS_X),
lat = convert_wsp_coord(WSP_GPS_Y)
) %>% st_as_sf(coords = c("lon", "lat"), crs = 4326, remove = FALSE) %>%
st_transform(target_crs)
After this, we can start preparing the dataset in the form of graph. First of all, we kept only the columns, which later will be used for modeling - road type, oneway (dummy), bridge (dummy), speed limit and geometry itself. Missing speed limits were imputed using median from the same road type. Additionally, we added dummy identifying the roads for which speed limit was missing initially.
roads_for_graph <- roads_wawa_clipped %>%
select(road_id, osm_id, type, oneway, bridge, maxspeed, length_m, geometry) %>%
mutate(
type = replace_na(as.character(type), "unknown"),
oneway = as.integer(oneway),
bridge = as.integer(bridge),
maxspeed_numeric = readr::parse_number(as.character(maxspeed)),
maxspeed_missing = as.integer(is.na(maxspeed_numeric))
) %>%
group_by(type) %>%
mutate(
type_median_maxspeed = median(maxspeed_numeric, na.rm = TRUE),
maxspeed_imputed = if_else(is.na(maxspeed_numeric), type_median_maxspeed, maxspeed_numeric)
) %>%
ungroup() %>%
select(road_id, osm_id, type, oneway, bridge, maxspeed_imputed, maxspeed_missing, geometry)
# sanity check
roads_for_graph %>%
st_drop_geometry() %>%
filter(is.na(maxspeed_imputed)) %>%
count(type, sort = TRUE)
## # A tibble: 0 × 2
## # ℹ 2 variables: type <chr>, n <int>
From the already prepared tibble, we create a network object, where each road component becomes and edge, and each crossing becomes a node. Although some researchers use directed graphs (Nippani et al.), we utilized undirected one due to the fact that we have no information about actual direction of each road segment.
road_net <- roads_for_graph %>%
as_sfnetwork(directed = FALSE) %>%
convert(to_spatial_subdivision)
Then, we assign node_id’s and edge_id’s and add natural log of segments length. Rod id should not be used at this point due to the fact that roads were divided into smaller segments.
road_net <- road_net %>%
activate(nodes) %>%
select(-any_of(".tidygraph_node_index")) %>%
mutate(node_id = row_number()) %>%
activate(edges) %>%
select(-any_of(".tidygraph_edge_index")) %>%
mutate(
graph_edge_id = row_number(),
graph_length_m = as.numeric(st_length(geometry)),
graph_log_length_m = log1p(graph_length_m)
)
After that, we extract edge layer, which later will be used for accidents assignments.
graph_edges <- road_net %>%
activate(edges) %>%
st_as_sf()
Finally, we need to verify the graph structure, check for missing length values and make sure unique identifiers are consistent.
graph_edges %>%
st_drop_geometry() %>%
summarise(
n_edges = n(),
unique_graph_edge_ids = n_distinct(graph_edge_id),
min_length_m = min(graph_length_m, na.rm = TRUE),
median_length_m = median(graph_length_m, na.rm = TRUE),
max_length_m = max(graph_length_m, na.rm = TRUE),
zero_length_edges = sum(graph_length_m == 0, na.rm = TRUE),
missing_length = sum(is.na(graph_length_m))
)
## # A tibble: 1 × 7
## n_edges unique_graph_edge_ids min_length_m median_length_m max_length_m
## <int> <int> <dbl> <dbl> <dbl>
## 1 33530 33530 0.0244 69.2 2340.
## # ℹ 2 more variables: zero_length_edges <int>, missing_length <int>
As graph structure looks consistent, we can proceed with accidents. Unfortunately the database we used does not provide fully consistent data; some of accidents might be clipped to places far away from roads. In addition to this, we removed part of the road types from the analysis. Consequently, we need check how many accidents lie outside the road network and clip only those lying not farther than 5 meters away.
graph_nodes <- road_net %>%
activate(nodes) %>%
st_as_sf()
nearest_edge_idx <- st_nearest_feature(wawa_sf, graph_edges)
nearest_node_idx <- st_nearest_feature(wawa_sf, graph_nodes)
dist_to_edge_m <- as.numeric(st_distance(
wawa_sf,
graph_edges[nearest_edge_idx, ],
by_element = TRUE
))
dist_to_node_m <- as.numeric(st_distance(
wawa_sf,
graph_nodes[nearest_node_idx, ],
by_element = TRUE
))
accident_location_check <- tibble(
total_accidents = nrow(wawa_sf),
kept_within_5m = sum(dist_to_edge_m <= 5),
deleted_outside_5m = sum(dist_to_edge_m > 5),
deleted_outside_5m_share = deleted_outside_5m / total_accidents,
near_node_5m = sum(dist_to_node_m <= 5),
edge_only_5m = sum(dist_to_edge_m <= 5 & dist_to_node_m > 5),
node_or_intersection_5m = sum(dist_to_node_m <= 5)
)
accident_location_check
## # A tibble: 1 × 7
## total_accidents kept_within_5m deleted_outside_5m deleted_outside_5m_share
## <int> <int> <int> <dbl>
## 1 25101 12982 12119 0.483
## # ℹ 3 more variables: near_node_5m <int>, edge_only_5m <int>,
## # node_or_intersection_5m <int>
In case of accidents lying near edges of the graph (road segments), we just assign them to the nearest edge. However, in case of accidents lying near nodes (road intersections), we still need to clip them to the nearest edge. By deleting them we would omit quite big number of accidents, and keeping them on nodes does not make any practical sense.
wawa_assigned_5m <- wawa_sf %>%
mutate(
graph_edge_id = graph_edges$graph_edge_id[nearest_edge_idx],
nearest_node_id = graph_nodes$node_id[nearest_node_idx],
dist_to_edge_m = dist_to_edge_m,
dist_to_node_m = dist_to_node_m,
assignment_type = case_when(
dist_to_edge_m <= 5 & dist_to_node_m <= 5 ~ "intersection_assigned_to_nearest_edge",
dist_to_edge_m <= 5 & dist_to_node_m > 5 ~ "direct_edge_assignment",
TRUE ~ "outside_graph"
)
) %>%
filter(assignment_type != "outside_graph")
wawa_assigned_5m %>%
st_drop_geometry() %>%
count(assignment_type)
## assignment_type n
## 1 direct_edge_assignment 10826
## 2 intersection_assigned_to_nearest_edge 2156
Then, we create a dataset with accidents count per each edge_id and per each month.
accidents_by_graph_edge_month <- wawa_assigned_5m %>%
st_drop_geometry() %>%
count(graph_edge_id, accident_month, name = "accident_count")
accidents_by_graph_edge_month %>%
summarise(
n_edge_months_with_accidents = n(),
total_assigned_accidents = sum(accident_count),
min_month = min(accident_month),
max_month = max(accident_month),
n_months = n_distinct(accident_month)
)
## n_edge_months_with_accidents total_assigned_accidents min_month max_month
## 1 12044 12982 2024-01-01 2024-12-01
## n_months
## 1 12
Then, we can create statik edge dataset having only features we need for further analysis.
edge_static_dataset <- graph_edges %>%
select(
graph_edge_id, road_id, osm_id, type, oneway, bridge,
graph_length_m, graph_log_length_m,
maxspeed_imputed, maxspeed_missing,
geometry
)
edge_static_dataset %>%
st_drop_geometry() %>%
summarise(
n_edges = n(),
unique_graph_edge_ids = n_distinct(graph_edge_id),
missing_type = sum(is.na(type)),
missing_maxspeed = sum(is.na(maxspeed_imputed))
)
## # A tibble: 1 × 4
## n_edges unique_graph_edge_ids missing_type missing_maxspeed
## <int> <int> <int> <int>
## 1 33530 33530 0 0
Then, we construct proper montly accidents panel with count and binary target.
all_months <- seq.Date(
from = min(wawa_sf$accident_month, na.rm = TRUE),
to = max(wawa_sf$accident_month, na.rm = TRUE),
by = "month"
)
edge_month_panel <- tidyr::expand_grid(
graph_edge_id = edge_static_dataset$graph_edge_id,
accident_month = all_months
) %>%
left_join(
accidents_by_graph_edge_month,
by = c("graph_edge_id", "accident_month")
) %>%
mutate(
accident_count = replace_na(accident_count, 0L),
accident_binary = as.integer(accident_count > 0)
)
edge_month_panel %>%
summarise(
n_rows = n(),
n_edges = n_distinct(graph_edge_id),
n_months = n_distinct(accident_month),
total_accidents = sum(accident_count),
positive_edge_months = sum(accident_binary),
positive_share = mean(accident_binary),
missing_targets = sum(is.na(accident_count))
)
## # A tibble: 1 × 7
## n_rows n_edges n_months total_accidents positive_edge_months positive_share
## <int> <int> <int> <int> <int> <dbl>
## 1 402360 33530 12 12982 12044 0.0299
## # ℹ 1 more variable: missing_targets <int>
At this point, we can depict the road network with montly accidents counts per each edge.
monthly_graph_dataset <- edge_static_dataset %>%
select(graph_edge_id, geometry) %>%
left_join(edge_month_panel, by = "graph_edge_id") %>%
mutate(
accident_month_label = format(accident_month, "%Y-%m")
)
month_chunks <- split(
all_months,
ceiling(seq_along(all_months) / 2)
)
common_max_accidents <- max(monthly_graph_dataset$accident_count, na.rm = TRUE)
plot_month_chunk <- function(month_vector) {
month_labels <- format(month_vector, "%Y-%m")
chunk_all_edges <- monthly_graph_dataset %>%
filter(accident_month %in% month_vector)
chunk_positive_edges <- chunk_all_edges %>%
filter(accident_count > 0)
ggplot() +
geom_sf(
data = chunk_all_edges,
color = "grey85",
linewidth = 0.12,
alpha = 0.65
) +
geom_sf(
data = chunk_positive_edges,
aes(color = accident_count, linewidth = accident_count),
alpha = 0.95
) +
geom_sf(
data = wawa_spat,
fill = NA,
color = "black",
linewidth = 0.45
) +
facet_wrap(~ accident_month_label, ncol = 2) +
scale_color_gradient(
low = "#FDB863",
high = "#B30000",
limits = c(1, common_max_accidents),
name = "Accidents"
) +
scale_linewidth_continuous(
range = c(0.35, 2.2),
guide = "none"
) +
coord_sf(expand = FALSE) +
theme_void() +
labs(
title = "Monthly Accident Counts on the Warsaw Road Graph",
subtitle = paste(min(month_labels), "to", max(month_labels)),
) +
theme(
plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 11, hjust = 0.5),
plot.caption = element_text(size = 8, hjust = 1),
strip.text = element_text(size = 10, face = "bold"),
legend.position = "right",
panel.spacing = unit(1, "lines"),
plot.margin = margin(10, 10, 10, 10)
)
}
monthly_maps <- lapply(month_chunks, plot_month_chunk)
monthly_maps[[1]]
monthly_maps[[2]]
monthly_maps[[3]]
monthly_maps[[4]]
monthly_maps[[5]]
monthly_maps[[6]]
After that, let us prepare the nodes feature table.
road_net <- road_net %>%
activate(nodes) %>%
mutate(
node_degree = centrality_degree(mode = "all"),
node_component = group_components()
)
graph_nodes <- road_net %>%
activate(nodes) %>%
st_as_sf() %>%
mutate(
node_x = st_coordinates(geometry)[, 1],
node_y = st_coordinates(geometry)[, 2]
) %>%
st_drop_geometry() %>%
select(node_id, node_degree, node_component, node_x, node_y)
graph_nodes %>%
summarise(
n_nodes = n(),
unique_node_ids = n_distinct(node_id),
n_components = n_distinct(node_component),
min_degree = min(node_degree),
median_degree = median(node_degree),
max_degree = max(node_degree)
)
## # A tibble: 1 × 6
## n_nodes unique_node_ids n_components min_degree median_degree max_degree
## <int> <int> <int> <dbl> <dbl> <dbl>
## 1 27052 27052 176 1 2 6
And then we can prepare the tables for modeling. First, the edge table combining graph structure and tabular features.
graph_model_edges <- road_net %>%
activate(edges) %>%
as_tibble() %>%
select(graph_edge_id, from, to) %>%
left_join(
edge_static_dataset %>%
st_drop_geometry() %>%
select(
graph_edge_id, road_id, osm_id, type, oneway, bridge,
graph_length_m, graph_log_length_m,
maxspeed_imputed, maxspeed_missing
),
by = "graph_edge_id"
)
graph_model_edges %>%
summarise(
n_edges = n(),
unique_graph_edge_ids = n_distinct(graph_edge_id),
missing_type = sum(is.na(type)),
missing_maxspeed = sum(is.na(maxspeed_imputed))
)
## Simple feature collection with 1 feature and 4 fields
## Geometry type: MULTILINESTRING
## Dimension: XY
## Bounding box: xmin: 626541.6 ymin: 472233.7 xmax: 654472.2 ymax: 502031
## Projected CRS: ETRF2000-PL / CS92
## # A tibble: 1 × 5
## n_edges unique_graph_edge_ids missing_type missing_maxspeed
## <int> <int> <int> <int>
## 1 33530 33530 0 0
## # ℹ 1 more variable: geometry <MULTILINESTRING [m]>
Then, we prapare other tables to be exported to Python.
# export node table
node_table <- graph_nodes %>%
select(node_id, node_degree, node_x, node_y, node_component) %>%
arrange(node_id)
#-------------------------------------------------------------------------------
# export static edge table
edge_table_raw <- graph_model_edges %>%
mutate(
from_py = from - 1L,
to_py = to - 1L,
type = factor(type)
) %>%
select(
graph_edge_id, from, to, from_py, to_py,
type, oneway, bridge,
graph_length_m, graph_log_length_m,
maxspeed_imputed, maxspeed_missing
)
edge_type_dummies <- model.matrix(~ type - 1, data = edge_table_raw) %>%
as.data.frame() %>%
janitor::clean_names()
edge_table <- bind_cols(
edge_table_raw %>% select(-type),
edge_type_dummies
) %>%
arrange(graph_edge_id)
#-------------------------------------------------------------------------------
# export monthly target table
edge_month_table <- edge_month_panel %>%
mutate(
month_id = match(accident_month, all_months) - 1L
) %>%
arrange(accident_month, graph_edge_id) %>%
select(
graph_edge_id,
accident_month,
month_id,
accident_count,
accident_binary
)
# final checks before writing files
node_table %>%
summarise(
n_nodes = n(),
unique_node_ids = n_distinct(node_id),
min_node_id = min(node_id),
max_node_id = max(node_id)
)
## # A tibble: 1 × 4
## n_nodes unique_node_ids min_node_id max_node_id
## <int> <int> <int> <int>
## 1 27052 27052 1 27052
edge_table %>%
summarise(
n_edges = n(),
unique_graph_edge_ids = n_distinct(graph_edge_id),
min_from_py = min(from_py),
max_from_py = max(from_py),
min_to_py = min(to_py),
max_to_py = max(to_py)
)
## Simple feature collection with 1 feature and 6 fields
## Geometry type: MULTILINESTRING
## Dimension: XY
## Bounding box: xmin: 626541.6 ymin: 472233.7 xmax: 654472.2 ymax: 502031
## Projected CRS: ETRF2000-PL / CS92
## # A tibble: 1 × 7
## n_edges unique_graph_edge_ids min_from_py max_from_py min_to_py max_to_py
## <int> <int> <int> <int> <int> <int>
## 1 33530 33530 0 27050 1 27051
## # ℹ 1 more variable: geometry <MULTILINESTRING [m]>
edge_month_table %>%
summarise(
n_rows = n(),
n_edges = n_distinct(graph_edge_id),
n_months = n_distinct(accident_month),
total_accidents = sum(accident_count),
positive_edge_months = sum(accident_binary),
positive_share = mean(accident_binary),
missing_targets = sum(is.na(accident_count))
)
## # A tibble: 1 × 7
## n_rows n_edges n_months total_accidents positive_edge_months positive_share
## <int> <int> <int> <int> <int> <dbl>
## 1 402360 33530 12 12982 12044 0.0299
## # ℹ 1 more variable: missing_targets <int>
tibble(
n_nodes = nrow(node_table),
min_node_id = min(node_table$node_id),
max_node_id = max(node_table$node_id),
n_edges = nrow(edge_table),
n_edge_month_rows = nrow(edge_month_table),
min_index_used = min(c(edge_table$from_py, edge_table$to_py)),
max_index_used = max(c(edge_table$from_py, edge_table$to_py)),
any_negative = any(c(edge_table$from_py, edge_table$to_py) < 0),
any_too_large = any(c(edge_table$from_py, edge_table$to_py) >= nrow(node_table)),
total_accidents = sum(edge_month_table$accident_count),
missing_targets = sum(is.na(edge_month_table$accident_count))
)
## # A tibble: 1 × 11
## n_nodes min_node_id max_node_id n_edges n_edge_month_rows min_index_used
## <int> <int> <int> <int> <int> <int>
## 1 27052 1 27052 33530 402360 0
## # ℹ 5 more variables: max_index_used <int>, any_negative <lgl>,
## # any_too_large <lgl>, total_accidents <int>, missing_targets <int>
#-------------------------------------------------------------------------------
# write files for Python / PyTorch Geometric
write_csv(node_table, "node_table.csv")
write_csv(edge_table, "edge_table.csv")
write_csv(edge_month_table, "edge_month_table.csv")
Due to the fact that R ecosystem does not provide any frameworks for GNN’s, we need to transfer modeling to Python with “reticulate” package. First, we need to configure the environment. I used anaconda environment due to the fact that it provides maximum stability while working with various libraries. However, it is also possible to create virtual environment and load all libraries there.
Sys.setenv(RETICULATE_PYTHON = "C:/Users/yana1/anaconda3/envs/gnn_pyg_r/python.exe") #replace with your local environment
library(reticulate)
py_config()
## python: C:/Users/yana1/anaconda3/envs/gnn_pyg_r/python.exe
## libpython: C:/Users/yana1/anaconda3/envs/gnn_pyg_r/python311.dll
## pythonhome: C:/Users/yana1/anaconda3/envs/gnn_pyg_r
## version: 3.11.15 | packaged by Anaconda, Inc. | (main, Mar 11 2026, 17:12:15) [MSC v.1942 64 bit (AMD64)]
## Architecture: 64bit
## numpy: C:/Users/yana1/anaconda3/envs/gnn_pyg_r/Lib/site-packages/numpy
## numpy_version: 2.4.4
##
## NOTE: Python version was forced by RETICULATE_PYTHON
After the environment was properly set, we can start with importing all necessary Python libraries and loading data (all the codes in this section were previosuly run and debugged in Python IDE, so there is no need in small chunks).
py_run_string(r"(
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from sklearn.metrics import (
roc_auc_score,
balanced_accuracy_score,
accuracy_score,
precision_score,
recall_score,
f1_score,
confusion_matrix
)
from torch_geometric.data import Data
from torch_geometric.nn import SAGEConv
torch.manual_seed(42)
np.random.seed(42)
node_table = pd.read_csv('node_table.csv')
edge_table = pd.read_csv('edge_table.csv')
edge_month_table = pd.read_csv('edge_month_table.csv')
edge_month_table['accident_month'] = pd.to_datetime(edge_month_table['accident_month'])
print('node_table:', node_table.shape)
print('edge_table:', edge_table.shape)
print('edge_month_table:', edge_month_table.shape)
print('\nnode_table columns:')
print(node_table.columns.tolist())
print('\nedge_table columns:')
print(edge_table.columns.tolist())
print('\nedge_month_table columns:')
print(edge_month_table.columns.tolist())
)")
## node_table: (27052, 5)
## edge_table: (33530, 23)
## edge_month_table: (402360, 5)
##
## node_table columns:
## ['node_id', 'node_degree', 'node_x', 'node_y', 'node_component']
##
## edge_table columns:
## ['graph_edge_id', 'from', 'to', 'from_py', 'to_py', 'oneway', 'bridge', 'graph_length_m', 'graph_log_length_m', 'maxspeed_imputed', 'maxspeed_missing', 'geometry', 'typemotorway', 'typemotorway_link', 'typeprimary', 'typeprimary_link', 'typeresidential', 'typesecondary', 'typesecondary_link', 'typetertiary', 'typetertiary_link', 'typetrunk', 'typetrunk_link']
##
## edge_month_table columns:
## ['graph_edge_id', 'accident_month', 'month_id', 'accident_count', 'accident_binary']
Then, we need to run sanity checks, construct proper dataset for binary classification model and divide the data into train, validation and test subsets chronologically.
py_run_string(r"(
# --------------------------------------------------
# Basic checks
# --------------------------------------------------
assert edge_table['graph_edge_id'].is_unique, 'graph_edge_id is not unique in edge_table'
assert node_table['node_id'].is_unique, 'node_id is not unique in node_table'
edge_ids = edge_table['graph_edge_id'].values
n_edges = len(edge_ids)
months = sorted(edge_month_table['accident_month'].unique())
n_months = len(months)
print('\nMonths:')
for i, m in enumerate(months):
print(i, pd.Timestamp(m).strftime('%Y-%m'))
expected_rows = n_edges * n_months
assert len(edge_month_table) == expected_rows, (
f'Expected {expected_rows} edge-month rows, got {len(edge_month_table)}'
)
print('\nNumber of edges:', n_edges)
print('Number of months:', n_months)
print('Expected rows:', expected_rows)
# --------------------------------------------------
# Monthly binary label matrix: months × edges
# --------------------------------------------------
label_matrix = (
edge_month_table
.pivot(index='accident_month', columns='graph_edge_id', values='accident_binary')
.reindex(index=months, columns=edge_ids)
)
missing_labels = label_matrix.isna().sum().sum()
print('\nMissing labels in matrix:', missing_labels)
assert missing_labels == 0, 'There are missing labels after pivoting'
Y = torch.tensor(label_matrix.to_numpy(dtype=np.float32), dtype=torch.float32)
print('Y shape:', Y.shape)
print('Overall positive share:', float(Y.mean().item()))
# --------------------------------------------------
#chronological split
# train: Jan-Aug
# valid: Sep-Oct
# test : Nov-Dec
# --------------------------------------------------
train_month_idx = list(range(0, 8))
valid_month_idx = list(range(8, 10))
test_month_idx = list(range(10, 12))
print('\nSplit:')
print('Train:', [pd.Timestamp(months[i]).strftime('%Y-%m') for i in train_month_idx])
print('Valid:', [pd.Timestamp(months[i]).strftime('%Y-%m') for i in valid_month_idx])
print('Test :', [pd.Timestamp(months[i]).strftime('%Y-%m') for i in test_month_idx])
print('\nPositive shares:')
print('Train:', float(Y[train_month_idx].mean().item()))
print('Valid:', float(Y[valid_month_idx].mean().item()))
print('Test :', float(Y[test_month_idx].mean().item()))
)")
##
## Months:
## 0 2024-01
## 1 2024-02
## 2 2024-03
## 3 2024-04
## 4 2024-05
## 5 2024-06
## 6 2024-07
## 7 2024-08
## 8 2024-09
## 9 2024-10
## 10 2024-11
## 11 2024-12
##
## Number of edges: 33530
## Number of months: 12
## Expected rows: 402360
##
## Missing labels in matrix: 0
## Y shape: torch.Size([12, 33530])
## Overall positive share: 0.029933393001556396
##
## Split:
## Train: ['2024-01', '2024-02', '2024-03', '2024-04', '2024-05', '2024-06', '2024-07', '2024-08']
## Valid: ['2024-09', '2024-10']
## Test : ['2024-11', '2024-12']
##
## Positive shares:
## Train: 0.029115717858076096
## Valid: 0.031225768849253654
## Test : 0.03191171959042549
Then, we need to define preprocessing steps and create PyTorch Geometry object.
py_run_string(r"(
# --------------------------------------------------
# Node features
# --------------------------------------------------
# node_component is not used as a numeric feature because it is an arbitrary ID.
node_feature_cols = ['node_degree', 'node_x', 'node_y']
node_features_df = node_table[node_feature_cols].apply(pd.to_numeric, errors='coerce')
print('\nNode feature dtypes:')
print(node_features_df.dtypes)
print('\nMissing node features after numeric conversion:')
print(node_features_df.isna().sum())
node_features_df = node_features_df.fillna(node_features_df.median()).fillna(0)
x = torch.tensor(
node_features_df.to_numpy(dtype=np.float32),
dtype=torch.float32
)
x_mean = x.mean(dim=0, keepdim=True)
x_std = x.std(dim=0, keepdim=True)
x_std[x_std == 0] = 1.0
x = (x - x_mean) / x_std
# --------------------------------------------------
# Edge index
# --------------------------------------------------
edge_index_original = torch.tensor(
edge_table[['from_py', 'to_py']].to_numpy(dtype=np.int64).T,
dtype=torch.long
)
# Message passing is undirected.
edge_index_message = torch.cat(
[edge_index_original, edge_index_original.flip(0)],
dim=1
)
# --------------------------------------------------
# Edge features
# --------------------------------------------------
continuous_edge_cols = [
'graph_log_length_m',
'maxspeed_imputed'
]
binary_edge_cols = [
'oneway',
'bridge',
'maxspeed_missing'
]
type_dummy_cols = [
col for col in edge_table.columns
if col.startswith('type')
]
continuous_df = edge_table[continuous_edge_cols].apply(pd.to_numeric, errors='coerce')
binary_df = edge_table[binary_edge_cols + type_dummy_cols].apply(pd.to_numeric, errors='coerce')
print('\nContinuous edge feature dtypes:')
print(continuous_df.dtypes)
print('\nBinary/dummy edge feature dtypes:')
print(binary_df.dtypes)
print('\nMissing continuous edge features:')
print(continuous_df.isna().sum())
print('\nMissing binary/dummy edge features:')
print(binary_df.isna().sum())
continuous_df = continuous_df.fillna(continuous_df.median()).fillna(0)
binary_df = binary_df.fillna(0)
continuous_tensor = torch.tensor(
continuous_df.to_numpy(dtype=np.float32),
dtype=torch.float32
)
continuous_mean = continuous_tensor.mean(dim=0, keepdim=True)
continuous_std = continuous_tensor.std(dim=0, keepdim=True)
continuous_std[continuous_std == 0] = 1.0
continuous_tensor = (continuous_tensor - continuous_mean) / continuous_std
binary_tensor = torch.tensor(
binary_df.to_numpy(dtype=np.float32),
dtype=torch.float32
)
edge_attr = torch.cat(
[continuous_tensor, binary_tensor],
dim=1
)
edge_feature_cols = continuous_edge_cols + binary_edge_cols + type_dummy_cols
print('\nEdge feature columns:')
print(edge_feature_cols)
print('Number of edge features:', len(edge_feature_cols))
print('edge_attr:', edge_attr.shape)
# --------------------------------------------------
# PyG data object
# --------------------------------------------------
data = Data(
x=x,
edge_index=edge_index_message,
original_edge_index=edge_index_original,
edge_attr=edge_attr
)
print('\nData object:')
print(data)
print('x:', data.x.shape)
print('message edge_index:', data.edge_index.shape)
print('original_edge_index:', data.original_edge_index.shape)
print('edge_attr:', data.edge_attr.shape)
assert data.edge_attr.shape[0] == n_edges, 'edge_attr rows do not match number of edges'
assert data.original_edge_index.shape[1] == n_edges, 'original_edge_index columns do not match number of edges'
assert Y.shape[1] == n_edges, 'Y columns do not match number of edges'
assert int(data.edge_index.max()) < data.x.shape[0], 'edge_index contains node index larger than number of nodes'
assert int(data.edge_index.min()) >= 0, 'edge_index contains negative node index'
)")
##
## Node feature dtypes:
## node_degree int64
## node_x float64
## node_y float64
## dtype: object
##
## Missing node features after numeric conversion:
## node_degree 0
## node_x 0
## node_y 0
## dtype: int64
##
## Continuous edge feature dtypes:
## graph_log_length_m float64
## maxspeed_imputed int64
## dtype: object
##
## Binary/dummy edge feature dtypes:
## oneway int64
## bridge int64
## maxspeed_missing int64
## typemotorway int64
## typemotorway_link int64
## typeprimary int64
## typeprimary_link int64
## typeresidential int64
## typesecondary int64
## typesecondary_link int64
## typetertiary int64
## typetertiary_link int64
## typetrunk int64
## typetrunk_link int64
## dtype: object
##
## Missing continuous edge features:
## graph_log_length_m 0
## maxspeed_imputed 0
## dtype: int64
##
## Missing binary/dummy edge features:
## oneway 0
## bridge 0
## maxspeed_missing 0
## typemotorway 0
## typemotorway_link 0
## typeprimary 0
## typeprimary_link 0
## typeresidential 0
## typesecondary 0
## typesecondary_link 0
## typetertiary 0
## typetertiary_link 0
## typetrunk 0
## typetrunk_link 0
## dtype: int64
##
## Edge feature columns:
## ['graph_log_length_m', 'maxspeed_imputed', 'oneway', 'bridge', 'maxspeed_missing', 'typemotorway', 'typemotorway_link', 'typeprimary', 'typeprimary_link', 'typeresidential', 'typesecondary', 'typesecondary_link', 'typetertiary', 'typetertiary_link', 'typetrunk', 'typetrunk_link']
## Number of edge features: 16
## edge_attr: torch.Size([33530, 16])
##
## Data object:
## Data(x=[27052, 3], edge_index=[2, 67060], edge_attr=[33530, 16], original_edge_index=[2, 33530])
## x: torch.Size([27052, 3])
## message edge_index: torch.Size([2, 67060])
## original_edge_index: torch.Size([2, 33530])
## edge_attr: torch.Size([33530, 16])
The next step is to define the model architecture, loss and some helper functions.
# PART 4: Model definition, loss, and optimizer
py_run_string(r"(
class EdgeGraphSAGEBinary(nn.Module):
def __init__(self, node_in_dim, edge_in_dim, hidden_dim=64, dropout=0.30):
super().__init__()
self.conv1 = SAGEConv(node_in_dim, hidden_dim)
self.conv2 = SAGEConv(hidden_dim, hidden_dim)
self.edge_mlp = nn.Sequential(
nn.Linear(hidden_dim * 2 + edge_in_dim, hidden_dim),
nn.ReLU(),
nn.Dropout(dropout),
nn.Linear(hidden_dim, hidden_dim // 2),
nn.ReLU(),
nn.Dropout(dropout),
nn.Linear(hidden_dim // 2, 1)
)
self.dropout = dropout
def encode_nodes(self, x, edge_index):
h = self.conv1(x, edge_index)
h = F.relu(h)
h = F.dropout(h, p=self.dropout, training=self.training)
h = self.conv2(h, edge_index)
h = F.relu(h)
return h
def decode_edges(self, node_emb, original_edge_index, edge_attr):
src = original_edge_index[0]
dst = original_edge_index[1]
h_src = node_emb[src]
h_dst = node_emb[dst]
edge_input = torch.cat([h_src, h_dst, edge_attr], dim=1)
logits = self.edge_mlp(edge_input).squeeze(-1)
return logits
def forward(self, x, edge_index, original_edge_index, edge_attr):
node_emb = self.encode_nodes(x, edge_index)
logits = self.decode_edges(node_emb, original_edge_index, edge_attr)
return logits
model = EdgeGraphSAGEBinary(
node_in_dim=data.x.shape[1],
edge_in_dim=data.edge_attr.shape[1],
hidden_dim=64,
dropout=0.30
)
Y_train = Y[train_month_idx]
n_pos = float(Y_train.sum().item())
n_total = float(Y_train.numel())
n_neg = n_total - n_pos
pos_weight = torch.tensor([n_neg / max(n_pos, 1.0)], dtype=torch.float32)
print('\nTraining labels:')
print('positive labels:', int(n_pos))
print('negative labels:', int(n_neg))
print('positive share:', n_pos / n_total)
print('pos_weight:', float(pos_weight.item()))
criterion = nn.BCEWithLogitsLoss(pos_weight=pos_weight)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-4)
print('\nModel initialized.')
print(model)
)")
##
## Training labels:
## positive labels: 7810
## negative labels: 260430
## positive share: 0.029115717268118103
## pos_weight: 33.34571075439453
##
## Model initialized.
## EdgeGraphSAGEBinary(
## (conv1): SAGEConv(3, 64, aggr=mean)
## (conv2): SAGEConv(64, 64, aggr=mean)
## (edge_mlp): Sequential(
## (0): Linear(in_features=144, out_features=64, bias=True)
## (1): ReLU()
## (2): Dropout(p=0.3, inplace=False)
## (3): Linear(in_features=64, out_features=32, bias=True)
## (4): ReLU()
## (5): Dropout(p=0.3, inplace=False)
## (6): Linear(in_features=32, out_features=1, bias=True)
## )
## )
#-------------------------------------------------------------------------------
# Evaluation helpers
py_run_string(r"(
def collect_predictions(month_indices):
model.eval()
with torch.no_grad():
logits = model(
data.x,
data.edge_index,
data.original_edge_index,
data.edge_attr
)
probs = torch.sigmoid(logits)
y_true = Y[month_indices].reshape(-1).cpu().numpy()
y_prob = probs.repeat(len(month_indices)).cpu().numpy()
return y_true, y_prob
def evaluate_split(month_indices, threshold=0.5, split_name='split'):
y_true, y_prob = collect_predictions(month_indices)
y_pred = (y_prob >= threshold).astype(int)
tn, fp, fn, tp = confusion_matrix(y_true, y_pred, labels=[0, 1]).ravel()
metrics = {
'split': split_name,
'threshold': threshold,
'auc': roc_auc_score(y_true, y_prob),
'accuracy': accuracy_score(y_true, y_pred),
'balanced_accuracy': balanced_accuracy_score(y_true, y_pred),
'precision': precision_score(y_true, y_pred, zero_division=0),
'recall': recall_score(y_true, y_pred, zero_division=0),
'specificity': tn / (tn + fp) if (tn + fp) > 0 else np.nan,
'f1': f1_score(y_true, y_pred, zero_division=0),
'true_positive_share': float(np.mean(y_true)),
'pred_positive_share': float(np.mean(y_pred)),
'mean_probability': float(np.mean(y_prob)),
'tp': int(tp),
'tn': int(tn),
'fp': int(fp),
'fn': int(fn)
}
return metrics
def find_best_threshold_on_validation():
y_true, y_prob = collect_predictions(valid_month_idx)
thresholds = np.linspace(0.01, 0.99, 99)
best_threshold = 0.5
best_score = -1
for threshold in thresholds:
y_pred = (y_prob >= threshold).astype(int)
score = balanced_accuracy_score(y_true, y_pred)
if score > best_score:
best_score = score
best_threshold = threshold
return float(best_threshold), float(best_score)
print('Evaluation helpers are ready.')
# quick pre-training check
pre_train_valid = evaluate_split(valid_month_idx, threshold=0.5, split_name='valid_pre_training')
print('\nPre-training validation metrics:')
for key, value in pre_train_valid.items():
print(f'{key}: {value}')
)")
## Evaluation helpers are ready.
##
## Pre-training validation metrics:
## split: valid_pre_training
## threshold: 0.5
## auc: 0.47507729485772315
## accuracy: 0.031225767968983
## balanced_accuracy: 0.5
## precision: 0.031225767968983
## recall: 1.0
## specificity: 0.0
## f1: 0.06056048818578824
## true_positive_share: 0.031225768849253654
## pred_positive_share: 1.0
## mean_probability: 0.5178737044334412
## tp: 2094
## tn: 0
## fp: 64966
## fn: 0
Finally, we can train the model and predict on test set.
# prepare a nice-looking table
library(knitr)
binary_table <- bind_rows(
as_tibble(py_to_r(py$final_train)),
as_tibble(py_to_r(py$final_valid)),
as_tibble(py_to_r(py$final_test))
) %>%
transmute(
Split = recode(split, train = "Train", valid = "Validation", test = "Test"),
AUROC = round(auc, 3),
Accuracy = round(accuracy, 3),
`Balanced accuracy` = round(balanced_accuracy, 3),
Precision = round(precision, 3),
Recall = round(recall, 3),
Specificity = round(specificity, 3),
F1 = round(f1, 3),
`True positive share` = round(true_positive_share, 3),
`Predicted positive share` = round(pred_positive_share, 3),
Threshold = round(threshold, 2)
)
kable(
binary_table,
caption = "Binary GraphSAGE classification metrics",
align = "c"
)
| Split | AUROC | Accuracy | Balanced accuracy | Precision | Recall | Specificity | F1 | True positive share | Predicted positive share | Threshold |
|---|---|---|---|---|---|---|---|---|---|---|
| Train | 0.795 | 0.721 | 0.723 | 0.072 | 0.726 | 0.721 | 0.132 | 0.029 | 0.292 | 0.54 |
| Validation | 0.787 | 0.721 | 0.715 | 0.076 | 0.709 | 0.721 | 0.137 | 0.031 | 0.292 | 0.54 |
| Test | 0.788 | 0.721 | 0.715 | 0.077 | 0.708 | 0.722 | 0.140 | 0.032 | 0.292 | 0.54 |
The binary model shows stable behaviour across the training, validation, and test sets. The AUROC remains at a moderate level in all three splits, which means that the model has some ability to rank road segments from lower to higher accident risk. The small difference between training and later periods suggests that the model generalises reasonably well across months and does not show a strong sign of overfitting. Accuracy and specificity are also relatively stable, although they should be interpreted carefully because accident-positive observations are rare. Balanced accuracy gives a more useful picture here and indicates that the model performs better than a naive classifier, while still leaving clear room for improvement. The most visible pattern is the difference between recall and precision. Recall is high, so the model catches a large share of accident-positive cases. Precision and F1 remain low, which means that many segments classified as risky do not actually have accidents in the observed month. This follows from the strong imbalance in the data and from the model’s tendency to flag a relatively large share of edges as risky. Overall, the binary model is useful as a broad risk-screening tool: it highlights potentially problematic road segments, although its positive predictions should be interpreted with caution.
Let us also plot the training loop.
binary_history <- as.data.frame(py_to_r(py$binary_history_df))
binary_loss_long <- binary_history %>%
select(epoch, train_loss, valid_loss) %>%
pivot_longer(
cols = c(train_loss, valid_loss),
names_to = "split",
values_to = "loss"
) %>%
mutate(
split = recode(
split,
train_loss = "Train",
valid_loss = "Validation"
)
)
ggplot(binary_loss_long, aes(x = epoch, y = loss, color = split)) +
geom_line(linewidth = 1.1) +
labs(
title = "Binary GraphSAGE Training History",
subtitle = "Train and validation loss across epochs",
x = "Epoch",
y = "Weighted binary cross-entropy loss",
color = "Split"
) +
theme_minimal(base_size = 13) +
theme(
plot.title = element_text(face = "bold", hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5),
legend.position = "bottom"
)
During the first 50 epochs, the model reached a relatively stable solution. However, after that the loss was quite stable suggesting that training for 150 more epochs did not result in severe overfitting.
Due to the fact that data was already loaded and preprocessed, we can omit some steps. The only modification we need to apply is change in target - we need to replace binary output matrix with counts of accidents.
# REGRESSION PART 1: Count target matrix
py_run_string(r"(
# --------------------------------------------------
# Monthly count matrix: months × edges
# --------------------------------------------------
count_matrix = (
edge_month_table
.pivot(index='accident_month', columns='graph_edge_id', values='accident_count')
.reindex(index=months, columns=edge_ids)
)
missing_counts = count_matrix.isna().sum().sum()
print('Missing count labels:', missing_counts)
assert missing_counts == 0, 'There are missing count labels after pivoting'
Y_count = torch.tensor(
count_matrix.to_numpy(dtype=np.float32),
dtype=torch.float32
)
print('Y_count shape:', Y_count.shape)
print('Total accidents:', float(Y_count.sum().item()))
print('Mean count overall:', float(Y_count.mean().item()))
print('Max count:', float(Y_count.max().item()))
print('\\nSplit mean counts:')
print('Train:', float(Y_count[train_month_idx].mean().item()))
print('Valid:', float(Y_count[valid_month_idx].mean().item()))
print('Test :', float(Y_count[test_month_idx].mean().item()))
print('\\nSplit total accidents:')
print('Train:', float(Y_count[train_month_idx].sum().item()))
print('Valid:', float(Y_count[valid_month_idx].sum().item()))
print('Test :', float(Y_count[test_month_idx].sum().item()))
)")
## Missing count labels: 0
## Y_count shape: torch.Size([12, 33530])
## Total accidents: 12982.0
## Mean count overall: 0.03226463869214058
## Max count: 5.0
## \nSplit mean counts:
## Train: 0.031222039833664894
## Valid: 0.034088876098394394
## Test : 0.034610796719789505
## \nSplit total accidents:
## Train: 8375.0
## Valid: 2286.0
## Test : 2321.0
Then, we proceed with model architecture definition and evaluation helper fucntions.
py_run_string(r"(
class EdgeGraphSAGECount(nn.Module):
def __init__(self, node_in_dim, edge_in_dim, hidden_dim=64, dropout=0.30):
super().__init__()
self.conv1 = SAGEConv(node_in_dim, hidden_dim)
self.conv2 = SAGEConv(hidden_dim, hidden_dim)
self.edge_mlp = nn.Sequential(
nn.Linear(hidden_dim * 2 + edge_in_dim, hidden_dim),
nn.ReLU(),
nn.Dropout(dropout),
nn.Linear(hidden_dim, hidden_dim // 2),
nn.ReLU(),
nn.Dropout(dropout),
nn.Linear(hidden_dim // 2, 1)
)
self.dropout = dropout
def encode_nodes(self, x, edge_index):
h = self.conv1(x, edge_index)
h = F.relu(h)
h = F.dropout(h, p=self.dropout, training=self.training)
h = self.conv2(h, edge_index)
h = F.relu(h)
return h
def decode_edges(self, node_emb, original_edge_index, edge_attr):
src = original_edge_index[0]
dst = original_edge_index[1]
h_src = node_emb[src]
h_dst = node_emb[dst]
edge_input = torch.cat([h_src, h_dst, edge_attr], dim=1)
# Output is log(lambda), where lambda is expected accident count.
log_rate = self.edge_mlp(edge_input).squeeze(-1)
return log_rate
def forward(self, x, edge_index, original_edge_index, edge_attr):
node_emb = self.encode_nodes(x, edge_index)
log_rate = self.decode_edges(node_emb, original_edge_index, edge_attr)
return log_rate
reg_model = EdgeGraphSAGECount(
node_in_dim=data.x.shape[1],
edge_in_dim=data.edge_attr.shape[1],
hidden_dim=64,
dropout=0.30
)
Y_count_train = Y_count[train_month_idx]
train_mean_count = float(Y_count_train.mean().item())
# Initialize final bias near the average training count.
# This helps avoid unstable starting predictions.
with torch.no_grad():
reg_model.edge_mlp[-1].bias.fill_(np.log(train_mean_count + 1e-6))
reg_criterion = nn.PoissonNLLLoss(log_input=True, reduction='mean')
reg_optimizer = torch.optim.Adam(reg_model.parameters(), lr=0.001, weight_decay=1e-4)
print('Count regression model initialized.')
print('Training mean count:', train_mean_count)
print(reg_model)
)")
## Count regression model initialized.
## Training mean count: 0.031222039833664894
## EdgeGraphSAGECount(
## (conv1): SAGEConv(3, 64, aggr=mean)
## (conv2): SAGEConv(64, 64, aggr=mean)
## (edge_mlp): Sequential(
## (0): Linear(in_features=144, out_features=64, bias=True)
## (1): ReLU()
## (2): Dropout(p=0.3, inplace=False)
## (3): Linear(in_features=64, out_features=32, bias=True)
## (4): ReLU()
## (5): Dropout(p=0.3, inplace=False)
## (6): Linear(in_features=32, out_features=1, bias=True)
## )
## )
#-------------------------------------------------------------------------------
# REGRESSION PART 3: Evaluation helpers
py_run_string(r"(
def collect_count_predictions(month_indices):
reg_model.eval()
with torch.no_grad():
log_rate = reg_model(
data.x,
data.edge_index,
data.original_edge_index,
data.edge_attr
)
# Convert log-rate to expected count.
pred_count = torch.exp(log_rate)
y_true = Y_count[month_indices].reshape(-1).cpu().numpy()
y_pred = pred_count.repeat(len(month_indices)).cpu().numpy()
return y_true, y_pred
def evaluate_count_split(month_indices, split_name='split'):
y_true, y_pred = collect_count_predictions(month_indices)
error = y_pred - y_true
abs_error = np.abs(error)
positive_mask = y_true > 0
mae = float(np.mean(abs_error))
rmse = float(np.sqrt(np.mean(error ** 2)))
zero_baseline_mae = float(np.mean(np.abs(y_true)))
if positive_mask.sum() > 0:
mae_positive = float(np.mean(abs_error[positive_mask]))
mean_pred_positive_actual = float(np.mean(y_pred[positive_mask]))
mean_actual_positive = float(np.mean(y_true[positive_mask]))
else:
mae_positive = np.nan
mean_pred_positive_actual = np.nan
mean_actual_positive = np.nan
metrics = {
'split': split_name,
'mae': mae,
'rmse': rmse,
'zero_baseline_mae': zero_baseline_mae,
'mae_minus_zero_baseline': mae - zero_baseline_mae,
'mean_actual_count': float(np.mean(y_true)),
'mean_predicted_count': float(np.mean(y_pred)),
'total_actual_count': float(np.sum(y_true)),
'total_predicted_count': float(np.sum(y_pred)),
'positive_share': float(np.mean(y_true > 0)),
'mae_positive_actual_only': mae_positive,
'mean_actual_count_positive_only': mean_actual_positive,
'mean_predicted_count_on_positive_actual': mean_pred_positive_actual,
'max_actual_count': float(np.max(y_true)),
'max_predicted_count': float(np.max(y_pred))
}
return metrics
print('Regression evaluation helpers are ready.')
pre_train_reg_valid = evaluate_count_split(valid_month_idx, split_name='valid_pre_training')
print('\\nPre-training validation regression metrics:')
for key, value in pre_train_reg_valid.items():
print(f'{key}: {value}')
)")
## Regression evaluation helpers are ready.
## \nPre-training validation regression metrics:
## split: valid_pre_training
## mae: 0.06238536536693573
## rmse: 0.1985711306333542
## zero_baseline_mae: 0.034088876098394394
## mae_minus_zero_baseline: 0.028296489268541336
## mean_actual_count: 0.034088876098394394
## mean_predicted_count: 0.030194653198122978
## total_actual_count: 2286.0
## total_predicted_count: 2024.8533935546875
## positive_share: 0.031225767968983
## mae_positive_actual_only: 1.0612964630126953
## mean_actual_count_positive_only: 1.0916905403137207
## mean_predicted_count_on_positive_actual: 0.03039413131773472
## max_actual_count: 5.0
## max_predicted_count: 0.03197809308767319
And finally, the training loop and model evaluation.
# Regression table
regression_table <- bind_rows(
as_tibble(py_to_r(py$final_reg_train)),
as_tibble(py_to_r(py$final_reg_valid)),
as_tibble(py_to_r(py$final_reg_test))
) %>%
transmute(
Split = recode(split, train = "Train", valid = "Validation", test = "Test"),
MAE = round(mae, 3),
RMSE = round(rmse, 3),
`Zero baseline MAE` = round(zero_baseline_mae, 3),
`MAE - zero baseline` = round(mae_minus_zero_baseline, 3),
`Mean actual count` = round(mean_actual_count, 3),
`Mean predicted count` = round(mean_predicted_count, 3),
`Total actual count` = round(total_actual_count, 0),
`Total predicted count` = round(total_predicted_count, 0),
`MAE on positives` = round(mae_positive_actual_only, 3)
)
kable(
regression_table,
caption = "GraphSAGE count regression metrics",
align = "c"
)
| Split | MAE | RMSE | Zero baseline MAE | MAE - zero baseline | Mean actual count | Mean predicted count | Total actual count | Total predicted count | MAE on positives |
|---|---|---|---|---|---|---|---|---|---|
| Train | 0.057 | 0.182 | 0.031 | 0.026 | 0.031 | 0.031 | 8375 | 8186 | 0.992 |
| Validation | 0.060 | 0.193 | 0.034 | 0.026 | 0.034 | 0.031 | 2286 | 2046 | 1.013 |
| Test | 0.060 | 0.194 | 0.035 | 0.025 | 0.035 | 0.031 | 2321 | 2046 | 1.006 |
The count regression model gives very similar results across the training, validation, and test sets, so there is no clear sign that the model behaves much worse on later months. The average errors remain low, but this should be interpreted carefully because most edge-month observations have zero accidents. This is visible in the comparison with the zero baseline: the model has a higher MAE than simply predicting zero everywhere, which means that the regression task is strongly affected by the sparsity of accident counts. The model predicts the overall mean count quite closely, but it underestimates the total number of accidents in the validation and test periods. The weakest point is visible for road segments where accidents actually occurred: the error on positive observations remains high, suggesting that the model struggles to assign sufficiently large predicted counts to accident-positive edges. Overall, the count regression model captures the general low-frequency structure of the data, although it is less useful for estimating the actual number of accidents on specific risky segments.
reg_history <- as.data.frame(py_to_r(py$reg_history_df))
reg_loss_long <- reg_history %>%
select(epoch, train_loss, valid_loss) %>%
pivot_longer(
cols = c(train_loss, valid_loss),
names_to = "split",
values_to = "loss"
) %>%
mutate(
split = recode(
split,
train_loss = "Train",
valid_loss = "Validation"
)
)
ggplot(reg_loss_long, aes(x = epoch, y = loss, color = split)) +
geom_line(linewidth = 1.1) +
labs(
title = "Count GraphSAGE Training History",
subtitle = "Train and validation loss across epochs",
x = "Epoch",
y = "Poisson negative log-likelihood loss",
color = "Split"
) +
theme_minimal(base_size = 13) +
theme(
plot.title = element_text(face = "bold", hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5),
legend.position = "bottom"
)
In case of regression, the model behaves quite the same as classification in terms of training. After 50 epochs, the loss declines significantly and stays on stable level.
This project showed that road accident modelling becomes more meaningful when accidents are represented on the road network rather than treated only as isolated point events. After cleaning the road data, assigning accidents to road segments, and building a monthly edge-level panel, the Warsaw road network could be used as a spatial structure for accident-risk modelling. The graph-based approach was therefore useful mainly as a way of translating accident records into a road-segment risk map. This is the strongest part of the project: it connects accident locations with the geometry, attributes, and topology of the actual transport network.
The binary GraphSAGE model produced the most useful results. Its performance was stable across training, validation, and test periods, which suggests that the model captured some persistent spatial patterns rather than fitting only one part of the data. The model had moderate ranking ability and relatively high recall, so it was able to identify many road segments that later appeared as accident-positive. At the same time, precision remained low, meaning that many segments flagged as risky did not actually have accidents in a given month. This result is expected in such a sparse problem, where accident-positive edge-months are rare. For this reason, the binary model is best understood as a broad screening tool: it can highlight areas of elevated risk, while the predicted positive class should be interpreted cautiously.
The count regression model was less convincing. Although its errors were stable across time periods and the average predicted count was close to the overall low accident frequency, the model did not improve over a simple zero baseline in terms of MAE. This means that, under the current data structure, predicting exact monthly accident counts is too difficult. The model also underestimated total accident counts in later periods and performed poorly on observations where accidents actually occurred. These results suggest that the count task requires richer temporal and exposure-related information, such as traffic volume, weather, or lagged accident history.
Overall, the project should be treated as a successful spatial risk-mapping exercise rather than a complete accident forecasting system. The model demonstrates that road-network structure and basic road attributes contain useful information for identifying persistently risky segments in Warsaw. However, the current setup is based mostly on static spatial features, so it cannot fully explain month-to-month fluctuations in accidents. Future improvements should focus on adding dynamic predictors, especially traffic intensity, weather, seasonal variables, and accident lags. A comparison with simpler non-graph models would also help determine how much value is added specifically by the graph structure. Even with these limitations, the project provides a coherent spatial machine learning workflow and shows that graph-based modelling is a reasonable direction for urban road-safety analysis.
Gao, X., Jiang, X., Haworth, J., Zhuang, D., Wang, S., Chen, H., & Law, S. (2024). Uncertainty-aware probabilistic graph neural networks for road-level traffic crash prediction. Accident Analysis & Prevention, 208, 107801. https://doi.org/10.1016/j.aap.2024.107801
Huang, B., & Hooi, B. (2022). Traffic accident prediction using graph neural networks: New datasets and the TRAVEL model. Workshop on Graph Learning Benchmarks, The Web Conference 2022. https://graph-learning-benchmarks.github.io/assets/papers/glb2022/Traffic_Accident_Prediction_using_Graph_Neural_Networks_New_Datasets_and_the_TRAVEL_Model.pdf
Huang, B., Hooi, B., & Shu, K. (2023). TAP: A comprehensive data repository for traffic accident prediction in road networks. Proceedings of the 31st ACM International Conference on Advances in Geographic Information Systems. https://doi.org/10.1145/3589132.3625655
Nippani, A., Li, D., Ju, H., Koutsopoulos, H. N., & Zhang, H. R. (2023). Graph neural networks for road safety modeling: Datasets and evaluations for accident analysis. Advances in Neural Information Processing Systems, 36. https://arxiv.org/abs/2311.00164
Hamilton, W. L., Ying, R., & Leskovec, J. (2017). Inductive representation learning on large graphs. Advances in Neural Information Processing Systems, 30.