1 Introduction

Traffic accidents remain one of the most important urban safety problems because they directly affect human health, mobility, infrastructure planning, and the allocation of public resources. For a city, the total number of accidents gives only a partial picture of the problem. A more useful question is where risk tends to concentrate and which parts of the transport system require closer attention. Accident risk is rarely distributed evenly across the road network. Some segments may become repeatedly dangerous because of their function in the city, their position in the network, their road class, speed conditions, or their connection to busy intersections and corridors. For this reason, modelling accident risk at the level of road segments can support a practical understanding of urban safety that is closer to the scale at which many road-safety interventions are planned.

Recent studies increasingly approach this problem through graph-based road safety modelling. Nippani et al. (2023) formulate accident prediction as an edge-level task on road networks, where historical accident records are aggregated into monthly road-level labels and used to predict accident counts or accident occurrence in subsequent time periods; their approach therefore combines road-network structure with an explicit temporal setup based on chronological training, validation, and testing. Huang and Hooi (2022) propose the TRAVEL model, which uses road geometry and graph relations to estimate accident vulnerability. Huang et al. (2023) introduce the TAP repository, emphasizing the need for graph-structured accident datasets that preserve road topology and geospatial features. Gao et al. (2024) develop a more advanced road-level framework for spatio-temporal crash prediction and uncertainty-aware risk estimation.

The main aim of the project is to test the effectiveness of GNN (graph neural network) approach on data for traffic accidents occurred in Warsaw in 2024. We did not try to replicate geo-temporal approach used by Nippani et al., instead we use time component for train and test split only. In addition to this, it was necessary to clean the original dataset (collected from “sewik” traffic accidents database) and cut a part of the original road network, so the current project should rather be considered a proof of concept than a complete road risk predicting framework.

Although there exists quite large number of GNN architectures, we applied GraphSAGE model.GraphSAGE is a graph neural network model designed to learn representations of nodes by using both their own features and information from their neighbours. Instead of treating each road intersection or road segment as an isolated observation, the model passes information through the graph structure. In this project, this means that the representation of a node in the Warsaw road network is influenced by nearby connected nodes. GraphSAGE works by aggregating information from a node’s local neighbourhood and combining it with the node’s own attributes to create a learned embedding. These embeddings can then be used for prediction tasks. In the present case, the embeddings of the two nodes forming a road segment were combined with edge-level road attributes to estimate accident occurrence and accident count. This makes the model suitable for spatial road-risk analysis, because it allows the prediction for one segment to depend partly on its position in the wider road network.

Since the target variable is defined for road segments, the final prediction is made at the edge level. After GraphSAGE creates embeddings for the nodes of the road network, each edge is represented by the embeddings of its two endpoint nodes. These two node embeddings describe the local graph context around both ends of the road segment. The model then combines them with edge-specific features, such as road type, length, speed limit, one-way status, and bridge information. This combined edge representation is passed through a prediction layer. For the binary task, the output is transformed into the estimated probability that at least one accident occurs on the segment. For the regression task, the output represents the expected accident count for the segment. In this way, the model predicts road-segment risk using both the segment’s own attributes and the structure of the surrounding road network.

2 Data preprocessing and exploration

Let us start with loading all the necessary libraries:

Then, read the accidents data:

maz <- read.csv("Mazowieckie_accidents.csv")

wawa <- maz %>%
  filter(POWIAT == "POWIAT WARSZAWA") %>%
  mutate(
    accident_date = lubridate::ymd(DATA_ZDARZ),  # obtain month for future split
    accident_month = lubridate::floor_date(accident_date, "month")
  )

wawa %>%
  summarise(
    n_accidents = n(),
    missing_dates = sum(is.na(accident_date)),
    min_date = min(accident_date, na.rm = TRUE),
    max_date = max(accident_date, na.rm = TRUE),
    n_months = n_distinct(accident_month)
  )

##   n_accidents missing_dates   min_date   max_date n_months
## 1       25101             0 2024-01-01 2024-12-31       12

As we can clearly see, the overall number of accidents occurred in Warsaw within a year is quite huge; however, not all of them will be used for final modeling procedure.After that, we need to read the file with the powiats borders and extract the borders for Warsaw.

The next step is to load the roads geometries. OpensStreetMaps API did not allow us to extract all roads, so we used “osmextract” package. Initially, we load all Line geometries and then filter out only the main road types. We need to admit the fact that not all roads were used for future analysis. Utilizing all of them would seriously overcomplicate the analysis.

roads_osm_lines <- osmextract::oe_get(
  place = wawa_spat,
  provider = "geofabrik",
  layer = "lines",
  extra_tags = c("maxspeed", "oneway", "bridge", "ref"),
  force_vectortranslate = TRUE,
  quiet = FALSE
)

## 0...10...20...30...40...50...60...70...80...90...100 - done.

## Reading layer `lines' from data source 
##   `C:\Users\yana1\AppData\Roaming\R\data\R\osmextract\geofabrik_mazowieckie-latest.gpkg' 
##   using driver `GPKG'
## Simple feature collection with 1347853 features and 14 fields
## Geometry type: LINESTRING
## Dimension:     XY
## Bounding box:  xmin: 18.23717 ymin: 50.88302 xmax: 23.89536 ymax: 53.80929
## Geodetic CRS:  WGS 84

# Add missing columns if they are absent
needed_cols <- c("osm_id", "name", "ref", "highway", "oneway", "bridge", "maxspeed")


# define helper functions to transform the roads features
for (col in needed_cols) {
  if (!col %in% names(roads_osm_lines)) {
    roads_osm_lines[[col]] <- NA_character_
  }
}

to_binary_osm <- function(x) {
  z <- stringr::str_to_lower(as.character(x))
  as.integer(z %in% c("yes", "true", "1", "-1", "reverse"))
}

to_integer_speed <- function(x) {
  as.integer(readr::parse_number(as.character(x)))
}

roads_spat <- roads_osm_lines %>%
  dplyr::transmute(
    osm_id = suppressWarnings(as.numeric(osm_id)),
    name = as.character(name),
    ref = as.character(ref),
    type = as.character(highway),
    oneway = to_binary_osm(oneway),
    bridge = to_binary_osm(bridge),
    maxspeed = to_integer_speed(maxspeed),
    geometry = geometry
  ) %>%
  sf::st_transform(target_crs) %>%
  sf::st_cast("MULTILINESTRING", warn = FALSE)


road_types_keep <- c(            # filter the main road types
  "motorway", "motorway_link",
  "trunk", "trunk_link",
  "primary", "primary_link",
  "secondary", "secondary_link",    
  "tertiary", "tertiary_link",
  "residential"
)

roads_spat <- roads_spat %>%
  dplyr::filter(type %in% road_types_keep)


roads_spat %>%
  sf::st_drop_geometry() %>%
  dplyr::count(type, sort = TRUE)

##              type     n
## 1     residential 84082
## 2        tertiary 28462
## 3       secondary 15636
## 4         primary  9213
## 5        motorway  2856
## 6   motorway_link  1676
## 7    primary_link   924
## 8  secondary_link   561
## 9           trunk   413
## 10  tertiary_link   329
## 11     trunk_link    58

roads_spat %>%
  sf::st_drop_geometry() %>%
  dplyr::summarise(
    n_roads = dplyr::n(),
    non_missing_maxspeed = sum(!is.na(maxspeed)),
    missing_maxspeed = sum(is.na(maxspeed)),
    missing_maxspeed_share = mean(is.na(maxspeed))
  )

##   n_roads non_missing_maxspeed missing_maxspeed missing_maxspeed_share
## 1  144210                50893            93317               0.647091

Then, we leave only the road segments lying inside of Warsaw.

roads_wawa <- roads_spat %>%
  st_transform(st_crs(wawa_spat)) %>%
  st_filter(wawa_spat, .predicate = st_intersects)

roads_wawa_clipped <- roads_wawa %>%
  st_intersection(wawa_spat) %>%
  st_collection_extract("LINESTRING") %>%
  st_cast("LINESTRING", warn = FALSE) %>%
  mutate(
    road_id = row_number(),
    length_m = as.numeric(st_length(geometry))
  )


roads_wawa_clipped <- roads_wawa_clipped %>%
  filter(type %in% road_types_keep)


ggplot() +
  geom_sf(data = wawa_spat, fill = "grey97", color = "grey25", linewidth = 0.7) +
  geom_sf(data = roads_wawa_clipped, color = "grey20", linewidth = 0.08, alpha = 0.65) +
  coord_sf(expand = FALSE) +
  theme_void() +
  labs(
    title = "Road Network of Warsaw"
  ) +
  theme(
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 11, hjust = 0.5),
    plot.caption = element_text(size = 8, hjust = 1),
    plot.margin = margin(10, 10, 10, 10))

We also need to make sure all the coordinates are stored in correct format. Here it can be seen that they are presented in mixed formats.

head(wawa[, c("WSP_GPS_X", "WSP_GPS_Y")])

##   WSP_GPS_X WSP_GPS_Y
## 1 21*15'060 52*12'250
## 2 21*02'168 52*21'365
## 3 20*58'488 52*16'126
## 4 21.069849 52.230339
## 5 21.070175  52.23019
## 6 20.903177 52.239088

Let us define the proper transforming function and apply it.

# build function to take coords to same format
convert_wsp_coord <- function(x) {
  x <- str_replace_all(as.character(x), ",", ".")
  dms <- str_match(x, "^(\\d+)\\*(\\d+)'(\\d+)$")
  from_dms <- as.numeric(dms[, 2]) +
    as.numeric(dms[, 3]) / 60 +
    (as.numeric(dms[, 4]) / 10) / 3600
  from_decimal <- suppressWarnings(as.numeric(x))
  ifelse(str_detect(x, "\\*"), from_dms, from_decimal)
}


# apply
wawa_sf <- wawa %>% mutate(
  lon = convert_wsp_coord(WSP_GPS_X),
  lat = convert_wsp_coord(WSP_GPS_Y)
    ) %>% st_as_sf(coords = c("lon", "lat"), crs = 4326, remove = FALSE) %>%
     st_transform(target_crs)

After this, we can start preparing the dataset in the form of graph. First of all, we kept only the columns, which later will be used for modeling - road type, oneway (dummy), bridge (dummy), speed limit and geometry itself. Missing speed limits were imputed using median from the same road type. Additionally, we added dummy identifying the roads for which speed limit was missing initially.

roads_for_graph <- roads_wawa_clipped %>%
  select(road_id, osm_id, type, oneway, bridge, maxspeed, length_m, geometry) %>%
  mutate(
    type = replace_na(as.character(type), "unknown"),
    oneway = as.integer(oneway),
    bridge = as.integer(bridge),
    maxspeed_numeric = readr::parse_number(as.character(maxspeed)),
    maxspeed_missing = as.integer(is.na(maxspeed_numeric))
  ) %>%
  group_by(type) %>%
  mutate(
    type_median_maxspeed = median(maxspeed_numeric, na.rm = TRUE),
    maxspeed_imputed = if_else(is.na(maxspeed_numeric), type_median_maxspeed, maxspeed_numeric)
  ) %>%
  ungroup() %>%
  select(road_id, osm_id, type, oneway, bridge, maxspeed_imputed, maxspeed_missing, geometry)

# sanity check
roads_for_graph %>%
  st_drop_geometry() %>%
  filter(is.na(maxspeed_imputed)) %>%
  count(type, sort = TRUE)

## # A tibble: 0 × 2
## # ℹ 2 variables: type <chr>, n <int>

From the already prepared tibble, we create a network object, where each road component becomes and edge, and each crossing becomes a node. Although some researchers use directed graphs (Nippani et al.), we utilized undirected one due to the fact that we have no information about actual direction of each road segment.

road_net <- roads_for_graph %>%
  as_sfnetwork(directed = FALSE) %>%
  convert(to_spatial_subdivision)

Then, we assign node_id’s and edge_id’s and add natural log of segments length. Rod id should not be used at this point due to the fact that roads were divided into smaller segments.

road_net <- road_net %>%
  activate(nodes) %>%
  select(-any_of(".tidygraph_node_index")) %>%
  mutate(node_id = row_number()) %>%
  activate(edges) %>%
  select(-any_of(".tidygraph_edge_index")) %>%
  mutate(
    graph_edge_id = row_number(),
    graph_length_m = as.numeric(st_length(geometry)),
    graph_log_length_m = log1p(graph_length_m)
  )

After that, we extract edge layer, which later will be used for accidents assignments.

graph_edges <- road_net %>%
  activate(edges) %>%
  st_as_sf()

Finally, we need to verify the graph structure, check for missing length values and make sure unique identifiers are consistent.

graph_edges %>%
  st_drop_geometry() %>%
  summarise(
    n_edges = n(),
    unique_graph_edge_ids = n_distinct(graph_edge_id),
    min_length_m = min(graph_length_m, na.rm = TRUE),
    median_length_m = median(graph_length_m, na.rm = TRUE),
    max_length_m = max(graph_length_m, na.rm = TRUE),
    zero_length_edges = sum(graph_length_m == 0, na.rm = TRUE),
    missing_length = sum(is.na(graph_length_m))
  )

## # A tibble: 1 × 7
##   n_edges unique_graph_edge_ids min_length_m median_length_m max_length_m
##     <int>                 <int>        <dbl>           <dbl>        <dbl>
## 1   33530                 33530       0.0244            69.2        2340.
## # ℹ 2 more variables: zero_length_edges <int>, missing_length <int>

As graph structure looks consistent, we can proceed with accidents. Unfortunately the database we used does not provide fully consistent data; some of accidents might be clipped to places far away from roads. In addition to this, we removed part of the road types from the analysis. Consequently, we need check how many accidents lie outside the road network and clip only those lying not farther than 5 meters away.

graph_nodes <- road_net %>%
  activate(nodes) %>%
  st_as_sf()

nearest_edge_idx <- st_nearest_feature(wawa_sf, graph_edges)
nearest_node_idx <- st_nearest_feature(wawa_sf, graph_nodes)

dist_to_edge_m <- as.numeric(st_distance(
  wawa_sf,
  graph_edges[nearest_edge_idx, ],
  by_element = TRUE
))

dist_to_node_m <- as.numeric(st_distance(
  wawa_sf,
  graph_nodes[nearest_node_idx, ],
  by_element = TRUE
))

accident_location_check <- tibble(
  total_accidents = nrow(wawa_sf),
  kept_within_5m = sum(dist_to_edge_m <= 5),
  deleted_outside_5m = sum(dist_to_edge_m > 5),
  deleted_outside_5m_share = deleted_outside_5m / total_accidents,
  near_node_5m = sum(dist_to_node_m <= 5),
  edge_only_5m = sum(dist_to_edge_m <= 5 & dist_to_node_m > 5),
  node_or_intersection_5m = sum(dist_to_node_m <= 5)
)

accident_location_check

## # A tibble: 1 × 7
##   total_accidents kept_within_5m deleted_outside_5m deleted_outside_5m_share
##             <int>          <int>              <int>                    <dbl>
## 1           25101          12982              12119                    0.483
## # ℹ 3 more variables: near_node_5m <int>, edge_only_5m <int>,
## #   node_or_intersection_5m <int>

In case of accidents lying near edges of the graph (road segments), we just assign them to the nearest edge. However, in case of accidents lying near nodes (road intersections), we still need to clip them to the nearest edge. By deleting them we would omit quite big number of accidents, and keeping them on nodes does not make any practical sense.

wawa_assigned_5m <- wawa_sf %>%
  mutate(
    graph_edge_id = graph_edges$graph_edge_id[nearest_edge_idx],
    nearest_node_id = graph_nodes$node_id[nearest_node_idx],
    dist_to_edge_m = dist_to_edge_m,
    dist_to_node_m = dist_to_node_m,
    assignment_type = case_when(
      dist_to_edge_m <= 5 & dist_to_node_m <= 5 ~ "intersection_assigned_to_nearest_edge",
      dist_to_edge_m <= 5 & dist_to_node_m > 5 ~ "direct_edge_assignment",
      TRUE ~ "outside_graph"
    )
  ) %>%
  filter(assignment_type != "outside_graph")

wawa_assigned_5m %>%
  st_drop_geometry() %>%
  count(assignment_type)

##                         assignment_type     n
## 1                direct_edge_assignment 10826
## 2 intersection_assigned_to_nearest_edge  2156

Then, we create a dataset with accidents count per each edge_id and per each month.

accidents_by_graph_edge_month <- wawa_assigned_5m %>%
  st_drop_geometry() %>%
  count(graph_edge_id, accident_month, name = "accident_count")

accidents_by_graph_edge_month %>%
  summarise(
    n_edge_months_with_accidents = n(),
    total_assigned_accidents = sum(accident_count),
    min_month = min(accident_month),
    max_month = max(accident_month),
    n_months = n_distinct(accident_month)
  )

##   n_edge_months_with_accidents total_assigned_accidents  min_month  max_month
## 1                        12044                    12982 2024-01-01 2024-12-01
##   n_months
## 1       12

Then, we can create statik edge dataset having only features we need for further analysis.

edge_static_dataset <- graph_edges %>%
  select(
    graph_edge_id, road_id, osm_id, type, oneway, bridge,
    graph_length_m, graph_log_length_m,
    maxspeed_imputed, maxspeed_missing,
    geometry
  )

edge_static_dataset %>%
  st_drop_geometry() %>%
  summarise(
    n_edges = n(),
    unique_graph_edge_ids = n_distinct(graph_edge_id),
    missing_type = sum(is.na(type)),
    missing_maxspeed = sum(is.na(maxspeed_imputed))
  )

## # A tibble: 1 × 4
##   n_edges unique_graph_edge_ids missing_type missing_maxspeed
##     <int>                 <int>        <int>            <int>
## 1   33530                 33530            0                0

Then, we construct proper montly accidents panel with count and binary target.

all_months <- seq.Date(
  from = min(wawa_sf$accident_month, na.rm = TRUE),
  to = max(wawa_sf$accident_month, na.rm = TRUE),
  by = "month"
)

edge_month_panel <- tidyr::expand_grid(
  graph_edge_id = edge_static_dataset$graph_edge_id,
  accident_month = all_months
) %>%
  left_join(
    accidents_by_graph_edge_month,
    by = c("graph_edge_id", "accident_month")
  ) %>%
  mutate(
    accident_count = replace_na(accident_count, 0L),
    accident_binary = as.integer(accident_count > 0)
  )

edge_month_panel %>%
  summarise(
    n_rows = n(),
    n_edges = n_distinct(graph_edge_id),
    n_months = n_distinct(accident_month),
    total_accidents = sum(accident_count),
    positive_edge_months = sum(accident_binary),
    positive_share = mean(accident_binary),
    missing_targets = sum(is.na(accident_count))
  )

## # A tibble: 1 × 7
##   n_rows n_edges n_months total_accidents positive_edge_months positive_share
##    <int>   <int>    <int>           <int>                <int>          <dbl>
## 1 402360   33530       12           12982                12044         0.0299
## # ℹ 1 more variable: missing_targets <int>

At this point, we can depict the road network with montly accidents counts per each edge.

monthly_graph_dataset <- edge_static_dataset %>%
  select(graph_edge_id, geometry) %>%
  left_join(edge_month_panel, by = "graph_edge_id") %>%
  mutate(
    accident_month_label = format(accident_month, "%Y-%m")
  )

month_chunks <- split(
  all_months,
  ceiling(seq_along(all_months) / 2)
)

common_max_accidents <- max(monthly_graph_dataset$accident_count, na.rm = TRUE)

plot_month_chunk <- function(month_vector) {
  
  month_labels <- format(month_vector, "%Y-%m")
  
  chunk_all_edges <- monthly_graph_dataset %>%
    filter(accident_month %in% month_vector)
  
  chunk_positive_edges <- chunk_all_edges %>%
    filter(accident_count > 0)
  
  ggplot() +
    geom_sf(
      data = chunk_all_edges,
      color = "grey85",
      linewidth = 0.12,
      alpha = 0.65
    ) +
    geom_sf(
      data = chunk_positive_edges,
      aes(color = accident_count, linewidth = accident_count),
      alpha = 0.95
    ) +
    geom_sf(
      data = wawa_spat,
      fill = NA,
      color = "black",
      linewidth = 0.45
    ) +
    facet_wrap(~ accident_month_label, ncol = 2) +
    scale_color_gradient(
      low = "#FDB863",
      high = "#B30000",
      limits = c(1, common_max_accidents),
      name = "Accidents"
    ) +
    scale_linewidth_continuous(
      range = c(0.35, 2.2),
      guide = "none"
    ) +
    coord_sf(expand = FALSE) +
    theme_void() +
    labs(
      title = "Monthly Accident Counts on the Warsaw Road Graph",
      subtitle = paste(min(month_labels), "to", max(month_labels)),
    ) +
    theme(
      plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
      plot.subtitle = element_text(size = 11, hjust = 0.5),
      plot.caption = element_text(size = 8, hjust = 1),
      strip.text = element_text(size = 10, face = "bold"),
      legend.position = "right",
      panel.spacing = unit(1, "lines"),
      plot.margin = margin(10, 10, 10, 10)
    )
}

monthly_maps <- lapply(month_chunks, plot_month_chunk)

monthly_maps[[1]]

monthly_maps[[2]]

monthly_maps[[3]]

monthly_maps[[4]]

monthly_maps[[5]]

monthly_maps[[6]]

After that, let us prepare the nodes feature table.

road_net <- road_net %>%
  activate(nodes) %>%
  mutate(
    node_degree = centrality_degree(mode = "all"),
    node_component = group_components()
  )

graph_nodes <- road_net %>%
  activate(nodes) %>%
  st_as_sf() %>%
  mutate(
    node_x = st_coordinates(geometry)[, 1],
    node_y = st_coordinates(geometry)[, 2]
  ) %>%
  st_drop_geometry() %>%
  select(node_id, node_degree, node_component, node_x, node_y)

graph_nodes %>%
  summarise(
    n_nodes = n(),
    unique_node_ids = n_distinct(node_id),
    n_components = n_distinct(node_component),
    min_degree = min(node_degree),
    median_degree = median(node_degree),
    max_degree = max(node_degree)
  )

## # A tibble: 1 × 6
##   n_nodes unique_node_ids n_components min_degree median_degree max_degree
##     <int>           <int>        <int>      <dbl>         <dbl>      <dbl>
## 1   27052           27052          176          1             2          6

And then we can prepare the tables for modeling. First, the edge table combining graph structure and tabular features.

graph_model_edges <- road_net %>%
  activate(edges) %>%
  as_tibble() %>%
  select(graph_edge_id, from, to) %>%
  left_join(
    edge_static_dataset %>%
      st_drop_geometry() %>%
      select(
        graph_edge_id, road_id, osm_id, type, oneway, bridge,
        graph_length_m, graph_log_length_m,
        maxspeed_imputed, maxspeed_missing
      ),
    by = "graph_edge_id"
  )

graph_model_edges %>%
  summarise(
    n_edges = n(),
    unique_graph_edge_ids = n_distinct(graph_edge_id),
    missing_type = sum(is.na(type)),
    missing_maxspeed = sum(is.na(maxspeed_imputed))
  )

## Simple feature collection with 1 feature and 4 fields
## Geometry type: MULTILINESTRING
## Dimension:     XY
## Bounding box:  xmin: 626541.6 ymin: 472233.7 xmax: 654472.2 ymax: 502031
## Projected CRS: ETRF2000-PL / CS92
## # A tibble: 1 × 5
##   n_edges unique_graph_edge_ids missing_type missing_maxspeed
##     <int>                 <int>        <int>            <int>
## 1   33530                 33530            0                0
## # ℹ 1 more variable: geometry <MULTILINESTRING [m]>

Then, we prapare other tables to be exported to Python.

# export node table

node_table <- graph_nodes %>%
  select(node_id, node_degree, node_x, node_y, node_component) %>%
  arrange(node_id)


#-------------------------------------------------------------------------------
# export static edge table

edge_table_raw <- graph_model_edges %>%
  mutate(
    from_py = from - 1L,
    to_py = to - 1L,
    type = factor(type)
  ) %>%
  select(
    graph_edge_id, from, to, from_py, to_py,
    type, oneway, bridge,
    graph_length_m, graph_log_length_m,
    maxspeed_imputed, maxspeed_missing
  )

edge_type_dummies <- model.matrix(~ type - 1, data = edge_table_raw) %>%
  as.data.frame() %>%
  janitor::clean_names()

edge_table <- bind_cols(
  edge_table_raw %>% select(-type),
  edge_type_dummies
) %>%
  arrange(graph_edge_id)


#-------------------------------------------------------------------------------
# export monthly target table

edge_month_table <- edge_month_panel %>%
  mutate(
    month_id = match(accident_month, all_months) - 1L
  ) %>%
  arrange(accident_month, graph_edge_id) %>%
  select(
    graph_edge_id,
    accident_month,
    month_id,
    accident_count,
    accident_binary
  )

# final checks before writing files

node_table %>%
  summarise(
    n_nodes = n(),
    unique_node_ids = n_distinct(node_id),
    min_node_id = min(node_id),
    max_node_id = max(node_id)
  )

## # A tibble: 1 × 4
##   n_nodes unique_node_ids min_node_id max_node_id
##     <int>           <int>       <int>       <int>
## 1   27052           27052           1       27052

edge_table %>%
  summarise(
    n_edges = n(),
    unique_graph_edge_ids = n_distinct(graph_edge_id),
    min_from_py = min(from_py),
    max_from_py = max(from_py),
    min_to_py = min(to_py),
    max_to_py = max(to_py)
  )

## Simple feature collection with 1 feature and 6 fields
## Geometry type: MULTILINESTRING
## Dimension:     XY
## Bounding box:  xmin: 626541.6 ymin: 472233.7 xmax: 654472.2 ymax: 502031
## Projected CRS: ETRF2000-PL / CS92
## # A tibble: 1 × 7
##   n_edges unique_graph_edge_ids min_from_py max_from_py min_to_py max_to_py
##     <int>                 <int>       <int>       <int>     <int>     <int>
## 1   33530                 33530           0       27050         1     27051
## # ℹ 1 more variable: geometry <MULTILINESTRING [m]>

edge_month_table %>%
  summarise(
    n_rows = n(),
    n_edges = n_distinct(graph_edge_id),
    n_months = n_distinct(accident_month),
    total_accidents = sum(accident_count),
    positive_edge_months = sum(accident_binary),
    positive_share = mean(accident_binary),
    missing_targets = sum(is.na(accident_count))
  )

## # A tibble: 1 × 7
##   n_rows n_edges n_months total_accidents positive_edge_months positive_share
##    <int>   <int>    <int>           <int>                <int>          <dbl>
## 1 402360   33530       12           12982                12044         0.0299
## # ℹ 1 more variable: missing_targets <int>

tibble(
  n_nodes = nrow(node_table),
  min_node_id = min(node_table$node_id),
  max_node_id = max(node_table$node_id),
  n_edges = nrow(edge_table),
  n_edge_month_rows = nrow(edge_month_table),
  min_index_used = min(c(edge_table$from_py, edge_table$to_py)),
  max_index_used = max(c(edge_table$from_py, edge_table$to_py)),
  any_negative = any(c(edge_table$from_py, edge_table$to_py) < 0),
  any_too_large = any(c(edge_table$from_py, edge_table$to_py) >= nrow(node_table)),
  total_accidents = sum(edge_month_table$accident_count),
  missing_targets = sum(is.na(edge_month_table$accident_count))
)

## # A tibble: 1 × 11
##   n_nodes min_node_id max_node_id n_edges n_edge_month_rows min_index_used
##     <int>       <int>       <int>   <int>             <int>          <int>
## 1   27052           1       27052   33530            402360              0
## # ℹ 5 more variables: max_index_used <int>, any_negative <lgl>,
## #   any_too_large <lgl>, total_accidents <int>, missing_targets <int>

#-------------------------------------------------------------------------------
# write files for Python / PyTorch Geometric

write_csv(node_table, "node_table.csv")
write_csv(edge_table, "edge_table.csv")
write_csv(edge_month_table, "edge_month_table.csv")

3 Modeling

3.1 BInary classification

Due to the fact that R ecosystem does not provide any frameworks for GNN’s, we need to transfer modeling to Python with “reticulate” package. First, we need to configure the environment. I used anaconda environment due to the fact that it provides maximum stability while working with various libraries. However, it is also possible to create virtual environment and load all libraries there.

Sys.setenv(RETICULATE_PYTHON = "C:/Users/yana1/anaconda3/envs/gnn_pyg_r/python.exe") #replace with your local environment 

library(reticulate)

py_config()

## python:         C:/Users/yana1/anaconda3/envs/gnn_pyg_r/python.exe
## libpython:      C:/Users/yana1/anaconda3/envs/gnn_pyg_r/python311.dll
## pythonhome:     C:/Users/yana1/anaconda3/envs/gnn_pyg_r
## version:        3.11.15 | packaged by Anaconda, Inc. | (main, Mar 11 2026, 17:12:15) [MSC v.1942 64 bit (AMD64)]
## Architecture:   64bit
## numpy:          C:/Users/yana1/anaconda3/envs/gnn_pyg_r/Lib/site-packages/numpy
## numpy_version:  2.4.4
## 
## NOTE: Python version was forced by RETICULATE_PYTHON

After the environment was properly set, we can start with importing all necessary Python libraries and loading data (all the codes in this section were previosuly run and debugged in Python IDE, so there is no need in small chunks).

py_run_string(r"(
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F

from sklearn.metrics import (
    roc_auc_score,
    balanced_accuracy_score,
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    confusion_matrix
)

from torch_geometric.data import Data
from torch_geometric.nn import SAGEConv

torch.manual_seed(42)
np.random.seed(42)

node_table = pd.read_csv('node_table.csv')
edge_table = pd.read_csv('edge_table.csv')
edge_month_table = pd.read_csv('edge_month_table.csv')

edge_month_table['accident_month'] = pd.to_datetime(edge_month_table['accident_month'])

print('node_table:', node_table.shape)
print('edge_table:', edge_table.shape)
print('edge_month_table:', edge_month_table.shape)

print('\nnode_table columns:')
print(node_table.columns.tolist())

print('\nedge_table columns:')
print(edge_table.columns.tolist())

print('\nedge_month_table columns:')
print(edge_month_table.columns.tolist())
)")

## node_table: (27052, 5)
## edge_table: (33530, 23)
## edge_month_table: (402360, 5)
## 
## node_table columns:
## ['node_id', 'node_degree', 'node_x', 'node_y', 'node_component']
## 
## edge_table columns:
## ['graph_edge_id', 'from', 'to', 'from_py', 'to_py', 'oneway', 'bridge', 'graph_length_m', 'graph_log_length_m', 'maxspeed_imputed', 'maxspeed_missing', 'geometry', 'typemotorway', 'typemotorway_link', 'typeprimary', 'typeprimary_link', 'typeresidential', 'typesecondary', 'typesecondary_link', 'typetertiary', 'typetertiary_link', 'typetrunk', 'typetrunk_link']
## 
## edge_month_table columns:
## ['graph_edge_id', 'accident_month', 'month_id', 'accident_count', 'accident_binary']

Then, we need to run sanity checks, construct proper dataset for binary classification model and divide the data into train, validation and test subsets chronologically.

py_run_string(r"(
# --------------------------------------------------
# Basic checks
# --------------------------------------------------

assert edge_table['graph_edge_id'].is_unique, 'graph_edge_id is not unique in edge_table'
assert node_table['node_id'].is_unique, 'node_id is not unique in node_table'

edge_ids = edge_table['graph_edge_id'].values
n_edges = len(edge_ids)

months = sorted(edge_month_table['accident_month'].unique())
n_months = len(months)

print('\nMonths:')
for i, m in enumerate(months):
    print(i, pd.Timestamp(m).strftime('%Y-%m'))

expected_rows = n_edges * n_months
assert len(edge_month_table) == expected_rows, (
    f'Expected {expected_rows} edge-month rows, got {len(edge_month_table)}'
)

print('\nNumber of edges:', n_edges)
print('Number of months:', n_months)
print('Expected rows:', expected_rows)


# --------------------------------------------------
# Monthly binary label matrix: months × edges
# --------------------------------------------------

label_matrix = (
    edge_month_table
    .pivot(index='accident_month', columns='graph_edge_id', values='accident_binary')
    .reindex(index=months, columns=edge_ids)
)

missing_labels = label_matrix.isna().sum().sum()
print('\nMissing labels in matrix:', missing_labels)
assert missing_labels == 0, 'There are missing labels after pivoting'

Y = torch.tensor(label_matrix.to_numpy(dtype=np.float32), dtype=torch.float32)

print('Y shape:', Y.shape)
print('Overall positive share:', float(Y.mean().item()))


# --------------------------------------------------
#chronological split
# train: Jan-Aug
# valid: Sep-Oct
# test : Nov-Dec
# --------------------------------------------------

train_month_idx = list(range(0, 8))
valid_month_idx = list(range(8, 10))
test_month_idx = list(range(10, 12))

print('\nSplit:')
print('Train:', [pd.Timestamp(months[i]).strftime('%Y-%m') for i in train_month_idx])
print('Valid:', [pd.Timestamp(months[i]).strftime('%Y-%m') for i in valid_month_idx])
print('Test :', [pd.Timestamp(months[i]).strftime('%Y-%m') for i in test_month_idx])

print('\nPositive shares:')
print('Train:', float(Y[train_month_idx].mean().item()))
print('Valid:', float(Y[valid_month_idx].mean().item()))
print('Test :', float(Y[test_month_idx].mean().item()))
)")

## 
## Months:
## 0 2024-01
## 1 2024-02
## 2 2024-03
## 3 2024-04
## 4 2024-05
## 5 2024-06
## 6 2024-07
## 7 2024-08
## 8 2024-09
## 9 2024-10
## 10 2024-11
## 11 2024-12
## 
## Number of edges: 33530
## Number of months: 12
## Expected rows: 402360
## 
## Missing labels in matrix: 0
## Y shape: torch.Size([12, 33530])
## Overall positive share: 0.029933393001556396
## 
## Split:
## Train: ['2024-01', '2024-02', '2024-03', '2024-04', '2024-05', '2024-06', '2024-07', '2024-08']
## Valid: ['2024-09', '2024-10']
## Test : ['2024-11', '2024-12']
## 
## Positive shares:
## Train: 0.029115717858076096
## Valid: 0.031225768849253654
## Test : 0.03191171959042549

Then, we need to define preprocessing steps and create PyTorch Geometry object.

py_run_string(r"(
# --------------------------------------------------
# Node features
# --------------------------------------------------
# node_component is not used as a numeric feature because it is an arbitrary ID.

node_feature_cols = ['node_degree', 'node_x', 'node_y']

node_features_df = node_table[node_feature_cols].apply(pd.to_numeric, errors='coerce')

print('\nNode feature dtypes:')
print(node_features_df.dtypes)

print('\nMissing node features after numeric conversion:')
print(node_features_df.isna().sum())

node_features_df = node_features_df.fillna(node_features_df.median()).fillna(0)

x = torch.tensor(
    node_features_df.to_numpy(dtype=np.float32),
    dtype=torch.float32
)

x_mean = x.mean(dim=0, keepdim=True)
x_std = x.std(dim=0, keepdim=True)
x_std[x_std == 0] = 1.0
x = (x - x_mean) / x_std


# --------------------------------------------------
# Edge index
# --------------------------------------------------

edge_index_original = torch.tensor(
    edge_table[['from_py', 'to_py']].to_numpy(dtype=np.int64).T,
    dtype=torch.long
)

# Message passing is undirected.
edge_index_message = torch.cat(
    [edge_index_original, edge_index_original.flip(0)],
    dim=1
)


# --------------------------------------------------
# Edge features
# --------------------------------------------------

continuous_edge_cols = [
    'graph_log_length_m',
    'maxspeed_imputed'
]

binary_edge_cols = [
    'oneway',
    'bridge',
    'maxspeed_missing'
]

type_dummy_cols = [
    col for col in edge_table.columns
    if col.startswith('type')
]

continuous_df = edge_table[continuous_edge_cols].apply(pd.to_numeric, errors='coerce')
binary_df = edge_table[binary_edge_cols + type_dummy_cols].apply(pd.to_numeric, errors='coerce')

print('\nContinuous edge feature dtypes:')
print(continuous_df.dtypes)

print('\nBinary/dummy edge feature dtypes:')
print(binary_df.dtypes)

print('\nMissing continuous edge features:')
print(continuous_df.isna().sum())

print('\nMissing binary/dummy edge features:')
print(binary_df.isna().sum())

continuous_df = continuous_df.fillna(continuous_df.median()).fillna(0)
binary_df = binary_df.fillna(0)

continuous_tensor = torch.tensor(
    continuous_df.to_numpy(dtype=np.float32),
    dtype=torch.float32
)

continuous_mean = continuous_tensor.mean(dim=0, keepdim=True)
continuous_std = continuous_tensor.std(dim=0, keepdim=True)
continuous_std[continuous_std == 0] = 1.0
continuous_tensor = (continuous_tensor - continuous_mean) / continuous_std

binary_tensor = torch.tensor(
    binary_df.to_numpy(dtype=np.float32),
    dtype=torch.float32
)

edge_attr = torch.cat(
    [continuous_tensor, binary_tensor],
    dim=1
)

edge_feature_cols = continuous_edge_cols + binary_edge_cols + type_dummy_cols

print('\nEdge feature columns:')
print(edge_feature_cols)
print('Number of edge features:', len(edge_feature_cols))
print('edge_attr:', edge_attr.shape)


# --------------------------------------------------
# PyG data object
# --------------------------------------------------

data = Data(
    x=x,
    edge_index=edge_index_message,
    original_edge_index=edge_index_original,
    edge_attr=edge_attr
)

print('\nData object:')
print(data)
print('x:', data.x.shape)
print('message edge_index:', data.edge_index.shape)
print('original_edge_index:', data.original_edge_index.shape)
print('edge_attr:', data.edge_attr.shape)

assert data.edge_attr.shape[0] == n_edges, 'edge_attr rows do not match number of edges'
assert data.original_edge_index.shape[1] == n_edges, 'original_edge_index columns do not match number of edges'
assert Y.shape[1] == n_edges, 'Y columns do not match number of edges'
assert int(data.edge_index.max()) < data.x.shape[0], 'edge_index contains node index larger than number of nodes'
assert int(data.edge_index.min()) >= 0, 'edge_index contains negative node index'
)")

## 
## Node feature dtypes:
## node_degree      int64
## node_x         float64
## node_y         float64
## dtype: object
## 
## Missing node features after numeric conversion:
## node_degree    0
## node_x         0
## node_y         0
## dtype: int64
## 
## Continuous edge feature dtypes:
## graph_log_length_m    float64
## maxspeed_imputed        int64
## dtype: object
## 
## Binary/dummy edge feature dtypes:
## oneway                int64
## bridge                int64
## maxspeed_missing      int64
## typemotorway          int64
## typemotorway_link     int64
## typeprimary           int64
## typeprimary_link      int64
## typeresidential       int64
## typesecondary         int64
## typesecondary_link    int64
## typetertiary          int64
## typetertiary_link     int64
## typetrunk             int64
## typetrunk_link        int64
## dtype: object
## 
## Missing continuous edge features:
## graph_log_length_m    0
## maxspeed_imputed      0
## dtype: int64
## 
## Missing binary/dummy edge features:
## oneway                0
## bridge                0
## maxspeed_missing      0
## typemotorway          0
## typemotorway_link     0
## typeprimary           0
## typeprimary_link      0
## typeresidential       0
## typesecondary         0
## typesecondary_link    0
## typetertiary          0
## typetertiary_link     0
## typetrunk             0
## typetrunk_link        0
## dtype: int64
## 
## Edge feature columns:
## ['graph_log_length_m', 'maxspeed_imputed', 'oneway', 'bridge', 'maxspeed_missing', 'typemotorway', 'typemotorway_link', 'typeprimary', 'typeprimary_link', 'typeresidential', 'typesecondary', 'typesecondary_link', 'typetertiary', 'typetertiary_link', 'typetrunk', 'typetrunk_link']
## Number of edge features: 16
## edge_attr: torch.Size([33530, 16])
## 
## Data object:
## Data(x=[27052, 3], edge_index=[2, 67060], edge_attr=[33530, 16], original_edge_index=[2, 33530])
## x: torch.Size([27052, 3])
## message edge_index: torch.Size([2, 67060])
## original_edge_index: torch.Size([2, 33530])
## edge_attr: torch.Size([33530, 16])

The next step is to define the model architecture, loss and some helper functions.

# PART 4: Model definition, loss, and optimizer

py_run_string(r"(
class EdgeGraphSAGEBinary(nn.Module):
    def __init__(self, node_in_dim, edge_in_dim, hidden_dim=64, dropout=0.30):
        super().__init__()
        
        self.conv1 = SAGEConv(node_in_dim, hidden_dim)
        self.conv2 = SAGEConv(hidden_dim, hidden_dim)
        
        self.edge_mlp = nn.Sequential(
            nn.Linear(hidden_dim * 2 + edge_in_dim, hidden_dim),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_dim, hidden_dim // 2),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_dim // 2, 1)
        )
        
        self.dropout = dropout
    
    def encode_nodes(self, x, edge_index):
        h = self.conv1(x, edge_index)
        h = F.relu(h)
        h = F.dropout(h, p=self.dropout, training=self.training)
        
        h = self.conv2(h, edge_index)
        h = F.relu(h)
        return h
    
    def decode_edges(self, node_emb, original_edge_index, edge_attr):
        src = original_edge_index[0]
        dst = original_edge_index[1]
        
        h_src = node_emb[src]
        h_dst = node_emb[dst]
        
        edge_input = torch.cat([h_src, h_dst, edge_attr], dim=1)
        logits = self.edge_mlp(edge_input).squeeze(-1)
        return logits
    
    def forward(self, x, edge_index, original_edge_index, edge_attr):
        node_emb = self.encode_nodes(x, edge_index)
        logits = self.decode_edges(node_emb, original_edge_index, edge_attr)
        return logits


model = EdgeGraphSAGEBinary(
    node_in_dim=data.x.shape[1],
    edge_in_dim=data.edge_attr.shape[1],
    hidden_dim=64,
    dropout=0.30
)

Y_train = Y[train_month_idx]

n_pos = float(Y_train.sum().item())
n_total = float(Y_train.numel())
n_neg = n_total - n_pos

pos_weight = torch.tensor([n_neg / max(n_pos, 1.0)], dtype=torch.float32)

print('\nTraining labels:')
print('positive labels:', int(n_pos))
print('negative labels:', int(n_neg))
print('positive share:', n_pos / n_total)
print('pos_weight:', float(pos_weight.item()))

criterion = nn.BCEWithLogitsLoss(pos_weight=pos_weight)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-4)

print('\nModel initialized.')
print(model)
)")

## 
## Training labels:
## positive labels: 7810
## negative labels: 260430
## positive share: 0.029115717268118103
## pos_weight: 33.34571075439453
## 
## Model initialized.
## EdgeGraphSAGEBinary(
##   (conv1): SAGEConv(3, 64, aggr=mean)
##   (conv2): SAGEConv(64, 64, aggr=mean)
##   (edge_mlp): Sequential(
##     (0): Linear(in_features=144, out_features=64, bias=True)
##     (1): ReLU()
##     (2): Dropout(p=0.3, inplace=False)
##     (3): Linear(in_features=64, out_features=32, bias=True)
##     (4): ReLU()
##     (5): Dropout(p=0.3, inplace=False)
##     (6): Linear(in_features=32, out_features=1, bias=True)
##   )
## )

#-------------------------------------------------------------------------------
# Evaluation helpers

py_run_string(r"(
def collect_predictions(month_indices):
    model.eval()
    
    with torch.no_grad():
        logits = model(
            data.x,
            data.edge_index,
            data.original_edge_index,
            data.edge_attr
        )
        
        probs = torch.sigmoid(logits)
        
        y_true = Y[month_indices].reshape(-1).cpu().numpy()
        y_prob = probs.repeat(len(month_indices)).cpu().numpy()
    
    return y_true, y_prob


def evaluate_split(month_indices, threshold=0.5, split_name='split'):
    y_true, y_prob = collect_predictions(month_indices)
    y_pred = (y_prob >= threshold).astype(int)
    
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred, labels=[0, 1]).ravel()
    
    metrics = {
        'split': split_name,
        'threshold': threshold,
        'auc': roc_auc_score(y_true, y_prob),
        'accuracy': accuracy_score(y_true, y_pred),
        'balanced_accuracy': balanced_accuracy_score(y_true, y_pred),
        'precision': precision_score(y_true, y_pred, zero_division=0),
        'recall': recall_score(y_true, y_pred, zero_division=0),
        'specificity': tn / (tn + fp) if (tn + fp) > 0 else np.nan,
        'f1': f1_score(y_true, y_pred, zero_division=0),
        'true_positive_share': float(np.mean(y_true)),
        'pred_positive_share': float(np.mean(y_pred)),
        'mean_probability': float(np.mean(y_prob)),
        'tp': int(tp),
        'tn': int(tn),
        'fp': int(fp),
        'fn': int(fn)
    }
    
    return metrics


def find_best_threshold_on_validation():
    y_true, y_prob = collect_predictions(valid_month_idx)
    
    thresholds = np.linspace(0.01, 0.99, 99)
    best_threshold = 0.5
    best_score = -1
    
    for threshold in thresholds:
        y_pred = (y_prob >= threshold).astype(int)
        score = balanced_accuracy_score(y_true, y_pred)
        
        if score > best_score:
            best_score = score
            best_threshold = threshold
    
    return float(best_threshold), float(best_score)


print('Evaluation helpers are ready.')

# quick pre-training check
pre_train_valid = evaluate_split(valid_month_idx, threshold=0.5, split_name='valid_pre_training')
print('\nPre-training validation metrics:')
for key, value in pre_train_valid.items():
    print(f'{key}: {value}')
)")

## Evaluation helpers are ready.
## 
## Pre-training validation metrics:
## split: valid_pre_training
## threshold: 0.5
## auc: 0.47507729485772315
## accuracy: 0.031225767968983
## balanced_accuracy: 0.5
## precision: 0.031225767968983
## recall: 1.0
## specificity: 0.0
## f1: 0.06056048818578824
## true_positive_share: 0.031225768849253654
## pred_positive_share: 1.0
## mean_probability: 0.5178737044334412
## tp: 2094
## tn: 0
## fp: 64966
## fn: 0

Finally, we can train the model and predict on test set.

# prepare a nice-looking table

library(knitr)

binary_table <- bind_rows(
  as_tibble(py_to_r(py$final_train)),
  as_tibble(py_to_r(py$final_valid)),
  as_tibble(py_to_r(py$final_test))
) %>%
  transmute(
    Split = recode(split, train = "Train", valid = "Validation", test = "Test"),
    AUROC = round(auc, 3),
    Accuracy = round(accuracy, 3),
    `Balanced accuracy` = round(balanced_accuracy, 3),
    Precision = round(precision, 3),
    Recall = round(recall, 3),
    Specificity = round(specificity, 3),
    F1 = round(f1, 3),
    `True positive share` = round(true_positive_share, 3),
    `Predicted positive share` = round(pred_positive_share, 3),
    Threshold = round(threshold, 2)
  )

kable(
  binary_table,
  caption = "Binary GraphSAGE classification metrics",
  align = "c"
)

Binary GraphSAGE classification metrics
Split	AUROC	Accuracy	Balanced accuracy	Precision	Recall	Specificity	F1	True positive share	Predicted positive share	Threshold
Train	0.795	0.721	0.723	0.072	0.726	0.721	0.132	0.029	0.292	0.54
Validation	0.787	0.721	0.715	0.076	0.709	0.721	0.137	0.031	0.292	0.54
Test	0.788	0.721	0.715	0.077	0.708	0.722	0.140	0.032	0.292	0.54

The binary model shows stable behaviour across the training, validation, and test sets. The AUROC remains at a moderate level in all three splits, which means that the model has some ability to rank road segments from lower to higher accident risk. The small difference between training and later periods suggests that the model generalises reasonably well across months and does not show a strong sign of overfitting. Accuracy and specificity are also relatively stable, although they should be interpreted carefully because accident-positive observations are rare. Balanced accuracy gives a more useful picture here and indicates that the model performs better than a naive classifier, while still leaving clear room for improvement. The most visible pattern is the difference between recall and precision. Recall is high, so the model catches a large share of accident-positive cases. Precision and F1 remain low, which means that many segments classified as risky do not actually have accidents in the observed month. This follows from the strong imbalance in the data and from the model’s tendency to flag a relatively large share of edges as risky. Overall, the binary model is useful as a broad risk-screening tool: it highlights potentially problematic road segments, although its positive predictions should be interpreted with caution.

Let us also plot the training loop.

binary_history <- as.data.frame(py_to_r(py$binary_history_df))

binary_loss_long <- binary_history %>%
  select(epoch, train_loss, valid_loss) %>%
  pivot_longer(
    cols = c(train_loss, valid_loss),
    names_to = "split",
    values_to = "loss"
  ) %>%
  mutate(
    split = recode(
      split,
      train_loss = "Train",
      valid_loss = "Validation"
    )
  )

ggplot(binary_loss_long, aes(x = epoch, y = loss, color = split)) +
  geom_line(linewidth = 1.1) +
  labs(
    title = "Binary GraphSAGE Training History",
    subtitle = "Train and validation loss across epochs",
    x = "Epoch",
    y = "Weighted binary cross-entropy loss",
    color = "Split"
  ) +
  theme_minimal(base_size = 13) +
  theme(
    plot.title = element_text(face = "bold", hjust = 0.5),
    plot.subtitle = element_text(hjust = 0.5),
    legend.position = "bottom"
  )

During the first 50 epochs, the model reached a relatively stable solution. However, after that the loss was quite stable suggesting that training for 150 more epochs did not result in severe overfitting.

3.2 Regression

Due to the fact that data was already loaded and preprocessed, we can omit some steps. The only modification we need to apply is change in target - we need to replace binary output matrix with counts of accidents.

# REGRESSION PART 1: Count target matrix

py_run_string(r"(
# --------------------------------------------------
# Monthly count matrix: months × edges
# --------------------------------------------------

count_matrix = (
    edge_month_table
    .pivot(index='accident_month', columns='graph_edge_id', values='accident_count')
    .reindex(index=months, columns=edge_ids)
)

missing_counts = count_matrix.isna().sum().sum()
print('Missing count labels:', missing_counts)
assert missing_counts == 0, 'There are missing count labels after pivoting'

Y_count = torch.tensor(
    count_matrix.to_numpy(dtype=np.float32),
    dtype=torch.float32
)

print('Y_count shape:', Y_count.shape)
print('Total accidents:', float(Y_count.sum().item()))
print('Mean count overall:', float(Y_count.mean().item()))
print('Max count:', float(Y_count.max().item()))

print('\\nSplit mean counts:')
print('Train:', float(Y_count[train_month_idx].mean().item()))
print('Valid:', float(Y_count[valid_month_idx].mean().item()))
print('Test :', float(Y_count[test_month_idx].mean().item()))

print('\\nSplit total accidents:')
print('Train:', float(Y_count[train_month_idx].sum().item()))
print('Valid:', float(Y_count[valid_month_idx].sum().item()))
print('Test :', float(Y_count[test_month_idx].sum().item()))
)")

## Missing count labels: 0
## Y_count shape: torch.Size([12, 33530])
## Total accidents: 12982.0
## Mean count overall: 0.03226463869214058
## Max count: 5.0
## \nSplit mean counts:
## Train: 0.031222039833664894
## Valid: 0.034088876098394394
## Test : 0.034610796719789505
## \nSplit total accidents:
## Train: 8375.0
## Valid: 2286.0
## Test : 2321.0

Then, we proceed with model architecture definition and evaluation helper fucntions.

py_run_string(r"(
class EdgeGraphSAGECount(nn.Module):
    def __init__(self, node_in_dim, edge_in_dim, hidden_dim=64, dropout=0.30):
        super().__init__()
        
        self.conv1 = SAGEConv(node_in_dim, hidden_dim)
        self.conv2 = SAGEConv(hidden_dim, hidden_dim)
        
        self.edge_mlp = nn.Sequential(
            nn.Linear(hidden_dim * 2 + edge_in_dim, hidden_dim),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_dim, hidden_dim // 2),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_dim // 2, 1)
        )
        
        self.dropout = dropout
    
    def encode_nodes(self, x, edge_index):
        h = self.conv1(x, edge_index)
        h = F.relu(h)
        h = F.dropout(h, p=self.dropout, training=self.training)
        
        h = self.conv2(h, edge_index)
        h = F.relu(h)
        return h
    
    def decode_edges(self, node_emb, original_edge_index, edge_attr):
        src = original_edge_index[0]
        dst = original_edge_index[1]
        
        h_src = node_emb[src]
        h_dst = node_emb[dst]
        
        edge_input = torch.cat([h_src, h_dst, edge_attr], dim=1)
        
        # Output is log(lambda), where lambda is expected accident count.
        log_rate = self.edge_mlp(edge_input).squeeze(-1)
        return log_rate
    
    def forward(self, x, edge_index, original_edge_index, edge_attr):
        node_emb = self.encode_nodes(x, edge_index)
        log_rate = self.decode_edges(node_emb, original_edge_index, edge_attr)
        return log_rate


reg_model = EdgeGraphSAGECount(
    node_in_dim=data.x.shape[1],
    edge_in_dim=data.edge_attr.shape[1],
    hidden_dim=64,
    dropout=0.30
)

Y_count_train = Y_count[train_month_idx]

train_mean_count = float(Y_count_train.mean().item())

# Initialize final bias near the average training count.
# This helps avoid unstable starting predictions.
with torch.no_grad():
    reg_model.edge_mlp[-1].bias.fill_(np.log(train_mean_count + 1e-6))

reg_criterion = nn.PoissonNLLLoss(log_input=True, reduction='mean')
reg_optimizer = torch.optim.Adam(reg_model.parameters(), lr=0.001, weight_decay=1e-4)

print('Count regression model initialized.')
print('Training mean count:', train_mean_count)
print(reg_model)
)")

## Count regression model initialized.
## Training mean count: 0.031222039833664894
## EdgeGraphSAGECount(
##   (conv1): SAGEConv(3, 64, aggr=mean)
##   (conv2): SAGEConv(64, 64, aggr=mean)
##   (edge_mlp): Sequential(
##     (0): Linear(in_features=144, out_features=64, bias=True)
##     (1): ReLU()
##     (2): Dropout(p=0.3, inplace=False)
##     (3): Linear(in_features=64, out_features=32, bias=True)
##     (4): ReLU()
##     (5): Dropout(p=0.3, inplace=False)
##     (6): Linear(in_features=32, out_features=1, bias=True)
##   )
## )

#-------------------------------------------------------------------------------
# REGRESSION PART 3: Evaluation helpers

py_run_string(r"(
def collect_count_predictions(month_indices):
    reg_model.eval()
    
    with torch.no_grad():
        log_rate = reg_model(
            data.x,
            data.edge_index,
            data.original_edge_index,
            data.edge_attr
        )
        
        # Convert log-rate to expected count.
        pred_count = torch.exp(log_rate)
        
        y_true = Y_count[month_indices].reshape(-1).cpu().numpy()
        y_pred = pred_count.repeat(len(month_indices)).cpu().numpy()
    
    return y_true, y_pred


def evaluate_count_split(month_indices, split_name='split'):
    y_true, y_pred = collect_count_predictions(month_indices)
    
    error = y_pred - y_true
    abs_error = np.abs(error)
    
    positive_mask = y_true > 0
    
    mae = float(np.mean(abs_error))
    rmse = float(np.sqrt(np.mean(error ** 2)))
    
    zero_baseline_mae = float(np.mean(np.abs(y_true)))
    
    if positive_mask.sum() > 0:
        mae_positive = float(np.mean(abs_error[positive_mask]))
        mean_pred_positive_actual = float(np.mean(y_pred[positive_mask]))
        mean_actual_positive = float(np.mean(y_true[positive_mask]))
    else:
        mae_positive = np.nan
        mean_pred_positive_actual = np.nan
        mean_actual_positive = np.nan
    
    metrics = {
        'split': split_name,
        'mae': mae,
        'rmse': rmse,
        'zero_baseline_mae': zero_baseline_mae,
        'mae_minus_zero_baseline': mae - zero_baseline_mae,
        'mean_actual_count': float(np.mean(y_true)),
        'mean_predicted_count': float(np.mean(y_pred)),
        'total_actual_count': float(np.sum(y_true)),
        'total_predicted_count': float(np.sum(y_pred)),
        'positive_share': float(np.mean(y_true > 0)),
        'mae_positive_actual_only': mae_positive,
        'mean_actual_count_positive_only': mean_actual_positive,
        'mean_predicted_count_on_positive_actual': mean_pred_positive_actual,
        'max_actual_count': float(np.max(y_true)),
        'max_predicted_count': float(np.max(y_pred))
    }
    
    return metrics


print('Regression evaluation helpers are ready.')

pre_train_reg_valid = evaluate_count_split(valid_month_idx, split_name='valid_pre_training')

print('\\nPre-training validation regression metrics:')
for key, value in pre_train_reg_valid.items():
    print(f'{key}: {value}')
)")

## Regression evaluation helpers are ready.
## \nPre-training validation regression metrics:
## split: valid_pre_training
## mae: 0.06238536536693573
## rmse: 0.1985711306333542
## zero_baseline_mae: 0.034088876098394394
## mae_minus_zero_baseline: 0.028296489268541336
## mean_actual_count: 0.034088876098394394
## mean_predicted_count: 0.030194653198122978
## total_actual_count: 2286.0
## total_predicted_count: 2024.8533935546875
## positive_share: 0.031225767968983
## mae_positive_actual_only: 1.0612964630126953
## mean_actual_count_positive_only: 1.0916905403137207
## mean_predicted_count_on_positive_actual: 0.03039413131773472
## max_actual_count: 5.0
## max_predicted_count: 0.03197809308767319

And finally, the training loop and model evaluation.

# Regression table
regression_table <- bind_rows(
  as_tibble(py_to_r(py$final_reg_train)),
  as_tibble(py_to_r(py$final_reg_valid)),
  as_tibble(py_to_r(py$final_reg_test))
) %>%
  transmute(
    Split = recode(split, train = "Train", valid = "Validation", test = "Test"),
    MAE = round(mae, 3),
    RMSE = round(rmse, 3),
    `Zero baseline MAE` = round(zero_baseline_mae, 3),
    `MAE - zero baseline` = round(mae_minus_zero_baseline, 3),
    `Mean actual count` = round(mean_actual_count, 3),
    `Mean predicted count` = round(mean_predicted_count, 3),
    `Total actual count` = round(total_actual_count, 0),
    `Total predicted count` = round(total_predicted_count, 0),
    `MAE on positives` = round(mae_positive_actual_only, 3)
  )

kable(
  regression_table,
  caption = "GraphSAGE count regression metrics",
  align = "c"
)

GraphSAGE count regression metrics
Split	MAE	RMSE	Zero baseline MAE	MAE - zero baseline	Mean actual count	Mean predicted count	Total actual count	Total predicted count	MAE on positives
Train	0.057	0.182	0.031	0.026	0.031	0.031	8375	8186	0.992
Validation	0.060	0.193	0.034	0.026	0.034	0.031	2286	2046	1.013
Test	0.060	0.194	0.035	0.025	0.035	0.031	2321	2046	1.006

The count regression model gives very similar results across the training, validation, and test sets, so there is no clear sign that the model behaves much worse on later months. The average errors remain low, but this should be interpreted carefully because most edge-month observations have zero accidents. This is visible in the comparison with the zero baseline: the model has a higher MAE than simply predicting zero everywhere, which means that the regression task is strongly affected by the sparsity of accident counts. The model predicts the overall mean count quite closely, but it underestimates the total number of accidents in the validation and test periods. The weakest point is visible for road segments where accidents actually occurred: the error on positive observations remains high, suggesting that the model struggles to assign sufficiently large predicted counts to accident-positive edges. Overall, the count regression model captures the general low-frequency structure of the data, although it is less useful for estimating the actual number of accidents on specific risky segments.

reg_history <- as.data.frame(py_to_r(py$reg_history_df))


reg_loss_long <- reg_history %>%
  select(epoch, train_loss, valid_loss) %>%
  pivot_longer(
    cols = c(train_loss, valid_loss),
    names_to = "split",
    values_to = "loss"
  ) %>%
  mutate(
    split = recode(
      split,
      train_loss = "Train",
      valid_loss = "Validation"
    )
  )

ggplot(reg_loss_long, aes(x = epoch, y = loss, color = split)) +
  geom_line(linewidth = 1.1) +
  labs(
    title = "Count GraphSAGE Training History",
    subtitle = "Train and validation loss across epochs",
    x = "Epoch",
    y = "Poisson negative log-likelihood loss",
    color = "Split"
  ) +
  theme_minimal(base_size = 13) +
  theme(
    plot.title = element_text(face = "bold", hjust = 0.5),
    plot.subtitle = element_text(hjust = 0.5),
    legend.position = "bottom"
  )

In case of regression, the model behaves quite the same as classification in terms of training. After 50 epochs, the loss declines significantly and stays on stable level.

4 Conclusion

This project showed that road accident modelling becomes more meaningful when accidents are represented on the road network rather than treated only as isolated point events. After cleaning the road data, assigning accidents to road segments, and building a monthly edge-level panel, the Warsaw road network could be used as a spatial structure for accident-risk modelling. The graph-based approach was therefore useful mainly as a way of translating accident records into a road-segment risk map. This is the strongest part of the project: it connects accident locations with the geometry, attributes, and topology of the actual transport network.

The binary GraphSAGE model produced the most useful results. Its performance was stable across training, validation, and test periods, which suggests that the model captured some persistent spatial patterns rather than fitting only one part of the data. The model had moderate ranking ability and relatively high recall, so it was able to identify many road segments that later appeared as accident-positive. At the same time, precision remained low, meaning that many segments flagged as risky did not actually have accidents in a given month. This result is expected in such a sparse problem, where accident-positive edge-months are rare. For this reason, the binary model is best understood as a broad screening tool: it can highlight areas of elevated risk, while the predicted positive class should be interpreted cautiously.

The count regression model was less convincing. Although its errors were stable across time periods and the average predicted count was close to the overall low accident frequency, the model did not improve over a simple zero baseline in terms of MAE. This means that, under the current data structure, predicting exact monthly accident counts is too difficult. The model also underestimated total accident counts in later periods and performed poorly on observations where accidents actually occurred. These results suggest that the count task requires richer temporal and exposure-related information, such as traffic volume, weather, or lagged accident history.

Overall, the project should be treated as a successful spatial risk-mapping exercise rather than a complete accident forecasting system. The model demonstrates that road-network structure and basic road attributes contain useful information for identifying persistently risky segments in Warsaw. However, the current setup is based mostly on static spatial features, so it cannot fully explain month-to-month fluctuations in accidents. Future improvements should focus on adding dynamic predictors, especially traffic intensity, weather, seasonal variables, and accident lags. A comparison with simpler non-graph models would also help determine how much value is added specifically by the graph structure. Even with these limitations, the project provides a coherent spatial machine learning workflow and shows that graph-based modelling is a reasonable direction for urban road-safety analysis.

5 References

Gao, X., Jiang, X., Haworth, J., Zhuang, D., Wang, S., Chen, H., & Law, S. (2024). Uncertainty-aware probabilistic graph neural networks for road-level traffic crash prediction. Accident Analysis & Prevention, 208, 107801. https://doi.org/10.1016/j.aap.2024.107801

Huang, B., & Hooi, B. (2022). Traffic accident prediction using graph neural networks: New datasets and the TRAVEL model. Workshop on Graph Learning Benchmarks, The Web Conference 2022. https://graph-learning-benchmarks.github.io/assets/papers/glb2022/Traffic_Accident_Prediction_using_Graph_Neural_Networks_New_Datasets_and_the_TRAVEL_Model.pdf

Huang, B., Hooi, B., & Shu, K. (2023). TAP: A comprehensive data repository for traffic accident prediction in road networks. Proceedings of the 31st ACM International Conference on Advances in Geographic Information Systems. https://doi.org/10.1145/3589132.3625655

Nippani, A., Li, D., Ju, H., Koutsopoulos, H. N., & Zhang, H. R. (2023). Graph neural networks for road safety modeling: Datasets and evaluations for accident analysis. Advances in Neural Information Processing Systems, 36. https://arxiv.org/abs/2311.00164

Hamilton, W. L., Ying, R., & Leskovec, J. (2017). Inductive representation learning on large graphs. Advances in Neural Information Processing Systems, 30.

Predicting traffic accidents with GraphSAGE

Kostiantyn Okhrimenko

2026-06-03