Transferable Bike Network Planning: Reinforcement Learning for
Pareto-Optimal Lane Reallocation
I’ve always hated travelling to and from from airport to my place in
this mumbai traffic and it would for sure beat my favourite playlist.
Since I don’t engage in any deleterious habits, my traffic survival kit
with all amazing snacks for the ride and petrol bill collectively a
decent down payment on a house in Andheri. A poor joke, I know. And of
course there’s always some new flyover or metro pillar coming up
everywhere yet not a single sign of traffic easing. That said, I’m
fairly certain most people would be furrowing their brows in total,
grudging recognition. Thousands of dollars are spent yearly on expanding
car lanes throughout the cities worldwide. Yet cities everywhere are
more congested than ever.
Recent events such as Kuala Lumpur’s “Bangun KL” campaign tried to
solve gridlock by asking people to start their day earlier, offering
discounted coffee as an incentive. However, most commuters’ schedules
are already fixed which includes parents dropping children at school by
6:30 am, teachers, civil servants, and shift workers who cannot shift
their hours. Meanwhile a study by Institute for Transportation and
Development Policy India reveals Mumbai as the most congested city in
the world. The vehicle numbers have tripled in 15 years while 90% of
Mumbaikars are stuck in peak‑hour traffic where 55% of traffic crawls
below 20 km/h. The city loses ₹3,600 crores ($485 million) annually in
productivity and fuel loss, 40 to 50% people lose personal time and 750
deaths per year are attributed to pollution[1]. A striking finding from
the study is that just 19% of people (travelling by private cars,
motorised two-wheelers, taxis and auto rickshaw) consume 80% of road
space meaning a small minority of private vehicles create the majority
of congestion. The usual “fixes” (widening roads or shifting schedules)
ignore induced demand.

Induced demand or induced traffic states that expanding highway
capacity make vehicle use higher which in the long run do not provide
lasting traffic congestion relief [2]. An example is the Katy Freeway in
Texas which connects Houston’s suburbs to its downtown. Over the past
two decades, this freeway has undergone several expansions totaling
nearly $3 billion, culminating in a massive 23 lanes. While the road’s
vehicle capacity has certainly grown the travel times have only
worsened. As a result, many consider the project a complete waste of
public funds since its original aim was to reduce congestion. On the
Katy Freeway, hundreds of homes and businesses were torn down to
accommodate the widening project, forcing people, jobs, and commerce to
relocate farther from the city center. That displacement ultimately
increased the number of freeway users. These changes fueled additional
demand: people now live farther from their workplaces and must commute
in. Walking is impossible due to the distance, and public transit is
either unreliable or nonexistent. Consequently, residents have no choice
but to drive using a freeway they never needed before [3].
Instead of endlessly widening roads for cars which only induces more
demand, one of the best ways cities can solve congestion is by investing
in a mix of transport modes i.e. trains, buses, walking, and cycling.
Mumbai, for example, is building 101.43 km (63.03 mi) of metro network
on top of its existing suburban train netwrok which spreads over 450
kilometres. But metros and trains alone cannot solve the last‑mile
problem where people still need a reliable way to reach stations from
their homes or offices, often short distances that are too far to walk
but too close to justify driving. Cycling fills that gap perfectly as it
is space‑efficient. A bike lane of the same width can move 7,000–8,000
people per hour while a single car lane can move roughly 1,000–2,000
people. Bicycles also take up far less parking and road space, reduce
pollution, and improve public health. This project investigates whether
reinforcement learning (RL) can overcome these limitations. Traditional
optimization methods, such as linear programming, can answer these
questions but are computationally expensive and cannot transfer
solutions from one city to another. By training RL agents on urban
street networks, we aim to produce bike lane allocations that are not
only efficient but also transferable across districts. In the following
section, we describe the training process, the environment setup, the
state and action spaces, and the reward function designed to balance
bike and car travel times.
Data Used
The data was acquired from
The raw data consists of Street network graph which includes
Nodes (intersections), edges (streets) with attributes (length,
gradient, speed limit, number of lanes). This data is obtained from
OpenStreetMap (OSM).
Raw data also includes travel demand data which is represented as
Origin-Destination (OD) matrix. This data was captured from bike sharing
data, census and travel surveys.
Lastly the raw data is enriched with factors such as elevation
raster, parking data and transit routes which helps in improving travel
time calculations or constraints.
Using snman library the raw OSM data is converted into an undirected
graph. Each edge stores: length(d(e)), gradient(δ(e)), speed limit
(θ(e)), capacity in lane count or width (Λ(e)). The undirected graph is
then converted into an directed auxiliary graph where each undirected
edge is replaced with two reciprocal directed edges and the gradient is
reversed for the opposite direction. Next travel times are computed for
each directed edge. From real life travel demand data the nodes which
have higher demand are selected and optional auxiliary OD pairs (chain
of all nodes) are added to enforce strong connectivity of the car
network, these have zero weight in the objective. There are two district
networks which have been choosen , one of which is Affoltern(Zurich) and
another one is Brichplatz (Zurich).
library(sf) # for reading spatial data (GeoPackage)
library(tidyverse) # for data manipulation and plotting
library(igraph) # for network analysis (optional)
library(sf)
library(dplyr)
library(purrr)
library(patchwork)
library(ggplot2)
library(gt)
library(scales)
library(plotly)
library(tidyr)
library(hrbrthemes)
instances <- c("affoltern", "birchplatz")
summary_table <- map_dfr(instances, function(inst) {
edges <- st_read(paste0("data/", inst, "/edges_all_attributes.gpkg"), quiet = TRUE)
nodes <- st_read(paste0("data/", inst, "/nodes_all_attributes.gpkg"), quiet = TRUE)
od <- read.csv(paste0("data/", inst, "/od_matrix.csv"))
tibble(
Instance = inst,
Nodes = nrow(nodes),
Edges = nrow(edges),
Total_Length_km = sum(edges$length, na.rm = TRUE) / 1000,
Edges_with_BikeLane = sum(grepl("P", edges$ln_desc, fixed = TRUE)),
Pct_BikeLane = round(mean(grepl("P", edges$ln_desc, fixed = TRUE)) * 100, 1),
OD_Pairs = nrow(od),
Total_Trips = sum(od$trips, na.rm = TRUE),
Avg_Trips_per_OD = round(mean(od$trips), 1)
)
})
print(summary_table)
snman[4] processing outputs three files, two gpkg files, one of which
has nodes and other edges. Last file is the od matrix. Affoltern network
has 231 node while birchplatz has 314. Similarly birchplatz (431) has
more edges than affoltern network (290). We can anticipate a longer
training time for the birchplatz network. Both of the initial networks
has 1 protected bike lane.
instance_affoltern <- "affoltern"
edges_affoltern <- st_read(paste0("data/", instance_affoltern, "/edges_all_attributes.gpkg"))
Reading layer `street_graph_edges' from data source `C:\Users\user\Documents\bike_lane_research\outputs_bikr_rl\data\affoltern\edges_all_attributes.gpkg' using driver `GPKG'
Simple feature collection with 290 features and 62 fields
Geometry type: LINESTRING
Dimension: XY
Bounding box: xmin: 8.491903 ymin: 47.41154 xmax: 8.519319 ymax: 47.42838
Geodetic CRS: WGS 84
nodes_affoltern <- st_read(paste0("data/", instance_affoltern, "/nodes_all_attributes.gpkg"))
Reading layer `street_graph_nodes' from data source `C:\Users\user\Documents\bike_lane_research\outputs_bikr_rl\data\affoltern\nodes_all_attributes.gpkg' using driver `GPKG'
Simple feature collection with 231 features and 8 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 8.491903 ymin: 47.41154 xmax: 8.519112 ymax: 47.42838
Geodetic CRS: WGS 84
od_affoltern <- read_csv(paste0("data/", instance_affoltern, "/od_matrix.csv"))
instance_birchplatz <- "birchplatz"
edges_birchplatz <- st_read(paste0("data/", instance_birchplatz , "/edges_all_attributes.gpkg"))
Reading layer `street_graph_edges' from data source `C:\Users\user\Documents\bike_lane_research\outputs_bikr_rl\data\birchplatz\edges_all_attributes.gpkg' using driver `GPKG'
Simple feature collection with 431 features and 64 fields
Geometry type: LINESTRING
Dimension: XY
Bounding box: xmin: 8.518686 ymin: 47.39858 xmax: 8.550276 ymax: 47.4148
Geodetic CRS: WGS 84
nodes_birchplatz <- st_read(paste0("data/", instance_birchplatz , "/nodes_all_attributes.gpkg"))
Reading layer `street_graph_nodes' from data source `C:\Users\user\Documents\bike_lane_research\outputs_bikr_rl\data\birchplatz\nodes_all_attributes.gpkg' using driver `GPKG'
Simple feature collection with 314 features and 8 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 8.518686 ymin: 47.39879 xmax: 8.549898 ymax: 47.4143
Geodetic CRS: WGS 84
od_birchplatz <- read_csv(paste0("data/", instance_birchplatz , "/od_matrix.csv"))
# Basic info
#glimpse(edges_birchplatz)
# Edge length distribution Affoltern
p1 <- ggplot(edges_affoltern, aes(x = length)) +
geom_histogram(bins = 50, fill = "darkblue") +
labs(title = "Edge length distribution Affoltern", x = "Length (meters)")
# Edge length distribution Birchplatz
p2 <- ggplot(edges_birchplatz, aes(x = length)) +
geom_histogram(bins = 50, fill = "steelblue") +
labs(title = "Edge length distribution Birchplatz", x = "Length (meters)")
patchwork_1 <- (p1 + p2)
patchwork_1 + plot_annotation(
title = 'Figure 1: Edge distribution across two quaters in zurich-Switzerland',
)

# Gradient (if column exists)
if("grade" %in% colnames(edges_affoltern)) {
p3 <- ggplot(edges_affoltern, aes(x = grade)) +
geom_histogram(bins = 50, fill = "lightgreen") +
labs(title = "Grade distribution Affoltern", x = "Grade (%)")
}
# Gradient (if column exists)
if("grade" %in% colnames(edges_birchplatz)) {
p4 <- ggplot(edges_birchplatz, aes(x = grade)) +
geom_histogram(bins = 50, fill = "darkgreen") +
labs(title = "Grade distribution BirchPlatz", x = "Grade (%)")
}
patchwork_2 <- (p3 + p4)
patchwork_2 + plot_annotation(
title = 'Figure 2: Gradient distribution of streets across two quaters in zurich-Switzerland',
)

NA
NA
p1 <- ggplot() +
geom_sf(data = edges_affoltern, color = "grey", linewidth = 0.3) +
geom_sf(data = nodes_affoltern, size = 0.5, color = "red") +
labs(title = paste("Figure 3: Street Network of", instance_affoltern)) +
theme_void()
p2 <- ggplot() +
geom_sf(data = edges_birchplatz, color = "grey", linewidth = 0.3) +
geom_sf(data = nodes_birchplatz, size = 0.5, color = "purple") +
labs(title = paste("Figure 4: Street Network of", instance_birchplatz)) +
theme_void()
p1+p2

Baseline Method
The baseline method used in this project is a linear programming
solution developed by [5] which is a mathematical recipe that balances
car and bike travel times. For every street segment, the optimization
decides how much space to give to cars and how much to bikes, with bike
lanes needing only half the width of car lanes. It considers where
people actually want to travel (the origin‑destination pairs) and
ensures that both car and bike networks stay connected. The solver first
finds a “fractional” solution (e.g 0.7 of a lane for bikes), a rounding
algorithm then picks edges with the highest fractional bike capacity and
fixes them as whole bike lanes, re‑solving the LP every k steps. This
makes sure that cars still have a way to get everywhere. By repeating
the whole process with different weights on car vs. bike importance, the
planner obtains a full set of Pareto‑optimal networks, each showing how
much bike improvement costs in terms of car travel time.
PPO method
The Reinforcement learning agent environment in this project is named
as BikeNetworkEnv and it defines state space, action space, reward
function, and transition dynamics.
State Space
The state is a flattened vector of features for every lane edge in
the graph. For each directed lane edge the agent sees three features.
The features name with their description and value type are given in the
below table-:
Table 3: Features of the State Vector
| distance |
Length of the edge (km) |
float |
| gradient |
Slope in percent (positive uphill, negative
downhill) |
float |
| is_bike |
1 if the lane is already a bike lane (lanetype ==
“P”), else 0 |
bool |
The full observation is a 1D array of length (3 ×
number_of_directed_lane_edges). So if there are 50 lane edges the vector
has 150 features. This state representation is used because the agent
need to verify the physical properties of each edge to determine whether
it will be beneficial to convert it into bike lane. The is_bike feature
is there so that the agent avoids redundant actions.
Action Space
Each action corresponds to selecting one specific directed lane edge
and attempting to convert it from a car lane (M>) to a bike lane
(P).
Reward
The reward is designed to encourage the agent to reduce bike travel
time while keeping car travel time low, and to avoid disconnecting the
car network.
When the RL agent takes an action, below points are considered for
the reward.
If 1. The edge is already a bike lane then -5 penalty is received.
Else 1. The edge is first temporarily converted 2. Check if the
remaining car network is strongly connected. If 1. not connected then
the conversion is rejected and -20 penalty is given. Else 1. If
connected then the conversion is accepted, the new bike and car travel
times are computed and the reward equation (i) is calculated.

Reward = 100 / (bike_time + ε) - 0.7 × car_time (i)
(100 / bike_time ) gives high reward when bike time is low. (-0.7 *
car_time) penalizes increases in car travel. The values 100 and 0.7 were
chosen empirically and can be further tuned using the hyper parameter
tuning.
Initial State and Episode Termination
At the start of each episode, the graph reset to the initial car‑only
configuration (all lanes M>). The agent then sequentially converts
edges. An episode ends after a fixed number of steps (max_steps).
Model Training
The agent is trained as follows-:
- The lane graph and the OD matrix are loaded for specific
instances(cities/districts).
- The BikeNetworkEnv is loaded with lane graph and OD matrix.
- PPO algorithm is used for training. PPO is used because it works
well with continuous as well as discrete action spaces. Furthermore, It
is less sensitive to hyperparameters than vanilla policy gradient
methods.
- The model is trained for 10000 timesteps.
- the trained model is saved to a zip file.
How does PPO work in this case?
The agent runs many episodes and in each episode it takes multiple
steps defined by the hyperparameter (max_steps). At each step, it:
Observes the current state (edge distances, gradients, bike
flags).
Samples an action according to the policy’s probabilities (or
chooses the action with highest probability during inference).
Executes the action, receives a reward, and moves to the next
state.
Stores (state, action, reward, next_state, done) in a
buffer.
After the agent collects a batch of experiences, PPO updates the
policy by:
Computing the advantage which determines how much better an
action was than the average expect turn.
Clipping the policy update to avoid taking too large a
step.
The objective is to increase the probability of actnthat led to
higher rewards but only within a trust region.
Inference After Training
Once trained, the agent is used to rank edges inead of solving an LP.
The current graph (with some edges already converted to bike lanes) is
used to build the observation vector in exactly the same way as in the
environment. Since, the RL agent is trained on a specific graph ( e.g
affoltern) which has a fixed number of edges. The graphs determines the
observation space. When later the trained agent is loaded to a different
city or district, the new graph may have a different number of lane
edges. Thus, the observation vector length would differ from what the
neural network expects. To overcome, this mismatch a padding/truncation
function is added.
If the new graph has fewer edges than the training graph we pad the
observation vector with zeros to reach the expected length. The network
sees zeros for those “missing” edges, which essentially means they have
minimal influence.
If the new graph has more edges than the training graph we truncate
or ignore the extra edges. This means the agent cannot act on those
extra edges at all, this is one of the severe limitation of the model
which can be taken into account for future work.
Despite the limitations the RL agent can be trained once on a
representative graph which can then be applied in inference on any other
district or part of the city. This is a huge practical advantage if many
places need to be optimised for bike lanes.
Two PPO agents were trained, one on Affoltern map and another on
Birchplatz network. The training insights are as follows-:



These insights were created using wandb, the original wandb report includes
interactive graph for further data.
S2V_DQN
The S2V‑DQN agent was inspired by the work of Dai et al. (NIPS 2017),
who combined graph embedding (Structure2Vec) with deep Q‑learning to
learn greedy heuristics for combinatorial optimization problems over
graphs.
State Representation
In s2v_dqn agent the state is a line graph of the original lane
graph. State consists of
Nodes – Each node (which corresponds to a directed lane edge in
original lane graph). each node has 3 features including distance,
gradient, is_bike. 2 . An adjacency matrix that tells the neural network
how edges are connected.
Valid mask – binary mask indicating which edges can be legally
converted without disconnecting the car network.
Training
The S2V‑DQN agent represents the current state of the lane network as
a line graph where each node corresponds to a directed lane edge and
edges in this line graph connect two lane edges if they share a common
intersection. Each node is associated with a feature vector of length
three including the edge’s length, its gradient, and a binary indicator
of whether it is already a bike lane. The agent then applies a
message‑passing neural network (Structure2Vec) that iteratively updates
node embeddings by aggregating information from neighbouring nodes.
After n iterations, each node’s embedding encodes not only its own
properties but also the structure and features of the local
neighbourhood. A global pooling operation (mean) over all node
embeddings produces a single vector that summarises the entire network
state.
For each lane edge, the agent concatenates its individual embedding
with the global graph summary and passes the result through a small
multi‑layer perceptron (two hidden layers) to compute a scalar Q‑value.
This Q‑value estimates the expected long‑term reward if that edge is
converted from a car lane to a bike lane given the current network
state. Edges that are critical for car connectivity like bridges or
arterial roads that if removed would disconnect the graph, receive low
Q‑values while edges whose conversion would significantly reduce bike
travel time without harming car travel time receive high Q‑values. The
message‑passing mechanism allows the agent to reason about network‑wide
consequences. During inference Q‑value is used to rank edges during
inference a higher Q means that the lane is a better candidate for
conversion.
Unlike on policy method PPO, s2v_dqn is a Off-policy method. It
stores past experiences in a replay buffer and samples randomly to
update the Q‑network.Although, it is generally more sample‑efficient
than on‑policy methods, it can be unstable if hyperparameters are not
tuned. Handling variable graph representation.
We used padding/truncation in PPO agent to make it accomodable for a
variety of city networks. However, this is not required by s2v_dqn as
the graph representation is inherently variable‑size which means that
the model processes a set of nodes and edges via message passing, which
works for any number of edges. PyTorch Geometric’s Batch class collates
multiple graphs of different sizes into a single batch. Despite its
advantages, one main limitation of s2v_dqn is that it requires careful
reward scaling and target network updates.
An S2V_DQN agent was trained on Birchplatz network. The training
insights are as follows-:


Above insights were created using wandb, the original wandb report includes
interactive graph for further data.
Results
library(dplyr)
library(purrr)
library(tidyr)
# Root directory (where all experiment folders live)
root_dir <- "C:/Users/user/Documents/bike_lane_research/outputs_bikr_rl/s2v_dqn"
# Find all CSV files matching the pattern
file_list <- list.files(
path = root_dir,
pattern = "real_pareto_optimize_od_1.0_50\\.csv$",
recursive = TRUE,
full.names = TRUE
)
# Function to extract both experiment and location
read_and_label <- function(path) {
df <- read.csv(path)
# Get relative path from root_dir
rel_path <- gsub(normalizePath(root_dir, winslash = "/"), "", normalizePath(path, winslash = "/"))
# Example: "/bike_lane_agent_affoltern_20260405_1523/affoltern/real_pareto_optimize_od_1.0_50.csv"
# Split by "/"
parts <- strsplit(rel_path, "/")[[1]]
# parts[1] is empty (starts with /), parts[2] = experiment folder, parts[3] = location (affoltern/birchplatz)
experiment <- parts[2]
location <- parts[3]
df$experiment <- experiment
df$location <- location
return(df)
}
# Combine all data
combined_df <- map_df(file_list, read_and_label)
# Reshape to long format (bike/car categories)
df_long <- combined_df %>%
pivot_longer(
cols = c(bike_edges, car_edges, bike_time, car_time),
names_to = c("mode", ".value"),
names_pattern = "(bike|car)_(.*)"
)
# Now columns: ..., mode, edges, time, experiment, location
df <- df_long
df <- df %>%
group_by(bike_edges_added, experiment, location, mode) %>%
mutate(duplicate_id = row_number()) %>%
ungroup()
df_wide <- df %>%
pivot_wider(
id_cols = c(bike_edges_added, experiment, location, duplicate_id),
names_from = mode,
values_from = time
) %>%
rename(bike_time = bike, car_time = car) %>%
select(-duplicate_id) # remove the helper column
Comparison of Final Perceived Bike and Car Travel
Times
library(gt)
summary_table <- df_wide %>%
arrange(location, experiment, bike_edges_added) %>% # ensure correct order
group_by(location, experiment) %>%
summarise(
init_bike_time = first(bike_time),
init_car_time = first(car_time),
final_bike_time = last(bike_time),
final_car_time = last(car_time),
num_points = n()
) %>%
ungroup() %>%
# Round numeric columns for readability
mutate(across(where(is.numeric), ~ round(.x, 3)))
# Create gt table
gt(summary_table) %>%
tab_header(
title = "Table 4: Initial and final travel times and number of Pareto points",
) %>%
cols_label(
location = "Location",
experiment = "Algorithm",
init_bike_time = "Initial bike time",
init_car_time = "Initial car time",
final_bike_time = "Final bike time (seconds)",
final_car_time = "Final car time",
num_points = "Bike Edges"
) %>%
fmt_number(columns = c(init_bike_time, init_car_time, final_bike_time, final_car_time),
decimals = 2) %>%
tab_options(
table.font.size = 12,
heading.title.font.size = 14,
heading.subtitle.font.size = 12
)
| Table 4: Initial and final travel times and number of Pareto points |
| Location |
Algorithm |
Initial bike time |
Initial car time |
Final bike time (seconds) |
Final car time |
Bike Edges |
| affoltern |
Linear_Programming |
7.19 |
1.83 |
3.87 |
2.60 |
130 |
| affoltern |
PPO_Trained_On_Affoltern |
7.19 |
1.83 |
4.49 |
3.84 |
83 |
| affoltern |
PPO_Trained_On_Birchplatz |
7.19 |
1.83 |
4.50 |
3.62 |
87 |
| affoltern |
S2V_DQN_Trained_On_Birchplatz |
7.19 |
1.83 |
4.44 |
3.83 |
86 |
| birchplatz |
Linear_Programming |
6.27 |
1.58 |
3.22 |
2.51 |
267 |
| birchplatz |
PPO_Trained_On_Affoltern |
6.27 |
1.58 |
3.92 |
3.13 |
154 |
| birchplatz |
PPO_Trained_On_Birchplatz |
6.27 |
1.58 |
3.79 |
3.41 |
176 |
| birchplatz |
S2V_DQN_Trained_On_Birchplatz |
6.27 |
1.58 |
3.72 |
3.27 |
174 |
Table 4 reveals a consistent trade‑off between optimality and
resource efficiency. Linear Programming (LP) achieves the lowest final
bike travel time in both districts (3.87 s in Affoltern, 3.22 s in
Birchplatz) but it does so by reallocating substantially more bike lanes
(130 and 267, respectively). In contrast, all RL agents (PPO and
S2V‑DQN) produce slightly higher final bike times (3.72–4.50 s) while
using roughly 40% fewer bike lanes in Affoltern (83–87 vs. 130) and
35–40% fewer in Birchplatz (154–176 vs. 267).
# Filter for bike mode and affoltern location
df_bike_affoltern <- df %>%
filter(mode == "bike", location == "affoltern")
p <- df_bike_affoltern %>%
ggplot(aes(x = edges, y = time, fill = experiment)) +
geom_area(alpha = 0.5) +
facet_wrap(~ experiment) + # one subplot per experiment
labs(x = "Edges", y = "Time", title = "Figure 5: Change in Perceived Bike Time on Affoltern as The Algorithms Progress Add Bike Lanes Over Time") +
theme_ipsum()
# Make interactive
p1 <- ggplotly(p)
p1
NA
NA
Plot shows that
# Filter for bike mode and birchplatz location
df_bike_affoltern <- df %>%
filter(mode == "bike", location == "birchplatz")
p <- df_bike_affoltern %>%
ggplot(aes(x = edges, y = time, fill = experiment)) +
geom_area(alpha = 0.5) +
facet_wrap(~ experiment) + # one subplot per experiment
labs(x = "Edges", y = "Time", title = "Figure 6: Change in Perceived Bike Time on Birchplatz as The Algorithms Progress Add Bike Lanes Over Time") +
theme_ipsum()
# Make interactive
p2 <- ggplotly(p)
p2
NA
The table shows that Linear Programming (LP) not only achieves the
lowest final bike travel times but also limits the increase in car
travel times more effectively than the RL agents. Starting from initial
car times of 1.83 s (Affoltern) and 1.58 s (Birchplatz) LP increases car
times to 2.60 s (+42%) and 2.51 s (+59%) respectively. In contrast, all
RL agents result in substantially higher car times. For example for
Affoltern, the best RL agent (PPO trained on Affoltern) final car time
reaches 3.84 s (+110%) and for Birchplatz, S2V‑DQN reaches 3.27 s
(+107%) while PPO trained on Birchplatz reaches 3.41 s (+116%). This is
despite RL agents using far fewer bike lanes (e.g., 83 vs. 130 in
Affoltern). The finding suggests that LP is more car‑aware where it
carefully selects lanes that improve bikeability while minimizing
disruption to car traffic. RL agents, even when trained with a reward
that includes car travel time appear to allocate bike lanes in a way
that inadvertently increases car travel time more than LP does.
One way to improve the RL agent can be increasing the training budget
and extensive hyperparameter tuning as they had relatively short
training budgets (e.g., 635–2489 seconds). Another, way is more thorough
reward engineering.
# Filter for bike mode and affoltern location
df_bike_affoltern <- df %>%
filter(mode == "car", location == "affoltern")
p <- df_bike_affoltern %>%
ggplot(aes(x = edges, y = time, fill = experiment)) +
geom_area(alpha = 0.5) +
facet_wrap(~ experiment) + # one subplot per experiment
labs(x = "Edges", y = "Time", title = "Figure 7: Change in Perceived Car Time on Affoltern as The Algorithms Progress Add Bike Lanes Over Time") +
theme_ipsum()
# Make interactive
p1 <- ggplotly(p)
p1
# Filter for bike mode and birchplatz location
df_bike_affoltern <- df %>%
filter(mode == "car", location == "birchplatz")
p <- df_bike_affoltern %>%
ggplot(aes(x = edges, y = time, fill = experiment)) +
geom_area(alpha = 0.5) +
facet_wrap(~ experiment) + # one subplot per experiment
labs(x = "Edges", y = "Time", title = "Figure 8: Change in Perceived Car Time on Birchplatz as The Algorithms Add Bike Lanes Over Time") +
theme_ipsum()
# Make interactive
p2 <- ggplotly(p)
p2
NA
Comparison of Training and Pareto Computation Time
Table 5 : Summary of training time (seconds) and Pareto computation
time (seconds)
| Affoltern_PPO |
635 |
10.85 |
35.50 |
| Birchplatz_PPO |
2489 |
13.37 |
37.12 |
| LP_Affoltern |
|
117.63 |
|
| LP_Birchplatz |
|
|
3911.71 |
| Birchplatz_s2vdqn |
11421 |
12.85 |
73.85 |
The results in the table above highlight a clear trade‑off between
optimality and transferability. The Linear Programming (LP) approach
achieves the lowest perceived bike travel time in both districts (3.87 s
for Affoltern, 3.22 s for Birchplatz), improving upon the initial times
by about 46% and 49% respectively. However, LP is not transferable which
is why each new district requires a full, time‑intensive re‑optimization
from scratch (as reflected in the table). In contrast, reinforcement
learning agents (PPO, S2V‑DQN) can be trained once on one district and
then applied to another with negligible inference cost. For example, PPO
trained on Birchplatz achieves a final bike time of 4.50 s on Affoltern,
only 0.63 s higher than LP while using significantly fewer bike edges
(87 vs. 130). Similarly, on Birchplatz itself, the best RL agent
(S2V‑DQN) reaches 3.72 s, 0.50 s above LP but with about 100 fewer bike
lanes. The disadvantage of RL is its worse final bike time (less
aggressive network reallocation) and the upfront training cost (up to
11421 s for S2V‑DQN). The advantage is clear where after training an RL
agent can instantly produce a viable Pareto‑optimal network for any new
city, whereas LP must be re‑solved each time a prohibitive overhead for
large‑scale or iterative urban planning. Thus, LP remains the gold
standard for static, per‑city optimization, while RL offers a
transferable alternative when computational budgets or repeated
applications are a concern.
library(dplyr)
library(ggplot2)
library(magick)
library(patchwork)
base_dir <- "C:/Users/user/Documents/bike_lane_research/outputs_bikr_rl/s2v_dqn"
sub_dir_1 <- "PPO_Trained_On_Affoltern/graphs"
sub_dir_2 <- "PPO_Trained_On_Birchplatz/graphs"
sub_dir_3 <- "Linear_Programming/graphs"
sub_dir_4 <- "S2V_DQN_Trained_On_Birchplatz/graphs"
# Read images and add titles
plt1 <- image_read(file.path(base_dir, sub_dir_1, "affoltern_bike_network.png")) %>%
image_ggplot() +
ggtitle("PPO (Trained on Affoltern)") +
theme(plot.title = element_text(size = 10, hjust = 0.5))
plt2 <- image_read(file.path(base_dir, sub_dir_2, "affoltern_bike_network.png")) %>%
image_ggplot() +
ggtitle("PPO (Trained on Birchplatz)") +
theme(plot.title = element_text(size = 10, hjust = 0.5))
plt3 <- image_read(file.path(base_dir, sub_dir_3, "affoltern_bike_network.png")) %>%
image_ggplot() +
ggtitle("Linear Programming") +
theme(plot.title = element_text(size = 10, hjust = 0.5))
plt4 <- image_read(file.path(base_dir, sub_dir_4, "affoltern_bike_network.png")) %>%
image_ggplot() +
ggtitle("S2V-DQN (Trained on Birchplatz)") +
theme(plot.title = element_text(size = 10, hjust = 0.5))
# Arrange with patchwork and add overall caption
(plt1 | plt2) / (plt3 | plt4) +
plot_annotation(
title = "Figure 9: Affoltern Bike Networks by Algorithm",
theme = theme(plot.title = element_text(size = 14, hjust = 0.5))
)

The final bike network for Affoltern reveals that the linear
programming method produces the most elaborate and well‑connected
network. The networks generated by PPO(Affoltern), PPO(Birchplatz) and
S2V‑DQN(Birchplatz) exhibit similar connectivity and complexity, all
ranking lower than the LP results.
plt1 <- image_read(file.path(base_dir, sub_dir_1, "birchplatz_bike_network.png")) %>%
image_ggplot() +
ggtitle("PPO (Trained on Affoltern)") +
theme(plot.title = element_text(size = 10, hjust = 0.5))
plt2 <- image_read(file.path(base_dir, sub_dir_2, "birchplatz_bike_network.png")) %>%
image_ggplot() +
ggtitle("PPO (Trained on Birchplatz)") +
theme(plot.title = element_text(size = 10, hjust = 0.5))
plt3 <- image_read(file.path(base_dir, sub_dir_3, "birchplatz_bike_network.png")) %>%
image_ggplot() +
ggtitle("Linear Programming") +
theme(plot.title = element_text(size = 10, hjust = 0.5))
plt4 <- image_read(file.path(base_dir, sub_dir_4, "birchplatz_bike_network.png")) %>%
image_ggplot() +
ggtitle("S2V-DQN (Trained on Birchplatz)") +
theme(plot.title = element_text(size = 10, hjust = 0.5))
(plt1 | plt2) / (plt3 | plt4) +
plot_annotation(
title = "Figure 10:Birchplatz Bike Networks by Algorithm",
theme = theme(plot.title = element_text(size = 14, hjust = 0.5))
)

Similar to Affoltern network, the final bike network for Birchplatz
reveals that the linear programming method produces the most elaborate
and well‑connected network, followed by S2V‑DQN .The networks generated
by PPO trained on Affoltern and PPO trained on Birchplatz exhibit
similar connectivity and complexity, both ranking lower than the LP and
S2V‑DQN results.
plt1 <- image_read(file.path(base_dir, sub_dir_1, "affoltern_car_network.png")) %>%
image_ggplot() +
ggtitle("PPO (Trained on Affoltern)") +
theme(plot.title = element_text(size = 10, hjust = 0.5))
plt2 <- image_read(file.path(base_dir, sub_dir_2, "affoltern_car_network.png")) %>%
image_ggplot() +
ggtitle("PPO (Trained on Birchplatz)") +
theme(plot.title = element_text(size = 10, hjust = 0.5))
plt3 <- image_read(file.path(base_dir, sub_dir_3, "affoltern_car_network.png")) %>%
image_ggplot() +
ggtitle("Linear Programming") +
theme(plot.title = element_text(size = 10, hjust = 0.5))
plt4 <- image_read(file.path(base_dir, sub_dir_4, "affoltern_car_network.png")) %>%
image_ggplot() +
ggtitle("S2V-DQN (Trained on Birchplatz)") +
theme(plot.title = element_text(size = 10, hjust = 0.5))
(plt1 | plt2) / (plt3 | plt4) +
plot_annotation(
title = "Figure 11: Affoltern Car Networks by Algorithm",
theme = theme(plot.title = element_text(size = 14, hjust = 0.5))
)

plt1 <- image_read(file.path(base_dir, sub_dir_1, "birchplatz_car_network.png")) %>%
image_ggplot() +
ggtitle("PPO (Trained on Affoltern)") +
theme(plot.title = element_text(size = 10, hjust = 0.5))
plt2 <- image_read(file.path(base_dir, sub_dir_2, "birchplatz_car_network.png")) %>%
image_ggplot() +
ggtitle("PPO (Trained on Birchplatz)") +
theme(plot.title = element_text(size = 10, hjust = 0.5))
plt3 <- image_read(file.path(base_dir, sub_dir_3, "birchplatz_car_network.png")) %>%
image_ggplot() +
ggtitle("Linear Programming") +
theme(plot.title = element_text(size = 10, hjust = 0.5))
plt4 <- image_read(file.path(base_dir, sub_dir_4, "birchplatz_car_network.png")) %>%
image_ggplot() +
ggtitle("S2V-DQN (Trained on Birchplatz)") +
theme(plot.title = element_text(size = 10, hjust = 0.5))
(plt1 | plt2) / (plt3 | plt4) +
plot_annotation(
title = "Figure 12: Birchplatz Car Networks by Algorithm",
theme = theme(plot.title = element_text(size = 14, hjust = 0.5))
)

REFERENCES
[1] “Taming Traffic,” Institute for Transportation and Development
Policy - Promoting sustainable and equitable transportation worldwide,
Mar. 22, 2021. https://itdp.org/publication/taming-traffic
[2] K. Hymel, “If you build it, they will drive: Measuring induced
demand for vehicle travel in urban areas,” Transport Policy,
vol. 76, pp. 57–66, Apr. 2019, doi: https://doi.org/10.1016/j.tranpol.2018.12.006.
[3] “Road widening and induced demand explained | TomTom Newsroom,”
TomTom, 2024. https://www.tomtom.com/newsroom/explainers-and-insights/induced-demand-explained/
[4] Ballo, L., M. Raubal and K.W. Axhausen (2024) Designing an E-Bike
City: An automated process for network-wide multimodal road space
reallocation, Journal of Cycling and Micromobility
Research, 2, 100048.
[5] Wiedemann, N., Nöbel, C., Martin, H., Ballo, L., & Raubal, M.
(2024). Bike network planning in limited urban space. arXiv preprint
arXiv:2405.01770.
---
title: ""
output: html_notebook
editor_options: 
  markdown: 
    wrap: sentence
---

## Transferable Bike Network Planning: Reinforcement Learning for Pareto-Optimal Lane Reallocation

I've always hated travelling to and from from airport to my place in this mumbai traffic and it would for sure beat my favourite playlist.
Since I don't engage in any deleterious habits, my traffic survival kit with all amazing snacks for the ride and petrol bill collectively a decent down payment on a house in Andheri.
A poor joke, I know.
And of course there's always some new flyover or metro pillar coming up everywhere yet not a single sign of traffic easing.
That said, I'm fairly certain most people would be furrowing their brows in total, grudging recognition.
Thousands of dollars are spent yearly on expanding car lanes throughout the cities worldwide.
Yet cities everywhere are more congested than ever.

Recent events such as Kuala Lumpur’s “Bangun KL” campaign tried to solve gridlock by asking people to start their day earlier, offering discounted coffee as an incentive.
However, most commuters’ schedules are already fixed which includes parents dropping children at school by 6:30 am, teachers, civil servants, and shift workers who cannot shift their hours.
Meanwhile a study by Institute for Transportation and Development Policy India reveals Mumbai as the most congested city in the world.
The vehicle numbers have tripled in 15 years while 90% of Mumbaikars are stuck in peak‑hour traffic where 55% of traffic crawls below 20 km/h.
The city loses ₹3,600 crores (\$485 million) annually in productivity and fuel loss, 40 to 50% people lose personal time and 750 deaths per year are attributed to pollution[1].
A striking finding from the study is that just 19% of people (travelling by private cars, motorised two-wheelers, taxis and auto rickshaw) consume 80% of road space meaning a small minority of private vehicles create the majority of congestion.
The usual “fixes” (widening roads or shifting schedules) ignore induced demand.

![](images/clipboard-489765278.png)

Induced demand or induced traffic states that expanding highway capacity make vehicle use higher which in the long run do not provide lasting traffic congestion relief [2].
An example is the Katy Freeway in Texas which connects Houston’s suburbs to its downtown.
Over the past two decades, this freeway has undergone several expansions totaling nearly \$3 billion, culminating in a massive 23 lanes.
While the road’s vehicle capacity has certainly grown the travel times have only worsened.
As a result, many consider the project a complete waste of public funds since its original aim was to reduce congestion.
On the Katy Freeway, hundreds of homes and businesses were torn down to accommodate the widening project, forcing people, jobs, and commerce to relocate farther from the city center.
That displacement ultimately increased the number of freeway users.
These changes fueled additional demand: people now live farther from their workplaces and must commute in.
Walking is impossible due to the distance, and public transit is either unreliable or nonexistent.
Consequently, residents have no choice but to drive using a freeway they never needed before [3].

Instead of endlessly widening roads for cars which only induces more demand, one of the best ways cities can solve congestion is by investing in a mix of transport modes i.e. trains, buses, walking, and cycling.
Mumbai, for example, is building 101.43 km (63.03 mi) of metro network on top of its existing suburban train netwrok which spreads over 450 kilometres.
But metros and trains alone cannot solve the last‑mile problem where people still need a reliable way to reach stations from their homes or offices, often short distances that are too far to walk but too close to justify driving.
Cycling fills that gap perfectly as it is space‑efficient.
A bike lane of the same width can move 7,000–8,000 people per hour while a single car lane can move roughly 1,000–2,000 people.
Bicycles also take up far less parking and road space, reduce pollution, and improve public health.
This project investigates whether reinforcement learning (RL) can overcome these limitations.
Traditional optimization methods, such as linear programming, can answer these questions but are computationally expensive and cannot transfer solutions from one city to another.
By training RL agents on urban street networks, we aim to produce bike lane allocations that are not only efficient but also transferable across districts.
In the following section, we describe the training process, the environment setup, the state and action spaces, and the reward function designed to balance bike and car travel times.

### Data Used

The data was acquired from[^1]

[^1]: <https://wandb.ai/hsra/Bike_Lane_Project/reports/PPO-Agents-for-Bike-Lane-Optimization--VmlldzoxNjYyODE1Ng>

1.  The raw data consists of Street network graph which includes Nodes (intersections), edges (streets) with attributes (length, gradient, speed limit, number of lanes).
    This data is obtained from OpenStreetMap (OSM).

2.  Raw data also includes travel demand data which is represented as Origin-Destination (OD) matrix.
    This data was captured from bike sharing data, census and travel surveys.

3.  Lastly the raw data is enriched with factors such as elevation raster, parking data and transit routes which helps in improving travel time calculations or constraints.

Using snman library the raw OSM data is converted into an undirected graph.
Each edge stores: length(d(e)), gradient(δ(e)), speed limit (θ(e)), capacity in lane count or width (Λ(e)).
The undirected graph is then converted into an directed auxiliary graph where each undirected edge is replaced with two reciprocal directed edges and the gradient is reversed for the opposite direction.
Next travel times are computed for each directed edge.
From real life travel demand data the nodes which have higher demand are selected and optional auxiliary OD pairs (chain of all nodes) are added to enforce strong connectivity of the car network, these have zero weight in the objective.
There are two district networks which have been choosen , one of which is Affoltern(Zurich) and another one is Brichplatz (Zurich).

```{r}
library(sf)        # for reading spatial data (GeoPackage)
library(tidyverse) # for data manipulation and plotting
library(igraph)    # for network analysis (optional)
library(sf)
library(dplyr)
library(purrr)
library(patchwork)
library(ggplot2)
library(gt)
library(scales)
library(plotly)
library(tidyr)
library(hrbrthemes)
```

```{r}


instances <- c("affoltern", "birchplatz")

summary_table <- map_dfr(instances, function(inst) {
  edges <- st_read(paste0("data/", inst, "/edges_all_attributes.gpkg"), quiet = TRUE)
  nodes <- st_read(paste0("data/", inst, "/nodes_all_attributes.gpkg"), quiet = TRUE)
  od <- read.csv(paste0("data/", inst, "/od_matrix.csv"))
  
  tibble(
    Instance = inst,
    Nodes = nrow(nodes),
    Edges = nrow(edges),
    Total_Length_km = sum(edges$length, na.rm = TRUE) / 1000,
    Edges_with_BikeLane = sum(grepl("P", edges$ln_desc, fixed = TRUE)),
    Pct_BikeLane = round(mean(grepl("P", edges$ln_desc, fixed = TRUE)) * 100, 1),
    OD_Pairs = nrow(od),
    Total_Trips = sum(od$trips, na.rm = TRUE),
    Avg_Trips_per_OD = round(mean(od$trips), 1)
  )
})

print(summary_table)
```

snman[4] processing outputs three files, two gpkg files, one of which has nodes and other edges.
Last file is the od matrix.
Affoltern network has 231 node while birchplatz has 314.
Similarly birchplatz (431) has more edges than affoltern network (290).
We can anticipate a longer training time for the birchplatz network.
Both of the initial networks has 1 protected bike lane.

```{r}
instance_affoltern <- "affoltern"
edges_affoltern <- st_read(paste0("data/", instance_affoltern, "/edges_all_attributes.gpkg"))
nodes_affoltern <- st_read(paste0("data/", instance_affoltern, "/nodes_all_attributes.gpkg"))
od_affoltern <- read_csv(paste0("data/", instance_affoltern, "/od_matrix.csv"))

instance_birchplatz <- "birchplatz"
edges_birchplatz <- st_read(paste0("data/", instance_birchplatz , "/edges_all_attributes.gpkg"))
nodes_birchplatz <- st_read(paste0("data/", instance_birchplatz , "/nodes_all_attributes.gpkg"))
od_birchplatz <- read_csv(paste0("data/", instance_birchplatz , "/od_matrix.csv"))

```

```{r}
# Basic info
#glimpse(edges_birchplatz)

# Edge length distribution Affoltern
p1 <- ggplot(edges_affoltern, aes(x = length)) +
  geom_histogram(bins = 50, fill = "darkblue") +
  labs(title = "Edge length distribution Affoltern", x = "Length (meters)")

# Edge length distribution Birchplatz
p2 <- ggplot(edges_birchplatz, aes(x = length)) +
  geom_histogram(bins = 50, fill = "steelblue") +
  labs(title = "Edge length distribution Birchplatz", x = "Length (meters)")

patchwork_1 <- (p1 + p2)
patchwork_1 + plot_annotation(
  title = 'Figure 1: Edge distribution across two quaters in zurich-Switzerland',
)



# Gradient (if column exists)
if("grade" %in% colnames(edges_affoltern)) {
  p3 <- ggplot(edges_affoltern, aes(x = grade)) +
    geom_histogram(bins = 50, fill = "lightgreen") +
    labs(title = "Grade distribution Affoltern", x = "Grade (%)")
}

# Gradient (if column exists)
if("grade" %in% colnames(edges_birchplatz)) {
  p4 <- ggplot(edges_birchplatz, aes(x = grade)) +
    geom_histogram(bins = 50, fill = "darkgreen") +
    labs(title = "Grade distribution BirchPlatz", x = "Grade (%)")
}

patchwork_2 <- (p3 + p4)
patchwork_2 + plot_annotation(
  title = 'Figure 2: Gradient distribution of streets across two quaters in zurich-Switzerland',
)


```

```{r}
p1 <- ggplot() +
  geom_sf(data = edges_affoltern, color = "grey", linewidth = 0.3) +
  geom_sf(data = nodes_affoltern, size = 0.5, color = "red") +
  labs(title = paste("Figure 3: Street Network of", instance_affoltern)) +
  theme_void()

p2 <- ggplot() +
  geom_sf(data = edges_birchplatz, color = "grey", linewidth = 0.3) +
  geom_sf(data = nodes_birchplatz, size = 0.5, color = "purple") +
  labs(title = paste("Figure 4: Street Network of", instance_birchplatz)) +
  theme_void()
p1+p2
```

### Baseline Method

The baseline method used in this project is a linear programming solution developed by [5] which is a mathematical recipe that balances car and bike travel times.
For every street segment, the optimization decides how much space to give to cars and how much to bikes, with bike lanes needing only half the width of car lanes.
It considers where people actually want to travel (the origin‑destination pairs) and ensures that both car and bike networks stay connected.
The solver first finds a “fractional” solution (e.g 0.7 of a lane for bikes), a rounding algorithm then picks edges with the highest fractional bike capacity and fixes them as whole bike lanes, re‑solving the LP every k steps.
This makes sure that cars still have a way to get everywhere.
By repeating the whole process with different weights on car vs. bike importance, the planner obtains a full set of Pareto‑optimal networks, each showing how much bike improvement costs in terms of car travel time.

### PPO method

The Reinforcement learning agent environment in this project is named as BikeNetworkEnv and it defines state space, action space, reward function, and transition dynamics.

#### *State Space*

The state is a flattened vector of features for every lane edge in the graph.
For each directed lane edge the agent sees three features.
The features name with their description and value type are given in the below table-:

Table 3: Features of the State Vector

| Feature | Discription | Type |
|:----------------------:|:----------------------:|:----------------------:|
| distance | Length of the edge (km) | float |
| gradient | Slope in percent (positive uphill, negative downhill) | float |
| is_bike | 1 if the lane is already a bike lane (lanetype == "P"), else 0 | bool |

The full observation is a 1D array of length (3 × number_of_directed_lane_edges).
So if there are 50 lane edges the vector has 150 features.
This state representation is used because the agent need to verify the physical properties of each edge to determine whether it will be beneficial to convert it into bike lane.
The is_bike feature is there so that the agent avoids redundant actions.

#### *Action Space*

Each action corresponds to selecting one specific directed lane edge and attempting to convert it from a car lane (M\>) to a bike lane (P).

#### *Reward*

The reward is designed to encourage the agent to reduce bike travel time while keeping car travel time low, and to avoid disconnecting the car network.

When the RL agent takes an action, below points are considered for the reward.

If 1.
The edge is already a bike lane then -5 penalty is received.
Else 1.
The edge is first temporarily converted 2.
Check if the remaining car network is strongly connected.
If 1.
not connected then the conversion is rejected and -20 penalty is given.
Else 1.
If connected then the conversion is accepted, the new bike and car travel times are computed and the reward equation (i) is calculated.

![](images/clipboard-3888357034.png)

> Reward = 100 / (bike_time + ε) - 0.7 × car_time (i)

(100 / bike_time ) gives high reward when bike time is low.
(-0.7 \* car_time) penalizes increases in car travel.
The values 100 and 0.7 were chosen empirically and can be further tuned using the hyper parameter tuning.

#### Initial State and Episode Termination

At the start of each episode, the graph reset to the initial car‑only configuration (all lanes M\>).
The agent then sequentially converts edges.
An episode ends after a fixed number of steps (max_steps).

#### *Model Training*

The agent is trained as follows-:

1.  The lane graph and the OD matrix are loaded for specific instances(cities/districts).
2.  The BikeNetworkEnv is loaded with lane graph and OD matrix.
3.  PPO algorithm is used for training. PPO is used because it works well with continuous as well as discrete action spaces. Furthermore, It is less sensitive to hyperparameters than vanilla policy gradient methods.
4.  The model is trained for 10000 timesteps.
5.  the trained model is saved to a zip file.

#### *How does PPO work in this case?*

The agent runs many episodes and in each episode it takes multiple steps defined by the hyperparameter (max_steps).
At each step, it:

1.  Observes the current state (edge distances, gradients, bike flags).

2.  Samples an action according to the policy’s probabilities (or chooses the action with highest probability during inference).

3.  Executes the action, receives a reward, and moves to the next state.

4.  Stores (state, action, reward, next_state, done) in a buffer.

After the agent collects a batch of experiences, PPO updates the policy by:

1.  Computing the advantage which determines how much better an action was than the average expect turn.

2.  Clipping the policy update to avoid taking too large a step.

The objective is to increase the probability of actnthat led to higher rewards but only within a trust region.

#### *Inference After Training*

Once trained, the agent is used to rank edges inead of solving an LP.
The current graph (with some edges already converted to bike lanes) is used to build the observation vector in exactly the same way as in the environment.
Since, the RL agent is trained on a specific graph ( e.g affoltern) which has a fixed number of edges.
The graphs determines the observation space.
When later the trained agent is loaded to a different city or district, the new graph may have a different number of lane edges.
Thus, the observation vector length would differ from what the neural network expects.
To overcome, this mismatch a padding/truncation function is added.

If the new graph has fewer edges than the training graph we pad the observation vector with zeros to reach the expected length.
The network sees zeros for those “missing” edges, which essentially means they have minimal influence.

If the new graph has more edges than the training graph we truncate or ignore the extra edges.
This means the agent cannot act on those extra edges at all, this is one of the severe limitation of the model which can be taken into account for future work.

Despite the limitations the RL agent can be trained once on a representative graph which can then be applied in inference on any other district or part of the city.
This is a huge practical advantage if many places need to be optimised for bike lanes.

Two PPO agents were trained, one on Affoltern map and another on Birchplatz network.
The training insights are as follows-:

![](images/clipboard-1513399619.png)

![](images/clipboard-440975827.png)

![](images/clipboard-2349061818.png)

These insights were created using wandb, the original wandb report[^2] includes interactive graph for further data.

[^2]: <https://wandb.ai/hsra/Bike_Lane_Project/reports/PPO-Agents-for-Bike-Lane-Optimization--VmlldzoxNjYyODE1Ng>

### S2V_DQN

The S2V‑DQN agent was inspired by the work of Dai et al. (NIPS 2017), who combined graph embedding (Structure2Vec) with deep Q‑learning to learn greedy heuristics for combinatorial optimization problems over graphs.

#### *State Representation*

In s2v_dqn agent the state is a line graph of the original lane graph.
State consists of

1.  Nodes – Each node (which corresponds to a directed lane edge in original lane graph).
    each node has 3 features including distance, gradient, is_bike.
    2 .
    An adjacency matrix that tells the neural network how edges are connected.

2.  Valid mask – binary mask indicating which edges can be legally converted without disconnecting the car network.

#### *Training*

The S2V‑DQN agent represents the current state of the lane network as a line graph where each node corresponds to a directed lane edge and edges in this line graph connect two lane edges if they share a common intersection.
Each node is associated with a feature vector of length three including the edge’s length, its gradient, and a binary indicator of whether it is already a bike lane.
The agent then applies a message‑passing neural network (Structure2Vec) that iteratively updates node embeddings by aggregating information from neighbouring nodes.
After n iterations, each node’s embedding encodes not only its own properties but also the structure and features of the local neighbourhood.
A global pooling operation (mean) over all node embeddings produces a single vector that summarises the entire network state.

For each lane edge, the agent concatenates its individual embedding with the global graph summary and passes the result through a small multi‑layer perceptron (two hidden layers) to compute a scalar Q‑value.
This Q‑value estimates the expected long‑term reward if that edge is converted from a car lane to a bike lane given the current network state.
Edges that are critical for car connectivity like bridges or arterial roads that if removed would disconnect the graph, receive low Q‑values while edges whose conversion would significantly reduce bike travel time without harming car travel time receive high Q‑values.
The message‑passing mechanism allows the agent to reason about network‑wide consequences.
During inference Q‑value is used to rank edges during inference a higher Q means that the lane is a better candidate for conversion.

Unlike on policy method PPO, s2v_dqn is a Off-policy method.
It stores past experiences in a replay buffer and samples randomly to update the Q‑network.Although, it is generally more sample‑efficient than on‑policy methods, it can be unstable if hyperparameters are not tuned.
Handling variable graph representation.

We used padding/truncation in PPO agent to make it accomodable for a variety of city networks.
However, this is not required by s2v_dqn as the graph representation is inherently variable‑size which means that the model processes a set of nodes and edges via message passing, which works for any number of edges.
PyTorch Geometric’s Batch class collates multiple graphs of different sizes into a single batch.
Despite its advantages, one main limitation of s2v_dqn is that it requires careful reward scaling and target network updates.

An S2V_DQN agent was trained on Birchplatz network.
The training insights are as follows-:

![](images/clipboard-2490974748.png)

![](images/clipboard-1523775282.png)

Above insights were created using wandb, the original wandb report[^3] includes interactive graph for further data.

[^3]: <https://wandb.ai/hsra/S2V_DQN/reports/S2V_DQN--VmlldzoxNjYyOTkwNw>

### Results

```{r}
library(dplyr)
library(purrr)
library(tidyr)

# Root directory (where all experiment folders live)
root_dir <- "C:/Users/user/Documents/bike_lane_research/outputs_bikr_rl/s2v_dqn"

# Find all CSV files matching the pattern
file_list <- list.files(
  path = root_dir,
  pattern = "real_pareto_optimize_od_1.0_50\\.csv$",
  recursive = TRUE,
  full.names = TRUE
)

# Function to extract both experiment and location
read_and_label <- function(path) {
  df <- read.csv(path)
  
  # Get relative path from root_dir
  rel_path <- gsub(normalizePath(root_dir, winslash = "/"), "", normalizePath(path, winslash = "/"))
  # Example: "/bike_lane_agent_affoltern_20260405_1523/affoltern/real_pareto_optimize_od_1.0_50.csv"
  
  # Split by "/"
  parts <- strsplit(rel_path, "/")[[1]]
  # parts[1] is empty (starts with /), parts[2] = experiment folder, parts[3] = location (affoltern/birchplatz)
  experiment <- parts[2]
  location <- parts[3]
  
  df$experiment <- experiment
  df$location <- location
  return(df)
}

# Combine all data
combined_df <- map_df(file_list, read_and_label)

# Reshape to long format (bike/car categories)
df_long <- combined_df %>%
  pivot_longer(
    cols = c(bike_edges, car_edges, bike_time, car_time),
    names_to = c("mode", ".value"),
    names_pattern = "(bike|car)_(.*)"
  )
# Now columns: ..., mode, edges, time, experiment, location

df <- df_long

df <- df %>%
  group_by(bike_edges_added, experiment, location, mode) %>%
  mutate(duplicate_id = row_number()) %>%
  ungroup()

df_wide <- df %>%
  pivot_wider(
    id_cols = c(bike_edges_added, experiment, location, duplicate_id),
    names_from = mode,
    values_from = time
  ) %>%
  rename(bike_time = bike, car_time = car) %>%
  select(-duplicate_id)   # remove the helper column



```

#### *Comparison of Final Perceived Bike and Car Travel Times*

```{r}
library(gt)

summary_table <- df_wide %>%
  arrange(location, experiment, bike_edges_added) %>%   # ensure correct order
  group_by(location, experiment) %>%
  summarise(
    init_bike_time = first(bike_time),
    init_car_time = first(car_time),
    final_bike_time = last(bike_time),
    final_car_time = last(car_time),
    num_points = n()
  ) %>%
  ungroup() %>%
  # Round numeric columns for readability
  mutate(across(where(is.numeric), ~ round(.x, 3)))

# Create gt table
gt(summary_table) %>%
  tab_header(
    title = "Table 4: Initial and final travel times and number of Pareto points",
  ) %>%
  cols_label(
    location = "Location",
    experiment = "Algorithm",
    init_bike_time = "Initial bike time",
    init_car_time = "Initial car time",
    final_bike_time = "Final bike time (seconds)",
    final_car_time = "Final car time",
    num_points = "Bike Edges"
  ) %>%
  fmt_number(columns = c(init_bike_time, init_car_time, final_bike_time, final_car_time),
             decimals = 2) %>%
  tab_options(
    table.font.size = 12,
    heading.title.font.size = 14,
    heading.subtitle.font.size = 12
  )
```

Table 4 reveals a consistent trade‑off between optimality and resource efficiency.
Linear Programming (LP) achieves the lowest final bike travel time in both districts (3.87 s in Affoltern, 3.22 s in Birchplatz) but it does so by reallocating substantially more bike lanes (130 and 267, respectively).
In contrast, all RL agents (PPO and S2V‑DQN) produce slightly higher final bike times (3.72–4.50 s) while using roughly 40% fewer bike lanes in Affoltern (83–87 vs. 130) and 35–40% fewer in Birchplatz (154–176 vs. 267).

```{r}


# Filter for bike mode and affoltern location
df_bike_affoltern <- df %>%
  filter(mode == "bike", location == "affoltern")

p <- df_bike_affoltern %>%
  ggplot(aes(x = edges, y = time, fill = experiment)) +
  geom_area(alpha = 0.5) +
  facet_wrap(~ experiment) +          # one subplot per experiment
  labs(x = "Edges", y = "Time", title = "Figure 5: Change in Perceived Bike Time on Affoltern as The Algorithms Progress Add Bike Lanes Over Time") +
  theme_ipsum()

# Make interactive
p1 <- ggplotly(p)
p1


```

Plot shows that

```{r}
# Filter for bike mode and birchplatz location
df_bike_affoltern <- df %>%
  filter(mode == "bike", location == "birchplatz")

p <- df_bike_affoltern %>%
  ggplot(aes(x = edges, y = time, fill = experiment)) +
  geom_area(alpha = 0.5) +
  facet_wrap(~ experiment) +          # one subplot per experiment
  labs(x = "Edges", y = "Time", title = "Figure 6: Change in Perceived Bike Time on Birchplatz as The Algorithms Progress Add Bike Lanes Over Time") +
  theme_ipsum()

# Make interactive
p2 <- ggplotly(p)
p2

```

The table shows that Linear Programming (LP) not only achieves the lowest final bike travel times but also limits the increase in car travel times more effectively than the RL agents.
Starting from initial car times of 1.83 s (Affoltern) and 1.58 s (Birchplatz) LP increases car times to 2.60 s (+42%) and 2.51 s (+59%) respectively.
In contrast, all RL agents result in substantially higher car times.
For example for Affoltern, the best RL agent (PPO trained on Affoltern) final car time reaches 3.84 s (+110%) and for Birchplatz, S2V‑DQN reaches 3.27 s (+107%) while PPO trained on Birchplatz reaches 3.41 s (+116%).
This is despite RL agents using far fewer bike lanes (e.g., 83 vs. 130 in Affoltern).
The finding suggests that LP is more car‑aware where it carefully selects lanes that improve bikeability while minimizing disruption to car traffic.
RL agents, even when trained with a reward that includes car travel time appear to allocate bike lanes in a way that inadvertently increases car travel time more than LP does.

One way to improve the RL agent can be increasing the training budget and extensive hyperparameter tuning as they had relatively short training budgets (e.g., 635–2489 seconds).
Another, way is more thorough reward engineering.

```{r}

# Filter for bike mode and affoltern location
df_bike_affoltern <- df %>%
  filter(mode == "car", location == "affoltern")

p <- df_bike_affoltern %>%
  ggplot(aes(x = edges, y = time, fill = experiment)) +
  geom_area(alpha = 0.5) +
  facet_wrap(~ experiment) +          # one subplot per experiment
   labs(x = "Edges", y = "Time", title = "Figure 7: Change in Perceived Car Time on Affoltern as The Algorithms Progress Add Bike Lanes Over Time") +
  theme_ipsum()

# Make interactive
p1 <- ggplotly(p)
p1

# Filter for bike mode and birchplatz location
df_bike_affoltern <- df %>%
  filter(mode == "car", location == "birchplatz")

p <- df_bike_affoltern %>%
  ggplot(aes(x = edges, y = time, fill = experiment)) +
  geom_area(alpha = 0.5) +
  facet_wrap(~ experiment) +          # one subplot per experiment
  labs(x = "Edges", y = "Time", title = "Figure 8: Change in Perceived Car Time on Birchplatz as The Algorithms Add Bike Lanes Over Time") +
  theme_ipsum()

# Make interactive
p2 <- ggplotly(p)
p2

```

#### *Comparison of Training and Pareto Computation Time*

Table 5 : Summary of training time (seconds) and Pareto computation time (seconds)

| Method | Training time (s) | Affoltern (Pareto computation time, s) | Birchplatz (Pareto computation time, s) |
|------------------|------------------|------------------|------------------|
| Affoltern_PPO | 635 | 10.85 | 35.50 |
| Birchplatz_PPO | 2489 | 13.37 | 37.12 |
| LP_Affoltern |  | 117.63 |  |
| LP_Birchplatz |  |  | 3911.71 |
| Birchplatz_s2vdqn | 11421 | 12.85 | 73.85 |

The results in the table above highlight a clear trade‑off between optimality and transferability.
The Linear Programming (LP) approach achieves the lowest perceived bike travel time in both districts (3.87 s for Affoltern, 3.22 s for Birchplatz), improving upon the initial times by about 46% and 49% respectively.
However, LP is not transferable which is why each new district requires a full, time‑intensive re‑optimization from scratch (as reflected in the table).
In contrast, reinforcement learning agents (PPO, S2V‑DQN) can be trained once on one district and then applied to another with negligible inference cost.
For example, PPO trained on Birchplatz achieves a final bike time of 4.50 s on Affoltern, only 0.63 s higher than LP while using significantly fewer bike edges (87 vs. 130).
Similarly, on Birchplatz itself, the best RL agent (S2V‑DQN) reaches 3.72 s, 0.50 s above LP but with about 100 fewer bike lanes.
The disadvantage of RL is its worse final bike time (less aggressive network reallocation) and the upfront training cost (up to 11421 s for S2V‑DQN).
The advantage is clear where after training an RL agent can instantly produce a viable Pareto‑optimal network for any new city, whereas LP must be re‑solved each time a prohibitive overhead for large‑scale or iterative urban planning.
Thus, LP remains the gold standard for static, per‑city optimization, while RL offers a transferable alternative when computational budgets or repeated applications are a concern.

```{r}
library(dplyr)
library(ggplot2) 
library(magick)
library(patchwork)

base_dir <- "C:/Users/user/Documents/bike_lane_research/outputs_bikr_rl/s2v_dqn"
sub_dir_1 <- "PPO_Trained_On_Affoltern/graphs"
sub_dir_2 <- "PPO_Trained_On_Birchplatz/graphs"
sub_dir_3 <- "Linear_Programming/graphs"
sub_dir_4 <- "S2V_DQN_Trained_On_Birchplatz/graphs"

# Read images and add titles
plt1 <- image_read(file.path(base_dir, sub_dir_1, "affoltern_bike_network.png")) %>%
  image_ggplot() +
  ggtitle("PPO (Trained on Affoltern)") +
  theme(plot.title = element_text(size = 10, hjust = 0.5))

plt2 <- image_read(file.path(base_dir, sub_dir_2, "affoltern_bike_network.png")) %>%
  image_ggplot() +
  ggtitle("PPO (Trained on Birchplatz)") +
  theme(plot.title = element_text(size = 10, hjust = 0.5))

plt3 <- image_read(file.path(base_dir, sub_dir_3, "affoltern_bike_network.png")) %>%
  image_ggplot() +
  ggtitle("Linear Programming") +
  theme(plot.title = element_text(size = 10, hjust = 0.5))

plt4 <- image_read(file.path(base_dir, sub_dir_4, "affoltern_bike_network.png")) %>%
  image_ggplot() +
  ggtitle("S2V-DQN (Trained on Birchplatz)") +
  theme(plot.title = element_text(size = 10, hjust = 0.5))

# Arrange with patchwork and add overall caption
(plt1 | plt2) / (plt3 | plt4) +
  plot_annotation(
    title = "Figure 9: Affoltern Bike Networks by Algorithm",
    theme = theme(plot.title = element_text(size = 14, hjust = 0.5))
  )

```

The final bike network for Affoltern reveals that the linear programming method produces the most elaborate and well‑connected network.
The networks generated by PPO(Affoltern), PPO(Birchplatz) and S2V‑DQN(Birchplatz) exhibit similar connectivity and complexity, all ranking lower than the LP results.

```{r}

plt1 <- image_read(file.path(base_dir, sub_dir_1, "birchplatz_bike_network.png")) %>%
  image_ggplot() +
  ggtitle("PPO (Trained on Affoltern)") +
  theme(plot.title = element_text(size = 10, hjust = 0.5))

plt2 <- image_read(file.path(base_dir, sub_dir_2, "birchplatz_bike_network.png")) %>%
  image_ggplot() +
  ggtitle("PPO (Trained on Birchplatz)") +
  theme(plot.title = element_text(size = 10, hjust = 0.5))

plt3 <- image_read(file.path(base_dir, sub_dir_3, "birchplatz_bike_network.png")) %>%
  image_ggplot() +
  ggtitle("Linear Programming") +
  theme(plot.title = element_text(size = 10, hjust = 0.5))

plt4 <- image_read(file.path(base_dir, sub_dir_4, "birchplatz_bike_network.png")) %>%
  image_ggplot() +
  ggtitle("S2V-DQN (Trained on Birchplatz)") +
  theme(plot.title = element_text(size = 10, hjust = 0.5))


(plt1 | plt2) / (plt3 | plt4) +
  plot_annotation(
    title = "Figure 10:Birchplatz Bike Networks by Algorithm",
    theme = theme(plot.title = element_text(size = 14, hjust = 0.5))
  )

```

Similar to Affoltern network, the final bike network for Birchplatz reveals that the linear programming method produces the most elaborate and well‑connected network, followed by S2V‑DQN .The networks generated by PPO trained on Affoltern and PPO trained on Birchplatz exhibit similar connectivity and complexity, both ranking lower than the LP and S2V‑DQN results.

```{r}

plt1 <- image_read(file.path(base_dir, sub_dir_1, "affoltern_car_network.png")) %>%
  image_ggplot() +
  ggtitle("PPO (Trained on Affoltern)") +
  theme(plot.title = element_text(size = 10, hjust = 0.5))

plt2 <- image_read(file.path(base_dir, sub_dir_2, "affoltern_car_network.png")) %>%
  image_ggplot() +
  ggtitle("PPO (Trained on Birchplatz)") +
  theme(plot.title = element_text(size = 10, hjust = 0.5))

plt3 <- image_read(file.path(base_dir, sub_dir_3, "affoltern_car_network.png")) %>%
  image_ggplot() +
  ggtitle("Linear Programming") +
  theme(plot.title = element_text(size = 10, hjust = 0.5))

plt4 <- image_read(file.path(base_dir, sub_dir_4, "affoltern_car_network.png")) %>%
  image_ggplot() +
  ggtitle("S2V-DQN (Trained on Birchplatz)") +
  theme(plot.title = element_text(size = 10, hjust = 0.5))


(plt1 | plt2) / (plt3 | plt4) +
  plot_annotation(
    title = "Figure 11: Affoltern Car Networks by Algorithm",
    theme = theme(plot.title = element_text(size = 14, hjust = 0.5))
  )
```

```{r}

plt1 <- image_read(file.path(base_dir, sub_dir_1, "birchplatz_car_network.png")) %>%
  image_ggplot() +
  ggtitle("PPO (Trained on Affoltern)") +
  theme(plot.title = element_text(size = 10, hjust = 0.5))

plt2 <- image_read(file.path(base_dir, sub_dir_2, "birchplatz_car_network.png")) %>%
  image_ggplot() +
  ggtitle("PPO (Trained on Birchplatz)") +
  theme(plot.title = element_text(size = 10, hjust = 0.5))

plt3 <- image_read(file.path(base_dir, sub_dir_3, "birchplatz_car_network.png")) %>%
  image_ggplot() +
  ggtitle("Linear Programming") +
  theme(plot.title = element_text(size = 10, hjust = 0.5))

plt4 <- image_read(file.path(base_dir, sub_dir_4, "birchplatz_car_network.png")) %>%
  image_ggplot() +
  ggtitle("S2V-DQN (Trained on Birchplatz)") +
  theme(plot.title = element_text(size = 10, hjust = 0.5))


(plt1 | plt2) / (plt3 | plt4) +
  plot_annotation(
    title = "Figure 12: Birchplatz Car Networks by Algorithm",
    theme = theme(plot.title = element_text(size = 14, hjust = 0.5))
  )
```

#### ***REFERENCES***

[1] “Taming Traffic,” Institute for Transportation and Development Policy - Promoting sustainable and equitable transportation worldwide, Mar. 22, 2021.
<https://itdp.org/publication/taming-traffic>

[2] K.
Hymel, “If you build it, they will drive: Measuring induced demand for vehicle travel in urban areas,” *Transport Policy*, vol.
76, pp. 57–66, Apr. 2019, doi: <https://doi.org/10.1016/j.tranpol.2018.12.006.>

[3] “Road widening and induced demand explained \| TomTom Newsroom,” TomTom, 2024.
<https://www.tomtom.com/newsroom/explainers-and-insights/induced-demand-explained/>

[4] Ballo, L., M. Raubal and K.W.
Axhausen (2024)
[Designing an E-Bike City: An automated process for network-wide multimodal road space reallocation](https://doi.org/10.1016/j.jcmr.2024.100048),
*Journal of Cycling and Micromobility Research*, **2**, 100048.

[5] Wiedemann, N., Nöbel, C., Martin, H., Ballo, L., & Raubal, M.
(2024).
Bike network planning in limited urban space.
arXiv preprint arXiv:2405.01770.
