Spatial clustering of POI data for Warsaw

1 Introduction

The inspiration for this project comes from the paper “DBSCAN Spatial Clustering Analysis of Urban ‘Production–Living–Ecological’ Space Based on POI Data”, published in 2022 in the International Journal of Environmental Research and Public Health by Xiaoqiang Tu, Chun Fu, An Huang, Hailian Chen, and Xing Ding.

The authors demonstrate that publicly available Points of Interest (POI) data, combined with denisty based clustering, can effectively identify spatial pattern within the city of Wuhan. Their study, conducted for central Wuhan, shows that these patterns can be used to assess real-time urban structure.

Based on this approach, this paper aims to replicate this approach and methodology for Warsaw. The analysis uses OpenStreetMap POI data within within the administrative boundaries of Warsaw and local spatial planning documents (Plan Ogólny Warszawy from 2023, ArcGis). The goal is to explore whether similar clustering techniques can reveal meaningful insights into the spatial organization and development patterns of the capital city of Poland.

2 Data

2.1 POI data

Initially, the data regarding Warsaw POI was gathered from OpenStreetMap. Data was downloaded and organized into predefined thematic groups. Then each of the groups was assigned to one of the “Production-Living-Ecological” areas, following the split presented in the inspiration paper.

It is worth mention that in PLE categories mapping, the simplified approach was applied: each POI is assigned only to one category while in reality same building can serve two different functions - for example residental and commercial. Additionally, the entire analysis relies on data available in OpenStreetMap (OSM), which is an open-source dataset, so the data may be incomplete, inconsistent, or lacking detailed attributes in some cases.

warsaw_boundary <- st_read(file.path(data_dir, "warsaw_boundary.gpkg"), quiet = TRUE)
warsaw_poi <- st_read(file.path(data_dir, "warsaw_poi.gpkg"), quiet = TRUE)

warsaw_districts <- read_csv(file.path(data_dir, "dzielnice_warszawy.csv"), show_col_types = FALSE) %>% clean_names()

glimpse(warsaw_boundary)

## Rows: 1
## Columns: 17
## $ bbox_west    <dbl> 20.85169
## $ bbox_south   <dbl> 52.09785
## $ bbox_east    <dbl> 21.27115
## $ bbox_north   <dbl> 52.36815
## $ place_id     <dbl> 415345216
## $ osm_type     <chr> "relation"
## $ osm_id       <dbl> 2907540
## $ lat          <dbl> 52.23337
## $ lon          <dbl> 21.07115
## $ class        <chr> "boundary"
## $ type         <chr> "administrative"
## $ place_rank   <dbl> 14
## $ importance   <dbl> 0.7999388
## $ addresstype  <chr> "administrative"
## $ name         <chr> "Warsaw"
## $ display_name <chr> "Warsaw, Masovian Voivodeship, Poland"
## $ geom         <POLYGON [°]> POLYGON ((20.85169 52.20098...

glimpse(warsaw_poi)

## Rows: 220,657
## Columns: 13
## $ osm_id              <chr> "way/1394406679", "way/955875853", "way/955875854"…
## $ name                <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ target              <chr> "living", "living", "living", "living", "living", …
## $ criterion           <chr> "habitable", "habitable", "habitable", "habitable"…
## $ element             <dbl> 101, 101, 101, 101, 101, 101, 101, 101, 101, 101, …
## $ element_name        <chr> "Residential buildings", "Residential buildings", …
## $ studium_category    <chr> "MN", "MN", "MN", "MN", "MN", "MW", "MN", "MN", "M…
## $ studium_description <chr> "Zabudowa mieszkaniowa jednorodzinna", "Zabudowa m…
## $ lon                 <dbl> 21.04293, 21.07462, 21.09006, 21.07428, 21.08139, …
## $ lat                 <dbl> 52.33617, 52.29085, 52.28717, 52.28899, 52.28795, …
## $ x_m                 <dbl> 639146.4, 641448.5, 642512.9, 641431.2, 641919.2, …
## $ y_m                 <dbl> 498653.3, 493675.7, 493297.4, 493468.1, 493366.1, …
## $ geom                <POINT [°]> POINT (21.04293 52.33617), POINT (21.07462 5…

glimpse(warsaw_districts)

## Rows: 18
## Columns: 3
## $ x1       <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17
## $ name     <chr> "Ochota", "Praga-Północ", "Wilanów", "Włochy", "Wola", "Biela…
## $ geometry <chr> "POLYGON ((634957.971896 482766.148259, 634959.77002 482765.4…

2.1.1 Data transformation into sf object

Firstly district was transformed into sf object (two remaining ones were alredy sf) and transformed into CRS for Poland - EPSG2180. Then as the overview the POI data was visualised and their distribution in Warsaw area was investigated.

warsaw_districts_sf <- st_as_sf(warsaw_districts, wkt = "geometry", crs = 2180)
warsaw_boundary_2180 <- st_transform(warsaw_boundary, 2180)

warsaw_poi_2180 <- st_as_sf(st_drop_geometry(warsaw_poi), coords = c("x_m", "y_m"), crs = 2180)

2.1.2 PLE classification

The visualisation below presents the distribution of POIs across Warsaw in the three main layers:

ecological - shows the highest coverage, especially along the Vistula river, which is almost not visible.
living - is more clustered, slightly higher density can be observed in central and inner districts.
production - shows lowest coverage, points tend to form visible clusters, especially outside the very center.

This shows clear differences between layers in Warsaw between those three categories. Additionally the differences are visible in the number of POIs in each category - living and ecological have more POIs than production areas. These differences suggest that DBSCAN clustering should reveal distinct cluster structures for each layer.

ggplot() +
  geom_sf(
    data = warsaw_boundary_2180,
    fill = NA,
    color = "black",
    linewidth = 0.4
  ) +
  geom_sf(
    data = warsaw_poi_2180,
    aes(color = target),
    size = 0.15,
    alpha = 0.4,
    show.legend = FALSE
  ) +
  scale_color_viridis_d(option = "D", end = 0.9) +
  facet_wrap(~ target, ncol = 3) +
  theme_minimal() +
  labs(
    title = "POI: PLE target layer classification"
  ) +
  theme(
    axis.text = element_blank(),
    axis.title = element_blank(),
    axis.ticks = element_blank(),
    panel.grid = element_blank(),
    strip.text = element_text(size = 12),
    plot.title = element_text(hjust = 0.5)
  )

ple_counts <- warsaw_poi_2180 %>%
  st_drop_geometry() %>%
  count(target, sort = TRUE)

p_bar_ple <- ggplot(
  ple_counts,
  aes(x = fct_reorder(target, n), y = n, fill = target)
) +
  geom_col(width = 0.6) +
  scale_fill_viridis_d(option = "D", end = 0.9) +
  theme_minimal() +
  labs(
    title = "Number of POI by PLE target layer",
    x = "PLE target",
    y = "Number of POI"
  ) +
  theme(
    legend.position = "none",
    panel.grid.minor = element_blank(),
    plot.title = element_text(hjust = 0.5)
  )

print(p_bar_ple)

3 Clustering - DBSCAN

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a clustering algorithm that groups points based on their local spatial density. It uses two parameters: eps, which defines the neighbourhood radius, and minPts, the minimum number of points required to form a cluster. Points located in dense areas are assigned to clusters, while isolated points are treated as noise.

The method is particularly suitable for geospatial data because it operates directly on point coordinates and does not require specifying the number of clusters in advance. This is important in this study, as the number and shape of clusters in POI data are not known beforehand and may vary across different functional layers.

3.1 Do POI data cluster? Clark-Evans test

Before applying DBSCAN, it is worth verifying statistically that the POI points are actually clustered in space — and not randomly or uniformly distributed. The Clark-Evans test was used to check if clustering of the gathered data is possible.

#layers
poi_living <- warsaw_poi_2180 %>% filter(target == "living")
poi_production <- warsaw_poi_2180 %>% filter(target == "production")
poi_ecological <- warsaw_poi_2180 %>% filter(target == "ecological")

#coordinates for POI's - needed for clustering 
coords_living <- st_coordinates(poi_living)
coords_production <- st_coordinates(poi_production)
coords_ecological <- st_coordinates(poi_ecological)

#Clark-Evans test
run_ce(poi_living,     warsaw_owin, "Living")

## Living        R = 0.5995  p-value = 0.00e+00

run_ce(poi_production, warsaw_owin, "Production")

## Production    R = 0.4872  p-value = 0.00e+00

run_ce(poi_ecological, warsaw_owin, "Ecological")

## Ecological    R = 0.7767  p-value = 0.00e+00

The Clark–Evans test confirms that all three PLE layers are spatially clustered rather than randomly distributed, with R values below 1 and significant p-values. The Production layer is the most concentrated (R = 0.49), followed by the Living layer (R = 0.60), while the Ecological layer (R = 0.78) is the most dispersed. However, even the Ecological layer differs significantly from Complete Spatial Randomness.

These results confirm that DBSCAN is an appropriate method for this dataset. At the same time, they suggest that the three layers have different spatial densities, which means that they are expected to form clusters of different sizes and compactness.

3.2 Epsilon and minimum

Firstly, data was prepared for clustering and suitable eps parameters were applied separately for each layer, as their spatial densities differ.

The original study selected eps and MinPts “after several experimental iterations”. To make parameter selection similar, so this experiment follow two data-driven rules:

MinPts is set proportionally to the size of each target layer, using the same MinPts/n ratio reported for Wuhan:

Living - 0,039% of the POIs points: 65
Production - 0.2% of points: 100
Ecological - 1.7% of point: 50

Eps is chosen from the k-NN distance plot at the point where the curve forms the “elbow”:

par(mfrow = c(1, 3), mar = c(4, 4, 3, 1))

# Living
dist_living <- kNNdist(coords_living, k = 65)
plot(sort(dist_living), type = "l", col = "#440154", lwd = 2,
     xlab = "Points sorted by distance", ylab = "65-NN distance")
abline(h = 350, col = "#21918c", lty = 2)
title("Living")

# Production
dist_production <- kNNdist(coords_production, k = 100)
plot(sort(dist_production), type = "l", col = "#440154", lwd = 2,
     xlab = "Points sorted by distance", ylab = "100-NN distance")
abline(h = 550, col = "#21918c", lty = 2)
title("Production")

# Ecological
dist_ecological <- kNNdist(coords_ecological, k = 50)
plot(sort(dist_ecological), type = "l", col = "#440154", lwd = 2,
     xlab = "Points sorted by distance", ylab = "50-NN distance")
abline(h = 950, col = "#21918c", lty = 2)
title("Ecological")

Based on the obtained curves the radius was marked as follows:

350 m for the living layer
550 m for the production layer
950 m for the ecological layer

This is what was expected - parks, allotments or gardens are spatially rarer and more dispersed than residental or production points.

3.3 DBSCAN for each layer

For each layer dbscan was conducted separately, according to the parameters inspired by paper about Wuhan and the k-distance plot.

db_living <- dbscan(coords_living, eps = 350, minPts = 65)
db_production <- dbscan(coords_production, eps = 500, minPts = 100)
db_ecological <- dbscan(coords_ecological, eps = 950, minPts = 50)

poi_living <- warsaw_poi_2180 %>%
  filter(target == "living") %>%
  mutate(cluster = db_living$cluster)

poi_production <- warsaw_poi_2180 %>%
  filter(target == "production") %>%
  mutate(cluster = db_production$cluster)

poi_ecological <- warsaw_poi_2180 %>%
  filter(target == "ecological") %>%
  mutate(cluster = db_ecological$cluster)

poi_living_plot <- poi_living %>%
  mutate(cluster_type = ifelse(cluster == 0, "Noise", "Cluster"))

poi_production_plot <- poi_production %>%
  mutate(cluster_type = ifelse(cluster == 0, "Noise", "Cluster"))

poi_ecological_plot <- poi_ecological %>%
  mutate(cluster_type = ifelse(cluster == 0, "Noise", "Cluster"))

poi_dbscan_plot <- bind_rows(
  poi_living_plot,
  poi_production_plot,
  poi_ecological_plot
)

p_dbscan_color <- ggplot() +
  geom_sf(
    data = warsaw_boundary_2180,
    fill = NA,
    color = "black",
    linewidth = 0.35
  ) +
  # noise
  geom_sf(
    data = poi_dbscan_plot %>% filter(cluster == 0),
    color = "grey85",
    size = 0.12,
    alpha = 0.3
  ) +
  # klastry
  geom_sf(
    data = poi_dbscan_plot %>% filter(cluster != 0),
    aes(color = factor(cluster)),
    size = 0.18,
    alpha = 0.7
  ) +
  scale_color_viridis_d(option = "D", end = 0.95) +
  facet_wrap(~ target, ncol = 3) +
  theme_minimal() +
  labs(
    title = "DBSCAN clustering results for POI data"
  ) +
  theme(
    legend.position = "none",
    axis.text = element_blank(),
    axis.title = element_blank(),
    axis.ticks = element_blank(),
    panel.grid = element_blank(),
    strip.text = element_text(size = 12),
    plot.title = element_text(hjust = 0.5)
  )

print(p_dbscan_color)

dbscan_summary <- bind_rows(
  poi_living %>% st_drop_geometry() %>% mutate(layer = "living"),
  poi_production %>% st_drop_geometry() %>% mutate(layer = "production"),
  poi_ecological %>% st_drop_geometry() %>% mutate(layer = "ecological")
) %>%
  group_by(layer) %>%
  summarise(
    n_points = n(),
    n_noise = sum(cluster == 0),
    pct_noise = round(100 * mean(cluster == 0), 2),
    n_clusters = n_distinct(cluster[cluster != 0]),
    .groups = "drop"
  )

dbscan_summary

top_clusters <- bind_rows(
  poi_living      %>% st_drop_geometry() %>% mutate(layer = "living"),
  poi_production  %>% st_drop_geometry() %>% mutate(layer = "production"),
  poi_ecological  %>% st_drop_geometry() %>% mutate(layer = "ecological")
) %>%
  filter(cluster != 0) %>%
  count(layer, cluster, name = "n") %>%
  group_by(layer) %>%
  mutate(share_of_layer = round(100 * n / sum(n), 1)) %>%
  slice_max(n, n = 5) %>%
  arrange(layer, desc(n)) %>%
  ungroup()

top_clusters

3.4 Cluster quality assessment

Despite the parameter values being inspired from the two data-driven rules applied in the original study: proportional MinPts and the k-NN elbow for eps, the resulting quality is far from the expected.

A single cluster dominates each layer: for the ecological layer the largest cluster contains more than 90 % of all non-noise points, effectively merging Kampinos, Łazienki, Pole Mokotowskie and Kabaty into one “mega-blob” linked through continuous strips of street-level greenery, trees, and small squares. The living layer shows a two-blob structure (the left- and right-bank halves of Warsaw fused through commercial corridors), and the production layer is dominated by one elongated cluster stretching along the Służewiec — Śródmieście — Wola axis.
Cluster counts are either too small or spatially meaningless: with the original parameters, DBSCAN produces very few large clusters (4–13 per layer) that cannot be reasonably interpreted as separate functional zones of the city.

It seems that exactly the same approach isn’t working for Warsaw, as is working for Wuhan. It can be easily explained: Wuhan is a polycentric Chinese mega-city with large, monofunctional districts and a lower POI density than Warsaw. Additionaly Polish capital city has typical structure for European city, not large Chinese mega-city.

To solve this problem new section was added with tuning those two parameters to eliminate mega clusters and add better spatial separation.

3.5 Empirical tuning of parameters

To obtain interpretable clusters a small loop with grid of parameters was added. The sweep was described by two empirical criteria motivated by the diagnosis above. First, no single cluster should absorb the majority of the layer, the share of the largest cluster (top1) should stay below 50%. Second, the noise fraction should remain in a reasonable range (15–30%), similarly as it was in original study.

dbscan_diag <- function(coords, eps, minPts, label = "") {
  db <- dbscan(coords, eps = eps, minPts = minPts)
  cl <- db$cluster
  n_total  <- length(cl)
  n_noise  <- sum(cl == 0)
  n_clust  <- length(unique(cl[cl != 0]))
  share1   <- if (n_clust == 0) NA else
               round(100 * max(table(cl[cl != 0])) / sum(cl != 0), 1)
  cat(sprintf("%-11s eps=%-4d minPts=%-3d -> %3d clusters, %5.1f%% noise, top1=%5.1f%%\n",
              label, eps, minPts, n_clust,
              100 * n_noise / n_total, share1))
  invisible(db)
}

for (e in c(150, 200, 250, 300, 350, 400, 500)) {
  dbscan_diag(coords_living, eps = e, minPts = 65, label = "living")
}

## living      eps=150  minPts=65  -> 218 clusters,  40.0% noise, top1=  5.4%
## living      eps=200  minPts=65  -> 137 clusters,  15.4% noise, top1= 18.5%
## living      eps=250  minPts=65  ->  81 clusters,   5.2% noise, top1= 21.5%
## living      eps=300  minPts=65  ->  39 clusters,   2.0% noise, top1= 36.6%
## living      eps=350  minPts=65  ->  21 clusters,   1.0% noise, top1= 43.7%
## living      eps=400  minPts=65  ->  17 clusters,   0.5% noise, top1= 46.8%
## living      eps=500  minPts=65  ->   6 clusters,   0.3% noise, top1= 94.2%

for (mp in c(65, 100, 150, 200, 250, 300)) {
  dbscan_diag(coords_living, eps = 250, minPts = mp, label = "living")
}

## living      eps=250  minPts=65  ->  81 clusters,   5.2% noise, top1= 21.5%
## living      eps=250  minPts=100 -> 104 clusters,  15.6% noise, top1= 18.7%
## living      eps=250  minPts=150 -> 104 clusters,  36.0% noise, top1=  6.6%
## living      eps=250  minPts=200 ->  81 clusters,  55.7% noise, top1=  6.2%
## living      eps=250  minPts=250 ->  54 clusters,  75.0% noise, top1=  8.4%
## living      eps=250  minPts=300 ->  24 clusters,  88.7% noise, top1= 12.5%

for (mp in c(65, 100, 150, 200)) {
  dbscan_diag(coords_living, eps = 300, minPts = mp, label = "living")
}

## living      eps=300  minPts=65  ->  39 clusters,   2.0% noise, top1= 36.6%
## living      eps=300  minPts=100 ->  57 clusters,   6.8% noise, top1= 21.9%
## living      eps=300  minPts=150 ->  85 clusters,  17.8% noise, top1= 19.3%
## living      eps=300  minPts=200 ->  87 clusters,  33.4% noise, top1= 16.4%

for (mp in c(100, 150, 200)) {
  dbscan_diag(coords_living, eps = 350, minPts = mp, label = "living")
}

## living      eps=350  minPts=100 ->  34 clusters,   2.7% noise, top1= 36.2%
## living      eps=350  minPts=150 ->  57 clusters,   8.3% noise, top1= 22.3%
## living      eps=350  minPts=200 ->  66 clusters,  17.9% noise, top1= 20.5%

for (e in c(200, 300, 350, 400, 500, 600)) {
  dbscan_diag(coords_production, eps = e, minPts = 100, label = "production")
}

## production  eps=200  minPts=100 ->  20 clusters,  90.2% noise, top1= 30.5%
## production  eps=300  minPts=100 ->  48 clusters,  51.7% noise, top1= 38.1%
## production  eps=350  minPts=100 ->  41 clusters,  35.2% noise, top1= 40.5%
## production  eps=400  minPts=100 ->  25 clusters,  23.9% noise, top1= 54.4%
## production  eps=500  minPts=100 ->  13 clusters,  13.0% noise, top1= 88.5%
## production  eps=600  minPts=100 ->   8 clusters,   7.8% noise, top1= 97.2%

for (mp in c(50, 70, 80, 100, 120, 150)) {
  dbscan_diag(coords_production, eps = 350, minPts = mp, label = "production")
}

## production  eps=350  minPts=50  ->  36 clusters,  11.7% noise, top1= 64.6%
## production  eps=350  minPts=70  ->  27 clusters,  20.0% noise, top1= 65.2%
## production  eps=350  minPts=80  ->  31 clusters,  24.3% noise, top1= 46.1%
## production  eps=350  minPts=100 ->  41 clusters,  35.2% noise, top1= 40.5%
## production  eps=350  minPts=120 ->  33 clusters,  46.1% noise, top1= 41.9%
## production  eps=350  minPts=150 ->  28 clusters,  60.7% noise, top1= 45.5%

for (e in c(150, 200, 250, 350, 450, 550, 750, 950)) {
  dbscan_diag(coords_ecological, eps = e, minPts = 30, label = "ecological")
}

## ecological  eps=150  minPts=30  ->  27 clusters,  89.5% noise, top1= 14.6%
## ecological  eps=200  minPts=30  ->  28 clusters,  80.8% noise, top1= 36.5%
## ecological  eps=250  minPts=30  ->  33 clusters,  72.8% noise, top1= 39.0%
## ecological  eps=350  minPts=30  ->  52 clusters,  54.8% noise, top1= 54.3%
## ecological  eps=450  minPts=30  ->  62 clusters,  30.0% noise, top1= 53.8%
## ecological  eps=550  minPts=30  ->  35 clusters,  10.5% noise, top1= 72.9%
## ecological  eps=750  minPts=30  ->   4 clusters,   1.0% noise, top1= 98.5%
## ecological  eps=950  minPts=30  ->   2 clusters,   0.1% noise, top1= 99.0%

for (mp in c(15, 20, 25, 30, 40, 60)) {
  dbscan_diag(coords_ecological, eps = 250, minPts = mp, label = "ecological")
}

## ecological  eps=250  minPts=15  -> 140 clusters,  45.6% noise, top1= 40.5%
## ecological  eps=250  minPts=20  ->  80 clusters,  60.0% noise, top1= 31.0%
## ecological  eps=250  minPts=25  ->  51 clusters,  67.5% noise, top1= 35.4%
## ecological  eps=250  minPts=30  ->  33 clusters,  72.8% noise, top1= 39.0%
## ecological  eps=250  minPts=40  ->  24 clusters,  79.0% noise, top1= 40.0%
## ecological  eps=250  minPts=60  ->  12 clusters,  87.8% noise, top1= 27.0%

Based on the loops the following parameters were chosen:

Living layer — eps = 350m, minPts = 200. At this setting the algorithm returns 66 clusters with 17.9% noise and top1 = 20.5%.
Production layer — eps = 350m, minPts = 80. This configuration produces 31 clusters with 24.3% noise and top1 = 46.1%. The high top1 does not go away when parameters are adjusted, across the whole grid the largest cluster stabilizes around 40–50% of the clustered production POIs. This is a structural property of Warsaw: production-type activity (offices, retail, services) concentrates along a single corridor running from Służewiec through Mokotów and Śródmieście to Wola.
Ecological layer — eps = 250m, minPts = 60. This is the most restrictive configuration, forming 12 clusters with ~88% noise and top1 = 27.0%. The ecological layer is a small set of spatially isolated green sites: most Warsaw parks are individual objects separated by buildings, which lead to high noise in this layer. Only the largest green complexes will survive as clusters.

db_living      <- dbscan(coords_living,      eps = 350, minPts = 200)
db_production  <- dbscan(coords_production,  eps = 350, minPts =  80)
db_ecological  <- dbscan(coords_ecological,  eps = 250, minPts =  60)

poi_living <- warsaw_poi_2180 %>% filter(target == "living") %>% mutate(cluster = db_living$cluster)

poi_production <- warsaw_poi_2180 %>% filter(target == "production") %>% mutate(cluster = db_production$cluster)

poi_ecological <- warsaw_poi_2180 %>% filter(target == "ecological") %>% mutate(cluster = db_ecological$cluster)

make_dbscan_plot <- function(poi_sf, title_text) {
  
  poi_sf <- poi_sf %>%
    mutate(cluster_f = ifelse(cluster == 0, NA, as.character(cluster)))
  
  n_clust <- poi_sf %>%
    st_drop_geometry() %>%
    filter(cluster != 0) %>%
    summarise(n = n_distinct(cluster)) %>%
    pull(n)
  
  clust_cols <- viridis::viridis(
    n = max(1, n_clust),
    option = "turbo"
  )
  
  ggplot() +
    geom_sf(
      data = warsaw_districts_sf,
      fill = NA,
      color = "black",
      linewidth = 0.2
    ) +
    geom_sf(
      data = warsaw_boundary_2180,
      fill = NA,
      color = "black",
      linewidth = 0.4
    ) +
    # noise
    geom_sf(
      data = poi_sf %>% filter(cluster == 0),
      color = "grey85",
      size = 0.10,
      alpha = 0.25
    ) +
    # klastry
    geom_sf(
      data = poi_sf %>% filter(cluster != 0),
      aes(color = factor(cluster)),
      size = 0.16,
      alpha = 0.8
    ) +
    scale_color_manual(values = clust_cols) +
    theme_minimal() +
    labs(title = title_text) +
    theme(
      legend.position = "none",
      axis.text = element_blank(),
      axis.title = element_blank(),
      axis.ticks = element_blank(),
      panel.grid = element_blank(),
      plot.title = element_text(hjust = 0.5)
    )
}

p_living <- make_dbscan_plot(poi_living, "DBSCAN clusters — Living layer")
p_production <- make_dbscan_plot(poi_production, "DBSCAN clusters — Production layer")
p_ecological <- make_dbscan_plot(poi_ecological, "DBSCAN clusters — Ecological layer")

print(p_living)

print(p_production)

print(p_ecological)

The living layer results in 66 clusters, 17.9% noise, and a largest cluster share of 20.5%, showing a clear polycentric structure. Instead of one dominant centre, residential areas form multiple clusters aligned with Warsaw districts. The main concentrations appear on both sides of the Vistula, with additional smaller clusters distributed across the north, west, and southern districts. That shows that Warsaw’s residential structure is decentralized.

The production layer produces 31 clusters, 24.3% noise, and a largest cluster share of 46.1%, indicating a more monocentric pattern. One dominant cluster stretches from Służewiec through Mokotów and the city center to Wola and further towards Praga-Południe. Smaller clusters appear in historically industrial areas such as Ursus, Włochy, and Targówek. Compared to the living layer, production activity is much more concentrated.

The ecological layer shows 12 clusters, 88% noise, and a largest cluster share of 27%. The high level of noise reflects the dispersed nature of green POIs. Only the largest and most developed parks form clusters, mainly along the central green corridor near the Vistula (Łazienki, Pole Mokotowskie).

4 Disctricts and major layer assigment

In the next step, the dominant PLE function was assigned to each district of Warsaw. This was done to move from point-based clustering results to a more interpretable district-level classification used to the comparison with territorial plans. Because the ecological layer produced relatively few stable clusters and a high share of noise points, it was expected that the final district classification would mainly be dominated by two functions: Living and Production.

The assignment was based on the area of DBSCAN cluster polygons within each district. Noise points were excluded from this step, since they do not belong to any DBSCAN cluster. Then, the polygons were intersected with Warsaw district boundaries and the area of each PLE layer inside every district was calculated. Finally, each district was assigned the function with the largest total clustered area.

make_cluster_polys <- function(pts_sf, cluster_vec, concavity = 2, min_pts = 5) {
  pts_sf$cluster <- cluster_vec
  pts_sf <- pts_sf[pts_sf$cluster != 0, ]

  polys <- split(pts_sf, pts_sf$cluster) |>
    lapply(function(g) {
      if (nrow(g) < min_pts) return(NULL)
      hull <- concaveman(st_coordinates(g), concavity = concavity)
      st_sf(cluster  = g$cluster[1],
            n_points = nrow(g),
            geometry = st_sfc(st_polygon(list(hull)), crs = st_crs(pts_sf)))
    }) |>
    Filter(Negate(is.null), x = _) |>
    do.call(rbind, args = _)

  polys$area_km2 <- as.numeric(st_area(polys)) / 1e6
  polys$density  <- polys$n_points / polys$area_km2
  polys
}

polys_L <- make_cluster_polys(poi_living, db_living$cluster)
polys_P <- make_cluster_polys(poi_production, db_production$cluster)
polys_E <- make_cluster_polys(poi_ecological, db_ecological$cluster)

assign_district <- function(polys, warsaw_districts_sf) {
  inter <- st_intersection(polys, warsaw_districts_sf) |>
    mutate(part_area = as.numeric(st_area(geometry)))

  labels <- inter |>
    st_drop_geometry() |>
    group_by(cluster) |>
    mutate(share = part_area / sum(part_area)) |>
    arrange(desc(share), .by_group = TRUE) |>
    summarise(district = if (first(share) >= 0.85) first(name)
                         else if (n() == 1) first(name)
                         else paste(name[1], name[2], sep = "/"),
              .groups = "drop")

  polys |> left_join(labels, by = "cluster")
}

polys_L_named <- assign_district(polys_L, warsaw_districts_sf)
polys_P_named <- assign_district(polys_P, warsaw_districts_sf)
polys_E_named <- assign_district(polys_E, warsaw_districts_sf)

cluster_table <- function(polys, layer_name, top_n = 10) {
  polys |>
    st_drop_geometry() |>
    arrange(desc(n_points)) |>
    mutate(rank      = row_number(),
           share_pct = round(100 * n_points / sum(n_points), 1),
           area_km2  = round(area_km2, 2),
           density   = round(density, 0),
           layer     = layer_name) |>
    select(layer, rank, cluster, district, n_points, share_pct, area_km2, density) |>
    head(top_n)
}

top_L <- cluster_table(polys_L_named, "Living",     top_n = 10)
top_P <- cluster_table(polys_P_named, "Production", top_n = 10)
top_E <- cluster_table(polys_E_named, "Ecological", top_n = 12)  # wszystkie 12

top_all <- bind_rows(top_L, top_P, top_E)
top_all

U_L <- st_union(polys_L_named) |> st_make_valid()
U_P <- st_union(polys_P_named) |> st_make_valid()
U_E <- st_union(polys_E_named) |> st_make_valid()


area_in_district <- function(districts, U, label) {
  int <- st_intersection(districts, U)
  if (nrow(int) == 0)
    return(tibble(name = character(), area = numeric(), layer = character()))
  int$area <- as.numeric(st_area(int))
  int |>
    st_drop_geometry() |>
    group_by(name) |>
    summarise(area = sum(area), .groups = "drop") |>
    mutate(layer = label)
}

areas_long <- bind_rows(
  area_in_district(warsaw_districts_sf, U_L, "Living"),
  area_in_district(warsaw_districts_sf, U_P, "Production"),
  area_in_district(warsaw_districts_sf, U_E, "Ecological")
)

areas_wide <- areas_long |>
  pivot_wider(names_from = layer, values_from = area, values_fill = 0) |>
  mutate(dominant = case_when(
    Production >= Living & Production >= Ecological ~ "Production",
    Living     >= Production & Living     >= Ecological ~ "Living",
    Ecological >= Production & Ecological >= Living    ~ "Ecological",
    TRUE ~ NA_character_
  ))

districts_map <- warsaw_districts_sf |>
  left_join(areas_wide, by = "name") |>
  mutate(dominant = ifelse(is.na(dominant), "None", dominant))

ggplot(districts_map) +
  geom_sf(aes(fill = dominant), color = "white", linewidth = 0.4) +
  geom_sf_text(aes(label = name), size = 2, color = "white") +
  scale_fill_manual(
    values = c(
      "Living"     = "#21918c",
      "Production" = "#443983",
      "Ecological" = "#fde725",
      "None"       = "grey85"
    ),
    name = "Dominant PLE function"
  ) +
  labs(title = "Dominant PLE function per district") +
  theme_void()

The map above presents the dominant PLE (Production–Living–Ecological) function assigned to each district based on DBSCAN clustering results. Central districts, including Śródmieście, Wola, Ochota, Mokotów and Żoliborz, are predominantly classified as production-oriented. This aligns with the concentration of offices, services, and commercial activity in the city centre.

In contrast, peripheral districts such as Białołęka, Wawer, Ursynów, Wilanów, and Wesoła are dominated by the living function, reflecting their primarily residential character. What’s interesting: several districts located between the core and the outskirts (Praga-Południe, Targówek) are still classified as production-dominant, suggesting that inner-city industrial corridor is expanding.

Next we need to campare the results obtained in this study with the ith local spatial planning documents (Plan Ogólny Warszawy from 2023).

4.1 Comparison to General Warsaw Territorial Plan

#territorial plans
pog <- st_read("strefy_pog_wwa/strefy_pog.shp")

## Reading layer `strefy_pog' from data source 
##   `/Users/aleksandraengel/Library/Mobile Documents/com~apple~CloudDocs/uczelnia/2nd year/summer semester/Spatial ML/spatial final project/strefy_pog_wwa/strefy_pog.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 16014 features and 11 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: 7489861 ymin: 5773790 xmax: 7518549 ymax: 5803862
## Projected CRS: ETRF2000-PL / CS2000/21

pog_2180 <- st_transform(pog, 2180)

#mapping layers
pog_ple <- pog_2180 %>%
  mutate(ple = case_when(
    symbol %in% c("SW","SJ")              ~ "Living",
    symbol %in% c("SU","SH","SP","SK","SI") ~ "Production",
    symbol %in% c("SO","SN","SC")          ~ "Ecological",
    TRUE                                    ~ NA_character_
  ))

#aggregation per districts
plan_per_district <- st_intersection(pog_ple, warsaw_districts_sf) %>%
  mutate(area_m2 = as.numeric(st_area(.))) %>%
  st_drop_geometry() %>%
  group_by(name, ple) %>%
  summarise(plan_area_km2 = sum(area_m2)/1e6, .groups="drop") %>%
  group_by(name) %>%
  mutate(plan_share = plan_area_km2 / sum(plan_area_km2)) %>%
  ungroup()

#my results aggregated per districts
dbscan_per_district <- bind_rows(
  st_intersection(polys_L_named, warsaw_districts_sf) %>%
    mutate(ple="Living",  area=as.numeric(st_area(.))),
  st_intersection(polys_P_named, warsaw_districts_sf) %>%
    mutate(ple="Production", area=as.numeric(st_area(.))),
  st_intersection(polys_E_named, warsaw_districts_sf) %>%
    mutate(ple="Ecological", area=as.numeric(st_area(.)))
) %>% st_drop_geometry() %>%
  group_by(name, ple) %>%
  summarise(dbscan_area_km2 = sum(area)/1e6, .groups="drop") %>%
  group_by(name) %>%
  mutate(dbscan_share = dbscan_area_km2 / sum(dbscan_area_km2)) %>%
  ungroup()

comparison <- plan_per_district %>%full_join(dbscan_per_district, by = c("name", "ple"))

#join for comparison table with missing data imputation: NA -> 0
comparison_clean <- comparison %>%
  complete(name, ple, fill = list(plan_area_km2 = 0, dbscan_area_km2 = 0)) %>%
  group_by(name) %>%
  mutate(
    plan_share   = plan_area_km2   / sum(plan_area_km2, na.rm = TRUE),
    dbscan_share = dbscan_area_km2 / sum(dbscan_area_km2, na.rm = TRUE),
    delta        = dbscan_share - plan_share,
    pp           = round(delta * 100, 1)
  ) %>%
  ungroup()

options(scipen = 999)
comparison_clean

ggplot(comparison_clean, aes(x = ple, y = reorder(name, pp), fill = pp)) +
  geom_tile(color = "white") +
  geom_text(aes(label = sprintf("%+.1f", pp)), size = 3) +
  scale_fill_gradient2(
    low = viridis::viridis(3, option = "D")[1],
    mid = "white",
    high = viridis::viridis(3, option = "D")[3],
    midpoint = 0,
    name = "Percentage point difference"
  ) +
  labs(
    title = "DBSCAN POI clustering vs Plan Ogólny Warszawy 2023",
    x = NULL, y = NULL
  ) +
  theme_minimal(base_size = 11) +
  theme(panel.grid = element_blank())

The heatmap above presents the percentage point difference between the share of each PLE category in DBSCAN clusters and the share assigned by Plan Ogólny per district. Positive values (yellow) mean that DBSCAN detects a given function as more dominant than the plan allocates, while negative values (purple) mean the opposite.

Three clear patterns emerge. The ecological layer is systematically underrepresented across all 18 districts, with deltas ranging from −7 p.p. in Ursus to −55 p.p. in Wesoła. This differences are a direct consequence of how the data is defined: the ecological layer in DBSCAN includes only point-based green POIs (parks, gardens, allotments, cemeteries), while the plan counts entire natural landscapes as a single zone category.

The living layer is strongly overrepresented in peripheral districts — Wesoła, Wawer, Rembertów, Wilanów, Włochy and Białołęka. Because PLE shares sum to 100% per district, the missing ecological share is redistributed across the remaining two layers, which can cause this outcome.

The production layer is overrepresented in central and left-bank districts — Bielany, Bemowo, Wola, Żoliborz, Mokotów and Praga-Północ. This reveals another difference between the two data sources: the plan regulates land use (a residential building with a ground-floor shop is formally classified as SW — a mixed residential zone), while POIs count the intensity of individual activities (the same building generates ten production points — shop, restaurant, hairdresser, clinic).

What’s interesting is that two districts stand out from these patterns. Włochy shows the opposite direction: the plan allocates large economic zones (Okęcie airport, logistics, warehouses), but DBSCAN does not detect them as production because such facilities generate very few POIs per unit of area. This is a case where the plan predicts a function that POIs cannot confirm, simply because the function itself is low-POI by nature.

5 Conclusions

This project replicated the methodology of Tu et al. (2022) for Warsaw, applying DBSCAN density-based clustering to POI data extracted from OpenStreetMap and classified into three layers: Production, Living and Ecological. The aim was to explore whether this technique, developed and validated for Wuhan, can meaningfully capture the spatial structure of a European capital, and how its results compare with the official spatial planning document (Plan Ogólny Warszawy 2023).

The first finding is methodological. The parameter configuration proposed in the original study, MinPts proportional to layer size, eps chosen from the k-NN elbow does not translate directly to Warsaw. This reflects a real structural difference of both cities.

The second finding concerns the spatial patterns themselves. The living layer produces a polycentric structure with 66 clusters distributed across both sides of the Vistula, confirming that Warsaw’s residential function is decentralised. The production layer produces a more monocentric pattern, with one dominant corridor stretching from Służewiec through Mokotów and the city centre to Wola — the main service and office axis of the city. The ecological layer produces only 12 clusters with around 88% noise, which reflects the dispersed character of Warsaw’s green infrastructure: only the largest and most developed green complexes (Łazienki, Pole Mokotowskie etc.) survive as clusters.

The third finding comes from the comparison with the Plan Ogólny 2023. Several differences can be observed: the ecological layer is underrepresented across all districts, the living layer is more visible in peripheral areas, and the production layer is more concentrated in central and left-bank districts.These differences should not be treated as errors, but rather as a result of what each dataset represents. The spatial plan describes the intended or planned land use, while POI data reflects the actual intensity of human activity in space.

This experiment shows that DBSCAN clustering on POI data can be a useful diagnostic tool for Warsaw: it identifies the city’s polycentric residential structure, the dominant central service corridor, and the small number of major urban green complexes. At the same time the method has visible limitations. It is sensitive to parameter choices, it cannot detect low-POI functions such as logistics or airports, and it systematically underestimates natural landscapes that are not represented as individual points in OpenStreetMap.

For further analysis, the study could be extended by integrating additional spatial data sources, including land-use layers from OpenStreetMap and official Polish datasets such as the Topographic Objects Database (BDOT10k), which contains detailed information on building functions and classifications.

Sources:

materials provided during classes
OpenStreetMap (OSM) data
ArcGisExpert, Plan Ogólny Warszawy, shapefiles
text editing, debugging and corrections were assisted by ChatGPT