Introduction

The primary objective of this project is to identify distinct forest complexes within the Ursynów district of Warsaw using spatial clustering algorithms (DBSCAN, HDBSCAN). The study aims to separate continuous forest patches, such as the Kabaty Forest, from scattered vegetation and urban noise.

The analysis is based on high-resolution raster data obtained from the Copernicus Land Monitoring Service (Tree Cover Density 2022). This product provides tree cover density values ranging from 0 to 100% at a spatial resolution of 10 meters, enabling precise detection of vegetation boundaries.

Data source: https://land.copernicus.eu/en/products/high-resolution-layer-forests-and-tree-cover/tree-cover-density-2022-raster-10-m-100-m-europe-yearly#download.

Data preparation

library(terra)
library(dbscan)
library(tmap)
library(tmaptools)
library(OpenStreetMap)
library(sf)
library(factoextra)
library(cluster)
library(clustertend)
library(fpc)

The source data is distributed in \(1^\circ \times 1^\circ\) tiles. Since the Warsaw metropolitan area straddles the boundary between the 20th and 21st meridians, it would be necessary to mosaic adjacent tiles before cropping the dataset to the specific region. However, since only the Ursynów district and its surroundings was analyzed, this step could be skipped, but it was included in case one would want to analyze another part of Warsaw.

A map projection that converts the coordinates into meters was used to simplify the analysis by enabling to define parameters (\(\epsilon\)) directly in meters rather than radians.

files <- list.files(path = "data", pattern = ".tif$", full.names = TRUE)
map_full <- mosaic(sprc(files))
## |---------|---------|---------|---------|=========================================                                          
#warsaw <- vect(ext(20.30, 21.40, 52.00, 52.45), crs = "EPSG:4326")
warsaw <- vect(ext(21.02, 21.12, 52.10, 52.16), crs = "EPSG:4326")
warsaw_projected <- project(warsaw, crs(map_full))
map_warsaw <- crop(map_full, warsaw_projected)
plot(map_warsaw, main = "Warsaw - Ursynów")

To distinguish forest clusters from general vegetation, a density cutoff point must be established. While standard definitions often accept values as low as 10-30% (for example databank), this study implements a stricter threshold of 70%. This approach aims to filter out transitional vegetation and isolate high-density forest structures.

points <- as.data.frame(map_warsaw, xy = TRUE, na.rm = TRUE)
colnames(points)[3] <- "density"
forest_df <- points[points$density > 70, ]
head(forest_df)
##          x       y density
## 7  5071875 3285795      74
## 8  5071885 3285795      77
## 9  5071895 3285795      72
## 10 5071905 3285795      75
## 11 5071915 3285795      76
## 12 5071925 3285795      76
cords<- forest_df[, c("x", "y")]

Clustering

Before applying any clustering algorithms, it is crucial to verify whether the dataset contains meaningful clusters or is merely a random distribution of points. For this purpose, the Hopkins statistic (\(H\)) was calculated. This measure quantifies the probability that a given dataset is generated by a uniform data distribution.

\(H \approx 0.5\): Indicates random data (no meaningful clusters).

\(H \to 1.0\): Indicates a high clustering tendency.

set.seed(12345)
forest_sample <- cords[sample(nrow(cords), 1000), ]
h <- hopkins(forest_sample, n = 999)
1 - h$H
## [1] 0.7645541
get_clust_tendency(forest_sample, n = 999, graph=TRUE, gradient=list(low="red", mid="white", high="blue"), seed = 123)
## $hopkins_stat
## [1] 0.7628387
## 
## $plot

The calculated statistic is about 0.75 (well above the 0.5), we can reject the null hypothesis of uniform distribution and proceed with clustering.

DBSCAN

Density-Based Spatial Clustering of Applications with Noise points into dense regions separated by regions of lower density and it relies on two key parameters:

  • Epsilon (\(\epsilon\)): The maximum distance between two samples for one to be considered as in the neighborhood of the other.

  • MinPts: The number of samples in a neighborhood for a point to be considered as a core point.

set.seed(123)
db <- dbscan::dbscan(cords, eps = 62, minPts = 80)
cat("Number of db clusters: ",max(db$cluster))
## Number of db clusters:  53
forest_df$cluster <- as.factor(db$cluster)
forest_sf <- st_as_sf(forest_df, 
                      coords = c("x", "y"), 
                      crs = crs(map_warsaw))

map_all <- tm_shape(forest_sf) +
  tm_dots(col = "cluster", 
          style = "cat", 
          palette = "Set1", 
          size = 0.05, 
          title = "Cluster ID") +
  tm_layout(main.title = "Raw output", 
            legend.show = FALSE)

forest_clean <- forest_sf[forest_sf$cluster != "0", ]

map_clean <- tm_shape(forest_clean) +
  tm_dots(col = "cluster", 
          style = "cat", 
          palette = "Set1", 
          size = 0.05) +
  tm_layout(main.title = "Noise-free output", 
            legend.show = FALSE)

tmap_mode("plot") #"view" for interactive map
tmap_arrange(map_all, map_clean, ncol = 2)

tm_shape(map_warsaw) +
  tm_raster(palette = "-Greys", 
            title = "Tree Density", 
            style = "cont",
            legend.show = FALSE,   
            alpha = 0.4) +         
tm_shape(forest_clean) +
  tm_dots(col = "cluster", 
          style = "cat", 
          palette = "Set1", 
          size = 0.1,              
          title = "Forest Clusters") +
  tm_layout(main.title = "Identified Forests vs real data", 
            )

basemap <- read_osm(forest_clean, type = "osm", zoom = NULL, mergeTiles = TRUE)
final_map <- tm_shape(basemap) +
  tm_rgb() + 
  tm_shape(forest_clean) +
  tm_dots(col = "cluster", 
          style = "cat", 
          palette = "Set1", 
          size = 0.08,
          alpha = 0.7,
          title = "Forest Clusters") +
  
  tm_layout(main.title = "Identified Forests vs Urban Infrastructure",
            main.title.position = "center",
            main.title.size = 1.1,
            legend.bg.color = "white",
            legend.bg.alpha = 0.8,
            frame = TRUE)
final_map

Silhouette and Calinski-Harabasz

set.seed(123)
valid_points_idx <- which(db$cluster != 0) 
idx_sample <- sample(valid_points_idx, 3000)
d_matrix <- dist(cords[idx_sample, ])
sil <- silhouette(db$cluster[idx_sample], d_matrix)
plot(sil, border = NA, main = "Silhouette Plot")

set.seed(123)
ch <- round(calinhara(d_matrix, db$cluster[idx_sample]),digits=4)
cat("Calinski-Harabasz:", ch)
## Calinski-Harabasz: 111.4849

It is important to acknowledge that classical validation indices (for example Silhouette) are biased towards spherical/convex clusters. In the context of geospatial data, where objects often exhibit elongated and irregular structures, lower validation scores are a natural occurrence. The result do not imply a deficiency in the clustering model. Rather, it highlight the non-convex nature of the data, for which DBSCAN was specifically chosen due to its ability to handle clusters of arbitrary shapes.

The negative Silhouette index value for the largest cluster (Kabaty Forest) results from the close proximity of other forest complexes (for example Natolin park or many smaller ones). Although these objects are physically separated by road infrastructure (as correctly detected by DBSCAN), their small Euclidean distance causes the Silhouette measure to interpret them as a single object, penalizing the division. However, even slightly increase in epsilon will result in various forest complexes merging (for example Kabaty Forest and Powsin area). Increasing the required points within the area defined by \(\epsilon\) is also not a solution, as it would lose important clusters, such as the forest complex near the Vistula River. Therefore, this is a compromise I had to accept.

HDBSCAN

While DBSCAN is effective, it assumes a constant density across all clusters (a single global \(\epsilon\)). This can be problematic in heterogeneous environments where dense nature reserves coexist with sparser urban parks. Hierarchical Density-Based Spatial Clustering of Applications with Noise addresses this limitation by clustering over a range of \(\epsilon\) values and extracts the most stable clusters from the hierarchy tree, allowing for varying densities.

However, HDBSCAN is significantly more memory-intensive than standard DBSCAN, as it requires constructing a Minimum Spanning Tree for the entire dataset. With the original high-resolution data (10m pixels), the memory requirement exceeded by far hardware capabilities of the workstation used. Therefore, the data was aggregated by a factor of 4 (reducing resolution from 10m to approx. 40m). This downsampling enabled the successful execution of the algorithm at the cost of losing some boundary details.

map_l <- aggregate(map_warsaw, fact = 4, fun = mean)
points <- as.data.frame(map_l, xy = TRUE, na.rm = TRUE)
colnames(points)[3] <- "density"
forest_df <- points[points$density > 70, ]
cords<- forest_df[, c("x", "y")]
set.seed(123)
hdb <- dbscan::hdbscan(cords, minPts = 20)
cat("Number of hdb clusters: ",max(hdb$cluster))
## Number of hdb clusters:  11
forest_df$hdb_cluster <- as.factor(hdb$cluster)
forest_sf <- st_as_sf(forest_df, 
                      coords = c("x", "y"), 
                      crs = crs(map_warsaw))

map_all <- tm_shape(forest_sf) +
  tm_dots(col = "hdb_cluster", 
          style = "cat", 
          palette = "Set1", 
          size = 0.05, 
          title = "Cluster ID") +
  tm_layout(main.title = "Raw output", 
            legend.show = FALSE)

forest_clean <- forest_sf[forest_sf$hdb_cluster != "0", ]

map_clean <- tm_shape(forest_clean) +
  tm_dots(col = "hdb_cluster", 
          style = "cat", 
          palette = "Set1", 
          size = 0.05) +
  tm_layout(main.title = "Noise-free output", 
            legend.show = FALSE)

tmap_mode("plot") #view for interactive map
tmap_arrange(map_all, map_clean, ncol = 2)

tm_shape(map_warsaw) +
  tm_raster(palette = "-Greys", 
            title = "Tree Density", 
            style = "cont",
            legend.show = FALSE,   
            alpha = 0.4) +         
tm_shape(forest_clean) +
  tm_dots(col = "hdb_cluster", 
          style = "cat", 
          palette = "Set1", 
          size = 0.1,              
          title = "Forest Clusters") +
  tm_layout(main.title = "HDBSCAN Clustering Results vs real data", 
            )

basemap <- read_osm(forest_clean, type = "osm", zoom = NULL, mergeTiles = TRUE)
final_map <- tm_shape(basemap) +
  tm_rgb() + 
  tm_shape(forest_clean) +
  tm_dots(col = "hdb_cluster", 
          style = "cat", 
          palette = "Set1", 
          size = 0.18,
          alpha = 0.9,
          title = "Forest Clusters") +
  
  tm_layout(main.title = "HDB Identified Forests vs Urban Infrastructure",
            main.title.position = "center",
            main.title.size = 1.1,
            legend.bg.color = "white",
            legend.bg.alpha = 0.8,
            frame = TRUE)
final_map

The visual output of HDBSCAN presents a much more generalized view of the forest structure. The algorithm identified a smaller number of clusters (11 compared to 53 in DBSCAN). The boundaries of the forests are noticeably thicker due to the reduced resolution of the data. However, the major ecological units like Kabaty Forest, Powsin area, Natolin Park, and the forest patch near Vistula river were correctly identified and separated.

Silhouette and Calinski-Harabasz

set.seed(123)
valid_points_idx <- which(hdb$cluster != 0) 
idx_sample <- sample(valid_points_idx, 3000)
d_matrix <- dist(cords[idx_sample, ])
sil <- silhouette(hdb$cluster[idx_sample], d_matrix)
plot(sil, border = NA, main = "Silhouette Plot")

set.seed(123)
ch <- round(calinhara(d_matrix, hdb$cluster[idx_sample]),digits=4)
cat("Calinski-Harabasz:", ch)
## Calinski-Harabasz: 509.9422

The validation metrics for HDBSCAN show a significant improvement over the standard DBSCAN run. Silhouette width is not huge, but positive (0.2) and Calinski-Harabasz is also noticeably higher. It was expected due to lower amount of clusters in close neighborhood of Kabaty Forest and lowering the resolution the data.

Conclusion

The study revealed a fundamental trade-off between spatial precision and algorithmic robustness. While DBSCAN applied to high-resolution data excelled at delineating exact forest boundaries, its reliance on a global density threshold proved to be a significant limitation, necessitating manual tuning to prevent the artificial merging of adjacent ecosystems. In contrast, HDBSCAN demonstrated superior stability by automatically adapting to local density variations. Ultimately, while DBSCAN remains preferable for precise, local-scale mapping where boundaries must be exact, HDBSCAN offers a more robust solution for regional analysis.