1 Abstract

Housing markets in large metropolitan areas exhibit strong spatial heterogeneity, with property values influenced not only by structural dwelling characteristics but also by accessibility to urban infrastructure and geographic location. Understanding these spatial relationships is essential for the development of accurate real estate valuation models and spatial decision-support systems.

This study analyses apartment prices in Warsaw, Poland, using a combination of spatial data processing, exploratory spatial analysis, supervised machine learning, and geostatistical interpolation. The analysis is based on apartment listing data obtained from the Kaggle dataset Apartment Prices in Poland, restricted to properties located within the Warsaw metropolitan area. After integrating and cleaning observations collected across multiple time periods, the dataset was transformed into a spatial object and enriched with geospatial information derived from OpenStreetMap (OSM).

The methodological contribution of the project consists of constructing a spatial data integration and modelling pipeline that combines housing-market information with urban infrastructure layers. Accessibility variables were generated by calculating distances to subway stations, tram networks, and public parks. Exploratory spatial analysis was conducted through mapping, district-level aggregation, correlation analysis, and unsupervised k-means clustering in order to identify potential spatial submarkets within the city.

The second part of the project applied supervised Spatial Machine Learning to predict apartment prices per square metre. A Spatial Random Forest model was trained using both structural variables and spatial predictors, including projected coordinates, accessibility measures, and spatial-market clusters. The model achieved strong predictive performance, with an R-squared value of 0.835 on the test set. Variable-importance analysis showed that spatial-market segmentation, building year, geographic coordinates, and subway accessibility were among the most important predictors.

Finally, ordinary kriging was used to transform point-level apartment listings into a continuous spatial price surface for Warsaw. The results reveal clear geographic patterns in the housing market, with higher predicted prices concentrated mainly in central and well-connected areas, and lower values more visible toward peripheral zones. Overall, the project confirms that Warsaw apartment prices are shaped by both structural property characteristics and spatial context, and that combining GIS methods, machine learning, and geostatistical interpolation provides a useful framework for analysing urban housing markets.

2 Introduction

Urban housing markets are highly heterogenous, with property valuation heavily influenced by spatial factors. Traditional pricing models often assume spatial independence, thereby ignoring the localized impacts of neighborhood amenities and urban infrastructure. The objective of this study is to systematically capture these spatial structures within the Warsaw apartment market.

In this paper, we propose a spatial data integration pipeline that transforms raw, point-level apartment listings into a robust spatial dataset. We enrich the baseline property attributes (such as floor area and room count) with geospatial accessibility metrics computed against OpenStreetMap (OSM) data. This allows us to quantify the proximity to key urban amenities, including subway stations, tram networks, and public parks.

The empirical analysis is based on a dataset of apartment listings in Poland (sourced from Kaggle), strictly filtered to the Warsaw metropolitan area. By constructing a localized distance-decay profile for each apartment and mapping regional price distributions, we establish a comprehensive baseline for advanced Spatial Machine Learning techniques.

The primary research question addressed in this first stage is whether spatial accessibility and urban infrastructure can be systematically integrated into a housing dataset to capture meaningful geographic variation in apartment prices across Warsaw.

3 Review of the Dataset

The listing data come from the Kaggle dataset Apartment Prices in Poland. In this study, we focus on the local sale files apartments_pl_YYYY_MM.csv from the data directory, restricting the analysis strictly to Warsaw (city == “Warszawa”).

The primary columns of interest that form the structural and spatial baseline of our analysis include:

  • id: Unique identifier for each listing
  • price: Total offer price in PLN
  • squareMeters: Total floor area of the apartment
  • rooms / floor / floorCount: Building and structural attributes
  • latitude / longitude: Geographical coordinates in WGS84 format

3.1 Data Importing and Cleaning

The initial data processing phase involves sequentially merging the monthly historical files. To prevent the artificial inflation of spatial density caused by prolonged market exposure of single properties, the pipeline isolates the most recent occurrence of each unique listing ID.

data_files <- list.files(
  "data",
  pattern = "^apartments_pl_[0-9]{4}_[0-9]{2}[.]csv$",
  full.names = TRUE
)

stopifnot(length(data_files) > 0)

read_apartment_file <- function(path) {
  file_month <- basename(path)
  file_month <- sub("^apartments_pl_", "", file_month)
  file_month <- sub("[.]csv$", "", file_month)
  file_month <- paste0(gsub("_", "-", file_month), "-01")

  readr::read_csv(path, show_col_types = FALSE) |>
    mutate(
      source_file = basename(path),
      month_date = as.Date(file_month)
    )
}

apartments_raw <- purrr::map_dfr(data_files, read_apartment_file)

warsaw_raw <- apartments_raw |>
  filter(tolower(city) == "warszawa")

tibble(
  stage = c("All sale listings in local files", "Warsaw sale listings"),
  n_records = c(nrow(apartments_raw), nrow(warsaw_raw))
) |>
  knitr::kable(format.args = list(big.mark = " "), caption = "Table 1: Data Import Summary")
Table 1: Data Import Summary
stage n_records
All sale listings in local files 195 568
Warsaw sale listings 59 246

Table 1 outlines the initial volume of data retrieved from the source files. The filtering process successfully isolates the subset of observations pertaining exclusively to the Warsaw housing market, providing a substantial baseline sample for subsequent spatial transformations.

To ensure the statistical validity of the spatial models, conservative quality filters are strictly applied. This involves defining a bounding box to exclude erroneously geocoded coordinates, limiting floor areas to a realistic spectrum (15–250 m²), and symmetrically trimming the extreme 1% tails of the price-per-square-meter distribution to mitigate the distortive effects of outliers and data entry errors.

warsaw_candidates <- warsaw_raw |>
  mutate(
    price_pln = as.numeric(price),
    area_m2 = as.numeric(squareMeters),
    rooms = as.numeric(rooms),
    floor = as.numeric(floor),
    floor_count = as.numeric(floorCount),
    build_year = as.numeric(buildYear),
    latitude = as.numeric(latitude),
    longitude = as.numeric(longitude),
    price_per_m2_pln = price_pln / area_m2
  ) |>
  filter(
    !is.na(id),
    !is.na(price_pln),
    !is.na(area_m2),
    !is.na(latitude),
    !is.na(longitude),
    price_pln > 0,
    area_m2 > 0,
    dplyr::between(latitude, 52.05, 52.40),
    dplyr::between(longitude, 20.75, 21.35),
    dplyr::between(area_m2, 15, 250),
    dplyr::between(price_pln, 100000, 10000000)
  )

price_m2_limits <- quantile(
  warsaw_candidates$price_per_m2_pln,
  probs = c(0.01, 0.99),
  na.rm = TRUE
)

warsaw_clean <- warsaw_candidates |>
  filter(
    dplyr::between(
      price_per_m2_pln,
      unname(price_m2_limits[1]),
      unname(price_m2_limits[2])
    )
  ) |>
  arrange(id, desc(month_date)) |>
  distinct(id, .keep_all = TRUE) |>
  mutate(
    target_price_pln = price_pln,
    target_price_per_m2_pln = price_per_m2_pln
  )

cleaning_summary <- tibble(
  stage = c(
    "Warsaw listings before cleaning",
    "After quality filters",
    "After removing repeated listing ids"
  ),
  n_records = c(
    nrow(warsaw_raw),
    nrow(warsaw_candidates),
    nrow(warsaw_clean)
  )
)

cleaning_summary |>
  knitr::kable(format.args = list(big.mark = " "), caption = "Table 2: Data Cleaning Stages")
Table 2: Data Cleaning Stages
stage n_records
Warsaw listings before cleaning 59 246
After quality filters 59 246
After removing repeated listing ids 31 341

As demonstrated in Table 2, the data preprocessing pipeline methodically reduced the dataset to its most robust form. The attrition rate indicates that while the original dataset contained a moderate volume of noise and duplicated records across temporal periods, the final consolidated dataset retains a highly representative and statistically sound sample size of unique market offerings.

warsaw_clean |>
  summarise(
    n_listings = n(),
    first_month = min(month_date),
    last_month = max(month_date),
    median_price_pln = median(price_pln),
    median_price_per_m2_pln = median(price_per_m2_pln),
    median_area_m2 = median(area_m2)
  ) |>
  knitr::kable(
    digits = 0,
    format.args = list(big.mark = " "),
    caption = "Table 3: Descriptive Statistics of Cleaned Dataset"
  )
Table 3: Descriptive Statistics of Cleaned Dataset
n_listings first_month last_month median_price_pln median_price_per_m2_pln median_area_m2
31 341 2023-08-01 2024-06-01 850 000 16 772 53

Table 3 summarizes the principal characteristics of the final analytical dataset. Considerable variation is observed across both structural apartment attributes and housing prices, indicating a heterogeneous market environment. The broad range of property sizes and price levels suggests that multiple housing submarkets may coexist within Warsaw, thereby motivating further spatial and cluster-based analyses.

4 Geospatial Processing & Infrastructure Integration

Accurate modeling of spatial dynamics necessitates a rigorous mathematical foundation for geography. Raw geographic coordinates must be appropriately projected to allow for valid Euclidean distance computations, and the dataset must be merged with contextual urban infrastructure.

4.1 Spatial CRS Projection

The original geographical coordinates (longitude and latitude) are provided in the unprojected WGS84 system (EPSG:4326). To facilitate accurate metric-based distance calculations, the coordinate reference system (CRS) is mathematically transformed to EPSG:2180 (ETRS89 / Poland CS92), a Cartesian projection standard specifically optimized for the territory of Poland.

warsaw_sf <- warsaw_clean |>
  st_as_sf(
    coords = c("longitude", "latitude"),
    crs = crs_geo,
    remove = FALSE
  ) |>
  st_transform(crs_pl)

warsaw_study_area <- warsaw_sf |>
  st_bbox() |>
  st_as_sfc() |>
  st_buffer(1500)

warsaw_bbox_geo <- warsaw_study_area |>
  st_transform(crs_geo) |>
  st_bbox()
ggplot(warsaw_sf) +
  geom_sf(aes(color = price_per_m2_pln), alpha = 0.35, size = 0.35) +
  scale_color_viridis_c(
    option = "C",
    labels = scales::label_number(big.mark = " ", suffix = " PLN"),
    name = "Price per m²"
  ) +
  coord_sf(datum = NA) +
  labs(
    title = "Figure 1. Spatial distribution of apartment listings in Warsaw",
    subtitle = "Cleaned sale listings; colour indicates the offer price per square metre",
    x = NULL,
    y = NULL
  )

Figure 1 reveals a pronounced spatial gradient in apartment prices across Warsaw. Higher unit prices appear concentrated in central districts and locations with strong transport accessibility, whereas peripheral areas generally exhibit lower price levels. This pattern is consistent with urban economic theory, which predicts that accessibility and proximity to employment centers increase residential property values. The observed clustering further justifies the inclusion of explicit spatial variables in subsequent analyses.

4.2 OpenStreetMap Integration

A major objective of this study is to quantify the relationship between housing prices and urban accessibility. To achieve this, spatial infrastructure layers were obtained from OpenStreetMap and integrated with apartment locations. Three categories of amenities were selected because of their relevance to residential attractiveness and daily mobility:

  1. subway stations and subway entrances,
  2. tram lines,
  3. parks.

In addition, district boundaries were incorporated to support spatial aggregation and descriptive analyses. These geographic layers provide the contextual information necessary to characterize the local environment surrounding each apartment listing.

osm_map <- ggplot()

if (!is.null(parks)) {
  osm_map <- osm_map +
    geom_sf(data = parks, fill = "#A7C957", color = NA, alpha = 0.35)
}

if (!is.null(tram_lines)) {
  osm_map <- osm_map +
    geom_sf(data = tram_lines, color = "#E76F51", linewidth = 0.45, alpha = 0.75)
}

if (!is.null(warsaw_districts)) {
  osm_map <- osm_map +
    geom_sf(data = warsaw_districts, fill = NA, color = "grey35", linewidth = 0.35)
}

osm_map +
  geom_sf(data = warsaw_sf, color = "grey20", alpha = 0.10, size = 0.20) +
  geom_sf(data = subway_stations, color = "#1D4ED8", size = 1.25) +
  coord_sf(datum = NA) +
  labs(
    title = "Figure 2. Apartment listings and OpenStreetMap infrastructure",
    subtitle = "Green: parks; orange: tram lines; blue: subway stations and entrances",
    x = NULL,
    y = NULL
  )

Figure 2 confirms that the downloaded infrastructure layers adequately cover the study area and align with the spatial distribution of apartment listings. The dense concentration of transport infrastructure in central districts contrasts with sparser networks in peripheral areas, illustrating the unequal spatial distribution of urban amenities across Warsaw. This pattern provides an initial indication that accessibility may contribute to observed differences in housing values.

4.3 Distance-Based Spatial Features

To translate visual geography into empirical, machine-readable variables, we compute the nearest Euclidean distance from each property to the closest subway station, tram line, and public park. Because the underlying coordinate system (EPSG:2180) is metric, the resulting measurements are strictly defined in meters. While Euclidean measurements represent linear proximity rather than exact network travel times, they serve as highly reliable and objective proxies for localized urban accessibility.

nearest_distance_m <- function(from_sf, to_sf) {
  if (is.null(to_sf) || nrow(to_sf) == 0) {
    return(rep(NA_real_, nrow(from_sf)))
  }

  nearest_id <- st_nearest_feature(from_sf, to_sf)

  as.numeric(
    st_distance(
      st_geometry(from_sf),
      st_geometry(to_sf)[nearest_id],
      by_element = TRUE
    )
  )
}

warsaw_enriched <- warsaw_sf |>
  mutate(
    dist_subway_m = nearest_distance_m(warsaw_sf, subway_stations),
    dist_tram_m = nearest_distance_m(warsaw_sf, tram_lines),
    dist_park_m = nearest_distance_m(warsaw_sf, parks)
  )

if (!is.null(warsaw_districts) && nrow(warsaw_districts) > 0) {
  warsaw_enriched <- warsaw_enriched |>
    st_join(warsaw_districts |> select(district), join = st_within, left = TRUE)
} else {
  warsaw_enriched$district <- NA_character_
}

warsaw_enriched |>
  st_drop_geometry() |>
  summarise(
    median_dist_subway_m = median(dist_subway_m, na.rm = TRUE),
    median_dist_tram_m = median(dist_tram_m, na.rm = TRUE),
    median_dist_park_m = median(dist_park_m, na.rm = TRUE),
    p25_dist_subway_m = quantile(dist_subway_m, 0.25, na.rm = TRUE),
    p75_dist_subway_m = quantile(dist_subway_m, 0.75, na.rm = TRUE)
  ) |>
  knitr::kable(digits = 0, format.args = list(big.mark = " "), caption = "Table 4: Summary of Proximity Metrics to Urban Infrastructure")
Table 4: Summary of Proximity Metrics to Urban Infrastructure
median_dist_subway_m median_dist_tram_m median_dist_park_m p25_dist_subway_m p75_dist_subway_m
760 434 290 408 1 738

Table 4 aggregates the calculated proximity metrics across the entire portfolio of analyzed listings. The median distances confirm that surface-level infrastructure (trams and parks) is widely distributed and highly accessible to the average listing, whereas the subway network—characterized by larger median distances and an expanded interquartile range (IQR)—represents a scarcer premium amenity.

distance_long <- warsaw_enriched |>
  st_drop_geometry() |>
  select(dist_subway_m, dist_tram_m, dist_park_m) |>
  tidyr::pivot_longer(
    everything(),
    names_to = "feature",
    values_to = "distance_m"
  ) |>
  mutate(
    feature = recode(
      feature,
      dist_subway_m = "Subway",
      dist_tram_m = "Tram",
      dist_park_m = "Park"
    )
  )

ggplot(distance_long, aes(x = distance_m / 1000)) +
  geom_histogram(bins = 35, fill = "#457B9D", color = "white") +
  facet_wrap(~feature, scales = "free_y") +
  scale_x_continuous(labels = scales::label_number(decimal.mark = ".")) +
  labs(
    title = "Figure 3. Distance from listings to selected urban infrastructure",
    subtitle = "Distribution of distances to subway stations, tram lines, and parks",
    x = "Distance [km]",
    y = "Number of listings"
  )

Figure 3 illustrates the accessibility profile of apartment listings across Warsaw. The distributions reveal that most properties are located relatively close to tram infrastructure and public parks, whereas distances to subway stations vary substantially. This reflects the limited geographic extent of the metro network compared with the more extensive tram system. The observed variability suggests that accessibility effects are unlikely to be uniform across the city and may contribute to localized differences in housing prices.

4.4 District-Level Price Distribution

While point-level models are superior for predictive accuracy, spatially aggregating data to official administrative boundaries provides essential macro-level context. A district-level examination identifies broader jurisdictional discrepancies in valuation and bridges the gap between individual property prices and macro-economic urban trends.

district_price_stats <- warsaw_enriched |>
  st_drop_geometry() |>
  filter(!is.na(district)) |>
  group_by(district) |>
  summarise(
    n_listings = n(),
    median_price_pln = median(price_pln, na.rm = TRUE),
    median_price_per_m2_pln = median(price_per_m2_pln, na.rm = TRUE),
    median_area_m2 = median(area_m2, na.rm = TRUE),
    .groups = "drop"
  ) |>
  arrange(desc(median_price_per_m2_pln))

if (nrow(district_price_stats) > 0) {
  district_price_stats |>
  knitr::kable(digits = 0, format.args = list(big.mark = " "), caption = "Table 5: District-Level Housing Market Statistics")
} else {
  message("Listings could not be assigned to districts from the OSM boundary layer.")
}
Table 5: District-Level Housing Market Statistics
district n_listings median_price_pln median_price_per_m2_pln median_area_m2
Śródmieście 2 625 1 015 000 21 307 50
WilanĂłw 1 324 1 199 500 19 131 64
Żoliborz 1 054 949 000 19 000 51
Wola 3 442 850 000 18 699 48
Ochota 1 580 795 000 17 778 47
MokotĂłw 4 871 900 000 17 778 55
Praga-Północ 970 769 000 16 537 48
UrsynĂłw 2 202 951 500 16 329 63
Włochy 922 800 000 15 979 53
Praga-Południe 3 045 799 000 15 965 52
Bielany 1 890 765 000 15 784 50
Bemowo 1 801 860 000 15 504 60
Ursus 1 036 754 000 14 443 54
TargĂłwek 1 464 719 450 14 103 53
Wawer 561 846 000 13 649 60
Białołęka 2 226 715 000 13 350 54
RembertĂłw 176 694 500 12 869 55
Wesoła 134 770 000 12 721 60

The table reveals a clear price hierarchy between Warsaw’s districts. Śródmieście (City Center) leads the way with a median unit price of PLN 21,307/m², confirming the city center’s dominant position as a premium location, despite the relatively small number of listings (2,625) and modestly sized apartments (median 50 m²). Wilanów (PLN 19,131/m²) and Żoliborz (PLN 19,000/m²) are close behind, maintaining high prices despite their distance from the city center thanks to their prestigious development and good transport accessibility.

The middle of the rankings is made up of the western-central districts of Wola, Ochota, and Mokotów, with prices ranging from PLN 17,000 to PLN 18,700/m². Despite its historically working-class character, Wola has seen its price rise thanks to intensive revitalization and the expansion of the M2 metro line.

The lower part of the table is dominated by the peripheral districts of Białołęka (PLN 13,350/m²), Rembertów (PLN 12,869/m²), and Wesoła (PLN 12,721/m²). These districts are characterized by poorer transport accessibility and a predominance of single-family housing. Wesoła also has the lowest number of listings (134), which limits the representativeness of the statistics for this district.

The disparity between the total price and the price per square meter is noteworthy – Wilanów has the highest median absolute price (PLN 1,199,500) for a square footage of 64 square meters, indicating a market for large, luxury apartments, different in profile from Śródmieście, where the higher cost per square meter applies to smaller units.

if (!is.null(warsaw_districts) && nrow(district_price_stats) > 0) {
  districts_prices <- warsaw_districts |>
    left_join(district_price_stats, by = "district")

  ggplot(districts_prices) +
    geom_sf(aes(fill = median_price_per_m2_pln), color = "white", linewidth = 0.35) +
    geom_sf(data = warsaw_enriched, color = "black", alpha = 0.05, size = 0.15) +
    scale_fill_viridis_c(
      option = "C",
      labels = scales::label_number(big.mark = " ", suffix = " PLN"),
      name = "Median price per m²"
    ) +
    coord_sf(datum = NA) +
      labs(
        title = "Figure 4. Median apartment price per square metre by Warsaw district",
        subtitle = "District-level aggregation based on cleaned apartment listings",
        x = NULL,
        y = NULL
      )
}

The choropleth map spatially confirms the pattern visible in the table. The color gradient creates a distinct concentric pattern – the darkest (most expensive) colors are concentrated in the center and along the north-south axis, which coincides with the M1 metro line. Śródmieście, Żoliborz, and Wola form a coherent road corridor on the left bank of the Vistula.

Wilanów in the south stands out as an expensive enclave detached from the center, which is visible on the map as a dark spot surrounded by lighter districts – the effect of premium housing developments around Lake Wilanów. The Praga district (Praga-Północ, Praga-Południe, Targówek) is clearly lighter, confirming the persistent premium for locations on the left bank of the Vistula. The eastern and southeastern outskirts (Wawer, Wesoła, Rembertów, Białołęka) form uniformly bright areas, signaling the lowest prices in the city.

The map visually justifies the inclusion of spatial variables in the model - district affiliation alone carries significant information about price, and the pattern is too regular to be random.

4.5 Feature Correlation Analysis

Prior to executing spatial clustering or predictive modeling, it is imperative to examine the linear associations between our engineered features and target economic variables. A robust correlation analysis validates the hypothesized distance-decay functions and pre-emptively diagnoses potential multicollinearity among predictor variables.

To perform this analysis, we compute the Pearson correlation coefficient (\(r\)) across the main numerical variables and visualize the matrix using the corrplot package.

correlation_data <- warsaw_enriched |>
  st_drop_geometry() |>
  select(
    price_per_m2_pln,
    price_pln,
    area_m2,
    rooms,
    floor,
    floor_count,
    build_year,
    dist_subway_m,
    dist_tram_m,
    dist_park_m
  ) |>
  na.omit()

correlation_matrix <- cor(
  correlation_data,
  method = "pearson"
)

corrplot::corrplot(
  correlation_matrix,
  method = "color",
  type = "upper",
  addCoef.col = "black",
  number.cex = 0.6,
  tl.col = "black",
  tl.cex = 0.8,
  diag = FALSE,
  title = "Figure 5. Pearson Correlation Matrix of Structural and Spatial Variables",
  mar = c(0, 0, 2, 0)
)

Figure 5 provides a visual overview of the relationships among housing characteristics, accessibility measures, and apartment prices. The matrix confirms that no single variable fully explains price variation, indicating that housing values emerge from the interaction of multiple structural and spatial factors. This observation further motivates the use of multivariate modelling approaches in later stages of the analysis.

4.6 Methodology Overview: K-Means in Spatial Econometrics

Before formally delineating submarkets within the Warsaw housing market, it is essential to establish the theoretical framework of our unsupervised machine learning approach. Real estate markets do not distribute homogeneously; rather, they fragment into localized economic zones—urban submarkets.

To algorithmically identify these zones independent of arbitrary administrative borders, we implement a spatially-weighted unsupervised partitioning algorithm.

4.6.1 The Mathematical Framework of K-Means

K-means is a partitioning clustering algorithm that divides a dataset of \(n\) observations into \(k\) distinct, non-overlapping groups. The mathematical objective of the algorithm is to minimize the Within-Cluster Sum of Squares (WSS). In other words, it minimizes the variance within each cluster, ensuring that observations assigned to the same group are as physically and economically homogeneous as possible:

\[\text{WSS} = \sum_{i=1}^{k} \sum_{x \in S_i} \|x - \mu_i\|^2\]

Where:

  • \(k\) represents the pre-specified number of clusters,
  • \(S_i\) denotes the set of observations belonging to the \(i\)-th cluster,
  • \(\mu_i\) is the centroid (mean vector) of all points in cluster \(S_i\),
  • \(\|x - \mu_i\|^2\) represents the squared Euclidean distance between an observation and its respective cluster center.

4.6.2 Spatial-Price Integration Strategy

In traditional data science applications, clustering is usually applied strictly to structural property characteristics (e.g., total area, number of rooms). However, in spatial econometrics, geography cannot be discarded. In this project, we feed the algorithm a combination of geographic and economic data, specifically three scaled variables simultaneously:

  1. Projected X Coordinate (x_2180) – East-West geographic position.
  2. Projected Y Coordinate (y_2180) – North-South geographic position.
  3. Price per Square Meter (price_per_m2_pln) – Unit economic value.

By integrating both coordinates and price into the distance-metric calculation, the K-means algorithm serves a dual purpose. It does not simply cluster apartments that are close to each other, nor does it just group properties with similar prices regardless of their location. Instead, it minimizes the combined variance of location and price.

Consequently, the resulting clusters represent data-driven spatial submarkets — continuous or semi-continuous urban territories where properties share both geographical proximity and a comparable price profile. This unsupervised segmentation provides a robust baseline for identifying regional patterns before expanding into supervised spatial machine learning models.

4.7 Unsupervised Spatial Analysis: k-Means

As a first unsupervised spatial analysis, we run k-means clustering on three variables: projected x and y coordinates and the price per square metre. This is not a predictive model. Its role is to identify rough price-location groups, or submarkets, that are spatially close and price-similar. We use five clusters to keep the map interpretable.

xy <- st_coordinates(warsaw_enriched)

cluster_input <- warsaw_enriched |>
  st_drop_geometry() |>
  mutate(
    x_2180 = xy[, 1],
    y_2180 = xy[, 2]
  ) |>
  select(x_2180, y_2180, price_per_m2_pln)

k_clusters <- 5
kmeans_fit <- kmeans(scale(cluster_input), centers = k_clusters, nstart = 50)

warsaw_clustered <- warsaw_enriched |>
  mutate(
    x_2180 = xy[, 1],
    y_2180 = xy[, 2],
    cluster_raw = kmeans_fit$cluster
  )

cluster_lookup <- warsaw_clustered |>
  st_drop_geometry() |>
  group_by(cluster_raw) |>
  summarise(
    n_listings = n(),
    median_price_per_m2_pln = median(price_per_m2_pln, na.rm = TRUE),
    median_dist_subway_m = median(dist_subway_m, na.rm = TRUE),
    median_dist_park_m = median(dist_park_m, na.rm = TRUE),
    .groups = "drop"
  ) |>
  arrange(median_price_per_m2_pln) |>
  mutate(
    cluster_rank = row_number(),
    cluster_label = paste0(
      "Cluster ", cluster_rank, " (",
      scales::number(median_price_per_m2_pln, accuracy = 1, big.mark = " "),
      " PLN/m²)"
    )
  )

warsaw_clustered <- warsaw_clustered |>
  left_join(
    cluster_lookup |> select(cluster_raw, cluster_rank, cluster_label),
    by = "cluster_raw"
  ) |>
  mutate(cluster_label = factor(cluster_label, levels = cluster_lookup$cluster_label))

cluster_lookup |>
  select(
    cluster_rank, n_listings, median_price_per_m2_pln,
    median_dist_subway_m, median_dist_park_m
  ) |>
  knitr::kable(digits = 0, format.args = list(big.mark = " "), caption = "Table 6: Spatial-Price Cluster Profiles")
Table 6: Spatial-Price Cluster Profiles
cluster_rank n_listings median_price_per_m2_pln median_dist_subway_m median_dist_park_m
1 3 828 13 500 2 831 561
2 5 132 14 896 765 329
3 8 640 16 084 882 274
4 7 059 17 069 755 285
5 6 682 22 073 517 204

Table 6 enumerates the statistical profiles of the five generated clusters, arranged sequentially by median price. A clear operational pattern emerges: clusters commanding the highest valuations per square meter concurrently exhibit the lowest median distances to subway infrastructure. This powerful unsupervised result reinforces the primacy of transit accessibility in structuring real estate economics.

cluster_map <- ggplot()

if (!is.null(warsaw_districts)) {
  cluster_map <- cluster_map +
    geom_sf(data = warsaw_districts, fill = NA, color = "grey70", linewidth = 0.25)
}

cluster_map +
  geom_sf(data = warsaw_clustered, aes(color = cluster_label), alpha = 0.45, size = 0.35) +
  scale_color_brewer(palette = "Set2", name = "Cluster") +
  coord_sf(datum = NA) +
  labs(
    title = "Figure 6. K-means spatial-price clusters of Warsaw apartment listings",
    subtitle = "Clusters obtained from projected coordinates and price per square metre",
    x = NULL,
    y = NULL
  )+
  theme(legend.position = "right")

Figure 6 reveals the presence of several spatially coherent housing submarkets within Warsaw. Higher-price clusters are concentrated in central and highly accessible locations, whereas lower-price clusters are more prevalent in peripheral districts. Notably, cluster boundaries do not perfectly coincide with administrative borders, suggesting that housing-market segmentation is driven more strongly by accessibility patterns and urban structure than by formal district divisions. This finding highlights the importance of incorporating spatial information into housing-market analyses.

4.8 Construction of the Final Analytical Dataset

To support subsequent supervised modelling and geostatistical analysis, a final analytical dataset was constructed. The dataset integrates structural apartment characteristics, projected spatial coordinates, and accessibility indicators derived from OpenStreetMap infrastructure layers. Particular attention was given to producing a standardized and model-ready representation suitable for both machine-learning and spatial interpolation methods.

The recommended target for the final raster price surface is target_price_per_m2_pln, because kriging a price-per-square-metre surface is more interpretable than kriging total apartment prices. The total price target target_price_pln is retained for an alternative model in which floor area is one of the strongest expected predictors.

model_feature_columns <- c(
  "area_m2", "rooms", "floor", "floor_count", "build_year",
  "x_2180", "y_2180", "dist_subway_m", "dist_tram_m", "dist_park_m"
)

model_ready_sf <- warsaw_clustered |>
  select(
    id,
    source_file,
    month_date,
    city,
    type,
    ownership,
    buildingMaterial,
    condition,
    district,
    target_price_pln,
    target_price_per_m2_pln,
    all_of(model_feature_columns),
    cluster_rank,
    cluster_label,
    geometry
  ) |>
  filter(
    !is.na(target_price_per_m2_pln),
    !is.na(area_m2),
    !is.na(x_2180),
    !is.na(y_2180),
    !is.na(dist_subway_m),
    !is.na(dist_tram_m),
    !is.na(dist_park_m)
  )

model_ready_table <- model_ready_sf |>
  st_drop_geometry()

model_handoff_summary <- tibble(
  item = c(
    "Recommended supervised ML target",
    "Alternative supervised ML target",
    "Core numeric predictors",
    "Spatial coordinates for Spatial RF and kriging",
    "Kriging geometry CRS",
    "Number of model-ready listings"
  ),
  value = c(
    "target_price_per_m2_pln",
    "target_price_pln",
    paste(model_feature_columns, collapse = ", "),
    "x_2180, y_2180",
    "EPSG:2180",
    format(nrow(model_ready_sf), big.mark = " ")
  )
)

model_handoff_summary |>
  knitr::kable(caption = "Table 7: Handoff Dataset Structure Overview")
Table 7: Handoff Dataset Structure Overview
item value
Recommended supervised ML target target_price_per_m2_pln
Alternative supervised ML target target_price_pln
Core numeric predictors area_m2, rooms, floor, floor_count, build_year, x_2180, y_2180, dist_subway_m, dist_tram_m, dist_park_m
Spatial coordinates for Spatial RF and kriging x_2180, y_2180
Kriging geometry CRS EPSG:2180
Number of model-ready listings 31 341
data_dictionary <- tibble(
  column = c(
    "target_price_per_m2_pln",
    "target_price_pln",
    "area_m2",
    "rooms",
    "floor",
    "floor_count",
    "build_year",
    "x_2180",
    "y_2180",
    "dist_subway_m",
    "dist_tram_m",
    "dist_park_m",
    "district",
    "cluster_rank",
    "geometry"
  ),
  role_in_part_ii = c(
    "Main target for Spatial Random Forest and price-per-square-metre kriging",
    "Optional target for a total-price model",
    "Core structural predictor",
    "Core structural predictor",
    "Optional structural predictor",
    "Optional structural predictor",
    "Optional structural predictor; may need missing-value handling",
    "Projected x coordinate for spatial modelling",
    "Projected y coordinate for spatial modelling",
    "Distance feature for subway accessibility",
    "Distance feature for tram accessibility",
    "Distance feature for green-space accessibility",
    "Descriptive grouping variable",
    "Unsupervised spatial-price submarket label",
    "Point geometry in EPSG:2180"
  )
)

data_dictionary |>
  knitr::kable(caption = "Table 8: Data Dictionary for Part II Spatial Modeling")
Table 8: Data Dictionary for Part II Spatial Modeling
column role_in_part_ii
target_price_per_m2_pln Main target for Spatial Random Forest and price-per-square-metre kriging
target_price_pln Optional target for a total-price model
area_m2 Core structural predictor
rooms Core structural predictor
floor Optional structural predictor
floor_count Optional structural predictor
build_year Optional structural predictor; may need missing-value handling
x_2180 Projected x coordinate for spatial modelling
y_2180 Projected y coordinate for spatial modelling
dist_subway_m Distance feature for subway accessibility
dist_tram_m Distance feature for tram accessibility
dist_park_m Distance feature for green-space accessibility
district Descriptive grouping variable
cluster_rank Unsupervised spatial-price submarket label
geometry Point geometry in EPSG:2180

As detailed in Table 7 and Table 8, the final compiled matrix contains everything required to deploy complex algorithms. The recommended methodological path for the upcoming phase is to fit a supervised Spatial Random Forest to isolate the systematic valuation derived from structural attributes and proximity metrics, and subsequently apply kriging interpolation to the model residuals to map any latent, unobserved spatial autocorrelation.

model_ready_sf <- readRDS("outputs/warsaw_housing_model_ready_sf.rds")
warsaw_boundary_2180 <- readRDS("outputs/warsaw_boundary_2180.rds")

rf_data <- model_ready_sf |>
  st_drop_geometry() |>
  select(
    target_price_per_m2_pln,
    area_m2, rooms, floor, floor_count, build_year,
    x_2180, y_2180,
    dist_subway_m, dist_tram_m, dist_park_m
  ) |>
  tidyr::drop_na()

rf_formula <- target_price_per_m2_pln ~
  area_m2 + rooms + floor + floor_count + build_year +
  x_2180 + y_2180 + dist_subway_m + dist_tram_m + dist_park_m

4.9 Summary of Part I

This part of the project:

  1. merged local Kaggle sale files for Polish apartment listings,
  2. filtered the data to Warsaw and removed repeated listings across months,
  3. converted listing coordinates to an sf point object in EPSG:2180,
  4. downloaded subway, tram, park, and district layers from OpenStreetMap,
  5. computed distance-based spatial features for every listing,
  6. produced descriptive maps and a simple k-means spatial-price clustering,
  7. exported a model-ready dataset for supervised ML, variable importance, and kriging.

The main technical result is a spatially enriched housing dataset. Each apartment has a clean price-per-square-metre target, projected coordinates, structural features, distance-to-infrastructure variables, district assignment, and a rough cluster label. This is sufficient for the next project stage: fitting a Spatial Random Forest, comparing variable importance between area and accessibility, and generating a smooth Warsaw price surface through kriging.

Three limitations should be kept in mind. First, the dataset contains offer prices rather than final transaction prices. Second, accessibility is measured with Euclidean distances rather than network travel times. Third, OSM layers depend on current tagging quality; therefore, the downloaded layers are cached locally after the first successful render.

5 Spatial Modelling of Apartment Prices in Warsaw

This stage extends the exploratory spatial analysis into supervised spatial modelling and geostatistical interpolation. The objective is to evaluate how well apartment prices per square metre can be predicted using a combination of structural characteristics, accessibility measures, and geographic coordinates. The analysis also investigates which variables contribute most strongly to predictive accuracy and how the predicted price structure is distributed across the urban space of Warsaw.

library(sf)
library(dplyr)
library(ggplot2)
library(ranger)
library(gstat)
library(stars)
library(knitr)
library(scales)

candidate_paths <- c(
  "outputs/warsaw_housing_model_ready_table.csv",
  "warsaw_housing_model_ready_table.csv",
  "warsaw_housing_model_ready_table(1).csv"
)

data_path <- candidate_paths[file.exists(candidate_paths)][1]
if (is.na(data_path)) {
  stop("The model-ready CSV file was not found. Put warsaw_housing_model_ready_table.csv in the project folder or in outputs/.")
}

housing_raw <- read.csv(data_path)

5.1 Methodology explanation

5.1.1 Spatial Random Forest

Random Forest is an ensemble learning method based on a large number of decision trees. Each tree is trained on a bootstrap sample of the data, and only a random subset of predictors is considered at each split. In regression problems, the final prediction is calculated as the average prediction across all trees. This approach is useful for housing-market analysis because it can capture non-linear relationships and interactions between variables without requiring a predefined functional form.

In this project, the method is applied as a spatial Random Forest because the predictor set contains not only apartment characteristics, but also explicitly spatial information. Projected coordinates (x_2180, y_2180) describe the position of each listing within Warsaw, while distance variables measure accessibility to selected elements of urban infrastructure. Including these predictors allows the model to account for the fact that real-estate prices are strongly influenced by location and by the surrounding urban environment.

The dependent variable is apartment price per square metre. This target was selected because it standardizes apartment values across dwellings of different sizes. Total price is strongly affected by apartment area, while price per square metre allows a more direct comparison of location quality, building characteristics, and accessibility. As a result, the model focuses on explaining differences in unit prices rather than simply identifying larger apartments as more expensive.

5.1.2 Variable importance

Since Random Forest models do not provide coefficients in the same way as linear regression, variable importance is used to interpret the model. In this analysis, permutation importance is applied. The method evaluates the contribution of each predictor by randomly shuffling its values and measuring how much the model accuracy decreases. Variables that cause a larger decrease in accuracy are interpreted as more important for prediction.

This makes it possible to compare the importance of traditional apartment characteristics, such as floor area or building year, with spatial predictors, such as coordinates, distances to public transport, distance to parks, and spatial-market cluster membership.

5.1.3 Residual analysis

In addition to standard model metrics, residual analysis is used to evaluate the quality of predictions. A residual is the difference between the actual observed price and the predicted price. If residuals are centred around zero and do not show a clear systematic pattern, the model can be interpreted as reasonably balanced. If residuals increase strongly for certain price ranges, this may indicate that the model has more difficulty predicting specific types of apartments, such as luxury listings or unusual properties.

5.1.4 Ordinary kriging

Kriging is a geostatistical interpolation method used to estimate values in locations where direct observations are not available. The method is based on spatial autocorrelation, which means that observations located close to one another are expected to be more similar than observations located far apart. In this project, ordinary kriging is used to transform observed point listings into a continuous price surface for Warsaw.

The kriging result should not be treated as an exact valuation of every apartment in the city. Instead, it provides a smoothed spatial representation of the market and helps identify broader zones of higher and lower apartment prices.

5.2 Data used for modelling

The modelling stage is based on a prepared dataset containing 31,341 apartment observations. All records were retained after the final preparation step, which means that the model was fitted on the full model-ready table. The target variable is apartment price per square metre, expressed in PLN, because it allows apartments of different sizes to be compared more directly than total listing price.

base_predictors <- c(
  "area_m2", "rooms", "floor", "floor_count", "build_year",
  "dist_subway_m", "dist_tram_m", "dist_park_m",
  "x_2180", "y_2180"
)

if ("cluster_rank" %in% names(housing_raw)) {
  base_predictors <- c(base_predictors, "cluster_rank")
}

required_columns <- c("target_price_per_m2_pln", base_predictors)
missing_columns <- setdiff(required_columns, names(housing_raw))
if (length(missing_columns) > 0) {
  stop(paste("Missing required columns:", paste(missing_columns, collapse = ", ")))
}

housing_model <- housing_raw %>%
  select(all_of(required_columns)) %>%
  filter(!is.na(target_price_per_m2_pln))

for (v in base_predictors) {
  if (any(is.na(housing_model[[v]]))) {
    housing_model[[v]][is.na(housing_model[[v]])] <- median(housing_model[[v]], na.rm = TRUE)
  }
}

housing_sf <- st_as_sf(
  housing_model,
  coords = c("x_2180", "y_2180"),
  crs = 2180,
  remove = FALSE
)

data_summary <- data.frame(
  Measure = c(
    "Input observations",
    "Model observations after cleaning",
    "Median price per m2",
    "Mean price per m2",
    "Median apartment area",
    "Median distance to subway",
    "Median distance to tram",
    "Median distance to park"
  ),
  Value = c(
    nrow(housing_raw),
    nrow(housing_model),
    round(median(housing_model$target_price_per_m2_pln), 0),
    round(mean(housing_model$target_price_per_m2_pln), 0),
    round(median(housing_model$area_m2), 1),
    round(median(housing_model$dist_subway_m), 0),
    round(median(housing_model$dist_tram_m), 0),
    round(median(housing_model$dist_park_m), 0)
  )
)

kable(data_summary, caption = "Model-ready dataset used in the modelling stage")
Model-ready dataset used in the modelling stage
Measure Value
Input observations 31341
Model observations after cleaning 31341
Median price per m2 16772
Mean price per m2 17166
Median apartment area 53
Median distance to subway 760
Median distance to tram 434
Median distance to park 290

The descriptive statistics show that the analysed market is characterized by high unit prices and moderate apartment sizes. The median price per square metre equals 16,772 PLN, while the mean value is slightly higher, at 17,166 PLN. This difference suggests a moderate influence of more expensive listings on the average price level. The median apartment area equals 53 m², indicating that the dataset is largely composed of medium-sized urban flats.

The accessibility measures also provide useful information about the spatial context of the listings. In the updated dataset, the median distance to subway stations is 760 metres, the median distance to tram infrastructure is 434 metres, and the median distance to parks is 290 metres. These values are much lower than in the previous version, which suggests that the updated accessibility measures represent closer urban amenities around the listings. The variables remain important because they quantify the relationship between apartment prices and access to transport and green areas. The use of the EPSG:2180 coordinate system ensures that all distance measures are expressed in metres and can be used consistently in spatial modelling.

5.3 Spatial Random Forest model

The Spatial Random Forest model was estimated using target_price_per_m2_pln as the dependent variable. The explanatory variables include apartment characteristics, building characteristics, projected coordinates, accessibility measures, and the spatial-market cluster variable. This specification combines traditional housing-market predictors with spatial information, making it possible to model both structural and geographic determinants of apartment prices.

To evaluate model performance, the dataset was divided into a training set and a test set. The training set contained 80% of the observations and was used to fit the model, while the remaining 20% was used for out-of-sample evaluation.

set.seed(123)

train_id <- sample(seq_len(nrow(housing_model)), size = floor(0.80 * nrow(housing_model)))
train_data <- housing_model[train_id, ]
test_data <- housing_model[-train_id, ]

rf_formula <- as.formula(
  paste("target_price_per_m2_pln ~", paste(base_predictors, collapse = " + "))
)

rf_model <- ranger(
  formula = rf_formula,
  data = train_data,
  num.trees = 500,
  mtry = floor(sqrt(length(base_predictors))),
  min.node.size = 5,
  importance = "permutation",
  seed = 123
)

rf_predictions <- predict(rf_model, data = test_data)$predictions

performance <- data.frame(
  Metric = c("RMSE", "MAE", "R-squared"),
  Value = c(
    sqrt(mean((test_data$target_price_per_m2_pln - rf_predictions)^2)),
    mean(abs(test_data$target_price_per_m2_pln - rf_predictions)),
    cor(test_data$target_price_per_m2_pln, rf_predictions)^2
  )
)

performance$Value <- round(performance$Value, 3)
kable(performance, caption = "Test-set performance of the Spatial Random Forest")
Test-set performance of the Spatial Random Forest
Metric Value
RMSE 1576.213
MAE 1153.276
R-squared 0.835
prediction_results <- data.frame(
  Actual = test_data$target_price_per_m2_pln,
  Predicted = rf_predictions
)

ggplot(prediction_results, aes(x = Actual, y = Predicted)) +
  geom_point(alpha = 0.18, size = 1.2) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", linewidth = 1, color = "red") +
  theme_minimal() +
  labs(
    title = "Spatial Random Forest: actual vs predicted prices",
    subtitle = "Test-set predictions for apartment price per square metre",
    x = "Actual price per m2 (PLN)",
    y = "Predicted price per m2 (PLN)"
  )

The test-set results indicate strong predictive performance. The RMSE equals 1,576.213 PLN/m², while the MAE equals 1,153.276 PLN/m². The R-squared value is 0.835, which means that the model explains approximately 83.5% of the variation in apartment prices per square metre in the test sample. Compared with the previous version, the overall predictive quality is practically unchanged, although the error measures are slightly higher.

The actual-versus-predicted plot confirms that the model captures the main price structure of the Warsaw housing market. Most observations are located close to the diagonal reference line, which indicates relatively accurate predictions. Larger deviations are visible mainly among the most expensive apartments. This is expected, because premium listings may contain unique characteristics that are not fully represented by the available variables.

prediction_results <- prediction_results %>%
  mutate(Residual = Actual - Predicted)

ggplot(prediction_results, aes(x = Predicted, y = Residual)) +
  geom_point(alpha = 0.18, size = 1.2) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red", linewidth = 1) +
  theme_minimal() +
  labs(
    title = "Spatial Random Forest: residuals",
    subtitle = "Difference between actual and predicted price per square metre",
    x = "Predicted price per m2 (PLN)",
    y = "Residual: actual - predicted (PLN/m2)"
  )

The residual plot provides an additional assessment of model quality. Most residuals are concentrated around the horizontal zero line, which suggests that the model is generally well balanced and does not show a strong overall tendency to overestimate or underestimate apartment prices. The largest concentration of observations appears between approximately 13,000 and 17,000 PLN/m², where the residuals are relatively dense and mostly close to zero. This indicates that the model performs particularly well for the most common segment of the Warsaw apartment market.

At the same time, the spread of residuals becomes wider for higher predicted prices, especially above approximately 20,000 PLN/m². This means that more expensive apartments are harder to predict precisely. In this range, both positive and negative residuals are visible, so the model sometimes underestimates and sometimes overestimates high-price listings. This pattern is reasonable because premium apartments may depend on additional qualitative factors that are not included in the dataset, such as renovation standard, view, building prestige, interior finish, noise level, or exact neighbourhood reputation.

importance_df <- data.frame(
  Variable = names(rf_model$variable.importance),
  Importance = as.numeric(rf_model$variable.importance)
) %>%
  arrange(Importance) %>%
  mutate(Variable = factor(Variable, levels = Variable))

kable(
  importance_df %>% arrange(desc(Importance)),
  digits = 2,
  caption = "Permutation variable importance from the Spatial Random Forest"
)
Permutation variable importance from the Spatial Random Forest
Variable Importance
cluster_rank 16466391.7
build_year 3750145.1
y_2180 3019591.7
x_2180 2496383.6
dist_subway_m 2427069.8
area_m2 2359513.3
dist_tram_m 1512473.9
dist_park_m 990228.9
floor_count 965522.3
rooms 912803.0
floor 507358.7
ggplot(importance_df, aes(x = Variable, y = Importance)) +
  geom_col() +
  coord_flip() +
  theme_minimal() +
  labs(
    title = "Spatial Random Forest: variable importance",
    subtitle = "Higher values mean a stronger contribution to predictive accuracy",
    x = "Predictor",
    y = "Permutation importance"
  )

importance_df %>%
  arrange(desc(Importance)) %>%
  slice_head(n = 5) %>%
  kable(
    digits = 2,
    caption = "Five most important predictors in the Spatial Random Forest"
  )
Five most important predictors in the Spatial Random Forest
Variable Importance
cluster_rank 16466392
build_year 3750145
y_2180 3019592
x_2180 2496384
dist_subway_m 2427070

The variable-importance ranking shows that cluster_rank is the strongest predictor in the model. This suggests that the previously identified spatial-market clusters contain substantial information about price differences across Warsaw. The second most important variable is build_year, which indicates that the age or modernity of buildings is also a major determinant of apartment prices per square metre.

Spatial predictors also play an important role. Both projected coordinates, especially y_2180 and x_2180, are highly ranked, which confirms that the model captures broader geographic price gradients across Warsaw. Among the accessibility variables, dist_subway_m is the strongest and appears in the top five predictors. dist_tram_m and dist_park_m remain useful, but they are less influential in the updated ranking than the coordinate variables and subway accessibility. The importance of area_m2 is also visible, although it ranks below the main spatial predictors. This is consistent with the fact that the target variable is price per square metre, which reduces the direct effect of apartment size compared with a model of total price.

The five most important predictors provide a concise interpretation of the model. The dominance of cluster_rank indicates that Warsaw can be divided into local housing submarkets with different price levels. The high importance of build_year suggests that newer or more modern buildings are priced differently from older ones. The presence of both y_2180 and x_2180 among the top predictors confirms that the model strongly depends on the geographic position of each apartment. The inclusion of dist_subway_m in the top five additionally shows that subway accessibility remains one of the most important amenity-related variables.

5.4 Kriging price surface

The final stage of the analysis applies ordinary kriging to create a continuous surface of apartment prices per square metre. Since kriging is computationally intensive for large datasets, a random sample of observations is used. This approach keeps the calculation manageable while preserving the general spatial distribution of listings across Warsaw.

set.seed(42)

kriging_sample_size <- min(2500, nrow(housing_sf))
kriging_sample <- housing_sf %>%
  slice_sample(n = kriging_sample_size) %>%
  distinct(x_2180, y_2180, .keep_all = TRUE)

bbox <- st_bbox(housing_sf)
price_grid <- st_as_stars(bbox, dx = 500, dy = 500)
price_grid_sf <- st_as_sf(price_grid)

emp_variogram <- variogram(target_price_per_m2_pln ~ 1, kriging_sample)
fit_variogram <- fit.variogram(
  emp_variogram,
  vgm(model = "Sph")
)

kriging_result <- krige(
  formula = target_price_per_m2_pln ~ 1,
  locations = kriging_sample,
  newdata = price_grid_sf,
  model = fit_variogram
)
## [using ordinary kriging]
ggplot() +
  geom_sf(data = kriging_result, aes(fill = var1.pred), color = NA) +
  scale_fill_viridis_c(
    name = "Predicted\nPLN/m2",
    labels = label_number(big.mark = " ")
  ) +
  geom_sf(data = kriging_sample, size = 0.1, alpha = 0.25) +
  theme_minimal() +
  labs(
    title = "Kriging price surface for Warsaw",
    subtitle = "Interpolated apartment price per square metre based on observed listings",
    x = "X coordinate (EPSG:2180)",
    y = "Y coordinate (EPSG:2180)"
  )

The kriging surface reveals a clear spatial structure of apartment prices in Warsaw. Higher predicted values are concentrated mainly in the central part of the city, while lower values are more visible in peripheral areas. This pattern is consistent with the general structure of urban housing markets, where centrality, accessibility, and concentration of services tend to increase property values.

The map should be interpreted as a smoothed spatial representation rather than a precise valuation tool. Its main purpose is to visualize the broader geographic pattern of prices and to complement the point-level predictions from the Random Forest model.

5.5 Limitations

Several limitations should be noted. First, the model is based on offer prices rather than final transaction prices, so the results reflect asking prices on the market. Second, some qualitative apartment characteristics are not available in the dataset, such as renovation standard, view, noise level, exposure to sunlight, or exact building condition. These missing variables may explain part of the prediction error, especially for premium apartments.

Third, the kriging map is an interpolation of observed listings and should not be interpreted as an exact valuation tool for individual properties. It is useful for identifying general spatial patterns, but local estimates may be less reliable in areas with fewer observations. Finally, the Random Forest model provides strong predictive performance, but it remains a data-driven method and does not prove causal relationships between variables and prices.

5.6 Summary

The supervised modelling stage confirms that apartment prices in Warsaw are strongly shaped by both structural and spatial factors. The Spatial Random Forest achieved a high level of predictive accuracy, with an R-squared value of 0.835 on the test set. This indicates that the selected variables explain most of the observed variation in apartment prices per square metre.

The interpretation of the model shows that spatial-market segmentation, building year, geographic coordinates, and subway accessibility are among the most important predictors. This supports the assumption that housing prices cannot be fully explained by apartment characteristics alone. Location and the surrounding urban environment are essential components of price formation, although in the updated model parks and tram accessibility are less influential than the broader coordinate-based location variables.

The residual analysis shows that the model performs well for most observations, but prediction errors increase for the most expensive apartments. This suggests that premium properties may require additional qualitative variables for more precise valuation. The kriging interpolation further demonstrates that prices are not randomly distributed across space. Instead, the estimated surface shows higher values in central areas and lower values toward the outskirts. Together, the Random Forest model and kriging map provide a consistent picture of the Warsaw housing market as a spatially differentiated system influenced by accessibility, urban structure, and property characteristics.

6 Final Conclusion

The project demonstrates that apartment prices in Warsaw are strongly shaped by spatial structure and urban accessibility. Starting from raw apartment listing data, the analysis created a complete spatial workflow: the data were cleaned, transformed into a spatial object, enriched with OpenStreetMap infrastructure layers, explored through maps and clustering, and then used for supervised spatial modelling and interpolation.

The exploratory part of the project showed that apartment prices are not evenly distributed across Warsaw. Clear differences are visible between central, well-connected areas and more peripheral parts of the city. The integration of distance variables made it possible to quantify access to subway stations, tram infrastructure, and parks, while the clustering stage helped identify local housing submarkets with different price characteristics. These steps confirmed that location is not only a background feature of the dataset, but one of the key dimensions of the housing market.

The supervised modelling stage strengthened this conclusion. The Spatial Random Forest achieved strong predictive performance, with an R-squared value of 0.835 on the test set. This indicates that the selected structural and spatial variables explain a large part of the variation in apartment prices per square metre. The variable-importance results showed that the spatial-market cluster variable was the strongest predictor, followed by building year, projected coordinates, and subway accessibility. This means that both the physical characteristics of apartments and their position within the city are important for explaining price differences.

The residual analysis showed that the model performs well for the main part of the market, especially for apartments within the most common price range. Larger errors appear mainly among the most expensive listings, which is reasonable because premium apartments may depend on qualitative characteristics that are not included in the dataset, such as renovation standard, view, building prestige, or exact neighbourhood reputation. This suggests that the model is useful for identifying general pricing patterns, but it should not be treated as a perfect valuation tool for every individual property.

Finally, the kriging interpolation transformed point-level apartment listings into a continuous spatial price surface. The resulting map confirmed the presence of higher price zones in central Warsaw and lower predicted values in more peripheral areas. This visual output complements the Random Forest model by showing the broader geographic structure of the market.

Overall, the project confirms that combining GIS methods, OpenStreetMap-based accessibility measures, unsupervised clustering, Spatial Random Forest modelling, and kriging provides a valuable framework for analysing urban housing prices. The results show that Warsaw’s apartment market is spatially differentiated and that location, accessibility, building characteristics, and local market segmentation jointly influence apartment prices per square metre. Future improvements could include transaction prices instead of offer prices, additional qualitative apartment features, and more detailed neighbourhood-level variables.