Abstract

This study examines whether access to urban green space improves the prediction of apartment asking prices in Warsaw. Apartment listings from the June 2024 edition of the Apartment Prices in Poland dataset were combined with green areas, public transport stops, and points of interest obtained from OpenStreetMap. The final analytical sample contains 6,822 listings and 24 variables, including structural apartment characteristics, geographic coordinates, distance-based measures, and buffer-based measures of the surrounding environment.

Linear regression, Random Forest, and XGBoost models were compared under a conventional random split, a spatial train-test split, and five-fold spatial cross-validation. The best overall specification was a relatively parsimonious Random Forest using apartment and general location characteristics. Its random-split RMSE was 2,143 PLN per square metre and its R-squared was 0.657. Under spatial cross-validation, RMSE increased to 2,621 PLN and R-squared decreased to 0.459. This difference demonstrates that a random split gives an optimistic impression of predictive performance when nearby observations occur in both samples.

Green-space variables did not improve out-of-sample accuracy once location was included. Their addition increased random-split RMSE by approximately 1.6% and spatial-split RMSE by approximately 4.3%. This does not imply that green space has no value for residents or no relationship with prices. It means that the available measures of proximity and green coverage added little independent predictive information beyond coordinates and distance to the city centre. Residual Moran’s I was close to zero and statistically insignificant, suggesting that the final spatial model captured most of the broad spatial structure present in the data. The results support Random Forest as the main model, while XGBoost remains a useful but weaker benchmark in this application.

1 Introduction

Apartment prices are inherently spatial. Two dwellings with similar floor area, room count, and construction year may have different prices because they occupy different parts of the city. Access to the centre, public transport, services, and environmental amenities can all contribute to this difference. Urban green space is especially interesting because it may represent both an amenity and a broader indicator of neighbourhood quality.

The purpose of this project is to test whether green-space accessibility contains useful predictive information about apartment asking prices per square metre in Warsaw. The project is designed as a predictive spatial analysis rather than a causal valuation study. Consequently, the central question is not whether constructing a park would cause a specific price increase. Instead, the analysis asks whether observed green-space measures improve predictions for listings that were not used during model training.

The study also addresses a practical methodological issue. Standard random validation may place geographically close listings in both the training and test sets. A model can then benefit from local similarity and appear more transferable than it really is. For this reason, conventional validation is supplemented with spatially separated evaluation.

2 Research Questions and Hypotheses

The analysis is guided by four research questions:

  1. How accurately can apartment asking prices per square metre be predicted from structural characteristics alone?
  2. How much predictive value is added by general location, green-space accessibility, transport access, and local services?
  3. Do nonlinear tree-based models outperform a linear benchmark?
  4. How much does estimated performance change when spatial separation is introduced into validation?

The corresponding hypotheses are:

  • H1: General location variables substantially improve prediction relative to apartment characteristics alone.
  • H2: Green-space variables provide additional predictive value after general location is controlled for.
  • H3: Random Forest and XGBoost outperform linear regression because price formation contains nonlinearities and interactions.
  • H4: Spatial validation produces weaker results than a random split because it reduces information leakage between nearby listings.

3 Data

3.1 Apartment listings

The apartment data come from the Apartment Prices in Poland dataset, published on Kaggle by Krzysztof Jamroz. According to the dataset documentation, it contains monthly snapshots of apartment sale and rental offers from major Polish cities. The observations originate from local real-estate listing websites and were additionally enriched by the author with OpenStreetMap-based distances to selected points of interest.

The source folder used in this project contains eleven monthly CSV files covering August 2023 to June 2024. These snapshots contain 195,568 records for all available Polish cities and 59,246 records identified as Warsaw listings. The empirical analysis uses only the most recent snapshot, June 2024. Treating the latest month as a cross-section avoids counting very similar listings repeatedly across months and keeps the spatial unit of analysis clear.

The June 2024 file contains 6,962 Warsaw records before final modeling filters. After validation of coordinates, removal of incomplete or implausible observations, and construction of all required variables, 6,822 listings remain. The analytical sample therefore retains approximately 98.0% of the Warsaw records from the selected month.

The dependent variable is price_m2, measured in PLN per square metre. It is calculated from listing price and apartment floor area. As the data describe advertised properties, the dependent variable should be interpreted as an asking price rather than a completed transaction price.

# Read the June 2024 Kaggle sale-offer file used in the analysis.
dataset_file <- file.path("data", "apartments_pl_2024_06.csv")

flats_raw <- readr::read_csv(dataset_file, show_col_types = FALSE) |>
  mutate(source_file = basename(dataset_file)) |>
  normalise_flat_columns()

# Retain the Warsaw observations from this monthly snapshot.
flats_clean <- flats_raw |>
  filter(
    str_to_lower(city) %in% c("warszawa", "warsaw")
  ) |>
  mutate(
    price = safe_numeric(price),
    area_m2 = safe_numeric(area_m2),
    rooms = safe_numeric(rooms),
    lon = safe_numeric(lon),
    lat = safe_numeric(lat),
    price_m2 = price / area_m2
  ) |>
  filter(
    is.finite(price_m2),
    area_m2 >= 10,
    area_m2 <= 300,
    rooms >= 1,
    between(price_m2,
            quantile(price_m2, 0.01),
            quantile(price_m2, 0.99))
  ) |>
  mutate(offer_id = row_number())

flats_sf <- st_as_sf(
  flats_clean,
  coords = c("lon", "lat"),
  crs = 4326,
  remove = FALSE
) |>
  st_transform(2180)

3.1.1 Dataset license and permitted use

The dataset is distributed under the Apache License 2.0. This permissive license allows the material to be used, reproduced, modified, and redistributed, including in commercial applications. When the dataset or a modified version is redistributed, the license information and existing attribution notices must be retained, recipients must receive a copy of the license, and modified files should be clearly identified as modified. If the original distribution includes a NOTICE file, its relevant attribution notices must also be preserved. The license does not grant rights to use the author’s trademarks and provides the material on an “as is” basis, without warranties.

For this project, the practical requirement is met by identifying the dataset title, author, source URL, and license in both the report and the final R script. The analytical tables, engineered variables, models, and figures are transformations produced for this study; they should continue to carry the source attribution when shared together with the underlying data.

3.2 OpenStreetMap data

OpenStreetMap data are downloaded through the osmdata package and stored in a local cache. Three groups of spatial objects are used:

  • green areas, such as parks, forests, grass areas, and recreation grounds;
  • public transport stops and stations;
  • selected points of interest representing access to local services.

The source coordinates are stored in WGS 84 (EPSG:4326). Spatial distances, buffers, and areas are calculated in the Polish metric coordinate reference system EPSG:2180. This transformation is necessary because calculations performed directly on longitude and latitude would not yield reliable distances in metres or areas in square metres.

3.3 Descriptive statistics

Variable Mean Median First quartile Third quartile
Asking price per m2 (PLN) 18,453 17,917 15,625 20,689
Floor area (m2) 57.10 52.36 41.60 67.00
Number of rooms 2.59 2.00 2.00 3.00
Construction year 1989 1995 1970 2012
Distance to city centre (km) 5.94 5.62 3.77 8.13
Distance to nearest green area (m) 24.05 13.28 5.73 32.91
Green share within 500 m 26.38% 23.61% 16.27% 33.94%
Green share within 1,000 m 29.63% 27.68% 21.76% 35.69%
Distance to nearest transport stop (m) 152.84 123.31 67.09 195.34
Distance to nearest POI (m) 92.34 58.23 31.39 107.92

The median asking price is lower than the mean, which indicates a moderately right-skewed price distribution. The typical listing has two rooms, an area slightly above 52 m2, and is located about 5.6 km from the defined city centre. The very short median distance to green space reflects the broad OpenStreetMap definition used in the project: it includes many types of mapped green polygons, not only large public parks.

The raw correlation between asking price and distance to the city centre is -0.472. In contrast, the correlations between price and distance to green space, green share within 500 m, and green share within 1,000 m are -0.082, -0.069, and -0.015, respectively. These bivariate relationships already suggest that centrality is a stronger market signal than the selected green-space measures. However, correlations alone cannot account for nonlinear effects, interactions, or confounding between location variables, so model-based comparisons remain necessary.

Spatial distribution of apartment prices and green areas
Spatial distribution of apartment prices and green areas

Figure 1. Spatial distribution of apartment asking prices and mapped green areas in Warsaw.

4 Spatial Feature Engineering

Five conceptually distinct groups of predictors are constructed.

Feature set Main variables Purpose
Baseline area, rooms, floor, floor count, construction year Describe the apartment and building
Location baseline, longitude, latitude, distance to centre Capture broad urban price gradients
Green location, distance to green, green counts, green area and share in 500 m and 1,000 m buffers Measure green-space accessibility and exposure
Accessibility location, distance and counts of stops and POIs Represent transport and service accessibility
Spatial all variables above Test the complete information set

Both 500 m and 1,000 m buffers are used because they represent different spatial scales. The smaller buffer approximates the immediate walking environment, while the larger buffer describes a wider neighbourhood. Counts and area shares are included together because the number of polygons does not necessarily represent their total surface. One large park and several small green fragments may have similar counts but very different environmental meaning.

# Download green polygons from OpenStreetMap.
parks <- opq(bbox) |>
  add_osm_feature(
    key = "leisure",
    value = c("park", "garden", "recreation_ground")
  ) |>
  osmdata_sf()

green_polygons <- safe_select_osm(parks, "polygons") |>
  st_make_valid() |>
  st_transform(2180) |>
  filter(!st_is_empty(geometry)) |>
  mutate(green_id = row_number())

# Construct distance-, count-, and area-based spatial predictors.
flats_features <- flats_sf |>
  mutate(
    dist_center_km = as.numeric(st_distance(geometry, city_center)) / 1000,
    dist_green_m = nearest_distance_m(flats_sf, green_polygons),
    dist_stop_m = nearest_distance_m(flats_sf, stops),
    dist_poi_m = nearest_distance_m(flats_sf, poi_points),
    n_green_500 = count_in_buffer(flats_sf, green_polygons, 500),
    n_green_1000 = count_in_buffer(flats_sf, green_polygons, 1000),
    green_area_500_m2 = area_in_buffer_m2(flats_sf, green_polygons, 500),
    green_area_1000_m2 = area_in_buffer_m2(flats_sf, green_polygons, 1000),
    green_share_500 = green_area_500_m2 / (pi * 500^2),
    green_share_1000 = green_area_1000_m2 / (pi * 1000^2)
  )

5 Modeling Strategy

5.1 Algorithms

Three model families are estimated:

  1. Linear regression provides a transparent benchmark and tests whether additive linear relationships are sufficient.
  2. Random Forest captures nonlinear relationships and interactions without requiring them to be specified in advance.
  3. XGBoost provides a strong boosting benchmark with explicit control over learning rate, tree depth, row and column subsampling, and regularization.

Random Forest tuning evaluates mtry, minimum node size, sampling fraction, and both variance and extremely randomized tree split rules. The best extended location model uses mtry = 4, min.node.size = 3, sample.fraction = 1, and splitrule = "extratrees".

baseline_features <- c(
  "area_m2", "rooms", "floor", "floor_count", "build_year"
)

location_features <- c(
  baseline_features, "dist_center_km", "lon", "lat"
)

spatial_features <- c(
  location_features,
  "dist_green_m", "green_area_500_m2", "green_area_1000_m2",
  "green_share_500", "green_share_1000", "n_green_500", "n_green_1000",
  "dist_stop_m", "n_stops_500", "n_stops_1000",
  "dist_poi_m", "n_poi_500", "n_poi_1000"
)

set.seed(2026)
random_split <- initial_split(model_df, prop = 0.80)
train_random <- training(random_split)
test_random <- testing(random_split)

location_formula <- as.formula(
  paste("price_m2 ~", paste(location_features, collapse = " + "))
)

rf_location <- ranger(
  location_formula,
  data = train_random,
  num.trees = 700,
  importance = "permutation",
  seed = 2026
)

The extended XGBoost search evaluates 80 sampled configurations for the original target and 80 for its logarithm. The best validation configuration uses the original price_m2 target, eta = 0.08, max_depth = 6, min_child_weight = 1, subsample = 0.75, colsample_bytree = 0.90, gamma = 0, lambda = 1, alpha = 0, and 300 boosting rounds.

# Random Forest grids used for the location and complete spatial models.
make_rf_grid <- function(features, feature_set) {
  p <- length(features)
  mtry_values <- unique(pmax(1, pmin(
    p,
    c(floor(sqrt(p)), floor(p / 3), floor(p / 2), p)
  )))

  tidyr::crossing(
    feature_set = feature_set,
    mtry = mtry_values,
    min.node.size = c(3, 5, 10, 20, 40),
    sample.fraction = c(0.65, 0.80, 1.00),
    splitrule = c("variance", "extratrees")
  )
}

rf_grid <- bind_rows(
  make_rf_grid(location_features, "location"),
  make_rf_grid(spatial_features, "spatial")
)

# XGBoost also evaluates regularization and two target scales.
xgb_grid <- tidyr::crossing(
  eta = c(0.03, 0.05, 0.08, 0.10),
  max_depth = c(3, 4, 5, 6),
  min_child_weight = c(1, 3, 6),
  subsample = c(0.75, 0.90),
  colsample_bytree = c(0.75, 0.90),
  gamma = c(0, 1),
  lambda = c(1, 5),
  alpha = c(0, 0.5),
  nrounds = c(150, 300, 500)
) |>
  slice_sample(n = 80)

xgb_model <- xgb.train(
  params = list(
    objective = "reg:squarederror",
    eval_metric = "rmse",
    eta = 0.08,
    max_depth = 6,
    min_child_weight = 1,
    subsample = 0.75,
    colsample_bytree = 0.90,
    lambda = 1,
    alpha = 0
  ),
  data = dtrain,
  nrounds = 300,
  verbose = 0
)

5.2 Evaluation metrics

Model performance is evaluated using:

  • RMSE, which penalizes large errors more strongly;
  • MAE, which represents the typical absolute prediction error more directly;
  • R-squared, which measures the share of observed variation aligned with model predictions.

RMSE and MAE are expressed in PLN per square metre. Lower values indicate better predictions, whereas higher R-squared is preferred.

5.3 Validation design

Three validation schemes are used:

  • a conventional random train-test split;
  • a spatial split in which complete parts of the city are held out;
  • five-fold spatial cross-validation based on spatial grid cells.

The random split measures performance for new listings drawn from approximately the same spatial distribution as the training data. Spatial validation is more demanding: it asks the model to predict in geographically separated areas. The latter is therefore more informative about geographic transferability.

# Build spatial blocks and hold out complete grid cells.
split_grid_raw <- st_make_grid(
  flats_features,
  n = c(5, 5),
  square = TRUE
)

split_grid <- st_sf(
  grid_id = seq_along(split_grid_raw),
  geometry = split_grid_raw
)

grid_membership <- st_join(
  flats_features |> select(offer_id),
  split_grid,
  join = st_within
) |>
  st_drop_geometry()

set.seed(2026)
available_grid_ids <- unique(na.omit(grid_membership$grid_id))
test_grid_ids <- sample(
  available_grid_ids,
  size = ceiling(length(available_grid_ids) * 0.20)
)

spatial_test_ids <- grid_membership |>
  filter(grid_id %in% test_grid_ids) |>
  pull(offer_id)

train_spatial <- model_df |> filter(!offer_id %in% spatial_test_ids)
test_spatial <- model_df |> filter(offer_id %in% spatial_test_ids)

# Five-fold CV assigns complete spatial cells rather than individual listings.
set.seed(2026)
grid_folds <- tibble(
  grid_id = sort(available_grid_ids),
  fold = sample(rep(1:5, length.out = length(available_grid_ids)))
)
Random and spatial validation results
Random and spatial validation results

Figure 2. Model performance under random and spatial train-test splits.

6 Results

6.1 Random train-test split

Model RMSE MAE R-squared
LM baseline 3,597 2,862 0.035
LM location 3,061 2,395 0.299
LM spatial 2,983 2,302 0.335
RF baseline 2,826 2,102 0.402
RF location 2,143 1,559 0.657
RF location + green 2,178 1,609 0.649
RF spatial tuned 2,167 1,598 0.648
XGBoost spatial tuned 2,195 1,651 0.639

The results strongly support H1 and H3. Adding general location to the baseline Random Forest reduces RMSE from 2,826 to 2,143 PLN/m2, an improvement of approximately 24.2%. Relative to linear regression with the same location concept, Random Forest reduces RMSE by about 30.0%. This indicates that the relationship between apartment attributes, location, and price is not adequately represented by a purely additive linear specification.

The location Random Forest is the best model under the random split. Adding green variables increases RMSE by 35 PLN/m2 and MAE by 50 PLN/m2. The complete tuned spatial model also remains slightly weaker. Consequently, H2 is not supported by the predictive comparison.

Observed and predicted prices
Observed and predicted prices

Figure 3. Observed and predicted asking prices under random validation.

6.2 Spatial train-test split

Model RMSE MAE R-squared
RF baseline 3,985 3,081 0.198
RF location 3,226 2,491 0.401
RF location + green 3,366 2,626 0.343
RF location + accessibility 3,283 2,557 0.373
RF spatial 3,332 2,610 0.349
RF spatial tuned 3,220 2,515 0.374
XGBoost spatial tuned 3,507 2,725 0.290

The tuned spatial Random Forest obtains the lowest RMSE, but its advantage over the simpler location model is only about 6 PLN/m2, or 0.2%. The location model has a lower MAE and higher R-squared. The small RMSE difference is not sufficient to justify the much larger feature set, especially because the location model is more stable in spatial cross-validation.

Adding green variables to the location model increases spatial-split RMSE by 140 PLN/m2, or approximately 4.3%, and lowers R-squared by 0.059. The weaker result under geographic separation suggests that the green variables partly describe local patterns that do not transfer consistently to held-out parts of Warsaw.

Spatial train-test allocation
Spatial train-test allocation

Figure 4. Spatial allocation of training and test observations.

6.3 Five-fold spatial cross-validation

Model Mean RMSE SD RMSE Mean MAE Mean R-squared
RF baseline 3,226 374 2,495 0.242
RF location 2,621 368 2,030 0.459
RF location + green 2,793 426 2,189 0.407
RF location + accessibility 2,758 424 2,156 0.399
RF spatial 2,834 439 2,218 0.378
XGBoost spatial tuned 2,710 483 2,082 0.412

Spatial cross-validation confirms that the location Random Forest is the most reliable specification. Its RMSE is approximately 22.3% higher than its random-split RMSE, while R-squared falls from 0.657 to 0.459. This supports H4 and illustrates why spatial validation is not merely an optional diagnostic. A random split answers an easier question because the training sample often contains geographically close analogues of test listings.

The fold-level standard deviations are also informative. The location model has an RMSE standard deviation of 368 PLN/m2, whereas XGBoost reaches 483 PLN/m2. XGBoost is therefore not only less accurate on average but also less stable across held-out regions.

Spatial cross-validation summary
Spatial cross-validation summary

Figure 5. Mean performance and variability across spatial folds.

6.4 Extended tuning

The extended Random Forest search slightly improves MAE for the location model, from 1,559 to 1,548 PLN/m2. RMSE is effectively unchanged at approximately 2,144 PLN/m2. The extended model raises R-squared, but it also produces a larger train-test performance gap. The gain is therefore too small to establish it as a clearly better final model.

Extended XGBoost tuning reduces MAE from 1,651 to 1,618 PLN/m2 and increases R-squared from 0.639 to 0.650. At the same time, RMSE increases from 2,195 to 2,213 PLN/m2. The tuning process improves ordinary errors but does not reduce the larger mistakes emphasized by RMSE. It still does not surpass the location Random Forest.

This is a useful result rather than a failed benchmark. A more complex boosting model is not automatically superior when the sample is moderate in size, predictors overlap strongly, and a Random Forest already captures the main nonlinear location structure.

Extended model comparison
Extended model comparison

Figure 6. Comparison of the principal Random Forest and XGBoost specifications.

7 Interpretation of Spatial Effects

7.1 Variable importance

The extended location Random Forest ranks distance to the city centre as the most important variable, followed by construction year, latitude, longitude, floor area, and number of rooms. In the full spatial model, distance to the centre remains dominant. Public transport counts receive some importance, while individual green-space variables appear lower in the ranking.

This ordering is consistent with the descriptive correlations and model comparisons. Warsaw’s broad price gradient is primarily represented by centrality and coordinates. Construction year may proxy building standard, neighbourhood development period, and housing stock quality. Latitude and longitude capture spatial patterns that are not fully summarized by a single radial distance from the centre.

Variable importance should not be read as a causal ranking. Correlated predictors can divide or exchange importance, and coordinates can absorb information associated with many unobserved neighbourhood characteristics. The results show which variables help prediction, not which urban interventions would change prices.

variable_importance <- tibble(
  variable = names(rf_spatial_tuned$model$variable.importance),
  importance = as.numeric(rf_spatial_tuned$model$variable.importance)
) |>
  arrange(desc(importance))

partial_dependence <- bind_rows(
  manual_partial_dependence(
    rf_spatial_tuned$model,
    train_random,
    "dist_green_m"
  ),
  manual_partial_dependence(
    rf_spatial_tuned$model,
    train_random,
    "green_share_500"
  )
)

7.2 Green-space contribution

The nearest green area is very close for most observations, with a median distance of only 13.3 m. This limited variation may make the variable less discriminating. In addition, the OpenStreetMap definition combines green spaces of different type, size, accessibility, and quality. A landscaped public park and a small mapped grass polygon can both reduce the measured distance, even though their likely housing-market relevance differs.

The partial dependence profile for distance to green space is comparatively flat. Predicted prices change only modestly over most of the observed range, with no strong monotonic premium visible after other predictors are included. Together with the lack of improvement in test metrics, this suggests that the current green measures do not provide a robust independent price signal.

Partial dependence of spatial variables
Partial dependence of spatial variables

Figure 7. Partial dependence profiles for selected spatial predictors.

The evidence therefore rejects H2 in its present operational form. A narrower definition based on large public parks, entrances, walking-network distance, vegetation quality, or remotely sensed canopy cover might produce a different result. The conclusion applies to the variables used here, not to the broader value of urban greenery.

8 Overfitting Diagnostics

The separate overfitting analysis compares in-sample and held-out performance for three Random Forest specifications.

Model Train RMSE Test RMSE Train R-squared Test R-squared
RF location 1,029 2,075 0.936 0.695
RF spatial tuned 886 2,096 0.953 0.687
RF location extended 858 2,144 0.957 0.672

All three models fit the training sample much better than the test sample. The gap is smallest for the standard location Random Forest and largest for the extended location model. Extended tuning reduces training RMSE by 171 PLN/m2 relative to the standard location model but increases test RMSE by 69 PLN/m2. This is a classic sign that the additional flexibility is learning sample-specific detail rather than a more transferable price function.

train_estimate <- predict(
  rf_location_extended,
  data = train_random
)$predictions

test_estimate <- predict(
  rf_location_extended,
  data = test_random
)$predictions

overfitting_metrics <- bind_rows(
  tibble(
    sample = "train",
    rmse = rmse_vec(train_random$price_m2, train_estimate),
    mae = mae_vec(train_random$price_m2, train_estimate),
    rsq = rsq_vec(train_random$price_m2, train_estimate)
  ),
  tibble(
    sample = "test",
    rmse = rmse_vec(test_random$price_m2, test_estimate),
    mae = mae_vec(test_random$price_m2, test_estimate),
    rsq = rsq_vec(test_random$price_m2, test_estimate)
  )
)

The absolute test values in this diagnostic run differ slightly from the main comparison because the models are refitted specifically for the train-versus-test exercise. The conclusion is nevertheless consistent: the simpler location Random Forest generalizes at least as well as the more aggressively tuned alternatives.

Train and test regression diagnostics
Train and test regression diagnostics

Figure 8. Train-test performance gaps for the evaluated Random Forest models.

9 Residual Spatial Autocorrelation

Moran’s I for the final residuals equals 0.003, compared with an expectation of approximately -0.00015 under spatial randomness. The p-value is 0.288, so there is no basis for rejecting the null hypothesis of no residual spatial autocorrelation at conventional significance levels.

# Test whether final residuals retain a global spatial pattern.
coordinates <- st_coordinates(flats_residuals)
knn <- knearneigh(coordinates, k = 8)
neighbours <- knn2nb(knn)
spatial_weights <- nb2listw(neighbours, style = "W", zero.policy = TRUE)

moran_result <- moran.test(
  flats_residuals$residual_rf_spatial,
  spatial_weights,
  zero.policy = TRUE
)

This is an encouraging diagnostic. Although prediction errors remain, they do not form a strong global spatial pattern under the selected neighbourhood structure. Coordinates, distance to the centre, and the remaining spatial predictors appear to capture most of the broad spatial dependence. The result does not prove that every local pattern has disappeared, but it reduces concern that the model systematically overlooks a city-wide spatial process.

Moran residual diagnostic
Moran residual diagnostic

Figure 9. Moran’s I diagnostic for model residuals.

10 Hypothesis Assessment

Hypothesis Decision Evidence
H1: general location improves prediction Supported RF RMSE falls from 2,826 to 2,143 PLN/m2 after adding location
H2: green space adds predictive value beyond location Not supported Green variables increase RMSE under random and spatial validation
H3: nonlinear tree models outperform linear regression Supported RF location RMSE is about 30% lower than LM location RMSE
H4: spatial validation is more demanding Supported RF location RMSE rises from 2,143 to 2,621 PLN/m2 in spatial CV

11 Limitations

Several limitations should be considered when interpreting the results.

First, advertised prices are not transaction prices. Sellers may set asking prices strategically, and the final sale price may differ. Second, the analysis uses one monthly cross-section. This avoids repeated listings but does not describe temporal market change. Third, duplicate or nearly duplicate advertisements may remain within a month if identifiers or property descriptions differ.

Fourth, OpenStreetMap completeness depends on contributor activity and tag consistency. The broad green-space definition measures mapped land use rather than perceived quality, safety, public access, vegetation condition, or park facilities. Euclidean distance also differs from actual walking distance along the street network.

Fifth, coordinates are powerful predictors but are difficult to interpret economically. They act as proxies for unobserved neighbourhood characteristics, including prestige, school access, employment centres, noise, and urban form. Finally, machine-learning importance and partial dependence describe predictive associations. They do not identify causal effects of green-space provision.

12 Conclusions

The project shows that location is the central component of apartment price prediction in Warsaw. Structural apartment characteristics provide a useful base, but coordinates and distance to the centre account for a large additional share of price variation. Random Forest captures these relationships much more effectively than a linear model.

The main research result is that the selected green-space variables do not improve out-of-sample prediction after general location is included. Their raw relationships with price are weak, their partial dependence profiles are relatively flat, and models containing them perform worse under spatial validation. This finding should be stated directly rather than hidden: the project tested a plausible hypothesis and found that its predictive support is limited under the adopted measurement strategy.

The preferred final specification is the standard Random Forest location model. It offers the best balance of random-split accuracy, spatial cross-validation performance, stability, and resistance to overfitting. The tuned full spatial Random Forest and XGBoost remain valuable benchmarks, but their additional complexity is not rewarded by better geographic generalization.

From a methodological perspective, the strongest lesson is the importance of spatial validation. A random split suggests an R-squared of 0.657, whereas five-fold spatial cross-validation gives 0.459 for the same model family. Reporting only the random result would materially overstate performance in geographically unseen areas. Combining predictive models with spatial validation and residual Moran diagnostics produces a more credible and complete assessment.

References and Data Sources

  • Jamroz, K., Apartment Prices in Poland. Kaggle dataset, monthly apartment sale and rental listings for major Polish cities, August 2023-June 2024. Available at: https://www.kaggle.com/datasets/krzysztofjamroz/apartment-prices-in-poland. Licensed under the Apache License 2.0.
  • OpenStreetMap contributors, spatial data on green areas, public transport, and points of interest.
  • R Core Team, R: A Language and Environment for Statistical Computing.
  • Wright, M. N. and Ziegler, A., ranger: a fast implementation of Random Forests for high-dimensional data.
  • Chen, T. and Guestrin, C., XGBoost: A Scalable Tree Boosting System.