This study examines whether access to urban green space improves the prediction of apartment asking prices in Warsaw. Apartment listings from the June 2024 edition of the Apartment Prices in Poland dataset were combined with green areas, public transport stops, and points of interest obtained from OpenStreetMap. The final analytical sample contains 6,822 listings and 24 variables, including structural apartment characteristics, geographic coordinates, distance-based measures, and buffer-based measures of the surrounding environment.
Linear regression, Random Forest, and XGBoost models were compared under a conventional random split, a spatial train-test split, and five-fold spatial cross-validation. The best overall specification was a relatively parsimonious Random Forest using apartment and general location characteristics. Its random-split RMSE was 2,143 PLN per square metre and its R-squared was 0.657. Under spatial cross-validation, RMSE increased to 2,621 PLN and R-squared decreased to 0.459. This difference demonstrates that a random split gives an optimistic impression of predictive performance when nearby observations occur in both samples.
Green-space variables did not improve out-of-sample accuracy once location was included. Their addition increased random-split RMSE by approximately 1.6% and spatial-split RMSE by approximately 4.3%. This does not imply that green space has no value for residents or no relationship with prices. It means that the available measures of proximity and green coverage added little independent predictive information beyond coordinates and distance to the city centre. Residual Moran’s I was close to zero and statistically insignificant, suggesting that the final spatial model captured most of the broad spatial structure present in the data. The results support Random Forest as the main model, while XGBoost remains a useful but weaker benchmark in this application.
Apartment prices are inherently spatial. Two dwellings with similar floor area, room count, and construction year may have different prices because they occupy different parts of the city. Access to the centre, public transport, services, and environmental amenities can all contribute to this difference. Urban green space is especially interesting because it may represent both an amenity and a broader indicator of neighbourhood quality.
The purpose of this project is to test whether green-space accessibility contains useful predictive information about apartment asking prices per square metre in Warsaw. The project is designed as a predictive spatial analysis rather than a causal valuation study. Consequently, the central question is not whether constructing a park would cause a specific price increase. Instead, the analysis asks whether observed green-space measures improve predictions for listings that were not used during model training.
The study also addresses a practical methodological issue. Standard random validation may place geographically close listings in both the training and test sets. A model can then benefit from local similarity and appear more transferable than it really is. For this reason, conventional validation is supplemented with spatially separated evaluation.
The analysis is guided by four research questions:
The corresponding hypotheses are:
The apartment data come from the Apartment Prices in Poland dataset, published on Kaggle by Krzysztof Jamroz. According to the dataset documentation, it contains monthly snapshots of apartment sale and rental offers from major Polish cities. The observations originate from local real-estate listing websites and were additionally enriched by the author with OpenStreetMap-based distances to selected points of interest.
The source folder used in this project contains eleven monthly CSV files covering August 2023 to June 2024. These snapshots contain 195,568 records for all available Polish cities and 59,246 records identified as Warsaw listings. The empirical analysis uses only the most recent snapshot, June 2024. Treating the latest month as a cross-section avoids counting very similar listings repeatedly across months and keeps the spatial unit of analysis clear.
The June 2024 file contains 6,962 Warsaw records before final modeling filters. After validation of coordinates, removal of incomplete or implausible observations, and construction of all required variables, 6,822 listings remain. The analytical sample therefore retains approximately 98.0% of the Warsaw records from the selected month.
The dependent variable is price_m2, measured in PLN per
square metre. It is calculated from listing price and apartment floor
area. As the data describe advertised properties, the dependent variable
should be interpreted as an asking price rather than a completed
transaction price.
# Read the June 2024 Kaggle sale-offer file used in the analysis.
dataset_file <- file.path("data", "apartments_pl_2024_06.csv")
flats_raw <- readr::read_csv(dataset_file, show_col_types = FALSE) |>
mutate(source_file = basename(dataset_file)) |>
normalise_flat_columns()
# Retain the Warsaw observations from this monthly snapshot.
flats_clean <- flats_raw |>
filter(
str_to_lower(city) %in% c("warszawa", "warsaw")
) |>
mutate(
price = safe_numeric(price),
area_m2 = safe_numeric(area_m2),
rooms = safe_numeric(rooms),
lon = safe_numeric(lon),
lat = safe_numeric(lat),
price_m2 = price / area_m2
) |>
filter(
is.finite(price_m2),
area_m2 >= 10,
area_m2 <= 300,
rooms >= 1,
between(price_m2,
quantile(price_m2, 0.01),
quantile(price_m2, 0.99))
) |>
mutate(offer_id = row_number())
flats_sf <- st_as_sf(
flats_clean,
coords = c("lon", "lat"),
crs = 4326,
remove = FALSE
) |>
st_transform(2180)The dataset is distributed under the Apache License
2.0. This permissive license allows the material to be used,
reproduced, modified, and redistributed, including in commercial
applications. When the dataset or a modified version is redistributed,
the license information and existing attribution notices must be
retained, recipients must receive a copy of the license, and modified
files should be clearly identified as modified. If the original
distribution includes a NOTICE file, its relevant
attribution notices must also be preserved. The license does not grant
rights to use the author’s trademarks and provides the material on an
“as is” basis, without warranties.
For this project, the practical requirement is met by identifying the dataset title, author, source URL, and license in both the report and the final R script. The analytical tables, engineered variables, models, and figures are transformations produced for this study; they should continue to carry the source attribution when shared together with the underlying data.
OpenStreetMap data are downloaded through the osmdata
package and stored in a local cache. Three groups of spatial objects are
used:
The source coordinates are stored in WGS 84 (EPSG:4326).
Spatial distances, buffers, and areas are calculated in the Polish
metric coordinate reference system EPSG:2180. This
transformation is necessary because calculations performed directly on
longitude and latitude would not yield reliable distances in metres or
areas in square metres.
| Variable | Mean | Median | First quartile | Third quartile |
|---|---|---|---|---|
| Asking price per m2 (PLN) | 18,453 | 17,917 | 15,625 | 20,689 |
| Floor area (m2) | 57.10 | 52.36 | 41.60 | 67.00 |
| Number of rooms | 2.59 | 2.00 | 2.00 | 3.00 |
| Construction year | 1989 | 1995 | 1970 | 2012 |
| Distance to city centre (km) | 5.94 | 5.62 | 3.77 | 8.13 |
| Distance to nearest green area (m) | 24.05 | 13.28 | 5.73 | 32.91 |
| Green share within 500 m | 26.38% | 23.61% | 16.27% | 33.94% |
| Green share within 1,000 m | 29.63% | 27.68% | 21.76% | 35.69% |
| Distance to nearest transport stop (m) | 152.84 | 123.31 | 67.09 | 195.34 |
| Distance to nearest POI (m) | 92.34 | 58.23 | 31.39 | 107.92 |
The median asking price is lower than the mean, which indicates a moderately right-skewed price distribution. The typical listing has two rooms, an area slightly above 52 m2, and is located about 5.6 km from the defined city centre. The very short median distance to green space reflects the broad OpenStreetMap definition used in the project: it includes many types of mapped green polygons, not only large public parks.
The raw correlation between asking price and distance to the city centre is -0.472. In contrast, the correlations between price and distance to green space, green share within 500 m, and green share within 1,000 m are -0.082, -0.069, and -0.015, respectively. These bivariate relationships already suggest that centrality is a stronger market signal than the selected green-space measures. However, correlations alone cannot account for nonlinear effects, interactions, or confounding between location variables, so model-based comparisons remain necessary.
Figure 1. Spatial distribution of apartment asking prices and mapped green areas in Warsaw.
Five conceptually distinct groups of predictors are constructed.
| Feature set | Main variables | Purpose |
|---|---|---|
| Baseline | area, rooms, floor, floor count, construction year | Describe the apartment and building |
| Location | baseline, longitude, latitude, distance to centre | Capture broad urban price gradients |
| Green | location, distance to green, green counts, green area and share in 500 m and 1,000 m buffers | Measure green-space accessibility and exposure |
| Accessibility | location, distance and counts of stops and POIs | Represent transport and service accessibility |
| Spatial | all variables above | Test the complete information set |
Both 500 m and 1,000 m buffers are used because they represent different spatial scales. The smaller buffer approximates the immediate walking environment, while the larger buffer describes a wider neighbourhood. Counts and area shares are included together because the number of polygons does not necessarily represent their total surface. One large park and several small green fragments may have similar counts but very different environmental meaning.
# Download green polygons from OpenStreetMap.
parks <- opq(bbox) |>
add_osm_feature(
key = "leisure",
value = c("park", "garden", "recreation_ground")
) |>
osmdata_sf()
green_polygons <- safe_select_osm(parks, "polygons") |>
st_make_valid() |>
st_transform(2180) |>
filter(!st_is_empty(geometry)) |>
mutate(green_id = row_number())
# Construct distance-, count-, and area-based spatial predictors.
flats_features <- flats_sf |>
mutate(
dist_center_km = as.numeric(st_distance(geometry, city_center)) / 1000,
dist_green_m = nearest_distance_m(flats_sf, green_polygons),
dist_stop_m = nearest_distance_m(flats_sf, stops),
dist_poi_m = nearest_distance_m(flats_sf, poi_points),
n_green_500 = count_in_buffer(flats_sf, green_polygons, 500),
n_green_1000 = count_in_buffer(flats_sf, green_polygons, 1000),
green_area_500_m2 = area_in_buffer_m2(flats_sf, green_polygons, 500),
green_area_1000_m2 = area_in_buffer_m2(flats_sf, green_polygons, 1000),
green_share_500 = green_area_500_m2 / (pi * 500^2),
green_share_1000 = green_area_1000_m2 / (pi * 1000^2)
)Three model families are estimated:
Random Forest tuning evaluates mtry, minimum node size,
sampling fraction, and both variance and extremely randomized tree split
rules. The best extended location model uses mtry = 4,
min.node.size = 3, sample.fraction = 1, and
splitrule = "extratrees".
baseline_features <- c(
"area_m2", "rooms", "floor", "floor_count", "build_year"
)
location_features <- c(
baseline_features, "dist_center_km", "lon", "lat"
)
spatial_features <- c(
location_features,
"dist_green_m", "green_area_500_m2", "green_area_1000_m2",
"green_share_500", "green_share_1000", "n_green_500", "n_green_1000",
"dist_stop_m", "n_stops_500", "n_stops_1000",
"dist_poi_m", "n_poi_500", "n_poi_1000"
)
set.seed(2026)
random_split <- initial_split(model_df, prop = 0.80)
train_random <- training(random_split)
test_random <- testing(random_split)
location_formula <- as.formula(
paste("price_m2 ~", paste(location_features, collapse = " + "))
)
rf_location <- ranger(
location_formula,
data = train_random,
num.trees = 700,
importance = "permutation",
seed = 2026
)The extended XGBoost search evaluates 80 sampled configurations for
the original target and 80 for its logarithm. The best validation
configuration uses the original price_m2 target,
eta = 0.08, max_depth = 6,
min_child_weight = 1, subsample = 0.75,
colsample_bytree = 0.90, gamma = 0,
lambda = 1, alpha = 0, and 300 boosting
rounds.
# Random Forest grids used for the location and complete spatial models.
make_rf_grid <- function(features, feature_set) {
p <- length(features)
mtry_values <- unique(pmax(1, pmin(
p,
c(floor(sqrt(p)), floor(p / 3), floor(p / 2), p)
)))
tidyr::crossing(
feature_set = feature_set,
mtry = mtry_values,
min.node.size = c(3, 5, 10, 20, 40),
sample.fraction = c(0.65, 0.80, 1.00),
splitrule = c("variance", "extratrees")
)
}
rf_grid <- bind_rows(
make_rf_grid(location_features, "location"),
make_rf_grid(spatial_features, "spatial")
)
# XGBoost also evaluates regularization and two target scales.
xgb_grid <- tidyr::crossing(
eta = c(0.03, 0.05, 0.08, 0.10),
max_depth = c(3, 4, 5, 6),
min_child_weight = c(1, 3, 6),
subsample = c(0.75, 0.90),
colsample_bytree = c(0.75, 0.90),
gamma = c(0, 1),
lambda = c(1, 5),
alpha = c(0, 0.5),
nrounds = c(150, 300, 500)
) |>
slice_sample(n = 80)
xgb_model <- xgb.train(
params = list(
objective = "reg:squarederror",
eval_metric = "rmse",
eta = 0.08,
max_depth = 6,
min_child_weight = 1,
subsample = 0.75,
colsample_bytree = 0.90,
lambda = 1,
alpha = 0
),
data = dtrain,
nrounds = 300,
verbose = 0
)Model performance is evaluated using:
RMSE and MAE are expressed in PLN per square metre. Lower values indicate better predictions, whereas higher R-squared is preferred.
Three validation schemes are used:
The random split measures performance for new listings drawn from approximately the same spatial distribution as the training data. Spatial validation is more demanding: it asks the model to predict in geographically separated areas. The latter is therefore more informative about geographic transferability.
# Build spatial blocks and hold out complete grid cells.
split_grid_raw <- st_make_grid(
flats_features,
n = c(5, 5),
square = TRUE
)
split_grid <- st_sf(
grid_id = seq_along(split_grid_raw),
geometry = split_grid_raw
)
grid_membership <- st_join(
flats_features |> select(offer_id),
split_grid,
join = st_within
) |>
st_drop_geometry()
set.seed(2026)
available_grid_ids <- unique(na.omit(grid_membership$grid_id))
test_grid_ids <- sample(
available_grid_ids,
size = ceiling(length(available_grid_ids) * 0.20)
)
spatial_test_ids <- grid_membership |>
filter(grid_id %in% test_grid_ids) |>
pull(offer_id)
train_spatial <- model_df |> filter(!offer_id %in% spatial_test_ids)
test_spatial <- model_df |> filter(offer_id %in% spatial_test_ids)
# Five-fold CV assigns complete spatial cells rather than individual listings.
set.seed(2026)
grid_folds <- tibble(
grid_id = sort(available_grid_ids),
fold = sample(rep(1:5, length.out = length(available_grid_ids)))
)Figure 2. Model performance under random and spatial train-test splits.
| Model | RMSE | MAE | R-squared |
|---|---|---|---|
| LM baseline | 3,597 | 2,862 | 0.035 |
| LM location | 3,061 | 2,395 | 0.299 |
| LM spatial | 2,983 | 2,302 | 0.335 |
| RF baseline | 2,826 | 2,102 | 0.402 |
| RF location | 2,143 | 1,559 | 0.657 |
| RF location + green | 2,178 | 1,609 | 0.649 |
| RF spatial tuned | 2,167 | 1,598 | 0.648 |
| XGBoost spatial tuned | 2,195 | 1,651 | 0.639 |
The results strongly support H1 and H3. Adding general location to the baseline Random Forest reduces RMSE from 2,826 to 2,143 PLN/m2, an improvement of approximately 24.2%. Relative to linear regression with the same location concept, Random Forest reduces RMSE by about 30.0%. This indicates that the relationship between apartment attributes, location, and price is not adequately represented by a purely additive linear specification.
The location Random Forest is the best model under the random split. Adding green variables increases RMSE by 35 PLN/m2 and MAE by 50 PLN/m2. The complete tuned spatial model also remains slightly weaker. Consequently, H2 is not supported by the predictive comparison.
Figure 3. Observed and predicted asking prices under random validation.
| Model | RMSE | MAE | R-squared |
|---|---|---|---|
| RF baseline | 3,985 | 3,081 | 0.198 |
| RF location | 3,226 | 2,491 | 0.401 |
| RF location + green | 3,366 | 2,626 | 0.343 |
| RF location + accessibility | 3,283 | 2,557 | 0.373 |
| RF spatial | 3,332 | 2,610 | 0.349 |
| RF spatial tuned | 3,220 | 2,515 | 0.374 |
| XGBoost spatial tuned | 3,507 | 2,725 | 0.290 |
The tuned spatial Random Forest obtains the lowest RMSE, but its advantage over the simpler location model is only about 6 PLN/m2, or 0.2%. The location model has a lower MAE and higher R-squared. The small RMSE difference is not sufficient to justify the much larger feature set, especially because the location model is more stable in spatial cross-validation.
Adding green variables to the location model increases spatial-split RMSE by 140 PLN/m2, or approximately 4.3%, and lowers R-squared by 0.059. The weaker result under geographic separation suggests that the green variables partly describe local patterns that do not transfer consistently to held-out parts of Warsaw.
Figure 4. Spatial allocation of training and test observations.
| Model | Mean RMSE | SD RMSE | Mean MAE | Mean R-squared |
|---|---|---|---|---|
| RF baseline | 3,226 | 374 | 2,495 | 0.242 |
| RF location | 2,621 | 368 | 2,030 | 0.459 |
| RF location + green | 2,793 | 426 | 2,189 | 0.407 |
| RF location + accessibility | 2,758 | 424 | 2,156 | 0.399 |
| RF spatial | 2,834 | 439 | 2,218 | 0.378 |
| XGBoost spatial tuned | 2,710 | 483 | 2,082 | 0.412 |
Spatial cross-validation confirms that the location Random Forest is the most reliable specification. Its RMSE is approximately 22.3% higher than its random-split RMSE, while R-squared falls from 0.657 to 0.459. This supports H4 and illustrates why spatial validation is not merely an optional diagnostic. A random split answers an easier question because the training sample often contains geographically close analogues of test listings.
The fold-level standard deviations are also informative. The location model has an RMSE standard deviation of 368 PLN/m2, whereas XGBoost reaches 483 PLN/m2. XGBoost is therefore not only less accurate on average but also less stable across held-out regions.
Figure 5. Mean performance and variability across spatial folds.
The extended Random Forest search slightly improves MAE for the location model, from 1,559 to 1,548 PLN/m2. RMSE is effectively unchanged at approximately 2,144 PLN/m2. The extended model raises R-squared, but it also produces a larger train-test performance gap. The gain is therefore too small to establish it as a clearly better final model.
Extended XGBoost tuning reduces MAE from 1,651 to 1,618 PLN/m2 and increases R-squared from 0.639 to 0.650. At the same time, RMSE increases from 2,195 to 2,213 PLN/m2. The tuning process improves ordinary errors but does not reduce the larger mistakes emphasized by RMSE. It still does not surpass the location Random Forest.
This is a useful result rather than a failed benchmark. A more complex boosting model is not automatically superior when the sample is moderate in size, predictors overlap strongly, and a Random Forest already captures the main nonlinear location structure.
Figure 6. Comparison of the principal Random Forest and XGBoost specifications.
The extended location Random Forest ranks distance to the city centre as the most important variable, followed by construction year, latitude, longitude, floor area, and number of rooms. In the full spatial model, distance to the centre remains dominant. Public transport counts receive some importance, while individual green-space variables appear lower in the ranking.
This ordering is consistent with the descriptive correlations and model comparisons. Warsaw’s broad price gradient is primarily represented by centrality and coordinates. Construction year may proxy building standard, neighbourhood development period, and housing stock quality. Latitude and longitude capture spatial patterns that are not fully summarized by a single radial distance from the centre.
Variable importance should not be read as a causal ranking. Correlated predictors can divide or exchange importance, and coordinates can absorb information associated with many unobserved neighbourhood characteristics. The results show which variables help prediction, not which urban interventions would change prices.
variable_importance <- tibble(
variable = names(rf_spatial_tuned$model$variable.importance),
importance = as.numeric(rf_spatial_tuned$model$variable.importance)
) |>
arrange(desc(importance))
partial_dependence <- bind_rows(
manual_partial_dependence(
rf_spatial_tuned$model,
train_random,
"dist_green_m"
),
manual_partial_dependence(
rf_spatial_tuned$model,
train_random,
"green_share_500"
)
)The nearest green area is very close for most observations, with a median distance of only 13.3 m. This limited variation may make the variable less discriminating. In addition, the OpenStreetMap definition combines green spaces of different type, size, accessibility, and quality. A landscaped public park and a small mapped grass polygon can both reduce the measured distance, even though their likely housing-market relevance differs.
The partial dependence profile for distance to green space is comparatively flat. Predicted prices change only modestly over most of the observed range, with no strong monotonic premium visible after other predictors are included. Together with the lack of improvement in test metrics, this suggests that the current green measures do not provide a robust independent price signal.
Figure 7. Partial dependence profiles for selected spatial predictors.
The evidence therefore rejects H2 in its present operational form. A narrower definition based on large public parks, entrances, walking-network distance, vegetation quality, or remotely sensed canopy cover might produce a different result. The conclusion applies to the variables used here, not to the broader value of urban greenery.
The separate overfitting analysis compares in-sample and held-out performance for three Random Forest specifications.
| Model | Train RMSE | Test RMSE | Train R-squared | Test R-squared |
|---|---|---|---|---|
| RF location | 1,029 | 2,075 | 0.936 | 0.695 |
| RF spatial tuned | 886 | 2,096 | 0.953 | 0.687 |
| RF location extended | 858 | 2,144 | 0.957 | 0.672 |
All three models fit the training sample much better than the test sample. The gap is smallest for the standard location Random Forest and largest for the extended location model. Extended tuning reduces training RMSE by 171 PLN/m2 relative to the standard location model but increases test RMSE by 69 PLN/m2. This is a classic sign that the additional flexibility is learning sample-specific detail rather than a more transferable price function.
train_estimate <- predict(
rf_location_extended,
data = train_random
)$predictions
test_estimate <- predict(
rf_location_extended,
data = test_random
)$predictions
overfitting_metrics <- bind_rows(
tibble(
sample = "train",
rmse = rmse_vec(train_random$price_m2, train_estimate),
mae = mae_vec(train_random$price_m2, train_estimate),
rsq = rsq_vec(train_random$price_m2, train_estimate)
),
tibble(
sample = "test",
rmse = rmse_vec(test_random$price_m2, test_estimate),
mae = mae_vec(test_random$price_m2, test_estimate),
rsq = rsq_vec(test_random$price_m2, test_estimate)
)
)The absolute test values in this diagnostic run differ slightly from the main comparison because the models are refitted specifically for the train-versus-test exercise. The conclusion is nevertheless consistent: the simpler location Random Forest generalizes at least as well as the more aggressively tuned alternatives.
Figure 8. Train-test performance gaps for the evaluated Random Forest models.
Moran’s I for the final residuals equals 0.003, compared with an expectation of approximately -0.00015 under spatial randomness. The p-value is 0.288, so there is no basis for rejecting the null hypothesis of no residual spatial autocorrelation at conventional significance levels.
# Test whether final residuals retain a global spatial pattern.
coordinates <- st_coordinates(flats_residuals)
knn <- knearneigh(coordinates, k = 8)
neighbours <- knn2nb(knn)
spatial_weights <- nb2listw(neighbours, style = "W", zero.policy = TRUE)
moran_result <- moran.test(
flats_residuals$residual_rf_spatial,
spatial_weights,
zero.policy = TRUE
)This is an encouraging diagnostic. Although prediction errors remain, they do not form a strong global spatial pattern under the selected neighbourhood structure. Coordinates, distance to the centre, and the remaining spatial predictors appear to capture most of the broad spatial dependence. The result does not prove that every local pattern has disappeared, but it reduces concern that the model systematically overlooks a city-wide spatial process.
Figure 9. Moran’s I diagnostic for model residuals.
| Hypothesis | Decision | Evidence |
|---|---|---|
| H1: general location improves prediction | Supported | RF RMSE falls from 2,826 to 2,143 PLN/m2 after adding location |
| H2: green space adds predictive value beyond location | Not supported | Green variables increase RMSE under random and spatial validation |
| H3: nonlinear tree models outperform linear regression | Supported | RF location RMSE is about 30% lower than LM location RMSE |
| H4: spatial validation is more demanding | Supported | RF location RMSE rises from 2,143 to 2,621 PLN/m2 in spatial CV |
Several limitations should be considered when interpreting the results.
First, advertised prices are not transaction prices. Sellers may set asking prices strategically, and the final sale price may differ. Second, the analysis uses one monthly cross-section. This avoids repeated listings but does not describe temporal market change. Third, duplicate or nearly duplicate advertisements may remain within a month if identifiers or property descriptions differ.
Fourth, OpenStreetMap completeness depends on contributor activity and tag consistency. The broad green-space definition measures mapped land use rather than perceived quality, safety, public access, vegetation condition, or park facilities. Euclidean distance also differs from actual walking distance along the street network.
Fifth, coordinates are powerful predictors but are difficult to interpret economically. They act as proxies for unobserved neighbourhood characteristics, including prestige, school access, employment centres, noise, and urban form. Finally, machine-learning importance and partial dependence describe predictive associations. They do not identify causal effects of green-space provision.
The project shows that location is the central component of apartment price prediction in Warsaw. Structural apartment characteristics provide a useful base, but coordinates and distance to the centre account for a large additional share of price variation. Random Forest captures these relationships much more effectively than a linear model.
The main research result is that the selected green-space variables do not improve out-of-sample prediction after general location is included. Their raw relationships with price are weak, their partial dependence profiles are relatively flat, and models containing them perform worse under spatial validation. This finding should be stated directly rather than hidden: the project tested a plausible hypothesis and found that its predictive support is limited under the adopted measurement strategy.
The preferred final specification is the standard Random Forest location model. It offers the best balance of random-split accuracy, spatial cross-validation performance, stability, and resistance to overfitting. The tuned full spatial Random Forest and XGBoost remain valuable benchmarks, but their additional complexity is not rewarded by better geographic generalization.
From a methodological perspective, the strongest lesson is the importance of spatial validation. A random split suggests an R-squared of 0.657, whereas five-fold spatial cross-validation gives 0.459 for the same model family. Reporting only the random result would materially overstate performance in geographically unseen areas. Combining predictive models with spatial validation and residual Moran diagnostics produces a more credible and complete assessment.
ranger: a fast
implementation of Random Forests for high-dimensional data.