1 Introduction

Land use classification is a fundamental problem in spatial analysis, with practical applications in urban planning, environmental monitoring, and regional policy. Accurate land use maps allow planners to track urban expansion, assess green infrastructure, and identify areas under pressure from population growth. At the same time, the complexity of land cover patterns in and around cities makes classification a non-trivial task: land use categories are often spatially mixed, boundaries between classes are gradual rather than sharp, and the signal carried by any single predictor is rarely sufficient on its own.

Machine learning methods offer a promising approach to this problem. Unlike parametric statistical models, algorithms such as Random Forest (RF) and Artificial Neural Networks (ANN) can capture non-linear and interaction effects among predictors without requiring explicit model specification. Both methods have been applied to land use and land cover classification tasks in urban contexts, typically using remotely sensed imagery as input (Maxwell et al., 2018). This project instead relies on socioeconomic and infrastructure features derived from publicly available datasets: road networks from OpenStreetMap, points of interest, population density from the Polish statistical office, and distance to city centres. This choice reflects a scenario where satellite imagery is unavailable or too costly, and only ground-level spatial data can be used.

The study area covers three major Polish cities, Warsaw, Krakow, and Wroclaw, analysed on a 1 km2 regular grid. Ground truth labels are derived from the CORINE Land Cover 2018 dataset using a dominant-class assignment rule. Two models are trained and compared: a tuned Random Forest and a cross-validated Artificial Neural Network. Model performance is evaluated using overall accuracy, Cohen’s Kappa, per-class F1 scores, and Moran’s I applied to the spatial distribution of prediction errors. The spatial error analysis is included because classification errors in geographic data are rarely independent: if a model fails in one location, neighbouring cells are often affected by the same structural problem, and identifying such clustering provides diagnostic information beyond what aggregate accuracy metrics reveal.


2 Data Preparation

2.1 Load CORINE Land Cover and Clip to Three Cities

The CORINE Land Cover 2018 dataset was downloaded from the Copernicus Land Monitoring Service as a pan-European GeoPackage. Due to the large file size of the full dataset, it was pre-clipped to the Polish territory using ogr2ogr with a bounding box in the ETRS89-LAEA coordinate system (EPSG:3035). The resulting file contains 185,396 polygons. Administrative boundaries at the city level were obtained from GADM version 4.1 and saved locally to avoid dependence on remote URLs during document rendering.

# Load the pre-clipped CORINE layer for Poland
clc <- st_read(
  "/Users/emma/Documents/Project/Results/clc2018_poland.gpkg",
  layer = "U2018_CLC2018_V2020_20u1",
  quiet = TRUE
)
clc <- st_make_valid(clc)
cat("CORINE features loaded:", nrow(clc), "\n")
## CORINE features loaded: 185396
# Load locally saved GADM boundaries
poland <- readRDS("/Users/emma/Documents/Project/data/gadm_poland_0.rds") %>%
  st_transform(st_crs(clc)) %>%
  st_make_valid()

cities <- readRDS("/Users/emma/Documents/Project/data/gadm_poland_2.rds") %>%
  st_transform(st_crs(clc))

# Clip CORINE to Poland extent
clc_poland <- st_intersection(clc, poland)
cat("CORINE features after Poland clip:", nrow(clc_poland), "\n")
## CORINE features after Poland clip: 148200
# Polish diacritics are required to match the GADM NAME_2 field
three_cities <- cities %>%
  filter(NAME_2 %in% c("Warszawa", "Kraków", "Wrocław"))

# Clip CORINE to the three city boundaries
clc_cities <- st_intersection(clc_poland, three_cities)
cat("CORINE features per city:\n")
## CORINE features per city:
print(table(clc_cities$NAME_2))
## 
##   Kraków Warszawa  Wrocław 
##      854      349      500
ggplot(clc_poland) +
  geom_sf(aes(fill = Code_18), color = NA, show.legend = FALSE) +
  labs(
    title   = "CORINE Land Cover -- Poland 2018",
    caption = "Source: Copernicus Land Monitoring Service, EEA (2020)"
  ) +
  theme_void() +
  theme(plot.title = element_text(face = "bold", hjust = 0.5, size = 13))
Figure 1. CORINE Land Cover 2018, Poland. Land cover codes shown before reclassification. The map confirms that agricultural and natural land dominate the Polish territory.

Figure 1. CORINE Land Cover 2018, Poland. Land cover codes shown before reclassification. The map confirms that agricultural and natural land dominate the Polish territory.

plot(clc_cities["Code_18"], main = "CORINE Land Cover -- Warsaw, Krakow, Wroclaw")
Figure 2. CORINE Land Cover clipped to Warsaw, Krakow, and Wroclaw. The three cities show distinct land cover compositions.

Figure 2. CORINE Land Cover clipped to Warsaw, Krakow, and Wroclaw. The three cities show distinct land cover compositions.

2.2 Reclassify CORINE: 44 Codes to 4 Classes

The CORINE dataset uses 44 land cover codes at the third hierarchical level. For the purposes of this study, these are aggregated into four categories: Urban, Industrial, Agricultural, and Natural. The grouping is designed to reflect substantively meaningful distinctions in land use type while keeping the number of classes tractable for a dataset of approximately 3,000 grid cells. Two small transitional categories (Transitional, codes 131-133, and Green urban areas, codes 141-142) are absorbed into Industrial and Natural respectively during dataset assembly, as their cell counts are too small for reliable class-level model training.

# Lookup table mapping 44 CORINE codes to five initial land use classes
reclass_table <- data.frame(
  Code_18 = c(
    "111", "112",
    "121", "122", "123", "124",
    "131", "132", "133",
    "141", "142",
    "211", "212", "213", "221", "222",
    "223", "231", "241", "242", "243", "244",
    "311", "312", "313", "321", "322",
    "323", "324", "331", "332", "333", "334", "335",
    "411", "412", "421", "422", "423",
    "511", "512", "521", "522", "523"
  ),
  LU_class = c(
    rep("Urban",        2),   # Continuous and discontinuous urban fabric
    rep("Industrial",   4),   # Industrial, commercial, transport areas
    rep("Transitional", 3),   # Mine dumps and construction sites
    rep("Green",        2),   # Urban green areas and sport facilities
    rep("Agricultural", 11),  # Arable land, crops, pastures, mixed farmland
    rep("Natural",      12),  # Forests and semi-natural vegetation
    rep("Natural",      10)   # Wetlands and water bodies
  )
)

# Single join -- applied once to avoid column duplication
clc_cities <- clc_cities %>%
  left_join(reclass_table, by = "Code_18")

cat("Land use class distribution (raw CORINE polygons):\n")
## Land use class distribution (raw CORINE polygons):
print(table(clc_cities$LU_class))
## 
## Agricultural        Green   Industrial      Natural Transitional        Urban 
##          735           76           89          438           17          348

2.3 Create 1 km2 Grid

A regular square grid with 1 km2 cells is constructed over the union of the three city boundaries. This resolution is chosen as a balance between spatial detail and the granularity of the available predictor data: the GUS population data is at the county level, and OSM road lengths are more stable estimators at 1 km2 than at finer resolutions. Grid cells that do not intersect any city boundary are removed.

# Construct a 1000 m square grid clipped to the three city boundaries
grid_1km <- st_make_grid(
    three_cities,
    cellsize = 1000,
    square   = TRUE,
    what     = "polygons"
  ) %>%
  st_as_sf() %>%
  st_filter(three_cities) %>%
  mutate(cell_id = row_number())

cat("Total 1 km2 grid cells:", nrow(grid_1km), "\n")
## Total 1 km2 grid cells: 3312

2.4 Assign Dominant Land Use to Each Cell (Y Variable)

Each grid cell is assigned the land use class that occupies the largest area within it. Where multiple CORINE polygons overlap a cell, the one with the greatest intersection area is selected. This dominant-class rule simplifies the classification problem to a single label per cell while preserving the primary land use character of each location.

# Compute intersection areas between CORINE polygons and the grid
clc_grid <- st_intersection(clc_cities, grid_1km) %>%
  mutate(area_m2 = as.numeric(st_area(.)))

# Assign the dominant land use label to each cell
y_variable <- clc_grid %>%
  st_drop_geometry() %>%
  group_by(cell_id) %>%
  slice_max(area_m2, n = 1, with_ties = FALSE) %>%
  select(cell_id, LU_class, Code_18)

cat("Grid cells with land use assignment:", nrow(y_variable), "\n")
## Grid cells with land use assignment: 3312
cat("\nDominant land use per cell (before merging small classes):\n")
## 
## Dominant land use per cell (before merging small classes):
print(table(y_variable$LU_class))
## 
## Agricultural        Green   Industrial      Natural Transitional        Urban 
##         2171           48           91          528            6          468

The dominant land use distribution confirms that Agricultural land covers the majority of grid cells across the three cities, followed by Urban and Natural. The Transitional (6 cells) and Green (48 cells) categories are too sparse for reliable modelling and will be merged in Section 2.8.

2.5 Load Population Grid (GUS)

Population density data at the county (powiat) level for 2020 and 2021 are obtained from the Central Statistical Office of Poland (GUS), downloaded as a shapefile (tnvc23015.shp). The dataset covers 380 administrative units and reports population density in inhabitants per km2. Since the GUS units do not align with the 1 km2 grid, a spatial join is performed and values are averaged where a cell intersects multiple administrative boundaries.

# Load and reproject GUS population data to match the grid CRS
gus_grid <- st_read("/Users/emma/Documents/Project/query/tnvc23015.shp", quiet = TRUE) %>%
  st_transform(st_crs(grid_1km)) %>%
  mutate(
    gestosc_20 = as.numeric(gestosc_20),
    gestosc_21 = as.numeric(gestosc_21)
  )

# Spatial join with aggregation to avoid row duplication
pop_per_cell <- st_join(grid_1km, gus_grid, join = st_intersects) %>%
  st_drop_geometry() %>%
  group_by(cell_id) %>%
  summarise(
    gestosc_20 = mean(gestosc_20, na.rm = TRUE),
    gestosc_21 = mean(gestosc_21, na.rm = TRUE),
    .groups = "drop"
  )

cat("Grid cells with population data:",
    sum(!is.na(pop_per_cell$gestosc_20)), "/", nrow(pop_per_cell), "\n")
## Grid cells with population data: 3312 / 3312

2.6 Load OSM Features (Roads and POIs)

Road networks and points of interest (POIs) are downloaded using the osmextract package, which retrieves pre-built regional extracts from the Geofabrik server. Road features are filtered to the highway tag. POI features are extracted from the other_tags field using a regular expression that matches amenity-tagged nodes. Both road and POI data are downloaded at the voivodeship level, which extends beyond the city administrative boundaries and is clipped to the grid in Section 2.7.

# Polish names with diacritics are required for correct Geofabrik region matching
warsaw_roads  <- oe_get("Warszawa", layer = "lines",
                        query = "SELECT * FROM lines WHERE highway IS NOT NULL",
                        quiet = TRUE)
krakow_roads  <- oe_get("Kraków",   layer = "lines",
                        query = "SELECT * FROM lines WHERE highway IS NOT NULL",
                        quiet = TRUE)
wroclaw_roads <- oe_get("Wrocław",  layer = "lines",
                        query = "SELECT * FROM lines WHERE highway IS NOT NULL",
                        quiet = TRUE)

warsaw_pois_raw  <- oe_get("Warszawa", layer = "points", quiet = TRUE)
krakow_pois_raw  <- oe_get("Kraków",   layer = "points", quiet = TRUE)
wroclaw_pois_raw <- oe_get("Wrocław",  layer = "points", quiet = TRUE)
# Extract amenity POIs from the other_tags field using regex
extract_amenity <- function(pois_raw) {
  pois_raw %>%
    mutate(amenity = str_extract(other_tags, '(?<="amenity"=>")[^"]+')) %>%
    filter(!is.na(amenity)) %>%
    select(osm_id, name, amenity, geometry)
}

target_crs <- st_crs(grid_1km)

all_roads <- bind_rows(warsaw_roads, krakow_roads, wroclaw_roads) %>%
  st_transform(target_crs)

all_pois <- bind_rows(
  extract_amenity(warsaw_pois_raw),
  extract_amenity(krakow_pois_raw),
  extract_amenity(wroclaw_pois_raw)
) %>%
  st_transform(target_crs)

cat("Total road segments:", nrow(all_roads), "\n")
## Total road segments: 2097896
cat("Total amenity POIs: ", nrow(all_pois),  "\n")
## Total amenity POIs:  214641

2.7 Compute Spatial Features per Grid Cell

Four groups of spatial features are computed for each cell: POI count, total road length, road length disaggregated by highway classification (residential, primary, secondary, tertiary, trunk, motorway), and the Euclidean distance from the cell centroid to the nearest city centre. Together, these 11 variables capture different dimensions of urban infrastructure intensity and accessibility.

# POI count per cell
pois_per_cell <- st_join(all_pois, grid_1km) %>%
  st_drop_geometry() %>%
  filter(!is.na(cell_id)) %>%
  group_by(cell_id) %>%
  summarise(poi_count = n())

# Clip roads to grid cells for road length computation
roads_clipped <- st_intersection(all_roads, grid_1km)

# Total road length per cell in metres
roads_per_cell <- roads_clipped %>%
  mutate(road_length = as.numeric(st_length(.))) %>%
  st_drop_geometry() %>%
  group_by(cell_id) %>%
  summarise(total_road_length = sum(road_length, na.rm = TRUE))

# Road length disaggregated by highway type
road_types_per_cell <- roads_clipped %>%
  mutate(road_length = as.numeric(st_length(.))) %>%
  st_drop_geometry() %>%
  filter(highway %in% c("primary", "secondary", "tertiary",
                        "residential", "motorway", "trunk")) %>%
  group_by(cell_id, highway) %>%
  summarise(length = sum(road_length, na.rm = TRUE), .groups = "drop") %>%
  tidyr::pivot_wider(
    names_from   = highway,
    values_from  = length,
    values_fill  = 0,
    names_prefix = "road_"
  )

# Euclidean distance to the nearest city centre
city_centres <- data.frame(
  city    = c("Warszawa", "Kraków", "Wrocław"),
  lon_wgs = c(21.0122, 19.9450, 17.0385),
  lat_wgs = c(52.2297, 50.0647, 51.1079)
) %>%
  st_as_sf(coords = c("lon_wgs", "lat_wgs"), crs = 4326) %>%
  st_transform(st_crs(grid_1km))

grid_centroids <- st_centroid(grid_1km)
dist_matrix    <- st_distance(grid_centroids, city_centres)

grid_1km <- grid_1km %>%
  mutate(
    dist_to_centre = apply(dist_matrix, 1, min),
    city           = city_centres$city[apply(dist_matrix, 1, which.min)]
  )

2.8 Assemble and Balance the Final Dataset

All feature tables are joined to the grid via cell_id. Missing values in road and POI columns arise in cells where no matching features were found; these are replaced with zero. The two small classes (Transitional and Green) are merged into Industrial and Natural respectively.

After merging, the class distribution is heavily skewed toward Agricultural. To prevent this imbalance from dominating model training, a downsampling strategy is applied: each class is capped at three times the size of the smallest class. This multiplier retains more data than equal-size sampling while substantially reducing the imbalance ratio.

# Join all feature tables to the grid
final_data <- grid_1km %>%
  st_drop_geometry() %>%
  left_join(y_variable,          by = "cell_id") %>%
  left_join(pop_per_cell,        by = "cell_id") %>%
  left_join(pois_per_cell,       by = "cell_id") %>%
  left_join(roads_per_cell,      by = "cell_id") %>%
  left_join(road_types_per_cell, by = "cell_id") %>%
  mutate(
    gestosc_20        = replace_na(gestosc_20, 0),
    gestosc_21        = replace_na(gestosc_21, 0),
    poi_count         = replace_na(poi_count, 0),
    total_road_length = replace_na(total_road_length, 0),
    across(starts_with("road_"), ~replace_na(., 0)),
    # Merge minority classes into the nearest substantive category
    LU_class = case_when(
      LU_class == "Transitional" ~ "Industrial",
      LU_class == "Green"        ~ "Natural",
      TRUE                       ~ LU_class
    )
  ) %>%
  filter(!is.na(LU_class))

cat("Class distribution before balancing:\n")
## Class distribution before balancing:
print(table(final_data$LU_class))
## 
## Agricultural   Industrial      Natural        Urban 
##         2171           97          576          468
# Cap each class at 3x the smallest class size
set.seed(42)
max_allowed <- as.integer(min(table(final_data$LU_class)) * 3)

final_balanced <- final_data %>%
  group_by(LU_class) %>%
  slice_sample(n = max_allowed) %>%
  ungroup()

cat("\nClass distribution after balancing (cap =", max_allowed, "per class):\n")
## 
## Class distribution after balancing (cap = 291 per class):
print(table(final_balanced$LU_class))
## 
## Agricultural   Industrial      Natural        Urban 
##          291           97          291          291
saveRDS(final_balanced, "data_prepared_balanced.rds")

The balanced dataset contains 970 observations across four classes. The Industrial class is the smallest (n = 97), which sets the cap for the other classes and is likely to limit per-class model performance for that category.


3 Modelling

3.1 Train and Test Split

The balanced dataset is split 70/30 into training and test sets using stratified random sampling via createDataPartition. Stratification ensures that class proportions are preserved in both partitions, which is important given the residual imbalance between Industrial and the other three classes. The same 11 spatial features serve as predictors for both models.

# Define the 11-variable predictor set for both models
features <- c(
  "dist_to_centre",    # Distance to nearest city centre (metres)
  "poi_count",         # Count of amenity POIs in cell
  "total_road_length", # Total road length in cell (metres)
  "road_residential",  # Residential road length
  "road_primary",      # Primary road length
  "road_secondary",    # Secondary road length
  "road_tertiary",     # Tertiary road length
  "road_trunk",        # Trunk road length
  "road_motorway",     # Motorway length
  "gestosc_20",        # Population density 2020 (GUS)
  "gestosc_21"         # Population density 2021 (GUS)
)

model_data <- final_balanced %>%
  select(all_of(features), LU_class, city, cell_id) %>%
  mutate(LU_class = as.factor(LU_class))

set.seed(42)
train_idx  <- createDataPartition(model_data$LU_class, p = 0.7, list = FALSE)
train_data <- model_data[train_idx, ]
test_data  <- model_data[-train_idx, ]

cat("Training observations:", nrow(train_data),
    "| Test observations:", nrow(test_data), "\n")
## Training observations: 680 | Test observations: 290

3.2 Random Forest

Random Forest builds an ensemble of decision trees, each trained on a bootstrap sample of the data, and aggregates predictions by majority vote. It is robust to outliers, handles non-linear feature interactions, and does not require feature scaling. Variable importance is measured as the mean decrease in classification accuracy when a predictor is permuted, averaged across all trees.

The hyperparameter mtry, which controls the number of candidate features considered at each split, is tuned by minimising out-of-bag (OOB) classification error using tuneRF. The forest uses 500 trees for all configurations.

set.seed(42)
tune_result <- tuneRF(
  x          = train_data %>% select(all_of(features)),
  y          = train_data$LU_class,
  ntreeTry   = 500,
  stepFactor = 1.5,
  improve    = 0.01,
  trace      = TRUE,
  plot       = TRUE
)
## mtry = 3  OOB error = 33.82% 
## Searching left ...
## mtry = 2     OOB error = 32.94% 
## 0.02608696 0.01 
## Searching right ...
## mtry = 4     OOB error = 32.94% 
## 0 0.01
Figure 3. OOB error as a function of mtry. The optimal value is selected at the minimum of the curve.

Figure 3. OOB error as a function of mtry. The optimal value is selected at the minimum of the curve.

best_mtry <- tune_result[which.min(tune_result[, "OOBError"]), "mtry"]
cat("Optimal mtry:", best_mtry, "\n")
## Optimal mtry: 2
# Train the final RF model with the optimal mtry
set.seed(42)
rf_tuned <- randomForest(
  LU_class ~ . - city - cell_id,
  data        = train_data,
  ntree       = 500,
  mtry        = best_mtry,
  importance  = TRUE,
  keep.forest = TRUE
)

rf_pred <- predict(rf_tuned, newdata = test_data)
conf_rf <- confusionMatrix(rf_pred, test_data$LU_class)

cat("RF Accuracy:", round(conf_rf$overall["Accuracy"], 3), "\n")
## RF Accuracy: 0.617
cat("RF Kappa:   ", round(conf_rf$overall["Kappa"],    3), "\n")
## RF Kappa:    0.463
saveRDS(rf_tuned, "rf_model_tuned.rds")

3.3 Artificial Neural Network

The ANN is implemented as a single-hidden-layer feedforward network using the nnet method in the caret package. This architecture is appropriate for a four-class tabular classification problem of moderate size. Features are scaled to the [0,1] range before training to ensure that gradient-based weight updates are not dominated by variables with large absolute values. Any zero-variance column that produces NaN after range scaling is replaced with zero to avoid numerical errors in the weight matrix.

The number of hidden nodes and the L2 weight decay regularisation parameter are selected jointly by 5-fold cross-validation over a grid of candidate values (size: 5, 10, 15, 20; decay: 0.001, 0.01, 0.1).

# Scale features to [0,1] using training set statistics
preproc      <- preProcess(train_data %>% select(all_of(features)),
                           method = "range")
train_scaled <- predict(preproc, train_data)
test_scaled  <- predict(preproc, test_data)

# Replace NaN values from zero-variance columns with zero
train_scaled <- train_scaled %>%
  mutate(across(all_of(features), ~ifelse(is.nan(.), 0, .)))
test_scaled <- test_scaled %>%
  mutate(across(all_of(features), ~ifelse(is.nan(.), 0, .)))

# Explicit formula avoids issues with non-feature columns remaining in the data
ann_formula <- as.formula(
  paste("LU_class ~", paste(features, collapse = " + "))
)

set.seed(42)
ann_tuned <- train(
  ann_formula,
  data      = train_scaled,
  method    = "nnet",
  tuneGrid  = expand.grid(
    size  = c(5, 10, 15, 20),
    decay = c(0.001, 0.01, 0.1)
  ),
  trControl = trainControl(method = "cv", number = 5, verboseIter = FALSE),
  maxit     = 500,
  trace     = FALSE
)

cat("Optimal ANN configuration:\n")
## Optimal ANN configuration:
print(ann_tuned$bestTune)
##   size decay
## 2    5  0.01
ann_pred <- predict(ann_tuned, newdata = test_scaled)
conf_ann <- confusionMatrix(ann_pred, test_scaled$LU_class)

cat("ANN Accuracy:", round(conf_ann$overall["Accuracy"], 3), "\n")
## ANN Accuracy: 0.576
cat("ANN Kappa:   ", round(conf_ann$overall["Kappa"],    3), "\n")
## ANN Kappa:    0.405
saveRDS(ann_tuned, "ann_model_tuned.rds")

4 Results

4.1 Performance Comparison

# Build the performance comparison table from both confusion matrices
comparison <- data.frame(
  Metric = c(
    "Overall Accuracy", "Kappa",
    "F1 - Agricultural", "F1 - Industrial",
    "F1 - Natural",      "F1 - Urban"
  ),
  RF = c(
    conf_rf$overall["Accuracy"],
    conf_rf$overall["Kappa"],
    conf_rf$byClass["Class: Agricultural", "F1"],
    conf_rf$byClass["Class: Industrial",   "F1"],
    conf_rf$byClass["Class: Natural",      "F1"],
    conf_rf$byClass["Class: Urban",        "F1"]
  ),
  ANN = c(
    conf_ann$overall["Accuracy"],
    conf_ann$overall["Kappa"],
    conf_ann$byClass["Class: Agricultural", "F1"],
    conf_ann$byClass["Class: Industrial",   "F1"],
    conf_ann$byClass["Class: Natural",      "F1"],
    conf_ann$byClass["Class: Urban",        "F1"]
  )
) %>%
  mutate(across(c(RF, ANN), ~round(., 3)))

knitr::kable(
  comparison,
  caption = paste0("Table 1. RF vs ANN performance on the test set ",
                   "(n = ", nrow(test_data), ")."),
  align = c("l", "r", "r")
)
Table 1. RF vs ANN performance on the test set (n = 290).
Metric RF ANN
Overall Accuracy 0.617 0.576
Kappa 0.463 0.405
F1 - Agricultural 0.609 0.557
F1 - Industrial 0.375 0.333
F1 - Natural 0.506 0.500
F1 - Urban 0.773 0.724
write.csv(comparison, "model_comparison.csv", row.names = FALSE)

Table 1 presents the performance comparison on the held-out test set. RF achieves the higher overall accuracy (0.617 vs 0.576) and the higher Kappa coefficient (0.463 vs 0.405). Both models perform best on the Urban class (RF F1 = 0.773; ANN F1 = 0.724), where the concentration of roads and amenities in city centres creates a strong and relatively unambiguous signal in the feature space. The Industrial class produces the lowest F1 scores for both models (RF: 0.375; ANN: 0.333). This result is expected for two reasons: the Industrial training sample is the smallest of the four classes, and Industrial zones in Polish cities frequently border or intermix with Urban areas, making the two classes difficult to distinguish from road and POI features alone.

The Kappa coefficients place both models in the moderate agreement range according to the scale proposed by Landis and Koch (1977), which defines values between 0.41 and 0.60 as moderate and 0.61 to 0.80 as substantial. These values reflect the inherent difficulty of classifying mixed land use at 1 km2 resolution using only socioeconomic and infrastructure features without spectral data.

4.2 Variable Importance

importance_df <- as.data.frame(importance(rf_tuned)) %>%
  tibble::rownames_to_column("variable") %>%
  arrange(desc(MeanDecreaseAccuracy)) %>%
  mutate(
    variable = recode(variable,
      dist_to_centre    = "Distance to centre",
      poi_count         = "POI count",
      total_road_length = "Total road length",
      road_residential  = "Residential roads",
      road_primary      = "Primary roads",
      road_secondary    = "Secondary roads",
      road_tertiary     = "Tertiary roads",
      road_trunk        = "Trunk roads",
      road_motorway     = "Motorways",
      gestosc_20        = "Pop. density 2020",
      gestosc_21        = "Pop. density 2021"
    ),
    is_top_3 = row_number() <= 3
  )

ggplot(importance_df,
       aes(x = reorder(variable, MeanDecreaseAccuracy),
           y = MeanDecreaseAccuracy,
           fill = is_top_3)) +
  geom_col() +
  scale_fill_manual(
    values = c("FALSE" = "#2C7BB6", "TRUE" = "#D7191C"),
    guide  = "none"
  ) +
  coord_flip() +
  labs(
    title    = "Random Forest -- Variable Importance",
    subtitle = "Top 3 predictors highlighted in red",
    x        = NULL,
    y        = "Mean Decrease in Accuracy",
    caption  = paste0("ntree = 500, mtry = ", best_mtry)
  ) +
  theme_minimal() +
  theme(
    plot.title    = element_text(face = "bold", hjust = 0.5),
    plot.subtitle = element_text(hjust = 0.5, color = "grey40")
  )
Figure 4. RF variable importance ranked by mean decrease in accuracy. The top three predictors are highlighted in red.

Figure 4. RF variable importance ranked by mean decrease in accuracy. The top three predictors are highlighted in red.

ggsave("plot_variable_importance.png", width = 8, height = 5, dpi = 300)

Figure 4 shows the mean decrease in accuracy for each predictor in the RF model. The top three variables are road-based: Residential roads (importance = 46), Total road length (43.07), and POI count (28.05). These results suggest that the density and type of road infrastructure are the strongest indicators of urban land use at 1 km2 resolution, which is consistent with the intuition that residential and commercial areas concentrate road networks while agricultural and natural land does not.

Population density ranks 4th and 6th respectively. The relatively modest contribution of the GUS population variable is most likely a consequence of its spatial resolution: the GUS data are reported at the county level, which is far coarser than the 1 km2 grid. A single density value is shared across all grid cells within the same county, reducing its discriminatory power for within-county classification. A finer-grained population dataset would likely improve its ranking.

4.3 Population Density by Land Use Class

final_balanced %>%
  mutate(gestosc_20 = as.numeric(gestosc_20)) %>%
  filter(!is.na(gestosc_20), gestosc_20 > 0) %>%
  ggplot(aes(x = LU_class, y = gestosc_20, fill = LU_class)) +
  geom_boxplot(outlier.alpha = 0.3) +
  scale_y_log10() +
  scale_fill_manual(values = class_colors) +
  labs(
    title    = "Population Density by Land Use Class",
    subtitle = "GUS 2020 data -- log scale",
    x        = NULL,
    y        = "Population density (inhabitants per km2, log scale)",
    caption  = "Source: Central Statistical Office of Poland (GUS), 2020"
  ) +
  theme_minimal() +
  theme(
    legend.position = "none",
    plot.title      = element_text(face = "bold", hjust = 0.5),
    plot.subtitle   = element_text(hjust = 0.5, color = "grey40")
  )
Figure 5. Population density (GUS 2020, log scale) by land use class. Urban and Industrial cells have substantially higher population densities than Agricultural and Natural cells.

Figure 5. Population density (GUS 2020, log scale) by land use class. Urban and Industrial cells have substantially higher population densities than Agricultural and Natural cells.

ggsave("plot_pop_boxplot.png", width = 7, height = 5, dpi = 300)

Figure 5 shows that Urban and Industrial cells have substantially higher median population densities (roughly 2,000 inhabitants per km2) than Agricultural cells (roughly 200), with Natural cells displaying the widest spread. This confirms that population density carries useful information for distinguishing land use classes, even though its county-level resolution limits its contribution in the RF model. The large overlap in the Natural distribution partly explains why the Agricultural-Natural boundary is the primary source of classification error across both models.

4.4 Model Performance Comparison Chart

comparison %>%
  tidyr::pivot_longer(c(RF, ANN), names_to = "Model", values_to = "Value") %>%
  ggplot(aes(x = Metric, y = Value, fill = Model)) +
  geom_col(position = "dodge", width = 0.6) +
  scale_fill_manual(values = c("RF" = "#2C7BB6", "ANN" = "#D7191C")) +
  scale_y_continuous(
    limits = c(0, 1),
    labels = scales::number_format(accuracy = 0.1)
  ) +
  coord_flip() +
  labs(
    title    = "RF vs ANN -- Performance Comparison",
    subtitle = "Land use classification, 1 km2 grid, Warsaw / Krakow / Wroclaw",
    x        = NULL,
    y        = "Score",
    caption  = paste0("Test set n = ", nrow(test_data),
                      "; CORINE 2018 ground truth")
  ) +
  theme_minimal() +
  theme(
    plot.title      = element_text(face = "bold", hjust = 0.5),
    plot.subtitle   = element_text(hjust = 0.5, color = "grey40"),
    legend.position = "bottom"
  )
Figure 6. RF vs ANN performance comparison across all six evaluation metrics.

Figure 6. RF vs ANN performance comparison across all six evaluation metrics.

ggsave("plot_model_comparison.png", width = 8, height = 5, dpi = 300)

4.5 Land Use Composition by City

final_balanced %>%
  count(city, LU_class) %>%
  ggplot(aes(x = city, y = n, fill = LU_class)) +
  geom_col(position = "fill") +
  scale_fill_manual(values = class_colors, name = "Land Use") +
  scale_y_continuous(labels = scales::percent) +
  labs(
    title   = "Land Use Composition by City",
    x       = NULL,
    y       = "Share of grid cells",
    caption = paste0("CORINE 2018, balanced sample (n = ",
                     nrow(final_balanced), ")")
  ) +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", hjust = 0.5))
Figure 7. Land use composition by city in the balanced dataset. Warsaw is dominated by Urban cells; Krakow and Wroclaw show more diverse compositions.

Figure 7. Land use composition by city in the balanced dataset. Warsaw is dominated by Urban cells; Krakow and Wroclaw show more diverse compositions.

ggsave("plot_city_composition.png", width = 7, height = 5, dpi = 300)

Figure 7 shows that the three cities have distinct land use profiles. Warsaw is the most urbanised, with more than half of its balanced grid cells classified as Urban, reflecting its role as the national capital and the largest city in Poland. Krakow has a more balanced mix of Urban, Natural, and Agricultural land, consistent with its more compact historic core and proximity to the Malopolska uplands. Wroclaw has the highest share of Natural land among the three cities, which is consistent with the Oder floodplain and the city’s extensive riverside green areas. These compositional differences motivate the per-city error analysis in Section 5.2, which tests whether classification performance is consistent across cities.


5 Spatial Error Analysis

5.1 Prediction Maps

# Build the prediction and error table for the test set
test_cells <- test_data %>%
  mutate(
    rf_pred     = as.character(predict(rf_tuned, newdata = test_data)),
    ann_pred    = as.character(predict(ann_tuned, newdata = test_scaled)),
    actual      = as.character(LU_class),
    rf_correct  = rf_pred  == actual,
    ann_correct = ann_pred == actual,
    error_type  = case_when(
       rf_correct &  ann_correct ~ "Both correct",
       rf_correct & !ann_correct ~ "Only RF correct",
      !rf_correct &  ann_correct ~ "Only ANN correct",
      !rf_correct & !ann_correct ~ "Both wrong"
    )
  ) %>%
  select(cell_id, actual, rf_pred, ann_pred,
         rf_correct, ann_correct, error_type)

# Join predictions back to the spatial grid
grid_results <- grid_1km %>%
  inner_join(test_cells, by = "cell_id")

map_actual <- ggplot(grid_results) +
  geom_sf(aes(fill = actual), color = NA) +
  scale_fill_manual(values = class_colors) +
  labs(title = "Actual (CORINE 2018)") +
  theme_void() +
  theme(legend.position = "none",
        plot.title = element_text(hjust = 0.5, face = "bold", size = 11))

map_rf <- ggplot(grid_results) +
  geom_sf(aes(fill = rf_pred), color = NA) +
  scale_fill_manual(values = class_colors) +
  labs(title = "RF Predicted") +
  theme_void() +
  theme(legend.position = "none",
        plot.title = element_text(hjust = 0.5, face = "bold", size = 11))

map_ann <- ggplot(grid_results) +
  geom_sf(aes(fill = ann_pred), color = NA) +
  scale_fill_manual(values = class_colors) +
  labs(title = "ANN Predicted") +
  theme_void() +
  theme(legend.position = "none",
        plot.title = element_text(hjust = 0.5, face = "bold", size = 11))

map_errors <- ggplot(grid_results) +
  geom_sf(aes(fill = error_type), color = NA) +
  scale_fill_manual(values = error_colors) +
  labs(title = "Spatial Error Analysis") +
  theme_void() +
  theme(legend.position = "none",
        plot.title = element_text(hjust = 0.5, face = "bold", size = 11))

combined_maps <- (map_actual | map_rf) / (map_ann | map_errors) +
  plot_annotation(
    title    = "Land Use Classification -- Warsaw, Krakow, Wroclaw",
    subtitle = "CORINE 2018 vs RF and ANN predictions, 1 km2 grid (test set)",
    theme = theme(
      plot.title    = element_text(hjust = 0.5, face = "bold", size = 13),
      plot.subtitle = element_text(hjust = 0.5, size = 9)
    )
  ) &
  theme(legend.position = "none")

combined_maps
Figure 8. Predicted vs actual land use and spatial error distribution (test set). Colour key -- land use maps: red = Urban, light green = Agricultural, dark green = Natural, brown = Industrial. Error map: green = both models correct, red = both models wrong, blue = only RF correct, orange = only ANN correct.

Figure 8. Predicted vs actual land use and spatial error distribution (test set). Colour key – land use maps: red = Urban, light green = Agricultural, dark green = Natural, brown = Industrial. Error map: green = both models correct, red = both models wrong, blue = only RF correct, orange = only ANN correct.

ggsave("maps_combined.png", width = 12, height = 8, dpi = 300)

Figure 8 shows that both models capture the broad urban core of each city reasonably well but diverge in peri-urban areas. Cells where both models fail are concentrated at the urban fringe, where agricultural, industrial, and residential land uses are spatially intermixed at a scale finer than the 1 km2 grid cell. This transition zone problem is structural: the available features cannot resolve land use boundaries that exist within a single cell.

5.2 Per-City Error Patterns

for (city_name in c("Warszawa", "Kraków", "Wrocław")) {
  p <- grid_results %>%
    filter(city == city_name) %>%
    ggplot() +
    geom_sf(aes(fill = error_type), color = NA) +
    scale_fill_manual(values = error_colors, name = "Prediction") +
    labs(
      title   = paste("Error Pattern --", city_name),
      caption = "Test set only (30% of balanced sample)"
    ) +
    theme_void() +
    theme(
      plot.title      = element_text(hjust = 0.5, face = "bold"),
      legend.position = "bottom"
    )
  print(p)
}
Figure 9. Prediction error patterns by city (test set). Green cells indicate correct predictions by both models; red cells indicate failure by both.

Figure 9. Prediction error patterns by city (test set). Green cells indicate correct predictions by both models; red cells indicate failure by both.

Figure 9. Prediction error patterns by city (test set). Green cells indicate correct predictions by both models; red cells indicate failure by both.

Figure 9. Prediction error patterns by city (test set). Green cells indicate correct predictions by both models; red cells indicate failure by both.

Figure 9. Prediction error patterns by city (test set). Green cells indicate correct predictions by both models; red cells indicate failure by both.

Figure 9. Prediction error patterns by city (test set). Green cells indicate correct predictions by both models; red cells indicate failure by both.

Figure 9 disaggregates the error map by city. In Warsaw, most misclassifications occur in the northern and western periphery, where agricultural and transitional land uses are intermixed with suburban development. In Krakow, errors are more dispersed across the city boundary, which is consistent with the more heterogeneous land cover of the city and its surroundings. Wroclaw shows a cluster of errors in its western industrial zone, where both models struggle to distinguish Industrial from adjacent Agricultural land. The city-specific patterns suggest that a single pooled model captures Warsaw’s urban core well but is less adapted to the peri-urban conditions of Krakow and Wroclaw.

5.3 Moran’s I: Spatial Clustering of Prediction Errors

If prediction errors were spatially random, the probability of a cell being misclassified would be independent of its neighbours. In practice, this is unlikely in land use classification because the features used as predictors are themselves spatially autocorrelated: road density and POI counts are high in the same locations across many neighbouring cells. To test whether errors are spatially clustered, Moran’s I is applied to a binary error indicator (1 = misclassified, 0 = correct) using a k = 4 nearest-neighbour spatial weight matrix.

# Add binary error indicators to the spatial results grid
grid_results <- grid_results %>%
  mutate(
    rf_error_bin  = as.integer(!rf_correct),
    ann_error_bin = as.integer(!ann_correct)
  )

# Build k = 4 nearest-neighbour weight matrix from cell centroids
coords     <- st_centroid(grid_results)
neighbours <- knn2nb(knearneigh(coords, k = 4))
weights    <- nb2listw(neighbours, style = "W")

# Moran's I test for RF and ANN error vectors
moran_rf  <- moran.test(grid_results$rf_error_bin,  weights)
moran_ann <- moran.test(grid_results$ann_error_bin, weights)

cat("Moran's I -- RF errors :",
    round(moran_rf$estimate["Moran I statistic"],  3),
    " | p =", round(moran_rf$p.value,  4), "\n")
## Moran's I -- RF errors : 0.069  | p = 0.0308
cat("Moran's I -- ANN errors:",
    round(moran_ann$estimate["Moran I statistic"], 3),
    " | p =", round(moran_ann$p.value, 4), "\n")
## Moran's I -- ANN errors: 0.126  | p = 4e-04

RF errors are statistically significant (p < 0.05) (Moran’s I = 0.069, p = 0.0308). ANN errors are statistically significant (p < 0.05) (Moran’s I = 0.126, p = 4^{-4}). A positive and significant Moran’s I indicates that misclassified cells tend to cluster in space rather than appearing randomly, which is consistent with the transition zone pattern observed in the per-city maps.

5.4 Misclassification Flow (Chord Diagram)

chord_colors <- c(
  "Agricultural" = "#a8d08d",
  "Industrial"   = "#c55a11",
  "Natural"      = "#375623",
  "Urban"        = "#ff0000"
)

# Confusion table as input to the chord diagram
trans_mat <- table(
  Predicted = as.character(rf_pred),
  Actual    = as.character(test_data$LU_class)
)

circos.clear()
chordDiagram(
  trans_mat,
  grid.col          = chord_colors,
  annotationTrack   = "grid",
  preAllocateTracks = list(track.height = 0.1)
)

# Add class labels around the outer ring
circos.trackPlotRegion(track.index = 1, panel.fun = function(x, y) {
  xlim   <- get.cell.meta.data("xlim")
  ylim   <- get.cell.meta.data("ylim")
  sector <- get.cell.meta.data("sector.index")
  circos.text(
    mean(xlim), ylim[1] + 0.5, sector,
    facing     = "clockwise",
    niceFacing = TRUE,
    adj        = c(0, 0.5),
    cex        = 0.85
  )
}, bg.border = NA)

title("RF: Misclassification Flow by Land Use Class", cex.main = 1.1)
Figure 10. Chord diagram of RF misclassifications on the test set. Arc width is proportional to cell count. Within-sector arcs represent correct predictions; cross-sector arcs represent misclassifications.

Figure 10. Chord diagram of RF misclassifications on the test set. Arc width is proportional to cell count. Within-sector arcs represent correct predictions; cross-sector arcs represent misclassifications.

circos.clear()

Figure 10 presents the RF confusion matrix as a chord diagram. The width of each arc is proportional to the number of test cells in that predicted-actual combination. Wide within-sector arcs indicate correct predictions; wide cross-sector arcs identify the dominant confusion pairs. The diagram makes it visually clear which class pairs drive the overall error rate and in which direction confusion tends to occur, complementing the row-level F1 scores in Table 1.


6 Summary and Conclusions

Table 2. Final model performance summary (test set).
Metric RF ANN
Overall Accuracy 0.617 0.576
Kappa 0.463 0.405
F1 - Agricultural 0.609 0.557
F1 - Industrial 0.375 0.333
F1 - Natural 0.506 0.500
F1 - Urban 0.773 0.724
Table 3. Moran’s I for spatial clustering of prediction errors.
Model Moran’s I p-value Interpretation
RF 0.069 0.031 statistically significant (p < 0.05)
ANN 0.126 0.000 statistically significant (p < 0.05)
Table 4. Top 3 RF predictors by mean decrease in accuracy.
Predictor Mean Decrease Accuracy
Residential roads 46.00
Total road length 43.07
POI count 28.05

This project classified land use across three Polish cities at 1 km2 resolution using Random Forest and an Artificial Neural Network, with infrastructure and population features as predictors and CORINE 2018 as ground truth. The main findings are as follows.

Model performance. RF achieves higher overall accuracy (0.617 vs 0.576) and a higher Kappa coefficient (0.463 vs 0.405). Both models classify Urban cells most reliably (RF F1 = 0.773; ANN F1 = 0.724), where road density and POI counts provide a strong and spatially concentrated signal. Industrial cells are the hardest to classify correctly (RF F1 = 0.375; ANN F1 = 0.333), a result that reflects both the small training sample and the spatial overlap between Industrial and Urban land use in Polish cities.

Variable importance. The three most predictive features in the RF model are Residential roads (importance = 46), Total road length (43.07), and POI count (28.05). Population density ranks 4th, well below the road-based features. This ranking reflects a limitation of the GUS data: county-level density values cannot capture fine-grained variation within a city, whereas road network features vary meaningfully at the 1 km2 scale.

Spatial error structure. RF errors are statistically significant (p < 0.05) (Moran’s I = 0.069, p = 0.0308). ANN errors are statistically significant (p < 0.05) (Moran’s I = 0.126, p = 4^{-4}). Errors from both models concentrate at urban-rural transition zones, where land use categories are spatially mixed within single grid cells and the available feature set cannot resolve class boundaries at that scale.

Limitations. The CORINE 2018 dataset has a minimum mapping unit of 25 ha, which may not accurately capture fine-grained urban patterns at 1 km2 resolution, particularly in areas of dispersed settlement (Bielecka and Ciolkosz, 2020). The absence of spectral or satellite imagery as input means the feature set covers only the socioeconomic and infrastructure dimensions of land use. The downsampling strategy reduces the effective training set size, which constrains model capacity for the Industrial class. Finally, the three-city sample limits the generalisability of the findings to other urban contexts in Poland or beyond.

Future directions. The predictive performance of both models could be improved by incorporating finer-grained population data (for example, the Eurostat GEOSTAT 1 km2 population grid) and by testing spatially explicit model variants such as spatial random forest, which accounts for residual spatial autocorrelation in the prediction errors identified here. Extending the analysis to additional Polish cities would allow for a more robust evaluation of whether the infrastructure-based feature set generalises across different urban morphologies.


7 References

Bielecka, E. and Ciolkosz, A. (2020). The problem of mismatch between the CORINE Land Cover data classification and the development of settlement in Poland. Remote Sensing, 12(14), 2253. https://doi.org/10.3390/rs12142253

European Environment Agency (2020). CORINE Land Cover 2018. Copernicus Land Monitoring Service. https://land.copernicus.eu/pan-european/corine-land-cover/clc2018

Kopczewska, K. (2025). Modelling Spatial Density in 2D, 3D, 4D: Spatial Statistics, Spatial Econometrics and Spatial Machine Learning with Applications in R. Routledge.

Kopczewska, K. (2024). Land use changes vs. population changes. RPubs. https://rpubs.com/Kathy_Kopczewska/land_use_changes

Landis, J. R. and Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159-174.

Maxwell, A. E., Warner, T. A. and Fang, F. (2018). Implementation of machine-learning classification in remote sensing: an applied review. International Journal of Remote Sensing, 39(9), 2784-2817. https://doi.org/10.1080/01431161.2018.1433343