1. Introduction

Nitrogen dioxide (\(NO_2\)) is a critical urban air pollutant characterized by high spatial variability and significant health impacts. In the context of Poland, high traffic density in metropolitan areas and industrial activities contribute to complex spatial patterns of pollution that vary significantly across the country.

Monitoring \(NO_2\) levels effectively is constrained by the trade-off between the high accuracy of ground-based monitoring stations and the continuous spatial coverage provided by remote sensing platforms. The national monitoring network (GIOŚ) provides precise point measurements, but its sparse distribution leaves large areas without direct observation. Conversely, the Copernicus Sentinel-5P TROPOMI mission provides a comprehensive view of the nitrogen dioxide tropospheric column, yet these satellite retrievals require spatial downscaling and integration with local features to accurately reflect ground-level concentrations.

This research aims to address these challenges by applying a Spatial Machine Learning framework. The study focuses on three main objectives: 1. Predicting ground-level \(NO_2\) at a 5 km resolution for the year 2024, utilizing TROPOMI data integrated with spatial features from OpenStreetMap (OSM). 2. Comparing model performance between supervised learning algorithms (Random Forest, Artificial Neural Networks) and traditional geostatistical interpolation techniques (Ordinary and Universal Kriging). 3. Analyzing spatial structures and identifying pollution hotspots through unsupervised clustering and local indicators of spatial association (LISA).

By bridging satellite observations with ground-truth measurements and localized spatial proxies, this project provides a detailed mapping of air quality exposure across the Polish territory.

2. Data Description and Preprocessing

The methodology relies on a multi-source data integration approach, where all observations are harmonized into a continuous spatial grid covering the entire territory of Poland.

1. TROPOMI NO₂ (Sentinel-5P satellite) Monthly mean tropospheric NO₂ column density for 2024 (12 GeoTIFF files). Generated via Google Earth Engine (code.earthengine.google.com) using the COPERNICUS/S5P/OFFL/L3_NO2 collection. Monthly composites were computed by averaging daily observations, then exported as GeoTIFFs cropped to Poland’s bounding box at ~5km resolution.

2. GIOŚ ground station measurements Hourly NO₂, SO₂, CO, PM10, PM2.5, O₃ measurements for 2024 from ~250 automatic monitoring stations. However, many stations only measure particulate matter (PM) or lack continuous data. After filtering for active \(NO_2\) sensors and enforcing a strict 75% minimum temporal coverage threshold for 2024, the dataset was reduced to 134 valid stations. It is important to note that aggregating this data to an annual mean collapses all seasonal variations (such as massive pollution spikes during the winter heating season), providing a general baseline rather than peak-exposure modeling. Downloaded as bulk xlsx files from the GIOŚ archive (powietrze.gios.gov.pl/pjp/archives → “Wyniki pomiarów 2024”). Station metadata (codes, coordinates) from the accompanying “Metadane oraz kody stacji i stanowisk pomiarowych” xlsx file from the same page.

3. OpenStreetMap infrastructure Road network (motorways, trunk, primary, secondary), railways, and industrial land use polygons. Downloaded as GeoPackage files from Geofabrik (download.geofabrik.de/europe/poland.html) — 16 separate voivodeship .gpkg files, merged in R using sf::st_read() and filtered by feature class.

4. GUS BDL population data Population density at gmina level for 2024. Downloaded from Bank Danych Lokalnych (bdl.stat.gov.pl) → LUDNOŚĆ → STAN LUDNOŚCI → Gęstość zaludnienia. Exported as CSV.

5. Administrative boundaries (GADM) Polish administrative boundaries (gminy/powiaty). Downloaded as GeoPackage from gadm.org (gadm41_POL.gpkg).

2.1. Spatial Reference Framework

To ensure geometric accuracy and minimize distortion for the Polish territory, all spatial data were projected into the EPSG:2180 coordinate system. We constructed a regular square grid with a spatial resolution of 5 km x 5 km. This resolution was selected to balance the granularity of OpenStreetMap features with the native resolution of the Sentinel-5P TROPOMI sensor (~5.5 km).

2.2. Variable Integration

The modeling dataset incorporates the following feature sets:

  • Satellite Observations (Remote Sensing): Monthly tropospheric \(NO_2\) column densities from the Sentinel-5P TROPOMI mission. These values were averaged to produce an annual mean for each grid cell, serving as a primary indicator of atmospheric pollution levels.
  • Anthropogenic Proxies (OSM): Local emission sources were captured through vector data from OpenStreetMap. We calculated the total density (km per cell) of major roads (motorways, primary, and secondary) and railway lines. Additionally, the percentage of industrial land use per cell was computed to account for stationary emission sources.
  • Geographic Coordinates: The centroid coordinates (\(X, Y\)) of each cell were included to allow models to account for broad-scale spatial gradients (e.g., the south-north pollution gradient in Poland).

2.3. Data Summary

The final analysis grid consists of 12,855 cells located within the administrative borders of Poland. The table below presents the descriptive statistics for the numerical features used in the supervised learning models.

Descriptive Statistics of Spatial Features
Feature n Mean SD Min Max
industrial_pct 12855 0.470512 1.583459 0.000000 43.218896
motorway_km 12855 0.840494 2.763462 0.000000 22.584651
railway_km 12855 3.491971 10.265218 0.000000 221.961775
total_road_km 12855 4.690317 5.728191 0.000000 83.960786
tropomi_no2_mean 12855 0.000037 0.000009 0.000015 0.000084
x_coord 12855 528724.672028 173995.862221 172077.453047 862077.453047
y_coord 12855 471472.512076 154662.423362 140043.107175 775043.107175

As observed in the summary statistics, the distribution of anthropogenic proxies tends to be right-skewed. While basic road infrastructure is present across nearly all grid cells, high-capacity emission sources such as motorways and industrial zones are heavily concentrated in a small fraction of the territory. Pollution is not evenly dispersed but rather tightly bound to localized human activity. Consequently, the standard deviation for these infrastructure variables often exceeds their mean, underscoring the necessity for non-linear, tree-based machine learning algorithms capable of handling extreme local spikes without assuming a normal distribution.

The observed small values in tropomi_no2_mean are due to the fact, that TROPOMI measures tropospheric column density — the total number of \(NO_2\) molecules in a vertical pillar of air stretching from the ground to the top of the troposphere. The standard unit exported by Google Earth Engine for Sentinel-5P is moles per square meter (\(mol/m^2\)). Actual values in this dataset are on the order of \(10^{-5} mol/m^2\), which simply display as very small values when rounded. Highly polluted urban areas might reach \(10^{-4} mol/m^2\), which is entirely consistent with expected satellite observations.

3. Exploratory Data Analysis (EDA)

Before proceeding with predictive modeling, we conducted an exploratory analysis to understand the spatial distribution of our observations and the covariates.

3.1. Spatial Distribution of Monitoring Stations

The ground-truth data consists of 134 monitoring stations from the GIOŚ network. As shown in the map below, while the network covers the entire country, there is a higher density of stations in heavily urbanized and industrialized regions, particularly in the southern part of Poland. This spatial bias is common in environmental monitoring and must be considered during model validation.

Location of GIOŚ monitoring stations across Poland.

Location of GIOŚ monitoring stations across Poland.

3.2. Tropospheric NO2 (TROPOMI)

The annual mean tropospheric \(NO_2\) column density from Sentinel-5P provides a continuous baseline for the model. The satellite observations clearly identify regional pollution hotspots in the Upper Silesian Industrial Region, Kraków, and the Warsaw metropolitan area. These regional patterns correlate with the broad geographic distribution of anthropogenic activity.

3.3. Feature Correlations

To understand the relationships and potential redundancy among our spatial predictors, we examined the Pearson correlation coefficients. As expected, total road density and motorway length exhibit positive correlations with industrial areas, reflecting the co-location of infrastructure and economic activity.

Pearson correlation matrix of spatial covariates.

Pearson correlation matrix of spatial covariates.

3.4. Spatial Autocorrelation (Moran’s I)

A fundamental assumption of spatial machine learning is that observations are not independent. To quantify this, we calculated the Global Moran’s I statistic for the TROPOMI \(NO_2\) observations using a k-nearest neighbors (k=8) spatial weights matrix.

## Global Moran's I: 0.9796
## P-value: < 0.000000000000000222

The highly positive Moran’s I value and the statistically significant p-value (\(p < 0.001\)) strongly reject the null hypothesis of spatial randomness. This confirms a high degree of spatial clustering in \(NO_2\) levels, justifying the use of spatial coordinates and neighborhood-aware algorithms in our subsequent modeling phases.

4. Predictive Modeling: Supervised Machine Learning

To generate a continuous map of ground-level NO₂ from our 134 point measurements, we formulated the problem as a supervised spatial regression task. We evaluated two distinct algorithms: an Artificial Neural Network (ANN) and a Random Forest (RF) ensemble.

The dataset was partitioned into an 80% training set (110 samples) and a 20% hold-out test set (24 samples). Crucially, to directly address the challenge of spatial data leakage, we implemented two different validation strategies for the Random Forest model:

  1. RF Standard: A traditional 5-fold cross-validation with random splits, which serves as a baseline but potentially leaks spatial autocorrelation between training and validation folds.
  2. RF Spatial CV: A rigorous Spatial Block Cross-Validation, where training stations were clustered into 5 distinct geographic zones using a k-means algorithm based on coordinates, ensuring that the model is validated on entirely unseen regions of Poland.

With the validation strategies defined, we established the specific model architectures to ensure full reproducibility. The ANN was implemented as a single-hidden-layer feed-forward network using the BFGS optimization algorithm, with hyperparameter tuning evaluating network sizes from 4 to 24 neurons and weight decay penalties (0.1 to 0.001). Both Random Forest models were stabilized using 1000 trees (num.trees = 1000) and tuned across mtry and terminal node sizes using the ‘extratrees’ and ‘variance’ split rules. ### 4.1. Model Performance and Comparison As detailed in the table below, the Random Forest model significantly outperformed the Artificial Neural Network.

Performance Metrics on the 20% Hold-Out Test Set
model n R2 RMSE MAE
RF_Standard 24 0.614 6.466 4.038
RF_Spatial_CV 24 0.518 6.996 4.450
ANN 24 0.449 7.088 5.732

The empirical comparison reveals a clear case of spatial optimism bias. The standard Random Forest model yields an optimistic \(R^2 \approx 0.614\), as it benefits from the local spatial autocorrelation inherent in random splits. When the spatial data leakage is blocked via Spatial Cross-Validation, the performance drops to \(R^2 \approx 0.518\). This more conservative metric represents the model’s true ability to generalize and predict NO₂ profiles in entirely unmonitored regions.

It is important to note that with only 24 samples in the hold-out test set, these metrics possess wide confidence intervals and should be interpreted with caution. The ANN achieved a lower fit (\(R^2 \approx 0.449\)), demonstrating that 110 training samples are structurally insufficient for stable weight optimization in a neural network. Consequently, the Random Forest was selected as our primary predictive engine.

4.2. Feature Importance

One of the primary advantages of the Random Forest algorithm is its interpretability through Gini impurity-based feature importance.

The variable importance chart reveals crucial insights into the drivers of local pollution: 1. Total Road Density is the overwhelmingly dominant predictor, suggesting that urban traffic is the primary source of \(NO_2\) at the ground level. 2. Conversely, Motorway Length provided minimal to zero predictive value. This may reflect the network’s urban bias - GIOŚ stations are predominantly located within populated centers, providing few training examples near rural expressways. Alternatively, at a coarse 5 km grid resolution, emissions from narrow motorway corridors may disperse and become overshadowed by broader urban or industrial background levels. 3. Industrial Percentage and TROPOMI NO2 serve as the strongest secondary predictors, confirming the necessity of integrating both bottom-up (infrastructure) and top-down (satellite) data sources.

4.3. Predicted Ground-Level NO2 Surface

Using the tuned Random Forest model, we projected ground-level \(NO_2\) for the entire territory of Poland. The resulting map successfully downscales the coarse satellite data, capturing sharp local gradients along major urban corridors and industrial zones that TROPOMI alone cannot resolve.

Predicted ground-level NO2 concentrations using the Random Forest model.

Predicted ground-level NO2 concentrations using the Random Forest model.

5. Unsupervised Spatial Structure

Beyond predictive modeling, we applied unsupervised learning techniques to identify latent spatial structures and regional clusters of \(NO_2\) pollution. These methods allow for the identification of areas sharing similar pollution profiles and infrastructure characteristics without relying on pre-defined labels.

5.1. Density-Based Clustering (DBSCAN)

We utilized the DBSCAN (Density-Based Spatial Clustering of Applications with Noise) algorithm to isolate the core metropolitan and industrial hotspots. To focus exclusively on the most severe pollution zones, the algorithm was applied to cells exceeding the 90th percentile of annual \(NO_2\) concentrations. The optimal algorithm parameters were defined as: search radius (\(\epsilon\)) of 8000 meters and a minimum points threshold (\(minPts\)) of 4.

The model identified 7 distinct clusters, primarily concentrated in the Silesian Voivodeship and around major urban centers such as Warsaw, Wrocław, and Poznań. Unlike grid-wide statistics, DBSCAN effectively isolates these geographic anomalies, highlighting the localized nature of extreme \(NO_2\) exposure.

DBSCAN clusters identifying the top 10% NO2 hotspots in Poland.

DBSCAN clusters identifying the top 10% NO2 hotspots in Poland.

5.2. Local Indicators of Spatial Association (LISA)

To rigorously quantify the spatial structure of pollution, we calculated the Local Moran’s I (LISA) statistic. We utilized a strict statistical significance threshold of \(p < 0.01\) to identify robust regional clusters.

The LISA analysis reveals a powerful spatial contagion effect:

  • High-High Clusters: A dominant “High-High” cluster (highly polluted cells surrounded by similar neighbors) covers a vast area of southern and central Poland. This confirms that NO2 levels in these regions are driven by large-scale, contiguous industrial and urban activities.
  • Low-Low Clusters: Statistically significant coldspots are observed in the northern and northeastern regions, corresponding to areas with lower population density and higher forest cover.
  • Spatial Outliers: Rare “High-Low” or “Low-High” outliers point to isolated industrial facilities or protected green zones within otherwise polluted urban corridors. They are not present in the picture below.

The presence of these significant clusters mathematically justifies the necessity of spatial regression techniques over traditional global models.

LISA map confirming statistically significant NO2 clusters (p < 0.01).

LISA map confirming statistically significant NO2 clusters (p < 0.01).

5.3. Aspatial Typology (PAM)

To contrast the spatial methods, we applied Partitioning Around Medoids (PAM) to the infrastructure variables (road density, industrial percentage) without enforcing geographic contiguity.

The optimal number of clusters was determined to be \(k=5\) using silhouette width analysis. However, as seen in the map below, the resulting typology is highly fragmented geographically. This fragmentation highlights a critical finding: air pollution profiles cannot be defined solely by universal, country-wide infrastructure rules. A high-traffic cell in Warsaw and a high-traffic cell along a rural highway corridor might be placed in the same PAM cluster, but they experience entirely different regional pollution baselines. This strongly reinforces the need for spatially aware algorithms like LISA.

Aspatial PAM clustering demonstrating severe geographic fragmentation.

Aspatial PAM clustering demonstrating severe geographic fragmentation.

6. Geostatistical Modeling: The Limits of Kriging

To provide a baseline comparison against our Machine Learning approach, we implemented two traditional geostatistical methods: Ordinary Kriging (OK) and Universal Kriging (UK, incorporating geographic coordinates as a trend).

6.1. Variogram Analysis and Spatial Range

The foundation of Kriging is the empirical semivariogram, which models how spatial autocorrelation decays with distance. The fitted model (Ste) estimated a spatial Range of \(\approx\) 1.8 km. This extremely short range must be interpreted with extreme caution. Because \(NO_2\) is a highly localized pollutant, its spatial dependence drops quickly. However, our empirical variogram is built upon only 134 sparse points spaced tens to hundreds of kilometers apart. Consequently, the model lacks the dense, short-distance point pairs necessary to reliably estimate the true micro-scale range. While the exact 1.8 km figure is statistically unreliable due to this sparsity, it provides strong supporting evidence for why Kriging struggles in this study: the monitoring network density is vastly insufficient to capture the continuous variance required for geostatistical interpolation across a 5 km grid.

Empirical semivariogram and fitted model showing a very short spatial range.

Empirical semivariogram and fitted model showing a very short spatial range.

6.2. Prediction Accuracy and Variance

Given the variogram results, both Kriging models failed to capture the spatial variability of \(NO_2\). Ordinary Kriging achieved an \(R^2\) of 0.000, effectively predicting the global mean for the majority of the study area. Universal Kriging, despite incorporating geographic trends, showed only a marginal improvement with an \(R^2\) of 0.049.

This statistical failure is visually confirmed by examining the Ordinary Kriging Variance (Uncertainty) map. The standard error is exceptionally high across most of the Polish territory, dropping only in the immediate, pixel-level vicinity of the monitoring stations. This effect around stations is a symptom of a high nugget-to-sill ratio, indicating that the model cannot reliably interpolate values beyond the immediate proximity of a sensor.

Ordinary Kriging Variance Map. Darker colors indicate higher uncertainty/standard error.

Ordinary Kriging Variance Map. Darker colors indicate higher uncertainty/standard error.

7. Discussion

This study aimed to bridge the gap between high-precision but sparse ground measurements and continuous but coarse satellite observations. The results provide strong evidence for the necessity of spatial machine learning in environmental modeling.

7.1. The Advantage of Machine Learning over Geostatistics

The contrast in performance between the Random Forest model (standard \(R^2 \approx 0.614\) / spatial \(R^2 \approx 0.518\)) and the Kriging models (\(R^2 \le 0.049\)) highlights a fundamental characteristic of nitrogen dioxide: it is an inherently local pollutant.

Traditional geostatistical methods like Kriging assume a smooth spatial gradient. However, our empirical variogram suggested a highly localized spatial range (estimated at 1.8 km, though constrained by data sparsity). Because the typical distance between GIOŚ monitoring stations vastly exceeds this localized range, Kriging suffers from an extreme “nugget effect,” rendering it mathematically blind to the spaces between stations.

In contrast, the Random Forest algorithm does not rely purely on geographic distance. By utilizing high-resolution OpenStreetMap data, the RF model effectively “learns” the physical rules of pollution (e.g., that high road density produces high \(NO_2\)). This allows the ML model to accurately predict pollution spikes in unmonitored areas based on their infrastructure profile, succeeding exactly where Kriging fails.

7.2. Spatial vs. Aspatial Clustering: The Role of Context

The unsupervised analysis provided a unique perspective on the spatial structure of Polish air quality. The contrast between the LISA results and the PAM typology is particularly telling.

The LISA analysis (Local Moran’s I) identified a large, contiguous “High-High” cluster in the southern-central part of the country. This indicates that pollution is not just a collection of isolated points, but a regional phenomenon where highly polluted cells reinforce the levels of their neighbors. This regional co-variation suggests that regional policy interventions are required, as local efforts in one city may be offset by high emissions in neighboring districts.

In contrast, the PAM clustering produced a fragmented, “salt-and-pepper” distribution across the map. While PAM correctly grouped cells with similar road and industrial densities, it failed to account for their geographic location. A high-infrastructure cell in the relatively clean northern region was often clustered with a similar cell in the heavily polluted south. The fragmentation of the PAM map demonstrates that infrastructure alone is not destiny; the regional geographic context—captured effectively by the spatial weights in LISA - is the true driver of the observed \(NO_2\) patterns.

7.3 Methodological Limitations

While Machine Learning significantly outperformed geostatistics, this approach has limitations. First, our initial Random Forest evaluation relied on standard cross-validation, which inadvertently leaks spatial autocorrelation. By implementing Spatial Block Cross-Validation, we quantified this optimism bias, observing a drop in \(R^2\) from 0.614 to 0.518. This demonstrates the necessity of rigorous spatial validation when modeling geographic data.

Secondly, the dataset’s sparsity (only 134 valid stations, leaving a test set of 24) results in wide confidence intervals for all reported performance metrics. For proper training (especially for ANN) more data would be required.

Finally, while Kriging intrinsically maps its own prediction variance (uncertainty), standard Random Forest models do not. Integrating methods such as Quantile Regression Forests would be necessary to generate reliable confidence intervals for the predicted \(NO_2\) surface.

8. Conclusions

The integration of Sentinel-5P TROPOMI data with high-resolution spatial proxies from OpenStreetMap provides a robust framework for mapping ground-level \(NO_2\) concentrations in Poland. Our analysis leads to three primary conclusions:

  1. Random Forest is superior for local pollution modeling: Achieving an optimistic standard \(R^2\) of 0.614 and a rigorous out-of-sample spatial \(R^2\) of 0.518, the RF model proved capable of downscaling satellite data to a 5 km resolution, effectively capturing urban-scale gradients that traditional geostatistics cannot resolve.
  2. The “Spatial Range” limitation: The very short spatial dependence of \(NO_2\) (1.8 km) explains the failure of Kriging models. This finding emphasizes that without an extremely dense (and expensive) physical monitoring network, machine learning is the only viable path for nationwide air quality mapping.
  3. Infrastructure as a proxy for exposure: The high importance of total road density as a predictor highlights that while satellite data provides the “atmospheric baseline,” local infrastructure data is the “key” that unlocks the prediction of human exposure at the ground level.

Future research should focus on incorporating temporal dynamics (seasonal variations) and higher-resolution meteorological data to further refine the predictive accuracy of the spatial machine learning models.