Introduction

Air pollution in urban areas exhibits strong spatial patterns due to localized emission sources and atmospheric transport processes. In particular, nitrogen oxides (NOx) generated by road traffic are known to cluster spatially along major transport corridors and densely populated areas. The research question of this project is: to what extent do non-traffic NOx sources (industrial, residential, background and others) explain the spatial variation of traffic-related NOx concentrations in Warsaw, and is this relationship spatially dependent?

Data description

Data were obtained from zenodo and below I attached the link to them:

https://zenodo.org/records/10987360

## Reading layer `Warsaw_Source_Contribution_2019_NOx_100m' from data source 
##   `C:\Users\Lukasz\Desktop\Masters thesis\masters thesis\data\Warsaw_Source_Contribution_2019_NOx_PM10_PM25_100m\Warsaw_Source_Contribution_2019_NOx_100m.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 147771 features and 5 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: 20.77529 ymin: 52.05278 xmax: 21.29156 ymax: 52.43163
## Geodetic CRS:  WGS 84

The analysis is based on EPISODE model output providing annual average NOx concentrations on a 100 m grid for the Warsaw metropolitan area. The following variables are used:

  • NOx_TRA – traffic-related NOx concentration (dependent variable)

  • NOx_IND – industrial contribution

  • NOx_RES – residential contribution

  • NOx_OTH – other sources

  • NOx_BGC – background contribution

All variables are expressed in µg/m³.

summary(nox_data)
##     NOx_BGC          NOx_IND           NOx_OTH           NOx_RES       
##  Min.   : 0.000   Min.   : 0.0000   Min.   : 0.0000   Min.   : 0.0000  
##  1st Qu.: 5.192   1st Qu.: 0.5655   1st Qu.: 0.1758   1st Qu.: 0.5197  
##  Median : 6.251   Median : 1.2995   Median : 0.2734   Median : 0.6932  
##  Mean   : 6.353   Mean   : 1.9941   Mean   : 0.3448   Mean   : 0.7245  
##  3rd Qu.: 7.409   3rd Qu.: 2.4323   3rd Qu.: 0.4147   3rd Qu.: 0.8892  
##  Max.   :10.237   Max.   :89.6058   Max.   :20.4633   Max.   :80.4736  
##     NOx_TRA                 geometry     
##  Min.   :  0.000   POLYGON      :147771  
##  1st Qu.:  6.586   epsg:4326    :     0  
##  Median : 11.207   +proj=long...:     0  
##  Mean   : 15.207                         
##  3rd Qu.: 19.654                         
##  Max.   :108.149

Basic data cleaning and variable selection was performed:

nox_data <- nox_data %>%
  dplyr::select(NOx_TRA, NOx_IND, NOx_RES, NOx_OTH, NOx_BGC, geometry) %>%
  filter(
    is.finite(NOx_TRA),
    is.finite(NOx_IND),
    is.finite(NOx_RES),
    is.finite(NOx_OTH),
    is.finite(NOx_BGC)
  )

The data were transformed to the (EPSG:2180) coordinate system to ensure that spatial relationships are expressed in meters rather than degrees

nox_data <- st_transform(nox_data, 2180)   

The dataset contains over 140,000 grid cells. To ensure computational feasibility while preserving spatial structure, a random subsample of 7,000 observations is used. The sampling is purely random, meaning that each grid cell has the same probability of being selected. As a result, the subsample preserves the overall spatial distribution of NOx concentrations across Warsaw and does not systematically exclude any specific area. Even after subsampling, each observation retains a sufficient number of neighbors within a reasonable distance which was set to 1.5 km, allowing spatial dependence to be detected and modeled reliably

set.seed(123)
nox_sample <- nox_data %>%
  slice_sample(n = 7000)

Centroids are used to define spatial relationships between grid cells.The neighborhood list is converted into a row-standardized spatial weights matrix. Row standardization ensures that the influence of neighboring observations is comparable across locations with different numbers of neighbors.

coords <- st_coordinates(st_centroid(nox_sample))
nb <- dnearneigh(coords, 0, 1500)
lw <- nb2listw(nb, style = "W", zero.policy = TRUE)

Visualisation

Before proceeding with the modelling part of the project, visualization of different NOx sources was performed, to better understand which areas of Warsaw are more prone to different sources of pollution.

bb_warsaw <- getbb("Warszawa, Polska")



# Big roads
big_streets <- opq(bb_warsaw) %>%
  add_osm_feature(
    key = "highway",
    value = c("motorway", "primary", "motorway_link", "primary_link")
  ) %>%
  osmdata_sf()
# Medium roads
med_streets <- opq(bb_warsaw) %>%
  add_osm_feature(
    key = "highway",
    value = c("secondary", "tertiary", "secondary_link", "tertiary_link")
  ) %>%
  osmdata_sf()

# Rivers
river <- opq(bb_warsaw) %>%
  add_osm_feature(key = "waterway", value = "river") %>%
  osmdata_sf()

# Railways
railway <- opq(bb_warsaw) %>%
  add_osm_feature(key = "railway", value = "rail") %>%
  osmdata_sf()



nox_ll <- st_transform(nox_data, 4326)

# Use centroids (standard in spatial EDA)
nox_pts <- st_centroid(nox_ll)



filter_top10 <- function(sf_obj, var) {
  thr <- quantile(sf_obj[[var]], 0.9, na.rm = TRUE)
  sf_obj %>%
    dplyr::filter(.data[[var]] >= thr)
}



plot_nox_top10 <- function(var, title_suffix) {
  
  nox_top <- filter_top10(nox_pts, var)
  
  ggplot() +
    geom_sf(
      data = railway$osm_lines,
      color = "#ffbe7f",
      size = 0.3,
      linetype = "dotdash",
      alpha = 0.6
    ) +
    geom_sf(
      data = med_streets$osm_lines,
      color = "#ffbe7f",
      size = 0.4,
      alpha = 0.5
    ) +
    geom_sf(
      data = big_streets$osm_lines,
      color = "#ffbe7f",
      size = 0.7,
      alpha = 0.7
    ) +
    geom_sf(
      data = nox_top,
      aes(color = .data[[var]]),
      size = 0.9,
      alpha = 0.9
    ) +
    scale_color_viridis_c(
      option = "magma",
      name = paste0(var)
    ) +
    coord_sf(
      xlim = bb_warsaw[c("xmin", "xmax")],
      ylim = bb_warsaw[c("ymin", "ymax")]
    ) +
    labs(
      title = "Warsaw, Poland",
      subtitle = paste("Top 10% highest values of", title_suffix),
      color = var
    ) +
    theme_minimal()
}


p_TRA_top <- plot_nox_top10("NOx_TRA", "traffic-related NOx")
p_IND_top <- plot_nox_top10("NOx_IND", "industrial NOx")
p_RES_top <- plot_nox_top10("NOx_RES", "residential NOx")
p_OTH_top <- plot_nox_top10("NOx_OTH", "other sources NOx")

To better understand the spatial heterogeneity of NOx sources in Warsaw, additional maps were constructed showing only the top 10% highest values for each NOx component (traffic, industrial, residential, and other sources). This approach allows the analysis to focus on pollution hotspots around the city, which are most relevant for identifying dominant emission sources in different regions.

Figure 1. Map of traffic related NOx pollution

The map of the highest 10% of traffic-related NOx concentrations shows a dense, spatial pattern closely aligned with Warsaw’s major road network. High concentrations are observed along primary roads and ring roads that circle around the city (especially in the northern part), near major transport corridors and intersections and especially in the central areas of the city.

Figure 2. Map of industry related NOx pollution

The industrial NOx hotspots appear highly clustered in a 2 compact zones, rather than being spread across the city. One in the far north of Warsaw, and one towards the south-east. Unlike traffic NOx, industrial NOx does not follow the road network. Instead, emissions are spatially concentrated in two hot spot areas and relatively isolated.

Figure 3. Map of residential areas related NOx pollution

Residential NOx hotspots consist of moderately clustered patches that are located outside the very city center and in areas with dense housing but lower traffic intensity. Suggesting that the highest amounts of NOx that comes from residential areas is located in the suburbs and outskirts of Warsaw.

Figure 4. Map of NOx pollution from other sources

NOx from other sources shows irregular but clearly clustered hotspots, often overlapping partially with industrial zones and residential areas. This can be possibly caused by some construction activities or commercial sources of NOx pollution.

The hotspot visualizations strongly reinforce the hypothesis that the clustering of high NOx values confirms strong positive spatial autocorrelation. Traffic NOx exhibits clear spatial spillovers while industrial and residential NOx are locally clustered. Overall, the maps provide general visual confirmation that NOx concentrations in Warsaw are not spatially random, and that spatial econometric models should be used in other to further investigate this problem.

Modelling

OLS

A standard linear regression model is estimated as a benchmark.

ols_model <- lm(
  NOx_TRA ~ NOx_IND + NOx_RES + NOx_OTH + NOx_BGC,
  data = nox_sample
)

summary(ols_model)
## 
## Call:
## lm(formula = NOx_TRA ~ NOx_IND + NOx_RES + NOx_OTH + NOx_BGC, 
##     data = nox_sample)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -57.849  -5.469  -3.033   1.305  81.832 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -8.14216    0.88056  -9.247   <2e-16 ***
## NOx_IND      0.91770    0.04062  22.590   <2e-16 ***
## NOx_RES     14.81309    0.47618  31.108   <2e-16 ***
## NOx_OTH      7.49155    0.40293  18.593   <2e-16 ***
## NOx_BGC      1.30625    0.10643  12.274   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.55 on 6995 degrees of freedom
## Multiple R-squared:  0.295,  Adjusted R-squared:  0.2946 
## F-statistic: 731.7 on 4 and 6995 DF,  p-value: < 2.2e-16

To test whether OLS residuals exhibit spatial dependence, Moran’s I statistic is computed. The Moran’s I test is based on the following hypotheses:

  • Null hypothesis (H₀): The residuals of the OLS model are spatially uncorrelated.

  • Alternative hypothesis (H₁): The residuals of the OLS model exhibit positive spatial autocorrelation, meaning that similar residual values cluster in space.

moran.test(residuals(ols_model), lw, zero.policy = TRUE)
## 
##  Moran I test under randomisation
## 
## data:  residuals(ols_model)  
## weights: lw    
## 
## Moran I statistic standard deviate = 236.12, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      7.127492e-01     -1.428776e-04      9.115479e-06

This result demonstrates that the OLS residuals are not spatially random. Instead, nearby locations tend to exhibit similar unexplained deviations from the model.Therefore, Moran’s I test strongly rejects the null hypothesis of spatially uncorrelated residuals, indicating the presence of spatial dependence and justifying the use of spatial econometric models such as Spatial Lag Model (SAR) or the Spatial Error Model (SEM), which explicitly account for spatial dependence and will be utilized in the latter part of this project.

lm_tests <- lm.RStests(
  ols_model, lw,
  test = c("RSlag", "RSerr", "adjRSlag", "adjRSerr"),
  zero.policy = TRUE
)

lm_tests
## 
##  Rao's score (a.k.a Lagrange multiplier) diagnostics for spatial
##  dependence
## 
## data:  
## model: lm(formula = NOx_TRA ~ NOx_IND + NOx_RES + NOx_OTH + NOx_BGC,
## data = nox_sample)
## test weights: lw
## 
## RSlag = 40941, df = 1, p-value < 2.2e-16
## 
## 
##  Rao's score (a.k.a Lagrange multiplier) diagnostics for spatial
##  dependence
## 
## data:  
## model: lm(formula = NOx_TRA ~ NOx_IND + NOx_RES + NOx_OTH + NOx_BGC,
## data = nox_sample)
## test weights: lw
## 
## RSerr = 55425, df = 1, p-value < 2.2e-16
## 
## 
##  Rao's score (a.k.a Lagrange multiplier) diagnostics for spatial
##  dependence
## 
## data:  
## model: lm(formula = NOx_TRA ~ NOx_IND + NOx_RES + NOx_OTH + NOx_BGC,
## data = nox_sample)
## test weights: lw
## 
## adjRSlag = 587.56, df = 1, p-value < 2.2e-16
## 
## 
##  Rao's score (a.k.a Lagrange multiplier) diagnostics for spatial
##  dependence
## 
## data:  
## model: lm(formula = NOx_TRA ~ NOx_IND + NOx_RES + NOx_OTH + NOx_BGC,
## data = nox_sample)
## test weights: lw
## 
## adjRSerr = 15073, df = 1, p-value < 2.2e-16

Both standard and robust LM diagnostics strongly reject the null hypothesis of no spatial dependence, with the robust LM-error test dominating. This indicates that a Spatial Error Model is the most appropriate specification. However, both SEM and SAR will be used and their performances will be compared.

SEM

sem_model <- errorsarlm(
  NOx_TRA ~ NOx_IND + NOx_RES + NOx_OTH + NOx_BGC,
  data = nox_sample,
  listw = lw,
  zero.policy = TRUE
)

summary(sem_model)
## 
## Call:errorsarlm(formula = NOx_TRA ~ NOx_IND + NOx_RES + NOx_OTH + 
##     NOx_BGC, data = nox_sample, listw = lw, zero.policy = TRUE)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -18.92521  -1.99463  -0.63015   0.33797  65.28164 
## 
## Type: error 
## Coefficients: (asymptotic standard errors) 
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept)  5.0240755  4.3079079  1.1662  0.24352
## NOx_IND      0.0087437  0.0347746  0.2514  0.80148
## NOx_RES      4.8232687  0.6025300  8.0050 1.11e-15
## NOx_OTH     -0.7642236  0.3974016 -1.9231  0.05447
## NOx_BGC      1.0559205  0.4783987  2.2072  0.02730
## 
## Lambda: 0.97918, LR test value: 10209, p-value: < 2.22e-16
## Asymptotic standard error: 0.0043366
##     z-value: 225.8, p-value: < 2.22e-16
## Wald statistic: 50984, p-value: < 2.22e-16
## 
## Log likelihood: -21320.14 for error model
## ML residual variance (sigma squared): 24.119, (sigma: 4.9111)
## Number of observations: 7000 
## Number of parameters estimated: 7 
## AIC: 42654, (AIC for lm: 52862)

Compared to the OLS model, the estimated coefficients in the SEM are changed in both magnitude and statistical significance.

Residential and Background NOx pollution are statistically significant at 5% level and have a positive effect of traffic NOx, while Other sources are marginally insignificant and industrial emission is statistically insignificant. Moreover the AIC value reduced greatly in comparison to OLS (from 52862 to 42654) suggesting a much better model fit.

SAR

sar_model <- lagsarlm(
  NOx_TRA ~ NOx_IND + NOx_RES + NOx_OTH + NOx_BGC,
  data = nox_sample,
  listw = lw,
  zero.policy = TRUE
)

summary(sar_model)
## 
## Call:lagsarlm(formula = NOx_TRA ~ NOx_IND + NOx_RES + NOx_OTH + NOx_BGC, 
##     data = nox_sample, listw = lw, zero.policy = TRUE)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -19.28994  -2.02825  -0.58314   0.42109  65.36862 
## 
## Type: lag 
## Coefficients: (asymptotic standard errors) 
##              Estimate Std. Error z value  Pr(>|z|)
## (Intercept) -1.372372   0.415153 -3.3057 0.0009474
## NOx_IND      0.036564   0.019157  1.9087 0.0563018
## NOx_RES      1.506723   0.245802  6.1298 8.797e-10
## NOx_OTH     -0.320468   0.188519 -1.6999 0.0891450
## NOx_BGC      0.124257   0.050354  2.4677 0.0135988
## 
## Rho: 0.96928, LR test value: 10185, p-value: < 2.22e-16
## Asymptotic standard error: 0.0051504
##     z-value: 188.19, p-value: < 2.22e-16
## Wald statistic: 35417, p-value: < 2.22e-16
## 
## Log likelihood: -21332.34 for lag model
## ML residual variance (sigma squared): 24.293, (sigma: 4.9288)
## Number of observations: 7000 
## Number of parameters estimated: 7 
## AIC: 42679, (AIC for lm: 52862)
## LM test for residual autocorrelation
## test value: 14.499, p-value: 0.00014022

For Spatial Lag Model, residential NOx emission remains positive and highly statistically significant, while industrial emission becomes marginally insignificant at 5% level. However, other sources have a significant effect according to SAR model. The AIC is also marginally higher than in the case of SEM with a value of 42679. Moreover, SAR model allows for spatial spillovers in the dependent variable, meaning that changes in NOx emissions in one grid cell can influence traffic-related NOx values in neighboring locations. To properly interpret these effects, the estimated coefficients are decomposed into direct, indirect, and total impacts.

imp <- impacts(
  sar_model,
  listw = lw,
  R = 1000   
)
imp
## Impact measures (lag, exact):
##                    Direct   Indirect      Total
## NOx_IND dy/dx  0.04276159   1.147406   1.190168
## NOx_RES dy/dx  1.76209136  47.281569  49.043660
## NOx_OTH dy/dx -0.37478280 -10.056413 -10.431196
## NOx_BGC dy/dx  0.14531712   3.899243   4.044560

By looking at the results we can see that that residential NOx es are the primary driver of traffic-related NOx concentrations in Warsaw, with effects dominated by strong spatial spillovers rather than purely local impacts.

AIC comparison between all 3 tested models:

AIC(ols_model, sar_model, sem_model)
##           df      AIC
## ols_model  6 52861.66
## sar_model  7 42678.69
## sem_model  7 42654.28
screenreg(
  list(ols_model, sar_model, sem_model),
  custom.model.names = c("OLS", "SAR", "SEM"),
  digits = 3
)
## 
## =================================================================
##                      OLS           SAR             SEM           
## -----------------------------------------------------------------
## (Intercept)            -8.142 ***      -1.372 ***       5.024    
##                        (0.881)         (0.415)         (4.308)   
## NOx_IND                 0.918 ***       0.037           0.009    
##                        (0.041)         (0.019)         (0.035)   
## NOx_RES                14.813 ***       1.507 ***       4.823 ***
##                        (0.476)         (0.246)         (0.603)   
## NOx_OTH                 7.492 ***      -0.320          -0.764    
##                        (0.403)         (0.189)         (0.397)   
## NOx_BGC                 1.306 ***       0.124 *         1.056 *  
##                        (0.106)         (0.050)         (0.478)   
## rho                                     0.969 ***                
##                                        (0.005)                   
## lambda                                                  0.979 ***
##                                                        (0.004)   
## -----------------------------------------------------------------
## R^2                     0.295                                    
## Adj. R^2                0.295                                    
## Num. obs.            7000            7000            7000        
## Parameters                              7               7        
## Log Likelihood                     -21332.343      -21320.140    
## AIC (Linear model)                  52861.663       52861.663    
## AIC (Spatial model)                 42678.686       42654.279    
## LR test: statistic                  10184.977       10209.383    
## LR test: p-value                        0.000           0.000    
## =================================================================
## *** p < 0.001; ** p < 0.01; * p < 0.05

By looking at the table above we can conclude that SEM and SAR are vastly superior based on their lower AIC value in comparison to regular OLS. Moreover, the SAR model exhibited high spatial autoregressive coefficient, indicating that traffic related NOx concentrations in one area are strongly influenced by concentrations in neighboring areas. This is further reinforced by the results of impacts in the SAR model that indicate that residential NOx emissions are the dominant contributor to traffic related NOx concentrations, with effects caused mainly by large indirect impacts rather than local effects alone.

Coclusion

The objective of this study was to investigate the spatial structure of NOx concentrations caused by traffic in Warsaw and to assess whether spatial econometric approaches should be applied when investigating that.The initial hypothesis stated that traffic related NOx concentrations exhibit significant spatial dependence.

Overall, the empirical results strongly support the initial hypothesis. Traffic related NOx concentrations in Warsaw display evident spatial dependence, and standard linear regression models are not able to fully capture such dependence. Spatial econometric models not only improve statistical performance but also provide a more realistic interpretation of NOx pollution by accounting for spatial spillovers.