Introduction

Air pollution is a major environmental issue affecting human health worldwide. Urban air quality is influenced by geographical location, population density, industrial activity and topographical characteristics. This project investigates whether these factors can predict Air Quality Index (AQI) values across cities.

Three supervised learning algorithms (Decision Tree, Random Forest and Artificial Neural Networks) were applied and compared. Additionally, unsupervised learning using K-Means clustering was used to identify groups of cities with similar pollution profiles.

Two research questions were investigated:

  1. Does mountain information improve AQI prediction?

  2. Can latitude and longitude alone predict AQI?

We also apply an unsupervised learning method (K-Means clustering) to identify groups of cities with similar characteristics.

Dataset Description

Original dataset is aquired from:

https://www.kaggle.com/datasets/hasibalmuzdadid/global-air-pollution-dataset

The dataset was later enriched with WikiData, as well as statistics about Industrial buildings, such as factories, for every city where Overpass could access the OpenStreeMap data. As a result, the total number of cities was reduced from 23k to 3242 due to unavaiability of either WikiData or OSM data.

The script for processing is city_enrichment_pipeline.R, however, it takes 2 days to process all the data, so it is not recommended to try and run it.

Lastly, almost all major mountain ranges were included as polygons to answer our hypothesis question. The result is a shapefile that contains basic information about cities, such as population, long/lat, size, population density, as well as information about industriality of the cities and closest mountain ranges.

Spatial information includes:

Additional variables include:

Load Packages

The following libraries were used for spatial data processing, data manipulation, machine learning, visualization and clustering analysis.

library(sf)
library(dplyr)
library(caret)
library(rpart)
library(rpart.plot)
library(ranger)
library(nnet)
library(cluster)
library(factoextra)

Load Spatial Data

The original dataset is stored as a shapefile. Since most machine learning algorithms cannot directly process spatial geometries, the geometry column was removed while preserving the associated spatial attributes.

## Reading layer `cities_wikidata_osm_mountains' from data source 
##   `C:\Users\adilk\Desktop\uni_work\Spatial ML\project\outputs\city_mountain_spatial_layers\shapefiles\cities_wikidata_osm_mountains\cities_wikidata_osm_mountains.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 3242 features and 44 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -149.5689 ymin: -54.80722 xmax: 175.65 ymax: 66.50279
## Geodetic CRS:  WGS 84
## Rows: 3,242
## Columns: 44
## $ row_id     <chr> "4", "8", "12", "28", "32", "34", "40", "46", "56", "59", "…
## $ country    <chr> "Poland", "Belgium", "Netherlands", "Romania", "Indonesia",…
## $ city       <chr> "Przasnysz", "Puurs", "Raalte", "Poiana Mare", "Pontianak",…
## $ aqi_val    <dbl> 34, 64, 41, 62, 44, 30, 32, 36, 30, 37, 44, 49, 47, 54, 60,…
## $ aqi_cat    <chr> "Good", "Moderate", "Good", "Moderate", "Good", "Good", "Go…
## $ co_aqi     <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 2, 1, 1, 0,…
## $ co_cat     <chr> "Good", "Good", "Good", "Good", "Good", "Good", "Good", "Go…
## $ o3_aqi     <dbl> 34, 29, 24, 37, 15, 30, 7, 25, 17, 32, 28, 25, 42, 20, 44, …
## $ o3_cat     <chr> "Good", "Good", "Good", "Good", "Good", "Good", "Good", "Go…
## $ no2_aqi    <dbl> 0, 7, 6, 1, 0, 1, 2, 3, 0, 2, 3, 3, 0, 0, 3, 0, 9, 0, 1, 0,…
## $ no2_cat    <chr> "Good", "Good", "Good", "Good", "Good", "Good", "Good", "Go…
## $ pm25_aqi   <dbl> 20, 64, 41, 62, 44, 15, 32, 36, 30, 37, 44, 49, 47, 54, 60,…
## $ pm25_cat   <chr> "Good", "Moderate", "Good", "Moderate", "Good", "Good", "Go…
## $ wd_qid     <chr> "Q672964", "Q908546", "Q18012068", "Q16426501", "Q14168", "…
## $ wd_label   <chr> "Przasnysz", "Puurs", "Raalte", "Poiana Mare", "Pontianak",…
## $ wd_cntry   <chr> "Poland", "Belgium", "Netherlands", "Romania", "Indonesia",…
## $ wd_inst    <chr> "urban municipality of Poland", "municipality section", "vi…
## $ wd_lat     <dbl> 53.01666700, 51.07610000, 52.38277778, 43.91265100, -0.0833…
## $ wd_lon     <dbl> 20.883333, 4.280300, 6.282778, 23.093976, 109.366667, 21.79…
## $ pop        <dbl> 16662, 17452, NA, 9047, 680880, 83010, 152549, 44315, 13652…
## $ area_km2   <dbl> 25.160, 33.410, NA, 163.000, 107.800, 905.840, 543.068, 75.…
## $ elev_m     <dbl> NA, 5.0000, NA, NA, 4.0000, NA, 832.0000, NA, 49.0000, NA, …
## $ popden_km  <dbl> 662.24165, 522.35858, NA, 55.50307, 6316.14100, 91.63870, 2…
## $ osm_rel    <chr> "2713266", "3896697", "1423766", "9695252", "10617336", "44…
## $ osm_stat   <chr> "ok", "ok", "ok", "ok", "ok", "ok", "ok", "ok", "ok", "ok",…
## $ ind_bld    <dbl> 98, 5, 231, 10, 59, 226, 35, 125, 61, 25, 45, 194, 321, 0, …
## $ ind_site   <dbl> 75, 10, 9, 8, 45, 135, 18, 51, 20, 2, 19, 43, 188, 2, 124, …
## $ ind_oth    <dbl> 1, 0, 0, 1, 4, 4, 5, 9, 23, 0, 0, 6, 13, 0, 34, 6, 4, 0, 1,…
## $ ind_all    <dbl> 172, 15, 240, 18, 104, 361, 53, 176, 81, 27, 64, 236, 509, …
## $ ind_bldpm  <dbl> 3.895072e-06, 1.496558e-07, NA, 6.134969e-08, 5.473098e-07,…
## $ ind_allpm  <dbl> 6.836248e-06, 4.489674e-07, NA, 1.104294e-07, 9.647495e-07,…
## $ in_mtn     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ in_mtn_nm  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ mtn_name   <chr> "CARPATHIAN MOUNTAINS", "ALPS", "ALPS", "Balkan Mts.", "CHA…
## $ mtn_reg    <chr> "Europe", "Europe", "Europe", "Europe", "Asia", "Europe", "…
## $ mtn_rank   <dbl> 3, 1, 1, 4, 2, 3, 4, 1, 2, 1, 1, 1, 3, 2, 3, 4, 4, 2, 3, 2,…
## $ mtn_dstkm  <dbl> 338.916, 502.753, 569.713, 30.395, 1217.821, 485.255, 3.401…
## $ mtn_50     <int> 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,…
## $ mtn_100    <int> 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,…
## $ mtn_250    <int> 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0,…
## $ mtn_500    <int> 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0,…
## $ mtn_n100   <dbl> 0, 0, 0, 2, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,…
## $ mtn_n250   <dbl> 0, 0, 0, 3, 0, 0, 1, 0, 0, 0, 0, 0, 3, 0, 0, 0, 1, 0, 0, 0,…
## $ mtn_prx100 <dbl> 3.373708e-02, 6.554998e-03, 3.355594e-03, 7.379003e-01, 5.1…

Spatial Distribution of Cities

The map illustrates the geographical distribution of the cities included in the dataset. The observations are globally distributed, providing diverse environmental conditions for analysis.

3. SELECT MODEL VARIABLES

This section includes dataset merging, feature engineering and preprocessing steps necessary to prepare the data for machine learning analysis.

The variables were grouped into four categories:

Geographical variables Demographic variables Industrial variables Mountain-related variables

##   aqi_val      wd_lat     wd_lon    pop area_km2 elev_m  popden_km ind_bld
## 1      34 53.01666700  20.883333  16662    25.16     NA  662.24165      98
## 2      64 51.07610000   4.280300  17452    33.41      5  522.35858       5
## 3      41 52.38277778   6.282778     NA       NA     NA         NA     231
## 4      62 43.91265100  23.093976   9047   163.00     NA   55.50307      10
## 5      44 -0.08333333 109.366667 680880   107.80      4 6316.14100      59
## 6      30 61.48666667  21.797500  83010   905.84     NA   91.63870     226
##   ind_site ind_all in_mtn mtn_rank mtn_dstkm   mtn_prx100
## 1       75     172      0        3   338.916 3.373708e-02
## 2       10      15      0        1   502.753 6.554998e-03
## 3        9     240      0        1   569.713 3.355594e-03
## 4        8      18      0        4    30.395 7.379003e-01
## 5       45     104      0        2  1217.821 5.141286e-06
## 6      135     361      0        3   485.255 7.808443e-03

4. CHECK MISSING VALUES

Missing values were replaced using median imputation instead of deleting observations. This approach preserves the number of cities available for analysis and reduces information loss.

##    aqi_val     wd_lat     wd_lon        pop   area_km2     elev_m  popden_km 
##          0          0          0        210        408        915        452 
##    ind_bld   ind_site    ind_all     in_mtn   mtn_rank  mtn_dstkm mtn_prx100 
##          0          0          0          0          0          0          0

Only a small number of missing values were observed. Median imputation was applied to preserve as many observations as possible.

##    aqi_val     wd_lat     wd_lon        pop   area_km2     elev_m  popden_km 
##          0          0          0          0          0          0          0 
##    ind_bld   ind_site    ind_all     in_mtn   mtn_rank  mtn_dstkm mtn_prx100 
##          0          0          0          0          0          0          0
## [1] 3242

AQI Distribution

AQI values are not uniformly distributed. Most cities exhibit moderate pollution levels, while a smaller number of cities have extremely high AQI values.

Several outliers are visible, indicating cities with exceptionally poor air quality.

6. TRAIN / TEST SPLIT

The dataset was divided into training (80%) and testing (20%) subsets. The training set was used to build the models, while the testing set was used to evaluate predictive performance.

SUPERVISED LEARNING

Does mountain information improve AQI prediction?

To evaluate the contribution of topographical information, each model was trained twice: once without mountain-related variables and once with mountain-related variables included.

Three supervised learning algorithms were used:

Decision Tree

Random Forest

Artificial Neural Network

##           Model Mountains     RMSE        R2
## 1 Decision Tree        No 29.60311 0.3446656
## 2 Decision Tree       Yes 29.21333 0.3629131
## 3 Random Forest        No 28.26983 0.4395175
## 4 Random Forest       Yes 27.93833 0.4583826
## 5           ANN        No 34.17773 0.1259718
## 6           ANN       Yes 32.88579 0.1879666

Effect of Mountain Variables

Effect of Mountain Variables on AQI Prediction

To investigate whether mountain proximity information improves air quality prediction, each supervised learning model was trained twice: first without mountain-related variables and then with mountain-related variables included.

The results indicate that adding mountain variables consistently improved predictive performance across all algorithms. Random Forest remained the best-performing model (R² = 0.458), followed by Decision Tree (R² = 0.363) and Artificial Neural Networks (R² = 0.188).

These findings suggest that topographical information contributes additional explanatory power when predicting urban air quality.

Variable Importance

Latitude was the most influential variable, followed by mountain proximity variables. This indicates that both geographical location and topography play an important role in explaining AQI differences across cities.

Can latitude and longitude alone predict AQI?

This experiment investigates whether geographical location by itself can explain air quality patterns. Models were trained using only latitude and longitude and then compared to models using all available variables.

##           Model     Variables     RMSE        R2
## 1 Decision Tree  Lat/Lon only 28.37967 0.4051640
## 2 Decision Tree All variables 29.21333 0.3629131
## 3 Random Forest  Lat/Lon only 25.60413 0.5088631
## 4 Random Forest All variables 27.93833 0.4583826
## 5           ANN  Lat/Lon only 32.88257 0.1879975
## 6           ANN All variables 32.05728 0.2297159

Latitude and Longitude Analysis

This experiment demonstrates that simple spatial information can be highly informative for AQI prediction. Latitude and longitude alone achieved comparable or even superior performance for Decision Tree and Random Forest models. One possible explanation is that geographical coordinates implicitly capture regional climate, urbanization patterns and large-scale environmental differences between cities. Nevertheless, additional variables may still provide complementary information for more complex models such as Artificial Neural Networks.

UNSUPERVISED LEARNING

K-MEANS CLUSTERING FOR AQI

K-Means clustering was applied to identify groups of cities with similar pollution, population, industrial and mountain characteristics.

Variables used for clustering:

AQI Population Population density Industrial activity Distance to mountains

Choosing the Number of Clusters

The Elbow Method was used to determine an appropriate number of clusters by examining the reduction in within-cluster variation.

Cluster Visualization

The K-Means algorithm grouped cities into four distinct clusters based on AQI, population, population density, industrial activity and mountain proximity. The visualization uses Principal Component Analysis (PCA) to reduce the multidimensional data into two dimensions. Dim1 explains 31.9% of the total variance, while Dim2 explains 20.4%. Although some overlap exists, several groups are clearly separated, indicating that cities with similar environmental and demographic characteristics tend to cluster together

## Cities per cluster:
## 
##    1    2    3    4 
## 2450   20  736   26

Cities per Cluster

Cluster Interpretation

The distribution of cities across clusters is highly unbalanced. Cluster 1 contains the majority of cities (2450), followed by Cluster 3 (736 cities), while Clusters 2 and 4 contain only a small number of observations (20 and 26 cities, respectively). This suggests that most cities share relatively similar pollution and demographic characteristics, whereas a few cities represent extreme or uncommon profiles.

By examining the cluster summaries, Cluster 2 appears to contain highly populated and densely urbanized cities with elevated AQI values. Cluster 1 consists primarily of cities with relatively low AQI values, moderate industrial activity and shorter distances to mountain regions. Cluster 3 contains cleaner cities that are generally farther away from mountains. Cluster 4 represents a small group of cities characterized by extremely high industrial activity and larger populations.

Overall, the clustering analysis indicates that urban air pollution patterns are not homogeneous across the world and that cities can be grouped into distinct environmental profiles. Population, industrial activity and spatial characteristics appear to contribute to these differences.

Conclusion

This study demonstrated that spatial information can successfully be used to predict urban air quality. Random Forest achieved the best predictive performance among the supervised learning algorithms. Mountain-related variables consistently improved model performance, suggesting that topographical information contributes useful explanatory power. Additionally, K-Means clustering revealed distinct groups of cities with different pollution, demographic and industrial characteristics. Overall, combining spatial, demographic, industrial and topographical variables provides a useful framework for understanding urban air pollution patterns.