Air pollution is a major environmental issue affecting human health worldwide. Urban air quality is influenced by geographical location, population density, industrial activity and topographical characteristics. This project investigates whether these factors can predict Air Quality Index (AQI) values across cities.
Three supervised learning algorithms (Decision Tree, Random Forest and Artificial Neural Networks) were applied and compared. Additionally, unsupervised learning using K-Means clustering was used to identify groups of cities with similar pollution profiles.
Two research questions were investigated:
Does mountain information improve AQI prediction?
Can latitude and longitude alone predict AQI?
We also apply an unsupervised learning method (K-Means clustering) to identify groups of cities with similar characteristics.
Original dataset is aquired from:
“https://www.kaggle.com/datasets/hasibalmuzdadid/global-air-pollution-dataset”
The dataset was later enriched with WikiData, as well as statistics about Industrial buildings, such as factories, for every city where Overpass could access the OpenStreeMap data. As a result, the total number of cities was reduced from 23k to 3242 due to unavaiability of either WikiData or OSM data.
The script for processing is city_enrichment_pipeline.R, however, it takes 2 days to process all the data, so it is not recommended to try and run it.
Lastly, almost all major mountain ranges were included as polygons to answer our hypothesis question. The result is a shapefile that contains basic information about cities, such as population, long/lat, size, population density, as well as information about industriality of the cities and closest mountain ranges.
Spatial information includes:
Additional variables include:
The following libraries were used for spatial data processing, data manipulation, machine learning, visualization and clustering analysis.
library(sf)
library(dplyr)
library(caret)
library(rpart)
library(rpart.plot)
library(ranger)
library(nnet)
library(cluster)
library(factoextra)
The original dataset is stored as a shapefile. Since most machine learning algorithms cannot directly process spatial geometries, the geometry column was removed while preserving the associated spatial attributes.
## Reading layer `cities_wikidata_osm_mountains' from data source
## `C:\Users\adilk\Desktop\uni_work\Spatial ML\project\outputs\city_mountain_spatial_layers\shapefiles\cities_wikidata_osm_mountains\cities_wikidata_osm_mountains.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 3242 features and 44 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: -149.5689 ymin: -54.80722 xmax: 175.65 ymax: 66.50279
## Geodetic CRS: WGS 84
## Rows: 3,242
## Columns: 44
## $ row_id <chr> "4", "8", "12", "28", "32", "34", "40", "46", "56", "59", "…
## $ country <chr> "Poland", "Belgium", "Netherlands", "Romania", "Indonesia",…
## $ city <chr> "Przasnysz", "Puurs", "Raalte", "Poiana Mare", "Pontianak",…
## $ aqi_val <dbl> 34, 64, 41, 62, 44, 30, 32, 36, 30, 37, 44, 49, 47, 54, 60,…
## $ aqi_cat <chr> "Good", "Moderate", "Good", "Moderate", "Good", "Good", "Go…
## $ co_aqi <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 2, 1, 1, 0,…
## $ co_cat <chr> "Good", "Good", "Good", "Good", "Good", "Good", "Good", "Go…
## $ o3_aqi <dbl> 34, 29, 24, 37, 15, 30, 7, 25, 17, 32, 28, 25, 42, 20, 44, …
## $ o3_cat <chr> "Good", "Good", "Good", "Good", "Good", "Good", "Good", "Go…
## $ no2_aqi <dbl> 0, 7, 6, 1, 0, 1, 2, 3, 0, 2, 3, 3, 0, 0, 3, 0, 9, 0, 1, 0,…
## $ no2_cat <chr> "Good", "Good", "Good", "Good", "Good", "Good", "Good", "Go…
## $ pm25_aqi <dbl> 20, 64, 41, 62, 44, 15, 32, 36, 30, 37, 44, 49, 47, 54, 60,…
## $ pm25_cat <chr> "Good", "Moderate", "Good", "Moderate", "Good", "Good", "Go…
## $ wd_qid <chr> "Q672964", "Q908546", "Q18012068", "Q16426501", "Q14168", "…
## $ wd_label <chr> "Przasnysz", "Puurs", "Raalte", "Poiana Mare", "Pontianak",…
## $ wd_cntry <chr> "Poland", "Belgium", "Netherlands", "Romania", "Indonesia",…
## $ wd_inst <chr> "urban municipality of Poland", "municipality section", "vi…
## $ wd_lat <dbl> 53.01666700, 51.07610000, 52.38277778, 43.91265100, -0.0833…
## $ wd_lon <dbl> 20.883333, 4.280300, 6.282778, 23.093976, 109.366667, 21.79…
## $ pop <dbl> 16662, 17452, NA, 9047, 680880, 83010, 152549, 44315, 13652…
## $ area_km2 <dbl> 25.160, 33.410, NA, 163.000, 107.800, 905.840, 543.068, 75.…
## $ elev_m <dbl> NA, 5.0000, NA, NA, 4.0000, NA, 832.0000, NA, 49.0000, NA, …
## $ popden_km <dbl> 662.24165, 522.35858, NA, 55.50307, 6316.14100, 91.63870, 2…
## $ osm_rel <chr> "2713266", "3896697", "1423766", "9695252", "10617336", "44…
## $ osm_stat <chr> "ok", "ok", "ok", "ok", "ok", "ok", "ok", "ok", "ok", "ok",…
## $ ind_bld <dbl> 98, 5, 231, 10, 59, 226, 35, 125, 61, 25, 45, 194, 321, 0, …
## $ ind_site <dbl> 75, 10, 9, 8, 45, 135, 18, 51, 20, 2, 19, 43, 188, 2, 124, …
## $ ind_oth <dbl> 1, 0, 0, 1, 4, 4, 5, 9, 23, 0, 0, 6, 13, 0, 34, 6, 4, 0, 1,…
## $ ind_all <dbl> 172, 15, 240, 18, 104, 361, 53, 176, 81, 27, 64, 236, 509, …
## $ ind_bldpm <dbl> 3.895072e-06, 1.496558e-07, NA, 6.134969e-08, 5.473098e-07,…
## $ ind_allpm <dbl> 6.836248e-06, 4.489674e-07, NA, 1.104294e-07, 9.647495e-07,…
## $ in_mtn <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ in_mtn_nm <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ mtn_name <chr> "CARPATHIAN MOUNTAINS", "ALPS", "ALPS", "Balkan Mts.", "CHA…
## $ mtn_reg <chr> "Europe", "Europe", "Europe", "Europe", "Asia", "Europe", "…
## $ mtn_rank <dbl> 3, 1, 1, 4, 2, 3, 4, 1, 2, 1, 1, 1, 3, 2, 3, 4, 4, 2, 3, 2,…
## $ mtn_dstkm <dbl> 338.916, 502.753, 569.713, 30.395, 1217.821, 485.255, 3.401…
## $ mtn_50 <int> 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,…
## $ mtn_100 <int> 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,…
## $ mtn_250 <int> 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0,…
## $ mtn_500 <int> 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0,…
## $ mtn_n100 <dbl> 0, 0, 0, 2, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,…
## $ mtn_n250 <dbl> 0, 0, 0, 3, 0, 0, 1, 0, 0, 0, 0, 0, 3, 0, 0, 0, 1, 0, 0, 0,…
## $ mtn_prx100 <dbl> 3.373708e-02, 6.554998e-03, 3.355594e-03, 7.379003e-01, 5.1…
The map illustrates the geographical distribution of the cities included in the dataset. The observations are globally distributed, providing diverse environmental conditions for analysis.
This section includes dataset merging, feature engineering and preprocessing steps necessary to prepare the data for machine learning analysis.
The variables were grouped into four categories:
Geographical variables Demographic variables Industrial variables Mountain-related variables
## aqi_val wd_lat wd_lon pop area_km2 elev_m popden_km ind_bld
## 1 34 53.01666700 20.883333 16662 25.16 NA 662.24165 98
## 2 64 51.07610000 4.280300 17452 33.41 5 522.35858 5
## 3 41 52.38277778 6.282778 NA NA NA NA 231
## 4 62 43.91265100 23.093976 9047 163.00 NA 55.50307 10
## 5 44 -0.08333333 109.366667 680880 107.80 4 6316.14100 59
## 6 30 61.48666667 21.797500 83010 905.84 NA 91.63870 226
## ind_site ind_all in_mtn mtn_rank mtn_dstkm mtn_prx100
## 1 75 172 0 3 338.916 3.373708e-02
## 2 10 15 0 1 502.753 6.554998e-03
## 3 9 240 0 1 569.713 3.355594e-03
## 4 8 18 0 4 30.395 7.379003e-01
## 5 45 104 0 2 1217.821 5.141286e-06
## 6 135 361 0 3 485.255 7.808443e-03
Missing values were replaced using median imputation instead of deleting observations. This approach preserves the number of cities available for analysis and reduces information loss.
## aqi_val wd_lat wd_lon pop area_km2 elev_m popden_km
## 0 0 0 210 408 915 452
## ind_bld ind_site ind_all in_mtn mtn_rank mtn_dstkm mtn_prx100
## 0 0 0 0 0 0 0
Only a small number of missing values were observed. Median imputation was applied to preserve as many observations as possible.
## aqi_val wd_lat wd_lon pop area_km2 elev_m popden_km
## 0 0 0 0 0 0 0
## ind_bld ind_site ind_all in_mtn mtn_rank mtn_dstkm mtn_prx100
## 0 0 0 0 0 0 0
## [1] 3242
AQI values are not uniformly distributed. Most cities exhibit moderate pollution levels, while a smaller number of cities have extremely high AQI values.
Several outliers are visible, indicating cities with exceptionally poor air quality.
The dataset was divided into training (80%) and testing (20%) subsets. The training set was used to build the models, while the testing set was used to evaluate predictive performance.
To evaluate the contribution of topographical information, each model was trained twice: once without mountain-related variables and once with mountain-related variables included.
Three supervised learning algorithms were used:
Decision Tree
Random Forest
Artificial Neural Network
## Model Mountains RMSE R2
## 1 Decision Tree No 29.60311 0.3446656
## 2 Decision Tree Yes 29.21333 0.3629131
## 3 Random Forest No 28.26983 0.4395175
## 4 Random Forest Yes 27.93833 0.4583826
## 5 ANN No 34.17773 0.1259718
## 6 ANN Yes 32.88579 0.1879666
To investigate whether mountain proximity information improves air quality prediction, each supervised learning model was trained twice: first without mountain-related variables and then with mountain-related variables included.
The results indicate that adding mountain variables consistently improved predictive performance across all algorithms. Random Forest remained the best-performing model (R² = 0.458), followed by Decision Tree (R² = 0.363) and Artificial Neural Networks (R² = 0.188).
These findings suggest that topographical information contributes additional explanatory power when predicting urban air quality.
Latitude was the most influential variable, followed by mountain proximity variables. This indicates that both geographical location and topography play an important role in explaining AQI differences across cities.
This experiment investigates whether geographical location by itself can explain air quality patterns. Models were trained using only latitude and longitude and then compared to models using all available variables.
## Model Variables RMSE R2
## 1 Decision Tree Lat/Lon only 28.37967 0.4051640
## 2 Decision Tree All variables 29.21333 0.3629131
## 3 Random Forest Lat/Lon only 25.60413 0.5088631
## 4 Random Forest All variables 27.93833 0.4583826
## 5 ANN Lat/Lon only 32.88257 0.1879975
## 6 ANN All variables 32.05728 0.2297159
This experiment demonstrates that simple spatial information can be highly informative for AQI prediction. Latitude and longitude alone achieved comparable or even superior performance for Decision Tree and Random Forest models. One possible explanation is that geographical coordinates implicitly capture regional climate, urbanization patterns and large-scale environmental differences between cities. Nevertheless, additional variables may still provide complementary information for more complex models such as Artificial Neural Networks.
K-Means clustering was applied to identify groups of cities with similar pollution, population, industrial and mountain characteristics.
Variables used for clustering:
AQI Population Population density Industrial activity Distance to mountains
Choosing the Number of Clusters
The Elbow Method was used to determine an appropriate number of clusters by examining the reduction in within-cluster variation.
The K-Means algorithm grouped cities into four distinct clusters based on AQI, population, population density, industrial activity and mountain proximity. The visualization uses Principal Component Analysis (PCA) to reduce the multidimensional data into two dimensions. Dim1 explains 31.9% of the total variance, while Dim2 explains 20.4%. Although some overlap exists, several groups are clearly separated, indicating that cities with similar environmental and demographic characteristics tend to cluster together
## Cities per cluster:
##
## 1 2 3 4
## 2450 20 736 26
The distribution of cities across clusters is highly unbalanced. Cluster 1 contains the majority of cities (2450), followed by Cluster 3 (736 cities), while Clusters 2 and 4 contain only a small number of observations (20 and 26 cities, respectively). This suggests that most cities share relatively similar pollution and demographic characteristics, whereas a few cities represent extreme or uncommon profiles.
By examining the cluster summaries, Cluster 2 appears to contain highly populated and densely urbanized cities with elevated AQI values. Cluster 1 consists primarily of cities with relatively low AQI values, moderate industrial activity and shorter distances to mountain regions. Cluster 3 contains cleaner cities that are generally farther away from mountains. Cluster 4 represents a small group of cities characterized by extremely high industrial activity and larger populations.
Overall, the clustering analysis indicates that urban air pollution patterns are not homogeneous across the world and that cities can be grouped into distinct environmental profiles. Population, industrial activity and spatial characteristics appear to contribute to these differences.
This study demonstrated that spatial information can successfully be used to predict urban air quality. Random Forest achieved the best predictive performance among the supervised learning algorithms. Mountain-related variables consistently improved model performance, suggesting that topographical information contributes useful explanatory power. Additionally, K-Means clustering revealed distinct groups of cities with different pollution, demographic and industrial characteristics. Overall, combining spatial, demographic, industrial and topographical variables provides a useful framework for understanding urban air pollution patterns.