Introduction

Air pollution is a major environmental concern due to its recorded health impacts. The World Health Organization (WHO) estimates that in 2019 alone, 4.2 million people prematurely died as a result of health complications such as cardiovascular and respiratory diseases from breathing in ambient or outdoor air with poor air quality. To bring awareness to this concern, the WHO maintains the WHO Ambient Air Quality Database.

This database compiles data on annual mean concentrations of gaseous pollutant nitrogen dioxide (NO2) as well as particulate matter with a diameter equal to or smaller than 10µm (PM10) and particulate matter equal to or smaller than 2.5µm (PM2.5). The database has been updated every 2-3 years since 2011, and as of the latest update released in January 2024, contains data from 7,182 locations across over 120 countries or independent states.

The vast majority of these locations are in high or middle-income regions such as North America, Europe, China, and India. The data compiled originates from official data collected by varying governing bodies. Due to varied air quality monitoring methods and incomplete representations of countries, the data cannot be used to draw direct comparisons between countries or cities but may be used in other ways, such as clustering. By clustering this data, profiles of pollution composition may be developed, such as locations with instances with high mixed pollution that score high in all three pollutants or instances that are dust-dominated with high particulate pollution but low NO2.

Exploratory Data Analysis (EDA)

#library packages
library(readxl)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(Hmisc)

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:dplyr':
## 
##     src, summarize

## The following objects are masked from 'package:base':
## 
##     format.pval, units

library(mice)

## 
## Attaching package: 'mice'

## The following object is masked from 'package:stats':
## 
##     filter

## The following objects are masked from 'package:base':
## 
##     cbind, rbind

set.seed(100)
options(scipen = 999) #Disable scientific notation

#read in dataset
air<-read_excel("air_pollution.xlsx")

## Warning: Expecting numeric in E2769 / R2769C5: got 'NA'

## Warning: Expecting numeric in E8509 / R8509C5: got 'NA'

## Warning: Expecting numeric in E27891 / R27891C5: got 'NA'

Several metadata variables such as version identifiers, reference fields, and web links were removed because they do not contribute meaningful information for clustering analysis and would not provide interpretable numerical structure

#remove metadata variables
air$version<-NULL
air$type_of_stations<-NULL
air$reference<-NULL
air$web_link<-NULL
air$population_source<-NULL
air$who_ms<-NULL

Relevant pollution, temperature coverage, and population variables were explicitly converted to numeric format to ensure compatibility with subsequent statistical procedures, including imputation, correlation analysis, PCA, and clustering preparation. This conversion also ensured that missing values were handled correctly during numerical computations and predictive mean matching imputation.

#convert to numeric data type
air <- air %>%
  mutate(across(c(pm25_concentration, pm10_concentration, no2_concentration,
                  pm25_tempcov, pm10_tempcov, no2_tempcov, population),
                ~ as.numeric(.)))

## Warning: There were 7 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `across(...)`.
## Caused by warning:
## ! NAs introduced by coercion
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 6 remaining warnings.

Because multiple yearly observations existed for many cities, records were aggregated into a single observation per city using feature-wise averages. Missing yearly measurements were ignored during averaging using na.rm=TRUE, allowing partially complete city records to be retained while reducing duplication within the dataset.

#Merge into one entry per city
cityAir <- air %>%
  group_by(city, country_name, iso3, who_region) %>%
  summarise(
    pm25 = mean(pm25_concentration, na.rm=TRUE),
    pm10 = mean(pm10_concentration, na.rm=TRUE),
    no2  = mean(no2_concentration,  na.rm=TRUE),
    pm25_tempcov = mean(pm25_tempcov, na.rm=TRUE),
    pm10_tempcov = mean(pm10_tempcov, na.rm=TRUE),
    no2_tempcov  = mean(no2_tempcov,  na.rm=TRUE),
    population   = mean(population,   na.rm=TRUE),
    latitude     = mean(latitude),
    longitude    = mean(longitude),
    .groups = "drop"
  )
#Check if there are any NaN values, some will be there but less than the initial dataset
colSums(is.na(cityAir))

##         city country_name         iso3   who_region         pm25         pm10 
##            0            0            0            0         2808         2100 
##          no2 pm25_tempcov pm10_tempcov  no2_tempcov   population     latitude 
##         2310         3647         3327         3060         1518            0 
##    longitude 
##            0

sum(is.na(cityAir$pm25) & is.na(cityAir$pm10) & is.na(cityAir$no2))

## [1] 0

Missing values were concentrated in the pollution concentration variables, reflecting the uneven global deployment of air quality monitoring infrastructure. Complete deletion was rejected because it would have removed a large proportion of the dataset and introduced systematic geographic bias, cities in lower-income regions with sparser monitoring networks would have been disproportionately excluded.

Predictive Mean Matching (PMM) through the miced package was selected for imputation. PMM replaces each missing value by drawing from observed values that are close in predicted magnitude, preserving the marginal distributions of each variable and avoiding implausible estimates outside the observed range. Five imputed datasets were generated and the first complete dataset was used for subsequent analysis.

# City and country are excluded here as mice() only accepts numeric columns.
features <- cityAir %>%
  select(pm25,pm10, no2, pm25_tempcov,pm10_tempcov, no2_tempcov, population, latitude,longitude)

# Impute missing values using Predictive Mean Matching (pmm).
imputed <- mice(features, m=5, method='pmm', seed=100)

## 
##  iter imp variable
##   1   1  pm25  pm10  no2  pm25_tempcov  pm10_tempcov  no2_tempcov  population
##   1   2  pm25  pm10  no2  pm25_tempcov  pm10_tempcov  no2_tempcov  population
##   1   3  pm25  pm10  no2  pm25_tempcov  pm10_tempcov  no2_tempcov  population
##   1   4  pm25  pm10  no2  pm25_tempcov  pm10_tempcov  no2_tempcov  population
##   1   5  pm25  pm10  no2  pm25_tempcov  pm10_tempcov  no2_tempcov  population
##   2   1  pm25  pm10  no2  pm25_tempcov  pm10_tempcov  no2_tempcov  population
##   2   2  pm25  pm10  no2  pm25_tempcov  pm10_tempcov  no2_tempcov  population
##   2   3  pm25  pm10  no2  pm25_tempcov  pm10_tempcov  no2_tempcov  population
##   2   4  pm25  pm10  no2  pm25_tempcov  pm10_tempcov  no2_tempcov  population
##   2   5  pm25  pm10  no2  pm25_tempcov  pm10_tempcov  no2_tempcov  population
##   3   1  pm25  pm10  no2  pm25_tempcov  pm10_tempcov  no2_tempcov  population
##   3   2  pm25  pm10  no2  pm25_tempcov  pm10_tempcov  no2_tempcov  population
##   3   3  pm25  pm10  no2  pm25_tempcov  pm10_tempcov  no2_tempcov  population
##   3   4  pm25  pm10  no2  pm25_tempcov  pm10_tempcov  no2_tempcov  population
##   3   5  pm25  pm10  no2  pm25_tempcov  pm10_tempcov  no2_tempcov  population
##   4   1  pm25  pm10  no2  pm25_tempcov  pm10_tempcov  no2_tempcov  population
##   4   2  pm25  pm10  no2  pm25_tempcov  pm10_tempcov  no2_tempcov  population
##   4   3  pm25  pm10  no2  pm25_tempcov  pm10_tempcov  no2_tempcov  population
##   4   4  pm25  pm10  no2  pm25_tempcov  pm10_tempcov  no2_tempcov  population
##   4   5  pm25  pm10  no2  pm25_tempcov  pm10_tempcov  no2_tempcov  population
##   5   1  pm25  pm10  no2  pm25_tempcov  pm10_tempcov  no2_tempcov  population
##   5   2  pm25  pm10  no2  pm25_tempcov  pm10_tempcov  no2_tempcov  population
##   5   3  pm25  pm10  no2  pm25_tempcov  pm10_tempcov  no2_tempcov  population
##   5   4  pm25  pm10  no2  pm25_tempcov  pm10_tempcov  no2_tempcov  population
##   5   5  pm25  pm10  no2  pm25_tempcov  pm10_tempcov  no2_tempcov  population

airCleaned <- complete(imputed, 1)

colSums(is.na(airCleaned))

##         pm25         pm10          no2 pm25_tempcov pm10_tempcov  no2_tempcov 
##            0            0            0            0            0            0 
##   population     latitude    longitude 
##            0            0            0

Summary statistics, variance, and skewness are reported below for all features. Histograms and boxplots are provided to give a visual account of distributional shape and the presence of outliers.

#Library needed for skewness
library(e1071)

## 
## Attaching package: 'e1071'

## The following object is masked from 'package:Hmisc':
## 
##     impute

#Summary of the cleaned dataset, provides mean, median and quartiles
summary(airCleaned[,1:9])

##       pm25              pm10              no2           pm25_tempcov     
##  Min.   :  1.778   Min.   :  2.564   Min.   :  0.002   Min.   :  0.6667  
##  1st Qu.:  9.606   1st Qu.: 17.845   1st Qu.: 10.766   1st Qu.: 81.0000  
##  Median : 14.657   Median : 24.324   Median : 18.104   Median : 91.6667  
##  Mean   : 20.682   Mean   : 39.369   Mean   : 21.406   Mean   : 83.6047  
##  3rd Qu.: 25.780   3rd Qu.: 46.000   3rd Qu.: 28.000   3rd Qu.: 97.0000  
##  Max.   :220.410   Max.   :540.000   Max.   :928.639   Max.   :100.0000  
##   pm10_tempcov       no2_tempcov       population          latitude     
##  Min.   :  0.6667   Min.   :  0.00   Min.   :       5   Min.   :-53.16  
##  1st Qu.: 83.5714   1st Qu.: 84.75   1st Qu.:   15598   1st Qu.: 32.41  
##  Median : 93.1000   Median : 93.78   Median :   68520   Median : 42.71  
##  Mean   : 86.2460   Mean   : 83.32   Mean   :  484143   Mean   : 37.52  
##  3rd Qu.: 97.1000   3rd Qu.: 96.80   3rd Qu.:  372234   3rd Qu.: 48.95  
##  Max.   :100.0000   Max.   :100.00   Max.   :37393129   Max.   : 69.66  
##    longitude       
##  Min.   :-159.366  
##  1st Qu.:   2.143  
##  Median :  13.617  
##  Mean   :  24.836  
##  3rd Qu.:  57.505  
##  Max.   : 178.450

sapply(airCleaned[,1:9], var)

##               pm25               pm10                no2       pm25_tempcov 
##           298.7241          1431.4131           655.5026           439.5059 
##       pm10_tempcov        no2_tempcov         population           latitude 
##           333.3706           631.0209 2556345745806.5518           356.0047 
##          longitude 
##          3654.8028

#Skewness for all variables
sapply(airCleaned[,1:9], skewness)

##         pm25         pm10          no2 pm25_tempcov pm10_tempcov  no2_tempcov 
##   2.73438899   3.07702082  22.40486907  -1.97988255  -2.48010847  -2.20553820 
##   population     latitude    longitude 
##   9.60756482  -2.32877831  -0.02976501

#Visual representation of skewness, we need histograms

hist(airCleaned$pm10,
     main = "PM10 Distribution",
     xlab = "µg/m3 of PM10")

hist(airCleaned$pm25,
     main = "PM10 Distribution",
     xlab = "µg/m3 of PM2.5")

hist(airCleaned$no2,
     main = "NO2 Distribution",
     xlab = "µg/m3 of NO2")

hist(airCleaned$pm10_tempcov,
     main = "PM10 Tempcov Distribution",
     xlab = "Percent")

hist(airCleaned$pm25_tempcov,
     main = "PM2.5 Tempcov Distribution",
     xlab = "Percent")

hist(airCleaned$no2_tempcov,
     main = "NO2 Tempcov Distribution",
     xlab = "Percent")

hist(airCleaned$population,
     main = "Population Distribution",
     xlab = "Population")

hist(airCleaned$latitude,
     main = "Latitude Distribution",
     xlab = "Degrees")

hist(airCleaned$longitude,
     main = "Longitude Distribution",
     xlab = "Degrees")

#Find the outliers through the use of boxplots
boxplot(airCleaned$pm10,
        las = 2,
        main = "PM10",
        ylab = "µg/m3")

boxplot(airCleaned$pm25,
        las = 2,
        main = "PM2.5",
        ylab = "µg/m3")

boxplot(airCleaned$no2,
        las = 2,
        main = "NO2",
        ylab = "µg/m3")

boxplot(airCleaned$pm10_tempcov,
        las = 2,
        main = "PM10 Annual temporal coverage",
        ylab = "Percent")

boxplot(airCleaned$pm25_tempcov,
        las = 2,
        main = "PM2.5 Annual temporal coverage",
        ylab = "Percent")

boxplot(airCleaned$no2_tempcov,
        las = 2,
        main = "NO2 Annual temporal coverage",
        ylab = "Percent")

boxplot(airCleaned$population,
        las = 0,
        main = "Population",
        ylab = "People")

boxplot(airCleaned$latitude,
        las = 2,
        main = "Latitude",
        ylab = "Degrees")

boxplot(airCleaned$longitude,
        las = 2,
        main = "Longitude",
        ylab = "Degrees")

PM10 concentrations range from 2.56 to 540, with a mean of 39.37 and a median of 24.32. The distribution is strongly right-skewed (skewness = 3.08), indicating that most cities experience low to moderate particulate pollution, while a small number of locations exhibit extremely high values. PM2.5 shows a similar but slightly less extreme pattern, ranging from 1.78 to 220.41, with a mean of 20.68 and a median of 14.66. Its positive skewness (2.73) suggests that fine particulate pollution follows a comparable structure to PM10 but at a lower magnitude, reflecting consistent behavior across particulate pollution metrics.

NO2 concentrations exhibit the most extreme skewness among all pollution variables (skewness = 22.40), ranging from near zero (0.002) to 928.64. The large gap between the mean (21.41) and median (18.10), together with the extreme upper bound, indicates the presence of a small number of cities with exceptionally high nitrogen dioxide levels that heavily influence the distribution. Its variance (655.50) further confirms substantial dispersion across locations.

The temperature coverage variables (PM10_tempcov, PM2.5_tempcov, and NO2_tempcov) are consistently concentrated at high values, with medians above 90 in all cases. All three variables exhibit negative skewness (−2.48, −1.98, and −2.21 respectively), indicating that most cities have high levels of data completeness, with relatively few cases of low monitoring coverage. Their variances (333.37, 439.51, and 631.02) are comparatively moderate, reflecting more constrained variability compared to pollution and demographic variables.

Population displays the most extreme distributional imbalance in the dataset. Values range from 5 to over 37 million, with a mean of 484,143 and a median of 68,520. The strong positive skewness (9.61) reflects a highly uneven global population distribution, where a small number of large metropolitan areas dominate the upper tail. This is further supported by its extremely large variance (2.56 × 10^12), which is several orders of magnitude greater than all other variables, indicating severe scale disparity.

Latitude and longitude provide complementary spatial structure to the dataset. Latitude ranges from −53.16 to 69.66 with mild negative skewness (−2.33), reflecting an uneven but broadly global distribution of cities across hemispheres. Longitude ranges from −159.37 to 178.45 and is approximately symmetric (skewness = −0.03), indicating a balanced global spread across east–west geography. While latitude shows moderate dispersion (variance = 356.00), longitude exhibits higher variability (variance = 3654.80), reflecting the full global span of urban locations.

The dataset contains a mixture of moderately skewed environmental variables, highly skewed pollution extremes (particularly NO2), and extreme scale imbalance driven primarily by population. In contrast, spatial variables provide relatively stable global coverage. These characteristics highlight both distributional asymmetry and magnitude differences across features, strongly justifying the use of feature standardization prior to clustering to ensure that no single variable disproportionately influences distance-based methods.

#correlation matrix
library(corrplot)

## corrplot 0.95 loaded

cor_matrix <- cor(airCleaned, use = "complete.obs")
round(cor_matrix, 2)

##               pm25  pm10   no2 pm25_tempcov pm10_tempcov no2_tempcov population
## pm25          1.00  0.87  0.24         0.09        -0.06       -0.10       0.21
## pm10          0.87  1.00  0.29         0.02        -0.06       -0.26       0.21
## no2           0.24  0.29  1.00         0.07         0.06       -0.01       0.09
## pm25_tempcov  0.09  0.02  0.07         1.00         0.69        0.55      -0.07
## pm10_tempcov -0.06 -0.06  0.06         0.69         1.00        0.57      -0.12
## no2_tempcov  -0.10 -0.26 -0.01         0.55         0.57        1.00      -0.17
## population    0.21  0.21  0.09        -0.07        -0.12       -0.17       1.00
## latitude     -0.18 -0.21 -0.02         0.09         0.13        0.23      -0.14
## longitude     0.53  0.45  0.11         0.32         0.13        0.17       0.11
##              latitude longitude
## pm25            -0.18      0.53
## pm10            -0.21      0.45
## no2             -0.02      0.11
## pm25_tempcov     0.09      0.32
## pm10_tempcov     0.13      0.13
## no2_tempcov      0.23      0.17
## population      -0.14      0.11
## latitude         1.00     -0.25
## longitude       -0.25      1.00

#heatmap of correlation matrix
corrplot(cor_matrix,
         method  = "color",
         type    = "upper",
         addCoef.col = "black",
         tl.col  = "black",
         tl.srt  = 45,
         number.cex = 0.7,
         title   = "Feature Correlation Matrix",
         mar     = c(0, 0, 2, 0))

The heatmap reveals a strong positive correlation between PM2.5 and PM10 (r = 0.87). Both variables measure airborne particulate matter and are therefore capturing closely related physical phenomena. Retaining both would disproportionately weight particulate pollution in clustering distance calculations relative to gaseous pollutants and geographic features. PM2.5 was removed and PM10 retained, as PM10 monitors all particles of 10 microns or smaller and thus subsumes the particle size range measured by PM2.5.

Following the removal of PM2.5, its associated temporal coverage variable (PM25_tempcov) was also removed. Once the underlying pollutant measure is absent from the feature set, its coverage indicator carries no interpretable meaning. The moderate correlation between PM25_tempcov and PM10_tempcov (r = 0.69) further supports this decision, as it suggests the two coverage variables were already capturing overlapping information about monitoring infrastructure completeness.

A moderate correlation remains between PM10_tempcov and NO2_tempcov (r = 0.57), and between PM25_tempcov and NO2_tempcov (r = 0.55). With PM25_tempcov removed, only the 0.57 pairing persists in the final feature set. This was considered insufficient to justify further removal, as both coverage indicators reflect distinct monitoring programs and their retention preserves potentially meaningful variation in national measurement infrastructure.

Longitude shows moderate correlations with PM2.5 (r = 0.53) and PM10 (r = 0.45), reflecting the geographic concentration of high-particulate cities in South and East Asia. However, since PM2.5 is removed and longitude’s correlation with PM10 falls below the 0.7 threshold, it was initially retained.

Longitude was ultimately excluded from the final feature set. While it shows moderate correlations with PM2.5 (r = 0.53) and PM10 (r = 0.45), the more fundamental reason for its exclusion is that retaining it would pull the clustering solution toward geographic proximity rather than pollution profile similarity. Our research question is concerned with grouping cities by their air quality and monitoring characteristics, two cities on opposite sides of the world with similar pollution burdens should cluster together. Including longitude would work against this by penalizing cities for being geographically distant even when their environmental profiles are similar. Latitude was retained because it carries substantive meaning beyond pure geography: the North-South gradient reflects genuine differences in climate, industrial development patterns, and regulatory environments that are relevant to how pollution profiles are structured.

The final selected feature set for clustering is: PM10, NO2, PM10_tempcov, NO2_tempcov, population, and latitude.

#Add city and country back as identifier columns
airCleaned$city <- cityAir$city
airCleaned$country_name <- cityAir$country_name
airCleaned2 <- airCleaned #Creates copy so that longitude may be used by GMM later
airCleaned <- airCleaned %>%
  select(city, country_name, pm10, no2, pm10_tempcov, no2_tempcov, population, latitude)

airCleaned_Scaled<- scale(airCleaned[, 3:8])

All six features were standardized to zero mean and unit variance using scale(). Standardization was selected over alternatives such as min-max normalization because the dataset contains extreme outliers, particularly in NO2 and population, and min-max normalization compresses all other observations into a very narrow range when outliers are present.

Standardization places each variable on a common scale where one unit corresponds to one standard deviation. This allows distance-based algorithms such as DBScan to treat each feature as making an equally weighted contribution to pairwise distances, rather than allowing high-variance features to dominate the solution. This step is essential as the variance of population (≈2.56 × 10¹²) is roughly ten billion times larger than that of latitude (356) in raw units

##PCA

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:e1071':
## 
##     element

pca <- prcomp(airCleaned_Scaled, center = FALSE, scale. = FALSE)  # Already scaled
summary(pca)

## Importance of components:
##                           PC1    PC2    PC3    PC4    PC5     PC6
## Standard deviation     1.3815 1.1357 0.9539 0.9254 0.8052 0.62205
## Proportion of Variance 0.3181 0.2150 0.1517 0.1427 0.1081 0.06449
## Cumulative Proportion  0.3181 0.5331 0.6847 0.8274 0.9355 1.00000

par(mfrow = c(1, 2))

# Scree plot
plot(pca, type = "l", main = "Scree Plot")

# Cumulative variance
var_explained <- cumsum(pca$sdev^2 / sum(pca$sdev^2)) * 100
plot(var_explained, type = "b",
     xlab = "Number of Components",
     ylab = "Cumulative Variance Explained (%)",
     main = "Cumulative Variance",
     ylim = c(0, 100))
abline(h = 90, col = "red", lty = 2)

par(mfrow = c(1, 1))
round(pca$rotation, 3)

##                 PC1    PC2    PC3    PC4    PC5    PC6
## pm10          0.409 -0.488  0.047  0.155 -0.709  0.256
## no2           0.142 -0.640  0.482  0.134  0.564 -0.052
## pm10_tempcov -0.480 -0.466 -0.324  0.040 -0.199 -0.637
## no2_tempcov  -0.578 -0.287 -0.217 -0.084  0.092  0.722
## population    0.339 -0.222 -0.316 -0.847  0.132 -0.027
## latitude     -0.364  0.066  0.717 -0.481 -0.336 -0.064

PC1 and PC2 together explain 53.3% of total variance (31.8% and 21.5% respectively), meaning that just two components capture the majority of the dataset’s structure. The scree plot shows a gradual decline rather than a sharp elbow, suggesting variance is distributed relatively evenly across components rather than being dominated by one or two dimensions. Five components are needed to reach the 90% threshold, indicating that while dimensionality can be reduced, the dataset does contain meaningful variation across multiple axes.

The loadings table reveals a clear interpretive structure. PC1 is dominated by the temporal coverage variables (no2_tempcov: −0.578, pm10_tempcov: −0.480) and pollution concentrations (pm10: 0.409, population: 0.339), representing a broad axis contrasting cities with high monitoring completeness against cities with high pollution and population. PC2 is driven primarily by NO2 (−0.640) and PM10 (−0.488), representing an axis of overall pollution intensity. Latitude loads strongly onto PC3 (0.717), confirming that geographic position operates as a largely independent dimension from pollution and monitoring characteristics.

pca_scores <- as.data.frame(pca$x)
pca_scores$who_region <- cityAir$who_region

ggplot(pca_scores, aes(x = PC1, y = PC2, colour = who_region)) +
  geom_point(alpha = 0.5, size = 1.5) +
  labs(title    = "PC1 vs PC2 PCA",
       subtitle = "Coloured by WHO Region",
       x        = paste0("PC1 (", round(summary(pca)$importance[2,1]*100, 1), "% variance)"),
       y        = paste0("PC2 (", round(summary(pca)$importance[2,2]*100, 1), "% variance)"),
       colour   = "WHO Region") +
  theme_minimal() +
  theme(legend.position = "bottom")

The PCA biplot shows a pronounced right-skewed distribution along PC1, consistent with the heavily skewed pollution and population distributions identified earlier. The dense mass of cities near the origin represents the majority of locations with moderate pollution profiles, while a long tail of points extending along the positive PC1 axis corresponds to high-pollution, high-population outliers. Regional separation is limited in PCA space, with most WHO regions overlapping substantially, confirming that pollution profiles do not map cleanly onto geographic boundaries and that a data-driven clustering approach is necessary.

UMAP (Uniform Manifold Approximation and Projection) was applied as a non-linear complement to PCA. While PCA preserves global variance structure, UMAP is better suited to revealing local neighborhood relationships and non-linear cluster geometry that PCA may compress or obscure.

#UMAP
library(umap)
umap_result <- umap(airCleaned_Scaled,
                    n_neighbors = 15,
                    min_dist    = 0.1,
                    metric      = "euclidean")

umap_df <- data.frame(
  UMAP1      = umap_result$layout[, 1],
  UMAP2      = umap_result$layout[, 2],
  who_region = cityAir$who_region
)

umap_df <- umap_df %>%
  mutate(who_region = recode(who_region,
    "1_Afr"   = "Africa",
    "2_Amr"   = "Americas",
    "3_Sear"  = "South-East Asia",
    "4_Eur"   = "Europe",
    "5_Emr"   = "Eastern Mediterranean",
    "6_Wpr"   = "Western Pacific",
    "7_NonMS" = "Non-Member State"
  ))
my_colours <- c(
  "Africa"                = "#E41A1C",  # Red
  "Americas"              = "#377EB8",  # Blue
  "South-East Asia"       = "#FF7F00",  # Orange
  "Europe"                = "#4DAF4A",  # Green
  "Eastern Mediterranean" = "#984EA3",  # Purple
  "Western Pacific"       = "#A65628",  # Brown
  "Non-Member State"      = "#000000"   # Black
)
ggplot(umap_df, aes(x = UMAP1, y = UMAP2, colour = who_region)) +
  geom_point(alpha = 0.5, size = 1.5) +
  scale_colour_manual(values = my_colours) +
  labs(title    = "UMAP Projection",
       subtitle = "Coloured by WHO Region",
       x        = "UMAP Dimension 1",
       y        = "UMAP Dimension 2",
       colour   = "WHO Region") +
  theme_minimal() +
  theme(legend.position = "bottom")

The UMAP projection (Figure 7) reveals considerably more structure than PCA. Two visually distinct groupings are immediately apparent: a large, dense green mass corresponding predominantly to European cities occupying the upper portion of the plot, and a separate mixed-region cluster of South-East Asian, Western Pacific, and Eastern Mediterranean cities in the lower central area. A small, isolated cluster is visible in the lower left, separated from the main body; these likely correspond to the extreme outlier cities identified in the distributional analysis, potentially cities with exceptionally high NO2 or PM10 values. A further strand of predominantly American cities extends to the left, suggesting a subgroup with a distinct pollution profile. The partial but imperfect regional separation confirms that while geography correlates with pollution profiles, it does not fully determine them.

region_labels <- recode(cityAir$who_region,
  "1_Afr"   = "Africa",
  "2_Amr"   = "Americas",
  "3_Sear"  = "South-East Asia",
  "4_Eur"   = "Europe",
  "5_Emr"   = "Eastern Mediterranean",
  "6_Wpr"   = "Western Pacific",
  "7_NonMS" = "Non-Member State"
)
umap_5  <- umap(airCleaned_Scaled, n_neighbors = 5,  min_dist = 0.1)
umap_30 <- umap(airCleaned_Scaled, n_neighbors = 30, min_dist = 0.1)

df5  <- data.frame(UMAP1 = umap_5$layout[,1],  UMAP2 = umap_5$layout[,2],  n = "n=5",  who_region = region_labels)
df30 <- data.frame(UMAP1 = umap_30$layout[,1], UMAP2 = umap_30$layout[,2], n = "n=30", who_region = region_labels)
df15 <- data.frame(UMAP1 = umap_result$layout[,1], UMAP2 = umap_result$layout[,2], n = "n=15", who_region = region_labels)
umap_sens <- rbind(df5, df15, df30)

ggplot(umap_sens, aes(x = UMAP1, y = UMAP2, colour = who_region)) +
  geom_point(alpha = 0.4, size = 1) +
  scale_colour_manual(values = my_colours) +
  facet_wrap(~n) +
  labs(title  = "UMAP Sensitivity to n_neighbors",
       colour = "WHO Region") +
  theme_minimal() +
  theme(legend.position = "bottom")

The sensitivity analysis shows that the core structure identified at n=15 is largely preserved across n=5 and n=30. The separation between the European mass and the mixed cluster persists across all three settings, as does the isolated group. At n=5 the projection becomes more fragmented as UMAP prioritises very local neighbourhood relationships, while at n=30 the structure becomes slightly more compressed. Since the key features, the two main groupings and the outlier separation, are consistent across all settings, the n=15 projection is reliable and not an artifact of parameter choice.

The exploratory analysis collectively suggests that the data contains genuine cluster structure, but of a complex nature that motivates careful method selection. The data is heavily right-skewed with multiple extreme outliers, particularly in NO2 and population. Distance-based algorithms will be sensitive to these extremes, and their influence should be monitored in the cluster outputs. DBSCAN, which explicitly identifies outliers as noise rather than forcing them into clusters, is particularly well-suited to this characteristic.

The strong positive correlation between PM2.5 and PM10 was addressed by removing PM2.5, and the resulting six-variable feature set exhibits no critically redundant pairs. Standardization was applied to remove scale imbalance, without which population would dominate all distance calculations. Both PCA and UMAP confirm the presence of a large central mass of cities with moderate pollution profiles alongside smaller, more extreme groupings.

The soft, overlapping boundaries between regional groupings visible in UMAP motivate probabilistic approaches such as Gaussian Mixture Models, which assign soft cluster membership rather than hard boundaries, reflecting the reality that cities can genuinely share characteristics across multiple pollution profiles.

The outlier cities visible in both projections, particularly those separated in the upper left of the UMAP plot suggest that density-based methods such as DBSCAN, which can explicitly label extreme points as noise rather than forcing them into clusters, are well suited to this data.

Hierarchical clustering provides a complementary perspective that neither GMM nor DBSCAN can offer by producing a dendrogram, it reveals not just which cities cluster together but the relative strength of those relationships and at what level of similarity they merge. This is particularly valuable for a dataset structured around pollution profiles, where understanding the degree of similarity between city groupings is as informative as the groupings themselves.

Clustering

##Hierarchical Clustering The WHO Air Quality dataset contains cities from a wide range of geographic regions and pollution profiles. Hierarchical clustering is well suited for this dataset as it makes no assumptions about the number of clusters in advance which is an important advantage when the natural grouping structure of the global cities is not known prior. This contrasts with methods like K-means which does require specifying the number of k clusters before running.

Hierarchical clustering produces a dendrogram which is a tree diagram that visualizes the entire merging history of observations. This is useful because we can observe how cities group at multiple scales and make an informed decision about where to cut the tree rather than committing to a single solution blindly. This clustering also works effectively with standardized continuous variable and can identify compact pollution groups using variance-minimizing linkage criteria.

Hierarchical clustering is an unsupervised learning algorithm that works by building a nested sequence of groups by either starting with every observation in its own cluster and merging upward (agglomerative), or by starting with all observations in one cluster and splitting downward (divisive). In this analysis, the agglomerative approach will be used.

The agglomerative method begins with each observation treated as its own cluster. Clusters are then merged iteratively according to a distance metric and linkage criterion until all observations belong to a single hierarchy.

The agglomerative algorithm proceeds in the following steps.

It first begins with initialization. Each of the n cities begins as its own singleton cluster which yields \(n\) clusters total. A pairwise distance matrix is then computed between all observations.

For \(n\) cities, the algorithm constructs an \(n \times n\) symmetric distance matrix \(\mathbf{D}\) where entry \(D_{ij} = d(\mathbf{x}_i, \mathbf{x}_j)\). The diagonal is zero, as every city has zero distance to itself (\(D_{ii} = 0\)), and the matrix is symmetric (\(D_{ij} = D_{ji}\)). At each merging step, two rows and columns are collapsed and replaced with a single row/column representing the new merged cluster.

The similarity between observations was measured using Euclidean distance: \[d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_{i=1}^{p}(x_i - y_i)^2}\] where \(\mathbf{x}\) and \(\mathbf{y}\) are two city feature vectors, \(p\) is the number of features, and \(x_i\), \(y_i\) are the \(i\)-th scaled feature values for each city.

Next, the two clusters that are closest to each other, which will be defined by the chosen linkage criterion, are merged into a single cluster. This reduces the total number of clusters by one. The distance matrix is then updated to reflect the new cluster distances according to the linkage method. This gets repeated until all observations have been merged into a single cluster. A horizontal cut is made in the resulting dendrogram at a chosen height to produce a flat partition of k clusters.

Hierarchical clustering uses different linkage methods to determine how the distance between two clusters is defined when one or both clusters contain multiple observations. Some common linkage methods include:

Single linkage is the minimum distance between any pair of points across the two clusters. This tends to produce elongated and chained clusters.

Complete linkage is the maximum distance between any pair of points. This tends to produce compact and roughly equal sized cluster.

Average linkage is the average of all pairwise distances between the two clusters which produces a balanced compromise between the above two linkages.

Ward’s linkage merges the two clusters that results in the smallest increase in total within-cluster variance. This often produces the most interpretable and balanced clusters.

The key decisions in hierarchical clustering are the distance metric, the linkage method, and the number of clusters. The distance metric that is being used is the Euclidean distance measure. Euclidean distance was selected because all clustering variables were continuous and standardized using z-score normalization. Standardization ensure that variables measured on different scales contributed equally to distance calculations. The features were standardized with scale() prior to clustering. This is done so that variables with large numeric ranges do not dominate the distance computation over variables with smaller ranges. The linkage method that is being used is Ward’s linkage method. This method minimizes within-cluster variance at each merge step which tends to produce compact and well separated clusters. The number of k clusters will be determined by the dendrogram and silhouette score.

#Compute the Euclidean distance matrix
dist_matrix <- dist(airCleaned_Scaled, method = "euclidean")

#Perform agglomerative clustering with Ward's linkage
hc <- hclust(dist_matrix, method = "ward.D2")

#Determine optimal K via silhouette score
library(cluster)
sil_scores <- sapply(2:10, function(k) {
  labels <- cutree(hc, k = k)
  mean(silhouette(labels, dist_matrix)[, 3])
})

plot(2:10, sil_scores, type = "b",
     xlab = "Number of Cluster (k)",
     ylab = "Average Silhouette Width",
     main = "Silhouette Scores for Hierarchical Clustering")

k_opt <- which.max(sil_scores) + 1 
k_opt

## [1] 3

#Assign Clusters
airClusters <- airCleaned2
airClusters$hc_cluster <- cutree(hc, k = k_opt)

The hierarchical clustering implementation followed several stages. First were data cleaning and preprocessing. Missing pollutant observations were removed as well as highly redundant variables were reduced through feature selection. Numeric features were standardized using z-score normalization to ensure equal weighting in distance calculations. A Euclidean distance matrix was computed between all observations. Agglomerative hierarchical clustering was performed using Ward’s linkage criterion. A dendrogram inspection and silhouette analysis were used to determine the appropriate number of clusters.

Hierarchical clustering offers several advantages for environmental data analysis. It does not require the number of clusters to be specified initially. Its dendrogram provides a clear visual representation of the nested cluster structure. Ward’s method produces compact and interpretable groups. The method is effective for exploratory analysis where cluster structure is unknown. Hierarchical clustering is also flexible because cluster assignments can be examined at multiple levels of granularity without rerunning the algorithm.

Despite its strengths, hierarchical clustering also has several limitations. The method can be computationally expensive for large datasets because pairwise distance must be calculated. Once clusters are merged, the algorithm cannot reverse earlier decisions. Results can be sensitive to the choice of linkage method and distance metric. Outliers may distort cluster structure if preprocessing and scaling are not handled carefully. In addition, environmental pollution data often exhibit gradual transitions rather than perfectly separated groups, meaning some overlap between clusters is expected. As a result, hierarchical clustering may produce fewer flexible boundaries compared with probabilistic approaches such as Gaussian Mixture Models.

Gaussian Mixture Model (GMM)

The Gaussian Mixture Model (GMM) was chosen as a cluster method for its ability to use soft assignments for determining cluster membership. This means a given observation will have a probability of being assigned to each cluster, such as 90% likelihood to Cluster A and 10% to Cluster B. This relates to this dataset as the PCA demonstrated that regional groupings did not reveal harsh boundaries, so the true clusters may experience overlap. Furthermore, if Cluster A is defined by pm10 and Cluster B is defined by no2, but the city contains high levels of both pollutants, it is unclear which cluster the city should be assigned to. The ability GMM has to create soft assignments comes in handy here, if the city has a higher concentration of pm10 than no2, then it has a higher probability of being a part of Cluster A than Cluster B, but the hard assignment does not happen until the algorithm converges and stops. A related aspect is that, if cities can have a combination of pm10 and no2 pollutants as well as other defining features such as population size, then it is likely that the clusters will overlap with each other as it is unlikely the physical location of cities determine their population and pollution concentration for each pollutant. Other advantages include its ability to handle elliptical or uneven shapes. This is because GMM uses a covariance matrix. The limitations of GMM are that it cannot handle data that has a non-convex shape and can be sensitive to outlier values or noise.

GMMs first require an input of the initial data, parameters, and number of components (or clusters). Then the algorithm calculates the responsibility value, the soft assignment mentioned before, for each observation to determine the likelihood that point was generated from a given component. Once completed, the algorithm updates the parameters using the responsibility values and repeats the soft assignment calculations. This process continues until the values converge, and the observations are then hard assigned to the cluster they have the highest probability of coming from.

To determine the number of clusters, first the GMM algorithm is run with a range of values for G to calculate AIC and BIC values. The highest values for AIC and/or BIC determine the suggested number of components. AIC and BIC apply penalty terms to the algorithm to prevent overfitting and between the two, BIC is more commonly used for clustering as the penalty term is harsher. However, adding more components will always increase these values and therefore suggest using more components than necessary, so other values should be considered. For this project, the uncertainty value will also be considered. The uncertainty value is the summation of the responsibility values for an observation except for its highest value. For example, if observation 1 has a 0.95 probability for cluster A and 0.05 probability for Cluster B, then the uncertainty value is 0.05. From that example it can be shown that a lower uncertainty is better; it means the algorithm is confident in its assessment on where each observation comes from.

The implementation of GMM for this project is as follows: First, the number of clusters to use needs to be calculated. As stated before, this is done by using a range of values, and the algorithm will generate BIC values for each number of components. The number of components with the highest value is the recommended number of components to use for the GMM. Therefore, an initial range of 2:20 components was used. As shown in the graph below, the BIC value increases as the number of components increases as well. However, BIC values will always increase as components increase, even though BIC has a penalty term as part of its calculation. Therefore, even though the graph shows 20 components as the highest BIC value, 20 components may not be the best fit for the GMM. It should be noted that after 10 components, the value of BIC only marginally increases, meaning that the BIC for 10 components should be a similar, but lower value, of the BIC of 20 components while having half the number of components.

library(mclust)

## Package 'mclust' version 6.1.2
## Type 'citation("mclust")' for citing this R package in publications.

## 
## Attaching package: 'mclust'

## The following object is masked from 'package:dplyr':
## 
##     count

set.seed(100)

#Using 2 to 20 components should give a wide enough range to find best value of G
GMM_results <- Mclust(data = airCleaned_Scaled,
                      G = 2:20,
                      verbose = FALSE)
#BIC plot
plot(GMM_results$BIC)

#uncertainty
uncertainty <- as.matrix(GMM_results$uncertainty)
summary(GMM_results$uncertainty)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.03594 0.12396 0.19078 0.32248 0.73738

#percent above 0.2
sum(uncertainty > 0.2)

## [1] 2751

mean(uncertainty > 0.2)

## [1] 0.3830409

Therefore, to determine which component number to use, the uncertainty value will be used and the threshold of an acceptable uncertainty value for a given observation will be at or below 0.2. The mean value of uncertainty for 20 components is 0.18716, so the average value of uncertainty is just under the threshold. There are 2709 of the 7182 observations that have an uncertainty value above 0.2, or about 38% of the observations. For 10 components mean uncertainty value is 0.126538, a decrease of 0.060622, and the number of observations above 0.2 is 1820 or about 25% of the observations, a 13% decrease. While 20 components have a higher BIC, the decrease in the BIC value for 10 components is worth the decrease in the uncertainty values.

#GMM with 10 components
set.seed(100)
GMM10_results <- Mclust(data = airCleaned_Scaled,
                      G = 10,
                      verbose = FALSE)
#uncertainty
summary(GMM10_results$uncertainty)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.000000 0.005166 0.052777 0.126538 0.203791 0.729436

uncertainty10 <- as.matrix(GMM10_results$uncertainty)
#percentage greater than 0.2
sum(uncertainty10 > 0.2)

## [1] 1820

mean(uncertainty10 > 0.2)

## [1] 0.2534113

##Density Based Spatial Clustering of Applications with Noise (DBSCAN)

Since DBSCAN explicitly identifies and labels noise points instead of considering them for clustering, it is well-suited for this dataset that presents with extreme outliers. Furthermore, DBSCAN allows for unique identification of clusters using dense regions rather than defined shapes like GMM.

DBSCAN defines clusters as dense regions of points separated by sparse regions of points. DBSCAN classifies the points in a dataset as either a core point, border point, or noise point depending on the density of its local neighborhood. This is estimated by counting the points within a fixed-radius (Ɛ) neighborhood and connecting points that lie within two neighborhoods, represented as Nε(x) = {y∈Rd: ∥y − x∥ ≤ ε}.For a core point, |Nε(p)| ≥ MinPts; for a border point, |Nε(p)| < MinPts but p ∈ Nε(q); and for a noise point it is neither a core point nor within the Ɛ-neighborhood of any core point. DBSCAN follows a six step process: 1. Select an unvisited data point 2. Compute its Ɛ-neighborhood. 3. If it is a core point, start a new cluster. 4. Expand the cluster using density-reachability. 5. If it is not a core point, label it as noise. 6. Repeat until all points in the dataset are processed.

Density-reachability explains how clusters grow but aren’t symmetric relationships. Direct density-reachability means that a point q lies within the Ɛ-neighborhood of a core point p, q ∈ Nε(p) and |Nε(p)| ≥ MinPts. Density-reachability between q and p requires that q is directly density-reachable from the sequence of core points between q and p. Density-connectivity, on the other hand, is symmetric and allows two border pints to belong to the same cluster despite neither being density-reachable from the other. Density-connectivity is true if a point o exists so that both p and q are density reachable from o.

DBSCAN requires two pre-determined parameters: Ɛ, the neighborhood radius around each point and minpts, the minimum number of points within that radius to qualify as a dense region. Ɛ or eps was determined using the k-distance plot. The “elbow” of the plot lies at ~1.

library(dbscan)

## 
## Attaching package: 'dbscan'

## The following object is masked from 'package:stats':
## 
##     as.dendrogram

library(fpc)

## 
## Attaching package: 'fpc'

## The following object is masked from 'package:dbscan':
## 
##     dbscan

#determine eps & minpts
#kNN distance plot
kNNdistplot(airCleaned_Scaled, k = 7)
title(main = "kNN Distance Plot (k=7)")
abline(h = 1, col = "red",    lty = 2, lwd = 2)

There are six features in the airCleaned_Scaled data set, so a grid of all combinations of eps = 0.8, 0.9, 1, 1.1, 1.2 and minpts = 7, 8, 9, 10, 11, 12 was created. Noise points increase as eps decreases and minpts increases. With this in mind, the heat map was examined for a suitable combination of eps and minpts. eps = 0.9 and minpts = 10 was chosen. eps = 0.9 is both large to avoid merging clusters and a very stable column in the heat map with low variability across changes in minpts. It results in 7-9 clusters, with one instance of 4 clusters. Changes in eps, however, is much more variable regardless of minpts value. At minpts = 10 in the 0.9 column, there is a little bit more stability than the other rows. The neighboring combinations do not exhibit quite as large of a difference in cluster number as other values of minpts.

#create grid of eps & minpts possibilities
eps_vals    <- seq(0.8, 1.2, by = 0.1)
minpts_vals <- c(7, 8, 9, 10, 11, 12)
grid_results <- expand.grid(eps = eps_vals, minPts = minpts_vals)
grid_results$n_clusters <- NA
grid_results$n_noise     <- NA
for (i in seq_len(nrow(grid_results))) {
  fit <- dbscan(airCleaned_Scaled,
                eps    = grid_results$eps[i],
                MinPts = grid_results$minPts[i])
  k   <- max(fit$cluster)
  grid_results$n_clusters[i] <- k
  grid_results$n_noise[i]    <- sum(fit$cluster == 0)
}
#heatmap of number of clusters
ggplot(grid_results, aes(x = eps, y = factor(minPts), fill = n_clusters)) +
  geom_tile(color = "white") +
  scale_fill_viridis_c(option = "plasma", name = "# Clusters") +
  labs(title = "DBSCAN Grid Search for Number of Clusters",
       subtitle = "",
       x = "eps (search radius)", y = "minPts") +
  theme_minimal(base_size = 13)

#DBSCAN
dbscan_model <- dbscan(airCleaned_Scaled, eps = 0.9, MinPts = 10)

DBSCAN’s biggest advantage is that it does not require a K to be defined prior to clustering. It also overcomes limitations such as the assumption of convex or spherical clusters made by K-means clustering. However, DBSCAN does require eps and minpts to be specified, which are parameters that tend to be sensitive to small changes and lead to variable cluster number and cluster size. Here, eps = 0.9 was chosen, but changing eps by 0.1 in either direction while maintaining minpts results in a reduction to 5 and 3 clusters.

#Results ##Hierarchical Clustering

table(airClusters$hc_cluster)

## 
##    1    2    3 
## 6147 1030    5

#Dendrogram with cut at chosen k
cut_height <- hc$height[nrow(airCleaned_Scaled) - k_opt]

plot(hc, labels = FALSE, main = "Dendrogram - Ward's Linkage", 
     xlab = "Cities", ylab = "Height (Within-Cluster Variance)")

abline(h = cut_height, col = "red", lty = 2, lwd = 2)

Hierarchical clustering resulted in 3 clusters. Cluster 1 has 6,147 members, Cluster 2 has 1,030 members, and Cluster 3 has 5 members. These Clusters are visually displayed in the cut dendrogram with Cluster 2 as the left most cluster, Cluster 3 as the second to the left most cluster, and Cluster 1 as the right most cluster. Clusters 3 and 1 are much closer in relation to one another than either of their respective relationships to Cluster 2.

#hierarchical plots
ggplot(airCleaned,aes(no2, population, color = as.factor(airClusters$hc_cluster))) + 
  geom_point() + labs(title = "Air Pollution vs. Population with Hierarchical Clusters", 
                      color = "Cluster") + xlab("Air Pollution (NO2)") +
  ylab("Population") + theme_minimal()

ggplot(airCleaned,aes(pm10, latitude, color = as.factor(airClusters$hc_cluster))) + 
  geom_point() + labs(title = "Particulate Pollution vs. Latitude with Hierarchical Clusters", 
                      color = "Cluster") + xlab("Particulate Pollution (PM10)") +
  ylab("Latitude") + theme_minimal()

Air pollution was plotted against population and particulate pollution was plotted against latitude. The former was done as a result of known biases where it is assumed that increased population results in more air pollution due to increased human activity. The latter investigates another type of pollution against a variable with more of a bell distribution for ease of visualization. In the plot of air pollution against population with the Hierarchical clusters reveals that Cluster 1 is mostly made up of small populations and low air pollution levels, Cluster 2 retains low air pollution but has larger populations, and Cluster 3 has small populations but much, much higher air pollution. Particulate pollution against latitude tells a much different story with all three clusters interspersed in each other.

##GMM

table(GMM10_results$classification)

## 
##    1    2    3    4    5    6    7    8    9   10 
##  556 1522  731  122  650  922  680  754  238 1007

airClusters$gmm_cluster <- factor(GMM10_results$classification)

GMM resulted in 10 clusters. Cluster 1 contains 556 members, Cluster 2 with 1,522 members, Cluster 3 with 731, Cluster 4 with 122, Cluster 5 with 650, Cluster 6 with 922, Cluster 7 with 680, Cluster 8 with 754, Cluster 9 with 238, and Cluster 10 with 1,007.

#gmm plots
ggplot(airCleaned,aes(no2, population, color = as.factor(airClusters$gmm_cluster))) + 
  geom_point() + labs(title = "Air Pollution vs. Population with GMM Clusters", 
                      color = "Cluster") + xlab("Air Pollution (NO2)") +
  ylab("Population") + theme_minimal()

ggplot(airCleaned,aes(pm10, latitude, color = as.factor(airClusters$gmm_cluster))) + 
  geom_point() + labs(title = "Particulate Pollution vs. Latitude with GMM Clusters", 
                      color = "Cluster") + xlab("Particulate Pollution (PM10)") +
  ylab("Latitude") + theme_minimal()

When plotting the GMM clusters in a plot of air pollution against population, it is clear that Cluster 5 has cities with the largest populations. Cluster 4 comes in a close second with varied air pollution including the most polluted cities. The other clusters peak through in the low population low pollution area that is mostly overrun by the largest Cluster 2. In the graph of particulate pollution against latitude, Cluster 4 has members across the latitude axis. The other clusters are much more concentrated in their respective latitudinal locations.

There is a sufficient number of clusters and cluster size to further examine their relationships on a world map.

#Prints cities onto world map colored by their cluster assignment
library(maps)

## 
## Attaching package: 'maps'

## The following object is masked from 'package:mclust':
## 
##     map

## The following object is masked from 'package:cluster':
## 
##     votes.repub

airClusters$gmm_cluster <- as.factor(airClusters$gmm_cluster)
world_map <- map_data("world")
cluster_colors <- c(
  "#E41A1C", "#377EB8","#4DAF4A","#984EA3","#FF7F00",
  "#FFFF33","#A65628","#F781BF","#17BECF","#000000")
base_map <- ggplot() +
  
  geom_polygon(data = world_map,
               aes(long, lat, group = group),
               fill = "gray90",
               color = "white",
               linewidth = 0.2) +
  
  geom_point(data = airClusters,
             aes(x = longitude,
                 y = latitude,
                 color = gmm_cluster),
             size = 2.5,
             alpha = 0.85) +
  
  scale_color_manual(values = cluster_colors) +
  
  labs(color = "Cluster Assignment") +
  
  theme_minimal() +
  
  theme(
    plot.title = element_text(hjust = 0.5),
    legend.position = "right"
  )

base_map

base_map +
  coord_cartesian(
    xlim = c(-170, -50),
    ylim = c(5, 75)
  ) +
  labs(title = "North America Zoom")

base_map +
  coord_cartesian(
    xlim = c(-15, 45),
    ylim = c(30, 72)
  ) +
  labs(title = "Europe Zoom")

base_map +
  coord_cartesian(
    xlim = c(60, 150),
    ylim = c(-10, 60)
  ) +
  labs(title = "Asia Zoom")

Since North America, Europe, and Asia are very crowded, they have additional zoomed in plots to examine. From the world map, we can see that most of the clusters are in the northern hemisphere, with only Cluster 5 and 9 having a significant number of points below the equator. Between the two, Cluster 9 has almost all of its points below the equator, whereas Cluster 5 appears equally spread between the northern and southern hemisphere. Both Cluster 5 and 9 profiles are high pm10 pollutants, low no2 pollutants and high population. Further investigation may be needed to understand what differentiates the cities in the southern hemisphere that are in Cluster 5 and therefore have similar cities in the north, versus the southern cities in Cluster 9 that have few northern cities that are similar.

From the North America map, it appears the predominate cluster found in the region and not others are Cluster 7, moderate pm10 with low no2 and moderate population. It appears mostly centered around latitude 50, around the Canadian and USA border line but still has a significant number of cities towards the center of each country. It is important to note that Cluster 2, the predominate cluster of Asia, has many points here as well focused in the Central American region up through the South-east coastline to the middle east coastline of the USA. Therefore, it should be investigated what it is about these cities that make them more similar to the cities of Asia rather than the other USA cities around them.

The European continent is dominated by Cluster 3, 6, 8, and 10. All these clusters have low populations and either moderate or low pollutant amounts. North America and Asia do have some points of these clusters within them, mostly inland for each continent. It should be investigated why these clusters are predominately in Europe, some possible explanations could be that given the smaller size of Europe getting measurements from smaller cities was possible whereas the rural zones of North America or Asia are harder to get measurements for. It could be due to Europe being the birthplace of the industrial revolution and thus has allowed more time for pollutants to permeate even to the smallest cities in the region.

Asia is predominately filled with Cluster 2 and 5, which are high pm10 with low no2 and either high or moderate population. Cluster 2 especially is concentrated in Asia while having some points in the USA, as mentioned before. Cluster 2 is the one with moderate population size, given the population and size of the Asian countries, especially China, it stands to reason this continent would have more cities that fit the moderate population size with high pm10 than you would be able to find in North America or Europe, where those countries or continents would have some but nowhere near as many.

##DBSCAN

#DBSCAN results
dbscan_model

## dbscan Pts=7182 MinPts=10 eps=0.9
##          0    1  2  3  4  5  6  7
## border 551  252 14  6 11 17  9 14
## seed     0 6254 18 24  4  4  1  3
## total  551 6506 32 30 15 21 10 17

#core, noise, & border pts
core_pts <- is.corepoint(x = airCleaned_Scaled, eps = 0.9, minPts = 10)
noise_pts <- dbscan_model$cluster == 0
border_pts <- dbscan_model$cluster > 0 & !core_pts
as.data.frame(table(core_pts, noise_pts, border_pts))

airClusters$db_cluster <- dbscan_model$cluster

DBSCAN resulted in seven clusters. Cluster 1 is one large cluster with a total of 6,506 points. The remaining 6 clusters are fragmented small clusters ranging from 32 to 10 members. DBSCAN categorized the data points as 6,308 core points, 551 noise points, and 323 border points.

#DBSCAN plots
ggplot(airCleaned,aes(no2, population, color = as.factor(dbscan_model$cluster))) + 
  geom_point() + labs(title = "Air Pollution vs. Population with DBSCAN Clusters", 
                      color = "Cluster") + xlab("Air Pollution (NO2)") +
  ylab("Population") + theme_minimal()

ggplot(airCleaned,aes(pm10, latitude, color = as.factor(dbscan_model$cluster))) + 
  geom_point() + labs(title = "Particulate Pollution vs. Latitude with DBSCAN Clusters", 
                      color = "Cluster") + xlab("Particulate Pollution (PM10)") +
  ylab("Latitude") + theme_minimal()

A plot of the air pollution against population demonstrates that most cities lie low on both, thereby forming one large Cluster 1 with data points on the edge of Cluster 1 classified as noise. The remaining six clusters are not visible through the large mass. Examining particulate pollution against latitude reveals that it too presents as a large cluster with many noise points on the edges of the cluster, however, the smaller clusters are visually clear here with demonstrated distinction from one another but still interspersed within Cluster 1 and the noise points.

#Hierarchical cluster quality metrics
#silhouette score
hcsil <- silhouette(airClusters$hc_cluster, dist_matrix)
hc_sil_df <- as.data.frame(hcsil)
hc_sil_perclus <- aggregate(sil_width ~ cluster,data = hc_sil_df,mean)
#Davies-Bouldin Index (DBI)
library(clusterSim)

## Loading required package: MASS

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

db_index <- index.DB(airCleaned_Scaled, airClusters$hc_cluster)
#Calinski-Harabasz Index
ch_index <- index.G1(airCleaned_Scaled, airClusters$hc_cluster)

#GMM Cluster quality metrics
#Calculating the silhouette score
airClusters$gmm_cluster <- as.numeric(airClusters$gmm_cluster)
gmmsil <- silhouette(airClusters$gmm_cluster, dist_matrix)
gmm_sil_df <- as.data.frame(gmmsil)
gmm_sil_perclus <- aggregate(sil_width ~ cluster,data = gmm_sil_df,mean)
#Calculates the Davies-Bouldin Index
gmmdb <- index.DB(airCleaned_Scaled,airClusters$gmm_cluster)
#Calinski-Harabasz Index
gmmch <- index.G1(airCleaned_Scaled, airClusters$gmm_cluster)

#DBSCAN Cluster quality metrics
#silhouette score
nonnoise <- dbscan_model$cluster != 0
distancematrix_nonnoise <- dist(airCleaned_Scaled[nonnoise,])
db_sil <- silhouette(dbscan_model$cluster[nonnoise], distancematrix_nonnoise)
#per cluster
db_sil_df <- as.data.frame(db_sil)
db_sil_perclus <- aggregate(sil_width ~ cluster,data = db_sil_df,mean)
#Davies-Bouldin Index
dbscan_db <- index.DB(airCleaned_Scaled[nonnoise,],dbscan_model$cluster[nonnoise])
#Calinski-Harabasz Index
dbscan_ch <- index.G1(airCleaned_Scaled[nonnoise,],dbscan_model$cluster[nonnoise])

#Comparison of cluster assignments against each other
#Hierarchical & DBSCAN
#Adjusted Rand Index
ari_hcdb <- adjustedRandIndex(airClusters$db_cluster[nonnoise], airClusters$hc_cluster[nonnoise])
#Normalised Mutual Information
library(aricode)
nmi_hcdb <- NMI(airClusters$db_cluster[nonnoise], airClusters$hc_cluster[nonnoise])
#Hierarchical & GMM
#ARI
ari_hcgmm <- adjustedRandIndex(airClusters$gmm_cluster, airClusters$hc_cluster)
#NMI
nmi_hcgmm <- NMI(airClusters$gmm_cluster, airClusters$hc_cluster)
#GMM & DBSCAN
#ARI
ari_dbgmm <- adjustedRandIndex(airClusters$db_cluster[nonnoise], airClusters$gmm_cluster[nonnoise])
#NMI
nmi_dbgmm <- NMI(airClusters$db_cluster[nonnoise], airClusters$gmm_cluster[nonnoise])

#tables
silhouette_scores <- c(mean(hcsil[,3]), mean(gmmsil[,3]), mean(db_sil[,3]))
davies_bouldin_index <- c(db_index$DB, gmmdb$DB, dbscan_db$DB)
calinski_harabasz_index <- c(ch_index, gmmch, dbscan_ch)
aicmetric <- c(NA, AIC(GMM10_results), NA)
bicmetric <- c(NA, BIC(GMM10_results), NA)
metrics <- data.frame(silhouette_scores,davies_bouldin_index,calinski_harabasz_index, aicmetric, bicmetric)
rownames(metrics) <- c("Hierarchical", "GMM", "DBSCAN")
metrics

#ARI table
adjusted_rand_index <- data.frame(
  Hierarchical = c(NA, ari_hcgmm, ari_hcdb),
  GMM = c(ari_hcgmm, NA, ari_dbgmm),
  DBSCAN = c(ari_hcdb, ari_dbgmm, NA),
  row.names = c("Hierarchical", "GMM", "DBSCAN"))
adjusted_rand_index

#NMI table
normalised_mutual_information <- data.frame(
  Hierarchical = c(NA, nmi_hcgmm, nmi_hcdb),
  GMM = c(nmi_hcgmm, NA, nmi_dbgmm),
  DBSCAN = c(nmi_hcdb, nmi_dbgmm, NA),
  row.names = c("Hierarchical", "GMM", "DBSCAN"))
normalised_mutual_information

#silhouette scores per cluster
hc_sil_clus <- c(hc_sil_perclus$sil_width, NA, NA, NA, NA, NA, NA, NA)
db_sil_clus <- c(db_sil_perclus$sil_width, NA, NA, NA)
sil_per_cluster <- data.frame(
  Hierarchical = hc_sil_clus,
  GMM = gmm_sil_perclus$sil_width,
  DBSCAN = db_sil_clus, 
  row.names = 1:10)
sil_per_cluster

Cluster validity gives insights into model performance and its suitability for the data. The first metric, silhouette score, was used on the clustering overall as well as the individual clusters. Silhouette scores range from -1 to 1. Values above 0.5 indicate well-performing models with well-separated, cohesive clusters. Hierarchical clustering scored a 0.52, just barely scraping above the 0.5. GMM scored -0.01 and DBSCAN scored 0.45. GMM’s score is expected because it searches for overlapping clusters so it would not score well on a metric that seeks well-separated clusters. As for silhouette scores per cluster, the hierarchical cluster performed very well. Cluster 1 scored a 0.60, Cluster 2 scored a 0.01, and Cluster 3 scored 0.84. These are all above 0.5 aside from Cluster 2. Cluster 3 was not a surprise since it only contains 5 members, but Cluster 2 indicates that this cluster overlaps with the other two which might indicate that the dataset is one large cluster. GMM clusters scored either below 0.5 or negatively with one Cluster 9 that scored a 0.56. This cluster has 238 members and is the only meaningful cluster according to the silhouette scores. Lastly, the DBSCAN clusters did fairly well with only three of the seven clusters falling below 0.5. They are all strong or acceptable scores with the exception of Cluster 5 that only scored 0.26.

The David-Bouldin Index measures the average ratio of within-cluster scatter to between-cluster separation. Hierarchical clustering scored 1.19, GMM scored 3.04, and DBSCAN scored 0.45. The lower the value, the better as that means either the between-cluster separation is large, the within-cluster scatter is small, or both. Unsurprisingly GMM has a very poor score due to overlapping clusters. Next, the Calinski-Harabasz Index scores clusters that are dense and well-separated highly. Here, the higher the value, the better. Hierarchical clustering scored the highest at 1,689, GMM the second highest at 575, and DBSCAN the lowest at 181. As with the other measures, the Calinski-Harabasz Index is not a good fit for measuring the quality of clusters made by GMM due to again, the overlapping clusters. The score for DBSCAN is also unsurprising as its clusters are not dense at all.

When comparing the resultant clusters against one another, the Adjusted Rand Index (ARI) and Normalised Mutual Information (NMI) metrics were used. ARI ranges from 0 to 1 with measurements above 0.7 indicating strong agreement. Hierarchical and GMM clusters scored 0.065, Hierarchical and DBSCAN scored 0.271, and GMM and DBSCAN scored 0.009. None show strong agreement and GMM has little agreement at all with the two other clustering results. For NMI, the values again fall between 0 and 1 and measure the mutual dependence between the cluster assignments. Hierarchical and GMM scored 0.12, Hierarchical and DBSCAN scored 0.14, and GMM and DBSCAN scored 0.03. When comparing DBSCAN, the points classified by the model as noise were omitted.

GMM has two additional values that indicate model performance. AIC rewards models that fit the data well and penalizes complexity. BIC does the same but with a stricter penalty to complexity. GMM scored a 38,860.02 AIC and 40,779.36 BIC. For both, the lower the score, the better the model fits the data. AIC and BIC scores are not applicable to Hierarchical and DBSCAN models so these values have no basis of comparison. However, it is known that outliers and skewed distributions affect GMMs negatively as skewed distributions directly contradict the normal distribution assumption that is the basis of this modeling. Outliers can result in distorting cluster means or directly forming new clusters that do not have real analytical value. Despite skewed distributions being skewed, GMMs will attempt to fit them in a normal distribution leading to a myriad of issues such as misclassifications and low model reliability. This dataset has plenty of outliers and the distributions of the majority of the feature variables are skewed.

In summary, hierarchical clustering has the best mean silhouette score and Calinski-Harabasz Index. DBSCAN has the best silhouette score per cluster and Davies-Bouldin Index. For ARI and NMI, the highest scores were obtained when comparing DBSCAN and Hierarchical clusters. Overall, the best performing model is the DBSCAN model.

#Cluster Profiling and Discussion ##Hierarchical clusters

#Summarize cluster profiles
cluster_summary <- airClusters %>%
  group_by(hc_cluster) %>%
  summarise(across(c(pm10, no2, pm10_tempcov, no2_tempcov, population, 
                     latitude),
                   mean, na.rm = TRUE),
            n = n())

## Warning: There was 1 warning in `summarise()`.
## ℹ In argument: `across(...)`.
## ℹ In group 1: `hc_cluster = 1`.
## Caused by warning:
## ! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
## Supply arguments directly to `.fns` through an anonymous function instead.
## 
##   # Previously
##   across(a:b, mean, na.rm = TRUE)
## 
##   # Now
##   across(a:b, \(x) mean(x, na.rm = TRUE))

print(cluster_summary)

## # A tibble: 3 × 8
##   hc_cluster  pm10   no2 pm10_tempcov no2_tempcov population latitude     n
##        <int> <dbl> <dbl>        <dbl>       <dbl>      <dbl>    <dbl> <int>
## 1          1  36.5  20.9         91.1        91.9    315190.     38.6  6147
## 2          2  55.9  20.5         57.2        32.2   1492886.     30.8  1030
## 3          3 111.  822.          97.0        92.0    394428      35.7     5

Cluster 1 is characterized by 36.55 pm10, 20.90 no2, 315,190 population, 91.10 pm10_tempcov, 91.87 no2_tempcov, and 38.64 latitude. This cluster consists of mostly North American and European cities with average pollution levels and small populations that are consistently recording pollution.

Cluster 2 has an average of 55.85 pm10, 20.53 no2, 57.20 pm10_tempcov, 32.22 no2_tempcov, 1,492,886 population, and 30.78 latitude. This cluster features cities that are slightly southern but still in the northern hemisphere. These cities have more particulate pollution than cities in Cluster 1 while maintaining average air pollution, very large populations, and have only been recording pollution levels for about half the years that the WHO has been aggregating the database.

Cluster 3 has an average of 110.66 pm10, 821.60 no2, 97.03 pm10_tempcov, 91.96 no2_tempcov, 394,428 population, and 35.73 latitude. This cluster is, again, North American and European cities, but those that are highly polluted with low populations and well documented pollution levels. They are especially polluted in NO2 levels.

#hierarchical radar chart
library(fmsb)
#create data frame
radar_hcdf <- cluster_summary
radar_hcdf$n <- NULL
radar_hcdf <- as.data.frame(radar_hcdf)
rownames(radar_hcdf) <- radar_hcdf$hc_cluster
radar_hcdf$hc_cluster <- NULL
#scale data
normalize <- function(x) {(x - min(x)) / (max(x) - min(x))}
radar_hcscaled <- as.data.frame(lapply(radar_hcdf, normalize))
#plot
radar_hcplot <- rbind(
  apply(radar_hcscaled, 2, max),
  apply(radar_hcscaled, 2, min),
  radar_hcscaled)
radarchart(radar_hcplot,
           axistype = 1,
           pcol = rainbow(nrow(radar_hcscaled)),
           plty = 1,
           plwd = 2)
title("Radar Chart of Mean Feature Values by Hierarchical Clusters")
legend("topright",legend = rownames(radar_hcdf),col = rainbow(nrow(radar_hcscaled)),
       lty = 1,lwd = 2,title = "Cluster")

The radar chart illustrates the mean values of each of the six feature variables for each cluster. This visual allows easier demonstration of the similarities and differences between clusters. Here, it is clearly indicated that all three hierarchical clusters occupy distinct domains. Hierarchical Cluster 3 features the highest pollution and temporal coverage values with low population and is positioned somewhat north. Hierarchical Cluster 1 is the most northern with low pollution and population but very consistent temporal coverage. Lastly, Hierarchical Cluster 2 has some particulate pollution and the largest mean population but is low on all other feature variables.

##GMM clusters

#Summarize GMM cluster profiles
gmm_clusters <- airClusters %>%
  group_by(gmm_cluster) %>%
  summarise(across(c(pm10, no2, pm10_tempcov, no2_tempcov, population, 
                     latitude),
                   mean, na.rm = TRUE),
            n = n())
print(gmm_clusters)

## # A tibble: 10 × 8
##    gmm_cluster  pm10   no2 pm10_tempcov no2_tempcov population latitude     n
##          <dbl> <dbl> <dbl>        <dbl>       <dbl>      <dbl>    <dbl> <int>
##  1           1  20.4  23.6         93.9        95.1    259964.     48.5   556
##  2           2  68.5  27.1         90.3        91.8    593386.     32.5  1522
##  3           3  21.5  16.5         77.8        87.0     55907.     45.6   731
##  4           4  59.2  75.6         91.7        85.9   2371450.     29.5   122
##  5           5  91.0  27.2         65.9        37.1   2806262.     18.4   650
##  6           6  27.2  20.7         96.2        95.7     51329.     43.6   922
##  7           7  29.9  15.7         62.5        44.9    255484.     40.4   680
##  8           8  18.9  13.5         89.8        92.9      8814.     46.0   754
##  9           9  24.8  15.1         91.1        89.4    138469.    -27.4   238
## 10          10  19.4  16.7         97.6        96.8     14429.     47.9  1007

Cluster 1 consists of a pm10 average of 20.36, no2 of 23.59, pm10_tempcov of 93.91, no2_tempcov of 95.10, population of 259,964, and latitude of 48.54. This cluster consists of northern cities in locations such as Canada, Russia, and the Nordics. These cities exhibit low particulate pollution, average air pollution, consistent pollution documentation, and small populations. Cluster 2 has an average pm10 at 68.49, no2 at 27.15, pm10_tempcov at 90.31, no2_tempcov at 91.83, population at 593,386, and latitude at 32.51. These cities are in the southern United States, northern Africa, and Asia and have high pollution concentrations, consistent reporting, and average population sizes.

Cluster 3 has an average pm10 at 21.46, no2 at 16.51, pm10_tempcov at 77.78, no2_tempcov at 87.05, population at 55,907, and latitude at 45.58. These cities are located very north like Cluster 1, with similar pollution levels, less consistent reporting, and very small populations.

Cluster 4 has an average pm10 at 59.24, no2 at 75.56, pm10_tempcov at 91.66, no2_tempcov at 85.86, population at 2,371,450, and latitude at 29.54. These cities are located similarly to Cluster 2 cities, with similar particulate pollution and pollution recording habits, but higher air pollution and much larger populations.

Cluster 5 has an average pm10 at 90.95, no2 at 27.20, pm10_tempcov at 65.93, no2_tempcov at 37.09, population at 2,806,262, and latitude at 18.44. This cluster is made up of cities in southern Asia, Africa, and Central America. They have high particulate pollution, average air pollution, inconsistent pollution reporting, and extremely large populations.

Cluster 6 has an average pm10 at 27.17, no2 at 20.70, pm10_tempcov at 96.16, no2_tempcov at 95.71, population at 51,329, and latitude at 43.55. These cities are just south of Cluster 1 and 3 and present similarly to Cluster 3. Cluster 6 has higher pollution concentrations and more consistent pollution reporting.

Cluster 7 has an average pm10 at 29.88, no2 at 15.72, pm10_tempcov at 62.53, no2_tempcov at 44.86, population at 255,484, and latitude at 40.38. Cluster 7 is just south of Cluster 6 with average particulate pollution and low air pollution. These cities are inconsistent with their pollution measuring and feature small populations.

Cluster 8 has an average pm10 at 18.95, no2 at 13.51, pm10_tempcov at 89.83, no2_tempcov at 92.91, population at 8,814, and latitude at 46.04 This cluster is located just south of Cluster 1, exhibiting low pollution levels, high reporting habits, and extremely small populations.

Cluster 9 has an average pm10 at 24.83, no2 at 15.05, pm10_tempcov at 91.10, no2_tempcov at 89.39, population at 138,469, and latitude at -27.36 This cluster is concentrated in the global south around Australia, South Africa, and Central America. These cities have average pollution levels, consistent report pollution levels, and have small populations.

Cluster 10 has an average pm10 at 19.44, no2 at 16.68, pm10_tempcov at 95.58, no2_tempcov at 96.77, population at 14,429, and latitude at 47.90. This Cluster is closely located to Cluster 1 and 8 and exhibits similarly to Cluster 8 with low pollution, high reporting, and extremely small populations. The difference between the two clusters is that Cluster 10 has been reporting either longer or more consistently and has higher pollution concentrations.

#GMM radar chart
#create data frame
radar_gmmdf <- gmm_clusters
radar_gmmdf$n <- NULL
radar_gmmdf <- as.data.frame(radar_gmmdf)
rownames(radar_gmmdf) <- radar_hcdf$gmm_cluster
radar_gmmdf$gmm_cluster <- NULL
#scale data
radar_gmmscaled <- as.data.frame(lapply(radar_gmmdf, normalize))
#plot
radar_gmmplot <- rbind(
  apply(radar_gmmscaled, 2, max),
  apply(radar_gmmscaled, 2, min),
  radar_gmmscaled)
radarchart(radar_gmmplot,
           axistype = 1,
           pcol = rainbow(nrow(radar_gmmscaled)),
           plty = 1,
           plwd = 2)
title("Radar Chart of Mean Feature Values by GMM Clusters")
legend("topright",legend = rownames(radar_gmmdf),col = rainbow(nrow(radar_gmmscaled)),
       lty = 1,lwd = 2,title = "Cluster")

The GMM radar chart tells an entirely different story from the hierarchical clusters. GMM clusters 1, 3, 6, 8, and 10 are all nearly identical in shape. GMM Clusters 4 and 5 are the most distinct. The remaining GMM Clusters 2, 7, and 9 have some overlap as well. This is consistent with the overlapping nature of GMM clusters.

##DBSCAN clusters

#Summarize DBSCAN cluster profiles
db_clusters <- airClusters %>%
  group_by(db_cluster) %>%
  summarise(across(c(pm10, no2, pm10_tempcov, no2_tempcov, population, 
                     latitude),
                   mean, na.rm = TRUE),
            n = n())
db_clusters <- db_clusters[-1,] #remove noise "cluster"
print(db_clusters)

## # A tibble: 7 × 8
##   db_cluster  pm10   no2 pm10_tempcov no2_tempcov population latitude     n
##        <dbl> <dbl> <dbl>        <dbl>       <dbl>      <dbl>    <dbl> <int>
## 1          1  34.0  20.0       88.3         88.1     263616.     39.7  6506
## 2          2  47.8  20.2       81.3          4.54    321831.     16.7    32
## 3          3  34.6  14.0        0.989        1.51    417853.    -29.2    30
## 4          4 199.   37.2       94.5         20.1     711289.     29.2    15
## 5          5 126.   28.1       91.4         23.7     342006.     28.2    21
## 6          6 125.   34.5       67.8         24.8     940944.     30.0    10
## 7          7  21.3  12.3       81.7          5.19    108695.    -27.6    17

Cluster 1 consists of a pm10 average of 34.02, no2 of 19.97, pm10_tempcov of 88.28, no2_tempcov of 88.15, population of 263,616, and latitude of 39.67. This cluster consists of North American and European cities with average pollution levels, small populations, and consistent pollution monitoring.

Cluster 2 has an average pm10 at 47.79, no2 at 20.19, pm10_tempcov at 81.32, no2_tempcov at 4.54, population at 321,831, and latitude at 16.74. The cities in this cluster are in Central America, Africa, and South Asia. They have average pollution levels, small populations, consistent particulate pollution reporting, and are new to measuring NO2 air pollution.

Cluster 3 has an average pm10 of 34.56, no2 of 14.01, pm10_tempcov of 0.99, no2_tempcov of 1.51, population of 417,854, and latitude of -29.17. These cities are Australian, Southern African, and South American. They have average particulate pollution levels, low air pollution, average population sizes, and are extremely new to recording ambient air quality.

Cluster 4 has an average pm10 of 198.78, no2 of 37.22, pm10_tempcov of 94.48, no2_tempcov of 20.06, population of 711,289, and latitude of 29.24. This cluster consists of cities in southern America, northern Africa, and Asia. They have high pollution levels, especially in particulate pollution. They regularly measure this particulate pollution but fall behind in air pollution measuring. They also have large populations.

Cluster 5 has an average pm10 of 125.87, no2 of 28.15, pm10_tempcov of 91.41, no2_tempcov of 23.72, population of 342,006, and latitude of 28.23. The cities in this cluster are located similarly to Cluster 4 cities and present similar characteristics with the differences being a much smaller population size and still high but not as high particulate pollution.

Cluster 6 has an average pm10 of 124.88, no2 of 34.48, pm10_tempcov of 67.78, no2_tempcov of 24.81, population of 940,944, and latitude of 30.04. These cities are again located similarly to Cluster 5 and 4 cities, however they exhibit different characteristics. Cluster 6 cities have the largest population of the three clusters, are heavily polluted similarly to Cluster 5’s profile, and have not been measuring pollution levels for as long or consistently as the other two clusters.

Lastly, Cluster 7 has an average pm10 of 21.29, no2 of 12.28, pm10_tempcov of 81.74, no2_tempcov of 5.19, population of 108,695, andlatitude` of -27.58. These cities are similarly located to Cluster 3 cities in the southern hemisphere. They have low pollution levels, have a longer history of reporting particulate pollution, and small populations.

DBSCAN is the best performing model despite appearing in a tie with the hierarchical model. The silhouette scores per cluster indicated the hierarchical clusters are more suited to be two clusters; one very large with the vast majority of data points and one very small with only 5 or more members. DBSCAN, on the other hand, had clusters that were at least moderately supported by the silhouette score. Furthermore, DBSCAN demonstrated the most amount of consensus and that consensus was with hierarchical clustering in the ARI and NMI comparisons. As the best performing model, DBSCAN clusters are the clusters of interest.

Interestingly, the one meaningful cluster from the GMM clusters, Cluster 9, is not fully represented in the DBSCAN clusters. The closest cluster is DBSCAN Cluster 7 which has a similar latitude, population, and pollution concentrations. The major difference is that it has a NO2 temporal coverage of 5.19. The meaningful Hierarchical Cluster 1 is mostly represented by DBSCAN Cluster 1. The Hierarchical Cluster 1 is similarly polluted, more populated, and has better temporal coverage than DBSCAN Cluster 1.

#dbscan spider/radar map
#create data frame
radar_df <- db_clusters
radar_df$n <- NULL
radar_df <- as.data.frame(radar_df)
rownames(radar_df) <- radar_df$db_cluster
radar_df$db_cluster <- NULL
#scale data
radar_scaled <- as.data.frame(lapply(radar_df, normalize))
#plot
radar_plot <- rbind(
  apply(radar_scaled, 2, max),
  apply(radar_scaled, 2, min),
  radar_scaled)
radarchart(radar_plot,
           axistype = 1,
           pcol = rainbow(nrow(radar_scaled)),
           plty = 1,
           plwd = 2)
title("Radar Chart of Mean Feature Values by DBSCAN Clusters")
legend("topright",legend = rownames(radar_df),col = rainbow(nrow(radar_scaled)),
       lty = 1,lwd = 2,title = "Cluster")

The radar chart of DBSCAN clusters allows for easier demonstration of the differences between clusters. Clusters such as Cluster 3 and 7 have low pollution but the radar chart clearly illustrates that the major differentiation of the two is Cluster 7’s large pm10 temporal coverage, and Cluster 3’s moderate population size. Cluster 4 displays as the most polluted cluster. Clusters 1 and 2 have similar shapes with Cluster 1’s extremely high no2_tempcov setting it apart from Cluster 2. Clusters 6 and 4 have the highest populations but notably different pollution profiles. Overall, the radar chart reveals that the DBSCAN clustering captures distinct clusters from the clear variation in cluster shapes.

#boxplots
library(tidyr)
clboxplot <- airClusters %>%
  filter(db_cluster != 0) %>%
  dplyr::select(db_cluster,
         pm10,
         no2,
         pm10_tempcov,
         no2_tempcov,
         population,
         latitude) %>%
  pivot_longer(
    cols = -db_cluster,
    names_to = "Feature",
    values_to = "Value")
ggplot(clboxplot,
       aes(x = as.factor(db_cluster),
           y = Value,
           fill = as.factor(db_cluster))) +
  geom_boxplot(alpha = 0.7) +
  facet_wrap(~Feature, scales = "free_y") +
  labs(
    title = "Feature Distributions Across DBSCAN Clusters",
    x = "Cluster",
    y = "Value",
    fill = "Cluster") +
  theme_minimal() +
  theme(
    plot.title = element_text(
      hjust = 0.5,
      face = "bold"),
    legend.position = "none")

Boxplots per cluster per feature variable (Figure 17) allows further illustration of the differences and similarities between the clusters. Cluster 1 is most certainly distinct from all of the other clusters. As the largest cluster, it also features an obscene number of outliers in its boxplots. Latitude is mostly concentrated in the northern hemisphere except for Clusters 3 and 7. Cluster 4 displays high no2 concentrations with Clusters 3 and 7 at the lower levels. Cluster 1 is the only cluster with high no2_tempcov. Clusters 3 and 7 are the lowest here. Cluster 4 is the most polluted cluster in terms of particulate pollution. Clusters 5 and 6 are high as well, with Cluster 7 at the lowest. Almost all the clusters have high pm10_tempcov. Cluster 6 is a little bit lower, and Cluster 3 is practically 0. Clusters 4 and 6 are the most populated clusters with Cluster 7 as the least populated. Although some of these distributions demonstrate similarity and overlap, other distributions demonstrate clear separation. This supports distinct cluster profiles as not all profiles are completely different from one another.

#interactive dbscan cluster graph
library(plotly)

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:MASS':
## 
##     select

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:Hmisc':
## 
##     subplot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

airClusters$db_cluster <- as.factor(airClusters$db_cluster)
nonnoise_clusters <- airClusters[nonnoise,]
#world map base
world_map <- map_data("world")
dbscanmap <- plot_ly()
dbscanmap <- dbscanmap %>%
  add_polygons(
    data = world_map,
    x = ~long,
    y = ~lat,
    split = ~group,
    fillcolor = "gray90",
    line = list(color = "white", width = 0.2),
    hoverinfo = "none",
    showlegend = FALSE)
#add cluster points
dbscanmap <- dbscanmap %>%
  add_markers(
    data = airClusters,
    x = ~longitude,
    y = ~latitude,
    color = ~db_cluster,
    marker = list(size = 6, opacity = 0.85),
    text = ~paste(
      "Cluster:", db_cluster,
      "<br>PM10:", pm10,
      "<br>NO2:", no2,
      "<br>Population:", population),
    hoverinfo = "text") %>%
  #fade out unselected
  highlight(
    on = "plotly_hover",
    off = "plotly_doubleclick",
    dynamic = FALSE,
    selected = attrs_selected(
      marker = list(opacity = 1)),
    unselected = attrs_selected(
      marker = list(
        opacity = 0.15,
        color = "gray"))) %>%
  layout(title = list(
      text = "Interactive World Map of DBSCAN Clusters",
      x = 0.5),
      legend = list(title = list(text = "Cluster")))

## Warning: The following arguments are not supported:
## unselected
## Arguments such as: hoverinfo and showInLegend 
## have been replaced by selected and other

dbscanmap

This world map allows for interactive user exploration of the DBSCAN clusters. Hovering over a point gives the pm10, no2, and population values of that city while fading out the color intensity of other cities that are not fellow cluster members.

#Conclusion

DBSCAN outperformed both the hierarchical and GMM models in defining multiple meaningful clusters. DBSCAN identified the following seven clusters: 1. North American and European cities with average pollution levels, small populations, and consistent pollution monitoring.
2, Central American, African, and South Asian cities with average pollution levels, small populations, consistent particulate pollution reporting, and are new to measuring NO2 air pollution.
3. Australian, Southern African, and South American cities with average particulate pollution levels, low air pollution, average population sizes, and are extremely new to recording ambient air quality. 4, Southern American, northern African, and Asian cities with the highest pollution levels, consistent particulate pollution reporting, average air pollution reporting, and large populations.
5. Southern American, northern African, and Asian cities with high pollution, consistent particulate pollution reporting, average air pollution reporting, and average population size.
6. Southern American, northern African, and Asian cities with high pollution, average pollution reporting, and the largest populations. 7. Australian, Southern African, and South American cities with the lowest pollution levels, average particulate pollution reporting, new air pollution reporting, and the smallest populations.

Future work should expand this project by clustering again with the exemption of temporal coverage data. Although this data provides meaningful insights into cities’ priorities and history regarding ambient air pollution, it would be interesting to see how clusters differ when considering just location, population, and pollution level. Although DBSCAN performed the best, it did not receive the best possible scores with the scoring metrics. Other clustering methods should be implemented to test for even better performing models.

DATA 542 Group 2 project

Bartlomiej Szaro, Vicky Hu, Kaitlin Norton, Benjamin Stec

2026-05-14

Introduction

Exploratory Data Analysis (EDA)

Clustering

Gaussian Mixture Model (GMM)