1. Introduction

In this project, I analyze the “Global Air Pollution” dataset using Unsupervised Learning techniques. The goal is to identify patterns among cities worldwide by clustering them based on various air quality indicators such as Ozone, Carbon Monoxide (CO), Nitrogen Dioxide (NO2), and Particulate Matter (PM2.5).

2. Data Preparation

First, we load the necessary libraries and the dataset. To ensure accurate clustering, we handle missing values and scale the numeric features.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.0     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(cluster)
library(factoextra)

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

library(NbClust)

# Loading data
df <- read.csv("global air pollution dataset.csv")

# Data Cleaning: Removing NA values and selecting numeric columns for clustering
df_clean <- df %>% 
  na.omit() %>%
  select(AQI.Value, CO.AQI.Value, Ozone.AQI.Value, NO2.AQI.Value, PM2.5.AQI.Value)

# Scaling the data (Standardization: Mean=0, SD=1)
df_scaled <- scale(df_clean)

# Displaying first 5 rows of the prepared data
head(df_scaled, 5)

##    AQI.Value CO.AQI.Value Ozone.AQI.Value NO2.AQI.Value PM2.5.AQI.Value
## 1 -0.3748245   -0.2010668      0.02869492    -0.5830359     -0.31972430
## 2 -0.5532200   -0.2010668     -1.07455804    -0.3927086     -0.50221790
## 3 -0.1072312   -0.2010668      0.13546134    -0.2023814     -0.04598391
## 4 -0.6780968   -0.2010668     -0.04248269    -0.5830359     -0.88545445
## 5 -0.8921715   -0.7468994     -0.46954835    -0.5830359     -1.14094549

#Finding the Best Number of Clusters

Our dataset is very large, so calculating the distances for every single city takes too much memory. To solve this, I will use a random sample of 5,000 cities to find the best number of clusters. This sample is enough to show us the general pattern without crashing the computer.

I will use the Elbow Method and the Silhouette Method on this sample.

# Setting a seed for reproducibility
set.seed(123)

# Taking a sample of 5000 rows to avoid memory errors
df_sample <- df_scaled[sample(nrow(df_scaled), 5000), ]

# Elbow Method
fviz_nbclust(df_sample, clara, method = "wss") +
  theme_minimal() +
  labs(title = "Elbow Method (on 5000 samples)", x = "Number of Clusters (k)", y = "Total Within-Cluster Sum of Squares")

# Silhouette Method
fviz_nbclust(df_sample, clara, method = "silhouette") +
  theme_minimal() +
  labs(title = "Silhouette Method (on 5000 samples)", x = "Number of Clusters (k)", y = "Average Silhouette Width")

Based on these results, we can see the best k value. After we decide on k, we will apply the CLARA algorithm to the whole dataset because CLARA is very good at handling large data.

Applying CLARA Clustering

Now that we have decided on the number of clusters k=3, we will apply the CLARA algorithm to our full dataset. CLARA is more efficient than standard K-Means for large datasets like ours. It works by taking samples and finding the best centers for the clusters.

After clustering, I will add the cluster labels back to our original data so we can see which city belongs to which group.

# We use k = 3 based on our previous analysis
final_clusters <- clara(df_scaled, k = 3, samples = 50, pamLike = TRUE)

# Adding the cluster information to our cleaned original data
df_final <- df_clean %>%
  mutate(Cluster = as.factor(final_clusters$clustering))

# Showing the first few rows with the new Cluster column
head(df_final)

##   AQI.Value CO.AQI.Value Ozone.AQI.Value NO2.AQI.Value PM2.5.AQI.Value Cluster
## 1        51            1              36             0              51       1
## 2        41            1               5             1              41       1
## 3        66            1              39             2              66       1
## 4        34            1              34             0              20       1
## 5        22            0              22             0               6       1
## 6        54            1              14            11              54       2

Visualizing the Clusters

Now we show our clusters on a graph. Since we have five different air quality variables, we cannot see them all at once. I use a special method to show these five dimensions in a 2D plot. Each color represents one of the three clusters we found.

# Visualizing the clustering results
fviz_cluster(final_clusters, 
             geom = "point", 
             ellipse.type = "convex", 
             ggtheme = theme_minimal()) +
  labs(title = "Air Quality Cluster Map", 
       subtitle = "Visualization of cities in 2D space")

# Cluster Profiling

To understand what each cluster represents, I calculate the average values of the pollution indicators for each group. This helps me identify which cluster is “Clean” and which one is “Highly Polluted.”

# Calculating the mean of each variable for each cluster
cluster_profile <- df_final %>%
  group_by(Cluster) %>%
  summarise(
    Count = n(),
    Avg_AQI = mean(AQI.Value),
    Avg_CO = mean(CO.AQI.Value),
    Avg_Ozone = mean(Ozone.AQI.Value),
    Avg_PM2.5 = mean(PM2.5.AQI.Value)
  )

# Display the profile table
cluster_profile

## # A tibble: 3 × 6
##   Cluster Count Avg_AQI Avg_CO Avg_Ozone Avg_PM2.5
##   <fct>   <int>   <dbl>  <dbl>     <dbl>     <dbl>
## 1 1       16929    49.6  0.901      31.5      46.2
## 2 2        3639    88.4  2.56       17.7      88.4
## 3 3        2895   182.   2.60       79.0     174.

In the table above, we can see the differences. For example, the cluster with the highest Avg_AQI and Avg_PM2.5 is the most polluted group. The one with the lowest values represents cities with better air quality.

Conclusion

In this project, I used the CLARA clustering algorithm to analyze global air pollution. By looking at the cluster profiles, we can conclude the following:

Cluster 1: Likely represents cities with low pollution (Clean cities).
Cluster 2: Represents cities with moderate pollution.
Cluster 3: Represents cities with high pollution levels, especially high PM2.5 and AQI values.

The visualization showed a clear separation between these groups. This analysis is useful because it helps us identify which regions in the world need more environmental protection. Unsupervised learning allowed us to see these patterns without any previous labels.

AI Usage Statement

This project was conceptually designed and executed by me, with specific technical support from AI (Large Language Model) in the following areas:

Problem Solving: I encountered a memory allocation error during the clustering of the full dataset (23,000+ rows). I consulted AI to find a workaround, which led to the implementation of the sampling method for the Elbow/Silhouette tests.

Methodology Refinement: While I decided to use CLARA for its efficiency with large data, AI assisted in providing the specific R syntax for the ClusterR and factoextra libraries.

Data Transformation: I guided the data cleaning process (selecting numeric AQI values and scaling), and AI provided the dplyr code blocks to perform these operations efficiently.

Quality Control: AI was used as a proofreading tool to ensure the English explanations were grammatically correct and maintained a professional tone.

I personally interpreted all statistical outputs, including the determination of the optimal k=3 from the plots and the characterization of the final cluster profiles.

Clustering Global Cities Based on Air Pollution Profiles

Enes Pircek

2026-02-22