Load packages

library(factoextra)
library(cluster)
library(ggplot2)
library(gridExtra)
library(fpc)
library(scales)

Introduction

Clustering is an unsupervised machine learning technique used to group data into clusters based on similarity, where items within the same cluster are similar and distinct from those in other clusters. This technique helps uncover natural groupings in data, providing insights, enabling anomaly detection, and simplifying complex datasets.

The primary objective of this project is to apply three non-hierarchical clustering algorithms to a pollution dataset and achieve air quality segmentation. Originally in this dataset were identified four levels of air quality: Good, Moderate, Poor, Hazardous. This project will evaluate the likelihood of this division by performing clustering analysis.

Dataset

The Air Quality and Pollution dataset come from: https://www.kaggle.com/datasets/mujtabamatin/air-quality-and-pollution-assessment/data It contains information about air pollution across various regions of the South East Asia. There are 5000 observations which are described by 9 variables:

  • Temperature (°C): Average temperature of the region.
  • Humidity (%): Relative humidity recorded in the region.
  • PM2.5 Concentration (µg/m³): Fine particulate matter levels.
  • PM10 Concentration (µg/m³): Coarse particulate matter levels.
  • NO2 Concentration (ppb): Nitrogen dioxide levels.
  • SO2 Concentration (ppb): Sulfur dioxide levels.
  • CO Concentration (ppm): Carbon monoxide levels.
  • Proximity to Industrial Areas (km): Distance to the nearest industrial zone.
  • Population Density (people/km²): Number of people per square kilometer in the region.
# Loading of the data
dane <- read.csv("pollution.csv")

#Summary of the data

head(dane)
##   Temperature Humidity PM2.5 PM10  NO2  SO2   CO Proximity_to_Industrial_Areas
## 1        29.8     59.1   5.2 17.9 18.9  9.2 1.72                           6.3
## 2        28.3     75.6   2.3 12.2 30.8  9.7 1.64                           6.0
## 3        23.1     74.7  26.7 33.8 24.4 12.6 1.63                           5.2
## 4        27.1     39.1   6.1  6.3 13.5  5.3 1.15                          11.1
## 5        26.5     70.7   6.9 16.0 21.9  5.6 1.01                          12.7
## 6        39.4     96.6  14.6 35.5 42.9 17.9 1.82                           3.1
##   Population_Density Air.Quality
## 1                319    Moderate
## 2                611    Moderate
## 3                619    Moderate
## 4                551        Good
## 5                303        Good
## 6                674   Hazardous
str(dane)
## 'data.frame':    5000 obs. of  10 variables:
##  $ Temperature                  : num  29.8 28.3 23.1 27.1 26.5 39.4 41.7 31 29.4 33.2 ...
##  $ Humidity                     : num  59.1 75.6 74.7 39.1 70.7 96.6 82.5 59.6 93.8 80.5 ...
##  $ PM2.5                        : num  5.2 2.3 26.7 6.1 6.9 14.6 1.7 5 10.3 11.1 ...
##  $ PM10                         : num  17.9 12.2 33.8 6.3 16 35.5 15.8 16.8 22.7 24.4 ...
##  $ NO2                          : num  18.9 30.8 24.4 13.5 21.9 42.9 31.1 24.2 45.1 32 ...
##  $ SO2                          : num  9.2 9.7 12.6 5.3 5.6 17.9 12.7 13.6 11.8 15.3 ...
##  $ CO                           : num  1.72 1.64 1.63 1.15 1.01 1.82 1.8 1.38 2.03 1.69 ...
##  $ Proximity_to_Industrial_Areas: num  6.3 6 5.2 11.1 12.7 3.1 4.6 6.3 5.4 4.9 ...
##  $ Population_Density           : int  319 611 619 551 303 674 735 443 486 535 ...
##  $ Air.Quality                  : chr  "Moderate" "Moderate" "Moderate" "Good" ...
dim(dane)
## [1] 5000   10
#We remove the variable assessing air quality
dane<-dane[,-10]

head(dane)
##   Temperature Humidity PM2.5 PM10  NO2  SO2   CO Proximity_to_Industrial_Areas
## 1        29.8     59.1   5.2 17.9 18.9  9.2 1.72                           6.3
## 2        28.3     75.6   2.3 12.2 30.8  9.7 1.64                           6.0
## 3        23.1     74.7  26.7 33.8 24.4 12.6 1.63                           5.2
## 4        27.1     39.1   6.1  6.3 13.5  5.3 1.15                          11.1
## 5        26.5     70.7   6.9 16.0 21.9  5.6 1.01                          12.7
## 6        39.4     96.6  14.6 35.5 42.9 17.9 1.82                           3.1
##   Population_Density
## 1                319
## 2                611
## 3                619
## 4                551
## 5                303
## 6                674
#Check if the data is complete
missing_in_cols <- sapply(dane, function(x) sum(is.na(x))/nrow(dane))
percent(missing_in_cols)
##                   Temperature                      Humidity 
##                          "0%"                          "0%" 
##                         PM2.5                          PM10 
##                          "0%"                          "0%" 
##                           NO2                           SO2 
##                          "0%"                          "0%" 
##                            CO Proximity_to_Industrial_Areas 
##                          "0%"                          "0%" 
##            Population_Density 
##                          "0%"

Due to different scales of variables, normalization of the data.

dane.s <- as.data.frame(lapply(dane, scale))

Assessing of clustering tendency with Hopkins statistic.

#Hopkins statistic
#Checking if dataset is appropriate to perform clustering on it with Hopkins statistic.
get_clust_tendency(dane, 2, graph=TRUE, gradient=list(low="red", mid="white", high="blue"), seed = 123)
## $hopkins_stat
## [1] 0.9533171
## 
## $plot

The result of the test achieved value over 0,95 which says that our dataset has a high clustering tendency.

The choice of the number of clusters “k”

The first method to project the appropriate number of the “k” will be the Silhouette score.

a <- fviz_nbclust(dane.s,kmeans,method = "silhouette") +ggtitle("kmeans")
b <- fviz_nbclust(dane.s,pam,method = "silhouette")+ggtitle("pam")
c <- fviz_nbclust(dane.s, cluster::clara, method = "silhouette")+ggtitle("clara")
grid.arrange(a,b,c, ncol=2, top = "Optimal number of clusters")

From the silhouette analysis all three methods should be done for 2 clusters.

Another method to determine the number of “k” will be Within-Cluster Sum of Squares statistic also known as “The elbow method”.

d <- fviz_nbclust(dane.s,kmeans,method = "wss") +ggtitle("kmeans")
e <- fviz_nbclust(dane.s,pam,method = "wss")+ggtitle("pam")
f <- fviz_nbclust(dane.s, cluster::clara, method = "wss")+ggtitle("clara")
grid.arrange(d,e,f, ncol=2, top = "Optimal number of clusters")

The elbow method confirm our previous results from Silhouette score. However, based on a plot for the k-means method we will also consider 4 clusters.

Clustering and plots

# k-means
km2 <- kmeans(dane.s, 2)
km4 <- kmeans(dane.s, 4)
# pam
pam<-eclust(dane, "pam", k=2) 
# clara
clara<-eclust(dane, "clara", k=2) 
#Clutering plots
c1 <- fviz_cluster(km2, data = dane.s, geom = c("point"), ellipse.type = "convex") + ggtitle('K-means with 2 clusters')
c2 <- fviz_cluster(km4,data = dane.s, geom = c("point"), ellipse.type = "convex") + ggtitle('K-means with 4 clusters')
c3 <- fviz_cluster(pam,data = dane.s, geom = c("point"), ellipse.type = "convex") + ggtitle('PAM with 2 clusters')
c4 <- fviz_cluster(clara,data = dane.s, geom = c("point"), ellipse.type = "convex") + ggtitle('CLARA with 2 clusters')

grid.arrange(arrangeGrob(c1,c2,c3,c4, ncol=2 , top = "Clustering"))

In all algorithms, created clusters are really close to each other. At first sight the most appropriately looks k-means with 2 clusters. The separated data doesn’t overlap. In other methods especially in PAM and Clara it is hard to say to which clusters our data belongs.

Clustering quality

To confirm our first predictions we will check the clustering quality with 2 methods. Calinski-Harabasz Index: The Calinski-Harabasz Index evaluates clustering quality using within-cluster and between-cluster variance. Higher values indicate better-defined clusters. It considers cluster compactness and separation.

#Calinski-Harabasz index
round(calinhara(dane.s, km2$cluster),digits=2) #fpc::calinhara()
## [1] 3196.44
round(calinhara(dane.s, km4$cluster),digits=2) #fpc::calinhara()
## [1] 2230.43
round(calinhara(dane.s, pam$cluster),digits=2) #fpc::calinhara()
## [1] 1107.92
round(calinhara(dane.s, clara$cluster),digits=2) #fpc::calinhara()
## [1] 1126.47

The outcome of the index basically back our previous results. K-means performed the best with a big advantage over the other methods of clustering.

Silhouette Score: The silhouette score evaluates clustering by measuring cluster separation and compactness, ranging from -1 to 1:

+1: Well-matched to its cluster, poorly matched to others.

0: Overlapping clusters.

-1: Likely misclassified

It’s calculated per data point and averaged, helping identify optimal cluster numbers.

#Silhouette Index

sil_km2 <- silhouette(km2$cluster, dist(dane.s))
v1<- fviz_silhouette(sil_km2) +
  ggtitle("Silhouette for K-means")
##   cluster size ave.sil.width
## 1       1 2118          0.18
## 2       2 2882          0.46
sil_km4 <- silhouette(km4$cluster, dist(dane.s))
v2<- fviz_silhouette(sil_km4) +
  ggtitle("Silhouette for K-means")
##   cluster size ave.sil.width
## 1       1 1642          0.21
## 2       2  366          0.14
## 3       3  924          0.15
## 4       4 2068          0.39
sil_pam <- silhouette(pam$clustering, dist(dane.s))
v3<-fviz_silhouette(sil_pam) +
  ggtitle("Silhouette for PAM")
##   cluster size ave.sil.width
## 1       1 2402          0.31
## 2       2 2598          0.02
sil_clara <- silhouette(clara$clustering, dist(dane.s))
v4<- fviz_silhouette(sil_clara) +
  ggtitle("Silhouette for CLARA")
##   cluster size ave.sil.width
## 1       1 2573          0.29
## 2       2 2427          0.03
grid.arrange(arrangeGrob(v1,v2,v3,v4, ncol=2 , top = "Clustering"))

In the last test we carried out, again the most efficient was k-means. Despite the fact that the difference between 2 and 4 clustering was not so large there was still an advantage for 2 clusters.

The most optimal algorithm

K-means with 2 clusters

Both cluster groups performed positive in Silhouette Score, with one cluster reaching nearly 0,5 score. This confirms our conclusions that we have made by reading from the visualization of clustering. The advantage of this cluster is that points within clusters are very close to each other. Unfortunately they are also close to the neighborhood clusters what makes the shape of clusters non-ideal.

Conclusions

After loading the data, we applied three different clustering algorithms: K-Means, PAM, and CLARA. We then visualized the created clusters, which led to our initial conclusions. To support our final decision, we calculated two statistical metrics: the Calinski-Harabasz Index and the Silhouette Score. Both metrics favored K-Means with two clusters. While this solution is not statistically perfect, it is more accurate than the one with four clusters.

The final outcome indicates that K-Means is the best clustering algorithm for our dataset, rejecting the concept of a four-level segmentation of air quality.