Clustering is an unsupervised machine learning technique used to group data into clusters based on similarity, where items within the same cluster are similar and distinct from those in other clusters. This technique helps uncover natural groupings in data, providing insights, enabling anomaly detection, and simplifying complex datasets.
The primary objective of this project is to apply three non-hierarchical clustering algorithms to a pollution dataset and achieve air quality segmentation. Originally in this dataset were identified four levels of air quality: Good, Moderate, Poor, Hazardous. This project will evaluate the likelihood of this division by performing clustering analysis.
The Air Quality and Pollution dataset come from: https://www.kaggle.com/datasets/mujtabamatin/air-quality-and-pollution-assessment/data It contains information about air pollution across various regions of the South East Asia. There are 5000 observations which are described by 9 variables:
## Temperature Humidity PM2.5 PM10 NO2 SO2 CO Proximity_to_Industrial_Areas
## 1 29.8 59.1 5.2 17.9 18.9 9.2 1.72 6.3
## 2 28.3 75.6 2.3 12.2 30.8 9.7 1.64 6.0
## 3 23.1 74.7 26.7 33.8 24.4 12.6 1.63 5.2
## 4 27.1 39.1 6.1 6.3 13.5 5.3 1.15 11.1
## 5 26.5 70.7 6.9 16.0 21.9 5.6 1.01 12.7
## 6 39.4 96.6 14.6 35.5 42.9 17.9 1.82 3.1
## Population_Density Air.Quality
## 1 319 Moderate
## 2 611 Moderate
## 3 619 Moderate
## 4 551 Good
## 5 303 Good
## 6 674 Hazardous
## 'data.frame': 5000 obs. of 10 variables:
## $ Temperature : num 29.8 28.3 23.1 27.1 26.5 39.4 41.7 31 29.4 33.2 ...
## $ Humidity : num 59.1 75.6 74.7 39.1 70.7 96.6 82.5 59.6 93.8 80.5 ...
## $ PM2.5 : num 5.2 2.3 26.7 6.1 6.9 14.6 1.7 5 10.3 11.1 ...
## $ PM10 : num 17.9 12.2 33.8 6.3 16 35.5 15.8 16.8 22.7 24.4 ...
## $ NO2 : num 18.9 30.8 24.4 13.5 21.9 42.9 31.1 24.2 45.1 32 ...
## $ SO2 : num 9.2 9.7 12.6 5.3 5.6 17.9 12.7 13.6 11.8 15.3 ...
## $ CO : num 1.72 1.64 1.63 1.15 1.01 1.82 1.8 1.38 2.03 1.69 ...
## $ Proximity_to_Industrial_Areas: num 6.3 6 5.2 11.1 12.7 3.1 4.6 6.3 5.4 4.9 ...
## $ Population_Density : int 319 611 619 551 303 674 735 443 486 535 ...
## $ Air.Quality : chr "Moderate" "Moderate" "Moderate" "Good" ...
## [1] 5000 10
## Temperature Humidity PM2.5 PM10 NO2 SO2 CO Proximity_to_Industrial_Areas
## 1 29.8 59.1 5.2 17.9 18.9 9.2 1.72 6.3
## 2 28.3 75.6 2.3 12.2 30.8 9.7 1.64 6.0
## 3 23.1 74.7 26.7 33.8 24.4 12.6 1.63 5.2
## 4 27.1 39.1 6.1 6.3 13.5 5.3 1.15 11.1
## 5 26.5 70.7 6.9 16.0 21.9 5.6 1.01 12.7
## 6 39.4 96.6 14.6 35.5 42.9 17.9 1.82 3.1
## Population_Density
## 1 319
## 2 611
## 3 619
## 4 551
## 5 303
## 6 674
#Check if the data is complete
missing_in_cols <- sapply(dane, function(x) sum(is.na(x))/nrow(dane))
percent(missing_in_cols)## Temperature Humidity
## "0%" "0%"
## PM2.5 PM10
## "0%" "0%"
## NO2 SO2
## "0%" "0%"
## CO Proximity_to_Industrial_Areas
## "0%" "0%"
## Population_Density
## "0%"
Due to different scales of variables, normalization of the data.
Assessing of clustering tendency with Hopkins statistic.
#Hopkins statistic
#Checking if dataset is appropriate to perform clustering on it with Hopkins statistic.
get_clust_tendency(dane, 2, graph=TRUE, gradient=list(low="red", mid="white", high="blue"), seed = 123)## $hopkins_stat
## [1] 0.9533171
##
## $plot
The result of the test achieved value over 0,95 which says that our dataset has a high clustering tendency.
The first method to project the appropriate number of the “k” will be the Silhouette score.
a <- fviz_nbclust(dane.s,kmeans,method = "silhouette") +ggtitle("kmeans")
b <- fviz_nbclust(dane.s,pam,method = "silhouette")+ggtitle("pam")
c <- fviz_nbclust(dane.s, cluster::clara, method = "silhouette")+ggtitle("clara")
grid.arrange(a,b,c, ncol=2, top = "Optimal number of clusters")From the silhouette analysis all three methods should be done for 2 clusters.
Another method to determine the number of “k” will be Within-Cluster Sum of Squares statistic also known as “The elbow method”.
d <- fviz_nbclust(dane.s,kmeans,method = "wss") +ggtitle("kmeans")
e <- fviz_nbclust(dane.s,pam,method = "wss")+ggtitle("pam")
f <- fviz_nbclust(dane.s, cluster::clara, method = "wss")+ggtitle("clara")
grid.arrange(d,e,f, ncol=2, top = "Optimal number of clusters")The elbow method confirm our previous results from Silhouette score. However, based on a plot for the k-means method we will also consider 4 clusters.
#Clutering plots
c1 <- fviz_cluster(km2, data = dane.s, geom = c("point"), ellipse.type = "convex") + ggtitle('K-means with 2 clusters')
c2 <- fviz_cluster(km4,data = dane.s, geom = c("point"), ellipse.type = "convex") + ggtitle('K-means with 4 clusters')
c3 <- fviz_cluster(pam,data = dane.s, geom = c("point"), ellipse.type = "convex") + ggtitle('PAM with 2 clusters')
c4 <- fviz_cluster(clara,data = dane.s, geom = c("point"), ellipse.type = "convex") + ggtitle('CLARA with 2 clusters')
grid.arrange(arrangeGrob(c1,c2,c3,c4, ncol=2 , top = "Clustering"))In all algorithms, created clusters are really close to each other. At first sight the most appropriately looks k-means with 2 clusters. The separated data doesn’t overlap. In other methods especially in PAM and Clara it is hard to say to which clusters our data belongs.
To confirm our first predictions we will check the clustering quality with 2 methods. Calinski-Harabasz Index: The Calinski-Harabasz Index evaluates clustering quality using within-cluster and between-cluster variance. Higher values indicate better-defined clusters. It considers cluster compactness and separation.
## [1] 3196.44
## [1] 2230.43
## [1] 1107.92
## [1] 1126.47
The outcome of the index basically back our previous results. K-means performed the best with a big advantage over the other methods of clustering.
Silhouette Score: The silhouette score evaluates clustering by measuring cluster separation and compactness, ranging from -1 to 1:
+1: Well-matched to its cluster, poorly matched to others.
0: Overlapping clusters.
-1: Likely misclassified
It’s calculated per data point and averaged, helping identify optimal cluster numbers.
#Silhouette Index
sil_km2 <- silhouette(km2$cluster, dist(dane.s))
v1<- fviz_silhouette(sil_km2) +
ggtitle("Silhouette for K-means")## cluster size ave.sil.width
## 1 1 2118 0.18
## 2 2 2882 0.46
sil_km4 <- silhouette(km4$cluster, dist(dane.s))
v2<- fviz_silhouette(sil_km4) +
ggtitle("Silhouette for K-means")## cluster size ave.sil.width
## 1 1 1642 0.21
## 2 2 366 0.14
## 3 3 924 0.15
## 4 4 2068 0.39
sil_pam <- silhouette(pam$clustering, dist(dane.s))
v3<-fviz_silhouette(sil_pam) +
ggtitle("Silhouette for PAM")## cluster size ave.sil.width
## 1 1 2402 0.31
## 2 2 2598 0.02
sil_clara <- silhouette(clara$clustering, dist(dane.s))
v4<- fviz_silhouette(sil_clara) +
ggtitle("Silhouette for CLARA")## cluster size ave.sil.width
## 1 1 2573 0.29
## 2 2 2427 0.03
In the last test we carried out, again the most efficient was k-means. Despite the fact that the difference between 2 and 4 clustering was not so large there was still an advantage for 2 clusters.
K-means with 2 clusters
Both cluster groups performed positive in Silhouette Score, with one cluster reaching nearly 0,5 score. This confirms our conclusions that we have made by reading from the visualization of clustering. The advantage of this cluster is that points within clusters are very close to each other. Unfortunately they are also close to the neighborhood clusters what makes the shape of clusters non-ideal.
After loading the data, we applied three different clustering algorithms: K-Means, PAM, and CLARA. We then visualized the created clusters, which led to our initial conclusions. To support our final decision, we calculated two statistical metrics: the Calinski-Harabasz Index and the Silhouette Score. Both metrics favored K-Means with two clusters. While this solution is not statistically perfect, it is more accurate than the one with four clusters.
The final outcome indicates that K-Means is the best clustering algorithm for our dataset, rejecting the concept of a four-level segmentation of air quality.