In recent years, with the rapid development of the Chinese economy and the acceleration of urbanization, air pollution has become an increasingly serious problem that threatens public health and the ecological environment. At the same time, there are significant differences in the sources and characteristics of air pollution in different cities.
This study aims to categorize and compare the air pollution conditions of 28 cities in China through cluster analysis methods. In order to better understand the air pollution levels and characteristics of cities, I will use several clustering methods, including partitioning clustering K-means and hierarchical clustering.
The data comes from China PM2.5 Historical Data, which includes PM2.5
and weather information data of 367 cities in China.
Source: https://www.aqistudy.cn/historydata/
### Dataset Variables
City
PM2.5
PM10
CO
SO2
NO2
O3
# Load necessary libraries
library(ggplot2)
library(ape)
library(cluster)
library(factoextra)
library(naniar)
library(corrplot)
# Load the dataset
data <- read.csv("E:/Rstata/28cities(average).csv", sep = ",", dec = ".", header = TRUE)
head(data)
## City PM2.5 PM10 CO SO2 NO2 O3
## 1 Beijing 49.85821 75.64179 813.5821 7.149254 36.46269 95.01493
## 2 Shanghai 36.91791 53.23134 716.1716 10.171642 38.31343 99.30597
## 3 Guangzhou 30.85821 50.74627 822.7239 9.343284 39.88060 93.93284
## 4 Shenzhen 23.63433 41.23881 712.5896 6.656716 26.37313 84.57463
## 5 Hangzhou 40.46269 66.91045 779.3731 9.776119 39.11940 95.90299
## 6 Tianjin 54.05224 89.13433 1057.9851 16.843284 41.70149 95.70896
There are no missing values in the Dataset.
vis_miss(data)
First, we need to preprocess the data by Min-Max Scaling each column and adding a small constant 1e-8 to prevent division by zero.
#(Min-Max Scaling)
data_normalized <- data
data_normalized[, 2:7] <- apply(data[, 2:7], 2, function(x) (x - min(x)) / (max(x) - min(x))+ 1e-8)
data_normalized[, 2:7]
## PM2.5 PM10 CO SO2 NO2 O3
## 1 0.82866473 0.58553036 0.38816930 0.01766596 0.75598195 0.59112151
## 2 0.53067538 0.32250154 0.23841539 0.12607067 0.81196389 0.71325405
## 3 0.39113251 0.29333451 0.40222343 0.09635975 0.85936796 0.56032286
## 4 0.22478090 0.18174653 0.23290847 0.00000001 0.45079008 0.29396772
## 5 0.61230453 0.48305160 0.33557818 0.11188438 0.83634313 0.61639763
## 6 0.92524490 0.74389070 0.76390213 0.36536404 0.91444696 0.61087512
## 7 0.81113594 0.61224491 0.44424814 0.08217346 0.89458240 0.46559049
## 8 0.62948961 0.55627574 0.37133877 0.17987153 0.82167044 0.82412915
## 9 0.99690669 0.94832269 0.88738342 0.27676661 1.00000001 0.37829228
## 10 0.82144699 0.58903391 0.59922216 0.20904712 0.92844245 0.59048429
## 11 0.86750302 0.74248928 0.58045272 0.16648823 0.66139956 0.33687342
## 12 0.65440798 0.52728388 0.31367669 0.41916489 0.69774267 0.23258285
## 13 0.81577592 0.45476045 0.42302354 0.15364027 0.60406322 0.38190315
## 14 0.89310879 0.82018044 0.79035831 0.16782656 0.93318285 0.04694139
## 15 0.91596495 0.91740388 0.87525671 1.00000001 0.93724606 0.55947325
## 16 0.44045370 0.34676361 0.47267764 0.13062099 0.51760723 0.11172473
## 17 0.26568140 0.23815364 0.32910754 0.19807282 0.44920994 0.28122346
## 18 0.35555938 0.26548131 0.12458269 0.18308352 0.31602710 0.00000001
## 19 0.56607666 0.60199703 1.00000001 0.55487153 0.72528218 0.35216654
## 20 1.00000001 1.00000001 0.60362770 0.64453962 0.90383748 1.00000001
## 21 0.50214815 0.64675485 0.73659696 0.50910065 0.75575622 0.41588786
## 22 0.78553017 0.58842079 0.36860825 0.12901500 0.75823929 0.39825829
## 23 0.81268260 0.57712185 0.42592615 0.61001072 0.77404064 0.02060324
## 24 0.00000001 0.00000001 0.12302239 0.05701286 0.00000001 0.24150383
## 25 0.28097612 0.41569590 0.11376388 0.37312635 0.27494358 0.69350044
## 26 0.53789312 0.56529737 0.24731825 0.28667025 0.65395035 0.68309262
## 27 0.46571577 0.38740476 0.28804654 0.32467881 0.47516931 0.60450298
## 28 0.03866645 0.15424368 0.00000001 0.03131693 0.21580136 0.72387427
As we can see we don’t have missing values. This means that we can move forward.
colSums(is.na(data_normalized))
## City PM2.5 PM10 CO SO2 NO2 O3
## 0 0 0 0 0 0 0
vis_miss(data_normalized)
From this ordered dissimilarity plot, it appears that there is some structure and pattern in the data. Often, distinct chunks or patterns of color in the plot can indicate potential clustering structures.
d<-dist(data_normalized[,2:7])
fviz_dist(d, show_labels = FALSE)+ labs(title = "28 Cities of Pollution Data")
corrplot(cor(data_normalized[,2:7], use="complete"), method="number", type="upper", diag=T, tl.col="black", tl.srt=30, tl.cex=0.9, number.cex=0.85, title="cars74", mar=c(0,0,1,0))
Let’s examine the average Silhouette and Elbow graphs. For the average Silhouette graph, the optimal number of clusters is where the graph peaks, and for the Elbow graph, it’s at the lower peak.
# extract the average silhouette width
silhouette_score <- function(k) {
km <- kmeans(data_normalized[, 2:7], centers = k, nstart = 25)
ss <- silhouette(km$cluster, dist(data_normalized[, 2:7]))
mean(ss[, 3])
}
# range of k for silhouette scores
k_sil <- 2:10 # silhouette is undefined for k = 1
avg_sil <- sapply(k_sil, silhouette_score)
plot(k_sil, avg_sil, type = "b", xlab = "Number of clusters k", ylab = "Average silhouette width")
# Silhouette Graph
fviz_nbclust(data_normalized[,2:7], kmeans, method = "silhouette") +labs(subtitle = "Silhouette Method")
# Elbow graph
distance_matrix <- dist(data_normalized[, 2:7], method = "euclidean")
wss <- sapply(1:10, function(k){
clusters <- cutree(hclust(distance_matrix), k)
sum(sapply(unique(clusters), function(cl){
cluster_data <- data_normalized[, 2:7][clusters == cl, ]
sum(scale(cluster_data, scale = FALSE)^2)
}))
})
plot(1:10, wss, type = "b", pch = 19, frame = FALSE,
xlab = "Number of Clusters K",
ylab = "Total Within-Cluster Sum of Squares WSS")
Therefore, depending on the Elbow graph, a choice of 2 or 3 clusters may be appropriate. Combined with the previous average Silhouette graph, 3 clusters is a good choice.
h.clust <- hclust(distance_matrix, method = "complete")
h.clust$labels <- data_normalized$City
plot(h.clust, main = "Hierarchical Clustering Dendrogram", xlab = "", sub = "",las = 2)
clusters <- cutree(h.clust, k = 3)
rect.hclust(h.clust, k =3, border = "red")
Urumqi is located higher in the dendrogram, which may indicate that
it is an outlier and its pollution data is more different from the other
cities. xining and Haerbin are also located higher, which may indicate
that their pollution characteristics are different from the other
cities.
Guiyang and Shenzhen may be very close to each other at a certain
height, indicating that they have similar pollution levels, and Beijing
and Shanghai may be close to each other at another height, indicating
that they have similar pollution characteristics.
km <- kmeans(data_normalized[,2:7], centers = 3, nstart = 25)
fviz_cluster(
km,
data = data_normalized[, 2:7],
geom = "point",
label = data_normalized$City,
show.clust.cent = TRUE,
repel = TRUE,
palette = c("red", "blue", "green")
) +
geom_text(aes(label = data_normalized$City), vjust = -1)
Centering on Urumqi indicates that the cluster has a high level of
pollution or has unique pollution characteristics.Cities included:
Haerbin, Xinning, Changchun, Hongqi, Changsha, Hefei, ChengDu, Wuhan,
Beijing, Hangzhou, Qingdao, Shanghai, Nanjing.
Cities such as Guiyang and Kunming are often known for their better
ecological environment, with Guiyang as the center, indicating lower
pollution levels or better environmental quality.Cities included:
Nanning, Kunming, Shenzhen, Qianghai, Zhangjakou, Lahsa.
Centered around Taiyuan, which is typically an industrial city,
indicates a medium level of pollution or similar characteristics in some
pollution indicators.
Both methods are very effective in analyzing urban pollution data, but they have their own advantages and disadvantages: Hierarchical clustering is more suitable for exploratory analysis, which can show the hierarchical structure of the data and adjust the number of clusters dynamically, while K-mean clustering is more suitable for obtaining clear grouping results quickly, but the number of clusters needs to be specified in advance. Hierarchical clustering is a better choice if you need to study the inner relationship of the data in depth; if you need to get clear grouping results quickly, K-mean clustering is more suitable.