Introduction
Review of the Dataset
Data pre-processing
- Data Normalization
- Relevant analysis
Clustering Analysis
- Hierarchical clustering results
- Kmeans clustering results
Conclusion

Introduction

In recent years, with the rapid development of the Chinese economy and the acceleration of urbanization, air pollution has become an increasingly serious problem that threatens public health and the ecological environment. At the same time, there are significant differences in the sources and characteristics of air pollution in different cities.

This study aims to categorize and compare the air pollution conditions of 28 cities in China through cluster analysis methods. In order to better understand the air pollution levels and characteristics of cities, I will use several clustering methods, including partitioning clustering K-means and hierarchical clustering.

Review of the Dataset

The data comes from China PM2.5 Historical Data, which includes PM2.5 and weather information data of 367 cities in China.
Source: https://www.aqistudy.cn/historydata/ ### Dataset Variables
City
PM2.5
PM10
CO
SO2
NO2
O3

# Load necessary libraries
library(ggplot2)
library(ape)
library(cluster)
library(factoextra)
library(naniar)
library(corrplot)
# Load the dataset
data <- read.csv("E:/Rstata/28cities(average).csv", sep = ",", dec = ".", header = TRUE)
head(data)

##        City    PM2.5     PM10        CO       SO2      NO2       O3
## 1   Beijing 49.85821 75.64179  813.5821  7.149254 36.46269 95.01493
## 2  Shanghai 36.91791 53.23134  716.1716 10.171642 38.31343 99.30597
## 3 Guangzhou 30.85821 50.74627  822.7239  9.343284 39.88060 93.93284
## 4  Shenzhen 23.63433 41.23881  712.5896  6.656716 26.37313 84.57463
## 5  Hangzhou 40.46269 66.91045  779.3731  9.776119 39.11940 95.90299
## 6   Tianjin 54.05224 89.13433 1057.9851 16.843284 41.70149 95.70896

There are no missing values in the Dataset.

vis_miss(data)

Data pre-processing

Data Normalization

First, we need to preprocess the data by Min-Max Scaling each column and adding a small constant 1e-8 to prevent division by zero.

#（Min-Max Scaling）
data_normalized <- data
data_normalized[, 2:7] <- apply(data[, 2:7], 2, function(x) (x - min(x)) / (max(x) - min(x))+ 1e-8)
data_normalized[, 2:7]

##         PM2.5       PM10         CO        SO2        NO2         O3
## 1  0.82866473 0.58553036 0.38816930 0.01766596 0.75598195 0.59112151
## 2  0.53067538 0.32250154 0.23841539 0.12607067 0.81196389 0.71325405
## 3  0.39113251 0.29333451 0.40222343 0.09635975 0.85936796 0.56032286
## 4  0.22478090 0.18174653 0.23290847 0.00000001 0.45079008 0.29396772
## 5  0.61230453 0.48305160 0.33557818 0.11188438 0.83634313 0.61639763
## 6  0.92524490 0.74389070 0.76390213 0.36536404 0.91444696 0.61087512
## 7  0.81113594 0.61224491 0.44424814 0.08217346 0.89458240 0.46559049
## 8  0.62948961 0.55627574 0.37133877 0.17987153 0.82167044 0.82412915
## 9  0.99690669 0.94832269 0.88738342 0.27676661 1.00000001 0.37829228
## 10 0.82144699 0.58903391 0.59922216 0.20904712 0.92844245 0.59048429
## 11 0.86750302 0.74248928 0.58045272 0.16648823 0.66139956 0.33687342
## 12 0.65440798 0.52728388 0.31367669 0.41916489 0.69774267 0.23258285
## 13 0.81577592 0.45476045 0.42302354 0.15364027 0.60406322 0.38190315
## 14 0.89310879 0.82018044 0.79035831 0.16782656 0.93318285 0.04694139
## 15 0.91596495 0.91740388 0.87525671 1.00000001 0.93724606 0.55947325
## 16 0.44045370 0.34676361 0.47267764 0.13062099 0.51760723 0.11172473
## 17 0.26568140 0.23815364 0.32910754 0.19807282 0.44920994 0.28122346
## 18 0.35555938 0.26548131 0.12458269 0.18308352 0.31602710 0.00000001
## 19 0.56607666 0.60199703 1.00000001 0.55487153 0.72528218 0.35216654
## 20 1.00000001 1.00000001 0.60362770 0.64453962 0.90383748 1.00000001
## 21 0.50214815 0.64675485 0.73659696 0.50910065 0.75575622 0.41588786
## 22 0.78553017 0.58842079 0.36860825 0.12901500 0.75823929 0.39825829
## 23 0.81268260 0.57712185 0.42592615 0.61001072 0.77404064 0.02060324
## 24 0.00000001 0.00000001 0.12302239 0.05701286 0.00000001 0.24150383
## 25 0.28097612 0.41569590 0.11376388 0.37312635 0.27494358 0.69350044
## 26 0.53789312 0.56529737 0.24731825 0.28667025 0.65395035 0.68309262
## 27 0.46571577 0.38740476 0.28804654 0.32467881 0.47516931 0.60450298
## 28 0.03866645 0.15424368 0.00000001 0.03131693 0.21580136 0.72387427

As we can see we don’t have missing values. This means that we can move forward.

colSums(is.na(data_normalized))

##  City PM2.5  PM10    CO   SO2   NO2    O3 
##     0     0     0     0     0     0     0

vis_miss(data_normalized)

From this ordered dissimilarity plot, it appears that there is some structure and pattern in the data. Often, distinct chunks or patterns of color in the plot can indicate potential clustering structures.

Relevant analysis

d<-dist(data_normalized[,2:7])
fviz_dist(d, show_labels = FALSE)+ labs(title = "28 Cities of Pollution Data")

corrplot(cor(data_normalized[,2:7], use="complete"), method="number", type="upper", diag=T, tl.col="black", tl.srt=30, tl.cex=0.9, number.cex=0.85, title="cars74", mar=c(0,0,1,0))

Clustering Analysis

Let’s examine the average Silhouette and Elbow graphs. For the average Silhouette graph, the optimal number of clusters is where the graph peaks, and for the Elbow graph, it’s at the lower peak.

# extract the average silhouette width
silhouette_score <- function(k) {
  km <- kmeans(data_normalized[, 2:7], centers = k, nstart = 25)
  ss <- silhouette(km$cluster, dist(data_normalized[, 2:7]))
  mean(ss[, 3])
}

# range of k for silhouette scores
k_sil <- 2:10 # silhouette is undefined for k = 1
avg_sil <- sapply(k_sil, silhouette_score)
plot(k_sil, avg_sil, type = "b", xlab = "Number of clusters k", ylab = "Average silhouette width")

# Silhouette Graph
fviz_nbclust(data_normalized[,2:7], kmeans, method = "silhouette") +labs(subtitle = "Silhouette Method")

# Elbow graph
distance_matrix <- dist(data_normalized[, 2:7], method = "euclidean")
wss <- sapply(1:10, function(k){
  clusters <- cutree(hclust(distance_matrix), k)
  sum(sapply(unique(clusters), function(cl){
    cluster_data <- data_normalized[, 2:7][clusters == cl, ]
    sum(scale(cluster_data, scale = FALSE)^2)
  }))
})
plot(1:10, wss, type = "b", pch = 19, frame = FALSE, 
     xlab = "Number of Clusters K", 
     ylab = "Total Within-Cluster Sum of Squares WSS")

Therefore, depending on the Elbow graph, a choice of 2 or 3 clusters may be appropriate. Combined with the previous average Silhouette graph, 3 clusters is a good choice.

h.clust <- hclust(distance_matrix, method = "complete")
h.clust$labels <- data_normalized$City
plot(h.clust, main = "Hierarchical Clustering Dendrogram", xlab = "", sub = "",las = 2)
clusters <- cutree(h.clust, k = 3)
rect.hclust(h.clust, k =3, border = "red")

Hierarchical clustering results

Urumqi is located higher in the dendrogram, which may indicate that it is an outlier and its pollution data is more different from the other cities. xining and Haerbin are also located higher, which may indicate that their pollution characteristics are different from the other cities.
Guiyang and Shenzhen may be very close to each other at a certain height, indicating that they have similar pollution levels, and Beijing and Shanghai may be close to each other at another height, indicating that they have similar pollution characteristics.

km <- kmeans(data_normalized[,2:7], centers = 3, nstart = 25)
fviz_cluster(
  km, 
  data = data_normalized[, 2:7], 
  geom = "point",          
  label = data_normalized$City, 
  show.clust.cent = TRUE,  
  repel = TRUE,            
  palette = c("red", "blue", "green") 
) +
  geom_text(aes(label = data_normalized$City), vjust = -1)

Kmeans clustering results

Centering on Urumqi indicates that the cluster has a high level of pollution or has unique pollution characteristics.Cities included: Haerbin, Xinning, Changchun, Hongqi, Changsha, Hefei, ChengDu, Wuhan, Beijing, Hangzhou, Qingdao, Shanghai, Nanjing.
Cities such as Guiyang and Kunming are often known for their better ecological environment, with Guiyang as the center, indicating lower pollution levels or better environmental quality.Cities included: Nanning, Kunming, Shenzhen, Qianghai, Zhangjakou, Lahsa.
Centered around Taiyuan, which is typically an industrial city, indicates a medium level of pollution or similar characteristics in some pollution indicators.

Conclusion

Both methods are very effective in analyzing urban pollution data, but they have their own advantages and disadvantages: Hierarchical clustering is more suitable for exploratory analysis, which can show the hierarchical structure of the data and adjust the number of clusters dynamically, while K-mean clustering is more suitable for obtaining clear grouping results quickly, but the number of clusters needs to be specified in advance. Hierarchical clustering is a better choice if you need to study the inner relationship of the data in depth; if you need to get clear grouping results quickly, K-mean clustering is more suitable.

Cluster Analysis of Air Pollution Characteristics in the 28 Cities of China

Guojie Tao

2025-01-26