The objective of this study is to apply Principal Component Analysis (PCA) to reduce the dimensionality of air quality variables while preserving as much of the total variance in the dataset as possible. The dataset consists of multiple air pollution indicators, meteorological variables, and geographic coordinates that are likely to be correlated with one another. These correlations may introduce redundancy and increase the complexity of the data structure.
Principal Component Analysis transforms the original correlated variables into a smaller set of uncorrelated variables called principal components. Each principal component is a linear combination of the original variables and is constructed to capture the maximum possible variance. By retaining only the most important principal components, PCA reduces the dimensionality of the dataset while minimizing the loss of information. This dimensionality reduction improves the efficiency of data analysis, simplifies interpretation, and facilitates further analysis such as clustering, regression modeling, and environmental pattern identification. In addition, PCA helps reveal the underlying structure of the air quality data by identifying dominant patterns and relationships among variables.
# Read data
data <- read.csv("air_quality.csv")
cat("Original Data Structure:\n")
## Original Data Structure:
str(data)
## 'data.frame': 5882208 obs. of 25 variables:
## $ date : chr "2024-08-31 23:00" "2024-08-31 23:00" "2024-08-31 23:00" "2024-08-31 23:00" ...
## $ sitename : chr "Hukou" "Zhongming" "Zhudong" "Hsinchu" ...
## $ county : chr "Hsinchu County" "Taichung City" "Hsinchu County" "Hsinchu City" ...
## $ aqi : num 62 50 45 42 50 40 39 44 46 49 ...
## $ pollutant: chr "PM2.5" "" "" "" ...
## $ status : chr "Moderate" "Good" "Good" "Good" ...
## $ so2 : chr "0.9" "1.6" "0.4" "0.8" ...
## $ co : chr "0.17" "0.32" "0.17" "0.2" ...
## $ o3 : chr "35.0" "27.9" "25.1" "30.0" ...
## $ o3_8hr : chr "40.2" "35.1" "40.6" "35.9" ...
## $ pm10 : chr "18.0" "27.0" "21.0" "19.0" ...
## $ pm2.5 : chr "17.0" "14.0" "13.0" "10.0" ...
## $ no2 : chr "2.3" "7.6" "2.9" "4.0" ...
## $ nox : chr "2.6" "9.3" "4.1" "4.8" ...
## $ no : chr "0.3" "1.6" "1.1" "0.7" ...
## $ windspeed: chr "2.3" "1.1" "0.4" "1.9" ...
## $ winddirec: chr "225" "184" "210" "239" ...
## $ unit : logi NA NA NA NA NA NA ...
## $ co_8hr : chr "0.2" "0.2" "0.2" "0.2" ...
## $ pm2.5_avg: chr "20.1" "15.3" "13.8" "13.0" ...
## $ pm10_avg : chr "26.0" "23.0" "24.0" "26.0" ...
## $ so2_avg : chr "1.0" "1.0" "0.0" "1.0" ...
## $ longitude: num 121 121 121 121 121 ...
## $ latitude : num 24.9 24.2 24.7 24.8 24.7 ...
## $ siteid : num 22 31 23 24 25 26 27 28 29 30 ...
# Selecting numeric variables
X <- data[, c("aqi","so2","co","o3","o3_8hr","pm10","pm2.5",
"no2","nox","no","windspeed","winddirec",
"co_8hr","pm2.5_avg","pm10_avg","so2_avg",
"longitude","latitude")]
# Convert to numeric
X <- data.frame(lapply(X, as.numeric))
# Delete NA
X <- na.omit(X)
# Take 10000 data
X <- X[1:10000, ]
cat("Structure After Data Preparation:\n")
## Structure After Data Preparation:
str(X)
## 'data.frame': 10000 obs. of 18 variables:
## $ aqi : num 62 50 45 42 50 40 39 44 46 49 ...
## $ so2 : num 0.9 1.6 0.4 0.8 1 1.1 0.9 1.3 2.5 0.7 ...
## $ co : num 0.17 0.32 0.17 0.2 0.16 0.17 0.18 0.24 0.2 0.24 ...
## $ o3 : num 35 27.9 25.1 30 33.5 35.2 35.3 24.6 30.3 29.4 ...
## $ o3_8hr : num 40.2 35.1 40.6 35.9 35.9 35 42.9 39.7 40.4 37 ...
## $ pm10 : num 18 27 21 19 18 15 14 21 33 20 ...
## $ pm2.5 : num 17 14 13 10 14 12 9 12 16 12 ...
## $ no2 : num 2.3 7.6 2.9 4 1.8 4 2.4 6.8 7.3 5.6 ...
## $ nox : num 2.6 9.3 4.1 4.8 3.1 5.1 3.1 7.3 7.7 6.3 ...
## $ no : num 0.3 1.6 1.1 0.7 1.2 1.1 0.7 0.5 0.3 0.7 ...
## $ windspeed: num 2.3 1.1 0.4 1.9 1.8 1.4 0.6 1.2 0.5 0.9 ...
## $ winddirec: num 225 184 210 239 259 235 203 38 8 97 ...
## $ co_8hr : num 0.2 0.2 0.2 0.2 0.1 0.1 0.1 0.2 0.1 0.2 ...
## $ pm2.5_avg: num 20.1 15.3 13.8 13 15.3 12.2 11.4 13.5 14.2 15.1 ...
## $ pm10_avg : num 26 23 24 26 28 17 16 21 26 21 ...
## $ so2_avg : num 1 1 0 1 1 1 0 1 1 0 ...
## $ longitude: num 121 121 121 121 121 ...
## $ latitude : num 24.9 24.2 24.7 24.8 24.7 ...
## - attr(*, "na.action")= 'omit' Named int [1:1591314] 35 41 47 53 59 65 68 121 127 137 ...
## ..- attr(*, "names")= chr [1:1591314] "35" "41" "47" "53" ...
The air quality dataset used in this study consists of 18 numerical variables representing air quality indicators, including the Air Quality Index (AQI), pollutant concentrations (SO₂, CO, O₃, 8-hour O₃, PM10, PM2.5, NO₂, NOx, and NO), meteorological variables (wind speed and wind direction), averaged pollutant measurements, and geographic coordinates. Since PCA requires numerical input variables, all selected variables were converted to numeric format, and observations with missing values were removed using listwise deletion. Due to the large size of the dataset, a subset of 10,000 observations was selected to improve computational efficiency while maintaining sufficient sample size. The prepared dataset is suitable for subsequent PCA analysis.
# Data Standardization
scaled.X <- scale(X)
scaled.mean <- apply(scaled.X, 2, mean)
scaled.sd <- apply(scaled.X, 2, sd)
check <- data.frame(
Mean = round(scaled.mean,4),
Standard_Deviation = round(scaled.sd,4)
)
check
## Mean Standard_Deviation
## aqi 0 1
## so2 0 1
## co 0 1
## o3 0 1
## o3_8hr 0 1
## pm10 0 1
## pm2.5 0 1
## no2 0 1
## nox 0 1
## no 0 1
## windspeed 0 1
## winddirec 0 1
## co_8hr 0 1
## pm2.5_avg 0 1
## pm10_avg 0 1
## so2_avg 0 1
## longitude 0 1
## latitude 0 1
All variables were standardized using the z-score method so that each variable has a mean of 0 and a standard deviation of 1. Standardization is necessary because the variables have different units and scales. Without standardization, variables with larger scales could dominate the PCA results. By standardizing the data, all variables contribute equally to the formation of principal components, and PCA is effectively performed based on the correlation matrix, which is appropriate for variables measured on different scales.
# Visual Examination of Correlation Matrix
cor.mat.initial <- cor(scaled.X)
# Count correlations > 0.30
sum(abs(cor.mat.initial[upper.tri(cor.mat.initial)]) > 0.3)
## [1] 53
round(cor.mat.initial, 2)
## aqi so2 co o3 o3_8hr pm10 pm2.5 no2 nox no windspeed
## aqi 1.00 0.16 0.36 0.39 0.54 0.51 0.70 0.31 0.26 0.04 0.11
## so2 0.16 1.00 0.15 0.16 0.08 0.17 0.22 0.22 0.20 0.08 0.09
## co 0.36 0.15 1.00 -0.03 0.03 0.36 0.40 0.79 0.83 0.61 -0.20
## o3 0.39 0.16 -0.03 1.00 0.62 0.23 0.28 -0.16 -0.17 -0.12 0.48
## o3_8hr 0.54 0.08 0.03 0.62 1.00 0.10 0.20 -0.01 -0.07 -0.18 0.27
## pm10 0.51 0.17 0.36 0.23 0.10 1.00 0.67 0.29 0.28 0.13 0.03
## pm2.5 0.70 0.22 0.40 0.28 0.20 0.67 1.00 0.34 0.30 0.10 0.00
## no2 0.31 0.22 0.79 -0.16 -0.01 0.29 0.34 1.00 0.95 0.48 -0.25
## nox 0.26 0.20 0.83 -0.17 -0.07 0.28 0.30 0.95 1.00 0.73 -0.23
## no 0.04 0.08 0.61 -0.12 -0.18 0.13 0.10 0.48 0.73 1.00 -0.09
## windspeed 0.11 0.09 -0.20 0.48 0.27 0.03 0.00 -0.25 -0.23 -0.09 1.00
## winddirec 0.02 0.00 -0.06 0.21 0.10 -0.01 0.01 -0.09 -0.08 -0.02 0.17
## co_8hr 0.41 0.10 0.75 0.03 0.02 0.33 0.37 0.64 0.65 0.41 -0.16
## pm2.5_avg 0.88 0.17 0.39 0.28 0.30 0.58 0.82 0.33 0.29 0.09 0.06
## pm10_avg 0.71 0.12 0.38 0.29 0.25 0.76 0.64 0.30 0.28 0.12 0.08
## so2_avg 0.11 0.41 0.08 0.03 0.03 0.00 0.10 0.13 0.12 0.05 0.06
## longitude 0.14 -0.02 0.30 -0.01 0.00 0.21 0.11 0.29 0.26 0.10 -0.01
## latitude 0.52 0.03 0.38 0.19 0.25 0.39 0.43 0.35 0.31 0.11 0.03
## winddirec co_8hr pm2.5_avg pm10_avg so2_avg longitude latitude
## aqi 0.02 0.41 0.88 0.71 0.11 0.14 0.52
## so2 0.00 0.10 0.17 0.12 0.41 -0.02 0.03
## co -0.06 0.75 0.39 0.38 0.08 0.30 0.38
## o3 0.21 0.03 0.28 0.29 0.03 -0.01 0.19
## o3_8hr 0.10 0.02 0.30 0.25 0.03 0.00 0.25
## pm10 -0.01 0.33 0.58 0.76 0.00 0.21 0.39
## pm2.5 0.01 0.37 0.82 0.64 0.10 0.11 0.43
## no2 -0.09 0.64 0.33 0.30 0.13 0.29 0.35
## nox -0.08 0.65 0.29 0.28 0.12 0.26 0.31
## no -0.02 0.41 0.09 0.12 0.05 0.10 0.11
## windspeed 0.17 -0.16 0.06 0.08 0.06 -0.01 0.03
## winddirec 1.00 -0.03 0.00 0.00 0.02 -0.01 0.02
## co_8hr -0.03 1.00 0.46 0.45 0.09 0.35 0.43
## pm2.5_avg 0.00 0.46 1.00 0.78 0.13 0.12 0.52
## pm10_avg 0.00 0.45 0.78 1.00 0.02 0.28 0.52
## so2_avg 0.02 0.09 0.13 0.02 1.00 -0.05 0.02
## longitude -0.01 0.35 0.12 0.28 -0.05 1.00 0.60
## latitude 0.02 0.43 0.52 0.52 0.02 0.60 1.00
The correlation matrix was examined to assess the suitability of the data for Principal Component Analysis (PCA). Out of 153 pairwise correlations, 53 (34.6%) have an absolute value greater than 0.30, indicating a substantial proportion of moderate to strong relationships among the variables. Several strong correlations are observed, such as between NO₂ and NOx (r = 0.95), CO and NOx (r = 0.83), and PM2.5 and PM2.5_avg (r = 0.82), suggesting considerable shared variance. Since a meaningful number of variables are sufficiently intercorrelated, the correlation requirement for PCA is satisfied.
library(psych)
# KMO test
kmo.result <- KMO(cor(scaled.X))
kmo.result
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = cor(scaled.X))
## Overall MSA = 0.73
## MSA for each item =
## aqi so2 co o3 o3_8hr pm10 pm2.5 no2
## 0.84 0.71 0.93 0.70 0.61 0.75 0.80 0.60
## nox no windspeed winddirec co_8hr pm2.5_avg pm10_avg so2_avg
## 0.62 0.42 0.75 0.78 0.89 0.76 0.80 0.61
## longitude latitude
## 0.66 0.84
The Kaiser-Meyer-Olkin (KMO) test was conducted to evaluate sampling adequacy and determine whether the dataset was suitable for Principal Component Analysis (PCA). The KMO statistic measures the proportion of common variance among variables that can be explained by the principal components. The overall KMO value of 0.73 falls into the middling category, indicating that the dataset is adequate and appropriate for PCA.
However, the individual Measure of Sampling Adequacy (MSA) values showed that the variable NO had an MSA of 0.42, which is below the recommended threshold of 0.50. This indicates that the variable has weak correlations with other variables and is not suitable for inclusion in PCA. Therefore, the variable NO was removed to improve sampling adequacy and the overall quality of the PCA results.
# Remove variable NO because MSA < 0.50
X.reduced <- X[, colnames(X) != "no"]
# Re-standardize after removing NO
scaled.X.reduced <- scale(X.reduced)
# Recalculate KMO
kmo.result.reduced <- KMO(cor(scaled.X.reduced))
kmo.result.reduced
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = cor(scaled.X.reduced))
## Overall MSA = 0.77
## MSA for each item =
## aqi so2 co o3 o3_8hr pm10 pm2.5 no2
## 0.84 0.70 0.86 0.68 0.58 0.75 0.79 0.76
## nox windspeed winddirec co_8hr pm2.5_avg pm10_avg so2_avg longitude
## 0.73 0.72 0.77 0.88 0.76 0.80 0.61 0.64
## latitude
## 0.84
After removing the variable NO, the Kaiser-Meyer-Olkin (KMO) test was recalculated using the reduced dataset. The overall KMO value increased from 0.73 to 0.77, indicating improved sampling adequacy and a stronger shared variance structure among the variables.
Furthermore, all remaining variables have individual Measure of Sampling Adequacy (MSA) values above 0.50, satisfying the minimum requirement for PCA. This indicates that each variable shares sufficient common variance with other variables and is appropriate for inclusion in the PCA. Therefore, the reduced dataset is considered adequate and suitable for Principal Component Analysis.
# Bartlett test
bartlett.result <- cortest.bartlett(
cor(scaled.X.reduced),
n = nrow(scaled.X.reduced)
)
bartlett.result
## $chisq
## [1] 126772
##
## $p.value
## [1] 0
##
## $df
## [1] 136
Bartlett’s Test of Sphericity was conducted to evaluate whether the correlation matrix significantly differs from an identity matrix. An identity matrix indicates that variables are uncorrelated, making PCA inappropriate. The test produced a chi-square value of 126772 with 136 degrees of freedom and a p-value less than 0.001. Since the p-value is smaller than the 0.05 significance level, the null hypothesis is rejected. This indicates that significant correlations exist among the variables. Therefore, the dataset satisfies the assumption required for Principal Component Analysis and is suitable for dimensionality reduction.
Since all variables were standardized using the z-score method, the correlation matrix was used as the basis for Principal Component Analysis (PCA). The correlation matrix describes the strength and direction of linear relationships among variables and ensures equal contribution from each variable.
# Correlation Matrix
cor.matrix <- cor(scaled.X.reduced)
cor.matrix
## aqi so2 co o3 o3_8hr
## aqi 1.00000000 0.164509471 0.35707946 0.38700356 0.538214367
## so2 0.16450947 1.000000000 0.15172909 0.16170686 0.081371894
## co 0.35707946 0.151729089 1.00000000 -0.02577027 0.027846115
## o3 0.38700356 0.161706858 -0.02577027 1.00000000 0.621154546
## o3_8hr 0.53821437 0.081371894 0.02784612 0.62115455 1.000000000
## pm10 0.50669565 0.169141703 0.35789109 0.22973549 0.095936871
## pm2.5 0.70408166 0.216043093 0.40172123 0.28192513 0.197515987
## no2 0.30702568 0.224221121 0.78550941 -0.16496489 -0.010628170
## nox 0.25547911 0.202912043 0.82773043 -0.17002092 -0.073419232
## windspeed 0.11434546 0.093168384 -0.20274296 0.47777990 0.267573766
## winddirec 0.02452998 -0.001651485 -0.05968979 0.21312595 0.097510779
## co_8hr 0.40636707 0.097227726 0.75181279 0.03230909 0.019318277
## pm2.5_avg 0.87585533 0.170956162 0.39154590 0.28321702 0.301143767
## pm10_avg 0.71173948 0.117053089 0.37958877 0.28525304 0.248544682
## so2_avg 0.11323843 0.406264892 0.07563397 0.03055103 0.032649434
## longitude 0.13710433 -0.022278121 0.30450249 -0.01006536 0.002216489
## latitude 0.52282365 0.029362618 0.37648524 0.19031034 0.252213696
## pm10 pm2.5 no2 nox windspeed
## aqi 0.506695649 0.704081663 0.30702568 0.25547911 0.114345461
## so2 0.169141703 0.216043093 0.22422112 0.20291204 0.093168384
## co 0.357891087 0.401721233 0.78550941 0.82773043 -0.202742958
## o3 0.229735487 0.281925126 -0.16496489 -0.17002092 0.477779899
## o3_8hr 0.095936871 0.197515987 -0.01062817 -0.07341923 0.267573766
## pm10 1.000000000 0.671348389 0.29457799 0.27607929 0.027173681
## pm2.5 0.671348389 1.000000000 0.34388678 0.30337689 0.004203052
## no2 0.294577992 0.343886783 1.00000000 0.95038312 -0.247974192
## nox 0.276079286 0.303376891 0.95038312 1.00000000 -0.226710503
## windspeed 0.027173681 0.004203052 -0.24797419 -0.22671050 1.000000000
## winddirec -0.007840368 0.006433025 -0.08978834 -0.07691247 0.171625724
## co_8hr 0.326011338 0.366952515 0.64267805 0.64847017 -0.157239088
## pm2.5_avg 0.580653693 0.817961473 0.33426973 0.29345953 0.060540705
## pm10_avg 0.764868572 0.638745102 0.30332653 0.27781692 0.076572965
## so2_avg 0.004994085 0.102854513 0.13219290 0.12200082 0.055369279
## longitude 0.211823403 0.106203241 0.28742476 0.25894448 -0.010085544
## latitude 0.387198841 0.426323271 0.34998958 0.31291769 0.026202058
## winddirec co_8hr pm2.5_avg pm10_avg so2_avg
## aqi 0.024529983 0.40636707 0.875855333 0.711739478 0.113238433
## so2 -0.001651485 0.09722773 0.170956162 0.117053089 0.406264892
## co -0.059689789 0.75181279 0.391545905 0.379588769 0.075633975
## o3 0.213125953 0.03230909 0.283217018 0.285253038 0.030551026
## o3_8hr 0.097510779 0.01931828 0.301143767 0.248544682 0.032649434
## pm10 -0.007840368 0.32601134 0.580653693 0.764868572 0.004994085
## pm2.5 0.006433025 0.36695252 0.817961473 0.638745102 0.102854513
## no2 -0.089788340 0.64267805 0.334269728 0.303326528 0.132192901
## nox -0.076912473 0.64847017 0.293459525 0.277816921 0.122000817
## windspeed 0.171625724 -0.15723909 0.060540705 0.076572965 0.055369279
## winddirec 1.000000000 -0.03176882 0.001072184 -0.004474842 0.020199018
## co_8hr -0.031768819 1.00000000 0.460358541 0.446852638 0.093773885
## pm2.5_avg 0.001072184 0.46035854 1.000000000 0.780432180 0.132234352
## pm10_avg -0.004474842 0.44685264 0.780432180 1.000000000 0.023640991
## so2_avg 0.020199018 0.09377388 0.132234352 0.023640991 1.000000000
## longitude -0.011418065 0.34718306 0.124403289 0.284463743 -0.054999314
## latitude 0.020384251 0.42506912 0.524739963 0.518260767 0.021254289
## longitude latitude
## aqi 0.137104333 0.52282365
## so2 -0.022278121 0.02936262
## co 0.304502486 0.37648524
## o3 -0.010065364 0.19031034
## o3_8hr 0.002216489 0.25221370
## pm10 0.211823403 0.38719884
## pm2.5 0.106203241 0.42632327
## no2 0.287424760 0.34998958
## nox 0.258944483 0.31291769
## windspeed -0.010085544 0.02620206
## winddirec -0.011418065 0.02038425
## co_8hr 0.347183063 0.42506912
## pm2.5_avg 0.124403289 0.52473996
## pm10_avg 0.284463743 0.51826077
## so2_avg -0.054999314 0.02125429
## longitude 1.000000000 0.60188012
## latitude 0.601880125 1.00000000
The correlation matrix shows several strong correlations among variables, such as NO₂ and NOx (0.95), CO and NOx (0.83), and AQI and PM2.5_avg (0.88), indicating overlapping or redundant information. Negative correlations are also present, such as between O₃ and NOx (-0.17), indicating inverse relationships. The presence of moderate to strong correlations confirms that the dataset is suitable for PCA, as PCA transforms correlated variables into a smaller set of uncorrelated principal components while retaining most of the original variance.
# Eigenvalues and Eigenvectors
eigen.decomp <- eigen(cor.matrix)
eigen.vals <- eigen.decomp$values
eigen.vecs <- eigen.decomp$vectors
sorted.idx <- order(eigen.vals, decreasing = TRUE)
sorted.eigen.vals <- eigen.vals[sorted.idx]
sorted.eigen.vecs <- eigen.vecs[, sorted.idx]
cat("Eigenvalues (first 5):\n")
## Eigenvalues (first 5):
print(sorted.eigen.vals[1:5])
## [1] 5.829627 2.857237 1.519862 1.301185 1.004075
cat("Eigenvectors (first 5):\n")
## Eigenvectors (first 5):
print(sorted.eigen.vecs[,1:5])
## [,1] [,2] [,3] [,4] [,5]
## [1,] -0.329841314 0.2349299240 -0.001552796 -0.102604557 -0.131592641
## [2,] -0.107284094 0.0279740070 0.589413972 -0.002831555 0.312143540
## [3,] -0.301151098 -0.2869922526 0.066134937 0.149529357 -0.221373877
## [4,] -0.109373160 0.4318095939 0.099117587 0.277512472 -0.171150543
## [5,] -0.122755496 0.3566617357 0.064362654 0.283507178 -0.337182038
## [6,] -0.288881123 0.1042630599 -0.106168899 -0.288697680 0.161044524
## [7,] -0.320626073 0.1388658880 0.019927181 -0.313542425 -0.008024958
## [8,] -0.280459825 -0.3478899338 0.145493999 0.134157075 -0.166675429
## [9,] -0.268377160 -0.3686684557 0.151496966 0.147394880 -0.190626489
## [10,] 0.008641508 0.3485897840 0.114527541 0.326065647 0.112598865
## [11,] 0.005748088 0.1495170107 0.056617954 0.346646077 -0.151533897
## [12,] -0.296542428 -0.2181903588 -0.008260336 0.157068980 -0.122872245
## [13,] -0.344801023 0.1760638377 -0.015092203 -0.250183310 -0.028144911
## [14,] -0.331718982 0.1555023335 -0.154550996 -0.192437636 0.111761463
## [15,] -0.060658322 0.0004244882 0.583006072 0.009860619 0.387968032
## [16,] -0.162372511 -0.1122786733 -0.349575577 0.415977452 0.534850453
## [17,] -0.279319584 0.0403982530 -0.275419656 0.253822409 0.320210381
Based on the eigen decomposition of the correlation matrix, eigenvalues and eigenvectors were obtained, which define the principal components in PCA. Eigenvalues indicate the amount of variance explained by each component. Since PCA was performed on the correlation matrix of 17 variables, the total variance equals 17. The results show that the first five eigenvalues are 5.83, 2.86, 1.52, 1.30, and 1.00, indicating that the first component explains the largest portion of variance, followed by the subsequent components. The eigenvectors represent the loadings or weights of each variable in forming the principal components. Variables with larger absolute loadings contribute more strongly to the corresponding component.
# Variance explained
tot.eigen <- sum(eigen.vals)
var.pc <- (sorted.eigen.vals / tot.eigen) * 100
print(var.pc)
## [1] 34.2919242 16.8072746 8.9403660 7.6540276 5.9063227 5.5142418
## [7] 4.5754007 3.6614624 2.8421060 2.5354423 2.1423479 1.6192732
## [13] 1.2236326 1.0129921 0.6873144 0.3581646 0.2277069
# Cumulative Variance
cum.var <- cumsum(var.pc)
data.frame(
PC = 1:length(var.pc),
Variance = var.pc,
Cumulative = cum.var
)
## PC Variance Cumulative
## 1 1 34.2919242 34.29192
## 2 2 16.8072746 51.09920
## 3 3 8.9403660 60.03956
## 4 4 7.6540276 67.69359
## 5 5 5.9063227 73.59991
## 6 6 5.5142418 79.11416
## 7 7 4.5754007 83.68956
## 8 8 3.6614624 87.35102
## 9 9 2.8421060 90.19313
## 10 10 2.5354423 92.72857
## 11 11 2.1423479 94.87092
## 12 12 1.6192732 96.49019
## 13 13 1.2236326 97.71382
## 14 14 1.0129921 98.72681
## 15 15 0.6873144 99.41413
## 16 16 0.3581646 99.77229
## 17 17 0.2277069 100.00000
cat("Total variance explained by first 5 PCs:", cum.var[5])
## Total variance explained by first 5 PCs: 73.59991
The cumulative variance was used to determine the number of components to retain. The results show that the first five principal components explain approximately 73.60% of the total variance. Although this value is below 90%, it satisfies the Kaiser criterion (eigenvalue > 1) and captures a substantial proportion of the dataset variability. Therefore, five principal components were retained for further analysis.
# Scree plot
eigen.temp <- eigen(cor(scaled.X.reduced))
plot(eigen.temp$values,
type = "b",
pch = 19,
xlab = "Principal Component",
ylab = "Eigenvalue",
main = "Scree Plot of Eigenvalues")
The scree plot shows a sharp decline in eigenvalues in the first few
components, followed by a gradual leveling off. Based on the Kaiser
criterion, five components have eigenvalues greater than 1, indicating
that these components explain substantial variance. Since the first five
principal components explain approximately 73.60% of the total variance,
five components were retained for further analysis.
# Run PCA using psych to obtain communalities
pca.psych <- principal(scaled.X.reduced,
nfactors = 5,
rotate = "none")
# Extract communalities
communalities <- pca.psych$communality
communalities
## aqi so2 co o3 o3_8hr pm10 pm2.5 no2
## 0.8230221 0.6951888 0.8489824 0.7470476 0.6763439 0.6691779 0.7829763 0.8878351
## nox windspeed winddirec co_8hr pm2.5_avg pm10_avg so2_avg longitude
## 0.9078698 0.5186379 0.2483502 0.6960308 0.8642261 0.8075990 0.6893045 0.8878328
## latitude
## 0.7615603
Communalities represent the proportion of variance of each variable explained by the retained principal components. Since five principal components were selected, explaining approximately 73.60% of the total variance, the communality values indicate how well each variable is represented in this 5-component solution. Most variables show moderate to high communalities (generally above 0.70), indicating that a substantial portion of their variance is captured by the retained components. However, some variables such as wind direction exhibit lower communality, suggesting that they are less well represented in the reduced component space.
The correlation circle displays the relationships between air quality variables and the first two principal components (PC1 and PC2). The length of each arrow represents the strength of the correlation between a variable and the principal components, while the direction indicates whether the correlation is positive or negative with respect to each axis. Variables located close to the circumference are well represented in the two-dimensional space. In the contribution plot, variables such as pm2.5, pm2.5_avg, pm10, pm10_avg, aqi, co, nox, and no2 show relatively high contributions to PC1 and PC2, indicating that these pollutants are major drivers of variability in the dataset. In the cos² plot, variables with higher cos² values (brighter colors) are better represented by PC1 and PC2, meaning that most of their variance is captured in this two-dimensional projection. Variables pointing in similar directions (e.g., pm2.5, pm2.5_avg, pm10_avg, and aqi) indicate strong positive correlations, whereas variables positioned in opposite directions reflect negative correlations.
# Contribution Plot for All PC
``` r
contrib_list <- list()
for(i in 1:5){
contrib_list[[i]] <- fviz_contrib(
pca.model,
choice = "var",
axes = i,
top = 10
) +
ggtitle(paste0("Top 10 Variable Contributions to PC", i))
}
do.call(grid.arrange, c(contrib_list, ncol = 2))
The contribution plots show the variables that play the most important
roles in forming the five retained principal components (PC1–PC5). For
PC1, the dominant contributors are pm2.5_avg, pm10_avg, aqi, pm2.5, and
co, indicating that this component is strongly associated with overall
air pollution intensity, particularly particulate matter and related
gaseous pollutants. PC2 is mainly driven by o3, nox, o3_8hr, windspeed,
and no2, suggesting that it reflects variability linked to ozone
formation, nitrogen oxides, and atmospheric dynamics. In both
components, variables exceeding the red dashed reference line contribute
more than the average expected contribution and are therefore considered
the primary drivers of the respective principal component.
For PC3, the largest contributions come from so2 and so2_avg, followed by geographic variables such as longitude and latitude, indicating that this component captures sulfur dioxide variation combined with spatial effects. PC4 is dominated by longitude, wind direction, and windspeed, along with particulate variables, suggesting a component related to spatial distribution and pollutant transport mechanisms. Finally, PC5 shows a very strong contribution from longitude, followed by so2_avg, o3_8hr, and latitude, implying that this component primarily reflects geographic gradients with secondary influence from sulfur dioxide and ozone variables. Overall, the five components represent distinct but complementary dimensions of air pollution variability and spatial–meteorological structure in the dataset.
# Cos² Correlation Circle
``` r
pc.comb <- combn(1:5, 2)
plot.list <- list()
for(i in 1:ncol(pc.comb)){
p <- fviz_pca_var(
pca.model,
axes = pc.comb[, i],
col.var = "cos2",
gradient.cols = c("white","blue","red"),
repel = TRUE
) +
ggtitle(paste0("PC", pc.comb[1,i], " vs PC", pc.comb[2,i]))
if(i != 1){
p <- p + theme(legend.position = "none")
}
plot.list[[i]] <- p
}
do.call(grid.arrange, c(plot.list, ncol = 2))
The matrix of cos² correlation circles shows the quality of
representation of each variable across different pairs of principal
components. Variables such as pm2.5, pm2.5_avg, pm10, pm10_avg, aqi, co,
no2, and nox consistently appear close to the circumference and in
darker red in plots involving PC1, indicating high cos² values and
strong representation. These variables are aligned in similar
directions, confirming strong positive correlations and showing that PC1
captures the dominant variation related to overall pollution intensity.
In contrast, ozone-related variables (o3 and o3_8hr) are more strongly
aligned with PC2, while nitrogen oxides tend to point in the opposite
direction along that dimension, suggesting a negative association
consistent with atmospheric photochemical processes.
Higher components reveal more specific structures. SO2 and so2_avg are strongly represented in dimensions involving PC3, indicating that this component isolates sulfur-related variability. Meanwhile, longitude and latitude show stronger representation in PC4 and PC5, confirming that these components primarily capture spatial heterogeneity rather than general pollution levels. Meteorological variables such as wind direction tend to have shorter vectors and lighter colors across most projections, suggesting lower cos² values and weaker representation in the principal component structure. Overall, the plots confirm a clear dimensional pattern separating general pollution intensity, ozone dynamics, sulfur variation, and spatial effects.
# Biplot
pc.comb <- combn(1:5, 2)
plot.list <- list()
for(i in 1:ncol(pc.comb)){
plot.list[[i]] <- fviz_pca_biplot(
pca.model,
axes = pc.comb[, i],
geom.ind = "point",
pointsize = 0.3,
col.ind = "gray70",
alpha.ind = 0.2,
col.var = "contrib",
gradient.cols = c("blue","yellow","red"),
repel = TRUE
) +
ggtitle(paste0("PC", pc.comb[1,i], " vs PC", pc.comb[2,i]))
}
do.call(grid.arrange, c(plot.list, ncol = 2))
The PCA biplot matrix displays both observations (gray points) and
variable vectors across several pairs of principal components, with
arrow length and color intensity representing each variable’s
contribution (contrib) to the corresponding dimensions. In plots
involving PC1 (Dim1 = 34.3%), variables such as PM2.5, PM2.5_avg, PM10,
PM10_avg, AQI, NOx, NO2, and CO consistently show long arrows with warm
colors, pointing toward the positive side of PC1. This indicates that
these pollutants are the primary contributors to the first component and
confirms that PC1 represents overall pollution intensity, largely driven
by particulate matter and combustion-related gases. Observations
projected in the same direction as these vectors tend to exhibit higher
concentrations of these pollutants.
In contrast, O3 and O3_8hr show strong contributions in dimensions involving PC2, generally pointing upward along Dim2 (16.8%), while nitrogen oxides tend to point in the opposite direction in that space, reflecting their contrasting behavior in atmospheric processes. Furthermore, SO2 and SO2_avg contribute strongly to PC3 (Dim3 = 8.9%), clearly separating sulfur-related variability from general pollution intensity. Geographic variables such as longitude and latitude become more prominent in PC4 and PC5, indicating that these higher components primarily capture spatial heterogeneity rather than chemical pollution structure. Overall, the biplot matrix confirms a structured dimensional pattern separating general pollution intensity, ozone–meteorological dynamics, sulfur variability, and spatial effects.
set.seed(123)
# Split 50% - 50%
n <- nrow(scaled.X.reduced)
idx.train <- sample(1:n, size = floor(0.5*n))
train.data <- scaled.X.reduced[idx.train, ]
test.data <- scaled.X.reduced[-idx.train, ]
# PCA on training data
pca.train <- principal(train.data,
nfactors = 5,
rotate = "none")
# PCA on testing data
pca.test <- principal(test.data,
nfactors = 5,
rotate = "none")
# Compare eigenvalues
train.eigen <- pca.train$values[1:5]
test.eigen <- pca.test$values[1:5]
data.frame(
PC = paste0("PC",1:5),
Train = train.eigen,
Test = test.eigen
)
## PC Train Test
## 1 PC1 5.9202895 5.746111
## 2 PC2 2.8708215 2.843728
## 3 PC3 1.5247238 1.517155
## 4 PC4 1.2970054 1.309180
## 5 PC5 0.9939569 1.014680
To evaluate the stability and robustness of the PCA solution, a split-sample validation approach was applied by randomly dividing the dataset into two equal subsets (50% training and 50% testing). PCA was then performed separately on each subset using five components without rotation. The eigenvalues of the first five principal components from the training and testing datasets were compared. The results show high consistency between both subsets. For example, PC1 has an eigenvalue of 5.92 in the training set and 5.75 in the testing set, while PC2 shows 2.87 and 2.84, respectively. The relatively small differences across all five components indicate that the component structure is stable and not sensitive to sampling variation. Therefore, the 5-component solution can be considered robust and reliable in representing the underlying structure of the air quality dataset.
# Select 5 main components (after removing NO)
selected.eigen.vecs <- sorted.eigen.vecs[, 1:5]
# Projection using reduced standardized data
pca.result <- scaled.X.reduced %*% selected.eigen.vecs
# Convert to data frame
pca.X <- as.data.frame(pca.result)
colnames(pca.X) <- paste("PC", 1:5, sep = "")
head(pca.X)
## PC1 PC2 PC3 PC4 PC5
## 1 -0.8167489 1.71322813 -0.2936654 0.06963302 0.75180078
## 2 -0.7460992 -0.44531337 0.7481311 -0.37058947 0.15315757
## 3 0.4049785 0.05127665 -1.8503856 -0.06475385 -0.06313232
## 4 0.3857649 0.33479703 -0.2777437 0.75025761 0.79189242
## 5 0.2987517 1.29735854 -0.1886397 0.06565303 0.90606408
## 6 1.2999600 0.27999411 0.3030849 0.62746386 0.56580928
After selecting the five principal components, the standardized data were projected onto the new component space using the selected eigenvectors. This projection was performed by multiplying the standardized data matrix by the matrix of eigenvectors, resulting in principal component scores for each observation. The PC1–PC5 scores are linear combinations of the original variables and form a new set of uncorrelated variables. The transformed dataset consists of five principal components that explain approximately 73.60% of the total variance in the original dataset. This lower-dimensional representation preserves most of the essential information and is suitable for further analyses such as clustering, regression or factor analysis.
Factor Analysis was conducted using Principal Component extraction via the principal() function in the psych package, followed by Varimax orthogonal rotation to improve interpretability. Prior diagnostic tests confirmed the adequacy of the data (KMO = 0.77 and Bartlett’s Test p < 0.001), indicating sufficient shared variance for factor modeling. Based on the Kaiser criterion and scree plot evaluation, five factors were retained, explaining approximately 74% of the total variance. The rotated solution produced a clearer loading structure with reduced cross-loadings and communality estimates suggest that most variables are well represented by the extracted factors.
fa.data <- scaled.X.reduced
library(corrplot)
## corrplot 0.95 loaded
cor.fa <- cor(fa.data)
corrplot(cor.fa,
method = "color",
type = "upper",
tl.col = "black",
tl.cex = 0.8,
tl.srt = 45,
mar = c(0,0,1,0),
diag = FALSE,
addCoef.col = "black",
number.cex = 0.7,
title = "Correlation Heatmap for Factor Analysis")
#KMO & Barllet
library(psych)
kmo_result <- KMO(cor.fa)
bartlett_result <- cortest.bartlett(cor.fa, n = nrow(fa.data))
result_df <- data.frame(
Test = c("KMO", "Bartlett"),
Overall_MSA = c(kmo_result$MSA, NA),
Chi_Square = c(NA, bartlett_result$chisq),
df = c(NA, bartlett_result$df),
p_value = c(NA, bartlett_result$p.value)
)
result_df
## Test Overall_MSA Chi_Square df p_value
## 1 KMO 0.7722823 NA NA NA
## 2 Bartlett NA 126772 136 0
The assumption tests indicate that the dataset is appropriate for Factor Analysis. The Kaiser–Meyer–Olkin (KMO) measure of sampling adequacy produced an overall value of 0.77, which falls within the acceptable range and suggests that the variables share sufficient common variance for factor extraction. Additionally, all individual MSA values exceed the minimum threshold of 0.50, indicating that each variable is adequately correlated with the others and suitable for inclusion in the analysis. Bartlett’s Test of Sphericity yielded a chi-square value of 126,772 with 136 degrees of freedom and a p-value less than 0.001. Since the test is highly significant, the null hypothesis that the correlation matrix is an identity matrix is rejected. This confirms the presence of significant correlations among variables. Therefore, the dataset satisfies the necessary assumptions and is suitable for conducting Factor Analysis.
# Eigenvalue & Scree Plot
eigen.fa <- eigen(cor.fa)
plot(eigen.fa$values,
type = "b",
pch = 19,
xlab = "Factor Number",
ylab = "Eigenvalue",
main = "Scree Plot for Factor Analysis")
abline(h = 1, col = "red")
var_explained <- eigen.fa$values / sum(eigen.fa$values) * 100
cum_var <- cumsum(var_explained)
barplot(var_explained[1:5], names.arg=paste0("F",1:5), col="red",
main="Variance Explained by Factors", ylab="Percentage of Variance")
lines(1:5, cum_var[1:5], type="b", pch=19, col="yellow")
text(1:5, cum_var[1:5]+2, labels=round(cum_var[1:5],1), col="yellow")
abline(h=1*100/sum(eigen.fa$values), col="yellow", lty=2)
The scree plot presents the eigenvalues derived from the correlation
matrix of the standardized variables. Each eigenvalue represents the
amount of variance explained by a corresponding factor. Since the
analysis involves 17 variables, the total variance equals 17 and each
factor accounts for a portion of this total variance. The plot shows a
sharp decline from the first to the second factor, followed by a more
gradual decrease after the fifth factor. The first five factors have
eigenvalues greater than 1 and are therefore retained based on the
Kaiser criterion (eigenvalue > 1). In addition, the visible “elbow”
around the fifth factor indicates that additional factors contribute
only marginal increases in explained variance. Therefore, a five-factor
solution is considered sufficient to represent the underlying latent
structure of the air quality variables in this study.
# Unrotated Factor
library(psych)
fa.unrotated <- principal(
fa.data,
nfactors = 5,
rotate = "none"
)
loading_df <- as.data.frame(unclass(fa.unrotated$loadings))
loading_df <- round(loading_df, 3)
eigen_df <- data.frame(
Factor = paste0("Factor", 1:length(fa.unrotated$values)),
Eigenvalue = round(fa.unrotated$values, 3)
)
variance_df <- data.frame(
Factor = paste0("Factor", 1:ncol(fa.unrotated$Vaccounted)),
SS_Loadings = round(fa.unrotated$Vaccounted["SS loadings", ], 3),
Proportion_Var = round(fa.unrotated$Vaccounted["Proportion Var", ], 3),
Cumulative_Var = round(fa.unrotated$Vaccounted["Cumulative Var", ], 3)
)
loading_df
## PC1 PC2 PC3 PC4 PC5
## aqi 0.796 0.397 -0.002 -0.117 -0.132
## so2 0.259 0.047 0.727 -0.003 0.313
## co 0.727 -0.485 0.082 0.171 -0.222
## o3 0.264 0.730 0.122 0.317 -0.171
## o3_8hr 0.296 0.603 0.079 0.323 -0.338
## pm10 0.697 0.176 -0.131 -0.329 0.161
## pm2.5 0.774 0.235 0.025 -0.358 -0.008
## no2 0.677 -0.588 0.179 0.153 -0.167
## nox 0.648 -0.623 0.187 0.168 -0.191
## windspeed -0.021 0.589 0.141 0.372 0.113
## winddirec -0.014 0.253 0.070 0.395 -0.152
## co_8hr 0.716 -0.369 -0.010 0.179 -0.123
## pm2.5_avg 0.833 0.298 -0.019 -0.285 -0.028
## pm10_avg 0.801 0.263 -0.191 -0.220 0.112
## so2_avg 0.146 0.001 0.719 0.011 0.389
## longitude 0.392 -0.190 -0.431 0.475 0.536
## latitude 0.674 0.068 -0.340 0.290 0.321
eigen_df
## Factor Eigenvalue
## 1 Factor1 5.830
## 2 Factor2 2.857
## 3 Factor3 1.520
## 4 Factor4 1.301
## 5 Factor5 1.004
## 6 Factor6 0.937
## 7 Factor7 0.778
## 8 Factor8 0.622
## 9 Factor9 0.483
## 10 Factor10 0.431
## 11 Factor11 0.364
## 12 Factor12 0.275
## 13 Factor13 0.208
## 14 Factor14 0.172
## 15 Factor15 0.117
## 16 Factor16 0.061
## 17 Factor17 0.039
variance_df
## Factor SS_Loadings Proportion_Var Cumulative_Var
## PC1 Factor1 5.830 0.343 0.343
## PC2 Factor2 2.857 0.168 0.511
## PC3 Factor3 1.520 0.089 0.600
## PC4 Factor4 1.301 0.077 0.677
## PC5 Factor5 1.004 0.059 0.736
library(reshape2)
library(ggplot2)
library(ggrepel)
load.unrot <- as.data.frame(unclass(fa.unrotated$loadings))
load.unrot$Variable <- rownames(load.unrot)
melt.unrot <- melt(load.unrot, id.vars = "Variable")
ggplot(melt.unrot,
aes(x = variable,
y = Variable,
fill = value)) +
geom_tile(color = "white") +
geom_text(aes(label = round(value, 2)),
size = 3.5) +
scale_fill_gradient2(low = "blue",
mid = "white",
high = "red",
midpoint = 0) +
coord_fixed(ratio = 0.3) +
theme_minimal(base_size = 12) +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
panel.grid = element_blank()
) +
labs(title = "Unrotated Factor Loading Heatmap",
x = "Factor",
y = "Variable")
In the initial extraction stage without rotation (unrotated solution),
five factors were extracted using the Principal Component method.
Collectively, these five factors explain 74% of the total variance,
indicating that the model adequately captures the underlying correlation
structure among the air quality variables. The first factor (PC1) has an
eigenvalue of 5.83 and accounts for 34% of the total variance, making it
the most dominant component in the model.The loading pattern reveals
that major pollutants such as PM2.5, PM10, PM2.5_avg, PM10_avg, CO, NO₂,
NOx and AQI load heavily on the first factor. This suggests that the
first factor represents a general air pollution dimension, primarily
associated with particulate matter and combustion-related emissions.
However, several variables also exhibit moderate loadings on other
factors, indicating the presence of cross-loadings.
Subsequent factors begin to capture more specific sources of variation, including ozone-related variables (O₃ and O₃_8hr), sulfur dioxide (SO₂ and SO₂_avg), as well as meteorological and spatial characteristics. Nevertheless, the unrotated solution does not yet achieve a clear and simple structure. The mean item complexity of 2.1 suggests that, on average, variables are associated with more than one factor. Although the model demonstrates excellent goodness-of-fit (RMSR = 0.04 and Fit = 0.98), interpretability remains limited due to the dominance of the first factor and overlapping loadings.
# rotated Factor Solution
fa.rotated <- principal(
fa.data,
nfactors = 5,
rotate = "varimax"
)
print(fa.rotated)
## Principal Components Analysis
## Call: principal(r = fa.data, nfactors = 5, rotate = "varimax")
## Standardized loadings (pattern matrix) based upon correlation matrix
## RC1 RC4 RC2 RC5 RC3 h2 u2 com
## aqi 0.81 0.23 0.33 0.02 0.03 0.82 0.177 1.5
## so2 0.13 0.12 0.08 -0.05 0.81 0.70 0.305 1.1
## co 0.25 0.88 -0.03 0.12 0.03 0.85 0.151 1.2
## o3 0.30 -0.13 0.80 -0.01 0.06 0.75 0.253 1.4
## o3_8hr 0.27 0.03 0.77 -0.08 -0.06 0.68 0.324 1.3
## pm10 0.79 0.11 -0.08 0.17 0.06 0.67 0.331 1.2
## pm2.5 0.85 0.20 0.03 -0.02 0.11 0.78 0.217 1.2
## no2 0.17 0.90 -0.12 0.10 0.13 0.89 0.112 1.2
## nox 0.12 0.92 -0.13 0.08 0.12 0.91 0.092 1.1
## windspeed 0.02 -0.30 0.61 0.15 0.19 0.52 0.481 1.8
## winddirec -0.13 0.02 0.48 0.04 -0.01 0.25 0.752 1.2
## co_8hr 0.30 0.75 0.00 0.22 0.00 0.70 0.304 1.5
## pm2.5_avg 0.89 0.22 0.13 0.04 0.07 0.86 0.136 1.2
## pm10_avg 0.85 0.16 0.07 0.24 0.00 0.81 0.192 1.2
## so2_avg 0.02 0.06 0.01 -0.01 0.83 0.69 0.311 1.0
## longitude 0.06 0.20 -0.01 0.92 -0.05 0.89 0.112 1.1
## latitude 0.45 0.25 0.16 0.69 -0.04 0.76 0.238 2.2
##
## RC1 RC4 RC2 RC5 RC3
## SS loadings 4.13 3.41 2.03 1.51 1.44
## Proportion Var 0.24 0.20 0.12 0.09 0.08
## Cumulative Var 0.24 0.44 0.56 0.65 0.74
## Proportion Explained 0.33 0.27 0.16 0.12 0.12
## Cumulative Proportion 0.33 0.60 0.76 0.88 1.00
##
## Mean item complexity = 1.3
## Test of the hypothesis that 5 components are sufficient.
##
## The root mean square of the residuals (RMSR) is 0.04
## with the empirical chi square 4933.69 with prob < 0
##
## Fit based upon off diagonal values = 0.98
load.rot <- as.data.frame(unclass(fa.rotated$loadings))
load.rot$Variable <- rownames(load.rot)
library(reshape2)
melt.rot <- melt(load.rot, id.vars = "Variable")
library(ggplot2)
library(ggrepel)
ggplot(melt.rot,
aes(x = variable,
y = Variable,
fill = value)) +
geom_tile(color = "white") +
geom_text(aes(label = sprintf("%.2f", value)),
size = 3) +
scale_fill_gradient2(low = "blue",
mid = "white",
high = "red",
midpoint = 0,
name = "") +
coord_fixed(ratio = 0.35) +
theme_minimal(base_size = 12) +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
panel.grid = element_blank(),
plot.title = element_text(face = "bold", hjust = 0.5)
) +
labs(title = "Rotated Factor Loading Heatmap (Varimax)",
x = "Factor",
y = "Variable")
# Example biplot rotated RC1 vs RC2
library(ggrepel)
ggplot(load.rot, aes(x=RC1, y=RC2, label=Variable)) +
geom_point(color="blue") +
geom_text_repel() +
geom_hline(yintercept=0, linetype="dashed") +
geom_vline(xintercept=0, linetype="dashed") +
labs(title="Factor Loading Plot (Rotated Varimax): RC1 vs RC2", x="RC1", y="RC2") +
theme_minimal()
To enhance interpretability, an orthogonal Varimax rotation was applied.
While the total variance explained remains unchanged at 74%, rotation
redistributes the variance across factors to achieve a simpler and more
interpretable structure.
After rotation, the contribution of each factor becomes more balanced. The first rotated factor (RC1) clearly captures particulate-related variables such as PM2.5, PM10, PM2.5_avg, PM10_avg and AQI, and can therefore be interpreted as a particulate pollution factor. The second factor (RC4) is strongly associated with CO, NO₂, NOx and CO_8hr, representing combustion-related gaseous emissions, likely linked to vehicular and industrial sources. The third factor (RC2) loads heavily on O₃ and O₃_8hr, along with windspeed, suggesting a photochemical and atmospheric dynamics dimension. The fourth factor (RC5) is primarily defined by longitude and latitude, indicating a spatial or geographical dimension. Finally, the fifth factor (RC3) loads strongly on SO₂ and SO₂_avg, reflecting a sulfur-related pollution component.
Importantly, the Varimax rotation reduces the mean item complexity from 2.1 to 1.3, indicating that most variables now load strongly on a single dominant factor. This confirms that rotation effectively reduces cross-loadings and produces a clearer, more stable latent structure, thereby improving substanti ve interpretability without altering overall model fit.
# Communalities Variable
comm <- data.frame(
Variable = names(fa.rotated$communality),
Communality = round(fa.rotated$communality, 3)
)
comm
## Variable Communality
## aqi aqi 0.823
## so2 so2 0.695
## co co 0.849
## o3 o3 0.747
## o3_8hr o3_8hr 0.676
## pm10 pm10 0.669
## pm2.5 pm2.5 0.783
## no2 no2 0.888
## nox nox 0.908
## windspeed windspeed 0.519
## winddirec winddirec 0.248
## co_8hr co_8hr 0.696
## pm2.5_avg pm2.5_avg 0.864
## pm10_avg pm10_avg 0.808
## so2_avg so2_avg 0.689
## longitude longitude 0.888
## latitude latitude 0.762
comm <- fa.rotated$communality
barplot(comm, names.arg=names(comm), col="red", las=2,
main="Communalities per Variable")
abline(h=0.5, col="yellow", lty=2)
Communality represents the proportion of each variable’s variance that
is explained by the extracted factors. In other words, higher
communality values indicate that a variable is well represented by the
factor model.
The results show that most variables have high communality values (above 0.6), including NOx (0.91), NO₂ (0.89), longitude (0.89), PM2.5_avg (0.86), CO (0.85) and AQI (0.82). This suggests that these variables strongly contribute to the latent structure and are well captured by the five-factor solution.Some variables display moderate communality, such as windspeed (0.52), which remains within an acceptable range. However, wind direction (0.25) has a relatively low communality, indicating that a substantial portion of its variance is not explained by the extracted factors. Wind direction is a circular variable measured in degrees (0–360) and its linear treatment in PCA or FA may limit its representation in the factor structure. This suggests that wind direction may behave more independently or is less strongly associated with the main latent dimensions. Overall, since the majority of communality values exceed 0.5, the factor model can be considered adequate in representing the variance structure of the air quality variables.
# Factor scores from rotated FA
scores_df <- as.data.frame(fa.rotated$scores)
head(scores_df)
## RC1 RC4 RC2 RC5 RC3
## 1 0.6887317 -0.7577945 0.4681434 0.6603423 0.2475130
## 2 0.2148040 0.2954464 -0.2958117 -0.2301398 0.6230258
## 3 0.1187411 -0.3661897 -0.2581393 0.5613031 -1.3263265
## 4 -0.3104363 -0.3693298 0.2374175 0.9178974 0.2024324
## 5 0.2390425 -0.9158682 0.2067130 0.6285434 0.3390911
## 6 -0.6287834 -0.4316504 0.2474547 0.4017330 0.4414345
scores_melt <- melt(scores_df, id.vars = NULL)
summary(scores_df)
## RC1 RC4 RC2 RC5
## Min. :-2.50366 Min. :-3.1150 Min. :-3.35280 Min. :-4.0164
## 1st Qu.:-0.67266 1st Qu.:-0.5999 1st Qu.:-0.77615 1st Qu.:-0.7507
## Median :-0.04383 Median :-0.2253 Median :-0.02577 Median :-0.1106
## Mean : 0.00000 Mean : 0.0000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.58804 3rd Qu.: 0.2970 3rd Qu.: 0.75046 3rd Qu.: 0.8115
## Max. : 7.72838 Max. : 8.8727 Max. : 3.33392 Max. : 2.4551
## RC3
## Min. :-2.0501
## 1st Qu.:-0.8376
## Median : 0.1008
## Mean : 0.0000
## 3rd Qu.: 0.5918
## Max. :18.4754
library(ggplot2)
library(reshape2)
scores_melt <- melt(scores_df, variable.name="Factor", value.name="Score")
## No id variables; using all as measure variables
ggplot(scores_melt, aes(x=Factor, y=Score, fill=Factor)) +
geom_violin(trim=FALSE, alpha=0.6) +
geom_boxplot(width=0.1, fill="white") +
labs(title="Distribution of Factor Scores", x="Factor", y="Score") +
theme_minimal()
After determining that five factors should be retained, a Varimax
rotation was applied to obtain a clearer and more interpretable
structure. Because Varimax produces orthogonal components, each factor
represents an independent dimension of variation, with variables loading
strongly on one component and minimally on others. The rotated solution
reveals a meaningful five-factor structure that captures distinct but
complementary aspects of air quality dynamics.
The first rotated component (RC1) is primarily defined by strong loadings from particulate matter variables, particularly PM2.5 and PM10, indicating overall particulate pollution intensity. This dimension reflects variations in suspended solid particles, commonly associated with traffic emissions, industrial activity and urban processes. RC4 is dominated by combustion-related gases such as carbon monoxide (CO) and nitrogen dioxide (NO₂), representing emissions from fossil fuel combustion and vehicular sources. RC2 loads strongly on ozone (O₃), capturing photochemical processes driven by atmospheric reactions and sunlight exposure. RC3 is characterized by sulfur dioxide (SO₂), suggesting sulfur-related industrial emissions and the presence of localized or episodic concentration spikes. Finally, RC5 reflects spatial variation across monitoring locations, capturing underlying geographic heterogeneity rather than specific pollutant mechanisms.
To evaluate how these latent dimensions are distributed across observations, factor scores were computed. Since the data were standardized prior to extraction, all scores are centered around zero, with positive values indicating above-average alignment with a particular factor and negative values reflecting relatively lower tendencies. The distribution of scores shows substantial variability across components. The particulate (RC1) and combustion-related (RC4) dimensions exhibit relatively wide ranges, indicating heterogeneity in pollution intensity across time and location. The sulfur-related factor (RC3) displays several extreme positive values, consistent with episodic spikes in sulfur emissions. In contrast, the spatial factor (RC5) presents a more moderate and gradual distribution, reflecting systematic geographic differences rather than abrupt fluctuations.
Overall, the rotated factor structure and corresponding factor scores provide a statistically coherent and substantively interpretable representation of air quality variation, capturing multiple independent pollution dynamics while reducing the complexity of the original dataset.
This study used Principal Component Analysis (PCA) and Exploratory Factor Analysis (FA) to better understand and simplify the multivariate structure of Taiwan’s Air Quality Index dataset from 2016 to 2024. The dataset includes various correlated air pollutants, meteorological indicators and geographic variables, making dimensional reduction necessary for clearer interpretation. Preliminary tests confirmed that the data were appropriate for multivariate analysis (KMO = 0.77; Bartlett’s Test p < 0.001), indicating sufficient shared variance among variables.
The PCA results showed that five principal components could explain about 73.6% of the total variance. This means that most of the information contained in the original 17 variables can be summarized into a smaller set of independent dimensions. The components reflect major pollution patterns, including particulate concentration, combustion-related gases, ozone-related atmospheric processes, sulfur variation and spatial–meteorological influences. Additional validation suggested that the PCA structure was stable and consistent.
To explore the underlying structure more clearly, Exploratory Factor Analysis was then conducted using Principal Component extraction with Varimax rotation. The five-factor solution explained approximately 74% of the total variance and produced a more interpretable loading pattern. The resulting factors represent distinct dimensions of air pollution, such as particulate pollution, combustion emissions, photochemical ozone processes, spatial–geographical variation and sulfur-related pollution. Most variables showed high communalities, indicating that the factors adequately capture their shared variance.
Overall, the findings suggest that air quality variation in Taiwan is inherently multidimensional. Pollution patterns are shaped by the interaction of particulate matter, gaseous emissions, atmospheric chemistry and geographic influences rather than a single dominant factor. By reducing complex and correlated variables into a smaller number of meaningful dimensions, PCA and FA provide a clearer and more structured way to understand air pollution dynamics.