Implementation of Principal Component Analysis (PCA) and Factor Analysis (FA) on Taiwan Air Quality Index Data 2016-2024

1. Principal Component Analysis

STAGE 1 — MAKE OBJECTIVE

The objective of this study is to apply Principal Component Analysis (PCA) to reduce the dimensionality of air quality variables while preserving as much of the total variance in the dataset as possible. The dataset consists of multiple air pollution indicators, meteorological variables, and geographic coordinates that are likely to be correlated with one another. These correlations may introduce redundancy and increase the complexity of the data structure.

Principal Component Analysis transforms the original correlated variables into a smaller set of uncorrelated variables called principal components. Each principal component is a linear combination of the original variables and is constructed to capture the maximum possible variance. By retaining only the most important principal components, PCA reduces the dimensionality of the dataset while minimizing the loss of information. This dimensionality reduction improves the efficiency of data analysis, simplifies interpretation, and facilitates further analysis such as clustering, regression modeling, and environmental pattern identification. In addition, PCA helps reveal the underlying structure of the air quality data by identifying dominant patterns and relationships among variables.

STAGE 2 — DESIGN

# Read data
data <- read.csv("air_quality.csv")

cat("Original Data Structure:\n")

## Original Data Structure:

str(data)

## 'data.frame':    5882208 obs. of  25 variables:
##  $ date     : chr  "2024-08-31 23:00" "2024-08-31 23:00" "2024-08-31 23:00" "2024-08-31 23:00" ...
##  $ sitename : chr  "Hukou" "Zhongming" "Zhudong" "Hsinchu" ...
##  $ county   : chr  "Hsinchu County" "Taichung City" "Hsinchu County" "Hsinchu City" ...
##  $ aqi      : num  62 50 45 42 50 40 39 44 46 49 ...
##  $ pollutant: chr  "PM2.5" "" "" "" ...
##  $ status   : chr  "Moderate" "Good" "Good" "Good" ...
##  $ so2      : chr  "0.9" "1.6" "0.4" "0.8" ...
##  $ co       : chr  "0.17" "0.32" "0.17" "0.2" ...
##  $ o3       : chr  "35.0" "27.9" "25.1" "30.0" ...
##  $ o3_8hr   : chr  "40.2" "35.1" "40.6" "35.9" ...
##  $ pm10     : chr  "18.0" "27.0" "21.0" "19.0" ...
##  $ pm2.5    : chr  "17.0" "14.0" "13.0" "10.0" ...
##  $ no2      : chr  "2.3" "7.6" "2.9" "4.0" ...
##  $ nox      : chr  "2.6" "9.3" "4.1" "4.8" ...
##  $ no       : chr  "0.3" "1.6" "1.1" "0.7" ...
##  $ windspeed: chr  "2.3" "1.1" "0.4" "1.9" ...
##  $ winddirec: chr  "225" "184" "210" "239" ...
##  $ unit     : logi  NA NA NA NA NA NA ...
##  $ co_8hr   : chr  "0.2" "0.2" "0.2" "0.2" ...
##  $ pm2.5_avg: chr  "20.1" "15.3" "13.8" "13.0" ...
##  $ pm10_avg : chr  "26.0" "23.0" "24.0" "26.0" ...
##  $ so2_avg  : chr  "1.0" "1.0" "0.0" "1.0" ...
##  $ longitude: num  121 121 121 121 121 ...
##  $ latitude : num  24.9 24.2 24.7 24.8 24.7 ...
##  $ siteid   : num  22 31 23 24 25 26 27 28 29 30 ...

# Selecting numeric variables
X <- data[, c("aqi","so2","co","o3","o3_8hr","pm10","pm2.5",
              "no2","nox","no","windspeed","winddirec",
              "co_8hr","pm2.5_avg","pm10_avg","so2_avg",
              "longitude","latitude")]

# Convert to numeric
X <- data.frame(lapply(X, as.numeric))

# Delete NA
X <- na.omit(X)

# Take 10000 data
X <- X[1:10000, ]

cat("Structure After Data Preparation:\n")

## Structure After Data Preparation:

str(X)

## 'data.frame':    10000 obs. of  18 variables:
##  $ aqi      : num  62 50 45 42 50 40 39 44 46 49 ...
##  $ so2      : num  0.9 1.6 0.4 0.8 1 1.1 0.9 1.3 2.5 0.7 ...
##  $ co       : num  0.17 0.32 0.17 0.2 0.16 0.17 0.18 0.24 0.2 0.24 ...
##  $ o3       : num  35 27.9 25.1 30 33.5 35.2 35.3 24.6 30.3 29.4 ...
##  $ o3_8hr   : num  40.2 35.1 40.6 35.9 35.9 35 42.9 39.7 40.4 37 ...
##  $ pm10     : num  18 27 21 19 18 15 14 21 33 20 ...
##  $ pm2.5    : num  17 14 13 10 14 12 9 12 16 12 ...
##  $ no2      : num  2.3 7.6 2.9 4 1.8 4 2.4 6.8 7.3 5.6 ...
##  $ nox      : num  2.6 9.3 4.1 4.8 3.1 5.1 3.1 7.3 7.7 6.3 ...
##  $ no       : num  0.3 1.6 1.1 0.7 1.2 1.1 0.7 0.5 0.3 0.7 ...
##  $ windspeed: num  2.3 1.1 0.4 1.9 1.8 1.4 0.6 1.2 0.5 0.9 ...
##  $ winddirec: num  225 184 210 239 259 235 203 38 8 97 ...
##  $ co_8hr   : num  0.2 0.2 0.2 0.2 0.1 0.1 0.1 0.2 0.1 0.2 ...
##  $ pm2.5_avg: num  20.1 15.3 13.8 13 15.3 12.2 11.4 13.5 14.2 15.1 ...
##  $ pm10_avg : num  26 23 24 26 28 17 16 21 26 21 ...
##  $ so2_avg  : num  1 1 0 1 1 1 0 1 1 0 ...
##  $ longitude: num  121 121 121 121 121 ...
##  $ latitude : num  24.9 24.2 24.7 24.8 24.7 ...
##  - attr(*, "na.action")= 'omit' Named int [1:1591314] 35 41 47 53 59 65 68 121 127 137 ...
##   ..- attr(*, "names")= chr [1:1591314] "35" "41" "47" "53" ...

The air quality dataset used in this study consists of 18 numerical variables representing air quality indicators, including the Air Quality Index (AQI), pollutant concentrations (SO₂, CO, O₃, 8-hour O₃, PM10, PM2.5, NO₂, NOx, and NO), meteorological variables (wind speed and wind direction), averaged pollutant measurements, and geographic coordinates. Since PCA requires numerical input variables, all selected variables were converted to numeric format, and observations with missing values were removed using listwise deletion. Due to the large size of the dataset, a subset of 10,000 observations was selected to improve computational efficiency while maintaining sufficient sample size. The prepared dataset is suitable for subsequent PCA analysis.

# Data Standardization
scaled.X <- scale(X)

scaled.mean <- apply(scaled.X, 2, mean)
scaled.sd <- apply(scaled.X, 2, sd)

check <- data.frame(
  Mean = round(scaled.mean,4),
  Standard_Deviation = round(scaled.sd,4)
)

check

##           Mean Standard_Deviation
## aqi          0                  1
## so2          0                  1
## co           0                  1
## o3           0                  1
## o3_8hr       0                  1
## pm10         0                  1
## pm2.5        0                  1
## no2          0                  1
## nox          0                  1
## no           0                  1
## windspeed    0                  1
## winddirec    0                  1
## co_8hr       0                  1
## pm2.5_avg    0                  1
## pm10_avg     0                  1
## so2_avg      0                  1
## longitude    0                  1
## latitude     0                  1

All variables were standardized using the z-score method so that each variable has a mean of 0 and a standard deviation of 1. Standardization is necessary because the variables have different units and scales. Without standardization, variables with larger scales could dominate the PCA results. By standardizing the data, all variables contribute equally to the formation of principal components, and PCA is effectively performed based on the correlation matrix, which is appropriate for variables measured on different scales.

STAGE 3 — ASSUMPTIONS

# Visual Examination of Correlation Matrix
cor.mat.initial <- cor(scaled.X)
# Count correlations > 0.30
sum(abs(cor.mat.initial[upper.tri(cor.mat.initial)]) > 0.3)

## [1] 53

round(cor.mat.initial, 2)

##            aqi   so2    co    o3 o3_8hr  pm10 pm2.5   no2   nox    no windspeed
## aqi       1.00  0.16  0.36  0.39   0.54  0.51  0.70  0.31  0.26  0.04      0.11
## so2       0.16  1.00  0.15  0.16   0.08  0.17  0.22  0.22  0.20  0.08      0.09
## co        0.36  0.15  1.00 -0.03   0.03  0.36  0.40  0.79  0.83  0.61     -0.20
## o3        0.39  0.16 -0.03  1.00   0.62  0.23  0.28 -0.16 -0.17 -0.12      0.48
## o3_8hr    0.54  0.08  0.03  0.62   1.00  0.10  0.20 -0.01 -0.07 -0.18      0.27
## pm10      0.51  0.17  0.36  0.23   0.10  1.00  0.67  0.29  0.28  0.13      0.03
## pm2.5     0.70  0.22  0.40  0.28   0.20  0.67  1.00  0.34  0.30  0.10      0.00
## no2       0.31  0.22  0.79 -0.16  -0.01  0.29  0.34  1.00  0.95  0.48     -0.25
## nox       0.26  0.20  0.83 -0.17  -0.07  0.28  0.30  0.95  1.00  0.73     -0.23
## no        0.04  0.08  0.61 -0.12  -0.18  0.13  0.10  0.48  0.73  1.00     -0.09
## windspeed 0.11  0.09 -0.20  0.48   0.27  0.03  0.00 -0.25 -0.23 -0.09      1.00
## winddirec 0.02  0.00 -0.06  0.21   0.10 -0.01  0.01 -0.09 -0.08 -0.02      0.17
## co_8hr    0.41  0.10  0.75  0.03   0.02  0.33  0.37  0.64  0.65  0.41     -0.16
## pm2.5_avg 0.88  0.17  0.39  0.28   0.30  0.58  0.82  0.33  0.29  0.09      0.06
## pm10_avg  0.71  0.12  0.38  0.29   0.25  0.76  0.64  0.30  0.28  0.12      0.08
## so2_avg   0.11  0.41  0.08  0.03   0.03  0.00  0.10  0.13  0.12  0.05      0.06
## longitude 0.14 -0.02  0.30 -0.01   0.00  0.21  0.11  0.29  0.26  0.10     -0.01
## latitude  0.52  0.03  0.38  0.19   0.25  0.39  0.43  0.35  0.31  0.11      0.03
##           winddirec co_8hr pm2.5_avg pm10_avg so2_avg longitude latitude
## aqi            0.02   0.41      0.88     0.71    0.11      0.14     0.52
## so2            0.00   0.10      0.17     0.12    0.41     -0.02     0.03
## co            -0.06   0.75      0.39     0.38    0.08      0.30     0.38
## o3             0.21   0.03      0.28     0.29    0.03     -0.01     0.19
## o3_8hr         0.10   0.02      0.30     0.25    0.03      0.00     0.25
## pm10          -0.01   0.33      0.58     0.76    0.00      0.21     0.39
## pm2.5          0.01   0.37      0.82     0.64    0.10      0.11     0.43
## no2           -0.09   0.64      0.33     0.30    0.13      0.29     0.35
## nox           -0.08   0.65      0.29     0.28    0.12      0.26     0.31
## no            -0.02   0.41      0.09     0.12    0.05      0.10     0.11
## windspeed      0.17  -0.16      0.06     0.08    0.06     -0.01     0.03
## winddirec      1.00  -0.03      0.00     0.00    0.02     -0.01     0.02
## co_8hr        -0.03   1.00      0.46     0.45    0.09      0.35     0.43
## pm2.5_avg      0.00   0.46      1.00     0.78    0.13      0.12     0.52
## pm10_avg       0.00   0.45      0.78     1.00    0.02      0.28     0.52
## so2_avg        0.02   0.09      0.13     0.02    1.00     -0.05     0.02
## longitude     -0.01   0.35      0.12     0.28   -0.05      1.00     0.60
## latitude       0.02   0.43      0.52     0.52    0.02      0.60     1.00

The correlation matrix was examined to assess the suitability of the data for Principal Component Analysis (PCA). Out of 153 pairwise correlations, 53 (34.6%) have an absolute value greater than 0.30, indicating a substantial proportion of moderate to strong relationships among the variables. Several strong correlations are observed, such as between NO₂ and NOx (r = 0.95), CO and NOx (r = 0.83), and PM2.5 and PM2.5_avg (r = 0.82), suggesting considerable shared variance. Since a meaningful number of variables are sufficiently intercorrelated, the correlation requirement for PCA is satisfied.

library(psych)

# KMO test
kmo.result <- KMO(cor(scaled.X))

kmo.result

## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = cor(scaled.X))
## Overall MSA =  0.73
## MSA for each item = 
##       aqi       so2        co        o3    o3_8hr      pm10     pm2.5       no2 
##      0.84      0.71      0.93      0.70      0.61      0.75      0.80      0.60 
##       nox        no windspeed winddirec    co_8hr pm2.5_avg  pm10_avg   so2_avg 
##      0.62      0.42      0.75      0.78      0.89      0.76      0.80      0.61 
## longitude  latitude 
##      0.66      0.84

The Kaiser-Meyer-Olkin (KMO) test was conducted to evaluate sampling adequacy and determine whether the dataset was suitable for Principal Component Analysis (PCA). The KMO statistic measures the proportion of common variance among variables that can be explained by the principal components. The overall KMO value of 0.73 falls into the middling category, indicating that the dataset is adequate and appropriate for PCA.

However, the individual Measure of Sampling Adequacy (MSA) values showed that the variable NO had an MSA of 0.42, which is below the recommended threshold of 0.50. This indicates that the variable has weak correlations with other variables and is not suitable for inclusion in PCA. Therefore, the variable NO was removed to improve sampling adequacy and the overall quality of the PCA results.

# Remove variable NO because MSA < 0.50
X.reduced <- X[, colnames(X) != "no"]

# Re-standardize after removing NO
scaled.X.reduced <- scale(X.reduced)

# Recalculate KMO
kmo.result.reduced <- KMO(cor(scaled.X.reduced))

kmo.result.reduced

## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = cor(scaled.X.reduced))
## Overall MSA =  0.77
## MSA for each item = 
##       aqi       so2        co        o3    o3_8hr      pm10     pm2.5       no2 
##      0.84      0.70      0.86      0.68      0.58      0.75      0.79      0.76 
##       nox windspeed winddirec    co_8hr pm2.5_avg  pm10_avg   so2_avg longitude 
##      0.73      0.72      0.77      0.88      0.76      0.80      0.61      0.64 
##  latitude 
##      0.84

After removing the variable NO, the Kaiser-Meyer-Olkin (KMO) test was recalculated using the reduced dataset. The overall KMO value increased from 0.73 to 0.77, indicating improved sampling adequacy and a stronger shared variance structure among the variables.

Furthermore, all remaining variables have individual Measure of Sampling Adequacy (MSA) values above 0.50, satisfying the minimum requirement for PCA. This indicates that each variable shares sufficient common variance with other variables and is appropriate for inclusion in the PCA. Therefore, the reduced dataset is considered adequate and suitable for Principal Component Analysis.

# Bartlett test
bartlett.result <- cortest.bartlett(
  cor(scaled.X.reduced),
  n = nrow(scaled.X.reduced)
)

bartlett.result

## $chisq
## [1] 126772
## 
## $p.value
## [1] 0
## 
## $df
## [1] 136

Bartlett’s Test of Sphericity was conducted to evaluate whether the correlation matrix significantly differs from an identity matrix. An identity matrix indicates that variables are uncorrelated, making PCA inappropriate. The test produced a chi-square value of 126772 with 136 degrees of freedom and a p-value less than 0.001. Since the p-value is smaller than the 0.05 significance level, the null hypothesis is rejected. This indicates that significant correlations exist among the variables. Therefore, the dataset satisfies the assumption required for Principal Component Analysis and is suitable for dimensionality reduction.

STAGE 4 — DERIVING FACTORS AND ASSESSING OVERALL FIT

Since all variables were standardized using the z-score method, the correlation matrix was used as the basis for Principal Component Analysis (PCA). The correlation matrix describes the strength and direction of linear relationships among variables and ensures equal contribution from each variable.

# Correlation Matrix
cor.matrix <- cor(scaled.X.reduced)

cor.matrix

##                  aqi          so2          co          o3       o3_8hr
## aqi       1.00000000  0.164509471  0.35707946  0.38700356  0.538214367
## so2       0.16450947  1.000000000  0.15172909  0.16170686  0.081371894
## co        0.35707946  0.151729089  1.00000000 -0.02577027  0.027846115
## o3        0.38700356  0.161706858 -0.02577027  1.00000000  0.621154546
## o3_8hr    0.53821437  0.081371894  0.02784612  0.62115455  1.000000000
## pm10      0.50669565  0.169141703  0.35789109  0.22973549  0.095936871
## pm2.5     0.70408166  0.216043093  0.40172123  0.28192513  0.197515987
## no2       0.30702568  0.224221121  0.78550941 -0.16496489 -0.010628170
## nox       0.25547911  0.202912043  0.82773043 -0.17002092 -0.073419232
## windspeed 0.11434546  0.093168384 -0.20274296  0.47777990  0.267573766
## winddirec 0.02452998 -0.001651485 -0.05968979  0.21312595  0.097510779
## co_8hr    0.40636707  0.097227726  0.75181279  0.03230909  0.019318277
## pm2.5_avg 0.87585533  0.170956162  0.39154590  0.28321702  0.301143767
## pm10_avg  0.71173948  0.117053089  0.37958877  0.28525304  0.248544682
## so2_avg   0.11323843  0.406264892  0.07563397  0.03055103  0.032649434
## longitude 0.13710433 -0.022278121  0.30450249 -0.01006536  0.002216489
## latitude  0.52282365  0.029362618  0.37648524  0.19031034  0.252213696
##                   pm10       pm2.5         no2         nox    windspeed
## aqi        0.506695649 0.704081663  0.30702568  0.25547911  0.114345461
## so2        0.169141703 0.216043093  0.22422112  0.20291204  0.093168384
## co         0.357891087 0.401721233  0.78550941  0.82773043 -0.202742958
## o3         0.229735487 0.281925126 -0.16496489 -0.17002092  0.477779899
## o3_8hr     0.095936871 0.197515987 -0.01062817 -0.07341923  0.267573766
## pm10       1.000000000 0.671348389  0.29457799  0.27607929  0.027173681
## pm2.5      0.671348389 1.000000000  0.34388678  0.30337689  0.004203052
## no2        0.294577992 0.343886783  1.00000000  0.95038312 -0.247974192
## nox        0.276079286 0.303376891  0.95038312  1.00000000 -0.226710503
## windspeed  0.027173681 0.004203052 -0.24797419 -0.22671050  1.000000000
## winddirec -0.007840368 0.006433025 -0.08978834 -0.07691247  0.171625724
## co_8hr     0.326011338 0.366952515  0.64267805  0.64847017 -0.157239088
## pm2.5_avg  0.580653693 0.817961473  0.33426973  0.29345953  0.060540705
## pm10_avg   0.764868572 0.638745102  0.30332653  0.27781692  0.076572965
## so2_avg    0.004994085 0.102854513  0.13219290  0.12200082  0.055369279
## longitude  0.211823403 0.106203241  0.28742476  0.25894448 -0.010085544
## latitude   0.387198841 0.426323271  0.34998958  0.31291769  0.026202058
##              winddirec      co_8hr   pm2.5_avg     pm10_avg      so2_avg
## aqi        0.024529983  0.40636707 0.875855333  0.711739478  0.113238433
## so2       -0.001651485  0.09722773 0.170956162  0.117053089  0.406264892
## co        -0.059689789  0.75181279 0.391545905  0.379588769  0.075633975
## o3         0.213125953  0.03230909 0.283217018  0.285253038  0.030551026
## o3_8hr     0.097510779  0.01931828 0.301143767  0.248544682  0.032649434
## pm10      -0.007840368  0.32601134 0.580653693  0.764868572  0.004994085
## pm2.5      0.006433025  0.36695252 0.817961473  0.638745102  0.102854513
## no2       -0.089788340  0.64267805 0.334269728  0.303326528  0.132192901
## nox       -0.076912473  0.64847017 0.293459525  0.277816921  0.122000817
## windspeed  0.171625724 -0.15723909 0.060540705  0.076572965  0.055369279
## winddirec  1.000000000 -0.03176882 0.001072184 -0.004474842  0.020199018
## co_8hr    -0.031768819  1.00000000 0.460358541  0.446852638  0.093773885
## pm2.5_avg  0.001072184  0.46035854 1.000000000  0.780432180  0.132234352
## pm10_avg  -0.004474842  0.44685264 0.780432180  1.000000000  0.023640991
## so2_avg    0.020199018  0.09377388 0.132234352  0.023640991  1.000000000
## longitude -0.011418065  0.34718306 0.124403289  0.284463743 -0.054999314
## latitude   0.020384251  0.42506912 0.524739963  0.518260767  0.021254289
##              longitude   latitude
## aqi        0.137104333 0.52282365
## so2       -0.022278121 0.02936262
## co         0.304502486 0.37648524
## o3        -0.010065364 0.19031034
## o3_8hr     0.002216489 0.25221370
## pm10       0.211823403 0.38719884
## pm2.5      0.106203241 0.42632327
## no2        0.287424760 0.34998958
## nox        0.258944483 0.31291769
## windspeed -0.010085544 0.02620206
## winddirec -0.011418065 0.02038425
## co_8hr     0.347183063 0.42506912
## pm2.5_avg  0.124403289 0.52473996
## pm10_avg   0.284463743 0.51826077
## so2_avg   -0.054999314 0.02125429
## longitude  1.000000000 0.60188012
## latitude   0.601880125 1.00000000

The correlation matrix shows several strong correlations among variables, such as NO₂ and NOx (0.95), CO and NOx (0.83), and AQI and PM2.5_avg (0.88), indicating overlapping or redundant information. Negative correlations are also present, such as between O₃ and NOx (-0.17), indicating inverse relationships. The presence of moderate to strong correlations confirms that the dataset is suitable for PCA, as PCA transforms correlated variables into a smaller set of uncorrelated principal components while retaining most of the original variance.

# Eigenvalues and Eigenvectors
eigen.decomp <- eigen(cor.matrix)

eigen.vals <- eigen.decomp$values
eigen.vecs <- eigen.decomp$vectors

sorted.idx <- order(eigen.vals, decreasing = TRUE)
sorted.eigen.vals <- eigen.vals[sorted.idx]
sorted.eigen.vecs <- eigen.vecs[, sorted.idx]

cat("Eigenvalues (first 5):\n")

## Eigenvalues (first 5):

print(sorted.eigen.vals[1:5])

## [1] 5.829627 2.857237 1.519862 1.301185 1.004075

cat("Eigenvectors (first 5):\n")

## Eigenvectors (first 5):

print(sorted.eigen.vecs[,1:5])

##               [,1]          [,2]         [,3]         [,4]         [,5]
##  [1,] -0.329841314  0.2349299240 -0.001552796 -0.102604557 -0.131592641
##  [2,] -0.107284094  0.0279740070  0.589413972 -0.002831555  0.312143540
##  [3,] -0.301151098 -0.2869922526  0.066134937  0.149529357 -0.221373877
##  [4,] -0.109373160  0.4318095939  0.099117587  0.277512472 -0.171150543
##  [5,] -0.122755496  0.3566617357  0.064362654  0.283507178 -0.337182038
##  [6,] -0.288881123  0.1042630599 -0.106168899 -0.288697680  0.161044524
##  [7,] -0.320626073  0.1388658880  0.019927181 -0.313542425 -0.008024958
##  [8,] -0.280459825 -0.3478899338  0.145493999  0.134157075 -0.166675429
##  [9,] -0.268377160 -0.3686684557  0.151496966  0.147394880 -0.190626489
## [10,]  0.008641508  0.3485897840  0.114527541  0.326065647  0.112598865
## [11,]  0.005748088  0.1495170107  0.056617954  0.346646077 -0.151533897
## [12,] -0.296542428 -0.2181903588 -0.008260336  0.157068980 -0.122872245
## [13,] -0.344801023  0.1760638377 -0.015092203 -0.250183310 -0.028144911
## [14,] -0.331718982  0.1555023335 -0.154550996 -0.192437636  0.111761463
## [15,] -0.060658322  0.0004244882  0.583006072  0.009860619  0.387968032
## [16,] -0.162372511 -0.1122786733 -0.349575577  0.415977452  0.534850453
## [17,] -0.279319584  0.0403982530 -0.275419656  0.253822409  0.320210381

Based on the eigen decomposition of the correlation matrix, eigenvalues and eigenvectors were obtained, which define the principal components in PCA. Eigenvalues indicate the amount of variance explained by each component. Since PCA was performed on the correlation matrix of 17 variables, the total variance equals 17. The results show that the first five eigenvalues are 5.83, 2.86, 1.52, 1.30, and 1.00, indicating that the first component explains the largest portion of variance, followed by the subsequent components. The eigenvectors represent the loadings or weights of each variable in forming the principal components. Variables with larger absolute loadings contribute more strongly to the corresponding component.

# Variance explained
tot.eigen <- sum(eigen.vals)

var.pc <- (sorted.eigen.vals / tot.eigen) * 100

print(var.pc)

##  [1] 34.2919242 16.8072746  8.9403660  7.6540276  5.9063227  5.5142418
##  [7]  4.5754007  3.6614624  2.8421060  2.5354423  2.1423479  1.6192732
## [13]  1.2236326  1.0129921  0.6873144  0.3581646  0.2277069

# Cumulative Variance
cum.var <- cumsum(var.pc)

data.frame(
  PC = 1:length(var.pc),
  Variance = var.pc,
  Cumulative = cum.var
)

##    PC   Variance Cumulative
## 1   1 34.2919242   34.29192
## 2   2 16.8072746   51.09920
## 3   3  8.9403660   60.03956
## 4   4  7.6540276   67.69359
## 5   5  5.9063227   73.59991
## 6   6  5.5142418   79.11416
## 7   7  4.5754007   83.68956
## 8   8  3.6614624   87.35102
## 9   9  2.8421060   90.19313
## 10 10  2.5354423   92.72857
## 11 11  2.1423479   94.87092
## 12 12  1.6192732   96.49019
## 13 13  1.2236326   97.71382
## 14 14  1.0129921   98.72681
## 15 15  0.6873144   99.41413
## 16 16  0.3581646   99.77229
## 17 17  0.2277069  100.00000

cat("Total variance explained by first 5 PCs:", cum.var[5])

## Total variance explained by first 5 PCs: 73.59991

The cumulative variance was used to determine the number of components to retain. The results show that the first five principal components explain approximately 73.60% of the total variance. Although this value is below 90%, it satisfies the Kaiser criterion (eigenvalue > 1) and captures a substantial proportion of the dataset variability. Therefore, five principal components were retained for further analysis.

# Scree plot
eigen.temp <- eigen(cor(scaled.X.reduced))

plot(eigen.temp$values,
     type = "b",
     pch = 19,
     xlab = "Principal Component",
     ylab = "Eigenvalue",
     main = "Scree Plot of Eigenvalues")

The scree plot shows a sharp decline in eigenvalues in the first few components, followed by a gradual leveling off. Based on the Kaiser criterion, five components have eigenvalues greater than 1, indicating that these components explain substantial variance. Since the first five principal components explain approximately 73.60% of the total variance, five components were retained for further analysis.

STAGE 5 — Interpreting the Principal Components

# Run PCA using psych to obtain communalities
pca.psych <- principal(scaled.X.reduced,
                       nfactors = 5,
                       rotate = "none")

# Extract communalities
communalities <- pca.psych$communality

communalities

##       aqi       so2        co        o3    o3_8hr      pm10     pm2.5       no2 
## 0.8230221 0.6951888 0.8489824 0.7470476 0.6763439 0.6691779 0.7829763 0.8878351 
##       nox windspeed winddirec    co_8hr pm2.5_avg  pm10_avg   so2_avg longitude 
## 0.9078698 0.5186379 0.2483502 0.6960308 0.8642261 0.8075990 0.6893045 0.8878328 
##  latitude 
## 0.7615603

Communalities represent the proportion of variance of each variable explained by the retained principal components. Since five principal components were selected, explaining approximately 73.60% of the total variance, the communality values indicate how well each variable is represented in this 5-component solution. Most variables show moderate to high communalities (generally above 0.70), indicating that a substantial portion of their variance is captured by the retained components. However, some variables such as wind direction exhibit lower communality, suggesting that they are less well represented in the reduced component space.

The correlation circle displays the relationships between air quality variables and the first two principal components (PC1 and PC2). The length of each arrow represents the strength of the correlation between a variable and the principal components, while the direction indicates whether the correlation is positive or negative with respect to each axis. Variables located close to the circumference are well represented in the two-dimensional space. In the contribution plot, variables such as pm2.5, pm2.5_avg, pm10, pm10_avg, aqi, co, nox, and no2 show relatively high contributions to PC1 and PC2, indicating that these pollutants are major drivers of variability in the dataset. In the cos² plot, variables with higher cos² values (brighter colors) are better represented by PC1 and PC2, meaning that most of their variance is captured in this two-dimensional projection. Variables pointing in similar directions (e.g., pm2.5, pm2.5_avg, pm10_avg, and aqi) indicate strong positive correlations, whereas variables positioned in opposite directions reflect negative correlations.

# Contribution Plot for All PC

``` r
contrib_list <- list()

for(i in 1:5){
  
  contrib_list[[i]] <- fviz_contrib(
    pca.model,
    choice = "var",
    axes = i,
    top = 10
  ) +
  ggtitle(paste0("Top 10 Variable Contributions to PC", i))
}

do.call(grid.arrange, c(contrib_list, ncol = 2))

The contribution plots show the variables that play the most important roles in forming the five retained principal components (PC1–PC5). For PC1, the dominant contributors are pm2.5_avg, pm10_avg, aqi, pm2.5, and co, indicating that this component is strongly associated with overall air pollution intensity, particularly particulate matter and related gaseous pollutants. PC2 is mainly driven by o3, nox, o3_8hr, windspeed, and no2, suggesting that it reflects variability linked to ozone formation, nitrogen oxides, and atmospheric dynamics. In both components, variables exceeding the red dashed reference line contribute more than the average expected contribution and are therefore considered the primary drivers of the respective principal component.

For PC3, the largest contributions come from so2 and so2_avg, followed by geographic variables such as longitude and latitude, indicating that this component captures sulfur dioxide variation combined with spatial effects. PC4 is dominated by longitude, wind direction, and windspeed, along with particulate variables, suggesting a component related to spatial distribution and pollutant transport mechanisms. Finally, PC5 shows a very strong contribution from longitude, followed by so2_avg, o3_8hr, and latitude, implying that this component primarily reflects geographic gradients with secondary influence from sulfur dioxide and ozone variables. Overall, the five components represent distinct but complementary dimensions of air pollution variability and spatial–meteorological structure in the dataset.

# Cos² Correlation Circle

``` r
pc.comb <- combn(1:5, 2)
plot.list <- list()

for(i in 1:ncol(pc.comb)){
  
  p <- fviz_pca_var(
    pca.model,
    axes = pc.comb[, i],
    col.var = "cos2",
    gradient.cols = c("white","blue","red"),
    repel = TRUE
  ) +
  ggtitle(paste0("PC", pc.comb[1,i], " vs PC", pc.comb[2,i]))
  
  if(i != 1){
    p <- p + theme(legend.position = "none")
  }
  
  plot.list[[i]] <- p
}

do.call(grid.arrange, c(plot.list, ncol = 2))

The matrix of cos² correlation circles shows the quality of representation of each variable across different pairs of principal components. Variables such as pm2.5, pm2.5_avg, pm10, pm10_avg, aqi, co, no2, and nox consistently appear close to the circumference and in darker red in plots involving PC1, indicating high cos² values and strong representation. These variables are aligned in similar directions, confirming strong positive correlations and showing that PC1 captures the dominant variation related to overall pollution intensity. In contrast, ozone-related variables (o3 and o3_8hr) are more strongly aligned with PC2, while nitrogen oxides tend to point in the opposite direction along that dimension, suggesting a negative association consistent with atmospheric photochemical processes.

Higher components reveal more specific structures. SO2 and so2_avg are strongly represented in dimensions involving PC3, indicating that this component isolates sulfur-related variability. Meanwhile, longitude and latitude show stronger representation in PC4 and PC5, confirming that these components primarily capture spatial heterogeneity rather than general pollution levels. Meteorological variables such as wind direction tend to have shorter vectors and lighter colors across most projections, suggesting lower cos² values and weaker representation in the principal component structure. Overall, the plots confirm a clear dimensional pattern separating general pollution intensity, ozone dynamics, sulfur variation, and spatial effects.

# Biplot
pc.comb <- combn(1:5, 2)
plot.list <- list()

for(i in 1:ncol(pc.comb)){
  
  plot.list[[i]] <- fviz_pca_biplot(
    pca.model,
    axes = pc.comb[, i],
    geom.ind = "point",
    pointsize = 0.3,
    col.ind = "gray70",
    alpha.ind = 0.2,
    col.var = "contrib",
    gradient.cols = c("blue","yellow","red"),
    repel = TRUE
  ) +
  ggtitle(paste0("PC", pc.comb[1,i], " vs PC", pc.comb[2,i]))
}

do.call(grid.arrange, c(plot.list, ncol = 2))

The PCA biplot matrix displays both observations (gray points) and variable vectors across several pairs of principal components, with arrow length and color intensity representing each variable’s contribution (contrib) to the corresponding dimensions. In plots involving PC1 (Dim1 = 34.3%), variables such as PM2.5, PM2.5_avg, PM10, PM10_avg, AQI, NOx, NO2, and CO consistently show long arrows with warm colors, pointing toward the positive side of PC1. This indicates that these pollutants are the primary contributors to the first component and confirms that PC1 represents overall pollution intensity, largely driven by particulate matter and combustion-related gases. Observations projected in the same direction as these vectors tend to exhibit higher concentrations of these pollutants.

In contrast, O3 and O3_8hr show strong contributions in dimensions involving PC2, generally pointing upward along Dim2 (16.8%), while nitrogen oxides tend to point in the opposite direction in that space, reflecting their contrasting behavior in atmospheric processes. Furthermore, SO2 and SO2_avg contribute strongly to PC3 (Dim3 = 8.9%), clearly separating sulfur-related variability from general pollution intensity. Geographic variables such as longitude and latitude become more prominent in PC4 and PC5, indicating that these higher components primarily capture spatial heterogeneity rather than chemical pollution structure. Overall, the biplot matrix confirms a structured dimensional pattern separating general pollution intensity, ozone–meteorological dynamics, sulfur variability, and spatial effects.

STAGE 6 — Validation of PCA

set.seed(123)

# Split 50% - 50%
n <- nrow(scaled.X.reduced)
idx.train <- sample(1:n, size = floor(0.5*n))

train.data <- scaled.X.reduced[idx.train, ]
test.data  <- scaled.X.reduced[-idx.train, ]

# PCA on training data
pca.train <- principal(train.data,
                       nfactors = 5,
                       rotate = "none")

# PCA on testing data
pca.test <- principal(test.data,
                      nfactors = 5,
                      rotate = "none")

# Compare eigenvalues
train.eigen <- pca.train$values[1:5]
test.eigen  <- pca.test$values[1:5]

data.frame(
  PC = paste0("PC",1:5),
  Train = train.eigen,
  Test = test.eigen
)

##    PC     Train     Test
## 1 PC1 5.9202895 5.746111
## 2 PC2 2.8708215 2.843728
## 3 PC3 1.5247238 1.517155
## 4 PC4 1.2970054 1.309180
## 5 PC5 0.9939569 1.014680

To evaluate the stability and robustness of the PCA solution, a split-sample validation approach was applied by randomly dividing the dataset into two equal subsets (50% training and 50% testing). PCA was then performed separately on each subset using five components without rotation. The eigenvalues of the first five principal components from the training and testing datasets were compared. The results show high consistency between both subsets. For example, PC1 has an eigenvalue of 5.92 in the training set and 5.75 in the testing set, while PC2 shows 2.87 and 2.84, respectively. The relatively small differences across all five components indicate that the component structure is stable and not sensitive to sampling variation. Therefore, the 5-component solution can be considered robust and reliable in representing the underlying structure of the air quality dataset.

STAGE 7 — PCA Score Computation

# Select 5 main components (after removing NO)
selected.eigen.vecs <- sorted.eigen.vecs[, 1:5]

# Projection using reduced standardized data
pca.result <- scaled.X.reduced %*% selected.eigen.vecs

# Convert to data frame
pca.X <- as.data.frame(pca.result)
colnames(pca.X) <- paste("PC", 1:5, sep = "")

head(pca.X)

##          PC1         PC2        PC3         PC4         PC5
## 1 -0.8167489  1.71322813 -0.2936654  0.06963302  0.75180078
## 2 -0.7460992 -0.44531337  0.7481311 -0.37058947  0.15315757
## 3  0.4049785  0.05127665 -1.8503856 -0.06475385 -0.06313232
## 4  0.3857649  0.33479703 -0.2777437  0.75025761  0.79189242
## 5  0.2987517  1.29735854 -0.1886397  0.06565303  0.90606408
## 6  1.2999600  0.27999411  0.3030849  0.62746386  0.56580928

After selecting the five principal components, the standardized data were projected onto the new component space using the selected eigenvectors. This projection was performed by multiplying the standardized data matrix by the matrix of eigenvectors, resulting in principal component scores for each observation. The PC1–PC5 scores are linear combinations of the original variables and form a new set of uncorrelated variables. The transformed dataset consists of five principal components that explain approximately 73.60% of the total variance in the original dataset. This lower-dimensional representation preserves most of the essential information and is suitable for further analyses such as clustering, regression or factor analysis.

2. Faktor Analysis (FA) with function Principal

Factor Analysis was conducted using Principal Component extraction via the principal() function in the psych package, followed by Varimax orthogonal rotation to improve interpretability. Prior diagnostic tests confirmed the adequacy of the data (KMO = 0.77 and Bartlett’s Test p < 0.001), indicating sufficient shared variance for factor modeling. Based on the Kaiser criterion and scree plot evaluation, five factors were retained, explaining approximately 74% of the total variance. The rotated solution produced a clearer loading structure with reduced cross-loadings and communality estimates suggest that most variables are well represented by the extracted factors.

STAGE 1 — Assumption Recheck for FA

fa.data <- scaled.X.reduced
library(corrplot)

## corrplot 0.95 loaded

cor.fa <- cor(fa.data)
corrplot(cor.fa,
         method = "color",
         type = "upper",
         tl.col = "black",
         tl.cex = 0.8,
         tl.srt = 45,
         mar = c(0,0,1,0),
         diag = FALSE,
         addCoef.col = "black",
         number.cex = 0.7,
         title = "Correlation Heatmap for Factor Analysis")

#KMO & Barllet
library(psych)

kmo_result <- KMO(cor.fa)
bartlett_result <- cortest.bartlett(cor.fa, n = nrow(fa.data))

result_df <- data.frame(
  Test = c("KMO", "Bartlett"),
  Overall_MSA = c(kmo_result$MSA, NA),
  Chi_Square = c(NA, bartlett_result$chisq),
  df = c(NA, bartlett_result$df),
  p_value = c(NA, bartlett_result$p.value)
)

result_df

##       Test Overall_MSA Chi_Square  df p_value
## 1      KMO   0.7722823         NA  NA      NA
## 2 Bartlett          NA     126772 136       0

The assumption tests indicate that the dataset is appropriate for Factor Analysis. The Kaiser–Meyer–Olkin (KMO) measure of sampling adequacy produced an overall value of 0.77, which falls within the acceptable range and suggests that the variables share sufficient common variance for factor extraction. Additionally, all individual MSA values exceed the minimum threshold of 0.50, indicating that each variable is adequately correlated with the others and suitable for inclusion in the analysis. Bartlett’s Test of Sphericity yielded a chi-square value of 126,772 with 136 degrees of freedom and a p-value less than 0.001. Since the test is highly significant, the null hypothesis that the correlation matrix is an identity matrix is rejected. This confirms the presence of significant correlations among variables. Therefore, the dataset satisfies the necessary assumptions and is suitable for conducting Factor Analysis.

STAGE 2 — Determine Number of Factors

# Eigenvalue & Scree Plot
eigen.fa <- eigen(cor.fa)

plot(eigen.fa$values,
     type = "b",
     pch = 19,
     xlab = "Factor Number",
     ylab = "Eigenvalue",
     main = "Scree Plot for Factor Analysis")
abline(h = 1, col = "red")

var_explained <- eigen.fa$values / sum(eigen.fa$values) * 100
cum_var <- cumsum(var_explained)

barplot(var_explained[1:5], names.arg=paste0("F",1:5), col="red",
        main="Variance Explained by Factors", ylab="Percentage of Variance")
lines(1:5, cum_var[1:5], type="b", pch=19, col="yellow")
text(1:5, cum_var[1:5]+2, labels=round(cum_var[1:5],1), col="yellow")
abline(h=1*100/sum(eigen.fa$values), col="yellow", lty=2)

The scree plot presents the eigenvalues derived from the correlation matrix of the standardized variables. Each eigenvalue represents the amount of variance explained by a corresponding factor. Since the analysis involves 17 variables, the total variance equals 17 and each factor accounts for a portion of this total variance. The plot shows a sharp decline from the first to the second factor, followed by a more gradual decrease after the fifth factor. The first five factors have eigenvalues greater than 1 and are therefore retained based on the Kaiser criterion (eigenvalue > 1). In addition, the visible “elbow” around the fifth factor indicates that additional factors contribute only marginal increases in explained variance. Therefore, a five-factor solution is considered sufficient to represent the underlying latent structure of the air quality variables in this study.

STAGE 3 — Factor Extraction

# Unrotated Factor
library(psych)

fa.unrotated <- principal(
  fa.data,
  nfactors = 5,
  rotate = "none"
)

loading_df <- as.data.frame(unclass(fa.unrotated$loadings))
loading_df <- round(loading_df, 3)

eigen_df <- data.frame(
  Factor = paste0("Factor", 1:length(fa.unrotated$values)),
  Eigenvalue = round(fa.unrotated$values, 3)
)

variance_df <- data.frame(
  Factor = paste0("Factor", 1:ncol(fa.unrotated$Vaccounted)),
  SS_Loadings = round(fa.unrotated$Vaccounted["SS loadings", ], 3),
  Proportion_Var = round(fa.unrotated$Vaccounted["Proportion Var", ], 3),
  Cumulative_Var = round(fa.unrotated$Vaccounted["Cumulative Var", ], 3)
)

loading_df

##              PC1    PC2    PC3    PC4    PC5
## aqi        0.796  0.397 -0.002 -0.117 -0.132
## so2        0.259  0.047  0.727 -0.003  0.313
## co         0.727 -0.485  0.082  0.171 -0.222
## o3         0.264  0.730  0.122  0.317 -0.171
## o3_8hr     0.296  0.603  0.079  0.323 -0.338
## pm10       0.697  0.176 -0.131 -0.329  0.161
## pm2.5      0.774  0.235  0.025 -0.358 -0.008
## no2        0.677 -0.588  0.179  0.153 -0.167
## nox        0.648 -0.623  0.187  0.168 -0.191
## windspeed -0.021  0.589  0.141  0.372  0.113
## winddirec -0.014  0.253  0.070  0.395 -0.152
## co_8hr     0.716 -0.369 -0.010  0.179 -0.123
## pm2.5_avg  0.833  0.298 -0.019 -0.285 -0.028
## pm10_avg   0.801  0.263 -0.191 -0.220  0.112
## so2_avg    0.146  0.001  0.719  0.011  0.389
## longitude  0.392 -0.190 -0.431  0.475  0.536
## latitude   0.674  0.068 -0.340  0.290  0.321

eigen_df

##      Factor Eigenvalue
## 1   Factor1      5.830
## 2   Factor2      2.857
## 3   Factor3      1.520
## 4   Factor4      1.301
## 5   Factor5      1.004
## 6   Factor6      0.937
## 7   Factor7      0.778
## 8   Factor8      0.622
## 9   Factor9      0.483
## 10 Factor10      0.431
## 11 Factor11      0.364
## 12 Factor12      0.275
## 13 Factor13      0.208
## 14 Factor14      0.172
## 15 Factor15      0.117
## 16 Factor16      0.061
## 17 Factor17      0.039

variance_df

##      Factor SS_Loadings Proportion_Var Cumulative_Var
## PC1 Factor1       5.830          0.343          0.343
## PC2 Factor2       2.857          0.168          0.511
## PC3 Factor3       1.520          0.089          0.600
## PC4 Factor4       1.301          0.077          0.677
## PC5 Factor5       1.004          0.059          0.736

library(reshape2)
library(ggplot2)
library(ggrepel)

load.unrot <- as.data.frame(unclass(fa.unrotated$loadings))
load.unrot$Variable <- rownames(load.unrot)

melt.unrot <- melt(load.unrot, id.vars = "Variable")

ggplot(melt.unrot,
       aes(x = variable,
           y = Variable,
           fill = value)) +
  geom_tile(color = "white") +
  geom_text(aes(label = round(value, 2)),
            size = 3.5) +
  scale_fill_gradient2(low = "blue",
                       mid = "white",
                       high = "red",
                       midpoint = 0) +
  coord_fixed(ratio = 0.3) +
  theme_minimal(base_size = 12) +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    panel.grid = element_blank()
  ) +
  labs(title = "Unrotated Factor Loading Heatmap",
       x = "Factor",
       y = "Variable")

In the initial extraction stage without rotation (unrotated solution), five factors were extracted using the Principal Component method. Collectively, these five factors explain 74% of the total variance, indicating that the model adequately captures the underlying correlation structure among the air quality variables. The first factor (PC1) has an eigenvalue of 5.83 and accounts for 34% of the total variance, making it the most dominant component in the model.The loading pattern reveals that major pollutants such as PM2.5, PM10, PM2.5_avg, PM10_avg, CO, NO₂, NOx and AQI load heavily on the first factor. This suggests that the first factor represents a general air pollution dimension, primarily associated with particulate matter and combustion-related emissions. However, several variables also exhibit moderate loadings on other factors, indicating the presence of cross-loadings.

Subsequent factors begin to capture more specific sources of variation, including ozone-related variables (O₃ and O₃_8hr), sulfur dioxide (SO₂ and SO₂_avg), as well as meteorological and spatial characteristics. Nevertheless, the unrotated solution does not yet achieve a clear and simple structure. The mean item complexity of 2.1 suggests that, on average, variables are associated with more than one factor. Although the model demonstrates excellent goodness-of-fit (RMSR = 0.04 and Fit = 0.98), interpretability remains limited due to the dominance of the first factor and overlapping loadings.

STAGE 4 — Rotation (Varimax)

# rotated Factor Solution
fa.rotated <- principal(
  fa.data,
  nfactors = 5,
  rotate = "varimax"
)

print(fa.rotated)

## Principal Components Analysis
## Call: principal(r = fa.data, nfactors = 5, rotate = "varimax")
## Standardized loadings (pattern matrix) based upon correlation matrix
##             RC1   RC4   RC2   RC5   RC3   h2    u2 com
## aqi        0.81  0.23  0.33  0.02  0.03 0.82 0.177 1.5
## so2        0.13  0.12  0.08 -0.05  0.81 0.70 0.305 1.1
## co         0.25  0.88 -0.03  0.12  0.03 0.85 0.151 1.2
## o3         0.30 -0.13  0.80 -0.01  0.06 0.75 0.253 1.4
## o3_8hr     0.27  0.03  0.77 -0.08 -0.06 0.68 0.324 1.3
## pm10       0.79  0.11 -0.08  0.17  0.06 0.67 0.331 1.2
## pm2.5      0.85  0.20  0.03 -0.02  0.11 0.78 0.217 1.2
## no2        0.17  0.90 -0.12  0.10  0.13 0.89 0.112 1.2
## nox        0.12  0.92 -0.13  0.08  0.12 0.91 0.092 1.1
## windspeed  0.02 -0.30  0.61  0.15  0.19 0.52 0.481 1.8
## winddirec -0.13  0.02  0.48  0.04 -0.01 0.25 0.752 1.2
## co_8hr     0.30  0.75  0.00  0.22  0.00 0.70 0.304 1.5
## pm2.5_avg  0.89  0.22  0.13  0.04  0.07 0.86 0.136 1.2
## pm10_avg   0.85  0.16  0.07  0.24  0.00 0.81 0.192 1.2
## so2_avg    0.02  0.06  0.01 -0.01  0.83 0.69 0.311 1.0
## longitude  0.06  0.20 -0.01  0.92 -0.05 0.89 0.112 1.1
## latitude   0.45  0.25  0.16  0.69 -0.04 0.76 0.238 2.2
## 
##                        RC1  RC4  RC2  RC5  RC3
## SS loadings           4.13 3.41 2.03 1.51 1.44
## Proportion Var        0.24 0.20 0.12 0.09 0.08
## Cumulative Var        0.24 0.44 0.56 0.65 0.74
## Proportion Explained  0.33 0.27 0.16 0.12 0.12
## Cumulative Proportion 0.33 0.60 0.76 0.88 1.00
## 
## Mean item complexity =  1.3
## Test of the hypothesis that 5 components are sufficient.
## 
## The root mean square of the residuals (RMSR) is  0.04 
##  with the empirical chi square  4933.69  with prob <  0 
## 
## Fit based upon off diagonal values = 0.98

load.rot <- as.data.frame(unclass(fa.rotated$loadings))
load.rot$Variable <- rownames(load.rot)

library(reshape2)
melt.rot <- melt(load.rot, id.vars = "Variable")

library(ggplot2)
library(ggrepel)
ggplot(melt.rot,
       aes(x = variable,
           y = Variable,
           fill = value)) +
  geom_tile(color = "white") +
  geom_text(aes(label = sprintf("%.2f", value)),
            size = 3) +
  scale_fill_gradient2(low = "blue",
                       mid = "white",
                       high = "red",
                       midpoint = 0,
                       name = "") +
  coord_fixed(ratio = 0.35) +
  theme_minimal(base_size = 12) +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    panel.grid = element_blank(),
    plot.title = element_text(face = "bold", hjust = 0.5)
  ) +
  labs(title = "Rotated Factor Loading Heatmap (Varimax)",
       x = "Factor",
       y = "Variable")

# Example biplot rotated RC1 vs RC2
library(ggrepel)

ggplot(load.rot, aes(x=RC1, y=RC2, label=Variable)) +
  geom_point(color="blue") +
  geom_text_repel() +
  geom_hline(yintercept=0, linetype="dashed") +
  geom_vline(xintercept=0, linetype="dashed") +
  labs(title="Factor Loading Plot (Rotated Varimax): RC1 vs RC2", x="RC1", y="RC2") +
  theme_minimal()

To enhance interpretability, an orthogonal Varimax rotation was applied. While the total variance explained remains unchanged at 74%, rotation redistributes the variance across factors to achieve a simpler and more interpretable structure.

After rotation, the contribution of each factor becomes more balanced. The first rotated factor (RC1) clearly captures particulate-related variables such as PM2.5, PM10, PM2.5_avg, PM10_avg and AQI, and can therefore be interpreted as a particulate pollution factor. The second factor (RC4) is strongly associated with CO, NO₂, NOx and CO_8hr, representing combustion-related gaseous emissions, likely linked to vehicular and industrial sources. The third factor (RC2) loads heavily on O₃ and O₃_8hr, along with windspeed, suggesting a photochemical and atmospheric dynamics dimension. The fourth factor (RC5) is primarily defined by longitude and latitude, indicating a spatial or geographical dimension. Finally, the fifth factor (RC3) loads strongly on SO₂ and SO₂_avg, reflecting a sulfur-related pollution component.

Importantly, the Varimax rotation reduces the mean item complexity from 2.1 to 1.3, indicating that most variables now load strongly on a single dominant factor. This confirms that rotation effectively reduces cross-loadings and produces a clearer, more stable latent structure, thereby improving substanti ve interpretability without altering overall model fit.

STAGE 5 — Communalities

# Communalities Variable
comm <- data.frame(
  Variable = names(fa.rotated$communality),
  Communality = round(fa.rotated$communality, 3)
)

comm

##            Variable Communality
## aqi             aqi       0.823
## so2             so2       0.695
## co               co       0.849
## o3               o3       0.747
## o3_8hr       o3_8hr       0.676
## pm10           pm10       0.669
## pm2.5         pm2.5       0.783
## no2             no2       0.888
## nox             nox       0.908
## windspeed windspeed       0.519
## winddirec winddirec       0.248
## co_8hr       co_8hr       0.696
## pm2.5_avg pm2.5_avg       0.864
## pm10_avg   pm10_avg       0.808
## so2_avg     so2_avg       0.689
## longitude longitude       0.888
## latitude   latitude       0.762

comm <- fa.rotated$communality
barplot(comm, names.arg=names(comm), col="red", las=2,
        main="Communalities per Variable")
abline(h=0.5, col="yellow", lty=2)

Communality represents the proportion of each variable’s variance that is explained by the extracted factors. In other words, higher communality values indicate that a variable is well represented by the factor model.

The results show that most variables have high communality values (above 0.6), including NOx (0.91), NO₂ (0.89), longitude (0.89), PM2.5_avg (0.86), CO (0.85) and AQI (0.82). This suggests that these variables strongly contribute to the latent structure and are well captured by the five-factor solution.Some variables display moderate communality, such as windspeed (0.52), which remains within an acceptable range. However, wind direction (0.25) has a relatively low communality, indicating that a substantial portion of its variance is not explained by the extracted factors. Wind direction is a circular variable measured in degrees (0–360) and its linear treatment in PCA or FA may limit its representation in the factor structure. This suggests that wind direction may behave more independently or is less strongly associated with the main latent dimensions. Overall, since the majority of communality values exceed 0.5, the factor model can be considered adequate in representing the variance structure of the air quality variables.

STAGE 6 — Factor Scores

# Factor scores from rotated FA
scores_df <- as.data.frame(fa.rotated$scores)
head(scores_df)

##          RC1        RC4        RC2        RC5        RC3
## 1  0.6887317 -0.7577945  0.4681434  0.6603423  0.2475130
## 2  0.2148040  0.2954464 -0.2958117 -0.2301398  0.6230258
## 3  0.1187411 -0.3661897 -0.2581393  0.5613031 -1.3263265
## 4 -0.3104363 -0.3693298  0.2374175  0.9178974  0.2024324
## 5  0.2390425 -0.9158682  0.2067130  0.6285434  0.3390911
## 6 -0.6287834 -0.4316504  0.2474547  0.4017330  0.4414345

scores_melt <- melt(scores_df, id.vars = NULL)

summary(scores_df)

##       RC1                RC4               RC2                RC5         
##  Min.   :-2.50366   Min.   :-3.1150   Min.   :-3.35280   Min.   :-4.0164  
##  1st Qu.:-0.67266   1st Qu.:-0.5999   1st Qu.:-0.77615   1st Qu.:-0.7507  
##  Median :-0.04383   Median :-0.2253   Median :-0.02577   Median :-0.1106  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.58804   3rd Qu.: 0.2970   3rd Qu.: 0.75046   3rd Qu.: 0.8115  
##  Max.   : 7.72838   Max.   : 8.8727   Max.   : 3.33392   Max.   : 2.4551  
##       RC3         
##  Min.   :-2.0501  
##  1st Qu.:-0.8376  
##  Median : 0.1008  
##  Mean   : 0.0000  
##  3rd Qu.: 0.5918  
##  Max.   :18.4754

library(ggplot2)
library(reshape2)

scores_melt <- melt(scores_df, variable.name="Factor", value.name="Score")

## No id variables; using all as measure variables

ggplot(scores_melt, aes(x=Factor, y=Score, fill=Factor)) +
  geom_violin(trim=FALSE, alpha=0.6) +
  geom_boxplot(width=0.1, fill="white") +
  labs(title="Distribution of Factor Scores", x="Factor", y="Score") +
  theme_minimal()

After determining that five factors should be retained, a Varimax rotation was applied to obtain a clearer and more interpretable structure. Because Varimax produces orthogonal components, each factor represents an independent dimension of variation, with variables loading strongly on one component and minimally on others. The rotated solution reveals a meaningful five-factor structure that captures distinct but complementary aspects of air quality dynamics.

The first rotated component (RC1) is primarily defined by strong loadings from particulate matter variables, particularly PM2.5 and PM10, indicating overall particulate pollution intensity. This dimension reflects variations in suspended solid particles, commonly associated with traffic emissions, industrial activity and urban processes. RC4 is dominated by combustion-related gases such as carbon monoxide (CO) and nitrogen dioxide (NO₂), representing emissions from fossil fuel combustion and vehicular sources. RC2 loads strongly on ozone (O₃), capturing photochemical processes driven by atmospheric reactions and sunlight exposure. RC3 is characterized by sulfur dioxide (SO₂), suggesting sulfur-related industrial emissions and the presence of localized or episodic concentration spikes. Finally, RC5 reflects spatial variation across monitoring locations, capturing underlying geographic heterogeneity rather than specific pollutant mechanisms.

To evaluate how these latent dimensions are distributed across observations, factor scores were computed. Since the data were standardized prior to extraction, all scores are centered around zero, with positive values indicating above-average alignment with a particular factor and negative values reflecting relatively lower tendencies. The distribution of scores shows substantial variability across components. The particulate (RC1) and combustion-related (RC4) dimensions exhibit relatively wide ranges, indicating heterogeneity in pollution intensity across time and location. The sulfur-related factor (RC3) displays several extreme positive values, consistent with episodic spikes in sulfur emissions. In contrast, the spatial factor (RC5) presents a more moderate and gradual distribution, reflecting systematic geographic differences rather than abrupt fluctuations.

Overall, the rotated factor structure and corresponding factor scores provide a statistically coherent and substantively interpretable representation of air quality variation, capturing multiple independent pollution dynamics while reducing the complexity of the original dataset.

STAGE 7 — Summary

This study used Principal Component Analysis (PCA) and Exploratory Factor Analysis (FA) to better understand and simplify the multivariate structure of Taiwan’s Air Quality Index dataset from 2016 to 2024. The dataset includes various correlated air pollutants, meteorological indicators and geographic variables, making dimensional reduction necessary for clearer interpretation. Preliminary tests confirmed that the data were appropriate for multivariate analysis (KMO = 0.77; Bartlett’s Test p < 0.001), indicating sufficient shared variance among variables.

The PCA results showed that five principal components could explain about 73.6% of the total variance. This means that most of the information contained in the original 17 variables can be summarized into a smaller set of independent dimensions. The components reflect major pollution patterns, including particulate concentration, combustion-related gases, ozone-related atmospheric processes, sulfur variation and spatial–meteorological influences. Additional validation suggested that the PCA structure was stable and consistent.

To explore the underlying structure more clearly, Exploratory Factor Analysis was then conducted using Principal Component extraction with Varimax rotation. The five-factor solution explained approximately 74% of the total variance and produced a more interpretable loading pattern. The resulting factors represent distinct dimensions of air pollution, such as particulate pollution, combustion emissions, photochemical ozone processes, spatial–geographical variation and sulfur-related pollution. Most variables showed high communalities, indicating that the factors adequately capture their shared variance.

Overall, the findings suggest that air quality variation in Taiwan is inherently multidimensional. Pollution patterns are shaped by the interaction of particulate matter, gaseous emissions, atmospheric chemistry and geographic influences rather than a single dominant factor. By reducing complex and correlated variables into a smaller number of meaningful dimensions, PCA and FA provide a clearer and more structured way to understand air pollution dynamics.