In this project, I worked with environmental performance data. I applied different clustering techniques to group similar data points. I used K-Means, PAM, CLARA, Agglomerative Hierarchical Clustering, and Divisive Hierarchical Clustering to find patterns in the data.
To improve the clustering results, I used Principal Component Analysis (PCA) for dimension reduction. This helped remove unnecessary information and made the data easier to analyze.
After applying PCA, I ran the clustering algorithms again on the reduced data. This allowed me to compare how well the clusters performed before and after PCA.
options(repos = c(CRAN = "https://cloud.r-project.org"))
required_packages <- c("factoextra", "cluster", "gridExtra", "dbscan", "caret",
"hopkins", "cclust", "clValid", "fpc","corrplot","psych")
for (pkg in required_packages) {
if (!requireNamespace(pkg, quietly = TRUE)) {
install.packages(pkg)
}
library(pkg, character.only = TRUE)
}
file_path <- "C:/Users/nijat/Desktop/clus_dim_data.csv"
data <- read.csv(file_path)
head(data)
## code iso country EPI.old EPI.new ECO.old ECO.new BDH.old BDH.new
## 1 4 AFG Afghanistan 18.2 31.0 21.5 31.9 25.2 31.7
## 2 8 ALB Albania 46.1 52.2 50.9 51.9 50.6 50.3
## 3 12 DZA Algeria 38.5 41.7 39.3 41.8 33.1 33.0
## 4 24 AGO Angola 31.9 40.1 36.6 44.1 42.3 41.5
## 5 28 ATG Antigua and Barbuda 54.4 55.6 52.4 53.2 52.2 52.5
## 6 32 ARG Argentina 45.9 47.0 41.2 46.7 31.7 34.7
## MKP.old MKP.new MHP.old MHP.new MPE.old MPE.new PAR.old PAR.new SPI.old
## 1 NA NA NA NA NA NA 0.3 0.3 0.6
## 2 21.2 25.5 24.4 24.4 100.0 96.7 46.4 46.4 54.4
## 3 0.0 0.0 0.0 0.0 50.0 100.0 6.1 6.1 73.2
## 4 0.0 0.0 0.0 0.0 NA NA 35.6 35.6 37.1
## 5 74.7 74.7 33.1 33.1 50.0 88.5 NA NA 19.6
## 6 13.4 17.1 9.5 33.0 77.9 89.1 14.9 14.9 30.9
## SPI.new TBN.old TBN.new TKP.old TKP.new PAE.old PAE.new PHL.old PHL.new
## 1 9.1 0.3 11.7 0.6 24.6 79.5 79.5 80.9 80.9
## 2 56.8 55.5 60.7 55.9 58.4 65.2 65.2 63.4 63.4
## 3 73.8 4.2 4.3 5.9 6.0 20.3 20.3 58.8 58.8
## 4 37.1 34.1 34.1 74.7 74.7 82.1 82.1 89.8 89.8
## 5 19.9 51.5 51.5 72.9 72.9 28.3 28.3 90.0 90.0
## 6 32.9 26.2 27.9 39.8 40.5 37.6 37.6 71.6 71.6
## RLI.old RLI.new SHI.old SHI.new BER.old BER.new ECS.old ECS.new PFL.old
## 1 75.8 75.6 67.5 66.4 35.4 35.7 NA NA NA
## 2 70.5 70.0 77.2 53.5 28.2 28.9 58.4 62.4 NA
## 3 74.3 73.8 76.3 74.0 54.1 53.9 NA NA NA
## 4 77.8 77.7 82.7 71.6 58.1 56.0 50.8 49.2 68.5
## 5 67.3 64.0 NA NA 83.7 80.9 35.4 28.7 0.0
## 6 54.4 54.2 64.3 48.2 25.6 25.2 35.6 48.9 47.2
## PFL.new IFL.old IFL.new FCL.old FCL.new TCG.old TCG.new FLI.old FLI.new
## 1 NA NA NA NA NA NA NA NA NA
## 2 NA NA NA 65.0 71.7 36.5 36.5 67.5 67.5
## 3 NA NA NA NA NA NA NA NA NA
## 4 50.5 41.1 49.3 58.1 49.1 27.9 27.9 83.5 83.5
## 5 0.0 NA NA 70.6 59.0 30.2 30.2 46.1 46.1
## 6 62.7 44.9 51.3 21.4 44.2 0.0 0.0 72.2 72.2
## FSH.old FSH.new FSS.old FSS.new FCD.old FCD.new BTZ.old BTZ.new BTO.old
## 1 NA NA NA NA NA NA NA NA NA
## 2 13.7 15.8 NA NA 24.2 24.0 9.0 3.5 5.6
## 3 53.1 51.6 85.6 75.7 40.8 37.7 44.8 49.2 45.3
## 4 38.8 37.6 21.0 29.0 48.1 45.4 27.2 26.9 38.5
## 5 97.6 97.3 81.5 100.0 NA NA 100.0 100.0 100.0
## 6 52.0 38.5 77.5 70.5 53.4 50.8 12.8 26.8 15.8
## BTO.new RMS.old RMS.new APO.old APO.new OEB.old OEB.new OEC.old OEC.new
## 1 NA NA NA 5.2 42.0 19.4 18.0 39.9 37.6
## 2 7.8 100.0 100.0 80.3 74.5 25.8 25.9 23.3 21.8
## 3 50.1 20.8 58.2 49.7 63.7 25.8 18.7 12.0 7.8
## 4 43.4 59.3 45.3 13.5 73.2 69.5 69.1 87.5 84.4
## 5 100.0 51.3 56.4 64.6 72.0 40.0 36.2 44.4 39.6
## 6 24.6 49.9 48.3 57.0 80.5 80.1 77.0 94.4 92.5
## NXA.old NXA.new SDA.old SDA.new AGR.old AGR.new SNM.old SNM.new PSU.old
## 1 0.0 34.2 0.0 55.5 45.0 41.1 35.3 31.2 100.0
## 2 100.0 100.0 100.0 69.2 48.4 50.4 20.3 22.0 35.7
## 3 7.1 60.7 100.0 86.8 42.9 46.7 35.0 38.6 100.0
## 4 0.0 69.6 0.0 75.4 39.3 41.0 33.0 35.2 81.6
## 5 39.5 84.8 76.3 72.9 30.2 31.4 5.2 3.8 40.2
## 6 36.7 69.0 62.0 90.4 84.1 81.4 82.6 87.7 100.0
## PSU.new PRS.old PRS.new RCY.old RCY.new WRS.old WRS.new WWG.old WWG.new
## 1 97.2 60.6 52.0 40.2 41.9 11.2 11.2 89.7 89.7
## 2 36.2 67.0 65.7 68.6 73.6 23.5 39.6 64.8 56.6
## 3 100.0 76.7 74.6 39.8 38.7 53.1 55.7 62.7 62.7
## 4 97.5 97.0 97.8 24.0 18.4 15.8 15.8 86.5 86.5
## 5 39.8 29.4 41.2 54.1 54.1 48.3 48.3 56.5 56.7
## 6 100.0 37.3 29.5 80.8 95.3 46.8 46.8 32.8 32.8
## WWC.old WWC.new WWT.old WWT.new WWR.old WWR.new HLT.old HLT.new AIR.old
## 1 0.0 0.0 0.0 0.0 0.0 0.0 17.6 18.2 16.9
## 2 19.0 26.5 10.9 49.0 33.4 33.4 39.9 44.1 31.7
## 3 65.0 72.0 41.7 41.7 41.7 41.7 48.1 48.2 47.3
## 4 7.6 7.6 4.3 4.3 4.3 4.3 19.8 21.8 18.7
## 5 50.8 50.8 44.4 44.4 44.4 44.4 65.6 69.6 72.6
## 6 55.0 55.0 55.0 55.0 11.8 11.8 52.9 54.3 48.7
## AIR.new HPE.old HPE.new HFD.old HFD.new OZD.old OZD.new NOD.old NOD.new
## 1 15.8 22.8 18.8 1.7 3.6 7.5 10.3 32.6 34.0
## 2 36.4 32.8 36.6 21.1 27.6 43.6 60.2 35.4 33.7
## 3 46.0 46.8 37.1 55.7 64.5 30.5 29.6 30.9 30.6
## 4 19.9 14.4 15.4 9.8 13.8 24.4 32.7 34.8 42.4
## 5 77.8 93.9 93.5 58.0 62.7 100.0 100.0 41.1 34.2
## 6 47.6 47.9 44.6 49.8 54.5 49.0 48.6 12.1 18.0
## SOE.old SOE.new COE.old COE.new VOE.old VOE.new H2O.old H2O.new USD.old
## 1 63.5 58.7 51.0 50.2 36.5 37.1 25.5 32.3 22.8
## 2 39.0 44.6 59.6 64.4 48.9 47.2 67.2 71.0 67.4
## 3 36.5 24.0 54.0 42.7 33.2 27.9 61.0 64.1 62.8
## 4 63.7 60.3 37.5 43.2 4.3 9.8 16.9 22.9 12.5
## 5 63.0 68.8 82.6 85.8 87.3 91.0 54.6 55.8 53.4
## 6 65.6 66.9 66.4 64.5 14.2 17.4 64.6 73.4 63.3
## USD.new UWD.old UWD.new HMT.old HMT.new LED.old LED.new WMG.old WMG.new
## 1 31.1 24.4 33.1 0.0 0.0 0.0 0.0 25.2 25.2
## 2 73.2 65.1 69.6 52.9 55.2 53.7 55.2 16.4 16.4
## 3 68.4 57.7 61.3 29.9 32.6 28.6 32.6 34.3 37.6
## 4 23.5 12.3 22.5 33.1 33.8 31.1 33.8 26.1 26.1
## 5 55.1 55.5 56.2 49.4 52.1 49.7 52.1 35.6 35.6
## 6 75.2 61.4 72.2 71.9 76.7 70.6 76.7 26.6 26.6
## WPC.old WPC.new SMW.old SMW.new WRR.old WRR.new PCC.old PCC.new CCH.old
## 1 60.6 60.6 0.0 0.0 2.4 2.4 13.7 40.2 13.7
## 2 35.6 35.6 0.0 0.0 5.3 5.3 44.1 59.4 44.1
## 3 44.6 47.8 61.6 71.6 10.3 10.5 29.3 36.2 29.3
## 4 64.1 64.1 0.0 0.0 1.1 1.1 35.2 49.4 35.2
## 5 39.0 39.0 100.0 100.0 0.0 0.0 47.1 46.4 47.1
## 6 32.5 32.5 56.1 56.1 6.0 6.0 47.1 41.4 47.1
## CCH.new CDA.old CDA.new CDF.old CDF.new CHA.old CHA.new FGA.old FGA.new
## 1 40.2 0.0 38.4 0.0 100.0 4.2 48.0 36.8 4.3
## 2 59.4 37.8 54.7 47.5 77.2 82.4 87.6 36.8 3.8
## 3 36.2 34.8 40.7 21.3 30.4 48.0 40.5 34.3 12.6
## 4 49.4 26.4 54.7 78.8 100.0 55.5 54.7 36.8 3.7
## 5 46.4 42.1 46.3 22.8 28.8 67.9 57.4 NA NA
## 6 41.4 41.8 50.8 30.7 44.8 87.6 46.6 42.6 6.8
## NDA.old NDA.new BCA.old BCA.new LUF.old LUF.new GTI.old GTI.new GTP.old
## 1 9.3 50.8 1.2 39.3 42.2 44.4 0.7 35.0 8.1
## 2 72.3 100.0 65.5 100.0 49.7 50.0 43.3 56.9 44.0
## 3 9.4 46.1 3.8 59.8 49.3 49.7 29.8 32.1 32.1
## 4 49.2 49.0 5.8 52.5 47.9 48.0 38.3 48.8 42.5
## 5 52.1 57.5 57.2 59.9 50.3 50.2 39.1 42.6 36.8
## 6 58.4 27.3 41.7 54.3 45.1 48.5 44.9 43.4 39.9
## GTP.new GHN.old GHN.new CBP.old CBP.new
## 1 46.7 17.9 22.6 97.7 97.7
## 2 56.3 32.7 39.5 87.6 87.6
## 3 34.7 7.7 6.9 51.5 51.5
## 4 52.9 15.3 20.4 88.5 88.5
## 5 39.0 51.5 51.6 54.0 54.0
## 6 39.1 7.4 6.6 53.0 53.0
str(data)
## 'data.frame': 180 obs. of 149 variables:
## $ code : int 4 8 12 24 28 32 51 36 40 31 ...
## $ iso : chr "AFG" "ALB" "DZA" "AGO" ...
## $ country: chr "Afghanistan" "Albania" "Algeria" "Angola" ...
## $ EPI.old: num 18.2 46.1 38.5 31.9 54.4 45.9 42.9 58.9 68.8 40.8 ...
## $ EPI.new: num 31 52.2 41.7 40.1 55.6 47 44.9 63.1 68.9 40.5 ...
## $ ECO.old: num 21.5 50.9 39.3 36.6 52.4 41.2 47.1 59.9 77.9 45.1 ...
## $ ECO.new: num 31.9 51.9 41.8 44.1 53.2 46.7 48.1 62.5 77.6 44.7 ...
## $ BDH.old: num 25.2 50.6 33.1 42.3 52.2 31.7 48.6 50.5 74.6 36.1 ...
## $ BDH.new: num 31.7 50.3 33 41.5 52.5 34.7 47.6 55.2 74.3 36.7 ...
## $ MKP.old: num NA 21.2 0 0 74.7 13.4 NA 36.5 NA 0.8 ...
## $ MKP.new: num NA 25.5 0 0 74.7 17.1 NA 57.2 NA 0.8 ...
## $ MHP.old: num NA 24.4 0 0 33.1 9.5 NA 30.1 NA 20 ...
## $ MHP.new: num NA 24.4 0 0 33.1 33 NA 42.5 NA 20 ...
## $ MPE.old: num NA 100 50 NA 50 77.9 NA 71.1 NA 50 ...
## $ MPE.new: num NA 96.7 100 NA 88.5 89.1 NA 57.2 NA 50 ...
## $ PAR.old: num 0.3 46.4 6.1 35.6 NA 14.9 14.4 50.9 86.4 7.2 ...
## $ PAR.new: num 0.3 46.4 6.1 35.6 NA 14.9 14.4 50.9 86.4 7.2 ...
## $ SPI.old: num 0.6 54.4 73.2 37.1 19.6 30.9 38.7 53.5 78.9 42.1 ...
## $ SPI.new: num 9.1 56.8 73.8 37.1 19.9 32.9 38.7 65.8 83.6 43.1 ...
## $ TBN.old: num 0.3 55.5 4.2 34.1 51.5 26.2 77.2 54.7 85.9 31.3 ...
## $ TBN.new: num 11.7 60.7 4.3 34.1 51.5 27.9 77.2 67.1 86.9 31.4 ...
## $ TKP.old: num 0.6 55.9 5.9 74.7 72.9 39.8 39.5 47.3 70.5 38 ...
## $ TKP.new: num 24.6 58.4 6 74.7 72.9 40.5 39.5 58.1 70.6 41.3 ...
## $ PAE.old: num 79.5 65.2 20.3 82.1 28.3 37.6 49.8 89.1 76.7 55.2 ...
## $ PAE.new: num 79.5 65.2 20.3 82.1 28.3 37.6 49.8 89.1 76.7 55.2 ...
## $ PHL.old: num 80.9 63.4 58.8 89.8 90 71.6 71.5 92.2 57.9 31 ...
## $ PHL.new: num 80.9 63.4 58.8 89.8 90 71.6 71.5 92.2 57.9 31 ...
## $ RLI.old: num 75.8 70.5 74.3 77.8 67.3 54.4 49.9 48 62.6 80.8 ...
## $ RLI.new: num 75.6 70 73.8 77.7 64 54.2 47.9 39.5 59.1 81.2 ...
## $ SHI.old: num 67.5 77.2 76.3 82.7 NA 64.3 89.8 81 71.6 83.9 ...
## $ SHI.new: num 66.4 53.5 74 71.6 NA 48.2 80.4 50.3 59.9 80.1 ...
## $ BER.old: num 35.4 28.2 54.1 58.1 83.7 25.6 47.5 29.3 47.3 11.9 ...
## $ BER.new: num 35.7 28.9 53.9 56 80.9 25.2 49.8 28.9 47.2 13.3 ...
## $ ECS.old: num NA 58.4 NA 50.8 35.4 35.6 80.3 60.2 58.9 79.5 ...
## $ ECS.new: num NA 62.4 NA 49.2 28.7 48.9 83.4 42.3 47.5 82.3 ...
## $ PFL.old: num NA NA NA 68.5 0 47.2 NA 97.5 NA NA ...
## $ PFL.new: num NA NA NA 50.5 0 62.7 NA 97.4 NA NA ...
## $ IFL.old: num NA NA NA 41.1 NA 44.9 NA 21.1 NA NA ...
## $ IFL.new: num NA NA NA 49.3 NA 51.3 NA 0 NA NA ...
## $ FCL.old: num NA 65 NA 58.1 70.6 21.4 88.8 64.7 66.7 92.1 ...
## $ FCL.new: num NA 71.7 NA 49.1 59 44.2 99.3 20.1 53.4 98.3 ...
## $ TCG.old: num NA 36.5 NA 27.9 30.2 0 58.2 44.9 38.7 50.7 ...
## $ TCG.new: num NA 36.5 NA 27.9 30.2 0 58.2 44.9 38.7 50.7 ...
## $ FLI.old: num NA 67.5 NA 83.5 46.1 72.2 54.2 72.2 35.6 65.4 ...
## $ FLI.new: num NA 67.5 NA 83.5 46.1 72.2 54.2 72.2 35.6 65.4 ...
## $ FSH.old: num NA 13.7 53.1 38.8 97.6 52 NA 48.5 NA NA ...
## $ FSH.new: num NA 15.8 51.6 37.6 97.3 38.5 NA 48.1 NA NA ...
## $ FSS.old: num NA NA 85.6 21 81.5 77.5 NA 29.4 NA NA ...
## $ FSS.new: num NA NA 75.7 29 100 70.5 NA 28.6 NA NA ...
## $ FCD.old: num NA 24.2 40.8 48.1 NA 53.4 NA 45.4 NA NA ...
## $ FCD.new: num NA 24 37.7 45.4 NA 50.8 NA 47.7 NA NA ...
## $ BTZ.old: num NA 9 44.8 27.2 100 12.8 NA 46.3 NA NA ...
## $ BTZ.new: num NA 3.5 49.2 26.9 100 26.8 NA 51.1 NA NA ...
## $ BTO.old: num NA 5.6 45.3 38.5 100 15.8 NA 49 NA NA ...
## $ BTO.new: num NA 7.8 50.1 43.4 100 24.6 NA 53.3 NA NA ...
## $ RMS.old: num NA 100 20.8 59.3 51.3 49.9 NA 77.6 NA NA ...
## $ RMS.new: num NA 100 58.2 45.3 56.4 48.3 NA 57.2 NA NA ...
## $ APO.old: num 5.2 80.3 49.7 13.5 64.6 57 38.7 87.4 93.1 77.1 ...
## $ APO.new: num 42 74.5 63.7 73.2 72 80.5 54.2 95.9 92.9 67 ...
## $ OEB.old: num 19.4 25.8 25.8 69.5 40 80.1 40.9 79 56.3 45 ...
## $ OEB.new: num 18 25.9 18.7 69.1 36.2 77 37.3 77.3 54.2 39.3 ...
## $ OEC.old: num 39.9 23.3 12 87.5 44.4 94.4 45.3 86.2 65.5 56.4 ...
## $ OEC.new: num 37.6 21.8 7.8 84.4 39.6 92.5 40.6 82.1 61 49.8 ...
## $ NXA.old: num 0 100 7.1 0 39.5 36.7 41.2 91.2 100 70.5 ...
## $ NXA.new: num 34.2 100 60.7 69.6 84.8 69 18.6 98.2 100 60.3 ...
## $ SDA.old: num 0 100 100 0 76.3 62 50.8 79.7 100 100 ...
## $ SDA.new: num 55.5 69.2 86.8 75.4 72.9 90.4 95.8 100 100 82.7 ...
## $ AGR.old: num 45 48.4 42.9 39.3 30.2 84.1 56.6 53.8 68.9 62.7 ...
## $ AGR.new: num 41.1 50.4 46.7 41 31.4 81.4 35.5 65.3 72.5 63 ...
## $ SNM.old: num 35.3 20.3 35 33 5.2 82.6 29.7 52.9 64.6 45.8 ...
## $ SNM.new: num 31.2 22 38.6 35.2 3.8 87.7 13.1 49.7 64 48.1 ...
## $ PSU.old: num 100 35.7 100 81.6 40.2 100 100 53.4 41 100 ...
## $ PSU.new: num 97.2 36.2 100 97.5 39.8 100 100 51.8 40.9 100 ...
## $ PRS.old: num 60.6 67 76.7 97 29.4 37.3 66.3 57.5 75.3 80.7 ...
## $ PRS.new: num 52 65.7 74.6 97.8 41.2 29.5 49.3 59 71.6 64.2 ...
## $ RCY.old: num 40.2 68.6 39.8 24 54.1 80.8 59.9 58.9 71.5 55.3 ...
## $ RCY.new: num 41.9 73.6 38.7 18.4 54.1 95.3 46.7 84.7 84.1 74.4 ...
## $ WRS.old: num 11.2 23.5 53.1 15.8 48.3 46.8 30.3 87 85.7 24.3 ...
## $ WRS.new: num 11.2 39.6 55.7 15.8 48.3 46.8 34 87.2 87.3 29.7 ...
## $ WWG.old: num 89.7 64.8 62.7 86.5 56.5 32.8 0 26.2 0 43 ...
## $ WWG.new: num 89.7 56.6 62.7 86.5 56.7 32.8 0 26.2 10.8 44 ...
## $ WWC.old: num 0 19 65 7.6 50.8 55 38 100 100 18.9 ...
## $ WWC.new: num 0 26.5 72 7.6 50.8 55 42.9 100 100 25.9 ...
## $ WWT.old: num 0 10.9 41.7 4.3 44.4 55 38 92.4 95.2 18.9 ...
## $ WWT.new: num 0 49 41.7 4.3 44.4 55 42.9 92.9 96 25.9 ...
## $ WWR.old: num 0 33.4 41.7 4.3 44.4 11.8 14.6 92.9 100 38.1 ...
## $ WWR.new: num 0 33.4 41.7 4.3 44.4 11.8 14.6 92.9 100 38.1 ...
## $ HLT.old: num 17.6 39.9 48.1 19.8 65.6 52.9 39.5 81.1 66.2 39.1 ...
## $ HLT.new: num 18.2 44.1 48.2 21.8 69.6 54.3 41.8 84 71 40.4 ...
## $ AIR.old: num 16.9 31.7 47.3 18.7 72.6 48.7 29 78.8 56.3 37.1 ...
## $ AIR.new: num 15.8 36.4 46 19.9 77.8 47.6 30.9 80.9 61.5 38.2 ...
## $ HPE.old: num 22.8 32.8 46.8 14.4 93.9 47.9 22.9 90 30.8 36.3 ...
## $ HPE.new: num 18.8 36.6 37.1 15.4 93.5 44.6 19.3 94.9 46.1 37.9 ...
## $ HFD.old: num 1.7 21.1 55.7 9.8 58 49.8 29.9 82.1 84.7 28.1 ...
## $ HFD.new: num 3.6 27.6 64.5 13.8 62.7 54.5 38.5 85.1 88.9 36.8 ...
## $ OZD.old: num 7.5 43.6 30.5 24.4 100 49 27.2 85.5 41.6 41.2 ...
## $ OZD.new: num 10.3 60.2 29.6 32.7 100 48.6 35.2 71 39.3 37.1 ...
## $ NOD.old: num 32.6 35.4 30.9 34.8 41.1 12.1 29.6 19.2 20.6 36.3 ...
## $ NOD.new: num 34 33.7 30.6 42.4 34.2 18 27.7 25.9 25 35.4 ...
## [list output truncated]
I used the Environmental Performance Index (EPI) dataset in my project. This dataset has different environmental indicators like air quality, biodiversity, climate change, and pollution. Each row represents a country, and columns show environmental scores.
Cleaned my dataset.
Removed columns that ended with “.old” because they were not needed. Renamed columns by removing “.new” to make names simpler. Deleted “iso” and “code” columns because they were not useful for my analysis.
data_cleaned <- data[, !grepl("\\.old$", names(data))]
names(data_cleaned) <- gsub("\\.new$", "", names(data_cleaned))
data_cleaned <- data_cleaned[, !(names(data_cleaned) %in% c("iso", "code"))]
Handled missing values in my dataset.
Removed columns that had more than 2 missing values to keep only useful data. Filled missing values with the mean of each column to avoid gaps in the data.
data_cleaned <- data_cleaned[, colSums(is.na(data_cleaned)) <= 2]
for (col in names(data_cleaned)) {
if (is.numeric(data_cleaned[[col]])) {
data_cleaned[[col]][is.na(data_cleaned[[col]])] <- mean(data_cleaned[[col]], na.rm = TRUE)
}
}
data_cleaned[!complete.cases(data_cleaned),]
## [1] country EPI ECO BDH SPI TBN TKP RLI BER
## [10] APO OEB NXA SDA AGR SNM PSU PRS RCY
## [19] WRS WWG WWC WWT WWR HLT AIR HPE HFD
## [28] OZD NOD SOE COE VOE H2O USD UWD HMT
## [37] LED WMG WPC SMW WRR PCC CCH CDA CDF
## [46] CHA NDA BCA GTI GTP GHN CBP
## <0 rows> (or 0-length row.names)
No missing values - the data is prepared for the further analysis.
numeric_columns <- names(data_cleaned)[!(names(data_cleaned) %in% c("country"))]
for (col in numeric_columns) {
boxplot(data_cleaned[[col]],
main = paste("Box Plot of", col),
ylab = col,
col = "lightblue",
border = "blue")
}
Checked for outliers in my dataset.
Selected only numeric columns, excluding “country”. Created box plots for each numeric column to visualize data distribution and spot outliers.
data_cleaned <- data_cleaned[, !(names(data_cleaned) %in% c("GTP", "GTI", "CDA", "WRR", "CCH"))]
I removed specific columns (GTP, GTI, CDA, WRR, CCH).
I did this to eliminate unnecessary or irrelevant variables. This helps simplify the dataset and improve clustering and PCA results by focusing on important features.
cor_matrix <- cor(data_cleaned[, -c(1)])
high_cor <- findCorrelation(cor_matrix, cutoff = 0.9)
names_to_drop <- colnames(data_cleaned[, -c(1)])[high_cor]
data_cleaned <- data_cleaned[, !(names(data_cleaned) %in% names_to_drop)]
I removed highly correlated variables (correlation > 0.9). . Finally, I dropped those highly correlated variables to reduce redundancy and avoid multicollinearity in clustering and PCA.
data_cleaned_scaled <- scale(data_cleaned[2:40])
data_cleaned <- cbind(data_cleaned[, 1,drop = FALSE], as.data.frame(data_cleaned_scaled))
I standardized the numeric columns to ensure all features have the same scale.
clustering_data <- data_cleaned[, !(names(data_cleaned) %in% c("country"))]
set.seed(123)
hopkins_stat <- hopkins::hopkins(clustering_data)
cat("Hopkins statistic:", hopkins_stat, "\n")
## Hopkins statistic: 0.9997001
distance_matrix <- dist(clustering_data)
fviz_dist(distance_matrix,
gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07"),
show_labels = FALSE)
I prepared the data for clustering and checked its cluster tendency using the Hopkins statistic. A Hopkins statistic of 0.9997 means the data has a very strong clustering tendency (values close to 1 indicate highly clusterable data).
data_for_clustering <- clustering_data
methods <- list(kmeans = kmeans, pam = pam, hcut = hcut)
titles <- c("Silhouette: K-Means", "Silhouette: PAM", "Silhouette: Hierarchical")
sil_plots <- lapply(1:3, function(i) {
fviz_nbclust(data_for_clustering, FUNcluster = methods[[i]], method = "silhouette") +
ggtitle(titles[i]) + theme_minimal()
})
grid.arrange(grobs = sil_plots, nrow = 3)
I checked the silhouette scores for K-Means, PAM, and Hierarchical clustering. The values were very low, meaning the clusters were not well-separated. To improve this, I selected only 10 important columns from my dataset.
selected_columns <- c("EPI", "ECO", "SPI", "PRS", "RLI", "WWT", "HFD", "NOD", "LED", "SMW")
final_columns <- c("country", selected_columns)
data_cleaned <- data_cleaned[, final_columns]
The obtained Hopkins statistic suggests that the dataset likely exhibits a clustering structure. With a result of 0.99, it indicates a very strong clustering tendency, making it suitable for further clustering analysis.
clustering_data <- data_cleaned[, !(names(data_cleaned) %in% c("country"))]
data_for_clustering <- clustering_data
set.seed(123)
hopkins_stat <- hopkins::hopkins(data_for_clustering)
cat("Hopkins statistic:", hopkins_stat, "\n")
## Hopkins statistic: 0.9973144
distance_matrix <- dist(data_for_clustering)
fviz_dist(distance_matrix,
gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07"),
show_labels = FALSE)
In this step, I will identify the optimal number of clusters for my dataset. To achieve this, I will utilize the Silhouette width and the elbow method, which will help determine the most suitable number of clusters for the clustering algorithm.
methods <- list(kmeans = kmeans, pam = pam, hcut = hcut)
titles <- c("Silhouette: K-Means", "Silhouette: PAM", "Silhouette: Hierarchical")
sil_plots <- lapply(1:3, function(i) {
fviz_nbclust(data_for_clustering, FUNcluster = methods[[i]], method = "silhouette") +
ggtitle(titles[i]) + theme_minimal()
})
grid.arrange(grobs = sil_plots, nrow = 3)
After reducing the data, silhouette scores increased for all clustering methods, and the Hopkins statistic confirmed that the data is clusterable.
Silhouette methods also showed that k = 2 is the best cluster number for all clustering methods.
k_selected <- 2
sil_values <- sapply(1:3, function(i) {
plot_data <- sil_plots[[i]]$data
plot_data[plot_data$clusters == k_selected, "y"]
})
cat("Silhouette value for k =", k_selected, "\n")
## Silhouette value for k = 2
cat("K-Means:", round(sil_values[1], 2), "\n")
## K-Means: 0.33
cat("PAM:", round(sil_values[2], 2), "\n")
## PAM: 0.34
cat("Hierarchical:", round(sil_values[3], 2), "\n")
## Hierarchical: 0.32
Also used the elbow method, which confirmed that k = 2 is the optimal number of clusters for all clustering methods.
methods <- list(kmeans = kmeans, pam = pam, hcut = hcut)
titles <- c("Elbow: K-Means", "Elbow: PAM", "Elbow: Hierarchical")
elbow_plots <- lapply(1:3, function(i) {
fviz_nbclust(data_for_clustering, FUNcluster = methods[[i]], method = "wss") +
ggtitle(titles[i]) + theme_minimal()
})
grid.arrange(grobs = elbow_plots, nrow = 3)
Clustering Visualization
kmeans_result <- eclust(data_for_clustering, "kmeans", k = 2, hc_metric = "euclidean", graph = FALSE)
fviz_cluster(kmeans_result, geom = "point", main = "K-Means Clustering") +
theme_minimal()
The clustering graph indicates that the algorithm did distinctly separate the universities into two well-separated clusters and the boundary between them is clearly defined.
SILHOUETTE INDEX
fviz_silhouette(kmeans_result, label = TRUE) +
ggtitle("Silhouette Plot for K-Means") +
theme_minimal()
## cluster size ave.sil.width
## 1 1 122 0.31
## 2 2 58 0.37
The silhouette index results (0.31 for Cluster 1 and 0.37 for Cluster 2) suggest that the clusters are distinct but still show some degree of overlap.
Calinski-Harabasz
calinski_harabasz <- cluster.stats(d = dist(data_for_clustering), clustering = kmeans_result$cluster)$ch
cat("Calinski-Harabasz Index:", round(calinski_harabasz, 2), "\n")
## Calinski-Harabasz Index: 102.59
Calculated the Calinski-Harabasz Index, which resulted in a value of 102.59, further validating that the clustering structure is well-defined.
Clustering Visualization
pam_result <- eclust(data_for_clustering, FUNcluster = "pam", k = 2, hc_metric = "euclidean", graph = FALSE)
fviz_cluster(pam_result, geom = "point", main = "PAM Clustering") +
theme_minimal()
Performed PAM (Partitioning Around Medoids) clustering with k=2 using the Euclidean distance metric. Then, visualized the clusters to analyze the structure of the data. The clustering graph indicates that the algorithm did distinctly separate the universities into two well-separated clusters and the boundary between them is clearly defined.
SILHOUETTE INDEX
fviz_silhouette(pam_result, label = TRUE) +
ggtitle("Silhouette Plot for PAM") +
theme_minimal()
## cluster size ave.sil.width
## 1 1 135 0.30
## 2 2 45 0.47
Computed the silhouette plot for PAM clustering with k=2. The average silhouette width for Cluster 1 is 0.30, and for Cluster 2 is 0.47, indicating that Cluster 2 has better-defined separation compared to Cluster 1, but overall, the clustering structure could still be improved.
Calinski-Harabasz
calinski_harabasz <- cluster.stats(d = dist(data_for_clustering), clustering = pam_result$cluster)$ch
cat("Calinski-Harabasz Index:", round(calinski_harabasz, 2), "\n")
## Calinski-Harabasz Index: 100.11
Computed the Calinski-Harabasz Index for PAM clustering, which resulted in 100.11. This value is slightly lower than the K-Means clustering score (102.59), indicating that K-Means may provide slightly better-defined clusters compared to PAM in this dataset.
OVERALL K-Means performed slightly better overall, as indicated by the higher Calinski-Harabasz Index. PAM formed a better-defined second cluster, as shown by its higher silhouette score for Cluster 2. If cluster compactness is the priority, K-Means is preferable.
Next, we will proceed with hierarchical clustering using two methods: agglomerative and divisive clustering. First, I will focus on performing the clustering, and then I will evaluate and compare the quality measures.
linkage_methods <- c("average", "single", "complete", "ward")
names(linkage_methods) <- linkage_methods
calc_ac <- function(method) {
agnes(data_for_clustering, method = method)$ac
}
ac_values <- sapply(linkage_methods, calc_ac)
print(ac_values)
## average single complete ward
## 0.7179504 0.4412106 0.8511947 0.9538338
We calculate the Agglomerative Coefficient (AC) to check how well different hierarchical clustering methods form structured clusters. A higher AC means better-defined clusters.
We choose the Ward method because it has the highest AC (0.95), meaning it produces compact and well-separated clusters. Ward minimizes the variance within clusters, making it the best choice for accurate and meaningful grouping.
hc_model <- agnes(data_for_clustering, method = "ward")
pltree(hc_model, cex = 0.6, hang = -1, main = "Dendrogram - Ward's Method")
hc_hclust <- as.hclust(hc_model)
pltree(hc_model, cex = 0.6, hang = -1, main = "Dendrogram - Ward's Method")
hc_clusters <- cutree(hc_hclust, k = 2)
table(hc_clusters)
## hc_clusters
## 1 2
## 148 32
rect.hclust(hc_hclust, k = 2, border = "blue")
Converted the hierarchical model, cut it into 2 clusters, counted cluster sizes, and visualized them with rectangles on the dendrogram to see the final grouping.
fviz_cluster(
list(data = data_for_clustering, cluster = hc_clusters),
main = "Hierarchical Clustering - Ward's Method"
) +
theme_minimal() +
scale_color_manual(values = c("red", "blue")) +
scale_fill_manual(values = c("red", "blue"))
Visualized hierarchical clustering using Ward’s method. Assigned clusters to the dataset, plotted them with different colors (red and blue), and applied a minimal theme for a clean presentation.
agg_cluster_stats <- cluster.stats(
dist(data_for_clustering),
hc_clusters
)
print(agg_cluster_stats)
## $n
## [1] 180
##
## $cluster.number
## [1] 2
##
## $cluster.size
## [1] 148 32
##
## $min.cluster.size
## [1] 32
##
## $noisen
## [1] 0
##
## $diameter
## [1] 7.834906 4.313502
##
## $average.distance
## [1] 3.813062 2.242219
##
## $median.distance
## [1] 3.734420 2.146629
##
## $separation
## [1] 1.464251 1.464251
##
## $average.toother
## [1] 5.380658 5.380658
##
## $separation.matrix
## [,1] [,2]
## [1,] 0.000000 1.464251
## [2,] 1.464251 0.000000
##
## $ave.between.matrix
## [,1] [,2]
## [1,] 0.000000 5.380658
## [2,] 5.380658 0.000000
##
## $average.between
## [1] 5.380658
##
## $average.within
## [1] 3.533801
##
## $n.between
## [1] 4736
##
## $n.within
## [1] 11374
##
## $max.diameter
## [1] 7.834906
##
## $min.separation
## [1] 1.464251
##
## $within.cluster.ss
## [1] 1256.328
##
## $clus.avg.silwidths
## 1 2
## 0.2595458 0.5771514
##
## $avg.silwidth
## [1] 0.316009
##
## $g2
## NULL
##
## $g3
## NULL
##
## $pearsongamma
## [1] 0.508956
##
## $dunn
## [1] 0.1868881
##
## $dunn2
## [1] 1.411112
##
## $entropy
## [1] 0.468007
##
## $wb.ratio
## [1] 0.6567601
##
## $ch
## [1] 75.61212
##
## $cwidegap
## [1] 2.399012 1.677730
##
## $widestgap
## [1] 2.399012
##
## $sindex
## [1] 1.805722
##
## $corrected.rand
## NULL
##
## $vi
## NULL
Performed cluster quality evaluation for hierarchical clustering using Ward’s method. The dataset was divided into two clusters (sizes: 148 and 32). The average silhouette width is 0.316, indicating moderate clustering structure. The Dunn index is 0.186, suggesting compact clusters with moderate separation. The Calinski-Harabasz index is 75.61, showing decent cluster validity.
hc_div <- diana(data_for_clustering)
cat("Dissociation Coefficient:", round(hc_div$dc, 2), "\n")
## Dissociation Coefficient: 0.83
Performed divisive hierarchical clustering (DIANA) and calculated the dissociation coefficient as 0.83, indicating a strong clustering structure where clusters are well-separated.
pltree(hc_div, cex = 0.6, hang = -1, main = "Dendrogram - DIANA")
div_clusters <- cutree(as.hclust(hc_div), k = 2)
table(div_clusters)
## div_clusters
## 1 2
## 120 60
pltree(hc_div, cex = 0.6, hang = -1, main = "Dendrogram - DIANA")
rect.hclust(as.hclust(hc_div), k = 2, border = viridisLite::viridis(2, option = "C"))
fviz_cluster(
list(data = data_for_clustering, cluster = div_clusters),
main = "Hierarchical Clustering Using DIANA"
) +
scale_color_manual(values = viridisLite::viridis(2, option = "C")) +
scale_fill_manual(values = viridisLite::viridis(2, option = "C")) +
theme_classic()
Visualized the clusters formed by Divisive Hierarchical Clustering (DIANA). The clusters are displayed in a scatter plot with distinct colors, helping to understand the separation and structure of the clusters.
Compared to Agglomerative Clustering, the Divisive Clustering (DIANA) method produced clusters that are closer to each other. This suggests that Agglomerative Clustering created more distinct clusters, while Divisive Clustering resulted in groups that are less separated.
div_cluster_stats <- cluster.stats(dist(data_for_clustering), div_clusters)
print(div_cluster_stats)
## $n
## [1] 180
##
## $cluster.number
## [1] 2
##
## $cluster.size
## [1] 120 60
##
## $min.cluster.size
## [1] 60
##
## $noisen
## [1] 0
##
## $diameter
## [1] 7.834906 7.045571
##
## $average.distance
## [1] 3.483922 3.269464
##
## $median.distance
## [1] 3.451303 3.238025
##
## $separation
## [1] 1.269166 1.269166
##
## $average.toother
## [1] 5.19601 5.19601
##
## $separation.matrix
## [,1] [,2]
## [1,] 0.000000 1.269166
## [2,] 1.269166 0.000000
##
## $ave.between.matrix
## [,1] [,2]
## [1,] 0.00000 5.19601
## [2,] 5.19601 0.00000
##
## $average.between
## [1] 5.19601
##
## $average.within
## [1] 3.412436
##
## $n.between
## [1] 7200
##
## $n.within
## [1] 8910
##
## $max.diameter
## [1] 7.834906
##
## $min.separation
## [1] 1.269166
##
## $within.cluster.ss
## [1] 1136.888
##
## $clus.avg.silwidths
## 1 2
## 0.3124926 0.3488300
##
## $avg.silwidth
## [1] 0.3246051
##
## $g2
## NULL
##
## $g3
## NULL
##
## $pearsongamma
## [1] 0.5956821
##
## $dunn
## [1] 0.1619887
##
## $dunn2
## [1] 1.491426
##
## $entropy
## [1] 0.6365142
##
## $wb.ratio
## [1] 0.6567415
##
## $ch
## [1] 102.2563
##
## $cwidegap
## [1] 2.314239 2.589898
##
## $widestgap
## [1] 2.589898
##
## $sindex
## [1] 1.469359
##
## $corrected.rand
## NULL
##
## $vi
## NULL
The silhouette score (0.32) is lower than Agglomerative Clustering, meaning less well-defined clusters. Cluster separation (1.26) is smaller, indicating clusters are closer to each other. Calinski-Harabasz Index (102.25) is slightly higher than Agglomerative, suggesting a better-defined structure in terms of cluster compactness and separation. Dunn Index (0.16) is lower, meaning weaker cluster separation than Agglomerative.
Conclusion: Agglomerative Clustering performed better in forming distinct clusters compared to Divisive Clustering.
data_for_dimreduction <- clustering_data
data_matrix <- data.matrix(data_for_dimreduction)
correlation_matrix <- cor(data_matrix)
print(correlation_matrix, digits = 2)
## EPI ECO SPI PRS RLI WWT HFD NOD LED SMW
## EPI 1.000 0.81 0.44 -0.084 0.318 0.70 0.76 -0.38 0.762 0.689
## ECO 0.810 1.00 0.74 0.177 0.452 0.54 0.50 -0.37 0.498 0.456
## SPI 0.440 0.74 1.00 0.326 0.405 0.21 0.17 -0.17 0.237 0.153
## PRS -0.084 0.18 0.33 1.000 0.284 -0.14 -0.26 0.12 -0.223 -0.150
## RLI 0.318 0.45 0.41 0.284 1.000 0.14 0.19 -0.23 0.088 0.071
## WWT 0.699 0.54 0.21 -0.145 0.135 1.00 0.82 -0.52 0.713 0.722
## HFD 0.756 0.50 0.17 -0.257 0.191 0.82 1.00 -0.59 0.776 0.745
## NOD -0.382 -0.37 -0.17 0.123 -0.232 -0.52 -0.59 1.00 -0.419 -0.436
## LED 0.762 0.50 0.24 -0.223 0.088 0.71 0.78 -0.42 1.000 0.704
## SMW 0.689 0.46 0.15 -0.150 0.071 0.72 0.75 -0.44 0.704 1.000
corrplot(correlation_matrix,
method = "circle",
type = "lower",
title = "Variables Correlation Matrix",
cl.cex = 0.8,
tl.cex = 0.7,
tl.col = "black",
col = viridisLite::inferno(10)) + theme_classic()
## NULL
EPI and ECO have a strong positive correlation (0.81), meaning they move together. WWT and HFD are highly correlated (0.82), indicating a close relationship between them. PRS has weak correlations with most variables, meaning it behaves independently. NOD is negatively correlated with most variables, suggesting it has an inverse relationship with them.
preproc_data <- preProcess(data_for_dimreduction, method = c("center", "scale"))
data_standardized <- predict(preproc_data, data_for_dimreduction)
summary(data_standardized)
## EPI ECO SPI PRS
## Min. :-1.9511 Min. :-2.1818 Min. :-1.76428 Min. :-3.01195
## 1st Qu.:-0.7222 1st Qu.:-0.7086 1st Qu.:-0.77931 1st Qu.:-0.63046
## Median :-0.1262 Median :-0.1193 Median : 0.03399 Median :-0.02317
## Mean : 0.0000 Mean : 0.0000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 0.5418 3rd Qu.: 0.7033 3rd Qu.: 0.77914 3rd Qu.: 0.75477
## Max. : 2.5107 Max. : 2.4406 Max. : 1.77562 Max. : 1.98691
## RLI WWT HFD NOD
## Min. :-1.7063 Min. :-1.1300 Min. :-1.2168 Min. :-1.8830
## 1st Qu.:-0.8269 1st Qu.:-0.9518 1st Qu.:-0.9755 1st Qu.:-0.7257
## Median : 0.1641 Median :-0.1747 Median :-0.1730 Median :-0.2207
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.8120 3rd Qu.: 0.8701 3rd Qu.: 0.7697 3rd Qu.: 0.4908
## Max. : 1.5025 Max. : 1.8277 Max. : 1.9513 Max. : 2.9491
## LED SMW
## Min. :-2.2892 Min. :-1.0577
## 1st Qu.:-0.8327 1st Qu.:-1.0284
## Median :-0.1968 Median :-0.2639
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6689 3rd Qu.: 1.2053
## Max. : 2.0072 Max. : 1.3809
data_cov <- cov(data_standardized)
data_eigen <- eigen(data_cov)
data_eigen$values
## [1] 5.01428541 1.95742220 0.81018767 0.65437394 0.56047192 0.28325759
## [7] 0.27232235 0.24049993 0.14874866 0.05843033
data_pca <- prcomp(data_standardized, center = FALSE, scale. = FALSE)
Eigenvalues indicate the variance explained by each principal component. They are crucial in PCA as they define the new data axes. The values are normalized to 1 and correspond to the principal components, with the first component explaining the most variance. The values in the dataframe represent loadings, showing how much each variable contributes to a component.
data_pca <- prcomp(data_standardized, center = FALSE, scale. = FALSE)
print(data_pca)
## Standard deviations (1, .., p=10):
## [1] 2.2392600 1.3990791 0.9001043 0.8089338 0.7486467 0.5322195 0.5218451
## [8] 0.4904079 0.3856795 0.2417237
##
## Rotation (n x k) = (10 x 10):
## PC1 PC2 PC3 PC4 PC5 PC6
## EPI 0.40549207 0.05789696 0.228998814 -0.14106795 0.182926893 0.01282849
## ECO 0.34768543 0.35227007 0.176755269 -0.19345796 -0.159569309 -0.15318355
## SPI 0.20458947 0.51114054 0.244851265 -0.25337883 -0.439876492 -0.02862643
## PRS -0.04763079 0.52444926 0.035583656 0.81468143 0.034410195 0.18072539
## RLI 0.15078386 0.44685437 -0.534534651 -0.23846831 0.619686571 -0.02767151
## WWT 0.38239634 -0.15049582 -0.006608801 0.22575912 0.003635861 0.03444704
## HFD 0.39551487 -0.20059831 -0.119760926 0.05762272 0.111224616 0.15292999
## NOD -0.27594467 0.07771055 0.704760010 -0.08793807 0.564816113 -0.02225317
## LED 0.37426584 -0.17533737 0.206320367 0.01782226 0.104462867 0.64453681
## SMW 0.35909659 -0.19207403 0.125420754 0.30913067 0.141685159 -0.70825581
## PC7 PC8 PC9 PC10
## EPI -0.01257448 -0.53269884 0.07630410 0.660687673
## ECO -0.18846551 -0.42595623 -0.16114763 -0.628090876
## SPI 0.11679132 0.53255672 0.17556072 0.224357497
## PRS 0.03312675 -0.12159614 0.08592030 0.024520032
## RLI 0.06072918 0.18291673 -0.11412347 -0.006563924
## WWT -0.71910830 0.33681681 -0.36355920 0.127184031
## HFD -0.13696586 0.08277248 0.81784100 -0.235439847
## NOD -0.18207977 0.18722637 0.10495555 -0.121082461
## LED 0.46236788 0.13342246 -0.31499623 -0.170727077
## SMW 0.40383601 0.17256974 -0.07552226 -0.050957330
fviz_pca_var(data_pca, col.var = "blue") +
theme_minimal() +
ggtitle("PCA Variable Contributions")
Loadings Analysis: Loadings indicate the importance of each variable in a principal component. Higher loadings mean the variable has a stronger influence on that component. If a variable is positively correlated with a component, its loading is positive; if negatively correlated, the loading is negative.
Since we are reducing our data to three principal components, we will analyze only the first three PCAs to understand their influence.
The rotation matrix shows how much each original variable contributes to the principal components (PCs). The first PC (PC1) has the largest variance and is mainly influenced by EPI, ECO, WWT, HFD, LED, and SMW (high positive loadings). The second PC (PC2) is mainly affected by SPI, PRS, and RLI.
Variables with higher absolute values in a component have stronger influence on that PC. For example, NOD contributes significantly to PC3, while PRS strongly impacts PC4. This analysis helps us understand which variables drive the variation in the dataset and will guide us in reducing dimensions effectively.
var_contrib <- get_pca_var(data_pca)
fviz_contrib(data_pca, choice = "var", axes = 1, fill = "blue", color = "blue") +
theme_minimal() +
ggtitle("Variable Contributions to PCA Axis 1") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))
fviz_contrib(data_pca, choice = "var", axes = 2, fill = "red", color = "red") +
theme_minimal() +
ggtitle("Variable Contributions to PCA Axis 2") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))
fviz_contrib(data_pca, choice = "var", axes = 3, fill = "purple", color = "purple") +
theme_minimal() +
ggtitle("Variable Contributions to PCA Axis 3") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))
I analyzed PCA results to understand which variables contribute most to
the main components. I visualized contributions for the first three
principal components.
Why I did this: To identify key variables that explain most of the variance and to simplify my dataset for better clustering.
Result:
PC1: EPI, HFD, WWT, and LED are the most important variables, explaining the highest variance. PC2: PRS, SPI, and RLI are key contributors. PC3: NOD and RLI dominate.
eigen_plot <- fviz_eig(data_pca, choice = "eigenvalue",
barcolor = "blue",
barfill = "blue",
main = "Eigenvalues Scree Plot") +
theme_classic()
var_perc_plot <- fviz_eig(data_pca,
barcolor = "red",
barfill = "red",
main = "Percentage of Explained Variance") +
theme_classic()
grid.arrange(eigen_plot, var_perc_plot, nrow = 2)
The scree plot shows that the first 3 components have the highest eigenvalues, explaining most of the variance. The variance plot confirms that over 70% of the variance is captured by the first 3 components, making them ideal for dimensionality reduction.
pca_rotated_analysis <- principal(data_standardized, nfactors = 3, rotate = "varimax")
print(pca_rotated_analysis)
## Principal Components Analysis
## Call: principal(r = data_standardized, nfactors = 3, rotate = "varimax")
## Standardized loadings (pattern matrix) based upon correlation matrix
## RC1 RC2 RC3 h2 u2 com
## EPI 0.86 0.35 0.09 0.87 0.13 1.4
## ECO 0.60 0.70 0.13 0.87 0.13 2.0
## SPI 0.26 0.84 -0.01 0.77 0.23 1.2
## PRS -0.31 0.67 0.00 0.55 0.45 1.4
## RLI -0.01 0.60 0.61 0.74 0.26 2.0
## WWT 0.85 0.03 0.25 0.78 0.22 1.2
## HFD 0.87 -0.05 0.35 0.87 0.13 1.3
## NOD -0.43 0.05 -0.78 0.80 0.20 1.6
## LED 0.89 0.02 0.06 0.80 0.20 1.0
## SMW 0.85 -0.02 0.12 0.73 0.27 1.0
##
## RC1 RC2 RC3
## SS loadings 4.43 2.14 1.21
## Proportion Var 0.44 0.21 0.12
## Cumulative Var 0.44 0.66 0.78
## Proportion Explained 0.57 0.28 0.16
## Cumulative Proportion 0.57 0.84 1.00
##
## Mean item complexity = 1.4
## Test of the hypothesis that 3 components are sufficient.
##
## The root mean square of the residuals (RMSR) is 0.07
## with the empirical chi square 72.73 with prob < 1.6e-08
##
## Fit based upon off diagonal values = 0.98
I performed a rotated PCA analysis using the Varimax method to better understand the relationships between variables and principal components.
Why I did this: To simplify the interpretation of loadings by maximizing the variance of squared loadings for each component, making it easier to see which variables contribute most to each component.
Result: Three components were extracted, explaining 78% of the total variance. Component RC1 is dominated by variables like EPI, WWT, HFD, LED, and SMW. Component RC2 highlights ECO, SPI, and PRS. Component RC3 strongly relates to NOD and partially RLI.
eig_val <- get_eigenvalue(data_pca)
print(eig_val)
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 5.01428541 50.1428541 50.14285
## Dim.2 1.95742220 19.5742220 69.71708
## Dim.3 0.81018767 8.1018767 77.81895
## Dim.4 0.65437394 6.5437394 84.36269
## Dim.5 0.56047192 5.6047192 89.96741
## Dim.6 0.28325759 2.8325759 92.79999
## Dim.7 0.27232235 2.7232235 95.52321
## Dim.8 0.24049993 2.4049993 97.92821
## Dim.9 0.14874866 1.4874866 99.41570
## Dim.10 0.05843033 0.5843033 100.00000
data_PCA_reduced3 <- data_pca$x[, 1:3]
summary(data_PCA_reduced3)
## PC1 PC2 PC3
## Min. :-3.5531 Min. :-3.43099 Min. :-2.64700
## 1st Qu.:-1.8307 1st Qu.:-1.00431 1st Qu.:-0.53937
## Median :-0.3537 Median : 0.04206 Median :-0.00979
## Mean : 0.0000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 1.4997 3rd Qu.: 0.97038 3rd Qu.: 0.54657
## Max. : 5.2195 Max. : 3.16497 Max. : 2.65803
I calculated the eigenvalues from the PCA results to determine the variance explained by each principal component.
To understand how much of the total variance is captured by each component and decide how many components to keep for further analysis.
Result: First 3 components explain 77.82% of the total variance (50.14%, 19.57%, and 8.10%, respectively). Components beyond the third contribute very little to the variance (<7% each), making them less relevant for the analysis. This supports reducing the data to 3 dimensions while retaining most of the meaningful information.
kmeans_pca <- eclust(data_PCA_reduced3, "kmeans", k = 2, hc_metric = "euclidean", graph = FALSE)
fviz_cluster(kmeans_pca, geom = "point", main = "K-Means Clustering") +
theme_minimal()
Two clusters are formed (red and blue regions). The PCA reduction has helped in visualizing the separation, but there is some overlap, indicating that the clusters may not be perfectly distinct in this space.
fviz_silhouette(kmeans_pca, label = TRUE) +
ggtitle("Silhouette Plot for K-Means") +
theme_minimal()
## cluster size ave.sil.width
## 1 1 122 0.40
## 2 2 58 0.44
The average silhouette widths indicate how well each data point is clustered, with higher values representing better-defined and more cohesive clusters. By comparing the silhouette scores before and after dimension reduction using PCA, we can observe an improvement in clustering quality.
The increase in silhouette widths demonstrates that PCA helped reduce noise and redundancy in the data, making it easier to group points into more distinct clusters. This confirms that dimension reduction improves clustering quality by retaining only the most relevant information.
calinski_harabasz <- cluster.stats(d = dist(data_PCA_reduced3), clustering = kmeans_pca$cluster)$ch
cat("Calinski-Harabasz Index:", round(calinski_harabasz, 2), "\n")
## Calinski-Harabasz Index: 157.27
The Calinski-Harabasz Index (CH Index) measures the quality of clustering by evaluating the ratio of the sum of between-cluster dispersion to within-cluster dispersion. Higher values indicate better-defined clusters.
pam_pca <- eclust(data_PCA_reduced3, FUNcluster = "pam", k = 2, hc_metric = "euclidean", graph = FALSE)
fviz_cluster(pam_pca, geom = "point", main = "PAM Clustering") +
theme_minimal()
After PCA, the clusters are more compact and well-separated when visualized. This indicates that dimensionality reduction helped in identifying distinct groups.
fviz_silhouette(pam_pca, label = TRUE) +
ggtitle("Silhouette Plot for PAM") +
theme_minimal()
## cluster size ave.sil.width
## 1 1 127 0.40
## 2 2 53 0.47
Cluster 1: The silhouette width improved from 0.30 to 0.40, indicating better cohesion within the cluster. Cluster 2: The silhouette width remained the same at 0.47, showing consistent quality for this cluster. Overall: The clustering quality improved, as the average silhouette width increased.
calinski_harabasz <- cluster.stats(d = dist(data_PCA_reduced3), clustering = pam_pca$cluster)$ch
cat("Calinski-Harabasz Index:", round(calinski_harabasz, 2), "\n")
## Calinski-Harabasz Index: 156.74
The Calinski-Harabasz Index increased significantly from 100.11 to 156.74, meaning the clusters are now more compact and well-separated. PCA reduced noise and enhanced clustering performance.
Next, we will proceed with hierarchical clustering using two methods: agglomerative and divisive clustering. First, I will focus on performing the clustering, and then I will evaluate and compare the quality measures.
calc_ac <- function(method) {
agnes(data_PCA_reduced3, method = method)$ac
}
ac_values <- sapply(linkage_methods, calc_ac)
print(ac_values)
## average single complete ward
## 0.8790348 0.6694455 0.9385899 0.9822595
PCA improved clustering efficiency for different linkage methods. Ward’s method still has the highest coefficient, indicating the best cluster cohesion after PCA.
hc_model <- agnes(data_PCA_reduced3, method = "ward")
pltree(hc_model, cex = 0.6, hang = -1, main = "Dendrogram - Ward's Method")
The dendrogram structure changed due to PCA-reduced dimensions. Clusters are more balanced and easier to interpret compared to before PCA.
hc_hclust <- as.hclust(hc_model)
pltree(hc_model, cex = 0.6, hang = -1, main = "Dendrogram - Ward's Method")
hc_clusters <- cutree(hc_hclust, k = 2)
table(hc_clusters)
## hc_clusters
## 1 2
## 95 85
rect.hclust(hc_hclust, k = 2, border = "blue")
Cluster sizes became more balanced (95 and 85) compared to before PCA. The separation between clusters is visually clearer.
fviz_cluster(
list(data = data_PCA_reduced3, cluster = hc_clusters),
main = "Hierarchical Clustering - Ward's Method"
) +
theme_minimal() +
scale_color_manual(values = c("red", "blue")) +
scale_fill_manual(values = c("red", "blue"))
PCA allowed better visualization of clusters in 2D space. Clusters overlap less, and their boundaries are clearer.
agg_cluster_stats <- cluster.stats(
dist(data_PCA_reduced3),
hc_clusters
)
print(agg_cluster_stats)
## $n
## [1] 180
##
## $cluster.number
## [1] 2
##
## $cluster.size
## [1] 95 85
##
## $min.cluster.size
## [1] 85
##
## $noisen
## [1] 0
##
## $diameter
## [1] 7.415800 8.465768
##
## $average.distance
## [1] 2.424508 3.084513
##
## $median.distance
## [1] 2.343930 2.928705
##
## $separation
## [1] 0.6625428 0.6625428
##
## $average.toother
## [1] 4.432048 4.432048
##
## $separation.matrix
## [,1] [,2]
## [1,] 0.0000000 0.6625428
## [2,] 0.6625428 0.0000000
##
## $ave.between.matrix
## [,1] [,2]
## [1,] 0.000000 4.432048
## [2,] 4.432048 0.000000
##
## $average.between
## [1] 4.432048
##
## $average.within
## [1] 2.736177
##
## $n.between
## [1] 8075
##
## $n.within
## [1] 8035
##
## $max.diameter
## [1] 8.465768
##
## $min.separation
## [1] 0.6625428
##
## $within.cluster.ss
## [1] 813.6629
##
## $clus.avg.silwidths
## 1 2
## 0.4361703 0.2589685
##
## $avg.silwidth
## [1] 0.3524917
##
## $g2
## NULL
##
## $g3
## NULL
##
## $pearsongamma
## [1] 0.5151331
##
## $dunn
## [1] 0.0782614
##
## $dunn2
## [1] 1.436871
##
## $entropy
## [1] 0.6916032
##
## $wb.ratio
## [1] 0.6173619
##
## $ch
## [1] 126.7291
##
## $cwidegap
## [1] 1.578496 1.456253
##
## $widestgap
## [1] 1.578496
##
## $sindex
## [1] 0.7915576
##
## $corrected.rand
## NULL
##
## $vi
## NULL
Calinski-Harabasz Index (CH): Increased significantly, indicating better-defined clusters. Silhouette Width: Slightly improved, showing better separation. Cluster Diameters: Increased slightly, but clusters remain cohesive.
hc_div <- diana(data_PCA_reduced3)
cat("Dissociation Coefficient:", round(hc_div$dc, 2), "\n")
## Dissociation Coefficient: 0.93
Before PCA: 0.83, indicating moderate structure in the data. After PCA: 0.93, showing improved separation and better-defined clusters.
pltree(hc_div, cex = 0.6, hang = -1, main = "Dendrogram - DIANA")
The dendrogram is cleaner and easier to interpret after PCA due to reduced dimensions. Clusters are more distinguishable. Remained the same (120 and 60) before and after PCA, indicating that PCA preserved the overall cluster distribution.
div_clusters <- cutree(as.hclust(hc_div), k = 2)
table(div_clusters)
## div_clusters
## 1 2
## 120 60
pltree(hc_div, cex = 0.6, hang = -1, main = "Dendrogram - DIANA")
rect.hclust(as.hclust(hc_div), k = 2, border = viridisLite::viridis(2, option = "C"))
fviz_cluster(
list(data = data_PCA_reduced3, cluster = div_clusters),
main = "Hierarchical Clustering Using DIANA"
) +
scale_color_manual(values = viridisLite::viridis(2, option = "C")) +
scale_fill_manual(values = viridisLite::viridis(2, option = "C")) +
theme_classic()
The dendrogram structure after PCA is more compact and easier to interpret due to dimensionality reduction.
div_cluster_stats <- cluster.stats(dist(data_PCA_reduced3), div_clusters)
print(div_cluster_stats)
## $n
## [1] 180
##
## $cluster.number
## [1] 2
##
## $cluster.size
## [1] 120 60
##
## $min.cluster.size
## [1] 60
##
## $noisen
## [1] 0
##
## $diameter
## [1] 7.415800 6.716694
##
## $average.distance
## [1] 2.671542 2.603787
##
## $median.distance
## [1] 2.606436 2.476841
##
## $separation
## [1] 0.6069928 0.6069928
##
## $average.toother
## [1] 4.714224 4.714224
##
## $separation.matrix
## [,1] [,2]
## [1,] 0.0000000 0.6069928
## [2,] 0.6069928 0.0000000
##
## $ave.between.matrix
## [,1] [,2]
## [1,] 0.000000 4.714224
## [2,] 4.714224 0.000000
##
## $average.between
## [1] 4.714224
##
## $average.within
## [1] 2.648957
##
## $n.between
## [1] 7200
##
## $n.within
## [1] 8910
##
## $max.diameter
## [1] 7.4158
##
## $min.separation
## [1] 0.6069928
##
## $within.cluster.ss
## [1] 740.4278
##
## $clus.avg.silwidths
## 1 2
## 0.4099194 0.4162648
##
## $avg.silwidth
## [1] 0.4120345
##
## $g2
## NULL
##
## $g3
## NULL
##
## $pearsongamma
## [1] 0.614367
##
## $dunn
## [1] 0.08185129
##
## $dunn2
## [1] 1.764608
##
## $entropy
## [1] 0.6365142
##
## $wb.ratio
## [1] 0.5619073
##
## $ch
## [1] 156.8696
##
## $cwidegap
## [1] 1.670422 1.364736
##
## $widestgap
## [1] 1.670422
##
## $sindex
## [1] 0.7446522
##
## $corrected.rand
## NULL
##
## $vi
## NULL
Average Distance (Within Clusters): Decreased after PCA, showing tighter cluster grouping. Separation (Between Clusters): Reduced from 1.27 to 0.61, but this is balanced by better-defined clusters. Silhouette Widths: Improved from 0.32 to 0.41, indicating better-defined and separated clusters. Calinski-Harabasz Index: Increased significantly from 102.26 to 156.87, showing much better-defined clusters after PCA.
The main goal of this study was to explore different clustering techniques and evaluate how PCA affects clustering performance. I tested K-Means, PAM, Agglomerative Hierarchical Clustering (Ward’s Method), and Divisive Hierarchical Clustering (DIANA) before and after applying PCA. The results provided valuable insights into how PCA influences clustering quality.
1. Initial Clustering Results (Before PCA)
Before performing PCA, I applied different clustering methods to the dataset. Here’s what I observed:
Hopkins Statistic: 0.99, indicating that my data was highly clusterable. Silhouette Scores: Showed that K-Means performed slightly better than PAM, while hierarchical clustering methods struggled. Elbow and Silhouette Methods: Suggested that k=2 was the best number of clusters. Calinski-Harabasz Index: Showed that K-Means (102.59) was better than PAM (100.11), but both values were low. Hierarchical Clustering: Ward’s method showed a high agglomerative coefficient (0.95), meaning strong clustering potential, but the actual clusters were loosely formed. Divisive Clustering: Showed a dissociation coefficient of 0.83, meaning that clusters were not well-separated.
2. PCA Dimension Reduction
PCA was applied to reduce the dataset from 10 dimensions to 3 principal components, keeping 78% of total variance while removing redundancy in the variables.
Eigenvalues and Scree Plot: The first three principal components captured most of the variance. Variable Loadings: PCA helped understand which variables had the most influence on the dataset. Data Standardization: Ensured that all variables contributed equally to PCA. PCA allowed me to remove correlated variables while preserving most of the information. This reduced noise and redundancy, making the data more suitable for clustering.
3. Clustering Results After PCA
K-Means Clustering After PCA
Silhouette Scores Improved: Increased from 0.31/0.37 to 0.40/0.44, meaning clusters were more compact. Calinski-Harabasz Index Increased: From 102.59 to 157.27, proving that clustering quality significantly improved. Visualization Showed Clearer Clusters: PCA made K-Means clustering more defined and separated clusters better. Overall: PCA helped K-Means clustering significantly, leading to clearer, better-separated clusters.
PAM Clustering After PCA
Silhouette Scores Improved: Increased from 0.30/0.47 to 0.40/0.47. Calinski-Harabasz Index Increased: From 100.11 to 156.74, indicating that PCA improved the clustering. Cluster Separation Was Still Weak: Even though the metrics improved, PAM was still weaker than K-Means. Overall: PCA improved PAM clustering but not as much as K-Means. The results were still weaker compared to K-Means.
Agglomerative Hierarchical Clustering (Ward’s Method) After PCA
Agglomerative Coefficient Remained High: 0.95 before PCA, meaning strong clustering potential. Cluster Sizes Changed: From 148/32 to 95/85, meaning PCA balanced cluster sizes. Silhouette Score Improved: Increased from 0.31 to 0.35, meaning slightly better-defined clusters. Calinski-Harabasz Index Increased: From 75.61 to 126.72, meaning the clusters became stronger and more compact. Overall: PCA improved hierarchical clustering, but clusters were still more spread compared to K-Means.
Divisive Hierarchical Clustering (DIANA) After PCA
Dissociation Coefficient Improved: From 0.83 to 0.93, meaning better-separated clusters. Cluster Sizes Stayed the Same: 120/60 clusters were stable before and after PCA. Silhouette Score Increased: From 0.32 to 0.41, meaning clusters became tighter and more defined. Calinski-Harabasz Index Increased: From 102.26 to 156.87, meaning PCA improved the cluster separation. Overall: PCA significantly improved DIANA clustering by making clusters tighter and better separated.
PCA helped improve clustering quality by reducing noise and making the data more structured. However, PCA is not a magic solution—it works best when the correct clustering method is chosen. In this case, K-Means was the best clustering method, and PCA made it even better. For future research, testing this approach on different datasets could help understand when PCA is most effective in clustering tasks.