1 Project description

In this project, I worked with environmental performance data. I applied different clustering techniques to group similar data points. I used K-Means, PAM, CLARA, Agglomerative Hierarchical Clustering, and Divisive Hierarchical Clustering to find patterns in the data.

To improve the clustering results, I used Principal Component Analysis (PCA) for dimension reduction. This helped remove unnecessary information and made the data easier to analyze.

After applying PCA, I ran the clustering algorithms again on the reduced data. This allowed me to compare how well the clusters performed before and after PCA.

2 Load Necessary Packages

options(repos = c(CRAN = "https://cloud.r-project.org"))
required_packages <- c("factoextra", "cluster", "gridExtra", "dbscan", "caret", 
                       "hopkins", "cclust", "clValid", "fpc","corrplot","psych")
for (pkg in required_packages) {
  if (!requireNamespace(pkg, quietly = TRUE)) {
    install.packages(pkg)
  }
  library(pkg, character.only = TRUE)
}

3 Load Dataset

file_path <- "C:/Users/nijat/Desktop/clus_dim_data.csv"
data <- read.csv(file_path)
head(data)

##   code iso             country EPI.old EPI.new ECO.old ECO.new BDH.old BDH.new
## 1    4 AFG         Afghanistan    18.2    31.0    21.5    31.9    25.2    31.7
## 2    8 ALB             Albania    46.1    52.2    50.9    51.9    50.6    50.3
## 3   12 DZA             Algeria    38.5    41.7    39.3    41.8    33.1    33.0
## 4   24 AGO              Angola    31.9    40.1    36.6    44.1    42.3    41.5
## 5   28 ATG Antigua and Barbuda    54.4    55.6    52.4    53.2    52.2    52.5
## 6   32 ARG           Argentina    45.9    47.0    41.2    46.7    31.7    34.7
##   MKP.old MKP.new MHP.old MHP.new MPE.old MPE.new PAR.old PAR.new SPI.old
## 1      NA      NA      NA      NA      NA      NA     0.3     0.3     0.6
## 2    21.2    25.5    24.4    24.4   100.0    96.7    46.4    46.4    54.4
## 3     0.0     0.0     0.0     0.0    50.0   100.0     6.1     6.1    73.2
## 4     0.0     0.0     0.0     0.0      NA      NA    35.6    35.6    37.1
## 5    74.7    74.7    33.1    33.1    50.0    88.5      NA      NA    19.6
## 6    13.4    17.1     9.5    33.0    77.9    89.1    14.9    14.9    30.9
##   SPI.new TBN.old TBN.new TKP.old TKP.new PAE.old PAE.new PHL.old PHL.new
## 1     9.1     0.3    11.7     0.6    24.6    79.5    79.5    80.9    80.9
## 2    56.8    55.5    60.7    55.9    58.4    65.2    65.2    63.4    63.4
## 3    73.8     4.2     4.3     5.9     6.0    20.3    20.3    58.8    58.8
## 4    37.1    34.1    34.1    74.7    74.7    82.1    82.1    89.8    89.8
## 5    19.9    51.5    51.5    72.9    72.9    28.3    28.3    90.0    90.0
## 6    32.9    26.2    27.9    39.8    40.5    37.6    37.6    71.6    71.6
##   RLI.old RLI.new SHI.old SHI.new BER.old BER.new ECS.old ECS.new PFL.old
## 1    75.8    75.6    67.5    66.4    35.4    35.7      NA      NA      NA
## 2    70.5    70.0    77.2    53.5    28.2    28.9    58.4    62.4      NA
## 3    74.3    73.8    76.3    74.0    54.1    53.9      NA      NA      NA
## 4    77.8    77.7    82.7    71.6    58.1    56.0    50.8    49.2    68.5
## 5    67.3    64.0      NA      NA    83.7    80.9    35.4    28.7     0.0
## 6    54.4    54.2    64.3    48.2    25.6    25.2    35.6    48.9    47.2
##   PFL.new IFL.old IFL.new FCL.old FCL.new TCG.old TCG.new FLI.old FLI.new
## 1      NA      NA      NA      NA      NA      NA      NA      NA      NA
## 2      NA      NA      NA    65.0    71.7    36.5    36.5    67.5    67.5
## 3      NA      NA      NA      NA      NA      NA      NA      NA      NA
## 4    50.5    41.1    49.3    58.1    49.1    27.9    27.9    83.5    83.5
## 5     0.0      NA      NA    70.6    59.0    30.2    30.2    46.1    46.1
## 6    62.7    44.9    51.3    21.4    44.2     0.0     0.0    72.2    72.2
##   FSH.old FSH.new FSS.old FSS.new FCD.old FCD.new BTZ.old BTZ.new BTO.old
## 1      NA      NA      NA      NA      NA      NA      NA      NA      NA
## 2    13.7    15.8      NA      NA    24.2    24.0     9.0     3.5     5.6
## 3    53.1    51.6    85.6    75.7    40.8    37.7    44.8    49.2    45.3
## 4    38.8    37.6    21.0    29.0    48.1    45.4    27.2    26.9    38.5
## 5    97.6    97.3    81.5   100.0      NA      NA   100.0   100.0   100.0
## 6    52.0    38.5    77.5    70.5    53.4    50.8    12.8    26.8    15.8
##   BTO.new RMS.old RMS.new APO.old APO.new OEB.old OEB.new OEC.old OEC.new
## 1      NA      NA      NA     5.2    42.0    19.4    18.0    39.9    37.6
## 2     7.8   100.0   100.0    80.3    74.5    25.8    25.9    23.3    21.8
## 3    50.1    20.8    58.2    49.7    63.7    25.8    18.7    12.0     7.8
## 4    43.4    59.3    45.3    13.5    73.2    69.5    69.1    87.5    84.4
## 5   100.0    51.3    56.4    64.6    72.0    40.0    36.2    44.4    39.6
## 6    24.6    49.9    48.3    57.0    80.5    80.1    77.0    94.4    92.5
##   NXA.old NXA.new SDA.old SDA.new AGR.old AGR.new SNM.old SNM.new PSU.old
## 1     0.0    34.2     0.0    55.5    45.0    41.1    35.3    31.2   100.0
## 2   100.0   100.0   100.0    69.2    48.4    50.4    20.3    22.0    35.7
## 3     7.1    60.7   100.0    86.8    42.9    46.7    35.0    38.6   100.0
## 4     0.0    69.6     0.0    75.4    39.3    41.0    33.0    35.2    81.6
## 5    39.5    84.8    76.3    72.9    30.2    31.4     5.2     3.8    40.2
## 6    36.7    69.0    62.0    90.4    84.1    81.4    82.6    87.7   100.0
##   PSU.new PRS.old PRS.new RCY.old RCY.new WRS.old WRS.new WWG.old WWG.new
## 1    97.2    60.6    52.0    40.2    41.9    11.2    11.2    89.7    89.7
## 2    36.2    67.0    65.7    68.6    73.6    23.5    39.6    64.8    56.6
## 3   100.0    76.7    74.6    39.8    38.7    53.1    55.7    62.7    62.7
## 4    97.5    97.0    97.8    24.0    18.4    15.8    15.8    86.5    86.5
## 5    39.8    29.4    41.2    54.1    54.1    48.3    48.3    56.5    56.7
## 6   100.0    37.3    29.5    80.8    95.3    46.8    46.8    32.8    32.8
##   WWC.old WWC.new WWT.old WWT.new WWR.old WWR.new HLT.old HLT.new AIR.old
## 1     0.0     0.0     0.0     0.0     0.0     0.0    17.6    18.2    16.9
## 2    19.0    26.5    10.9    49.0    33.4    33.4    39.9    44.1    31.7
## 3    65.0    72.0    41.7    41.7    41.7    41.7    48.1    48.2    47.3
## 4     7.6     7.6     4.3     4.3     4.3     4.3    19.8    21.8    18.7
## 5    50.8    50.8    44.4    44.4    44.4    44.4    65.6    69.6    72.6
## 6    55.0    55.0    55.0    55.0    11.8    11.8    52.9    54.3    48.7
##   AIR.new HPE.old HPE.new HFD.old HFD.new OZD.old OZD.new NOD.old NOD.new
## 1    15.8    22.8    18.8     1.7     3.6     7.5    10.3    32.6    34.0
## 2    36.4    32.8    36.6    21.1    27.6    43.6    60.2    35.4    33.7
## 3    46.0    46.8    37.1    55.7    64.5    30.5    29.6    30.9    30.6
## 4    19.9    14.4    15.4     9.8    13.8    24.4    32.7    34.8    42.4
## 5    77.8    93.9    93.5    58.0    62.7   100.0   100.0    41.1    34.2
## 6    47.6    47.9    44.6    49.8    54.5    49.0    48.6    12.1    18.0
##   SOE.old SOE.new COE.old COE.new VOE.old VOE.new H2O.old H2O.new USD.old
## 1    63.5    58.7    51.0    50.2    36.5    37.1    25.5    32.3    22.8
## 2    39.0    44.6    59.6    64.4    48.9    47.2    67.2    71.0    67.4
## 3    36.5    24.0    54.0    42.7    33.2    27.9    61.0    64.1    62.8
## 4    63.7    60.3    37.5    43.2     4.3     9.8    16.9    22.9    12.5
## 5    63.0    68.8    82.6    85.8    87.3    91.0    54.6    55.8    53.4
## 6    65.6    66.9    66.4    64.5    14.2    17.4    64.6    73.4    63.3
##   USD.new UWD.old UWD.new HMT.old HMT.new LED.old LED.new WMG.old WMG.new
## 1    31.1    24.4    33.1     0.0     0.0     0.0     0.0    25.2    25.2
## 2    73.2    65.1    69.6    52.9    55.2    53.7    55.2    16.4    16.4
## 3    68.4    57.7    61.3    29.9    32.6    28.6    32.6    34.3    37.6
## 4    23.5    12.3    22.5    33.1    33.8    31.1    33.8    26.1    26.1
## 5    55.1    55.5    56.2    49.4    52.1    49.7    52.1    35.6    35.6
## 6    75.2    61.4    72.2    71.9    76.7    70.6    76.7    26.6    26.6
##   WPC.old WPC.new SMW.old SMW.new WRR.old WRR.new PCC.old PCC.new CCH.old
## 1    60.6    60.6     0.0     0.0     2.4     2.4    13.7    40.2    13.7
## 2    35.6    35.6     0.0     0.0     5.3     5.3    44.1    59.4    44.1
## 3    44.6    47.8    61.6    71.6    10.3    10.5    29.3    36.2    29.3
## 4    64.1    64.1     0.0     0.0     1.1     1.1    35.2    49.4    35.2
## 5    39.0    39.0   100.0   100.0     0.0     0.0    47.1    46.4    47.1
## 6    32.5    32.5    56.1    56.1     6.0     6.0    47.1    41.4    47.1
##   CCH.new CDA.old CDA.new CDF.old CDF.new CHA.old CHA.new FGA.old FGA.new
## 1    40.2     0.0    38.4     0.0   100.0     4.2    48.0    36.8     4.3
## 2    59.4    37.8    54.7    47.5    77.2    82.4    87.6    36.8     3.8
## 3    36.2    34.8    40.7    21.3    30.4    48.0    40.5    34.3    12.6
## 4    49.4    26.4    54.7    78.8   100.0    55.5    54.7    36.8     3.7
## 5    46.4    42.1    46.3    22.8    28.8    67.9    57.4      NA      NA
## 6    41.4    41.8    50.8    30.7    44.8    87.6    46.6    42.6     6.8
##   NDA.old NDA.new BCA.old BCA.new LUF.old LUF.new GTI.old GTI.new GTP.old
## 1     9.3    50.8     1.2    39.3    42.2    44.4     0.7    35.0     8.1
## 2    72.3   100.0    65.5   100.0    49.7    50.0    43.3    56.9    44.0
## 3     9.4    46.1     3.8    59.8    49.3    49.7    29.8    32.1    32.1
## 4    49.2    49.0     5.8    52.5    47.9    48.0    38.3    48.8    42.5
## 5    52.1    57.5    57.2    59.9    50.3    50.2    39.1    42.6    36.8
## 6    58.4    27.3    41.7    54.3    45.1    48.5    44.9    43.4    39.9
##   GTP.new GHN.old GHN.new CBP.old CBP.new
## 1    46.7    17.9    22.6    97.7    97.7
## 2    56.3    32.7    39.5    87.6    87.6
## 3    34.7     7.7     6.9    51.5    51.5
## 4    52.9    15.3    20.4    88.5    88.5
## 5    39.0    51.5    51.6    54.0    54.0
## 6    39.1     7.4     6.6    53.0    53.0

str(data)

## 'data.frame':    180 obs. of  149 variables:
##  $ code   : int  4 8 12 24 28 32 51 36 40 31 ...
##  $ iso    : chr  "AFG" "ALB" "DZA" "AGO" ...
##  $ country: chr  "Afghanistan" "Albania" "Algeria" "Angola" ...
##  $ EPI.old: num  18.2 46.1 38.5 31.9 54.4 45.9 42.9 58.9 68.8 40.8 ...
##  $ EPI.new: num  31 52.2 41.7 40.1 55.6 47 44.9 63.1 68.9 40.5 ...
##  $ ECO.old: num  21.5 50.9 39.3 36.6 52.4 41.2 47.1 59.9 77.9 45.1 ...
##  $ ECO.new: num  31.9 51.9 41.8 44.1 53.2 46.7 48.1 62.5 77.6 44.7 ...
##  $ BDH.old: num  25.2 50.6 33.1 42.3 52.2 31.7 48.6 50.5 74.6 36.1 ...
##  $ BDH.new: num  31.7 50.3 33 41.5 52.5 34.7 47.6 55.2 74.3 36.7 ...
##  $ MKP.old: num  NA 21.2 0 0 74.7 13.4 NA 36.5 NA 0.8 ...
##  $ MKP.new: num  NA 25.5 0 0 74.7 17.1 NA 57.2 NA 0.8 ...
##  $ MHP.old: num  NA 24.4 0 0 33.1 9.5 NA 30.1 NA 20 ...
##  $ MHP.new: num  NA 24.4 0 0 33.1 33 NA 42.5 NA 20 ...
##  $ MPE.old: num  NA 100 50 NA 50 77.9 NA 71.1 NA 50 ...
##  $ MPE.new: num  NA 96.7 100 NA 88.5 89.1 NA 57.2 NA 50 ...
##  $ PAR.old: num  0.3 46.4 6.1 35.6 NA 14.9 14.4 50.9 86.4 7.2 ...
##  $ PAR.new: num  0.3 46.4 6.1 35.6 NA 14.9 14.4 50.9 86.4 7.2 ...
##  $ SPI.old: num  0.6 54.4 73.2 37.1 19.6 30.9 38.7 53.5 78.9 42.1 ...
##  $ SPI.new: num  9.1 56.8 73.8 37.1 19.9 32.9 38.7 65.8 83.6 43.1 ...
##  $ TBN.old: num  0.3 55.5 4.2 34.1 51.5 26.2 77.2 54.7 85.9 31.3 ...
##  $ TBN.new: num  11.7 60.7 4.3 34.1 51.5 27.9 77.2 67.1 86.9 31.4 ...
##  $ TKP.old: num  0.6 55.9 5.9 74.7 72.9 39.8 39.5 47.3 70.5 38 ...
##  $ TKP.new: num  24.6 58.4 6 74.7 72.9 40.5 39.5 58.1 70.6 41.3 ...
##  $ PAE.old: num  79.5 65.2 20.3 82.1 28.3 37.6 49.8 89.1 76.7 55.2 ...
##  $ PAE.new: num  79.5 65.2 20.3 82.1 28.3 37.6 49.8 89.1 76.7 55.2 ...
##  $ PHL.old: num  80.9 63.4 58.8 89.8 90 71.6 71.5 92.2 57.9 31 ...
##  $ PHL.new: num  80.9 63.4 58.8 89.8 90 71.6 71.5 92.2 57.9 31 ...
##  $ RLI.old: num  75.8 70.5 74.3 77.8 67.3 54.4 49.9 48 62.6 80.8 ...
##  $ RLI.new: num  75.6 70 73.8 77.7 64 54.2 47.9 39.5 59.1 81.2 ...
##  $ SHI.old: num  67.5 77.2 76.3 82.7 NA 64.3 89.8 81 71.6 83.9 ...
##  $ SHI.new: num  66.4 53.5 74 71.6 NA 48.2 80.4 50.3 59.9 80.1 ...
##  $ BER.old: num  35.4 28.2 54.1 58.1 83.7 25.6 47.5 29.3 47.3 11.9 ...
##  $ BER.new: num  35.7 28.9 53.9 56 80.9 25.2 49.8 28.9 47.2 13.3 ...
##  $ ECS.old: num  NA 58.4 NA 50.8 35.4 35.6 80.3 60.2 58.9 79.5 ...
##  $ ECS.new: num  NA 62.4 NA 49.2 28.7 48.9 83.4 42.3 47.5 82.3 ...
##  $ PFL.old: num  NA NA NA 68.5 0 47.2 NA 97.5 NA NA ...
##  $ PFL.new: num  NA NA NA 50.5 0 62.7 NA 97.4 NA NA ...
##  $ IFL.old: num  NA NA NA 41.1 NA 44.9 NA 21.1 NA NA ...
##  $ IFL.new: num  NA NA NA 49.3 NA 51.3 NA 0 NA NA ...
##  $ FCL.old: num  NA 65 NA 58.1 70.6 21.4 88.8 64.7 66.7 92.1 ...
##  $ FCL.new: num  NA 71.7 NA 49.1 59 44.2 99.3 20.1 53.4 98.3 ...
##  $ TCG.old: num  NA 36.5 NA 27.9 30.2 0 58.2 44.9 38.7 50.7 ...
##  $ TCG.new: num  NA 36.5 NA 27.9 30.2 0 58.2 44.9 38.7 50.7 ...
##  $ FLI.old: num  NA 67.5 NA 83.5 46.1 72.2 54.2 72.2 35.6 65.4 ...
##  $ FLI.new: num  NA 67.5 NA 83.5 46.1 72.2 54.2 72.2 35.6 65.4 ...
##  $ FSH.old: num  NA 13.7 53.1 38.8 97.6 52 NA 48.5 NA NA ...
##  $ FSH.new: num  NA 15.8 51.6 37.6 97.3 38.5 NA 48.1 NA NA ...
##  $ FSS.old: num  NA NA 85.6 21 81.5 77.5 NA 29.4 NA NA ...
##  $ FSS.new: num  NA NA 75.7 29 100 70.5 NA 28.6 NA NA ...
##  $ FCD.old: num  NA 24.2 40.8 48.1 NA 53.4 NA 45.4 NA NA ...
##  $ FCD.new: num  NA 24 37.7 45.4 NA 50.8 NA 47.7 NA NA ...
##  $ BTZ.old: num  NA 9 44.8 27.2 100 12.8 NA 46.3 NA NA ...
##  $ BTZ.new: num  NA 3.5 49.2 26.9 100 26.8 NA 51.1 NA NA ...
##  $ BTO.old: num  NA 5.6 45.3 38.5 100 15.8 NA 49 NA NA ...
##  $ BTO.new: num  NA 7.8 50.1 43.4 100 24.6 NA 53.3 NA NA ...
##  $ RMS.old: num  NA 100 20.8 59.3 51.3 49.9 NA 77.6 NA NA ...
##  $ RMS.new: num  NA 100 58.2 45.3 56.4 48.3 NA 57.2 NA NA ...
##  $ APO.old: num  5.2 80.3 49.7 13.5 64.6 57 38.7 87.4 93.1 77.1 ...
##  $ APO.new: num  42 74.5 63.7 73.2 72 80.5 54.2 95.9 92.9 67 ...
##  $ OEB.old: num  19.4 25.8 25.8 69.5 40 80.1 40.9 79 56.3 45 ...
##  $ OEB.new: num  18 25.9 18.7 69.1 36.2 77 37.3 77.3 54.2 39.3 ...
##  $ OEC.old: num  39.9 23.3 12 87.5 44.4 94.4 45.3 86.2 65.5 56.4 ...
##  $ OEC.new: num  37.6 21.8 7.8 84.4 39.6 92.5 40.6 82.1 61 49.8 ...
##  $ NXA.old: num  0 100 7.1 0 39.5 36.7 41.2 91.2 100 70.5 ...
##  $ NXA.new: num  34.2 100 60.7 69.6 84.8 69 18.6 98.2 100 60.3 ...
##  $ SDA.old: num  0 100 100 0 76.3 62 50.8 79.7 100 100 ...
##  $ SDA.new: num  55.5 69.2 86.8 75.4 72.9 90.4 95.8 100 100 82.7 ...
##  $ AGR.old: num  45 48.4 42.9 39.3 30.2 84.1 56.6 53.8 68.9 62.7 ...
##  $ AGR.new: num  41.1 50.4 46.7 41 31.4 81.4 35.5 65.3 72.5 63 ...
##  $ SNM.old: num  35.3 20.3 35 33 5.2 82.6 29.7 52.9 64.6 45.8 ...
##  $ SNM.new: num  31.2 22 38.6 35.2 3.8 87.7 13.1 49.7 64 48.1 ...
##  $ PSU.old: num  100 35.7 100 81.6 40.2 100 100 53.4 41 100 ...
##  $ PSU.new: num  97.2 36.2 100 97.5 39.8 100 100 51.8 40.9 100 ...
##  $ PRS.old: num  60.6 67 76.7 97 29.4 37.3 66.3 57.5 75.3 80.7 ...
##  $ PRS.new: num  52 65.7 74.6 97.8 41.2 29.5 49.3 59 71.6 64.2 ...
##  $ RCY.old: num  40.2 68.6 39.8 24 54.1 80.8 59.9 58.9 71.5 55.3 ...
##  $ RCY.new: num  41.9 73.6 38.7 18.4 54.1 95.3 46.7 84.7 84.1 74.4 ...
##  $ WRS.old: num  11.2 23.5 53.1 15.8 48.3 46.8 30.3 87 85.7 24.3 ...
##  $ WRS.new: num  11.2 39.6 55.7 15.8 48.3 46.8 34 87.2 87.3 29.7 ...
##  $ WWG.old: num  89.7 64.8 62.7 86.5 56.5 32.8 0 26.2 0 43 ...
##  $ WWG.new: num  89.7 56.6 62.7 86.5 56.7 32.8 0 26.2 10.8 44 ...
##  $ WWC.old: num  0 19 65 7.6 50.8 55 38 100 100 18.9 ...
##  $ WWC.new: num  0 26.5 72 7.6 50.8 55 42.9 100 100 25.9 ...
##  $ WWT.old: num  0 10.9 41.7 4.3 44.4 55 38 92.4 95.2 18.9 ...
##  $ WWT.new: num  0 49 41.7 4.3 44.4 55 42.9 92.9 96 25.9 ...
##  $ WWR.old: num  0 33.4 41.7 4.3 44.4 11.8 14.6 92.9 100 38.1 ...
##  $ WWR.new: num  0 33.4 41.7 4.3 44.4 11.8 14.6 92.9 100 38.1 ...
##  $ HLT.old: num  17.6 39.9 48.1 19.8 65.6 52.9 39.5 81.1 66.2 39.1 ...
##  $ HLT.new: num  18.2 44.1 48.2 21.8 69.6 54.3 41.8 84 71 40.4 ...
##  $ AIR.old: num  16.9 31.7 47.3 18.7 72.6 48.7 29 78.8 56.3 37.1 ...
##  $ AIR.new: num  15.8 36.4 46 19.9 77.8 47.6 30.9 80.9 61.5 38.2 ...
##  $ HPE.old: num  22.8 32.8 46.8 14.4 93.9 47.9 22.9 90 30.8 36.3 ...
##  $ HPE.new: num  18.8 36.6 37.1 15.4 93.5 44.6 19.3 94.9 46.1 37.9 ...
##  $ HFD.old: num  1.7 21.1 55.7 9.8 58 49.8 29.9 82.1 84.7 28.1 ...
##  $ HFD.new: num  3.6 27.6 64.5 13.8 62.7 54.5 38.5 85.1 88.9 36.8 ...
##  $ OZD.old: num  7.5 43.6 30.5 24.4 100 49 27.2 85.5 41.6 41.2 ...
##  $ OZD.new: num  10.3 60.2 29.6 32.7 100 48.6 35.2 71 39.3 37.1 ...
##  $ NOD.old: num  32.6 35.4 30.9 34.8 41.1 12.1 29.6 19.2 20.6 36.3 ...
##  $ NOD.new: num  34 33.7 30.6 42.4 34.2 18 27.7 25.9 25 35.4 ...
##   [list output truncated]

4 EDA

4.1 Data Preparation

I used the Environmental Performance Index (EPI) dataset in my project. This dataset has different environmental indicators like air quality, biodiversity, climate change, and pollution. Each row represents a country, and columns show environmental scores.

Cleaned my dataset.

Removed columns that ended with “.old” because they were not needed. Renamed columns by removing “.new” to make names simpler. Deleted “iso” and “code” columns because they were not useful for my analysis.

data_cleaned <- data[, !grepl("\\.old$", names(data))]
names(data_cleaned) <- gsub("\\.new$", "", names(data_cleaned))
data_cleaned <- data_cleaned[, !(names(data_cleaned) %in% c("iso", "code"))]

Handled missing values in my dataset.

Removed columns that had more than 2 missing values to keep only useful data. Filled missing values with the mean of each column to avoid gaps in the data.

data_cleaned <- data_cleaned[, colSums(is.na(data_cleaned)) <= 2]
for (col in names(data_cleaned)) {
  if (is.numeric(data_cleaned[[col]])) {
    data_cleaned[[col]][is.na(data_cleaned[[col]])] <- mean(data_cleaned[[col]], na.rm = TRUE)
  }
}

data_cleaned[!complete.cases(data_cleaned),]

##  [1] country EPI     ECO     BDH     SPI     TBN     TKP     RLI     BER    
## [10] APO     OEB     NXA     SDA     AGR     SNM     PSU     PRS     RCY    
## [19] WRS     WWG     WWC     WWT     WWR     HLT     AIR     HPE     HFD    
## [28] OZD     NOD     SOE     COE     VOE     H2O     USD     UWD     HMT    
## [37] LED     WMG     WPC     SMW     WRR     PCC     CCH     CDA     CDF    
## [46] CHA     NDA     BCA     GTI     GTP     GHN     CBP    
## <0 rows> (or 0-length row.names)

No missing values - the data is prepared for the further analysis.

4.2 Outlier Detection

numeric_columns <- names(data_cleaned)[!(names(data_cleaned) %in% c("country"))]
for (col in numeric_columns) {
  boxplot(data_cleaned[[col]], 
          main = paste("Box Plot of", col), 
          ylab = col, 
          col = "lightblue", 
          border = "blue")
}

Checked for outliers in my dataset.

Selected only numeric columns, excluding “country”. Created box plots for each numeric column to visualize data distribution and spot outliers.

data_cleaned <- data_cleaned[, !(names(data_cleaned) %in% c("GTP", "GTI", "CDA", "WRR", "CCH"))]

I removed specific columns (GTP, GTI, CDA, WRR, CCH).

I did this to eliminate unnecessary or irrelevant variables. This helps simplify the dataset and improve clustering and PCA results by focusing on important features.

5 Data analysis

5.1 Data distributions

cor_matrix <- cor(data_cleaned[, -c(1)])
high_cor <- findCorrelation(cor_matrix, cutoff = 0.9)
names_to_drop <- colnames(data_cleaned[, -c(1)])[high_cor]
data_cleaned <- data_cleaned[, !(names(data_cleaned) %in% names_to_drop)]

I removed highly correlated variables (correlation > 0.9). . Finally, I dropped those highly correlated variables to reduce redundancy and avoid multicollinearity in clustering and PCA.

5.2 Standardization

data_cleaned_scaled <- scale(data_cleaned[2:40])
data_cleaned <- cbind(data_cleaned[, 1,drop = FALSE], as.data.frame(data_cleaned_scaled))

I standardized the numeric columns to ensure all features have the same scale.

6 Clustering Analysis

clustering_data <- data_cleaned[, !(names(data_cleaned) %in% c("country"))]
set.seed(123)
hopkins_stat <- hopkins::hopkins(clustering_data)
cat("Hopkins statistic:", hopkins_stat, "\n")

## Hopkins statistic: 0.9997001

distance_matrix <- dist(clustering_data)
fviz_dist(distance_matrix, 
          gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07"), 
          show_labels = FALSE)

I prepared the data for clustering and checked its cluster tendency using the Hopkins statistic. A Hopkins statistic of 0.9997 means the data has a very strong clustering tendency (values close to 1 indicate highly clusterable data).

data_for_clustering <- clustering_data
methods <- list(kmeans = kmeans, pam = pam, hcut = hcut)
titles <- c("Silhouette: K-Means", "Silhouette: PAM", "Silhouette: Hierarchical")
sil_plots <- lapply(1:3, function(i) {
  fviz_nbclust(data_for_clustering, FUNcluster = methods[[i]], method = "silhouette") +
    ggtitle(titles[i]) + theme_minimal()
})
grid.arrange(grobs = sil_plots, nrow = 3)

I checked the silhouette scores for K-Means, PAM, and Hierarchical clustering. The values were very low, meaning the clusters were not well-separated. To improve this, I selected only 10 important columns from my dataset.

selected_columns <- c("EPI", "ECO", "SPI", "PRS", "RLI", "WWT", "HFD", "NOD", "LED", "SMW")
final_columns <- c("country", selected_columns)
data_cleaned <- data_cleaned[, final_columns]

6.1 Clustering tendency

The obtained Hopkins statistic suggests that the dataset likely exhibits a clustering structure. With a result of 0.99, it indicates a very strong clustering tendency, making it suitable for further clustering analysis.

clustering_data <- data_cleaned[, !(names(data_cleaned) %in% c("country"))]
data_for_clustering <- clustering_data

set.seed(123)
hopkins_stat <- hopkins::hopkins(data_for_clustering)
cat("Hopkins statistic:", hopkins_stat, "\n")

## Hopkins statistic: 0.9973144

distance_matrix <- dist(data_for_clustering)
fviz_dist(distance_matrix, 
          gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07"), 
          show_labels = FALSE)

6.2 Optimal number of clusters

In this step, I will identify the optimal number of clusters for my dataset. To achieve this, I will utilize the Silhouette width and the elbow method, which will help determine the most suitable number of clusters for the clustering algorithm.

methods <- list(kmeans = kmeans, pam = pam, hcut = hcut)
titles <- c("Silhouette: K-Means", "Silhouette: PAM", "Silhouette: Hierarchical")
sil_plots <- lapply(1:3, function(i) {
  fviz_nbclust(data_for_clustering, FUNcluster = methods[[i]], method = "silhouette") +
    ggtitle(titles[i]) + theme_minimal()
})
grid.arrange(grobs = sil_plots, nrow = 3)

After reducing the data, silhouette scores increased for all clustering methods, and the Hopkins statistic confirmed that the data is clusterable.

Silhouette methods also showed that k = 2 is the best cluster number for all clustering methods.

k_selected <- 2
sil_values <- sapply(1:3, function(i) {
  plot_data <- sil_plots[[i]]$data
  plot_data[plot_data$clusters == k_selected, "y"]
})
cat("Silhouette value for k =", k_selected, "\n")

## Silhouette value for k = 2

cat("K-Means:", round(sil_values[1], 2), "\n")

## K-Means: 0.33

cat("PAM:", round(sil_values[2], 2), "\n")

## PAM: 0.34

cat("Hierarchical:", round(sil_values[3], 2), "\n")

## Hierarchical: 0.32

Also used the elbow method, which confirmed that k = 2 is the optimal number of clusters for all clustering methods.

methods <- list(kmeans = kmeans, pam = pam, hcut = hcut)
titles <- c("Elbow: K-Means", "Elbow: PAM", "Elbow: Hierarchical")
elbow_plots <- lapply(1:3, function(i) {
  fviz_nbclust(data_for_clustering, FUNcluster = methods[[i]], method = "wss") +
    ggtitle(titles[i]) + theme_minimal()
})
grid.arrange(grobs = elbow_plots, nrow = 3)

6.3 K-means clustering

Clustering Visualization

kmeans_result <- eclust(data_for_clustering, "kmeans", k = 2, hc_metric = "euclidean", graph = FALSE)

fviz_cluster(kmeans_result, geom = "point", main = "K-Means Clustering") +
  theme_minimal()

The clustering graph indicates that the algorithm did distinctly separate the universities into two well-separated clusters and the boundary between them is clearly defined.

SILHOUETTE INDEX

fviz_silhouette(kmeans_result, label = TRUE) +
  ggtitle("Silhouette Plot for K-Means") +
  theme_minimal()

##   cluster size ave.sil.width
## 1       1  122          0.31
## 2       2   58          0.37

The silhouette index results (0.31 for Cluster 1 and 0.37 for Cluster 2) suggest that the clusters are distinct but still show some degree of overlap.

Calinski-Harabasz

calinski_harabasz <- cluster.stats(d = dist(data_for_clustering), clustering = kmeans_result$cluster)$ch
cat("Calinski-Harabasz Index:", round(calinski_harabasz, 2), "\n")

## Calinski-Harabasz Index: 102.59

Calculated the Calinski-Harabasz Index, which resulted in a value of 102.59, further validating that the clustering structure is well-defined.

6.4 PAM clustering

Clustering Visualization

pam_result <- eclust(data_for_clustering, FUNcluster = "pam", k = 2, hc_metric = "euclidean", graph = FALSE)

fviz_cluster(pam_result, geom = "point", main = "PAM Clustering") +
  theme_minimal()

Performed PAM (Partitioning Around Medoids) clustering with k=2 using the Euclidean distance metric. Then, visualized the clusters to analyze the structure of the data. The clustering graph indicates that the algorithm did distinctly separate the universities into two well-separated clusters and the boundary between them is clearly defined.

SILHOUETTE INDEX

fviz_silhouette(pam_result, label = TRUE) +
  ggtitle("Silhouette Plot for PAM") +
  theme_minimal()

##   cluster size ave.sil.width
## 1       1  135          0.30
## 2       2   45          0.47

Computed the silhouette plot for PAM clustering with k=2. The average silhouette width for Cluster 1 is 0.30, and for Cluster 2 is 0.47, indicating that Cluster 2 has better-defined separation compared to Cluster 1, but overall, the clustering structure could still be improved.

Calinski-Harabasz

calinski_harabasz <- cluster.stats(d = dist(data_for_clustering), clustering = pam_result$cluster)$ch
cat("Calinski-Harabasz Index:", round(calinski_harabasz, 2), "\n")

## Calinski-Harabasz Index: 100.11

Computed the Calinski-Harabasz Index for PAM clustering, which resulted in 100.11. This value is slightly lower than the K-Means clustering score (102.59), indicating that K-Means may provide slightly better-defined clusters compared to PAM in this dataset.

OVERALL K-Means performed slightly better overall, as indicated by the higher Calinski-Harabasz Index. PAM formed a better-defined second cluster, as shown by its higher silhouette score for Cluster 2. If cluster compactness is the priority, K-Means is preferable.

6.5 Hierarchical clustering

Next, we will proceed with hierarchical clustering using two methods: agglomerative and divisive clustering. First, I will focus on performing the clustering, and then I will evaluate and compare the quality measures.

6.5.1 Agglomerative

linkage_methods <- c("average", "single", "complete", "ward")
names(linkage_methods) <- linkage_methods

calc_ac <- function(method) {
  agnes(data_for_clustering, method = method)$ac
}

ac_values <- sapply(linkage_methods, calc_ac)
print(ac_values)

##   average    single  complete      ward 
## 0.7179504 0.4412106 0.8511947 0.9538338

We calculate the Agglomerative Coefficient (AC) to check how well different hierarchical clustering methods form structured clusters. A higher AC means better-defined clusters.

We choose the Ward method because it has the highest AC (0.95), meaning it produces compact and well-separated clusters. Ward minimizes the variance within clusters, making it the best choice for accurate and meaningful grouping.

hc_model <- agnes(data_for_clustering, method = "ward")

pltree(hc_model, cex = 0.6, hang = -1, main = "Dendrogram - Ward's Method")

hc_hclust <- as.hclust(hc_model)
pltree(hc_model, cex = 0.6, hang = -1, main = "Dendrogram - Ward's Method")
hc_clusters <- cutree(hc_hclust, k = 2)

table(hc_clusters)

## hc_clusters
##   1   2 
## 148  32

rect.hclust(hc_hclust, k = 2, border = "blue")

Converted the hierarchical model, cut it into 2 clusters, counted cluster sizes, and visualized them with rectangles on the dendrogram to see the final grouping.

fviz_cluster(
  list(data = data_for_clustering, cluster = hc_clusters),
  main = "Hierarchical Clustering - Ward's Method"
) +
  theme_minimal() +
  scale_color_manual(values = c("red", "blue")) +
  scale_fill_manual(values = c("red", "blue"))

Visualized hierarchical clustering using Ward’s method. Assigned clusters to the dataset, plotted them with different colors (red and blue), and applied a minimal theme for a clean presentation.

agg_cluster_stats <- cluster.stats(
  dist(data_for_clustering), 
  hc_clusters               
)

print(agg_cluster_stats)

## $n
## [1] 180
## 
## $cluster.number
## [1] 2
## 
## $cluster.size
## [1] 148  32
## 
## $min.cluster.size
## [1] 32
## 
## $noisen
## [1] 0
## 
## $diameter
## [1] 7.834906 4.313502
## 
## $average.distance
## [1] 3.813062 2.242219
## 
## $median.distance
## [1] 3.734420 2.146629
## 
## $separation
## [1] 1.464251 1.464251
## 
## $average.toother
## [1] 5.380658 5.380658
## 
## $separation.matrix
##          [,1]     [,2]
## [1,] 0.000000 1.464251
## [2,] 1.464251 0.000000
## 
## $ave.between.matrix
##          [,1]     [,2]
## [1,] 0.000000 5.380658
## [2,] 5.380658 0.000000
## 
## $average.between
## [1] 5.380658
## 
## $average.within
## [1] 3.533801
## 
## $n.between
## [1] 4736
## 
## $n.within
## [1] 11374
## 
## $max.diameter
## [1] 7.834906
## 
## $min.separation
## [1] 1.464251
## 
## $within.cluster.ss
## [1] 1256.328
## 
## $clus.avg.silwidths
##         1         2 
## 0.2595458 0.5771514 
## 
## $avg.silwidth
## [1] 0.316009
## 
## $g2
## NULL
## 
## $g3
## NULL
## 
## $pearsongamma
## [1] 0.508956
## 
## $dunn
## [1] 0.1868881
## 
## $dunn2
## [1] 1.411112
## 
## $entropy
## [1] 0.468007
## 
## $wb.ratio
## [1] 0.6567601
## 
## $ch
## [1] 75.61212
## 
## $cwidegap
## [1] 2.399012 1.677730
## 
## $widestgap
## [1] 2.399012
## 
## $sindex
## [1] 1.805722
## 
## $corrected.rand
## NULL
## 
## $vi
## NULL

Performed cluster quality evaluation for hierarchical clustering using Ward’s method. The dataset was divided into two clusters (sizes: 148 and 32). The average silhouette width is 0.316, indicating moderate clustering structure. The Dunn index is 0.186, suggesting compact clusters with moderate separation. The Calinski-Harabasz index is 75.61, showing decent cluster validity.

6.5.2 Divisisive

hc_div <- diana(data_for_clustering)
cat("Dissociation Coefficient:", round(hc_div$dc, 2), "\n")

## Dissociation Coefficient: 0.83

Performed divisive hierarchical clustering (DIANA) and calculated the dissociation coefficient as 0.83, indicating a strong clustering structure where clusters are well-separated.

pltree(hc_div, cex = 0.6, hang = -1, main = "Dendrogram - DIANA")

div_clusters <- cutree(as.hclust(hc_div), k = 2)
table(div_clusters)

## div_clusters
##   1   2 
## 120  60

pltree(hc_div, cex = 0.6, hang = -1, main = "Dendrogram - DIANA")
rect.hclust(as.hclust(hc_div), k = 2, border = viridisLite::viridis(2, option = "C"))

fviz_cluster(
  list(data = data_for_clustering, cluster = div_clusters),
  main = "Hierarchical Clustering Using DIANA"
) +
  scale_color_manual(values = viridisLite::viridis(2, option = "C")) +
  scale_fill_manual(values = viridisLite::viridis(2, option = "C")) +
  theme_classic()

Visualized the clusters formed by Divisive Hierarchical Clustering (DIANA). The clusters are displayed in a scatter plot with distinct colors, helping to understand the separation and structure of the clusters.

Compared to Agglomerative Clustering, the Divisive Clustering (DIANA) method produced clusters that are closer to each other. This suggests that Agglomerative Clustering created more distinct clusters, while Divisive Clustering resulted in groups that are less separated.

div_cluster_stats <- cluster.stats(dist(data_for_clustering), div_clusters)

print(div_cluster_stats)

## $n
## [1] 180
## 
## $cluster.number
## [1] 2
## 
## $cluster.size
## [1] 120  60
## 
## $min.cluster.size
## [1] 60
## 
## $noisen
## [1] 0
## 
## $diameter
## [1] 7.834906 7.045571
## 
## $average.distance
## [1] 3.483922 3.269464
## 
## $median.distance
## [1] 3.451303 3.238025
## 
## $separation
## [1] 1.269166 1.269166
## 
## $average.toother
## [1] 5.19601 5.19601
## 
## $separation.matrix
##          [,1]     [,2]
## [1,] 0.000000 1.269166
## [2,] 1.269166 0.000000
## 
## $ave.between.matrix
##         [,1]    [,2]
## [1,] 0.00000 5.19601
## [2,] 5.19601 0.00000
## 
## $average.between
## [1] 5.19601
## 
## $average.within
## [1] 3.412436
## 
## $n.between
## [1] 7200
## 
## $n.within
## [1] 8910
## 
## $max.diameter
## [1] 7.834906
## 
## $min.separation
## [1] 1.269166
## 
## $within.cluster.ss
## [1] 1136.888
## 
## $clus.avg.silwidths
##         1         2 
## 0.3124926 0.3488300 
## 
## $avg.silwidth
## [1] 0.3246051
## 
## $g2
## NULL
## 
## $g3
## NULL
## 
## $pearsongamma
## [1] 0.5956821
## 
## $dunn
## [1] 0.1619887
## 
## $dunn2
## [1] 1.491426
## 
## $entropy
## [1] 0.6365142
## 
## $wb.ratio
## [1] 0.6567415
## 
## $ch
## [1] 102.2563
## 
## $cwidegap
## [1] 2.314239 2.589898
## 
## $widestgap
## [1] 2.589898
## 
## $sindex
## [1] 1.469359
## 
## $corrected.rand
## NULL
## 
## $vi
## NULL

The silhouette score (0.32) is lower than Agglomerative Clustering, meaning less well-defined clusters. Cluster separation (1.26) is smaller, indicating clusters are closer to each other. Calinski-Harabasz Index (102.25) is slightly higher than Agglomerative, suggesting a better-defined structure in terms of cluster compactness and separation. Dunn Index (0.16) is lower, meaning weaker cluster separation than Agglomerative.

Conclusion: Agglomerative Clustering performed better in forming distinct clusters compared to Divisive Clustering.

7 PCA Analysis

7.1 Correlation

data_for_dimreduction <- clustering_data  
data_matrix <- data.matrix(data_for_dimreduction)
correlation_matrix <- cor(data_matrix)

print(correlation_matrix, digits = 2)

##        EPI   ECO   SPI    PRS    RLI   WWT   HFD   NOD    LED    SMW
## EPI  1.000  0.81  0.44 -0.084  0.318  0.70  0.76 -0.38  0.762  0.689
## ECO  0.810  1.00  0.74  0.177  0.452  0.54  0.50 -0.37  0.498  0.456
## SPI  0.440  0.74  1.00  0.326  0.405  0.21  0.17 -0.17  0.237  0.153
## PRS -0.084  0.18  0.33  1.000  0.284 -0.14 -0.26  0.12 -0.223 -0.150
## RLI  0.318  0.45  0.41  0.284  1.000  0.14  0.19 -0.23  0.088  0.071
## WWT  0.699  0.54  0.21 -0.145  0.135  1.00  0.82 -0.52  0.713  0.722
## HFD  0.756  0.50  0.17 -0.257  0.191  0.82  1.00 -0.59  0.776  0.745
## NOD -0.382 -0.37 -0.17  0.123 -0.232 -0.52 -0.59  1.00 -0.419 -0.436
## LED  0.762  0.50  0.24 -0.223  0.088  0.71  0.78 -0.42  1.000  0.704
## SMW  0.689  0.46  0.15 -0.150  0.071  0.72  0.75 -0.44  0.704  1.000

corrplot(correlation_matrix, 
         method = "circle", 
         type = "lower", 
         title = "Variables Correlation Matrix", 
         cl.cex = 0.8, 
         tl.cex = 0.7, 
         tl.col = "black", 
         col = viridisLite::inferno(10)) + theme_classic()

## NULL

EPI and ECO have a strong positive correlation (0.81), meaning they move together. WWT and HFD are highly correlated (0.82), indicating a close relationship between them. PRS has weak correlations with most variables, meaning it behaves independently. NOD is negatively correlated with most variables, suggesting it has an inverse relationship with them.

preproc_data <- preProcess(data_for_dimreduction, method = c("center", "scale"))
data_standardized <- predict(preproc_data, data_for_dimreduction)
summary(data_standardized)

##       EPI               ECO               SPI                PRS          
##  Min.   :-1.9511   Min.   :-2.1818   Min.   :-1.76428   Min.   :-3.01195  
##  1st Qu.:-0.7222   1st Qu.:-0.7086   1st Qu.:-0.77931   1st Qu.:-0.63046  
##  Median :-0.1262   Median :-0.1193   Median : 0.03399   Median :-0.02317  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.00000  
##  3rd Qu.: 0.5418   3rd Qu.: 0.7033   3rd Qu.: 0.77914   3rd Qu.: 0.75477  
##  Max.   : 2.5107   Max.   : 2.4406   Max.   : 1.77562   Max.   : 1.98691  
##       RLI               WWT               HFD               NOD         
##  Min.   :-1.7063   Min.   :-1.1300   Min.   :-1.2168   Min.   :-1.8830  
##  1st Qu.:-0.8269   1st Qu.:-0.9518   1st Qu.:-0.9755   1st Qu.:-0.7257  
##  Median : 0.1641   Median :-0.1747   Median :-0.1730   Median :-0.2207  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.8120   3rd Qu.: 0.8701   3rd Qu.: 0.7697   3rd Qu.: 0.4908  
##  Max.   : 1.5025   Max.   : 1.8277   Max.   : 1.9513   Max.   : 2.9491  
##       LED               SMW         
##  Min.   :-2.2892   Min.   :-1.0577  
##  1st Qu.:-0.8327   1st Qu.:-1.0284  
##  Median :-0.1968   Median :-0.2639  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6689   3rd Qu.: 1.2053  
##  Max.   : 2.0072   Max.   : 1.3809

7.2 Eigenvalue and eigenvector analysis of covariance

data_cov <- cov(data_standardized)

data_eigen <- eigen(data_cov)

data_eigen$values

##  [1] 5.01428541 1.95742220 0.81018767 0.65437394 0.56047192 0.28325759
##  [7] 0.27232235 0.24049993 0.14874866 0.05843033

data_pca <- prcomp(data_standardized, center = FALSE, scale. = FALSE)

Eigenvalues indicate the variance explained by each principal component. They are crucial in PCA as they define the new data axes. The values are normalized to 1 and correspond to the principal components, with the first component explaining the most variance. The values in the dataframe represent loadings, showing how much each variable contributes to a component.

data_pca <- prcomp(data_standardized, center = FALSE, scale. = FALSE)

print(data_pca)

## Standard deviations (1, .., p=10):
##  [1] 2.2392600 1.3990791 0.9001043 0.8089338 0.7486467 0.5322195 0.5218451
##  [8] 0.4904079 0.3856795 0.2417237
## 
## Rotation (n x k) = (10 x 10):
##             PC1         PC2          PC3         PC4          PC5         PC6
## EPI  0.40549207  0.05789696  0.228998814 -0.14106795  0.182926893  0.01282849
## ECO  0.34768543  0.35227007  0.176755269 -0.19345796 -0.159569309 -0.15318355
## SPI  0.20458947  0.51114054  0.244851265 -0.25337883 -0.439876492 -0.02862643
## PRS -0.04763079  0.52444926  0.035583656  0.81468143  0.034410195  0.18072539
## RLI  0.15078386  0.44685437 -0.534534651 -0.23846831  0.619686571 -0.02767151
## WWT  0.38239634 -0.15049582 -0.006608801  0.22575912  0.003635861  0.03444704
## HFD  0.39551487 -0.20059831 -0.119760926  0.05762272  0.111224616  0.15292999
## NOD -0.27594467  0.07771055  0.704760010 -0.08793807  0.564816113 -0.02225317
## LED  0.37426584 -0.17533737  0.206320367  0.01782226  0.104462867  0.64453681
## SMW  0.35909659 -0.19207403  0.125420754  0.30913067  0.141685159 -0.70825581
##             PC7         PC8         PC9         PC10
## EPI -0.01257448 -0.53269884  0.07630410  0.660687673
## ECO -0.18846551 -0.42595623 -0.16114763 -0.628090876
## SPI  0.11679132  0.53255672  0.17556072  0.224357497
## PRS  0.03312675 -0.12159614  0.08592030  0.024520032
## RLI  0.06072918  0.18291673 -0.11412347 -0.006563924
## WWT -0.71910830  0.33681681 -0.36355920  0.127184031
## HFD -0.13696586  0.08277248  0.81784100 -0.235439847
## NOD -0.18207977  0.18722637  0.10495555 -0.121082461
## LED  0.46236788  0.13342246 -0.31499623 -0.170727077
## SMW  0.40383601  0.17256974 -0.07552226 -0.050957330

fviz_pca_var(data_pca, col.var = "blue") +
  theme_minimal() +
  ggtitle("PCA Variable Contributions")

Loadings Analysis: Loadings indicate the importance of each variable in a principal component. Higher loadings mean the variable has a stronger influence on that component. If a variable is positively correlated with a component, its loading is positive; if negatively correlated, the loading is negative.

Since we are reducing our data to three principal components, we will analyze only the first three PCAs to understand their influence.

The rotation matrix shows how much each original variable contributes to the principal components (PCs). The first PC (PC1) has the largest variance and is mainly influenced by EPI, ECO, WWT, HFD, LED, and SMW (high positive loadings). The second PC (PC2) is mainly affected by SPI, PRS, and RLI.

Variables with higher absolute values in a component have stronger influence on that PC. For example, NOD contributes significantly to PC3, while PRS strongly impacts PC4. This analysis helps us understand which variables drive the variation in the dataset and will guide us in reducing dimensions effectively.

var_contrib <- get_pca_var(data_pca)

fviz_contrib(data_pca, choice = "var", axes = 1, fill = "blue", color = "blue") +
  theme_minimal() +
  ggtitle("Variable Contributions to PCA Axis 1") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))

fviz_contrib(data_pca, choice = "var", axes = 2, fill = "red", color = "red") +
  theme_minimal() +
  ggtitle("Variable Contributions to PCA Axis 2") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))

fviz_contrib(data_pca, choice = "var", axes = 3, fill = "purple", color = "purple") +
  theme_minimal() +
  ggtitle("Variable Contributions to PCA Axis 3") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))

I analyzed PCA results to understand which variables contribute most to the main components. I visualized contributions for the first three principal components.

Why I did this: To identify key variables that explain most of the variance and to simplify my dataset for better clustering.

Result:

PC1: EPI, HFD, WWT, and LED are the most important variables, explaining the highest variance. PC2: PRS, SPI, and RLI are key contributors. PC3: NOD and RLI dominate.

eigen_plot <- fviz_eig(data_pca, choice = "eigenvalue", 
                       barcolor = "blue", 
                       barfill = "blue", 
                       main = "Eigenvalues Scree Plot") +
  theme_classic()

var_perc_plot <- fviz_eig(data_pca, 
                          barcolor = "red", 
                          barfill = "red", 
                          main = "Percentage of Explained Variance") +
  theme_classic() 

grid.arrange(eigen_plot, var_perc_plot, nrow = 2)

The scree plot shows that the first 3 components have the highest eigenvalues, explaining most of the variance. The variance plot confirms that over 70% of the variance is captured by the first 3 components, making them ideal for dimensionality reduction.

pca_rotated_analysis <- principal(data_standardized, nfactors = 3, rotate = "varimax")

print(pca_rotated_analysis)

## Principal Components Analysis
## Call: principal(r = data_standardized, nfactors = 3, rotate = "varimax")
## Standardized loadings (pattern matrix) based upon correlation matrix
##       RC1   RC2   RC3   h2   u2 com
## EPI  0.86  0.35  0.09 0.87 0.13 1.4
## ECO  0.60  0.70  0.13 0.87 0.13 2.0
## SPI  0.26  0.84 -0.01 0.77 0.23 1.2
## PRS -0.31  0.67  0.00 0.55 0.45 1.4
## RLI -0.01  0.60  0.61 0.74 0.26 2.0
## WWT  0.85  0.03  0.25 0.78 0.22 1.2
## HFD  0.87 -0.05  0.35 0.87 0.13 1.3
## NOD -0.43  0.05 -0.78 0.80 0.20 1.6
## LED  0.89  0.02  0.06 0.80 0.20 1.0
## SMW  0.85 -0.02  0.12 0.73 0.27 1.0
## 
##                        RC1  RC2  RC3
## SS loadings           4.43 2.14 1.21
## Proportion Var        0.44 0.21 0.12
## Cumulative Var        0.44 0.66 0.78
## Proportion Explained  0.57 0.28 0.16
## Cumulative Proportion 0.57 0.84 1.00
## 
## Mean item complexity =  1.4
## Test of the hypothesis that 3 components are sufficient.
## 
## The root mean square of the residuals (RMSR) is  0.07 
##  with the empirical chi square  72.73  with prob <  1.6e-08 
## 
## Fit based upon off diagonal values = 0.98

I performed a rotated PCA analysis using the Varimax method to better understand the relationships between variables and principal components.

Why I did this: To simplify the interpretation of loadings by maximizing the variance of squared loadings for each component, making it easier to see which variables contribute most to each component.

Result: Three components were extracted, explaining 78% of the total variance. Component RC1 is dominated by variables like EPI, WWT, HFD, LED, and SMW. Component RC2 highlights ECO, SPI, and PRS. Component RC3 strongly relates to NOD and partially RLI.

eig_val <- get_eigenvalue(data_pca)

print(eig_val)

##        eigenvalue variance.percent cumulative.variance.percent
## Dim.1  5.01428541       50.1428541                    50.14285
## Dim.2  1.95742220       19.5742220                    69.71708
## Dim.3  0.81018767        8.1018767                    77.81895
## Dim.4  0.65437394        6.5437394                    84.36269
## Dim.5  0.56047192        5.6047192                    89.96741
## Dim.6  0.28325759        2.8325759                    92.79999
## Dim.7  0.27232235        2.7232235                    95.52321
## Dim.8  0.24049993        2.4049993                    97.92821
## Dim.9  0.14874866        1.4874866                    99.41570
## Dim.10 0.05843033        0.5843033                   100.00000

data_PCA_reduced3 <- data_pca$x[, 1:3]
summary(data_PCA_reduced3)

##       PC1               PC2                PC3          
##  Min.   :-3.5531   Min.   :-3.43099   Min.   :-2.64700  
##  1st Qu.:-1.8307   1st Qu.:-1.00431   1st Qu.:-0.53937  
##  Median :-0.3537   Median : 0.04206   Median :-0.00979  
##  Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.00000  
##  3rd Qu.: 1.4997   3rd Qu.: 0.97038   3rd Qu.: 0.54657  
##  Max.   : 5.2195   Max.   : 3.16497   Max.   : 2.65803

I calculated the eigenvalues from the PCA results to determine the variance explained by each principal component.

To understand how much of the total variance is captured by each component and decide how many components to keep for further analysis.

Result: First 3 components explain 77.82% of the total variance (50.14%, 19.57%, and 8.10%, respectively). Components beyond the third contribute very little to the variance (<7% each), making them less relevant for the analysis. This supports reducing the data to 3 dimensions while retaining most of the meaningful information.

8 Clustering Analysis after PCA

8.1 K-means clustering

kmeans_pca <- eclust(data_PCA_reduced3, "kmeans", k = 2, hc_metric = "euclidean", graph = FALSE)
fviz_cluster(kmeans_pca, geom = "point", main = "K-Means Clustering") +
  theme_minimal()

Two clusters are formed (red and blue regions). The PCA reduction has helped in visualizing the separation, but there is some overlap, indicating that the clusters may not be perfectly distinct in this space.

fviz_silhouette(kmeans_pca, label = TRUE) +
  ggtitle("Silhouette Plot for K-Means") +
  theme_minimal()

##   cluster size ave.sil.width
## 1       1  122          0.40
## 2       2   58          0.44

The average silhouette widths indicate how well each data point is clustered, with higher values representing better-defined and more cohesive clusters. By comparing the silhouette scores before and after dimension reduction using PCA, we can observe an improvement in clustering quality.

The increase in silhouette widths demonstrates that PCA helped reduce noise and redundancy in the data, making it easier to group points into more distinct clusters. This confirms that dimension reduction improves clustering quality by retaining only the most relevant information.

calinski_harabasz <- cluster.stats(d = dist(data_PCA_reduced3), clustering = kmeans_pca$cluster)$ch
cat("Calinski-Harabasz Index:", round(calinski_harabasz, 2), "\n")

## Calinski-Harabasz Index: 157.27

The Calinski-Harabasz Index (CH Index) measures the quality of clustering by evaluating the ratio of the sum of between-cluster dispersion to within-cluster dispersion. Higher values indicate better-defined clusters.

8.2 PAM clustering

pam_pca <- eclust(data_PCA_reduced3, FUNcluster = "pam", k = 2, hc_metric = "euclidean", graph = FALSE)

fviz_cluster(pam_pca, geom = "point", main = "PAM Clustering") +
  theme_minimal()

After PCA, the clusters are more compact and well-separated when visualized. This indicates that dimensionality reduction helped in identifying distinct groups.

fviz_silhouette(pam_pca, label = TRUE) +
  ggtitle("Silhouette Plot for PAM") +
  theme_minimal()

##   cluster size ave.sil.width
## 1       1  127          0.40
## 2       2   53          0.47

Cluster 1: The silhouette width improved from 0.30 to 0.40, indicating better cohesion within the cluster. Cluster 2: The silhouette width remained the same at 0.47, showing consistent quality for this cluster. Overall: The clustering quality improved, as the average silhouette width increased.

calinski_harabasz <- cluster.stats(d = dist(data_PCA_reduced3), clustering = pam_pca$cluster)$ch
cat("Calinski-Harabasz Index:", round(calinski_harabasz, 2), "\n")

## Calinski-Harabasz Index: 156.74

The Calinski-Harabasz Index increased significantly from 100.11 to 156.74, meaning the clusters are now more compact and well-separated. PCA reduced noise and enhanced clustering performance.

8.3 Hierarchical clustering

8.3.1 Agglomerative

calc_ac <- function(method) {
  agnes(data_PCA_reduced3, method = method)$ac
}

ac_values <- sapply(linkage_methods, calc_ac)
print(ac_values)

##   average    single  complete      ward 
## 0.8790348 0.6694455 0.9385899 0.9822595

PCA improved clustering efficiency for different linkage methods. Ward’s method still has the highest coefficient, indicating the best cluster cohesion after PCA.

hc_model <- agnes(data_PCA_reduced3, method = "ward")

pltree(hc_model, cex = 0.6, hang = -1, main = "Dendrogram - Ward's Method")

The dendrogram structure changed due to PCA-reduced dimensions. Clusters are more balanced and easier to interpret compared to before PCA.

hc_hclust <- as.hclust(hc_model)
pltree(hc_model, cex = 0.6, hang = -1, main = "Dendrogram - Ward's Method")
hc_clusters <- cutree(hc_hclust, k = 2)

table(hc_clusters)

## hc_clusters
##  1  2 
## 95 85

rect.hclust(hc_hclust, k = 2, border = "blue")

Cluster sizes became more balanced (95 and 85) compared to before PCA. The separation between clusters is visually clearer.

fviz_cluster(
  list(data = data_PCA_reduced3, cluster = hc_clusters),
  main = "Hierarchical Clustering - Ward's Method"
) +
  theme_minimal() +
  scale_color_manual(values = c("red", "blue")) +
  scale_fill_manual(values = c("red", "blue"))

PCA allowed better visualization of clusters in 2D space. Clusters overlap less, and their boundaries are clearer.

agg_cluster_stats <- cluster.stats(
  dist(data_PCA_reduced3),
  hc_clusters               
)

print(agg_cluster_stats)

## $n
## [1] 180
## 
## $cluster.number
## [1] 2
## 
## $cluster.size
## [1] 95 85
## 
## $min.cluster.size
## [1] 85
## 
## $noisen
## [1] 0
## 
## $diameter
## [1] 7.415800 8.465768
## 
## $average.distance
## [1] 2.424508 3.084513
## 
## $median.distance
## [1] 2.343930 2.928705
## 
## $separation
## [1] 0.6625428 0.6625428
## 
## $average.toother
## [1] 4.432048 4.432048
## 
## $separation.matrix
##           [,1]      [,2]
## [1,] 0.0000000 0.6625428
## [2,] 0.6625428 0.0000000
## 
## $ave.between.matrix
##          [,1]     [,2]
## [1,] 0.000000 4.432048
## [2,] 4.432048 0.000000
## 
## $average.between
## [1] 4.432048
## 
## $average.within
## [1] 2.736177
## 
## $n.between
## [1] 8075
## 
## $n.within
## [1] 8035
## 
## $max.diameter
## [1] 8.465768
## 
## $min.separation
## [1] 0.6625428
## 
## $within.cluster.ss
## [1] 813.6629
## 
## $clus.avg.silwidths
##         1         2 
## 0.4361703 0.2589685 
## 
## $avg.silwidth
## [1] 0.3524917
## 
## $g2
## NULL
## 
## $g3
## NULL
## 
## $pearsongamma
## [1] 0.5151331
## 
## $dunn
## [1] 0.0782614
## 
## $dunn2
## [1] 1.436871
## 
## $entropy
## [1] 0.6916032
## 
## $wb.ratio
## [1] 0.6173619
## 
## $ch
## [1] 126.7291
## 
## $cwidegap
## [1] 1.578496 1.456253
## 
## $widestgap
## [1] 1.578496
## 
## $sindex
## [1] 0.7915576
## 
## $corrected.rand
## NULL
## 
## $vi
## NULL

Calinski-Harabasz Index (CH): Increased significantly, indicating better-defined clusters. Silhouette Width: Slightly improved, showing better separation. Cluster Diameters: Increased slightly, but clusters remain cohesive.

8.3.2 Divisisive

hc_div <- diana(data_PCA_reduced3)
cat("Dissociation Coefficient:", round(hc_div$dc, 2), "\n")

## Dissociation Coefficient: 0.93

Before PCA: 0.83, indicating moderate structure in the data. After PCA: 0.93, showing improved separation and better-defined clusters.

pltree(hc_div, cex = 0.6, hang = -1, main = "Dendrogram - DIANA")

The dendrogram is cleaner and easier to interpret after PCA due to reduced dimensions. Clusters are more distinguishable. Remained the same (120 and 60) before and after PCA, indicating that PCA preserved the overall cluster distribution.

div_clusters <- cutree(as.hclust(hc_div), k = 2)
table(div_clusters)

## div_clusters
##   1   2 
## 120  60

pltree(hc_div, cex = 0.6, hang = -1, main = "Dendrogram - DIANA")
rect.hclust(as.hclust(hc_div), k = 2, border = viridisLite::viridis(2, option = "C"))

fviz_cluster(
  list(data = data_PCA_reduced3, cluster = div_clusters),
  main = "Hierarchical Clustering Using DIANA"
) +
  scale_color_manual(values = viridisLite::viridis(2, option = "C")) +
  scale_fill_manual(values = viridisLite::viridis(2, option = "C")) +
  theme_classic()

The dendrogram structure after PCA is more compact and easier to interpret due to dimensionality reduction.

div_cluster_stats <- cluster.stats(dist(data_PCA_reduced3), div_clusters)

print(div_cluster_stats)

## $n
## [1] 180
## 
## $cluster.number
## [1] 2
## 
## $cluster.size
## [1] 120  60
## 
## $min.cluster.size
## [1] 60
## 
## $noisen
## [1] 0
## 
## $diameter
## [1] 7.415800 6.716694
## 
## $average.distance
## [1] 2.671542 2.603787
## 
## $median.distance
## [1] 2.606436 2.476841
## 
## $separation
## [1] 0.6069928 0.6069928
## 
## $average.toother
## [1] 4.714224 4.714224
## 
## $separation.matrix
##           [,1]      [,2]
## [1,] 0.0000000 0.6069928
## [2,] 0.6069928 0.0000000
## 
## $ave.between.matrix
##          [,1]     [,2]
## [1,] 0.000000 4.714224
## [2,] 4.714224 0.000000
## 
## $average.between
## [1] 4.714224
## 
## $average.within
## [1] 2.648957
## 
## $n.between
## [1] 7200
## 
## $n.within
## [1] 8910
## 
## $max.diameter
## [1] 7.4158
## 
## $min.separation
## [1] 0.6069928
## 
## $within.cluster.ss
## [1] 740.4278
## 
## $clus.avg.silwidths
##         1         2 
## 0.4099194 0.4162648 
## 
## $avg.silwidth
## [1] 0.4120345
## 
## $g2
## NULL
## 
## $g3
## NULL
## 
## $pearsongamma
## [1] 0.614367
## 
## $dunn
## [1] 0.08185129
## 
## $dunn2
## [1] 1.764608
## 
## $entropy
## [1] 0.6365142
## 
## $wb.ratio
## [1] 0.5619073
## 
## $ch
## [1] 156.8696
## 
## $cwidegap
## [1] 1.670422 1.364736
## 
## $widestgap
## [1] 1.670422
## 
## $sindex
## [1] 0.7446522
## 
## $corrected.rand
## NULL
## 
## $vi
## NULL

Average Distance (Within Clusters): Decreased after PCA, showing tighter cluster grouping. Separation (Between Clusters): Reduced from 1.27 to 0.61, but this is balanced by better-defined clusters. Silhouette Widths: Improved from 0.32 to 0.41, indicating better-defined and separated clusters. Calinski-Harabasz Index: Increased significantly from 102.26 to 156.87, showing much better-defined clusters after PCA.

9 Conclusion

The main goal of this study was to explore different clustering techniques and evaluate how PCA affects clustering performance. I tested K-Means, PAM, Agglomerative Hierarchical Clustering (Ward’s Method), and Divisive Hierarchical Clustering (DIANA) before and after applying PCA. The results provided valuable insights into how PCA influences clustering quality.

1. Initial Clustering Results (Before PCA)

Before performing PCA, I applied different clustering methods to the dataset. Here’s what I observed:

Hopkins Statistic: 0.99, indicating that my data was highly clusterable. Silhouette Scores: Showed that K-Means performed slightly better than PAM, while hierarchical clustering methods struggled. Elbow and Silhouette Methods: Suggested that k=2 was the best number of clusters. Calinski-Harabasz Index: Showed that K-Means (102.59) was better than PAM (100.11), but both values were low. Hierarchical Clustering: Ward’s method showed a high agglomerative coefficient (0.95), meaning strong clustering potential, but the actual clusters were loosely formed. Divisive Clustering: Showed a dissociation coefficient of 0.83, meaning that clusters were not well-separated.

2. PCA Dimension Reduction

PCA was applied to reduce the dataset from 10 dimensions to 3 principal components, keeping 78% of total variance while removing redundancy in the variables.

Eigenvalues and Scree Plot: The first three principal components captured most of the variance. Variable Loadings: PCA helped understand which variables had the most influence on the dataset. Data Standardization: Ensured that all variables contributed equally to PCA. PCA allowed me to remove correlated variables while preserving most of the information. This reduced noise and redundancy, making the data more suitable for clustering.

3. Clustering Results After PCA

K-Means Clustering After PCA

Silhouette Scores Improved: Increased from 0.31/0.37 to 0.40/0.44, meaning clusters were more compact. Calinski-Harabasz Index Increased: From 102.59 to 157.27, proving that clustering quality significantly improved. Visualization Showed Clearer Clusters: PCA made K-Means clustering more defined and separated clusters better. Overall: PCA helped K-Means clustering significantly, leading to clearer, better-separated clusters.

PAM Clustering After PCA

Silhouette Scores Improved: Increased from 0.30/0.47 to 0.40/0.47. Calinski-Harabasz Index Increased: From 100.11 to 156.74, indicating that PCA improved the clustering. Cluster Separation Was Still Weak: Even though the metrics improved, PAM was still weaker than K-Means. Overall: PCA improved PAM clustering but not as much as K-Means. The results were still weaker compared to K-Means.

Agglomerative Hierarchical Clustering (Ward’s Method) After PCA

Agglomerative Coefficient Remained High: 0.95 before PCA, meaning strong clustering potential. Cluster Sizes Changed: From 148/32 to 95/85, meaning PCA balanced cluster sizes. Silhouette Score Improved: Increased from 0.31 to 0.35, meaning slightly better-defined clusters. Calinski-Harabasz Index Increased: From 75.61 to 126.72, meaning the clusters became stronger and more compact. Overall: PCA improved hierarchical clustering, but clusters were still more spread compared to K-Means.

Divisive Hierarchical Clustering (DIANA) After PCA

Dissociation Coefficient Improved: From 0.83 to 0.93, meaning better-separated clusters. Cluster Sizes Stayed the Same: 120/60 clusters were stable before and after PCA. Silhouette Score Increased: From 0.32 to 0.41, meaning clusters became tighter and more defined. Calinski-Harabasz Index Increased: From 102.26 to 156.87, meaning PCA improved the cluster separation. Overall: PCA significantly improved DIANA clustering by making clusters tighter and better separated.

PCA helped improve clustering quality by reducing noise and making the data more structured. However, PCA is not a magic solution—it works best when the correct clustering method is chosen. In this case, K-Means was the best clustering method, and PCA made it even better. For future research, testing this approach on different datasets could help understand when PCA is most effective in clustering tasks.

Clustering and Dimensionality Reduction Analysis

Nijat Abiyev

2025-02-01