USL final project - clustering and dimension reduction

0.1 Project description

For my final project I decided to combine two topics - clustering and dimension reduction. The paper is structured as follows:

In the first part I included data description, statistic and exploratory data analysis, data cleaning. Then I moved into clustering part using kmeans, pam and hierarchical clustering algorithms. After investigating the clustering quality I decided to perform dimension reduction and perform clustering once again. At the end I compared both clustering and described interesting findings.

0.2 Statistic and exploratory data analysis

0.2.1 Data preparation

In my project I decided to use the university data base from kaggle.com. The data contains the statistic about different universities. Variables are presented as ranks, widely used in the schools rankings (like in polish ranking of highschools - Perspektywy). Variables shows the quality measures of the investigated universities.

world_rank - word ranking place of the university
institution - name of the university
country - country of the university
national_rank - rank of the university within it’s country
quality_of_education - rank of quality of education
alumni_employment - rank of employment within graduates
quality_of_faculty - rank of quality of faculty
publications - rank of publication number
influence - rank of the influence of the university
citations - rank for number of citations
broad_impact - rank for broad impact
patents - number of patents
score - total score, used to determine place in world ranking
year

To perform clustering we will drop columns:

- country - in the clustering we would like to focus on the quality values

- rank - it’s already calculated the rank place of the university and it can influence the clustering algorithm. I want to investigate the clustering tendency based on the other qualities not the position in world ranking.

I will also focus only on one year - 2015.

Final data structure:

str(university)

## 'data.frame':    1000 obs. of  11 variables:
##  $ institution         : chr  "Harvard University" "Stanford University" "Massachusetts Institute of Technology" "University of Cambridge" ...
##  $ national_rank       : int  1 2 3 1 2 4 5 6 7 8 ...
##  $ quality_of_education: int  1 9 3 2 7 13 5 11 4 12 ...
##  $ alumni_employment   : int  1 2 11 10 13 6 21 14 15 18 ...
##  $ quality_of_faculty  : int  1 4 2 5 10 9 6 8 3 14 ...
##  $ publications        : int  1 5 15 11 7 13 10 17 72 24 ...
##  $ influence           : int  1 3 2 6 12 13 4 16 25 15 ...
##  $ citations           : int  1 3 2 12 7 11 4 12 24 25 ...
##  $ broad_impact        : int  1 4 2 13 9 12 7 22 33 22 ...
##  $ patents             : int  3 10 1 48 15 4 29 141 225 11 ...
##  $ score               : num  100 98.7 97.5 96.8 96.5 ...

We can easily observe that we have 1000 records in our data. All variables are classified as integers. In next step we need to check if we don’t have missing values in our data:

university[!complete.cases(university),]

##  [1] institution          national_rank        quality_of_education
##  [4] alumni_employment    quality_of_faculty   publications        
##  [7] influence            citations            broad_impact        
## [10] patents              score               
## <0 rows> (or 0-length row.names)

No missing values - the data is prepared for the further analysis.

0.2.2 Data analysis

Before clustering we need to investigate the quality of the data. Check if we don’t observe any outliers (they can impact on the clustering quality). We will also investigate the scale of the data - if the scale will differ among variables before clustering we will have to standardize it.

summary(university)

##  institution        national_rank    quality_of_education alumni_employment
##  Length:1000        Min.   :  1.00   Min.   :  1.0        Min.   :  1.0    
##  Class :character   1st Qu.:  7.00   1st Qu.:250.8        1st Qu.:250.8    
##  Mode  :character   Median : 22.00   Median :367.0        Median :500.5    
##                     Mean   : 42.51   Mean   :299.8        Mean   :406.5    
##                     3rd Qu.: 52.00   3rd Qu.:367.0        3rd Qu.:567.0    
##                     Max.   :229.00   Max.   :367.0        Max.   :567.0    
##  quality_of_faculty  publications      influence       citations    
##  Min.   :  1.0      Min.   :   1.0   Min.   :  1.0   Min.   :  1.0  
##  1st Qu.:218.0      1st Qu.: 250.8   1st Qu.:250.8   1st Qu.:234.0  
##  Median :218.0      Median : 500.5   Median :500.5   Median :428.0  
##  Mean   :194.3      Mean   : 500.4   Mean   :500.3   Mean   :451.3  
##  3rd Qu.:218.0      3rd Qu.: 750.0   3rd Qu.:750.2   3rd Qu.:645.0  
##  Max.   :218.0      Max.   :1000.0   Max.   :991.0   Max.   :812.0  
##   broad_impact       patents          score       
##  Min.   :   1.0   Min.   :  1.0   Min.   : 44.02  
##  1st Qu.: 250.0   1st Qu.:250.8   1st Qu.: 44.30  
##  Median : 495.0   Median :500.5   Median : 44.78  
##  Mean   : 496.7   Mean   :491.7   Mean   : 46.86  
##  3rd Qu.: 741.0   3rd Qu.:749.0   3rd Qu.: 46.54  
##  Max.   :1000.0   Max.   :871.0   Max.   :100.00

0.2.2.1 Data distributions

The data differs among factors - we will need to standardize it. For the better analysis of the variables we will investigate distributions on boxplots and histograms:

colnames <- colnames(university[,c(2:11)])
i=1
par(mfrow=c(1,2))
for (i in colnames){
  plot(university[,i], main = paste("Plot of ", i), ylab=i, col=paletteInferno[2], cex.main=1)+theme_classic()
  boxplot(university[,i], main = paste("Boxplot of ", i), ylab = i, col=paletteInferno[6], border=paletteInferno[2], cex.main=1)+theme_classic()
}

OBSERVATIONS: Some of the variables have a lot of outliers or are oddly distributed. This can interfere the proper clustering tendecy and in fact when I firstly performed clustering it affects the results. We need to drop those variables in order to perform clustering in the best possible quality of clustering.

Dropped values: national rank, quality of education, quality of faculty, score

university<-university[c(-2,-3,-5,-11)]

0.2.2.2 Correlation plot

Correlation plot shows how the variables are correlated. This will be useful in further analysis when I will perform PCA. Right now we can check if the correlations exists and how strong they are. For variables connected with the publishing of the university we can observe strong correlation. That indicate that performing PCA can reduce dimension. We will take this under investigation in the second part of this paper.

university_matrix <- data.matrix(university[2:7])
correlation_matrix<-cor(university_matrix)
corrplot(correlation_matrix, method="circle", type=c("lower"), title = "Variables correlation matrix", cl.cex=0.8, tl.cex=0.7, tl.col="black", col=inferno(10))+theme_classic()

## NULL

0.2.2.3 Data standardization

In order to perform clustering the most accurate we need to perform data standardizing. In the result all variables has mean 0 - data is standardized properly in the range [-1,1]

university<-scale(university[2:7])
summary(university)

##  alumni_employment  publications          influence          citations       
##  Min.   :-2.1712   Min.   :-1.7295293   Min.   :-1.73079   Min.   :-1.78640  
##  1st Qu.:-0.8340   1st Qu.:-0.8646244   1st Qu.:-0.86501   1st Qu.:-0.86213  
##  Median : 0.5031   Median : 0.0002805   Median : 0.00078   Median :-0.09256  
##  Mean   : 0.0000   Mean   : 0.0000000   Mean   : 0.00000   Mean   : 0.00000  
##  3rd Qu.: 0.8591   3rd Qu.: 0.8643197   3rd Qu.: 0.86657   3rd Qu.: 0.76824  
##  Max.   : 0.8591   Max.   : 1.7300903   Max.   : 1.70115   Max.   : 1.43070  
##   broad_impact          patents        
##  Min.   :-1.727306   Min.   :-1.77842  
##  1st Qu.:-0.859583   1st Qu.:-0.87321  
##  Median :-0.005799   Median : 0.03199  
##  Mean   : 0.000000   Mean   : 0.00000  
##  3rd Qu.: 0.851470   3rd Qu.: 0.93266  
##  Max.   : 1.754042   Max.   : 1.37485

0.3 Clustering - kmeans and PAM

In this step we will investigate the optimal number of clusters and try different methods of clustering. There are many methods for clustering data, we’ve discussed during the USL classes, but in my paper I will focus only on three.

K-MEANS, classic clustering method

PAM

CLARA, algorithm created to cluster large applications, my dataset isn’t large so there is no need to applied this algorithm

HIERARCHICAL

In this paper I will apply three clustering methods - Kmeans, PAM and hierarchical.

0.3.1 Clustering tendency

Hopkins statistic - check whether data can be clustered.

Obtained Hopkins statistic indicates that the dataset is likely to have a clustering structure. My result (0.61) suggests a good clustering tendency. Clustering tendency presented in plot also suggest high clustering tendency. That indicates that we can move to clustering process.

get_clust_tendency(university_matrix, 2, graph=TRUE, gradient=list(low=paletteInferno[3], mid="white", high=paletteInferno[2]))

## $hopkins_stat
## [1] 0.606107
## 
## $plot

0.3.2 Optimal number of clusters

In this step I will determine the optimal number of clusters for my data. I will use the Silhouette width and elbow method to decide how many clusters I will use in clustering algorithm.

SILHOUETTE WIDTH

sil_kmeans <- fviz_nbclust(university, FUNcluster = kmeans, method = "silhouette") + theme_classic() 
sil_pam <- fviz_nbclust(university, FUNcluster = pam, method = "silhouette") + theme_classic() 
sil_hcut <- fviz_nbclust(university, FUNcluster = hcut, method = "silhouette") + theme_classic() 
grid.arrange(sil_kmeans, sil_pam, sil_hcut, nrow=3)

ELBOW METHOD

elbow_kmeans<-fviz_nbclust(university, FUNcluster = kmeans, method = "wss")+theme_classic()
elbow_pam<-fviz_nbclust(university, FUNcluster = cluster::pam, method = "wss")+theme_classic()
elbow_hcut<-fviz_nbclust(university, FUNcluster = hcut, method = "wss")+theme_classic()
grid.arrange(elbow_kmeans, elbow_pam, elbow_hcut, nrow=3)

OBSERVATION: Both methods suggest that the optimal number of clusters is 2. We will proceed with this number of clusters in further steps.

0.3.3 K-means

Clustering visualization

cl_kmeans <- eclust(university, k=2, FUNcluster="kmeans", hc_metric="euclidean", graph=FALSE)

kmeans_plot <- fviz_cluster(cl_kmeans, data = university, elipse.type = "convex", main="Clustering using kmeans", xlim=c(-5,5)) + 
  scale_color_manual(values = paletteInferno[2:4]) + scale_fill_manual(values = paletteInferno[2:4]) +
  theme_classic()

kmeans_plot

The clustering graph shows that clustering didn’t divide the universities into two distanced from each other clusters. The clusters are close to each other and the line between them isn’t clear.

Clustering quality measures

SILHOUETTE INDEX

kmeans_sil<-fviz_silhouette(cl_kmeans)+theme_classic()

##   cluster size ave.sil.width
## 1       1  584          0.43
## 2       2  416          0.44

kmeans_sil

OBSERVATIONS: The clusters are similar in terms of numbers of observations. The index results aren’t so satisfactory - results around 0.4 in silhouette index means that clustering is performed in meaningful way - clusters are different from each other, but the clusters overlap on each other.

Rand Index

set.per1<-university[,1:3]
set.per2<-university[,4:6]


d1_kmeans<-cclust(set.per1, 2, dist="euclidean")

## Found more than one class "kcca" in cache; using the first, from namespace 'flexclust'

## Also defined by 'kernlab'

## Found more than one class "kcca" in cache; using the first, from namespace 'flexclust'

## Also defined by 'kernlab'

d2_kmeans<-cclust(set.per2, 2, dist="euclidean")
kmeans_stat<-comPart(d1_kmeans, d2_kmeans)
kmeans_stat

##       ARI        RI         J        FM 
## 0.6459092 0.8230310 0.7048469 0.8268982

OBSERVATIONS:

ARI - measure if the observation in the two different periods was assigned to the same cluster. Result 0.6459 indicates that the level of this agreement in my data is a good level (object was assigned to the same cluster two times) .

RI - it’s similar measure that ARI (ARI is adjusted) and the value 0.8230 indicates good clustering performance.

J (Jaccard Index) - It measures the ratio of overlap clusters. A value of 0.7048 indicates that many observation overlaped, but in some cases the membership may differ.

FM (Fowlkes-Mallows Index) - measure the same as RI but in geometric mean terms. Value 0.8269 indicates that clustering performance was very good.

Calinski-Harabasz & Duda-Hart

CH_kmeans<-round(calinhara(university, cl_kmeans$cluster),digits=2)
DH_kmeans<-dudahart2(university, cl_kmeans$cluster) 

CH_kmeans

## [1] 1150.77

OBSERVATION: Result 1150.77 is quite high result. That means that performed clustering is quite well separated. This index is mostly used for comparing different numbers of clusters, so it will be more useful in the next parts of this paper.

DH_kmeans

## $p.value
## [1] 0
## 
## $dh
## [1] 0.464451
## 
## $compare
## [1] 0.8414263
## 
## $cluster1
## [1] FALSE
## 
## $alpha
## [1] 0.001
## 
## $z
## [1] 3.090232

OBSERVATIONS: For the Duda-Hart index we can focus on the cluster1 value. In my case cluster1=FALSE -> H0 of homogeneity rejected, accept H1. That indicates that clustering occur in statistically significant structure and the formed clusters are heterogenious.

0.3.3.1 Conlusions

According to the quality measures checked above my clustering has a quite strong, clustering performance - data was clustered in semi-good separated and meaningful clusters (according to DH and CH indexes). Silhouette Index showed good cluster separation (however, it could be slightly better) and ARI and Jaccard index proved quite good robustness in my clustering performance.

Just to be sure I will compare Calinski-Harabasz index to check if the increase in number of cluster improve the quality of clustering. We can observe that with the increase of the clusters number the CH index decrease which indicates that the increasing the number in this case in pointless.

k_range=c(2,3,4,5,6)

for(k in k_range){
  cl_kmeans <- eclust(university, k, FUNcluster="kmeans", hc_metric="euclidean", graph=FALSE)
  print(calinhara(university, cl_kmeans$cluster))
}

## [1] 1150.773
## [1] 883.5275
## [1] 732.0458
## [1] 652.2331
## [1] 592.676

0.3.4 PAM

Let’s start with clustering visualization.

cl_pam <- eclust(university, k=2, FUNcluster="pam", hc_metric="euclidean", graph=FALSE)
pam_plot <- fviz_cluster(cl_pam, data = university, elipse.type = "convex", main="Clustering using PAM")+
  scale_color_manual(values = paletteInferno[3:5]) + scale_fill_manual(values = paletteInferno[3:5]) +
  theme_classic()
pam_plot

OBSERVATION: As in the previous algorithm (kmeans) on the first sight the clusters overlap on each other and the border isn’t visible. That means that once again clustering resulted in meaningful clusters but not well separated.

Clustering quality measures

SILHOUETTE INDEX

pam_sil<-fviz_silhouette(cl_pam)

##   cluster size ave.sil.width
## 1       1  415          0.44
## 2       2  585          0.43

pam_sil

OBSERVATION: The clusters are similar in terms of numbers of observations as in kmeans algorithm, but the Silhoutte index results aren’t so satisfactory - results around 0.4 in silhouette index means that clustering is performed in meaningful way - clusters are different from each other, but the clusters overlap on each other, which shows the same result as in the kmeans algorithm.

Rand Index

d1_pam<-pam(set.per1, 2, metric="euclidean")
d2_pam<-pam(set.per2, 2, metric="euclidean")

clustering1 <- d1_pam$clustering
clustering2 <- d2_pam$clustering

ari_pam <- adjustedRandIndex(clustering1, clustering2)
print(ari_pam)

## [1] 0.6141993

OBSERVATION: In the PAM clustering the structure of clustering result differ from kmeans, which lead us to the point where it is harder to obtain clustering characteristics. For the adjusted ARI the index is equal to 0.61, slightly worse than in k means method. Despite that as in previous algorithm this result indicate high agreement in assigning to clusters in my data.

Calinski-Harabasz & Duda-Hart

CH_pam<-round(calinhara(university, cl_pam$clustering),digits=2)
CH_pam

## [1] 1150.29

OBSERVATION: A value of 1150.29 is high. That means that performed clustering is quite well separated. As it was mentioned before, this index is most useful in comparing clustering outcomes. That’s why we will use it mostly while comparing PAM and kmeans methods.

DH_pam<-dudahart2(university, cl_pam$clustering) 
DH_pam

## $p.value
## [1] 0
## 
## $dh
## [1] 0.4645555
## 
## $compare
## [1] 0.8414263
## 
## $cluster1
## [1] FALSE
## 
## $alpha
## [1] 0.001
## 
## $z
## [1] 3.090232

OBSERVATION: For the Duda-Hart index we focus on the cluster1 value. In my case cluster1=FALSE -> H0 of homogeneity rejected, accept H1. That indicates that clustering occur in statistically significant structure.

0.3.4.1 Kmeans and PAM comparison

The goal of this paper is to find the most accurate clustering method for this dataset. In this part I compared the k-means clustering algorithm and PAM. To easily compare those two algorithms the results of quality test are presented below:

results_classic

##                RI                                           
## results_names  "ARI"               "CH index" "DH cluster 1"
## kmeans_results "0.823031031031031" "1150.77"  "0"           
## pam_results    "0.614199289814419" "1150.29"  "0"

0.3.5 Conclusions - Kmeans and PAM

K-means has slightly higher ARI and CH index values than PAM, suggesting better cluster separation and higher agreement in the two independent periods.

PAM shows comparable CH index values, but the ARI is lower, indicating that its clustering might not align with the predicted clusters.

Both clustering methods have significant DH p-values suggesting that the clustering results are robust and meaningful. Both methods results in good clustering structure, but slightly better is the kmeans algorithm.

0.4 Hierarchical clustering

Now we will move to the hierarchical clustering, using two approaches: agglomerative and divisive clustering. In this section firstly I will focus on the clustering itself, and then I will compare the quality measures.

0.4.1 Agglomerative hirerachical clustering

Firstly we will compare different clustering methods: “average”, “single”, “complete”, “ward” and we will choose the one with the highest the agglomerative coefficient measure. It measures how well the hierarchical clustering structure captures the data - in my case the higher value (0.993) for Ward method indicates that Ward Method has the strongest clustering structure and it’s the best fit.

clustering_met<- c( "average", "single", "complete", "ward")
names(clustering_met) <- c( "average", "single", "complete", "ward")

ac <- function(x) {
  agnes(university, method = x)$ac
}

map_dbl(clustering_met, ac)

##   average    single  complete      ward 
## 0.8816280 0.6330420 0.9365883 0.9931705

Ward hierarchical clustering

hc_agg <- agnes(university, method = "ward")
pltree(hc_agg, cex = 0.6, hang = -1, main = "Agglomerative dendrogram - agnes")

We will cut the data into 2 clusters - result from the Silhouette index and elbow method

hc_agg <- agnes(university, method = "ward")
pltree(hc_agg, cex = 0.6, hang = -1, main = "Agglomerative dendrogram - agnes")

agg_clusters<-as.hclust(hc_agg)
hc_agg_clusters <- cutree(agg_clusters, k = 2)
table(hc_agg_clusters)

## hc_agg_clusters
##   1   2 
## 587 413

rect.hclust(agg_clusters, k = 2, border = paletteInferno[3])

Visualization of the clustering result:

fviz_cluster(list(data = university, cluster = hc_agg_clusters), main="Hierarchical clustering using Agnes")+
  scale_color_manual(values = paletteInferno[2:4]) + scale_fill_manual(values = paletteInferno[2:4]) +
  theme_classic()

OBSERVATION: In the hierarchical clustering the cluster overlap each other even more than in the kmeans or PAM method.

Ward hierearchical clustering - evaluation

hc_agg.stat<-cluster.stats(university, hc_agg_clusters)

## Warning in as.dist.default(d): non-square matrix

## Warning in df[lo] <- x: number of items to replace is not a multiple of
## replacement length

## Warning in df[up] <- df[lo] <- x: number of items to replace is not a multiple
## of replacement length

hc_agg.stat

## $n
## [1] 1000
## 
## $cluster.number
## [1] 2
## 
## $cluster.size
## [1] 587 413
## 
## $min.cluster.size
## [1] 413
## 
## $noisen
## [1] 0
## 
## $diameter
## [1] 1.754042 1.754042
## 
## $average.distance
## [1] -0.02310134  0.02427086
## 
## $median.distance
## [1] 0.0220799 0.1126856
## 
## $separation
## [1] -2.16581 -2.16581
## 
## $average.toother
## [1] 0.01929488 0.01929488
## 
## $separation.matrix
##          [,1]     [,2]
## [1,]  0.00000 -2.16581
## [2,] -2.16581  0.00000
## 
## $ave.between.matrix
##            [,1]       [,2]
## [1,] 0.00000000 0.01929488
## [2,] 0.01929488 0.00000000
## 
## $average.between
## [1] 0.01929488
## 
## $average.within
## [1] -0.003536621
## 
## $n.between
## [1] 242431
## 
## $n.within
## [1] 257069
## 
## $max.diameter
## [1] 1.754042
## 
## $min.separation
## [1] -2.16581
## 
## $within.cluster.ss
## [1] 498.2024
## 
## $clus.avg.silwidths
##         1         2 
##  4.578812 -3.134112 
## 
## $avg.silwidth
## [1] 1.393375
## 
## $g2
## NULL
## 
## $g3
## NULL
## 
## $pearsongamma
## [1] 0.01340713
## 
## $dunn
## [1] -1.234754
## 
## $dunn2
## [1] 0.7949813
## 
## $entropy
## [1] 0.6779319
## 
## $wb.ratio
## [1] -0.1832933
## 
## $ch
## [1] -986.1227
## 
## $cwidegap
## [1] -1.722603 -1.200402
## 
## $widestgap
## [1] -1.200402
## 
## $sindex
## [1] -2.165596
## 
## $corrected.rand
## NULL
## 
## $vi
## NULL

OBSERVATIONS: Obtained results suggest that clustering isn’t good quality - we can observe low CH index, silhouette score is very unstable within clusters. We will investigate in detail the quality of hierarchical algorithms later while comparing two methods.

0.4.2 Divisive hierarchical clustering

Now, we will move to the divisive hierarchical clustering. Firstly we will calculate DC index.

hc_div<- diana(university)
hc_div$dc

## [1] 0.9272264

OBSERVATION: Obtained result of the dissociation coefficient (0.93) indicates that the divisive hierarchical clustering managed to achieve rather poor quality separation of the data implying that the clusters are very similar to each other.

This measure is used to measure how clustering change the dividing dataset in hierarchical algorithm. Lower indicates better separation of the clusters, while higher suggest that clusters are more similar to each other and clustering is poor.

Dendogram and clusters size - both clusters ale similar in size, as we observed using kmeans and PAM.

pltree(hc_div, cex = 0.6, hang = -1, main = "dendrogram - diana")

div_clusters<-as.hclust(hc_div)
hc_div_clusters <- cutree(div_clusters,k=2)
table(hc_div_clusters)

## hc_div_clusters
##   1   2 
## 421 579

Dividing the dendogram into 2 clusters

pltree(hc_div, cex = 0.6, hang = -1, main = "dendrogram - diana")
rect.hclust(div_clusters, k = 2, border = paletteInferno[2])

Clustering visualisation

fviz_cluster(list(data = university, cluster = hc_div_clusters), main="Hierarchical clustering using Diana")+
  scale_color_manual(values = paletteInferno[c(2,5)]) + scale_fill_manual(values = paletteInferno[c(2,5)]) +
  theme_classic()

OBSERVATION: Comparing to agglomerative method the divisive clustering managed better in terms of the clusters separation. The boarder is more visible and clusters overlapping on each other less. Now we will move to the hierrarchical clustering algorithm evaluation. We will focus on comparing only chosen quality measures.

Clustering quality

hc_div.stat<-cluster.stats(university, hc_div_clusters)

## Warning in as.dist.default(d): non-square matrix

## Warning in df[lo] <- x: number of items to replace is not a multiple of
## replacement length

## Warning in df[up] <- df[lo] <- x: number of items to replace is not a multiple
## of replacement length

hc_div.stat

## $n
## [1] 1000
## 
## $cluster.number
## [1] 2
## 
## $cluster.size
## [1] 421 579
## 
## $min.cluster.size
## [1] 421
## 
## $noisen
## [1] 0
## 
## $diameter
## [1] 1.754042 1.754042
## 
## $average.distance
## [1] -0.03879061  0.00772279
## 
## $median.distance
## [1] -0.002442874  0.088589119
## 
## $separation
## [1] -2.16581 -2.16581
## 
## $average.toother
## [1] 0.02012884 0.02012884
## 
## $separation.matrix
##          [,1]     [,2]
## [1,]  0.00000 -2.16581
## [2,] -2.16581  0.00000
## 
## $ave.between.matrix
##            [,1]       [,2]
## [1,] 0.00000000 0.02012884
## [2,] 0.02012884 0.00000000
## 
## $average.between
## [1] 0.02012884
## 
## $average.within
## [1] -0.01185935
## 
## $n.between
## [1] 243759
## 
## $n.within
## [1] 255741
## 
## $max.diameter
## [1] 1.754042
## 
## $min.separation
## [1] -2.16581
## 
## $within.cluster.ss
## [1] 507.4393
## 
## $clus.avg.silwidths
##         1         2 
## -2.765817  6.157679 
## 
## $avg.silwidth
## [1] 2.400887
## 
## $g2
## NULL
## 
## $g3
## NULL
## 
## $pearsongamma
## [1] 0.01429613
## 
## $dunn
## [1] -1.234754
## 
## $dunn2
## [1] 2.606421
## 
## $entropy
## [1] 0.6806127
## 
## $wb.ratio
## [1] -0.5891721
## 
## $ch
## [1] -986.3389
## 
## $cwidegap
## [1] -1.712214 -1.418798
## 
## $widestgap
## [1] -1.418798
## 
## $sindex
## [1] -2.164953
## 
## $corrected.rand
## NULL
## 
## $vi
## NULL

0.4.2.1 Hierarchical clustering - results comparison

0.4.3 Conclusions

In the conclusion I mostly focused on the chosen evaluation metrics.

results_hierarchical

##          Method MIN_SEPARATION AVG_BETWEEN_CLUST_DIST AVG_SILWIDTHS
## 1 Agglomerative       -2.16581             0.01929488      1.393375
## 2      Divisive       -2.16581             0.02012884      2.400887
##   SS_WITHIN_CLUSTERS DUNN_INDEX CALINSKI_HARABSZ
## 1           498.2024  -1.234754        -986.1227
## 2           507.4393  -1.234754        -986.3389

The results suggest that the both method are not the most suitable ones to this data. Both algorithms has issues with cluster separation (we can observe that by analyzing silhouette index).

Silhouette Width - the divisive approach has a higher mean silhouette width, which implies that Divisive clustering may result in better intra-cluster forms.

Sum of Squares Within Clusters - the value for agglomerative approach is lower which suggest that the variance within this clusters is lower, that suggest that clusters may be more homogenic.

Dunn Index - measure the clusters separation, both method has negative values, which indicates poor between cluster separation.

Calinski-Harabasz Index - both methods have negative CH index values. That means that clusters are not well created.

0.5 Dimensions reduction using PCA

In this part we will focus on dimension reduction to compare how clustering differs when we conduct it with PCA and without PCA. I decided to perform the dimension reduction using PCA because I observed that some of the analyzed variables are highly correlated with each other. That means that we can reduce dimension of the data set and still expalin majority of the variance in the dataset.

Also, PCA has several advantages: we can easily observe the correlations and visualize the attributes of variables, extract and analyze eigenvalues and and eigenvectors and investigate the contribution of each variable to the dimensions. Firstly we need to analyze the variables in our dataset once again.

0.5.1 Correlation analysis

university_nominal_matrix <- data.matrix(university_nominal_clean[2:7])
correlation_matrix_nominal<-cor(university_nominal_matrix)
print(correlation_matrix_nominal, digits=2)

##                   alumni_employment publications influence citations
## alumni_employment              1.00         0.48      0.42      0.45
## publications                   0.48         1.00      0.85      0.79
## influence                      0.42         0.85      1.00      0.81
## citations                      0.45         0.79      0.81      1.00
## broad_impact                   0.43         0.92      0.91      0.85
## patents                        0.40         0.62      0.55      0.52
##                   broad_impact patents
## alumni_employment         0.43    0.40
## publications              0.92    0.62
## influence                 0.91    0.55
## citations                 0.85    0.52
## broad_impact              1.00    0.58
## patents                   0.58    1.00

corrplot(correlation_matrix_nominal, method="circle", type=c("lower"), title = "Variables correlation matrix", cl.cex=0.8, tl.cex=0.7, tl.col="black", col=inferno(10))+theme_classic()

## NULL

OBSERVATION: We observe some significantly correlated values: influence&publications, citations&publications, citations&influence. That means that conducting PCA may have a possitive impact on the clustering quality and dimension reduction is rational.

Once again we are making sure that we standardized our data and checking the mean statistic to make sure that data is standardized (while it’s standardized the mean is equal to 0.

preproc_uni <- preProcess(university_nominal_clean[2:7], method=c("center", "scale"))
university.s <- predict(preproc_uni, university_nominal_clean[2:7])

summary(university.s)

##  alumni_employment  publications          influence          citations       
##  Min.   :-2.1712   Min.   :-1.7295293   Min.   :-1.73079   Min.   :-1.78640  
##  1st Qu.:-0.8340   1st Qu.:-0.8646244   1st Qu.:-0.86501   1st Qu.:-0.86213  
##  Median : 0.5031   Median : 0.0002805   Median : 0.00078   Median :-0.09256  
##  Mean   : 0.0000   Mean   : 0.0000000   Mean   : 0.00000   Mean   : 0.00000  
##  3rd Qu.: 0.8591   3rd Qu.: 0.8643197   3rd Qu.: 0.86657   3rd Qu.: 0.76824  
##  Max.   : 0.8591   Max.   : 1.7300903   Max.   : 1.70115   Max.   : 1.43070  
##   broad_impact          patents        
##  Min.   :-1.727306   Min.   :-1.77842  
##  1st Qu.:-0.859583   1st Qu.:-0.87321  
##  Median :-0.005799   Median : 0.03199  
##  Mean   : 0.000000   Mean   : 0.00000  
##  3rd Qu.: 0.851470   3rd Qu.: 0.93266  
##  Max.   : 1.754042   Max.   : 1.37485

0.5.2 Covariance analysis - eigenvalues

university.cov<-cov(university.s)
university.eigen<-eigen(university.cov)
university.eigen$values

## [1] 4.28023726 0.73919451 0.55877791 0.22001324 0.14697211 0.05480496

head(university.eigen$vectors)

##            [,1]       [,2]       [,3]        [,4]        [,5]        [,6]
## [1,] -0.2834652  0.8971355 -0.3278963  0.04756746  0.05508051 -0.04452889
## [2,] -0.4531682 -0.1143020 -0.0244138  0.43146852 -0.63749181  0.43406992
## [3,] -0.4450427 -0.2336659 -0.1388872  0.23841346  0.74978038  0.33020600
## [4,] -0.4312656 -0.1568901 -0.2189824 -0.84005368 -0.13612053  0.13123708
## [5,] -0.4605423 -0.2321552 -0.1078096  0.18880058 -0.06177110 -0.82639005
## [6,] -0.3437060  0.2212975  0.9016774 -0.11573866  0.07781547 -0.02051324

OBSERVATION: Eigenvalues show the amount of variance explained by corresponding component. They are very important in PCA because they define new axes of the data. They are normalized to 1 and correspond to the principal components. The first column is associated with first variable in data set. It explain the most variance. The values in the dataframe represent loadings for each of my variables.

0.5.3 Principal components analysis

summary(xxx.pca1)

## Importance of components:
##                           PC1    PC2     PC3     PC4    PC5     PC6
## Standard deviation     2.0689 0.8598 0.74751 0.46906 0.3834 0.23410
## Proportion of Variance 0.7134 0.1232 0.09313 0.03667 0.0245 0.00913
## Cumulative Proportion  0.7134 0.8366 0.92970 0.96637 0.9909 1.00000

OBSERVATION: We can observe that more than 90% of the variance is explained by the first three principal components, so that may be a good idea to restrict the analysis to first three components, while losing only 8% of the variance explained (which is relatively high score).

Visualization:

The first plot shows the amount of the variance explained by each of the principal component. The values aren’t in the percentage values so it’s slightly harder to compare that in terms of the fraction of variance explained.

variance <- plot(xxx.pca1, col=paletteInferno[2:7], border="white", main="Variance explained by the components", xlab="PCA")

To better investigate patterns in data we can observe charts below. The left chart shows the eigenvalue of the each of the dimension. It is similar descriptive value as variance. The right chart shows percentage of the total variance explained by each dimension (component). We can observe that the first three components explained more than 90% of the variance, we can narrow our investigation to those 3 components.

eigenvalue <- fviz_eig(xxx.pca1, choice='eigenvalue', barcolor = paletteInferno[4], barfill = paletteInferno[4], main="Eigenvalues scree plot")+
  theme_classic()

var_perc<-fviz_eig(xxx.pca1, barcolor = paletteInferno[6], barfill = paletteInferno[6], main = "Percentage of explained variance scree plot")+theme_classic() 

grid.arrange(eigenvalue, var_perc, ncol=2)

0.5.3.1 Variables contribution to each components

Before we moved to the clustering on the restricted number of dimensions, we will investigate the influence of the variables across PCA and take closer look at the variables.

fviz_pca_var(xxx.pca1, col.var=paletteInferno[3])+theme_classic()

OBSERVATIONS:

The length of arrow indicates the contribution of the variable to the principal component, while the angle between variables indicates how similar to each other they are. The 0 degrees angle suggest strong similarity while 90 degrees suggest very low to no similarity between variables. Knowing that we can investigate the relationship between variables.

alumni_employment and patents have the longest arrows, meaning they contribute strongly to at least one of the principal components

publications, citations and broad_impact are clustered together and individually contribute less to the principal components but in the similar extent (to the PCA2)

Direction and angle analysis:

publications, citations, broad_impact and influence have the same direction and small angle so they are strongly correlated.

alumni_employment points in a different direction from the other variables (it’s close to the 90 degrees angle than to 0) so it means that this variable has unique contribution to the PCA than others (it’s not correlated).

We can also investigate the contribution of the variables to each dimension:

var<-get_pca_var(xxx.pca1)
a<-fviz_contrib(xxx.pca1, "var", axes=1, xtickslab.rt=90, fill = paletteInferno[2], color = paletteInferno[2], cex.lab=0.4)+
  theme_minimal()
a

b<-fviz_contrib(xxx.pca1, "var", axes=2, xtickslab.rt=90, fill = paletteInferno[4], color = paletteInferno[4])+theme_minimal()
b

c<-fviz_contrib(xxx.pca1, "var", axes=3, xtickslab.rt=90, fill = paletteInferno[6], color = paletteInferno[6])+theme_minimal()
c

We can observe that for the first dimension the most important variables are broad_impact, publications, influence and citations variables. Those all are variables connected with the research publications of employees and students of the university.

Comparing other contributions we can observe that for others variables only one variable contributes in the most significant way. For the second dimension it’s variable alumni_employment, which is connected with the employment rate among graudates. For the Dim-3 the most significant variable is patent rank.

0.5.3.2 Rotation

In the last step of variables contribution I will investigate the meaning of each component. In order to do that we need to rotate our PCA results to achieve linear structure in order to easier interprets them. We will restrict our analysis to 3 factors as was stated in the homework description.

xxx.pca4<-principal(xxx, nfactors=3, rotate="varimax")
xxx.pca4

## Principal Components Analysis
## Call: principal(r = xxx, nfactors = 3, rotate = "varimax")
## Standardized loadings (pattern matrix) based upon correlation matrix
##                    RC1  RC3  RC2   h2     u2 com
## alumni_employment 0.24 0.16 0.96 1.00 0.0011 1.2
## publications      0.85 0.34 0.23 0.89 0.1110 1.5
## influence         0.90 0.23 0.16 0.90 0.1011 1.2
## citations         0.87 0.18 0.23 0.84 0.1589 1.2
## broad_impact      0.92 0.27 0.17 0.95 0.0458 1.2
## patents           0.33 0.92 0.18 1.00 0.0039 1.3
## 
##                        RC1  RC3  RC2
## SS loadings           3.32 1.15 1.11
## Proportion Var        0.55 0.19 0.18
## Cumulative Var        0.55 0.74 0.93
## Proportion Explained  0.59 0.21 0.20
## Cumulative Proportion 0.59 0.80 1.00
## 
## Mean item complexity =  1.3
## Test of the hypothesis that 3 components are sufficient.
## 
## The root mean square of the residuals (RMSR) is  0.03 
##  with the empirical chi square  23.09  with prob <  NA 
## 
## Fit based upon off diagonal values = 1

KEY OBSERVATIONS:

RC1 - it seems to be a component which represent reaserch impact of the university. The highest postive contribution have: broad_impact, publications and citations. That component is directly related with university reaserch activity

RC2 - component with the higher contribution of patents. The second component is strictly correlated with this component.

RC3 - it’s dominated by employment of graduates.

Additionally I took a look into the other statistic presents in the summary after rotation:

complexity (com) - it explains how many variables we need to explain a variable. Higher means that we need more variables to explain the meaning of the factor. Here for example the most complex variable is military expenditure. Overall - high variables are undesirable. We can set a benchmark that the high uniqueness is 0.7.

uniqueness - it’s the proportion of variance that is unique for given variable (it’s not shared with others variables). We want it to be low in PCA because then we will be able to reduce the space to a smaller number of dimensions.

We can visualize both characteristics on the graph with the benchmark points. We want uniqueness below 0.7 and complexity below 2:

plot(xxx.pca4$complexity, xxx.pca4$uniqueness, xlim=c(0, 3), ylim=c(0,1), xlab="complexity", ylab="uniqueness", main="Complexity and uniqueness graph",cex=0.8)
text(xxx.pca4$complexity, xxx.pca4$uniqueness, labels=names(xxx.pca4$uniqueness), cex=0.8)
abline(h=c(0.7), lty=3, col=2)
abline(v=c(2), lty=3, col=2)

OBSERVATION: We don’t observe any outliers in my data. The PCA isn’t impacted by the outliers.

0.5.4 Summary

eig.val<-get_eigenvalue(xxx.pca1)
eig.val

##       eigenvalue variance.percent cumulative.variance.percent
## Dim.1 4.28023726        71.337288                    71.33729
## Dim.2 0.73919451        12.319909                    83.65720
## Dim.3 0.55877791         9.312965                    92.97016
## Dim.4 0.22001324         3.666887                    96.63705
## Dim.5 0.14697211         2.449535                    99.08658
## Dim.6 0.05480496         0.913416                   100.00000

SUMMARY:

In PC1 the greatest impact have variables connected with variables describing the university research performance - broad_impact, publication and influence.Despite other components it has very similar impact of all the variables. The PC1 explain the major part of the variance.

The PCA2 is dominated by alumni_employment with a large positive loading (0.8971355). This component allows us to separate universities based on the employment of the graduates.

The PCA3 is mainly about patents. It differs university based on the patent related actions. It has high contribution of 0.9016774.

This investigation allows us to restrict our analysis into first three components. When we will do it we will lost under 10% of the a variance explained, which is high score.

0.5.4.1 Dimension reduction

Basing on the PCA we can restrict our data to 3 dimensions. In this case we will explain more than 90% of the variance in data.

university_PCA_reduced3<-xxx.pca1$x[,1:3]

summary(university_PCA_reduced3)

##       PC1               PC2                PC3          
##  Min.   :-4.3442   Min.   :-2.55251   Min.   :-2.23709  
##  1st Qu.:-1.7540   1st Qu.:-0.64442   1st Qu.:-0.49156  
##  Median : 0.2766   Median : 0.06422   Median : 0.02032  
##  Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.00000  
##  3rd Qu.: 1.7654   3rd Qu.: 0.62544   3rd Qu.: 0.48428  
##  Max.   : 3.5502   Max.   : 2.13453   Max.   : 2.36828

0.6 Clustering after dimension reduction

In the last part of my paper I will conduct the clustering after dimension reduction. In that part we deal with three dimensional data, which is the reason why I implemented 3D plots with clustering results.

This part will be constructed as follows: I will conduct kmeans and pam algorithm as in the first part. I will focus on the quality measures at the very end with the full comparison. Due to the dimension reduction some of the robustness metrics will be omitted.

In this part number of cluster will stay the same - we want to investigate how PCA influence the quality of clustering.

0.6.1 Clustering using kmeans

cl_kmeans_PCA3 <- eclust(university_PCA_reduced3, k=2, FUNcluster="kmeans", hc_metric="euclidean", graph=FALSE)
clustered_data_kmeans3 <- as.data.frame(university_PCA_reduced3)
clustered_data_kmeans3$Cluster <- as.factor(cl_kmeans_PCA3$cluster)

kmeans_plot_PCA3 <- fviz_cluster(cl_kmeans_PCA3, data = university_PCA_reduced3, elipse.type = "convex", main="Clustering using kmeans", xlim=c(-5,5)) + 
  scale_color_manual(values = paletteInferno[2:3]) + scale_fill_manual(values = paletteInferno[2:3]) +
  theme_classic()
kmeans_plot_PCA3

OBSERVATION: The points and cluster are more narrow now. To observe that better, since we reduced dimensions to 3, we can plot clustering effects in 3D in order to make it more visible - we can now observe clustering tendency. Some points overlapping on each other as in previous clustering.

kmeans_3d_plot <- plot_ly(data = clustered_data_kmeans3, x = ~PC1, y = ~PC2, z = ~PC3, color = ~Cluster, colors = paletteInferno[2:3], type = "scatter3d", mode = "markers")

kmeans_3d_plot

0.6.1.1 Kmeans comparison

Silhouette index - before PCA and after

kmeans_sil_PCA3<-fviz_silhouette(cl_kmeans_PCA3)+theme_classic()

##   cluster size ave.sil.width
## 1       1  584          0.50
## 2       2  416          0.44

grid.arrange(kmeans_sil, kmeans_sil_PCA3,ncol=2)

OBSERVATION: Dimension reduction improved the Silhouette index (0.53). As before it means that there might be some overlapping observations but the clusters are formed in meaningful structures. The clusters size remain at the same stable level.

Calinski-Harabasz & Duda-Hart

Final quality comparison:

results_comparison

##                          CH DH cluster 1
## kmeans_results      1150.77            0
## kmeans_results_PCA3 1355.80            0

OBSERVATION: Dimensionality reduction improved Calinski- Harabasz index which means that clustering is better separated than it was before clustering. The Duda-Hart index still shows for cluster1=FALSE which means that clustering occur in statistically significant structure.

0.6.2 Clustering using PAM

cl_pam_PCA3 <- eclust(university_PCA_reduced3, k=2, FUNcluster="pam", hc_metric="euclidean", graph=FALSE)
clustered_data_pam<-as.data.frame(university_PCA_reduced3)
clustered_data_pam$Cluster <- as.factor(cl_pam_PCA3$cluster)

pam_plot_PCA3 <- fviz_cluster(cl_pam_PCA3, data = university_PCA_reduced3, elipse.type = "convex", main="Clustering using PAM")+
  scale_color_manual(values = paletteInferno[4:5]) + scale_fill_manual(values = paletteInferno[4:5]) +
  theme_classic()
pam_plot_PCA3

OBSERVATION: In PAM clustering the points aren’t as narrow as in the kmeans algorithm. Now, we can plot clustering effects in 3D in order to make it more visible - we can now observe clustering tendency. Some points overlapping on each other as in previous clustering.

pam_3d_plot <- plot_ly(data = clustered_data_pam, x = ~PC1, y = ~PC2, z = ~PC3, color = ~Cluster, colors = paletteInferno[4:5], type = "scatter3d", mode = "markers")

pam_3d_plot

Comparing to the kmeans algorithm the overlapping is more visible. Separation is worse than it was before dimension reduction.

0.6.2.1 PAM comparison

Silhouette index - before PCA and after

pam_sil_PCA3<-fviz_silhouette(cl_pam_PCA3)

##   cluster size ave.sil.width
## 1       1  406          0.45
## 2       2  594          0.49

grid.arrange(pam_sil, pam_sil_PCA3,ncol=2)

OBSERVATION: Conducted PCA slightly improved Silhouette index as in the case of kmeans, but in PAM the difference is less significant, which was also observed in the clustering output plot - clusters were overlapping on each other.

Calinski-Harabasz & Duda-Hart

Final quality comparison:

results_comparison_pam

##                       CH DH cluster 1
## pam_results      1150.29            0
## pam_results_PCA3 1342.67            0

OBSERVATION: Dimension reduction in PAM algorithm reduced the value of the Calinski Harabasz index. That means that in the PAM algorithm results are worse in terms of quality of separated between clusters. However, the Duda-Hart index still shows for cluster1=FALSE which means that clustering occur in statistically significant structure.

0.6.3 Agglomerative hierachical clustering after PCA

Now we will move to the hierarchical clustering. We will use two approaches as before. As in the sections above I will conduct clustering on reduced dimensional and at the end I will compare results.

Firstly we will observe dendogram and clustering output. As in previous section we will stick with 2 clusters to compare results. The clustering has a lot overlapping observations and two clusters immerse each other. It’s very visible comparing with the kmeans or even PAM method.

hc_agg_PCA <- agnes(university_PCA_reduced3, method = "ward")
pltree(hc_agg_PCA, cex = 0.6, hang = -1, main = "Agglomerative dendrogram - agnes")
agg_clusters_PCA<-as.hclust(hc_agg_PCA)
hc_agg_clusters_PCA <- cutree(agg_clusters_PCA, k = 2)
table(hc_agg_clusters_PCA)

## hc_agg_clusters_PCA
##   1   2 
## 379 621

rect.hclust(agg_clusters_PCA, k = 2, border = paletteInferno[3])

Visualization of the clustering result - both two dimension plot and 3d plot. By zooming our 3D plot we also see that the clusters immerse themselves as it was observed in PAM algorithm.

fviz_cluster(list(data = university_PCA_reduced3, cluster = hc_agg_clusters_PCA), main="Hierarchical clustering using Agnes")+
  scale_color_manual(values = paletteInferno[5:6]) + scale_fill_manual(values = paletteInferno[5:6]) +
  theme_classic()

clustered_data_agg_PCA <- as.data.frame(university_PCA_reduced3)
clustered_data_agg_PCA$Cluster<-as.factor(hc_agg_clusters_PCA)

agg_3d_plot <- plot_ly(data = clustered_data_agg_PCA, x = ~PC1, y = ~PC2, z = ~PC3, color = ~Cluster, colors = paletteInferno[5:6], type = "scatter3d", mode = "markers")

agg_3d_plot

OBSERVATION: In the hierarchical agglomerative clustering the clusters significantly overlap each other. At the first sight the clustering quality isn’t good, but let’s evaluate the clustering quality by adding proper valuation metrics:

0.6.3.1 Agglomerative clustering comparison

Now, we will formaly investigate the quality of achived clusters:

results_hierarchical_agg_comparison

##              Method MIN_SEPARATION AVG_BETWEEN_CLUST_DIST AVG_SILWIDTHS
## 1     Agglomerative       -2.16581             0.01929488      1.393375
## 2 Agglomerative PCA       -4.31633             0.03289139      3.993856
##   SS_WITHIN_CLUSTERS DUNN_INDEX CALINSKI_HARABSZ
## 1           498.2024  -1.234754        -986.1227
## 2           961.1514  -1.215783        -992.2356

OBSERVATION: Agglomerative clustering performed better in terms of sum of squares within clusters, which indicates that clusters were more tight and was better ordered.

On the other hand Agglomerative clustering after PCA performed better in average width between clusters distance and average Silhouette width, that means that after PCA clustering was separated better and obtained with better defined clusters.

Dunn Index, Calinski-Harabasz were both better in the clustering after PCA. That indicates that if we are looking for clear and distinct clusters the clustering quality is better after PCA. On the other hand if we care more about how tight our point are clustered we need to focus on the clustering after PCA.

0.6.4 Divisive hirerachical clustering after PCA

As in the agglomerative clustering. Firstly we will plot the dendogram with separation into two clasters and clustering output. Then we move to the overall quality clustering.

hc_div_PCA<- diana(university_PCA_reduced3)

pltree(hc_div_PCA, cex = 0.6, hang = -1, main = "dendrogram - diana after PCA")
div_clusters_PCA<-as.hclust(hc_div_PCA)
hc_div_clusters_PCA <- cutree(div_clusters_PCA,k=2)
table(hc_div_clusters_PCA)

## hc_div_clusters_PCA
##   1   2 
## 411 589

rect.hclust(div_clusters_PCA, k = 2, border = paletteInferno[3])

Visualisation of the clustering result - both two dimension plot and 3d plot

fviz_cluster(list(data = university_PCA_reduced3, cluster = hc_div_clusters_PCA), main="Hierarchical clustering using Diana")+
  scale_color_manual(values = paletteInferno[c(3,6)])+ scale_fill_manual(values = paletteInferno[c(3,6)]) +
  theme_classic()

clustered_data_div_PCA <- as.data.frame(university_PCA_reduced3)
clustered_data_div_PCA$Cluster<-as.factor(hc_div_clusters_PCA)

div_3d_plot <- plot_ly(data = clustered_data_div_PCA, x = ~PC1, y = ~PC2, z = ~PC3, color = ~Cluster, colors = paletteInferno[c(3,6)], type = "scatter3d", mode = "markers")

div_3d_plot

Comparing with agglomerative hierarchical clustering now the divisive clustering results in tighter and better separated clusters. Now the separation between two clusters is easier to determine and clusters don’t overlap on each other so much.

0.6.4.1 Divisive clustering comparison

results_hierarchical_div_comparison

##         Method MIN_SEPARATION AVG_BETWEEN_CLUST_DIST AVG_SILWIDTHS
## 1     Divisive       -2.16581             0.02012884      2.400887
## 2 Divisive PCA       -4.31633             0.03277042      2.832564
##   SS_WITHIN_CLUSTERS DUNN_INDEX CALINSKI_HARABSZ
## 1           507.4393  -1.234754        -986.3389
## 2           972.9624  -1.215783        -992.3056

COMMENT:

Divisive clustering after PCA has better results in terms of average distance between clusters and average sillhouette widths. That indicates that we obtained better separation and cohesion in clusters. On the other hands, sum of square between clusters is higher which indicates that now we have looser clusters.

Divisive clustering before PCA resulted in the more tight clusters (lower ss) which indicates that this clustering was better organised within clusters.

Overall divisive clustering after PCA resulted in better results, but if we are focusing on the tighter clusters and better organized clusters it may be better to focus on divisive clustering method without performing PCA.

0.7 Conclusions

The main goal of this paper was to investigate different methods of clustering data and analyze how PCA impacts the clustering quality.

The first part showed that the kmeans method is slightly better than PAM method for this dataset. As for the hierarchical clustering the both results - before and after PCA proved that hierarchical clustering isn’t a good choice while clustering this dataset.

In the second part while performing PCA dimension reduction I’ve revealed interesting connection across thee analyzed variables. PCA overall allows us to analyze the impact of variables, analyze their contribution to overall structure of the data. In that part I managed to reduce dimension from 6 to 3 (half). Investigate patterns and quality of dimension reduction.

Last part was connected with the analysis of how results of clustering differs when we conduct it after PCA and without PCA. The results was as follows:

For the kmeans I observed the better quality of the clustering. The kmeans method was my initial best choice for my data (from the first time).
For the PAM method the results after PCA was even worse before PCA.
For the hierarchical clustering comparison results were mixed - overall results for hierarchical clustering after PCA was better but the quality of obtained clusters was worse in terms of the separation (were more loose).

PCA contributes positively for the algorithm that was the best fit for my data. That result may indicate that PCA may help in obtaining the better clustering results only when we are applying the proper clustering method for the data which we want to cluster. PCA also gives us deeper understanding of our data. In my case, despite the statistic and exploratory data analysis I conducted in the beginning, PCA gave me more detailed information about the key insight and relationship between variables in my dataset.

It is also worth mentioned, that in my data in the first step I needed to remove some variables due to the majority of outliers and odd distributions in the dataset. It may be helpful to conduct this type of experiment for different dataset and compare results to investigate the real impact of PCA.

0.7.0.1 Sources:

PCA & Kmeans & PAM & hierarchical clustering: script from classes
3D plots: https://plotly.com/r/3d-scatter-plots/ and debugging with chat GPT
PCA: https://www.cfholbert.com/blog/cluster-pca/

USL final project - clustering and dimension reduction

Aleksandra Engel

2024-12-16

0.1 Project description

0.2 Statistic and exploratory data analysis

0.2.1 Data preparation

0.2.2 Data analysis

0.2.2.1 Data distributions

0.2.2.2 Correlation plot

0.2.2.3 Data standardization

0.3 Clustering - kmeans and PAM

0.3.1 Clustering tendency

0.3.2 Optimal number of clusters

0.3.3 K-means

0.3.3.1 Conlusions

0.3.4 PAM

0.3.4.1 Kmeans and PAM comparison

0.3.5 Conclusions - Kmeans and PAM

0.4 Hierarchical clustering

0.4.1 Agglomerative hirerachical clustering

0.4.2 Divisive hierarchical clustering

0.4.2.1 Hierarchical clustering - results comparison

0.4.3 Conclusions

0.5 Dimensions reduction using PCA

0.5.1 Correlation analysis

0.5.2 Covariance analysis - eigenvalues

0.5.3 Principal components analysis

0.5.3.1 Variables contribution to each components

0.5.3.2 Rotation

0.5.4 Summary

0.5.4.1 Dimension reduction

0.6 Clustering after dimension reduction

0.6.1 Clustering using kmeans

0.6.1.1 Kmeans comparison

0.6.2 Clustering using PAM

0.6.2.1 PAM comparison

0.6.3 Agglomerative hierachical clustering after PCA

0.6.3.1 Agglomerative clustering comparison

0.6.4 Divisive hirerachical clustering after PCA

0.6.4.1 Divisive clustering comparison

0.7 Conclusions

0.7.0.1 Sources: