Clustering

Clustering is one of unsupervised learning methods, used to divide data into similar groups. Clusters are sets of homogeneous data, which is heterogeneous between clusters (i.e. when compering data from two different clusters).

Project introduction

The second part of the project is focused on clustering. The aim of this publication is understand to what extend clustering on raw data and clustering on less complex dataset (result of dimension reduction) are alike. To determine the similarities clustering on both datasets was performed. In order to understand the best clustering technique, average silhouette widths of various clustering methods were compared for clutseirng on raw data prior to comparing the results with the data after PCA. The first part of the project, dimension reduction can be found https://rpubs.com/meggie/863152.


The dataset used was FIFA’22 computer game players (https://www.futbin.com/22/players?page=1&version=gold_rare&sort=version&order=desc, accessed 22-23.01.2022). The dataset was reduced to only Gold Rare players as these are the most valuable players. It is composed of 19 variables and 107 observations:
- Index -> of each observation to better understand the difference in clustering
- Player Name -> ch. for each of the player
- Position -> position on which the player plays, can be used for filtering as there are different requirements for goal keepers, defendants and strikers
- Version -> the same for all the players in the dataset “rare”
- Rating -> rating in the game
- Player’s price -> player’s price in the game
- Skills -> rating out of 5 stars
- Weak foot -> rating out of 5 stars
- Pace -> score out of 100
- Shooting -> score out of 100
- Dribbling -> score out of 100
- Defense -> score out of 100
- Physically -> score out of 100
- Popularity -> number of clicks (profile views of each player) on the Futbin website
- Base statistics -> statistics of the player
- In game statistics -> statistics of the player
- Games played -> number of games played
- Average goals per game -> average goals scored in one game

Loading necessary libraries

library(corrplot)
## corrplot 0.92 loaded
library(stats)
library(factoextra)
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(flexclust)
## Loading required package: grid
## Loading required package: lattice
## Loading required package: modeltools
## Loading required package: stats4
library(fpc)
library(clustertend)
library(cluster)
library(ClusterR)
## Loading required package: gtools
library(Rcpp)
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(NbClust)
library(dendextend)
## 
## ---------------------
## Welcome to dendextend version 1.15.2
## Type citation('dendextend') for how to cite the package.
## 
## Type browseVignettes(package = 'dendextend') for the package vignette.
## The github page is: https://github.com/talgalili/dendextend/
## 
## Suggestions and bug-reports can be submitted at: https://github.com/talgalili/dendextend/issues
## You may ask questions at stackoverflow, use the r and dendextend tags: 
##   https://stackoverflow.com/questions/tagged/dendextend
## 
##  To suppress this message use:  suppressPackageStartupMessages(library(dendextend))
## ---------------------
## 
## Attaching package: 'dendextend'
## The following object is masked from 'package:stats':
## 
##     cutree
library(gridExtra)
library(ggplot2)
library(grid)

Data preparation

For the purpose of clustering the data was limited to numerical attributes. Such restriction allows better clusterization as some satistical measures can be calculated.

db <- read.csv("FIFA_RARE.csv", sep = ";", dec = ",", header = TRUE)
db1 <- db[, c(1,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19)]
db1 <- as.data.frame(lapply(db1, scale))
summary(db1)
##      Index            Rating           PS_price           Skills       
##  Min.   :-1.708   Min.   :-1.9864   Min.   :-0.3265   Min.   :-2.0130  
##  1st Qu.:-0.854   1st Qu.:-0.6326   1st Qu.:-0.3241   1st Qu.:-0.6204  
##  Median : 0.000   Median :-0.0911   Median :-0.3112   Median :-0.1562  
##  Mean   : 0.000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.854   3rd Qu.: 0.5858   3rd Qu.:-0.1982   3rd Qu.: 0.7722  
##  Max.   : 1.708   Max.   : 2.3457   Max.   : 7.4759   Max.   : 1.7007  
##    Weak_Foot            Pace            Passing           Shooting      
##  Min.   :-2.1035   Min.   :-2.6719   Min.   :-2.6866   Min.   :-3.1951  
##  1st Qu.:-0.7558   1st Qu.:-0.5568   1st Qu.:-0.5289   1st Qu.:-0.6201  
##  Median : 0.5920   Median : 0.1138   Median : 0.2660   Median : 0.1061  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5920   3rd Qu.: 0.7328   3rd Qu.: 0.6446   3rd Qu.: 0.5023  
##  Max.   : 1.9398   Max.   : 1.8677   Max.   : 1.5531   Max.   : 2.0869  
##    Dribbling          Defense          Physically        Popularity     
##  Min.   :-3.1494   Min.   :-1.7149   Min.   :-2.6802   Min.   :-0.8749  
##  1st Qu.:-0.4716   1st Qu.:-0.8778   1st Qu.:-0.7942   1st Qu.:-0.5589  
##  Median : 0.0922   Median : 0.1164   Median : 0.1219   Median :-0.3956  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.7264   3rd Qu.: 0.9536   3rd Qu.: 0.7685   3rd Qu.: 0.1277  
##  Max.   : 1.9244   Max.   : 1.5291   Max.   : 1.8462   Max.   : 5.4479  
##    Base_Stats       In_game_stats       Games_played         Avg_goals      
##  Min.   :-2.68112   Min.   :-3.50388   Min.   :-0.603840   Min.   :-0.8710  
##  1st Qu.:-0.76510   1st Qu.:-0.01005   1st Qu.:-0.577556   1st Qu.:-0.8041  
##  Median :-0.02483   Median : 0.29693   Median :-0.452398   Median :-0.3358  
##  Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.000000   Mean   : 0.0000  
##  3rd Qu.: 0.69368   3rd Qu.: 0.57176   3rd Qu.:-0.000867   3rd Qu.: 0.2998  
##  Max.   : 2.41374   Max.   : 0.94015   Max.   : 3.764747   Max.   : 2.8421
cat("Number of observations in the dataset:", nrow(db1))
## Number of observations in the dataset: 107
cat("Number of years variables in the analysis:", ncol(db1))
## Number of years variables in the analysis: 16

Correlation

Analyzing the below chart potential relationship between couple of variables can be identified. Some of the attributes are positively correlated (in_game_stats and skills, games_played and rating, games_played and popularity, dribbling and passing), whereas some are negatively correlated (defense and passing, defense and avg_goals, skills and physically). Please see dimension reduction part of the project for in-depth evaluation.

cor<-cor(db1, method="pearson") 
corrplot(cor)

Assessing clustering tendency

To assess the clustering tendency of the dataset, Hopkins apporoach was used. The higher the result the larger percentage of data is clusterable. The interpretation of the result is that 73.95% of the dataset is clusterable, after removing index column over 81% of data is clusterable.

hopkins(db1, n=nrow(db1)-1, byrow=F, header=F) 
## $H
## [1] 0.2227116
get_clust_tendency(db1, 2, graph=TRUE, gradient=list(low="red", mid="white", high="blue"), seed = 123)
## $hopkins_stat
## [1] 0.7395165
## 
## $plot

Optimal numbers of clusters

Below functions display the optimal number of clusters for various clustering methods:

a <- fviz_nbclust(db1,kmeans,method = "silhouette") +ggtitle("kmeans")
b <- fviz_nbclust(db1,pam,method = "silhouette")+ggtitle("pam")
c <- fviz_nbclust(db1,clara,method = "silhouette")+ggtitle("clara")
d <- fviz_nbclust(db1,hcut,method = "silhouette")+ggtitle("hierarchical")
grid.arrange(a,b,c,d, ncol=2, top = "Optimal number of clusters")

K-Means 2 or 3 cluster? Calinski-Harabasz index

The higher level of Calinski-Harabasz index the better.

km2 <- kmeans(db1, 2)
round(calinhara(db1, km2$cluster),digits=2)
## [1] 29.23
km3 <- kmeans(db1, 3)
round(calinhara(db1, km3$cluster),digits=2)
## [1] 26.79
p1 <- fviz_cluster(km2, geom = "point", db1) + ggtitle("k = 2")
p2 <- fviz_cluster(km3, geom = "point", db1) + ggtitle("k = 3")
grid.arrange(p1, p2, nrow=1)

Based on the above analysis it rather clear that 2 cluster for the K-Means method is appropriate.

km2$size
## [1] 24 83
km2$centers
##        Index     Rating   PS_price      Skills   Weak_Foot       Pace
## 1 -1.2057352  1.3078332  1.0018879 -0.19486939  0.25506721  0.7457205
## 2  0.3486463 -0.3781686 -0.2897025  0.05634778 -0.07375437 -0.2156300
##      Passing   Shooting  Dribbling    Defense Physically Popularity Base_Stats
## 1  0.9064251  0.5297975  0.7734180 -0.5376560  0.6248130  1.0987941  1.0547475
## 2 -0.2620988 -0.1531944 -0.2236389  0.1554668 -0.1806688 -0.3177236 -0.3049872
##   In_game_stats Games_played  Avg_goals
## 1    -0.3920861    1.3943311  0.5576664
## 2     0.1133743   -0.4031801 -0.1612529
fviz_cluster(list(data=db1, cluster=km2$cluster), 
             ellipse.type="norm", geom="point", stand=FALSE)

sil<-silhouette(km2$cluster, dist(db1))
fviz_silhouette(sil)
##   cluster size ave.sil.width
## 1       1   24          0.12
## 2       2   83          0.34

PAM

pam4=pam(db1,4)
fviz_cluster(list(data=db1, cluster=pam4$cluster), 
             ellipse.type="norm", geom="point", stand=FALSE)

sil<-silhouette(pam4$cluster, dist(db1))
fviz_silhouette(sil)
##   cluster size ave.sil.width
## 1       1   15          0.09
## 2       2   10          0.38
## 3       3   43          0.21
## 4       4   39          0.32

The average silhouette width is less then for K-Means, which indicates K-Means clustering is more appropriate for the dataset in terms of effectiveness.

CLARA

clara4=clara(db1,4)
fviz_cluster(list(data=db1, cluster=clara4$cluster), 
             ellipse.type="norm", geom="point", stand=FALSE)

sil<-silhouette(clara4$cluster, dist(db1))
fviz_silhouette(sil)
##   cluster size ave.sil.width
## 1       1   16          0.08
## 2       2   10          0.38
## 3       3   42          0.22
## 4       4   39          0.32

For CLARA again the average silhouette width is less then for K-Means, which indicates K-Means clustering is more appropriate for the dataset in terms of effectiveness than CLARA.

Hierarchical clustering

hc <- eclust(db1, k=2, FUNcluster="hclust", hc_metric="euclidean", hc_method = "single")
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
plot(hc, cex=0.6, hang=-1, main = "Dendrogram of HAC")
rect.hclust(hc, k=2, border='red')

hc1 <- eclust(db1, k=2, FUNcluster="hclust", hc_metric="euclidean", hc_method = "complete")
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
plot(hc1, cex=0.6, hang=-1, main = "Dendrogram of HAC")
rect.hclust(hc1, k=2, border='red')

Clustering of the PCA results

db2 <- read.csv("pca_result.csv", sep = ",", dec = ".", header = TRUE)
db2 <- db2[, 2:6]
db2 <- as.data.frame(lapply(db2, scale))
summary(db2)
##       PC1                PC2                PC3                PC4          
##  Min.   :-2.88106   Min.   :-2.26648   Min.   :-3.34222   Min.   :-2.21029  
##  1st Qu.:-0.55762   1st Qu.:-0.76253   1st Qu.:-0.41568   1st Qu.:-0.67512  
##  Median :-0.04522   Median :-0.01572   Median : 0.07531   Median :-0.09085  
##  Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.00000  
##  3rd Qu.: 0.62448   3rd Qu.: 0.52987   3rd Qu.: 0.71519   3rd Qu.: 0.58816  
##  Max.   : 2.19363   Max.   : 2.90671   Max.   : 2.10662   Max.   : 2.58365  
##       PC5          
##  Min.   :-2.63945  
##  1st Qu.:-0.67736  
##  Median : 0.08502  
##  Mean   : 0.00000  
##  3rd Qu.: 0.62249  
##  Max.   : 2.91394
cat("Number of observations in the dataset:", nrow(db2))
## Number of observations in the dataset: 107
cat("Number of years variables in the analysis:", ncol(db2))
## Number of years variables in the analysis: 5

Optimal numbers of clusters for PCA dataset

Below functions display the optimal number of clusters for various clustering methods:

a <- fviz_nbclust(db2,kmeans,method = "silhouette") +ggtitle("kmeans")
b <- fviz_nbclust(db2,pam,method = "silhouette")+ggtitle("pam")
c <- fviz_nbclust(db2,clara,method = "silhouette")+ggtitle("clara")
d <- fviz_nbclust(db2,hcut,method = "silhouette")+ggtitle("hierarchical")
grid.arrange(a,b,c,d, ncol=2, top = "Optimal number of clusters")

Optimal number of clusters for the PCA results is much higher for each of the methods except for hierarchical clustering. Hence, there is no need to check the extend to which the clustering is similar since the degree of difference will most probably be rather large.

Conclusion

Analyzing the Silhouette coefficient of presented clustering methods for the raw dataset, K-Means seems to be the most effective way of clustering. However, when comparing the clustering of raw data and clustering of the PCA results, number of optimal clusters remained unchanged only for hierarchical clustering. This potentially could indicate that this type of clustering for this dataset has a higher stability level. Furthermore, the fact that for all other methods the optimal number of clusters increased could indicate that the data with reduced dimensions has a greater relative variance due to the reduced common elements.

Reference

https://www.analyticsvidhya.com/blog/2016/11/an-introduction-to-clustering-and-different-methods-of-clustering/
https://www.datanovia.com/en/lessons/assessing-clustering-tendency/
https://uc-r.github.io/hc_clustering
https://towardsdatascience.com/clustering-analysis-in-r-using-k-means-73eca4fb7967
https://rpubs.com/eosowska/clustering