Clustering is one of unsupervised learning methods, used to divide data into similar groups. Clusters are sets of homogeneous data, which is heterogeneous between clusters (i.e. when compering data from two different clusters).
The second part of the project is focused on clustering. The aim of this publication is understand to what extend clustering on raw data and clustering on less complex dataset (result of dimension reduction) are alike. To determine the similarities clustering on both datasets was performed. In order to understand the best clustering technique, average silhouette widths of various clustering methods were compared for clutseirng on raw data prior to comparing the results with the data after PCA. The first part of the project, dimension reduction can be found https://rpubs.com/meggie/863152.
The dataset used was FIFA’22 computer game players (https://www.futbin.com/22/players?page=1&version=gold_rare&sort=version&order=desc, accessed 22-23.01.2022). The dataset was reduced to only Gold Rare players as these are the most valuable players. It is composed of 19 variables and 107 observations:
- Index -> of each observation to better understand the difference in clustering
- Player Name -> ch. for each of the player
- Position -> position on which the player plays, can be used for filtering as there are different requirements for goal keepers, defendants and strikers
- Version -> the same for all the players in the dataset “rare”
- Rating -> rating in the game
- Player’s price -> player’s price in the game
- Skills -> rating out of 5 stars
- Weak foot -> rating out of 5 stars
- Pace -> score out of 100
- Shooting -> score out of 100
- Dribbling -> score out of 100
- Defense -> score out of 100
- Physically -> score out of 100
- Popularity -> number of clicks (profile views of each player) on the Futbin website
- Base statistics -> statistics of the player
- In game statistics -> statistics of the player
- Games played -> number of games played
- Average goals per game -> average goals scored in one game
library(corrplot)
## corrplot 0.92 loaded
library(stats)
library(factoextra)
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(flexclust)
## Loading required package: grid
## Loading required package: lattice
## Loading required package: modeltools
## Loading required package: stats4
library(fpc)
library(clustertend)
library(cluster)
library(ClusterR)
## Loading required package: gtools
library(Rcpp)
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(NbClust)
library(dendextend)
##
## ---------------------
## Welcome to dendextend version 1.15.2
## Type citation('dendextend') for how to cite the package.
##
## Type browseVignettes(package = 'dendextend') for the package vignette.
## The github page is: https://github.com/talgalili/dendextend/
##
## Suggestions and bug-reports can be submitted at: https://github.com/talgalili/dendextend/issues
## You may ask questions at stackoverflow, use the r and dendextend tags:
## https://stackoverflow.com/questions/tagged/dendextend
##
## To suppress this message use: suppressPackageStartupMessages(library(dendextend))
## ---------------------
##
## Attaching package: 'dendextend'
## The following object is masked from 'package:stats':
##
## cutree
library(gridExtra)
library(ggplot2)
library(grid)
For the purpose of clustering the data was limited to numerical attributes. Such restriction allows better clusterization as some satistical measures can be calculated.
db <- read.csv("FIFA_RARE.csv", sep = ";", dec = ",", header = TRUE)
db1 <- db[, c(1,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19)]
db1 <- as.data.frame(lapply(db1, scale))
summary(db1)
## Index Rating PS_price Skills
## Min. :-1.708 Min. :-1.9864 Min. :-0.3265 Min. :-2.0130
## 1st Qu.:-0.854 1st Qu.:-0.6326 1st Qu.:-0.3241 1st Qu.:-0.6204
## Median : 0.000 Median :-0.0911 Median :-0.3112 Median :-0.1562
## Mean : 0.000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.854 3rd Qu.: 0.5858 3rd Qu.:-0.1982 3rd Qu.: 0.7722
## Max. : 1.708 Max. : 2.3457 Max. : 7.4759 Max. : 1.7007
## Weak_Foot Pace Passing Shooting
## Min. :-2.1035 Min. :-2.6719 Min. :-2.6866 Min. :-3.1951
## 1st Qu.:-0.7558 1st Qu.:-0.5568 1st Qu.:-0.5289 1st Qu.:-0.6201
## Median : 0.5920 Median : 0.1138 Median : 0.2660 Median : 0.1061
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5920 3rd Qu.: 0.7328 3rd Qu.: 0.6446 3rd Qu.: 0.5023
## Max. : 1.9398 Max. : 1.8677 Max. : 1.5531 Max. : 2.0869
## Dribbling Defense Physically Popularity
## Min. :-3.1494 Min. :-1.7149 Min. :-2.6802 Min. :-0.8749
## 1st Qu.:-0.4716 1st Qu.:-0.8778 1st Qu.:-0.7942 1st Qu.:-0.5589
## Median : 0.0922 Median : 0.1164 Median : 0.1219 Median :-0.3956
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.7264 3rd Qu.: 0.9536 3rd Qu.: 0.7685 3rd Qu.: 0.1277
## Max. : 1.9244 Max. : 1.5291 Max. : 1.8462 Max. : 5.4479
## Base_Stats In_game_stats Games_played Avg_goals
## Min. :-2.68112 Min. :-3.50388 Min. :-0.603840 Min. :-0.8710
## 1st Qu.:-0.76510 1st Qu.:-0.01005 1st Qu.:-0.577556 1st Qu.:-0.8041
## Median :-0.02483 Median : 0.29693 Median :-0.452398 Median :-0.3358
## Mean : 0.00000 Mean : 0.00000 Mean : 0.000000 Mean : 0.0000
## 3rd Qu.: 0.69368 3rd Qu.: 0.57176 3rd Qu.:-0.000867 3rd Qu.: 0.2998
## Max. : 2.41374 Max. : 0.94015 Max. : 3.764747 Max. : 2.8421
cat("Number of observations in the dataset:", nrow(db1))
## Number of observations in the dataset: 107
cat("Number of years variables in the analysis:", ncol(db1))
## Number of years variables in the analysis: 16
Analyzing the below chart potential relationship between couple of variables can be identified. Some of the attributes are positively correlated (in_game_stats and skills, games_played and rating, games_played and popularity, dribbling and passing), whereas some are negatively correlated (defense and passing, defense and avg_goals, skills and physically). Please see dimension reduction part of the project for in-depth evaluation.
cor<-cor(db1, method="pearson")
corrplot(cor)
To assess the clustering tendency of the dataset, Hopkins apporoach was used. The higher the result the larger percentage of data is clusterable. The interpretation of the result is that 73.95% of the dataset is clusterable, after removing index column over 81% of data is clusterable.
hopkins(db1, n=nrow(db1)-1, byrow=F, header=F)
## $H
## [1] 0.2227116
get_clust_tendency(db1, 2, graph=TRUE, gradient=list(low="red", mid="white", high="blue"), seed = 123)
## $hopkins_stat
## [1] 0.7395165
##
## $plot
Below functions display the optimal number of clusters for various clustering methods:
a <- fviz_nbclust(db1,kmeans,method = "silhouette") +ggtitle("kmeans")
b <- fviz_nbclust(db1,pam,method = "silhouette")+ggtitle("pam")
c <- fviz_nbclust(db1,clara,method = "silhouette")+ggtitle("clara")
d <- fviz_nbclust(db1,hcut,method = "silhouette")+ggtitle("hierarchical")
grid.arrange(a,b,c,d, ncol=2, top = "Optimal number of clusters")
The higher level of Calinski-Harabasz index the better.
km2 <- kmeans(db1, 2)
round(calinhara(db1, km2$cluster),digits=2)
## [1] 29.23
km3 <- kmeans(db1, 3)
round(calinhara(db1, km3$cluster),digits=2)
## [1] 26.79
p1 <- fviz_cluster(km2, geom = "point", db1) + ggtitle("k = 2")
p2 <- fviz_cluster(km3, geom = "point", db1) + ggtitle("k = 3")
grid.arrange(p1, p2, nrow=1)
Based on the above analysis it rather clear that 2 cluster for the K-Means method is appropriate.
km2$size
## [1] 24 83
km2$centers
## Index Rating PS_price Skills Weak_Foot Pace
## 1 -1.2057352 1.3078332 1.0018879 -0.19486939 0.25506721 0.7457205
## 2 0.3486463 -0.3781686 -0.2897025 0.05634778 -0.07375437 -0.2156300
## Passing Shooting Dribbling Defense Physically Popularity Base_Stats
## 1 0.9064251 0.5297975 0.7734180 -0.5376560 0.6248130 1.0987941 1.0547475
## 2 -0.2620988 -0.1531944 -0.2236389 0.1554668 -0.1806688 -0.3177236 -0.3049872
## In_game_stats Games_played Avg_goals
## 1 -0.3920861 1.3943311 0.5576664
## 2 0.1133743 -0.4031801 -0.1612529
fviz_cluster(list(data=db1, cluster=km2$cluster),
ellipse.type="norm", geom="point", stand=FALSE)
sil<-silhouette(km2$cluster, dist(db1))
fviz_silhouette(sil)
## cluster size ave.sil.width
## 1 1 24 0.12
## 2 2 83 0.34
pam4=pam(db1,4)
fviz_cluster(list(data=db1, cluster=pam4$cluster),
ellipse.type="norm", geom="point", stand=FALSE)
sil<-silhouette(pam4$cluster, dist(db1))
fviz_silhouette(sil)
## cluster size ave.sil.width
## 1 1 15 0.09
## 2 2 10 0.38
## 3 3 43 0.21
## 4 4 39 0.32
The average silhouette width is less then for K-Means, which indicates K-Means clustering is more appropriate for the dataset in terms of effectiveness.
clara4=clara(db1,4)
fviz_cluster(list(data=db1, cluster=clara4$cluster),
ellipse.type="norm", geom="point", stand=FALSE)
sil<-silhouette(clara4$cluster, dist(db1))
fviz_silhouette(sil)
## cluster size ave.sil.width
## 1 1 16 0.08
## 2 2 10 0.38
## 3 3 42 0.22
## 4 4 39 0.32
For CLARA again the average silhouette width is less then for K-Means, which indicates K-Means clustering is more appropriate for the dataset in terms of effectiveness than CLARA.
hc <- eclust(db1, k=2, FUNcluster="hclust", hc_metric="euclidean", hc_method = "single")
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
plot(hc, cex=0.6, hang=-1, main = "Dendrogram of HAC")
rect.hclust(hc, k=2, border='red')
hc1 <- eclust(db1, k=2, FUNcluster="hclust", hc_metric="euclidean", hc_method = "complete")
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
plot(hc1, cex=0.6, hang=-1, main = "Dendrogram of HAC")
rect.hclust(hc1, k=2, border='red')
db2 <- read.csv("pca_result.csv", sep = ",", dec = ".", header = TRUE)
db2 <- db2[, 2:6]
db2 <- as.data.frame(lapply(db2, scale))
summary(db2)
## PC1 PC2 PC3 PC4
## Min. :-2.88106 Min. :-2.26648 Min. :-3.34222 Min. :-2.21029
## 1st Qu.:-0.55762 1st Qu.:-0.76253 1st Qu.:-0.41568 1st Qu.:-0.67512
## Median :-0.04522 Median :-0.01572 Median : 0.07531 Median :-0.09085
## Mean : 0.00000 Mean : 0.00000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 0.62448 3rd Qu.: 0.52987 3rd Qu.: 0.71519 3rd Qu.: 0.58816
## Max. : 2.19363 Max. : 2.90671 Max. : 2.10662 Max. : 2.58365
## PC5
## Min. :-2.63945
## 1st Qu.:-0.67736
## Median : 0.08502
## Mean : 0.00000
## 3rd Qu.: 0.62249
## Max. : 2.91394
cat("Number of observations in the dataset:", nrow(db2))
## Number of observations in the dataset: 107
cat("Number of years variables in the analysis:", ncol(db2))
## Number of years variables in the analysis: 5
Below functions display the optimal number of clusters for various clustering methods:
a <- fviz_nbclust(db2,kmeans,method = "silhouette") +ggtitle("kmeans")
b <- fviz_nbclust(db2,pam,method = "silhouette")+ggtitle("pam")
c <- fviz_nbclust(db2,clara,method = "silhouette")+ggtitle("clara")
d <- fviz_nbclust(db2,hcut,method = "silhouette")+ggtitle("hierarchical")
grid.arrange(a,b,c,d, ncol=2, top = "Optimal number of clusters")
Optimal number of clusters for the PCA results is much higher for each of the methods except for hierarchical clustering. Hence, there is no need to check the extend to which the clustering is similar since the degree of difference will most probably be rather large.
Analyzing the Silhouette coefficient of presented clustering methods for the raw dataset, K-Means seems to be the most effective way of clustering. However, when comparing the clustering of raw data and clustering of the PCA results, number of optimal clusters remained unchanged only for hierarchical clustering. This potentially could indicate that this type of clustering for this dataset has a higher stability level. Furthermore, the fact that for all other methods the optimal number of clusters increased could indicate that the data with reduced dimensions has a greater relative variance due to the reduced common elements.
https://www.analyticsvidhya.com/blog/2016/11/an-introduction-to-clustering-and-different-methods-of-clustering/
https://www.datanovia.com/en/lessons/assessing-clustering-tendency/
https://uc-r.github.io/hc_clustering
https://towardsdatascience.com/clustering-analysis-in-r-using-k-means-73eca4fb7967
https://rpubs.com/eosowska/clustering