Clustering
- Hard clustering
- Fuzzy clustering
Dataset
Clustering - All data
Clustering - Legendary Pokemon
Clustering - All data - 3 clusters
Dendrogram clustering
Cluster assessment
Boxplots and statistics
Fuzzy clustering

Clustering

Hard clustering

We deal with clustering in almost every aspect of daily life. Clustering is the subject of active research in several fields such as statistics, pattern recognition, and machine learning. In data mining, clustering deals with very large data sets with different attributes associated with the data. This imposes unique computational requirements on relevant clustering algorithms. A variety of algorithms have recently emerged that meet these requirements and were successfully applied to real life problems.
Clustering methods are divided into two basic types: hierarchical and flat clustering. Within each of these types there exists a wealth of subtypes and different algorithms for finding the clusters. Flat clustering algorithm goal is to create clusters that are coherent internally, and clearly different from each other. The data within a cluster should be as similar as possible; data in one cluster should be as dissimilar as possible from documents in other clusters. Hierarchical clustering builds a cluster hierarchy that can be represented as a tree of clusters. Each cluster can be represented as child, a parent and a sibling to other clusters. Even though hierarchical clustering is superior to flat clustering in representing the clusters, it has a drawback of being computationally intensive in finding the relevant hierarchies. Among the most popular clustering techniques we may list:

k-means,
k-medoids (PAM),
Clustering LARge Applications (CLARA),
Hierarchical clustering.

Fuzzy clustering

Compared to that, scientists have developed algorithms that cluster data not necessarily with binary membership. Such methods are called fuzzy clustering. They allow the objects to belong to several clusters simultaneously, with different degrees of membership. In many situations, fuzzy clustering is more natural than hard clustering. Objects on the boundaries between several classes are not forced to fully belong to one of the classes, but rather are assigned membership degrees between 0 and 1 indicating their partial membership.

To visualize the biggest difference between flat and fuzzy clustering, we may refer to the picture below. Flat clustering assigns ones (1) when an observation belongs to one of the clusters and zeroes (0) while it does not. Therefore left-hand side matrix may represent such assignment. On the other hand fuzzy clustering assigns the probability of membership to each of the wanted clusters (right-hand side matrix may represent such probability assignment).

The most popular clustering algorithm is fuzzy c-Means (FCM) clustering, hence I will be using it in this project.

Dataset

The dataset that this project will operate on contains data on 721 Pokemon, including their number, name, first and second type, and basic stats: HP, Attack, Defense, Special Attack, Special Defense, and Speed. It was provided by Kaggle (https://www.kaggle.com/abcsds/pokemon)

Main idea is to present the clustering of Pokemon depending on similarities in their statistics. Hence we end up with a dataset in the form presented below.

There are legendary Pokemon in each generation of games. They have more powerful statistics, thus I wanted to check whether dataset limited only to those observations clusters well too.

data <- read.csv('Pokemon.csv')
summary(data)

##        X.                             Name         Type.1         Type.2        Total             HP             Attack       Defense          Sp..Atk          Sp..Def          Speed          Generation    Legendary  
##  Min.   :  1.0   Abomasnow              :  1   Water  :112           :386   Min.   :180.0   Min.   :  1.00   Min.   :  5   Min.   :  5.00   Min.   : 10.00   Min.   : 20.0   Min.   :  5.00   Min.   :1.000   False:735  
##  1st Qu.:184.8   AbomasnowMega Abomasnow:  1   Normal : 98   Flying  : 97   1st Qu.:330.0   1st Qu.: 50.00   1st Qu.: 55   1st Qu.: 50.00   1st Qu.: 49.75   1st Qu.: 50.0   1st Qu.: 45.00   1st Qu.:2.000   True : 65  
##  Median :364.5   Abra                   :  1   Grass  : 70   Ground  : 35   Median :450.0   Median : 65.00   Median : 75   Median : 70.00   Median : 65.00   Median : 70.0   Median : 65.00   Median :3.000              
##  Mean   :362.8   Absol                  :  1   Bug    : 69   Poison  : 34   Mean   :435.1   Mean   : 69.26   Mean   : 79   Mean   : 73.84   Mean   : 72.82   Mean   : 71.9   Mean   : 68.28   Mean   :3.324              
##  3rd Qu.:539.2   AbsolMega Absol        :  1   Psychic: 57   Psychic : 33   3rd Qu.:515.0   3rd Qu.: 80.00   3rd Qu.:100   3rd Qu.: 90.00   3rd Qu.: 95.00   3rd Qu.: 90.0   3rd Qu.: 90.00   3rd Qu.:5.000              
##  Max.   :721.0   Accelgor               :  1   Fire   : 52   Fighting: 26   Max.   :780.0   Max.   :255.00   Max.   :190   Max.   :230.00   Max.   :194.00   Max.   :230.0   Max.   :180.00   Max.   :6.000              
##                  (Other)                :794   (Other):342   (Other) :189

head(data)

##   X.                  Name Type.1 Type.2 Total HP Attack Defense Sp..Atk Sp..Def Speed Generation Legendary
## 1  1             Bulbasaur  Grass Poison   318 45     49      49      65      65    45          1     False
## 2  2               Ivysaur  Grass Poison   405 60     62      63      80      80    60          1     False
## 3  3              Venusaur  Grass Poison   525 80     82      83     100     100    80          1     False
## 4  3 VenusaurMega Venusaur  Grass Poison   625 80    100     123     122     120    80          1     False
## 5  4            Charmander   Fire          309 39     52      43      60      50    65          1     False
## 6  5            Charmeleon   Fire          405 58     64      58      80      65    80          1     False

legendary <- data %>% select(-c(X., Name, Type.1, Type.2, Total, Generation)) %>% filter(Legendary == 'True') %>% select(-Legendary)
legendary_g <- data %>% select(c(Generation, Legendary)) %>% filter(Legendary == 'True') %>% select(-Legendary)
data <- data %>% select(-c(X., Name, Type.1, Type.2, Total, Generation, Legendary)) 


plot(data)

plot(legendary)

First of all we should check whether datasets are clusterable at all. To do that I used simple Hopkins statistic. It is a way of measuring the cluster tendency of a data set. A value close to 1 tends to indicate the data is highly clustered, random data will tend to result in values around 0.5, and uniformly distributed data will tend to result in values close to 0.

data_h <- hopkins(data, nrow(data)-1)

cat("All dataset: Hopkin's statistic is equal to:",data_h$H)

## All dataset: Hopkin's statistic is equal to: 0.1888738

leg_h <- hopkins(data, nrow(legendary)-1)
cat("Legendary dataset: Hopkin's statistic is equal to:",leg_h$H)

## Legendary dataset: Hopkin's statistic is equal to: 0.1842238

Hopkins statistic values for both datasets seem to inform us that there are no statistically significant clusters (it is less than 0,5), but nevertheless the values are ca. 20%, hence let’s not give up and proceed with analysis.

Below we can see the Ordered Dissimilarity Matrices, which show distant observations in blue and close ones in red. We can easily notice clusters of data for both of our datasets.

d<-get_dist(data, method = "euclidean")
fviz_dist(d, show_labels = F) + labs(title = "Ordered Dissimilarity Matrix - All data")

d<-get_dist(legendary, method = "euclidean")
fviz_dist(d, show_labels = F) + labs(title = "Ordered Dissimilarity Matrix - Legendaries")

Next step are prediagnostics about the optimal number of clusters. Most popular technique there uses Silhouette - statistic that is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation).

a <- fviz_nbclust(data,FUNcluster=kmeans,method = "s")
b <- fviz_nbclust(data,FUNcluster=pam,method = "s")
c <- fviz_nbclust(data,FUNcluster=clara,method = "s")

#using fanny function which relies on the same idea as fbclust
d<-fviz_nbclust(data,FUNcluster=fanny,method = "s")

grid.arrange(a,b,c,d, top = "Optimal number of clusters")

clusnum <- 2


#LEGENDS
a <- fviz_nbclust(legendary,FUNcluster=kmeans,method = "s") + labs(title = "K-means")
b <- fviz_nbclust(legendary,FUNcluster=pam,method = "s") + labs(title = "PAM")
c <- fviz_nbclust(legendary,FUNcluster=clara,method = "s") + labs(title = "CLARA")
d <- fviz_nbclust(legendary,FUNcluster=fanny,method = "s") + labs(title = "Fuzzy clustering")

grid.arrange(a,b,c,d, top = "Optimal number of clusters")

clus_leg = 6

The greater the silhouette value the better, and we should choose the number of clusters that has the increase silhouette relatively low. It can be easily noted that the optimal number of clusters equals to 2 for each of the flat and fuzzy partitioning methods. Therefore this will be an optimal number used since now.

For the legendary Pokemon dataset, the optimal value varies between methods. That is why the potential number of clusters was picked by me - 6, because that is the number of generations available in the dataset.

Clustering - All data

The main point of analysis begins here. I applied each of the flat partitioning algorithms (fuzzy one is at the end) to the dataset. It seems that the division into two clusters is very similar between the methods, with some outlying differences in the center of the dataset. Moreover to understand the dimensions of the plotted data I provided small PCA analysis. It seems that the first PC (x axis) composes of all the analyzed variables and the second PC (y axis) consists of speed and special attack mostly. Therefore second cluster groups the observations (Pokemon) with high values of all attributes.

There is also a Silhouette diagram for PAM and CLARA methods - it seems that CLARA is better, because it has created clusters that are more homogenous inside and heterogeneous outside. The first observations that are the closest to the center of a cluster have high silhouettes, and those that lie far from the center have small or even negative silhouettes.

Moreover the clustering is rather bad, because last observations of the second cluster have negative silhouettes. For K-means, I calculated Calinski-Harabasz statistic (a variance ratio) that will be later compared to the case of 3 clusters.

prcomp(data)$rotation[,1:2]

##                PC1         PC2
## HP      -0.3008079 -0.04221029
## Attack  -0.4928918 -0.07654480
## Defense -0.3806345 -0.69521578
## Sp..Atk -0.5089806  0.38331141
## Sp..Def -0.3943698 -0.17389431
## Speed   -0.3272626  0.57607928

ckm <- kmeans(data, clusnum)
ckm_p <- fviz_cluster(list(data=data, cluster=ckm$cluster), ellipse.type="convex", geom="point", stand=FALSE, palette="Dark2", ggtheme=theme_minimal()) + labs(title = "K-means")+theme(legend.position="bottom") 

cpam <- pam(data, clusnum)
cpam_p <- fviz_cluster(list(data=data, cluster=cpam$clustering), ellipse.type="convex", geom="point", stand=FALSE, palette="Dark2", ggtheme=theme_minimal()) + labs(title = "PAM")+theme(legend.position="bottom")

claraa <-clara(data, clusnum)
clara_p <- fviz_cluster(list(data=data, cluster=claraa$clustering), ellipse.type="convex", geom="point", stand=FALSE, palette="Dark2", ggtheme=theme_minimal()) + labs(title = "CLARA")+theme(legend.position="bottom")

grid.arrange(ckm_p,cpam_p,clara_p,heights=unit(0.8, "npc") , top = "Clustering all data, clusters = 2")

chara2 <- round(calinhara(data,ckm$cluster),digits=2)
sil_pam <- fviz_silhouette(cpam) + labs(title = "PAM")

##   cluster size ave.sil.width
## 1       1  296          0.47
## 2       2  504          0.16

sil_cla <- fviz_silhouette(claraa) + labs(title = "CLARA")

##   cluster size ave.sil.width
## 1       1   22          0.49
## 2       2   22          0.13

grid.arrange(sil_pam,sil_cla,top = "Silhouette plots, clusters = 2")

Clustering - Legendary Pokemon

When it comes to the case of Legendary Pokemon there are 6 clusters - as many as the generations in the dataset. This causes small disturbances in the analysis compared to the available dataset - 65 observations. Nevertheless there are significant differences between the clusters.

First of all first dimension consists mostly of Attack and Special Attack values and Speed attribute. Second dimension is mostly all variables but speed. The main disadvantage od these clusterings is that they differ between methods, therefore there is no cluster of high quality.

There are some cases of the best Pokemon that do not switch clustering between the methods. But the biggest difference make the outlying variables that are inserted into different clusters each time.

Silhouette diagram shows that there is not much difference between PAM and CLARA clustering.

prcomp(legendary)$rotation[,1:2]

##                 PC1        PC2
## HP       0.01346974  0.2874259
## Attack   0.49750799  0.4532307
## Defense -0.44611129  0.3321664
## Sp..Atk  0.54138483  0.4464765
## Sp..Def -0.39580888  0.5163867
## Speed    0.32175594 -0.3682897

ckm <- kmeans(legendary, clus_leg)
ckm_p <- fviz_cluster(list(data=legendary, cluster=ckm$cluster), ellipse.type="convex", geom="point", stand=FALSE, palette="Dark2", ggtheme=theme_minimal()) + labs(title = "K-means")+theme(legend.position="bottom") 

cpam <- pam(legendary, clus_leg)
cpam_p <- fviz_cluster(list(data=legendary, cluster=cpam$clustering), ellipse.type="convex", geom="point", stand=FALSE, palette="Dark2", ggtheme=theme_minimal()) + labs(title = "PAM")+theme(legend.position="bottom")

claraa <-clara(legendary, clus_leg)
clara_p <- fviz_cluster(list(data=legendary, cluster=claraa$clustering), ellipse.type="convex", geom="point", stand=FALSE, palette="Dark2", ggtheme=theme_minimal()) + labs(title = "CLARA")+theme(legend.position="bottom")

grid.arrange(ckm_p,cpam_p,clara_p,heights=unit(0.8, "npc") , top = "Clustering legendary Pokemon, clusters = 6")

sil_pam <- fviz_silhouette(cpam) + labs(title = "PAM")

##   cluster size ave.sil.width
## 1       1   18          0.25
## 2       2   13          0.09
## 3       3    4          0.15
## 4       4    6          0.21
## 5       5   13          0.11
## 6       6   11          0.25

sil_cla <- fviz_silhouette(claraa) + labs(title = "CLARA")

##   cluster size ave.sil.width
## 1       1   10          0.07
## 2       2   14          0.09
## 3       3   12          0.19
## 4       4    6          0.25
## 5       5    8          0.33
## 6       6    2          0.43

grid.arrange(sil_pam,sil_cla,top = "Silhouette plots, clusters = 6")

Clustering - All data - 3 clusters

What if there were 3 clusters? It seems that the clustering is less logical now. Most of the clusters overlay each other. The values of Silhouette statistic for PAM and CLARA are noticeably lower than previously. Same case is with Calinski-Harabasz index, it is much lower than previously. Therefore we can easily say that the case of 3 clusters is far worse than case of two clusters and we should only consider smaller case.

ckm <- kmeans(data, 3)
ckm_p <- fviz_cluster(list(data=data, cluster=ckm$cluster), ellipse.type="convex", geom="point", stand=FALSE, palette="Dark2", ggtheme=theme_minimal()) + labs(title = "K-means")+theme(legend.position="bottom") 

cpam <- pam(data, 3)
cpam_p <- fviz_cluster(list(data=data, cluster=cpam$clustering), ellipse.type="convex", geom="point", stand=FALSE, palette="Dark2", ggtheme=theme_minimal()) + labs(title = "PAM")+theme(legend.position="bottom")

claraa <-clara(data, 3)
clara_p <- fviz_cluster(list(data=data, cluster=claraa$clustering), ellipse.type="convex", geom="point",stand=FALSE, palette="Dark2", ggtheme=theme_minimal()) + labs(title = "CLARA")+theme(legend.position="bottom")

grid.arrange(ckm_p,cpam_p,clara_p,heights=unit(0.8, "npc") , top = "Clustering all data, clusters = 3")

round(calinhara(data,ckm$cluster),digits=2)

## [1] 273.86

sil_pam <- fviz_silhouette(cpam) + labs(title = "PAM")

##   cluster size ave.sil.width
## 1       1  266          0.33
## 2       2  234          0.13
## 3       3  300          0.06

sil_cla <- fviz_silhouette(claraa) + labs(title = "CLARA")

##   cluster size ave.sil.width
## 1       1   10          0.56
## 2       2   16          0.19
## 3       3   20          0.09

grid.arrange(sil_pam,sil_cla,top = "Silhouette plots, clusters = 3")

Dendrogram clustering

There is also a branch of clustering called hierarchical clustering. It is a method of cluster analysis which seeks to build a hierarchy of clusters. Below there is the clustering for whole dataset with the tree cut off on the second level. It becomes highly unreadable after more or less 6th level, because of the capacity of data.

d <- dist(data, method = "euclidean")
res <- hclust(d, method = "ward.D2" )
grp_all <- cutree(res, k = 2)
plot(res, cex = 0.6) 
rect.hclust(res, k = 2, border = c('red','green'))

That is why I provided the plot (dendrogram) for Legendary Pokemon datset. Below there is its clustering with the tree cut of on the sixth level. There is more visibility. Eeach observation i denoted by a leaf at the bottom of the tree. Observations that are similiar are connected into groups on each of the levels, going to the top of the tree where there is only one group. It is important to find a cut off that proiduces reliably the best clustering.

To check whether there are clusters of Pokemon similar to the generations the labels are painted in colors of generations. It is easily noticeably (“eye” method) that each cluster contains observations coming from various generations.

d <- dist(legendary, method = "euclidean")
res <- as.dendrogram(hclust(d, method = "ward.D2" ))
cols<- brewer.pal(6,"Dark2")
labels_colors(res) <-cols
grp_leg <- cutree(res, k = 2)

pl<-plot(res, cex = 0.6) 
x <- rect.dendrogram(res, k = 6, border = cols)

Cluster assessment

To statistically pick which method of clustering provided the best results let us use clValid function that provides Connectivity index, Dunn index and Silhouette coefficient together. The higher the values the better is the method in clustering given dataset.

Dunn index identifies sets of clusters that are compact, with a small variance between members of the cluster, and well separated, where the means of different clusters are sufficiently far apart, as compared to the within cluster variance.

Compactness assesses cluster homogeneity, usually by looking at the intra-cluster variance, while separation quanties the degree of separation between clusters (usually by measuring the distance between cluster centroids).

assess <- clValid(data, nClust = 2:4, clMethods = c("hierarchical", "kmeans", "pam", "clara","fanny"), validation = "internal",maxitems = 1000 )
summary(assess)

## 
## Clustering Methods:
##  hierarchical kmeans pam clara fanny 
## 
## Cluster sizes:
##  2 3 4 
## 
## Validation Measures:
##                                   2        3        4
##                                                      
## hierarchical Connectivity    2.9290   6.8869  31.1976
##              Dunn            0.4024   0.3389   0.1521
##              Silhouette      0.6322   0.5718   0.3944
## kmeans       Connectivity  120.8877 249.0107 247.4766
##              Dunn            0.0273   0.0313   0.0393
##              Silhouette      0.2883   0.2325   0.2262
## pam          Connectivity   84.4429 303.2147 278.5552
##              Dunn            0.0347   0.0214   0.0472
##              Silhouette      0.2733   0.1696   0.2273
## clara        Connectivity  134.9786 259.0056 313.6921
##              Dunn            0.0282   0.0314   0.0383
##              Silhouette      0.2844   0.2300   0.2025
## fanny        Connectivity  117.8472       NA       NA
##              Dunn            0.0273       NA       NA
##              Silhouette      0.2878       NA       NA
## 
## Optimal Scores:
## 
##              Score  Method       Clusters
## Connectivity 2.9290 hierarchical 2       
## Dunn         0.4024 hierarchical 2       
## Silhouette   0.6322 hierarchical 2

assess_leg <- clValid(legendary, nClust = 2:4, clMethods = c("hierarchical", "kmeans", "pam", "clara"), validation = "internal",maxitems = 1000 )
summary(assess_leg)

## 
## Clustering Methods:
##  hierarchical kmeans pam clara 
## 
## Cluster sizes:
##  2 3 4 
## 
## Validation Measures:
##                                  2       3       4
##                                                   
## hierarchical Connectivity   3.8579 12.9012 15.8302
##              Dunn           0.2812  0.2332  0.2366
##              Silhouette     0.4253  0.3376  0.2980
## kmeans       Connectivity  35.2234 14.7782 26.4984
##              Dunn           0.0863  0.2019  0.2366
##              Silhouette     0.1942  0.3064  0.2360
## pam          Connectivity  35.9476 44.6702 45.0056
##              Dunn           0.0851  0.0858  0.1410
##              Silhouette     0.1933  0.1934  0.1797
## clara        Connectivity  30.8187 37.0107 44.7567
##              Dunn           0.0537  0.1395  0.1410
##              Silhouette     0.1889  0.1812  0.1745
## 
## Optimal Scores:
## 
##              Score  Method       Clusters
## Connectivity 3.8579 hierarchical 2       
## Dunn         0.2812 hierarchical 2       
## Silhouette   0.4253 hierarchical 2

For both cases (whole dataset and Legendary Pokemon) it seems that the most reasonable is to pick hierarchical clustering and cut the tree at the second level, beacause all of the measures (Dunn Index, Connectivity measure and silhouette) are the best for those (connectivity the lowest, Dunn and silhouette the highest).

Boxplots and statistics

Below we can see boxplots for whole dataset and Legendary Pokemon. There are two different sets for each variable. Boxplot presents mean and overall spread of the variable.

All Pokemon

As aforementioned, it is easily noticeable that the second cluster consists of much better Pokemon. They have all their statistics better. What’s interesting also variance in this cluster is higher. It is also a cluster with higher number of observations.

Legendary Pokemon

When we cut off the dendrogram at the second level we created one cluster with high number of observations and another one with just 5, highly strong, defensive Legendary Pokemon. Variances in both clusters are comparable.

groupBWplot(data, as.factor(grp_all), alpha=0.05)

groupBWplot(legendary, as.factor(grp_leg), alpha=0.05)

Fuzzy clustering

res.f <- fanny(data, k=2, diss=FALSE, memb.exp = 1.2, metric = "euclidean", 
      stand = FALSE, maxit = 500)

res.f3 <- fanny(data, k=2, diss=FALSE, memb.exp = 1.3, metric = "euclidean", 
      stand = FALSE, maxit = 500)

head(res.f$membership)

##             [,1]        [,2]
## [1,] 0.997187736 0.002812264
## [2,] 0.805354144 0.194645856
## [3,] 0.005398136 0.994601864
## [4,] 0.009690168 0.990309832
## [5,] 0.997989775 0.002010225
## [6,] 0.821683501 0.178316499

res.f2 <- fanny(legendary, k=2, diss=FALSE, memb.exp = 2, metric = "euclidean", 
      stand = FALSE, maxit = 500)


res.f4 <- fanny(legendary, k=2, diss=FALSE, memb.exp = 2.5, metric = "euclidean", 
      stand = FALSE, maxit = 500)

head(res.f2$membership)

##      [,1] [,2]
## [1,]  0.5  0.5
## [2,]  0.5  0.5
## [3,]  0.5  0.5
## [4,]  0.5  0.5
## [5,]  0.5  0.5
## [6,]  0.5  0.5

We can see that the algorithm provided us with the probabilities of the membership to each of the clusters.

res.f_p <- fviz_cluster(list(data=data, cluster=res.f$clustering), ellipse.type="convex", geom="point", stand=FALSE, palette="Dark2", ggtheme=theme_minimal()) + labs(title = "Fuzzy = 1.2")+theme(legend.position="bottom") 

res.f3_p <- fviz_cluster(list(data=data, cluster=res.f3$clustering), ellipse.type="convex", geom="point", stand=FALSE, palette="Dark2", ggtheme=theme_minimal()) + labs(title = "Fuzzy = 1.3")+theme(legend.position="bottom") 

grid.arrange(res.f_p,res.f3_p,heights=unit(0.8, "npc") , top = "Fuzzy clustering - all data, clusters = 2")

sil_1 <- fviz_silhouette(res.f) + labs(title = "Fuzzy = 1.2")

##   cluster size ave.sil.width
## 1       1  365          0.43
## 2       2  435          0.17

sil_3 <- fviz_silhouette(res.f3) + labs(title = "Fuzzy = 1.3")

##   cluster size ave.sil.width
## 1       1  372          0.42
## 2       2  428          0.17

grid.arrange(sil_1,sil_3,top = "Silhouette plots, clusters = 2")

res.f2_p <- fviz_cluster(list(data=legendary, cluster=res.f2$clustering), ellipse.type="convex", geom="point", stand=FALSE, palette="Dark2", ggtheme=theme_minimal()) + labs(title = "Fuzzy = 2")+theme(legend.position="bottom") 

res.f4_p <- fviz_cluster(list(data=legendary, cluster=res.f4$clustering), ellipse.type="convex", geom="point", stand=FALSE, palette="Dark2", ggtheme=theme_minimal()) + labs(title = "Fuzzy = 2.5")+theme(legend.position="bottom") 

grid.arrange(res.f2_p,res.f4_p,heights=unit(0.8, "npc") , top = "Fuzzy clustering - Legendary Pokemon, clusters = 6")

sil_2 <- fviz_silhouette(res.f2) + labs(title = "Fuzzy = 2")

##   cluster size ave.sil.width
## 1       1   32          0.19
## 2       2   33          0.20

sil_4 <- fviz_silhouette(res.f4) + labs(title = "Fuzzy = 2.5")

##   cluster size ave.sil.width
## 1       1   31          0.20
## 2       2   34          0.19

grid.arrange(sil_2,sil_4,top = "Silhouette plots, clusters = 2")

Increasing the fuzziness leads to marked increase in silhouette scores for both analyzed datasets. Nevertheless the values of silhouette coefficients are significantly lower than in the case with 2 clusters using CLARA or PCA.

For the smaller dataset it is also possible to present the probabilities of being in any of the clusters. It can be seen on the plot below.

corrplot(t(res.f2$membership), is.corr = FALSE, method = "shade")

To sum up the winning methods in both cases was hierarchical clustering, which provided us with the clusters with best statistics. Therefore for cluistering of such datasets dendrograms should be used. Moreover fanny clustering that was one of the aims of this analysis shopwed mediocre results. In particular, hierarchical clustering is appropriate for

There are few differences between the applications of flat and hierarchical clustering in information retrieval. In particular, hierarchical clustering is appropriate for search results clustering or collection clustering (Pokemon).In general, we select flat clustering when efficiency is important and hierarchical clustering when one of the potential problems of flat clustering (not enough structure, predetermined number of clusters, non-determinism) is a concern. In addition, many researchers believe that hierarchical clustering produces better clusters than flat clustering. Results of this study show that hierarchical clustering is exactly better.

List of Pokemon: https://pokemondb.net/pokedex/national

Clustering Pokemon

Mateusz Buczyński

Clustering

Hard clustering

Fuzzy clustering

Dataset

Clustering - All data

Clustering - Legendary Pokemon

Clustering - All data - 3 clusters

Dendrogram clustering

Cluster assessment

Boxplots and statistics

Fuzzy clustering