Clustering of wines - flat, hierarchical and fuzzy algorithms

Introduction
Dataset
Initial analysis
Prediagnostics
Flat clustering
Hierarchical clustering
Fuzzy clustering
Conclusions

Introduction

Clustering is a field of unsupervised learning which is used as a technique of identifying subgroups in the data if the lables are not provided. It can be applied to any kind of data and thus it has a number of different applications, e.g. customer segmentation, document clustering or identifying crime localities. One can distinguish many types of clustering algorithms from which the most popular are partitional (flat) algorithms, hierarchical algorithms, fuzzy algorithms and density based algorithms.

In this study, I will try to find the best clustering approach which will group different wines in the most efficient and stable way. In order to do this, I will implement different partitional, hierarchical and fuzzy algorithms.

Dataset

The data used to conduct analysis is Wine Data Set which consists of chemical substances in order to determine the origin of wines. It was provided by the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Wine). It can be imported directly from the website through foreign::read.table() function.

# loading the dataset
library(foreign)
wine_data <- read.table("https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data", sep=",")
wine = wine_data[,-1] # first column is a class label

Variable Name	Variable description
V2	Alcohol
V3	Malic acid
V4	Ash
V5	Alcalinity of ash
V6	Magnesium
V7	Total phenols
V8	Flavanoids
V9	Nonflavanoid phenols
V10	Proanthocyanins
V11	Color intensity
V12	Hue
V13	OD280/OD315 of diluted wines
V14	Proline

Initial analysis

In the first steps of the analysis, I will inspect the detailed information of the data, calculate basic statistics and explore the relationship between variables.

cat("Number of different wines in the dataset:", nrow(wine))

## Number of different wines in the dataset: 178

cat("Number of features describing wines:", ncol(wine))

## Number of features describing wines: 13

Following output shows basic statistics of the wine characteristics. These are minimum value, 1st quantile, median, mean, 3rd quantile and maximum value.

summary(wine)

##        V2              V3              V4              V5              V6               V7       
##  Min.   :11.03   Min.   :0.740   Min.   :1.360   Min.   :10.60   Min.   : 70.00   Min.   :0.980  
##  1st Qu.:12.36   1st Qu.:1.603   1st Qu.:2.210   1st Qu.:17.20   1st Qu.: 88.00   1st Qu.:1.742  
##  Median :13.05   Median :1.865   Median :2.360   Median :19.50   Median : 98.00   Median :2.355  
##  Mean   :13.00   Mean   :2.336   Mean   :2.367   Mean   :19.49   Mean   : 99.74   Mean   :2.295  
##  3rd Qu.:13.68   3rd Qu.:3.083   3rd Qu.:2.558   3rd Qu.:21.50   3rd Qu.:107.00   3rd Qu.:2.800  
##  Max.   :14.83   Max.   :5.800   Max.   :3.230   Max.   :30.00   Max.   :162.00   Max.   :3.880  
##        V8              V9              V10             V11              V12              V13       
##  Min.   :0.340   Min.   :0.1300   Min.   :0.410   Min.   : 1.280   Min.   :0.4800   Min.   :1.270  
##  1st Qu.:1.205   1st Qu.:0.2700   1st Qu.:1.250   1st Qu.: 3.220   1st Qu.:0.7825   1st Qu.:1.938  
##  Median :2.135   Median :0.3400   Median :1.555   Median : 4.690   Median :0.9650   Median :2.780  
##  Mean   :2.029   Mean   :0.3619   Mean   :1.591   Mean   : 5.058   Mean   :0.9574   Mean   :2.612  
##  3rd Qu.:2.875   3rd Qu.:0.4375   3rd Qu.:1.950   3rd Qu.: 6.200   3rd Qu.:1.1200   3rd Qu.:3.170  
##  Max.   :5.080   Max.   :0.6600   Max.   :3.580   Max.   :13.000   Max.   :1.7100   Max.   :4.000  
##       V14        
##  Min.   : 278.0  
##  1st Qu.: 500.5  
##  Median : 673.5  
##  Mean   : 746.9  
##  3rd Qu.: 985.0  
##  Max.   :1680.0

Since the values of variables are on different scales I will normalize them using scale() function in order to get proper and interpretable results.

wine <- scale(wine)

Next step will be examining the relationship between variables.

library(corrplot)
wine_matrix <- data.matrix(wine, rownames.force = NA)
M <- cor(wine_matrix)
corrplot(M, method = "number", number.cex = 0.75, order="hclust")

From the correlation matrix one can see that dataset contains some highly correlated features, eg. flavanoids and total phenols (0.86), OD280/OD315 of diluted wines and flavanoids (0.79), OD280/OD315 of diluted wines and total phenols (0.7), flavanoids and proanthocyanins (0.65) or proline and alcohol (0.64). The highest dependencies (over 0.6) are mainly positive.

In other to check if the dataset contains outliers, one can calculate interquartile range statistic. The equation is as follows:

\[IQR = Q_3 - Q_1\]

vars <- c(colnames(wine))
Outliers <- c()
for(i in vars){
  max <- quantile(wine[,i], 0.75) + (IQR(wine[,i]) * 1.5 )
  min <- quantile(wine[,i], 0.25) - (IQR(wine[,i]) * 1.5 )
  idx <- which(wine[,i] < min | wine[,i] > max)
  print(paste(i, length(idx), sep=' ')) # printing variable and number of potential outliers 
  Outliers <- c(Outliers, idx) 
}

## [1] "V2 0"
## [1] "V3 3"
## [1] "V4 3"
## [1] "V5 4"
## [1] "V6 4"
## [1] "V7 0"
## [1] "V8 0"
## [1] "V9 0"
## [1] "V10 2"
## [1] "V11 4"
## [1] "V12 1"
## [1] "V13 0"
## [1] "V14 0"

It seems that the statistic indicated up to four outliers in particular features. Let’s now plot these variables which were indicated to contain outliers.

par(mfrow=c(2,2))
colnames <- colnames(wine[,c(2:5,9:11)])
for (i in colnames) {
  plot(wine[,i], main = paste("Plot of ", i), ylab = i)
}

After examining the plots, I decided not to exclude any of the observations from the analysis.

In order to choose the clustering algorithm best fitted to the data, one must consider characteristics of the dataset. From initial analysis, we know that we deal with continuous variables and most of them are correlated (mainly positively). Dataset doesn’t contain worrisome outliers.

Prediagnostics

I will start with running prediagnostics in order to check wheather data can be clustered and also to choose the optimal number of clusters.

In order to assess clusterability of the data, I will run Hopkins statistic. The null hypothesis tells that the dataset is uniformly distributed and does not contain meaningful clusters.

library(clustertend)
library(factoextra)
get_clust_tendency(wine, 2, graph=TRUE, gradient=list(low="red", mid="white", high="blue"))

## $hopkins_stat
## [1] 0.3469492
## 
## $plot

Hopkins statistc is equal to 0.31. As this value is less then 0.5, we can conclude that the dataset is significantly clusterable. The same conclusion can be made based on based on the ordered dissimilarity plot. One can see fields of different colors which indicate that there is possibility of finding clusters in the data.

Next step will be finding the optimal number of clusters. To do this, I will use silhouette statistic and apply it to three flat clustering algorithms: k-means, PAM and CLARA, hierarchical clustering and fuzzy clustering.

library(gridExtra)
a <- fviz_nbclust(wine, FUNcluster = kmeans, method = "silhouette") + theme_classic() 
b <- fviz_nbclust(wine, FUNcluster = cluster::pam, method = "silhouette") + theme_classic() 
c <- fviz_nbclust(wine, FUNcluster = cluster::clara, method = "silhouette") + theme_classic() 
d <- fviz_nbclust(wine, FUNcluster = hcut, method = "silhouette") + theme_classic() 
e <- fviz_nbclust(wine, FUNcluster = cluster::fanny, method = "silhouette") + theme_classic() 
grid.arrange(a, b, c, d, e, ncol=2)

According to silhouette statistic, the optimal number of clusters is three for flat and hierarchical algorithms and two for fuzzy.

Flat clustering

As to partitional clustering, the most popular algorithms are k-means, PAM and CLARA. However, in this chapter only k-means and PAM will be implemented. That is because CLARA is a PAM algorithm implementation for large datasets. Rather than the whole dataset, it uses subsets od the data in order to generate clusters. Each subset is partitioned into k clusters using the same algorithm as in PAM. Considering that the dataset we are using in the analysis is rather small, it is not needed to run CLARA.

The other thing is that considering the linear dependency of the variables, important step in flat clustering is to find the distance which will be proper for this kind of data. One can consider at least two of them: Pearson’s correlation and Mahalanobis distance. Mahalanobis distance is usually used if two variables have some degree of covariance. It rescales the data in a way that the there is no covariance after all. The formal equation of tha Mahalanobis distance is as follows:

\[D^2 = (x- \bar x)^TS^{-1}(x- \bar x)\] In the analysis, I will compare performance of the algorithm using these distances and also very common Euclidian distance.

K-means

The basic concept of k-means is that it takes cluster centres (means) to represent cluster. Its goal is to minimize square error of the intra-class dissimilarity. It measn that the algorithm aims to the situation in which clusters are consistent and different from each other.

Pearson correlation

library(factoextra)
cl_kmeans <- eclust(wine, k=3, FUNcluster="kmeans", hc_metric="pearson", graph=FALSE)
a <- fviz_silhouette(cl_kmeans)

##   cluster size ave.sil.width
## 1       1   62          0.34
## 2       2   51          0.35
## 3       3   65          0.18

b <- fviz_cluster(cl_kmeans, data = wine, elipse.type = "convex") + theme_minimal()
grid.arrange(a, b, ncol=2)

Efficiency testing:

table(cl_kmeans$cluster, wine_data$V1)

##    
##      1  2  3
##   1 59  3  0
##   2  0  3 48
##   3  0 65  0

Euclidean distance

cl_kmeans1 <- eclust(wine, k=3, FUNcluster="kmeans", hc_metric="euclidean", graph=FALSE)
g <- fviz_silhouette(cl_kmeans1)

##   cluster size ave.sil.width
## 1       1   62          0.34
## 2       2   51          0.35
## 3       3   65          0.18

h <- fviz_cluster(cl_kmeans1, data = wine, elipse.type = "convex") + theme_minimal()
grid.arrange(g, h, ncol=2)

Efficiency testing:

table(cl_kmeans1$cluster, wine_data$V1)

##    
##      1  2  3
##   1 59  3  0
##   2  0  3 48
##   3  0 65  0

# kiedyś będzie k-means with mahalanobis distance

Summarizing above results it occurs that there is basically no difference between Pearson’s correlation and Euclidean distance. Silhouette statistic for both is equal to 0.28. As to accuracy, both methods seems to wrongly assign labels of class 2 and 3. Thus number of wrongly assigned wines is 116.

PAM

PAM is so called medoid-based method which means that is chooses datapoints as centers (medoids) of the clusters. It is generally more robust to outliers than k-means since it uses median instead of mean.

Pearson correlation

cl_pam <- eclust(wine, k=3, FUNcluster="pam", hc_metric="pearson", graph=FALSE)
c <- fviz_silhouette(cl_pam)

##   cluster size ave.sil.width
## 1       1   74          0.25
## 2       2   55          0.22
## 3       3   49          0.34

d <- fviz_cluster(cl_pam, data = wine, elipse.type = "convex") + theme_minimal()
grid.arrange(c, d, ncol=2)

Efficiency testing:

table(cl_pam$cluster, wine_data$V1)

##    
##      1  2  3
##   1 59 15  0
##   2  0 55  0
##   3  0  1 48

Euclidean distance

cl_pam1 <- eclust(wine, k=3, FUNcluster="pam", hc_metric="euclidean", graph=FALSE)
i <- fviz_silhouette(cl_pam1)

##   cluster size ave.sil.width
## 1       1   74          0.25
## 2       2   55          0.22
## 3       3   49          0.34

j <- fviz_cluster(cl_pam1, data = wine, elipse.type = "convex") + theme_minimal()
grid.arrange(i, j, ncol=2)

Efficiency testing:

table(cl_pam1$cluster, wine_data$V1)

##    
##      1  2  3
##   1 59 15  0
##   2  0 55  0
##   3  0  1 48

Mahalanobis distance

library(ClusterR)
cl_pam2 <- Cluster_Medoids(wine, 3, distance_metric = "mahalanobis", verbose = FALSE, seed = 1)
Silhouette_Dissimilarity_Plot(cl_pam2, silhouette = TRUE)

## [1] TRUE

Efficiency testing:

table(cl_pam2$clusters, wine_data$V1)

##    
##      1  2  3
##   1 49 33  8
##   2  4 25  0
##   3  6 13 40

Comparing above PAM implementaitons it occurs that there is also no difference between Pearson’s correlation and Euclidean distance, just like in case of k-means. Silhouette statistic for both is equal to 0.27 and the number of wrongly assignes labels for both is 16. In case of Mahalanobis distance the silhouette statistic is much lower (0.065). The number of wrongly assignes labels is 64. Thus there is no advantage from using distance different than Euclidean one.

To sum up and compare the results obtained with k-means and PAM, it occured that PAM with euclidian distance performed the best in case of efficiency.

However, besides testing for efficiency, clustering validation involves also testing for stability of the clusters. It refers to the situation in which clusters remain the same even though there are changes in initial dataset, e.g. making subsamples or adding noise to the data. Most common method to test for stability is bootstrap. It is provided in R by the fpc::clusterboot() function. One has to specify at least three parameters in order to get the result. These are clustering method, method and number of clusters. Chosen clustering method is pamkCBI and the number of clusters remains 3.

cboot.hclust$bootmean

## [1] 0.9147406 0.8687969 0.9548098

The above vector indicates the cluster stability. Considering that value equal to 1 means perfectly stable cluster, we may assume that three obtained clusters are highly stable with the clusterwise means of the bootresult equal to 0.91, 0.87 and 0.95.

cboot.hclust$bootbrd

## [1] 0 0 0

The bootbrd component indicate how many times each cluster was dissolved. It means that out of 100 bootstrap iterations no cluster was ever dissolved.

cboot.hclust$bootrecover

## [1] 100 100 100

Analogical component to the provious one is bootrecover which indicates number of times a cluster has been successfully recovered. It means that out of 100 bootstrap iterations all cluters recovered successfully.

Taking into account all these information, we can assume clusters separated using PAM to be highly stable. I will later compare performance of this algorithm with the best algorithms obtained in the further custering analysis.

Hierarchical clustering

Hierarchical clustering, as the name indicates, is based on building the hierarchy of clusters. There are two types of hierarchical clustering: agglomerative (bottom-up approach; HAC or AGNES) and divisive (top-down approach; DIANA). The main difference between these two approaches is that in agglomerative clustering each observation starts in its own cluster and it is getting aggregared. On the other hand, in divisive clustering all observations start in one cluster and then are divided into smaller clusters, resulting in the situation that there is one cluster for each observation.

In order to cluster data agglomeratively, one may use stats::hclust() or cluster::agnes() functions. I will implement the first one. To do this, firstly the dissimilarity matrix has to be computed, e.g. using dist() function. Inside the hclust(), one must also specify the linkage method. There are several options, including among others single linkage, complete linkage, average linkage and Ward’s method. As to single linkage, the distance between clusters is measured as the distance between the closests members of two clusters. In complete linkage, the distance between clusters is equal to the the distance between most disimilar members of two clusters. Average linkage takes as the distance between clusters the average distance from any data point in one cluster to any data point in the other cluster. On the other hand, Ward’s method aims to minimize the total within-cluster variance which measures the compactness of the clustering.

Agglomerative hierarchical clustering

The approach which I will apply in the analysis will be conducting clustering with the use of four abovementioned methods and comparing the clustering results with the class label assign to each observation.

The following plots represents dendrogram of agglomerative hierarchical clustering algorithm assuming three clusters beforehand.

Single linkage

hc <- eclust(dm, k=3, FUNcluster="hclust", hc_metric="euclidean", hc_method = "single")
plot(hc, cex=0.6, hang=-1, main = "Dendrogram of HAC")
rect.hclust(hc, k=3, border='red')

Efficiency testing:

clusterCut <- cutree(hc, 3)
table(clusterCut, wine_data$V1)

##           
## clusterCut  1  2  3
##          1 59 69 48
##          2  0  1  0
##          3  0  1  0

Complete linkage

hc1 <- eclust(dm, k=3, FUNcluster="hclust", hc_metric="euclidean", hc_method = "complete")
plot(hc1, cex=0.6, hang=-1, main = "Dendrogram of HAC")
rect.hclust(hc1, k=3, border='red')

Efficiency testing:

clusterCut1 <- cutree(hc1, 3)
table(clusterCut1, wine_data$V1)

##            
## clusterCut1  1  2  3
##           1 33 12  2
##           2 26 56  0
##           3  0  3 46

Average linkage

hc2 <- eclust(dm, k=3, FUNcluster="hclust", hc_metric="euclidean", hc_method = "average")
plot(hc2, cex=0.6, hang=-1, main = "Dendrogram of HAC")
rect.hclust(hc2, k=3, border='red')

Efficiency testing:

clusterCut2 <- cutree(hc2, 3)
table(clusterCut2, wine_data$V1)

##            
## clusterCut2  1  2  3
##           1 59 61  2
##           2  0  9  0
##           3  0  1 46

Ward’s method

hc3 <- eclust(dm, k=3, FUNcluster="hclust", hc_metric="euclidean", hc_method = "ward.D2")
plot(hc3, cex=0.6, hang=-1, main = "Dendrogram of HAC")
rect.hclust(hc3, k=3, border='red')

Efficiency testing:

clusterCut3 <- cutree(hc3, 3)
table(clusterCut3, wine_data$V1)

##            
## clusterCut3  1  2  3
##           1 57  0  0
##           2  2 70  1
##           3  0  1 47

It occured that the most accurate is Ward’s method. In this case only four observations were wrongly assigned to the cluster. Thus, further analysis wil be conducted uning Ward’s method. In order to know more details about the division, I will run cluster.stats() function and display some of the basic statistic of the clustering object.

library(fpc)
hc_stats <- cluster.stats(dm, hc3$cluster)
hc_stats$cluster.size # number of observations per cluster

## [1] 57 73 48

hc_stats$within.cluster.ss # within cluster sum of squares

## [1] 1306.373

hc_stats$avg.silwidth # average silhouette width

## [1] 0.2750407

hc_stats$clus.avg.silwidths # average silhouette widths for each cluster

##         1         2         3 
## 0.4069831 0.1036613 0.3789986

The algorithm created three clusters with the count of observation equal to 57, 73 and 48. The within cluster sum of squares is equal to 1306.37. The average silhouette width is 0.275. The average silhouette widths for each cluster are 0.41, 0.10 and 0.38.

Another valuable statistic is Dunn Index counted among relative criterion of efficiency of clustering. Higher Dunn index indicates better clustering which means that clusters are compact and well-separated from other clusters. It can be calculated by dividing the minimum separation by maximum diameter. These values are result of the eclust() function.

hc_stats$min.separation

## [1] 1.730825

hc_stats$max.diameter

## [1] 11.17996

dunn <- hc_stats$min.separation / hc_stats$max.diameter
cat("Dunn Index is equal to", round(dunn, 2))

## Dunn Index is equal to 0.15

One can also calculate the agglomerative coefficient which is a measure of the clustering structure. Low values indicate tight clustering, higher values indicate less well-formed clusters.

library(cluster)
cat("Agglomerative coefficient is equal to", round(coef.hclust(hc3), 2))

## Agglomerative coefficient is equal to 0.96

Divisive hierarchical clustering

As to divisive hierarchical clustering, it can be implemented using cluster::diana() function. Here, the only parameter within function which has to be set is the number of clusters.

hc4 <- eclust(dm, k=3, FUNcluster="diana")
pltree(hc4, cex = 0.6, hang = -1, main = "Dendrogram of DIANA")
rect.hclust(hc4, k=3, border='red')

clusterCut4 <- cutree(hc4, 3)
table(clusterCut4, wine_data$V1)

##            
## clusterCut4  1  2  3
##           1 54 20  0
##           2  5  7  0
##           3  0 44 48

It is clearly visible that the algorithm performed worse than the best one using agglomerative approach. This is also confirmed by the following statistics which were also calculated for the clustering object in which Ward’s method was used.

hc_stats1 <- cluster.stats(dm, hc4$cluster)
hc_stats1$cluster.size # number of observations per cluster

## [1] 74 12 92

hc_stats1$within.cluster.ss # within cluster sum of squares

## [1] 1613.311

hc_stats1$avg.silwidth # average silhouette width

## [1] 0.2129193

hc_stats1$clus.avg.silwidths # average silhouette widths for each cluster

##          1          2          3 
##  0.2893495 -0.1487402  0.1986159

dunn <- hc_stats1$min.separation / hc_stats1$max.diameter
cat("Dunn Index is equal to", round(dunn, 2))

## Dunn Index is equal to 0.14

cat("Divisive coefficient is equal to", round(coef(hc4), 2))

## Divisive coefficient is equal to 0.83

As done before, the last step of the clustering algorithm assesment will be stability validation using fpc::clusterboot() function. In this case, I chose hclustCBI as clustering method which is an interface to the function hclust with optional noise cluster. As method, I chose Ward’s method since it was the best performing option. Number of clusters remains the same.

cboot.hclust1$bootmean

## [1] 0.8398482 0.9301395 0.7649944

The clusterwise means of the bootresult are equal to 0.84, 0.92 and 0.77. We can say that the first two clusters seems to be highly stable while the third cluster is only quite stable.

cboot.hclust1$bootbrd

## [1] 1 0 3

As to bootbrd component, it occures that out of 100 bootstrap iterations the first cluster was dissolved one, the second cluster wasn’t dissolved at all and the third cluster was dissolved 3 times.

cboot.hclust1$bootrecover

## [1] 79 98 59

Talking about successfull recovery, it seems that out of 100 bootstrap iterations the first cluster did not recover 17 times, the second cluster did not recover 6 times and the third cluster did not recover 41 times.

Fuzzy clustering

Fuzzy clustering is an algorithm in which each element has a probability of belonging to patricular cluster not just a binary membership as it is in hard clustering algorithms. Thus, the result will be degrees of membership for each observation in each cluster. Fuzzy clustering can be implemented in R using cluster::fanny() function. In the function, one has to specify at least three parametres: number of clusters, membership exponent and metric to be used for calculating dissimilarities. In our case, the algorithm indicated that the optimal number of clusters in fuzzy clustering is two. In order to chech acuracy of the algorithm, I will also run algorithms assuming number of clusters equal 2. As to membership exponent, I will set three levels (1.2, 1.5 and 2) in order to spot the difference between the outputs and decide which one to choose. Metric which I will use is Euclidian distance.

library(cluster)
clust_fanny <- fanny(dm, k=2, diss=TRUE, memb.exp = 1.2, metric = "euclidean")
head(clust_fanny$membership, n=10)

##            [,1]        [,2]
##  [1,] 0.9878966 0.012103415
##  [2,] 0.9678145 0.032185465
##  [3,] 0.9846820 0.015318021
##  [4,] 0.9768821 0.023117889
##  [5,] 0.8855022 0.114497847
##  [6,] 0.9886276 0.011372388
##  [7,] 0.9870559 0.012944070
##  [8,] 0.9736858 0.026314166
##  [9,] 0.9819113 0.018088673
## [10,] 0.9937507 0.006249258

library(cluster)
clust_fanny1 <- fanny(dm, k=3, diss=TRUE, memb.exp = 1.2, metric = "euclidean")
head(clust_fanny1$membership, n=10)

##            [,1]        [,2]        [,3]
##  [1,] 0.9881048 0.009546268 0.002348908
##  [2,] 0.8739605 0.114163537 0.011876000
##  [3,] 0.9714936 0.024164061 0.004342341
##  [4,] 0.9834670 0.011232010 0.005300994
##  [5,] 0.7695419 0.181498124 0.048959974
##  [6,] 0.9946211 0.003829950 0.001548986
##  [7,] 0.9885032 0.009110240 0.002386550
##  [8,] 0.9796753 0.014452082 0.005872613
##  [9,] 0.9757306 0.019652784 0.004616626
## [10,] 0.9905939 0.008155433 0.001250690

clust_fanny2 <- fanny(dm, k=2, diss=TRUE, memb.exp = 1.5, metric = "euclidean")
head(clust_fanny2$membership, n=10)

##            [,1]      [,2]
##  [1,] 0.7644276 0.2355724
##  [2,] 0.7166768 0.2833232
##  [3,] 0.7546862 0.2453138
##  [4,] 0.7311154 0.2688846
##  [5,] 0.6369950 0.3630050
##  [6,] 0.7693791 0.2306209
##  [7,] 0.7664869 0.2335131
##  [8,] 0.7300331 0.2699669
##  [9,] 0.7485009 0.2514991
## [10,] 0.7983968 0.2016032

clust_fanny3 <- fanny(dm, k=3, diss=TRUE, memb.exp = 1.5, metric = "euclidean")
head(clust_fanny3$membership, n=10)

##            [,1]      [,2]       [,3]
##  [1,] 0.6768370 0.2052493 0.11791361
##  [2,] 0.5431846 0.3085460 0.14826937
##  [3,] 0.6372803 0.2363846 0.12633518
##  [4,] 0.6478315 0.2079265 0.14424198
##  [5,] 0.4596641 0.3307317 0.20960427
##  [6,] 0.7129983 0.1742018 0.11279991
##  [7,] 0.6806755 0.2023862 0.11693833
##  [8,] 0.6328367 0.2214725 0.14569080
##  [9,] 0.6413502 0.2280630 0.13058685
## [10,] 0.7118584 0.1930817 0.09505991

clust_fanny4 <- fanny(dm, k=2, diss=TRUE, memb.exp = 2, metric = "euclidean")
head(clust_fanny4$membership, n=10)

##       [,1] [,2]
##  [1,]  0.5  0.5
##  [2,]  0.5  0.5
##  [3,]  0.5  0.5
##  [4,]  0.5  0.5
##  [5,]  0.5  0.5
##  [6,]  0.5  0.5
##  [7,]  0.5  0.5
##  [8,]  0.5  0.5
##  [9,]  0.5  0.5
## [10,]  0.5  0.5

clust_fanny5 <- fanny(dm, k=3, diss=TRUE, memb.exp = 2, metric = "euclidean")
head(clust_fanny5$membership, n=10)

##            [,1]      [,2]      [,3]
##  [1,] 0.3333334 0.3333333 0.3333333
##  [2,] 0.3333334 0.3333333 0.3333333
##  [3,] 0.3333334 0.3333333 0.3333333
##  [4,] 0.3333334 0.3333333 0.3333333
##  [5,] 0.3333334 0.3333333 0.3333333
##  [6,] 0.3333334 0.3333333 0.3333333
##  [7,] 0.3333334 0.3333333 0.3333333
##  [8,] 0.3333334 0.3333333 0.3333333
##  [9,] 0.3333334 0.3333333 0.3333333
## [10,] 0.3333334 0.3333333 0.3333333

Based on the above output, one can easily see that with the increase of membership exponents, the probabilities of membership in both clusters are equal.

a <- fviz_silhouette(clust_fanny)

##   cluster size ave.sil.width
## 1       1   88          0.28
## 2       2   90          0.24

b <- fviz_silhouette(clust_fanny2)

##   cluster size ave.sil.width
## 1       1   88          0.28
## 2       2   90          0.24

c <- fviz_silhouette(clust_fanny4)

##   cluster size ave.sil.width
## 1       1   89          0.28
## 2       2   89          0.24

grid.arrange(a, b, c, ncol=1)

d <- fviz_silhouette(clust_fanny1)

##   cluster size ave.sil.width
## 1       1   61          0.36
## 2       2   66          0.16
## 3       3   51          0.35

e <- fviz_silhouette(clust_fanny3)

##   cluster size ave.sil.width
## 1       1   62          0.36
## 2       2   63          0.17
## 3       3   53          0.34

f <- fviz_silhouette(clust_fanny5)

##   cluster size ave.sil.width
## 1       1   89          0.28
## 2       2   89          0.24

grid.arrange(d, e, f, ncol=1)

Changes of the membership exponent and number of clusters resulted in changes of silhouette statistic. The highest silhouette is seen for three clusters and membership exponent equal to 1.2. Thus, next step will be testing for clustering accuracy for clust_funny1 object.

table(clust_fanny1$clustering, wine_data$V1)

##    
##      1  2  3
##   1 59  2  0
##   2  0 66  0
##   3  0  3 48

Unfortunately, I didn’t find pertinent implementation of fuzzy algorithm bootstraping which I conducted in the previous part od the study. However, assuming that it is still a weaker alforthm in case of efficiency than hierarchical clustering and that PAM in case of stability is a very strong one, I will stay with these algorithm as leader in particular fields. Nevertheless, I will not stop looking for the right answer (or in the end I will write my own function) in order to conduct valid validation.

Conclusions

The best algorithm in case of efficiency occured to be agglomerative hierarchical clustering using Ward’s method as a linkage. However, in case of stability, the best one out of the ones which were validated is PAM using Euclidean distance. Choice of particular algorithm and its advantages is usually driven by business need and there is usually trade-off like in this case. Thus, I will not point out the best algorithm since it is subjective.

Sources:
http://www.unuftp.is/static/files/rannsoknarritegrdir/WarshaSingh_MastersThesis.pdf
https://www.r-bloggers.com/bootstrap-evaluation-of-clusters/
https://www.rdocumentation.org/packages/fpc/versions/2.1-11.1/topics/clusterboot
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.331.1569&rep=rep1&type=pdf
https://www.sciencedirect.com/science/article/pii/S016501141500216X?via%3Dihub

Clustering of wines - flat, hierarchical and fuzzy algorithms

Ewelina Osowska

Introduction

Dataset

Initial analysis

Prediagnostics

Flat clustering

Hierarchical clustering

Agglomerative hierarchical clustering

Divisive hierarchical clustering

Fuzzy clustering

Conclusions