Clustering is a field of unsupervised learning which is used as a technique of identifying subgroups in the data if the lables are not provided. It can be applied to any kind of data and thus it has a number of different applications, e.g. customer segmentation, document clustering or identifying crime localities. One can distinguish many types of clustering algorithms from which the most popular are partitional (flat) algorithms, hierarchical algorithms, fuzzy algorithms and density based algorithms.
In this study, I will try to find the best clustering approach which will group different wines in the most efficient and stable way. In order to do this, I will implement different partitional, hierarchical and fuzzy algorithms.
The data used to conduct analysis is Wine Data Set which consists of chemical substances in order to determine the origin of wines. It was provided by the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Wine). It can be imported directly from the website through foreign::read.table() function.
# loading the dataset
library(foreign)
wine_data <- read.table("https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data", sep=",")
wine = wine_data[,-1] # first column is a class label
| Variable Name | Variable description |
|---|---|
| V2 | Alcohol |
| V3 | Malic acid |
| V4 | Ash |
| V5 | Alcalinity of ash |
| V6 | Magnesium |
| V7 | Total phenols |
| V8 | Flavanoids |
| V9 | Nonflavanoid phenols |
| V10 | Proanthocyanins |
| V11 | Color intensity |
| V12 | Hue |
| V13 | OD280/OD315 of diluted wines |
| V14 | Proline |
In the first steps of the analysis, I will inspect the detailed information of the data, calculate basic statistics and explore the relationship between variables.
cat("Number of different wines in the dataset:", nrow(wine))
## Number of different wines in the dataset: 178
cat("Number of features describing wines:", ncol(wine))
## Number of features describing wines: 13
Following output shows basic statistics of the wine characteristics. These are minimum value, 1st quantile, median, mean, 3rd quantile and maximum value.
summary(wine)
## V2 V3 V4 V5 V6 V7
## Min. :11.03 Min. :0.740 Min. :1.360 Min. :10.60 Min. : 70.00 Min. :0.980
## 1st Qu.:12.36 1st Qu.:1.603 1st Qu.:2.210 1st Qu.:17.20 1st Qu.: 88.00 1st Qu.:1.742
## Median :13.05 Median :1.865 Median :2.360 Median :19.50 Median : 98.00 Median :2.355
## Mean :13.00 Mean :2.336 Mean :2.367 Mean :19.49 Mean : 99.74 Mean :2.295
## 3rd Qu.:13.68 3rd Qu.:3.083 3rd Qu.:2.558 3rd Qu.:21.50 3rd Qu.:107.00 3rd Qu.:2.800
## Max. :14.83 Max. :5.800 Max. :3.230 Max. :30.00 Max. :162.00 Max. :3.880
## V8 V9 V10 V11 V12 V13
## Min. :0.340 Min. :0.1300 Min. :0.410 Min. : 1.280 Min. :0.4800 Min. :1.270
## 1st Qu.:1.205 1st Qu.:0.2700 1st Qu.:1.250 1st Qu.: 3.220 1st Qu.:0.7825 1st Qu.:1.938
## Median :2.135 Median :0.3400 Median :1.555 Median : 4.690 Median :0.9650 Median :2.780
## Mean :2.029 Mean :0.3619 Mean :1.591 Mean : 5.058 Mean :0.9574 Mean :2.612
## 3rd Qu.:2.875 3rd Qu.:0.4375 3rd Qu.:1.950 3rd Qu.: 6.200 3rd Qu.:1.1200 3rd Qu.:3.170
## Max. :5.080 Max. :0.6600 Max. :3.580 Max. :13.000 Max. :1.7100 Max. :4.000
## V14
## Min. : 278.0
## 1st Qu.: 500.5
## Median : 673.5
## Mean : 746.9
## 3rd Qu.: 985.0
## Max. :1680.0
Since the values of variables are on different scales I will normalize them using scale() function in order to get proper and interpretable results.
wine <- scale(wine)
Next step will be examining the relationship between variables.
library(corrplot)
wine_matrix <- data.matrix(wine, rownames.force = NA)
M <- cor(wine_matrix)
corrplot(M, method = "number", number.cex = 0.75, order="hclust")
From the correlation matrix one can see that dataset contains some highly correlated features, eg. flavanoids and total phenols (0.86), OD280/OD315 of diluted wines and flavanoids (0.79), OD280/OD315 of diluted wines and total phenols (0.7), flavanoids and proanthocyanins (0.65) or proline and alcohol (0.64). The highest dependencies (over 0.6) are mainly positive.
In other to check if the dataset contains outliers, one can calculate interquartile range statistic. The equation is as follows:
\[IQR = Q_3 - Q_1\]
vars <- c(colnames(wine))
Outliers <- c()
for(i in vars){
max <- quantile(wine[,i], 0.75) + (IQR(wine[,i]) * 1.5 )
min <- quantile(wine[,i], 0.25) - (IQR(wine[,i]) * 1.5 )
idx <- which(wine[,i] < min | wine[,i] > max)
print(paste(i, length(idx), sep=' ')) # printing variable and number of potential outliers
Outliers <- c(Outliers, idx)
}
## [1] "V2 0"
## [1] "V3 3"
## [1] "V4 3"
## [1] "V5 4"
## [1] "V6 4"
## [1] "V7 0"
## [1] "V8 0"
## [1] "V9 0"
## [1] "V10 2"
## [1] "V11 4"
## [1] "V12 1"
## [1] "V13 0"
## [1] "V14 0"
It seems that the statistic indicated up to four outliers in particular features. Let’s now plot these variables which were indicated to contain outliers.
par(mfrow=c(2,2))
colnames <- colnames(wine[,c(2:5,9:11)])
for (i in colnames) {
plot(wine[,i], main = paste("Plot of ", i), ylab = i)
}
After examining the plots, I decided not to exclude any of the observations from the analysis.
In order to choose the clustering algorithm best fitted to the data, one must consider characteristics of the dataset. From initial analysis, we know that we deal with continuous variables and most of them are correlated (mainly positively). Dataset doesn’t contain worrisome outliers.
I will start with running prediagnostics in order to check wheather data can be clustered and also to choose the optimal number of clusters.
In order to assess clusterability of the data, I will run Hopkins statistic. The null hypothesis tells that the dataset is uniformly distributed and does not contain meaningful clusters.
library(clustertend)
library(factoextra)
get_clust_tendency(wine, 2, graph=TRUE, gradient=list(low="red", mid="white", high="blue"))
## $hopkins_stat
## [1] 0.3469492
##
## $plot
Hopkins statistc is equal to 0.31. As this value is less then 0.5, we can conclude that the dataset is significantly clusterable. The same conclusion can be made based on based on the ordered dissimilarity plot. One can see fields of different colors which indicate that there is possibility of finding clusters in the data.
Next step will be finding the optimal number of clusters. To do this, I will use silhouette statistic and apply it to three flat clustering algorithms: k-means, PAM and CLARA, hierarchical clustering and fuzzy clustering.
library(gridExtra)
a <- fviz_nbclust(wine, FUNcluster = kmeans, method = "silhouette") + theme_classic()
b <- fviz_nbclust(wine, FUNcluster = cluster::pam, method = "silhouette") + theme_classic()
c <- fviz_nbclust(wine, FUNcluster = cluster::clara, method = "silhouette") + theme_classic()
d <- fviz_nbclust(wine, FUNcluster = hcut, method = "silhouette") + theme_classic()
e <- fviz_nbclust(wine, FUNcluster = cluster::fanny, method = "silhouette") + theme_classic()
grid.arrange(a, b, c, d, e, ncol=2)
According to silhouette statistic, the optimal number of clusters is three for flat and hierarchical algorithms and two for fuzzy.
As to partitional clustering, the most popular algorithms are k-means, PAM and CLARA. However, in this chapter only k-means and PAM will be implemented. That is because CLARA is a PAM algorithm implementation for large datasets. Rather than the whole dataset, it uses subsets od the data in order to generate clusters. Each subset is partitioned into k clusters using the same algorithm as in PAM. Considering that the dataset we are using in the analysis is rather small, it is not needed to run CLARA.
The other thing is that considering the linear dependency of the variables, important step in flat clustering is to find the distance which will be proper for this kind of data. One can consider at least two of them: Pearson’s correlation and Mahalanobis distance. Mahalanobis distance is usually used if two variables have some degree of covariance. It rescales the data in a way that the there is no covariance after all. The formal equation of tha Mahalanobis distance is as follows:
\[D^2 = (x- \bar x)^TS^{-1}(x- \bar x)\] In the analysis, I will compare performance of the algorithm using these distances and also very common Euclidian distance.
K-means
The basic concept of k-means is that it takes cluster centres (means) to represent cluster. Its goal is to minimize square error of the intra-class dissimilarity. It measn that the algorithm aims to the situation in which clusters are consistent and different from each other.
Pearson correlation
library(factoextra)
cl_kmeans <- eclust(wine, k=3, FUNcluster="kmeans", hc_metric="pearson", graph=FALSE)
a <- fviz_silhouette(cl_kmeans)
## cluster size ave.sil.width
## 1 1 62 0.34
## 2 2 51 0.35
## 3 3 65 0.18
b <- fviz_cluster(cl_kmeans, data = wine, elipse.type = "convex") + theme_minimal()
grid.arrange(a, b, ncol=2)
Efficiency testing:
table(cl_kmeans$cluster, wine_data$V1)
##
## 1 2 3
## 1 59 3 0
## 2 0 3 48
## 3 0 65 0
Euclidean distance
cl_kmeans1 <- eclust(wine, k=3, FUNcluster="kmeans", hc_metric="euclidean", graph=FALSE)
g <- fviz_silhouette(cl_kmeans1)
## cluster size ave.sil.width
## 1 1 62 0.34
## 2 2 51 0.35
## 3 3 65 0.18
h <- fviz_cluster(cl_kmeans1, data = wine, elipse.type = "convex") + theme_minimal()
grid.arrange(g, h, ncol=2)
Efficiency testing:
table(cl_kmeans1$cluster, wine_data$V1)
##
## 1 2 3
## 1 59 3 0
## 2 0 3 48
## 3 0 65 0
# kiedyś będzie k-means with mahalanobis distance
Summarizing above results it occurs that there is basically no difference between Pearson’s correlation and Euclidean distance. Silhouette statistic for both is equal to 0.28. As to accuracy, both methods seems to wrongly assign labels of class 2 and 3. Thus number of wrongly assigned wines is 116.
PAM
PAM is so called medoid-based method which means that is chooses datapoints as centers (medoids) of the clusters. It is generally more robust to outliers than k-means since it uses median instead of mean.
Pearson correlation
cl_pam <- eclust(wine, k=3, FUNcluster="pam", hc_metric="pearson", graph=FALSE)
c <- fviz_silhouette(cl_pam)
## cluster size ave.sil.width
## 1 1 74 0.25
## 2 2 55 0.22
## 3 3 49 0.34
d <- fviz_cluster(cl_pam, data = wine, elipse.type = "convex") + theme_minimal()
grid.arrange(c, d, ncol=2)
Efficiency testing:
table(cl_pam$cluster, wine_data$V1)
##
## 1 2 3
## 1 59 15 0
## 2 0 55 0
## 3 0 1 48
Euclidean distance
cl_pam1 <- eclust(wine, k=3, FUNcluster="pam", hc_metric="euclidean", graph=FALSE)
i <- fviz_silhouette(cl_pam1)
## cluster size ave.sil.width
## 1 1 74 0.25
## 2 2 55 0.22
## 3 3 49 0.34
j <- fviz_cluster(cl_pam1, data = wine, elipse.type = "convex") + theme_minimal()
grid.arrange(i, j, ncol=2)
Efficiency testing:
table(cl_pam1$cluster, wine_data$V1)
##
## 1 2 3
## 1 59 15 0
## 2 0 55 0
## 3 0 1 48
Mahalanobis distance
library(ClusterR)
cl_pam2 <- Cluster_Medoids(wine, 3, distance_metric = "mahalanobis", verbose = FALSE, seed = 1)
Silhouette_Dissimilarity_Plot(cl_pam2, silhouette = TRUE)
## [1] TRUE
Efficiency testing:
table(cl_pam2$clusters, wine_data$V1)
##
## 1 2 3
## 1 49 33 8
## 2 4 25 0
## 3 6 13 40
Comparing above PAM implementaitons it occurs that there is also no difference between Pearson’s correlation and Euclidean distance, just like in case of k-means. Silhouette statistic for both is equal to 0.27 and the number of wrongly assignes labels for both is 16. In case of Mahalanobis distance the silhouette statistic is much lower (0.065). The number of wrongly assignes labels is 64. Thus there is no advantage from using distance different than Euclidean one.
To sum up and compare the results obtained with k-means and PAM, it occured that PAM with euclidian distance performed the best in case of efficiency.
However, besides testing for efficiency, clustering validation involves also testing for stability of the clusters. It refers to the situation in which clusters remain the same even though there are changes in initial dataset, e.g. making subsamples or adding noise to the data. Most common method to test for stability is bootstrap. It is provided in R by the fpc::clusterboot() function. One has to specify at least three parameters in order to get the result. These are clustering method, method and number of clusters. Chosen clustering method is pamkCBI and the number of clusters remains 3.
cboot.hclust$bootmean
## [1] 0.9147406 0.8687969 0.9548098
The above vector indicates the cluster stability. Considering that value equal to 1 means perfectly stable cluster, we may assume that three obtained clusters are highly stable with the clusterwise means of the bootresult equal to 0.91, 0.87 and 0.95.
cboot.hclust$bootbrd
## [1] 0 0 0
The bootbrd component indicate how many times each cluster was dissolved. It means that out of 100 bootstrap iterations no cluster was ever dissolved.
cboot.hclust$bootrecover
## [1] 100 100 100
Analogical component to the provious one is bootrecover which indicates number of times a cluster has been successfully recovered. It means that out of 100 bootstrap iterations all cluters recovered successfully.
Taking into account all these information, we can assume clusters separated using PAM to be highly stable. I will later compare performance of this algorithm with the best algorithms obtained in the further custering analysis.
Hierarchical clustering, as the name indicates, is based on building the hierarchy of clusters. There are two types of hierarchical clustering: agglomerative (bottom-up approach; HAC or AGNES) and divisive (top-down approach; DIANA). The main difference between these two approaches is that in agglomerative clustering each observation starts in its own cluster and it is getting aggregared. On the other hand, in divisive clustering all observations start in one cluster and then are divided into smaller clusters, resulting in the situation that there is one cluster for each observation.
In order to cluster data agglomeratively, one may use stats::hclust() or cluster::agnes() functions. I will implement the first one. To do this, firstly the dissimilarity matrix has to be computed, e.g. using dist() function. Inside the hclust(), one must also specify the linkage method. There are several options, including among others single linkage, complete linkage, average linkage and Ward’s method. As to single linkage, the distance between clusters is measured as the distance between the closests members of two clusters. In complete linkage, the distance between clusters is equal to the the distance between most disimilar members of two clusters. Average linkage takes as the distance between clusters the average distance from any data point in one cluster to any data point in the other cluster. On the other hand, Ward’s method aims to minimize the total within-cluster variance which measures the compactness of the clustering.
The approach which I will apply in the analysis will be conducting clustering with the use of four abovementioned methods and comparing the clustering results with the class label assign to each observation.
The following plots represents dendrogram of agglomerative hierarchical clustering algorithm assuming three clusters beforehand.
Single linkage
hc <- eclust(dm, k=3, FUNcluster="hclust", hc_metric="euclidean", hc_method = "single")
plot(hc, cex=0.6, hang=-1, main = "Dendrogram of HAC")
rect.hclust(hc, k=3, border='red')
Efficiency testing:
clusterCut <- cutree(hc, 3)
table(clusterCut, wine_data$V1)
##
## clusterCut 1 2 3
## 1 59 69 48
## 2 0 1 0
## 3 0 1 0
Complete linkage
hc1 <- eclust(dm, k=3, FUNcluster="hclust", hc_metric="euclidean", hc_method = "complete")
plot(hc1, cex=0.6, hang=-1, main = "Dendrogram of HAC")
rect.hclust(hc1, k=3, border='red')
Efficiency testing:
clusterCut1 <- cutree(hc1, 3)
table(clusterCut1, wine_data$V1)
##
## clusterCut1 1 2 3
## 1 33 12 2
## 2 26 56 0
## 3 0 3 46
Average linkage
hc2 <- eclust(dm, k=3, FUNcluster="hclust", hc_metric="euclidean", hc_method = "average")
plot(hc2, cex=0.6, hang=-1, main = "Dendrogram of HAC")
rect.hclust(hc2, k=3, border='red')
Efficiency testing:
clusterCut2 <- cutree(hc2, 3)
table(clusterCut2, wine_data$V1)
##
## clusterCut2 1 2 3
## 1 59 61 2
## 2 0 9 0
## 3 0 1 46
Ward’s method
hc3 <- eclust(dm, k=3, FUNcluster="hclust", hc_metric="euclidean", hc_method = "ward.D2")
plot(hc3, cex=0.6, hang=-1, main = "Dendrogram of HAC")
rect.hclust(hc3, k=3, border='red')
Efficiency testing:
clusterCut3 <- cutree(hc3, 3)
table(clusterCut3, wine_data$V1)
##
## clusterCut3 1 2 3
## 1 57 0 0
## 2 2 70 1
## 3 0 1 47
It occured that the most accurate is Ward’s method. In this case only four observations were wrongly assigned to the cluster. Thus, further analysis wil be conducted uning Ward’s method. In order to know more details about the division, I will run cluster.stats() function and display some of the basic statistic of the clustering object.
library(fpc)
hc_stats <- cluster.stats(dm, hc3$cluster)
hc_stats$cluster.size # number of observations per cluster
## [1] 57 73 48
hc_stats$within.cluster.ss # within cluster sum of squares
## [1] 1306.373
hc_stats$avg.silwidth # average silhouette width
## [1] 0.2750407
hc_stats$clus.avg.silwidths # average silhouette widths for each cluster
## 1 2 3
## 0.4069831 0.1036613 0.3789986
The algorithm created three clusters with the count of observation equal to 57, 73 and 48. The within cluster sum of squares is equal to 1306.37. The average silhouette width is 0.275. The average silhouette widths for each cluster are 0.41, 0.10 and 0.38.
Another valuable statistic is Dunn Index counted among relative criterion of efficiency of clustering. Higher Dunn index indicates better clustering which means that clusters are compact and well-separated from other clusters. It can be calculated by dividing the minimum separation by maximum diameter. These values are result of the eclust() function.
hc_stats$min.separation
## [1] 1.730825
hc_stats$max.diameter
## [1] 11.17996
dunn <- hc_stats$min.separation / hc_stats$max.diameter
cat("Dunn Index is equal to", round(dunn, 2))
## Dunn Index is equal to 0.15
One can also calculate the agglomerative coefficient which is a measure of the clustering structure. Low values indicate tight clustering, higher values indicate less well-formed clusters.
library(cluster)
cat("Agglomerative coefficient is equal to", round(coef.hclust(hc3), 2))
## Agglomerative coefficient is equal to 0.96
As to divisive hierarchical clustering, it can be implemented using cluster::diana() function. Here, the only parameter within function which has to be set is the number of clusters.
hc4 <- eclust(dm, k=3, FUNcluster="diana")
pltree(hc4, cex = 0.6, hang = -1, main = "Dendrogram of DIANA")
rect.hclust(hc4, k=3, border='red')
clusterCut4 <- cutree(hc4, 3)
table(clusterCut4, wine_data$V1)
##
## clusterCut4 1 2 3
## 1 54 20 0
## 2 5 7 0
## 3 0 44 48
It is clearly visible that the algorithm performed worse than the best one using agglomerative approach. This is also confirmed by the following statistics which were also calculated for the clustering object in which Ward’s method was used.
hc_stats1 <- cluster.stats(dm, hc4$cluster)
hc_stats1$cluster.size # number of observations per cluster
## [1] 74 12 92
hc_stats1$within.cluster.ss # within cluster sum of squares
## [1] 1613.311
hc_stats1$avg.silwidth # average silhouette width
## [1] 0.2129193
hc_stats1$clus.avg.silwidths # average silhouette widths for each cluster
## 1 2 3
## 0.2893495 -0.1487402 0.1986159
dunn <- hc_stats1$min.separation / hc_stats1$max.diameter
cat("Dunn Index is equal to", round(dunn, 2))
## Dunn Index is equal to 0.14
cat("Divisive coefficient is equal to", round(coef(hc4), 2))
## Divisive coefficient is equal to 0.83
As done before, the last step of the clustering algorithm assesment will be stability validation using fpc::clusterboot() function. In this case, I chose hclustCBI as clustering method which is an interface to the function hclust with optional noise cluster. As method, I chose Ward’s method since it was the best performing option. Number of clusters remains the same.
cboot.hclust1$bootmean
## [1] 0.8398482 0.9301395 0.7649944
The clusterwise means of the bootresult are equal to 0.84, 0.92 and 0.77. We can say that the first two clusters seems to be highly stable while the third cluster is only quite stable.
cboot.hclust1$bootbrd
## [1] 1 0 3
As to bootbrd component, it occures that out of 100 bootstrap iterations the first cluster was dissolved one, the second cluster wasn’t dissolved at all and the third cluster was dissolved 3 times.
cboot.hclust1$bootrecover
## [1] 79 98 59
Talking about successfull recovery, it seems that out of 100 bootstrap iterations the first cluster did not recover 17 times, the second cluster did not recover 6 times and the third cluster did not recover 41 times.
Fuzzy clustering is an algorithm in which each element has a probability of belonging to patricular cluster not just a binary membership as it is in hard clustering algorithms. Thus, the result will be degrees of membership for each observation in each cluster. Fuzzy clustering can be implemented in R using cluster::fanny() function. In the function, one has to specify at least three parametres: number of clusters, membership exponent and metric to be used for calculating dissimilarities. In our case, the algorithm indicated that the optimal number of clusters in fuzzy clustering is two. In order to chech acuracy of the algorithm, I will also run algorithms assuming number of clusters equal 2. As to membership exponent, I will set three levels (1.2, 1.5 and 2) in order to spot the difference between the outputs and decide which one to choose. Metric which I will use is Euclidian distance.
library(cluster)
clust_fanny <- fanny(dm, k=2, diss=TRUE, memb.exp = 1.2, metric = "euclidean")
head(clust_fanny$membership, n=10)
## [,1] [,2]
## [1,] 0.9878966 0.012103415
## [2,] 0.9678145 0.032185465
## [3,] 0.9846820 0.015318021
## [4,] 0.9768821 0.023117889
## [5,] 0.8855022 0.114497847
## [6,] 0.9886276 0.011372388
## [7,] 0.9870559 0.012944070
## [8,] 0.9736858 0.026314166
## [9,] 0.9819113 0.018088673
## [10,] 0.9937507 0.006249258
library(cluster)
clust_fanny1 <- fanny(dm, k=3, diss=TRUE, memb.exp = 1.2, metric = "euclidean")
head(clust_fanny1$membership, n=10)
## [,1] [,2] [,3]
## [1,] 0.9881048 0.009546268 0.002348908
## [2,] 0.8739605 0.114163537 0.011876000
## [3,] 0.9714936 0.024164061 0.004342341
## [4,] 0.9834670 0.011232010 0.005300994
## [5,] 0.7695419 0.181498124 0.048959974
## [6,] 0.9946211 0.003829950 0.001548986
## [7,] 0.9885032 0.009110240 0.002386550
## [8,] 0.9796753 0.014452082 0.005872613
## [9,] 0.9757306 0.019652784 0.004616626
## [10,] 0.9905939 0.008155433 0.001250690
clust_fanny2 <- fanny(dm, k=2, diss=TRUE, memb.exp = 1.5, metric = "euclidean")
head(clust_fanny2$membership, n=10)
## [,1] [,2]
## [1,] 0.7644276 0.2355724
## [2,] 0.7166768 0.2833232
## [3,] 0.7546862 0.2453138
## [4,] 0.7311154 0.2688846
## [5,] 0.6369950 0.3630050
## [6,] 0.7693791 0.2306209
## [7,] 0.7664869 0.2335131
## [8,] 0.7300331 0.2699669
## [9,] 0.7485009 0.2514991
## [10,] 0.7983968 0.2016032
clust_fanny3 <- fanny(dm, k=3, diss=TRUE, memb.exp = 1.5, metric = "euclidean")
head(clust_fanny3$membership, n=10)
## [,1] [,2] [,3]
## [1,] 0.6768370 0.2052493 0.11791361
## [2,] 0.5431846 0.3085460 0.14826937
## [3,] 0.6372803 0.2363846 0.12633518
## [4,] 0.6478315 0.2079265 0.14424198
## [5,] 0.4596641 0.3307317 0.20960427
## [6,] 0.7129983 0.1742018 0.11279991
## [7,] 0.6806755 0.2023862 0.11693833
## [8,] 0.6328367 0.2214725 0.14569080
## [9,] 0.6413502 0.2280630 0.13058685
## [10,] 0.7118584 0.1930817 0.09505991
clust_fanny4 <- fanny(dm, k=2, diss=TRUE, memb.exp = 2, metric = "euclidean")
head(clust_fanny4$membership, n=10)
## [,1] [,2]
## [1,] 0.5 0.5
## [2,] 0.5 0.5
## [3,] 0.5 0.5
## [4,] 0.5 0.5
## [5,] 0.5 0.5
## [6,] 0.5 0.5
## [7,] 0.5 0.5
## [8,] 0.5 0.5
## [9,] 0.5 0.5
## [10,] 0.5 0.5
clust_fanny5 <- fanny(dm, k=3, diss=TRUE, memb.exp = 2, metric = "euclidean")
head(clust_fanny5$membership, n=10)
## [,1] [,2] [,3]
## [1,] 0.3333334 0.3333333 0.3333333
## [2,] 0.3333334 0.3333333 0.3333333
## [3,] 0.3333334 0.3333333 0.3333333
## [4,] 0.3333334 0.3333333 0.3333333
## [5,] 0.3333334 0.3333333 0.3333333
## [6,] 0.3333334 0.3333333 0.3333333
## [7,] 0.3333334 0.3333333 0.3333333
## [8,] 0.3333334 0.3333333 0.3333333
## [9,] 0.3333334 0.3333333 0.3333333
## [10,] 0.3333334 0.3333333 0.3333333
Based on the above output, one can easily see that with the increase of membership exponents, the probabilities of membership in both clusters are equal.
a <- fviz_silhouette(clust_fanny)
## cluster size ave.sil.width
## 1 1 88 0.28
## 2 2 90 0.24
b <- fviz_silhouette(clust_fanny2)
## cluster size ave.sil.width
## 1 1 88 0.28
## 2 2 90 0.24
c <- fviz_silhouette(clust_fanny4)
## cluster size ave.sil.width
## 1 1 89 0.28
## 2 2 89 0.24
grid.arrange(a, b, c, ncol=1)
d <- fviz_silhouette(clust_fanny1)
## cluster size ave.sil.width
## 1 1 61 0.36
## 2 2 66 0.16
## 3 3 51 0.35
e <- fviz_silhouette(clust_fanny3)
## cluster size ave.sil.width
## 1 1 62 0.36
## 2 2 63 0.17
## 3 3 53 0.34
f <- fviz_silhouette(clust_fanny5)
## cluster size ave.sil.width
## 1 1 89 0.28
## 2 2 89 0.24
grid.arrange(d, e, f, ncol=1)
Changes of the membership exponent and number of clusters resulted in changes of silhouette statistic. The highest silhouette is seen for three clusters and membership exponent equal to 1.2. Thus, next step will be testing for clustering accuracy for clust_funny1 object.
table(clust_fanny1$clustering, wine_data$V1)
##
## 1 2 3
## 1 59 2 0
## 2 0 66 0
## 3 0 3 48
Unfortunately, I didn’t find pertinent implementation of fuzzy algorithm bootstraping which I conducted in the previous part od the study. However, assuming that it is still a weaker alforthm in case of efficiency than hierarchical clustering and that PAM in case of stability is a very strong one, I will stay with these algorithm as leader in particular fields. Nevertheless, I will not stop looking for the right answer (or in the end I will write my own function) in order to conduct valid validation.
The best algorithm in case of efficiency occured to be agglomerative hierarchical clustering using Ward’s method as a linkage. However, in case of stability, the best one out of the ones which were validated is PAM using Euclidean distance. Choice of particular algorithm and its advantages is usually driven by business need and there is usually trade-off like in this case. Thus, I will not point out the best algorithm since it is subjective.
Sources:
http://www.unuftp.is/static/files/rannsoknarritegrdir/WarshaSingh_MastersThesis.pdf
https://www.r-bloggers.com/bootstrap-evaluation-of-clusters/
https://www.rdocumentation.org/packages/fpc/versions/2.1-11.1/topics/clusterboot
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.331.1569&rep=rep1&type=pdf
https://www.sciencedirect.com/science/article/pii/S016501141500216X?via%3Dihub