We have 210 observations on 7 continuous variables divided into 3 classes of 70 observations each.
## species
## 1 2 3
## 70 70 70
We fit the Gaussian Mixture Model for different values of k (number of clusters) and check the Gap Statistic as a function of k.
Now we see Silhouette Index as function of k.
Now we see Dunn method as function of k.
We see that Silhouette index gives a good estimate of number of clusters but the Dunn index and Gap statistic both give a large estimate of number of clusters.
Fitting the model with 3 and 9 clusters. We see that the 3-cluster model has much better BIC.
## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm
## ----------------------------------------------------
##
## Mclust VEV (ellipsoidal, equal shape) model with 3 components:
##
## log-likelihood n df BIC ICL
## 286.9906 210 95 66.00606 61.46508
##
## Clustering table:
## 1 2 3
## 76 69 65
## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm
## ----------------------------------------------------
##
## Mclust VEV (ellipsoidal, equal shape) model with 9 components:
##
## log-likelihood n df BIC ICL
## 630.9067 210 275 -208.6411 -214.0946
##
## Clustering table:
## 1 2 3 4 5 6 7 8 9
## 36 15 37 6 47 33 12 10 14
We can visualise the pairwise plots to see the clustering obtained with 3 groups.
Gap and Silhouette statistic for k means. Gap statistics is maximised for k = 3, Silhouette index is maximised for k = 2.
For Dunn score
We see that Dunn score is optimal at 3 clusters and again Silhouette score at 2 clusters.
##
## Clustering Methods:
## kmeans
##
## Cluster sizes:
## 2 3 4 5 6 7 8 9
##
## Validation Measures:
## 2 3 4 5 6 7 8 9
##
## kmeans Connectivity 16.8964 39.5476 63.6409 79.9706 98.7131 97.6437 101.8389 103.5853
## Dunn 0.0870 0.1188 0.0888 0.0814 0.0814 0.0986 0.0590 0.0590
## Silhouette 0.4658 0.4007 0.3379 0.2881 0.2740 0.2837 0.2686 0.2687
##
## Optimal Scores:
##
## Score Method Clusters
## Connectivity 16.8964 kmeans 2
## Dunn 0.1188 kmeans 3
## Silhouette 0.4658 kmeans 2
So we perform clustering using 3 clusters and visualise the clusters in 2 dimensions.
Gap and Silhouette statistic for k means which suggest 3 and 2 clusters respectively.
We see Dunn and Silhouette scores. We see that Silhouette index suggests 2 clusters and Dunn index suggests 3 clusters.
##
## Clustering Methods:
## pam
##
## Cluster sizes:
## 2 3 4 5 6 7 8 9
##
## Validation Measures:
## 2 3 4 5 6 7 8 9
##
## pam Connectivity 28.5028 41.4413 55.4944 73.2921 96.9325 103.9544 116.5790 128.4750
## Dunn 0.0588 0.1115 0.1114 0.0816 0.0817 0.0711 0.0839 0.0840
## Silhouette 0.4648 0.3982 0.3335 0.2678 0.2336 0.2401 0.2488 0.2345
##
## Optimal Scores:
##
## Score Method Clusters
## Connectivity 28.5028 pam 2
## Dunn 0.1115 pam 3
## Silhouette 0.4648 pam 2
So we cluster using 3 statistics and we obtain the clustering and the plot.
First we do for single Linkage. Checking the Gap and Silhouette Statistic we obtain suggested cluster number of 3 and 2.
So we calculate using three clusters and print the dendogram. We see that the result is quite terrible.
Checking the confusion matrix. It is quite bad.
##
## species 1 2 3
## 1 68 1 1
## 2 70 0 0
## 3 68 2 0
We check the Gap and the Silhouette Statistic. We get 3 clusters as optimal from Gap statistic and 2 clusters as optimal from Silhouette index.
Dunn we have to see.
So choosing 3 as the correct cluster number we obtain the clustering and the dendogram. Here the dendogram is much more reasonable.
We obtain knee at around 1.1. So we use epsilon = 1.1.
We perform the dbscan and see that we obtain 3 predicted clusters for minPts = 17.
We also get 42 Noise Points.
## DBSCAN clustering for 210 objects.
## Parameters: eps = 1.1, minPts = 17
## The clustering contains 3 cluster(s) and 42 noise points.
##
## 0 1 2 3
## 42 116 17 35
##
## Available fields: cluster, eps, minPts
We plot the predicted clusters.
Using the same parameter values as DBSCAN and obtain reachability plot.
We look at the clustering obtained.
We obtain the optimal number of clusters as 3 by most methods.
The hierarchical clustering by single linkage gave the worst predictions.
K Means gave a very good performance. We tabulate the predictions compared to true labels.
##
## species 1 2 3
## 1 62 2 6
## 2 5 65 0
## 3 4 0 66
## Accuracy of kmeans = 91.9 %