Assignment 3: Pattern Recognition

0. The Data

We have 210 observations on 7 continuous variables divided into 3 classes of 70 observations each.

## species
##  1  2  3 
## 70 70 70

1. Model based clustering using EM algorithm

We fit the Gaussian Mixture Model for different values of k (number of clusters) and check the Gap Statistic as a function of k.

Gap method

Silhouette method

Now we see Silhouette Index as function of k.

Dunn Method

Now we see Dunn method as function of k.

We see that Silhouette index gives a good estimate of number of clusters but the Dunn index and Gap statistic both give a large estimate of number of clusters.

Fitting the model with 3 and 9 clusters. We see that the 3-cluster model has much better BIC.

## ---------------------------------------------------- 
## Gaussian finite mixture model fitted by EM algorithm 
## ---------------------------------------------------- 
## 
## Mclust VEV (ellipsoidal, equal shape) model with 3 components: 
## 
##  log-likelihood   n df      BIC      ICL
##        286.9906 210 95 66.00606 61.46508
## 
## Clustering table:
##  1  2  3 
## 76 69 65

## ---------------------------------------------------- 
## Gaussian finite mixture model fitted by EM algorithm 
## ---------------------------------------------------- 
## 
## Mclust VEV (ellipsoidal, equal shape) model with 9 components: 
## 
##  log-likelihood   n  df       BIC       ICL
##        630.9067 210 275 -208.6411 -214.0946
## 
## Clustering table:
##  1  2  3  4  5  6  7  8  9 
## 36 15 37  6 47 33 12 10 14

We can visualise the pairwise plots to see the clustering obtained with 3 groups.

2. K-Means

Gap and Silhouette statistic for k means. Gap statistics is maximised for k = 3, Silhouette index is maximised for k = 2.

For Dunn score

We see that Dunn score is optimal at 3 clusters and again Silhouette score at 2 clusters.

## 
## Clustering Methods:
##  kmeans 
## 
## Cluster sizes:
##  2 3 4 5 6 7 8 9 
## 
## Validation Measures:
##                             2        3        4        5        6        7        8        9
##                                                                                             
## kmeans Connectivity   16.8964  39.5476  63.6409  79.9706  98.7131  97.6437 101.8389 103.5853
##        Dunn            0.0870   0.1188   0.0888   0.0814   0.0814   0.0986   0.0590   0.0590
##        Silhouette      0.4658   0.4007   0.3379   0.2881   0.2740   0.2837   0.2686   0.2687
## 
## Optimal Scores:
## 
##              Score   Method Clusters
## Connectivity 16.8964 kmeans 2       
## Dunn          0.1188 kmeans 3       
## Silhouette    0.4658 kmeans 2

So we perform clustering using 3 clusters and visualise the clusters in 2 dimensions.

3. K Medoids (PAM)

Gap and Silhouette statistic for k means which suggest 3 and 2 clusters respectively.

We see Dunn and Silhouette scores. We see that Silhouette index suggests 2 clusters and Dunn index suggests 3 clusters.

## 
## Clustering Methods:
##  pam 
## 
## Cluster sizes:
##  2 3 4 5 6 7 8 9 
## 
## Validation Measures:
##                          2        3        4        5        6        7        8        9
##                                                                                          
## pam Connectivity   28.5028  41.4413  55.4944  73.2921  96.9325 103.9544 116.5790 128.4750
##     Dunn            0.0588   0.1115   0.1114   0.0816   0.0817   0.0711   0.0839   0.0840
##     Silhouette      0.4648   0.3982   0.3335   0.2678   0.2336   0.2401   0.2488   0.2345
## 
## Optimal Scores:
## 
##              Score   Method Clusters
## Connectivity 28.5028 pam    2       
## Dunn          0.1115 pam    3       
## Silhouette    0.4648 pam    2

So we cluster using 3 statistics and we obtain the clustering and the plot.

4. Hierarchical Clustering

For Single Linkage

First we do for single Linkage. Checking the Gap and Silhouette Statistic we obtain suggested cluster number of 3 and 2.

So we calculate using three clusters and print the dendogram. We see that the result is quite terrible.

Checking the confusion matrix. It is quite bad.

##        
## species  1  2  3
##       1 68  1  1
##       2 70  0  0
##       3 68  2  0

For complete linkage.

We check the Gap and the Silhouette Statistic. We get 3 clusters as optimal from Gap statistic and 2 clusters as optimal from Silhouette index.

Dunn we have to see.

So choosing 3 as the correct cluster number we obtain the clustering and the dendogram. Here the dendogram is much more reasonable.

DBSCAN and OPTICS

DBSCAN

We obtain knee at around 1.1. So we use epsilon = 1.1.

We perform the dbscan and see that we obtain 3 predicted clusters for minPts = 17.
We also get 42 Noise Points.

## DBSCAN clustering for 210 objects.
## Parameters: eps = 1.1, minPts = 17
## The clustering contains 3 cluster(s) and 42 noise points.
## 
##   0   1   2   3 
##  42 116  17  35 
## 
## Available fields: cluster, eps, minPts

We plot the predicted clusters.

OPTICS

Using the same parameter values as DBSCAN and obtain reachability plot.

We look at the clustering obtained.

B

We obtain the optimal number of clusters as 3 by most methods.
The hierarchical clustering by single linkage gave the worst predictions.

C

K Means gave a very good performance. We tabulate the predictions compared to true labels.

##        
## species  1  2  3
##       1 62  2  6
##       2  5 65  0
##       3  4  0 66

## Accuracy of kmeans = 91.9 %