This mini-project is based on the K-Means exercise from ‘R in Action’. Go here for the original blog post and solutions http://www.r-bloggers.com/k-means-clustering-from-r-in-action/.

Exercise 0: Install these packages if you don’t have them already

# install.packages(c("cluster", "rattle","NbClust"))
library(cluster)
library(rattle)
library(NbClust)

Now load the data and look at the first few rows

data(wine, package="rattle")
head(wine)
##   Type Alcohol Malic  Ash Alcalinity Magnesium Phenols Flavanoids
## 1    1   14.23  1.71 2.43       15.6       127    2.80       3.06
## 2    1   13.20  1.78 2.14       11.2       100    2.65       2.76
## 3    1   13.16  2.36 2.67       18.6       101    2.80       3.24
## 4    1   14.37  1.95 2.50       16.8       113    3.85       3.49
## 5    1   13.24  2.59 2.87       21.0       118    2.80       2.69
## 6    1   14.20  1.76 2.45       15.2       112    3.27       3.39
##   Nonflavanoids Proanthocyanins Color  Hue Dilution Proline
## 1          0.28            2.29  5.64 1.04     3.92    1065
## 2          0.26            1.28  4.38 1.05     3.40    1050
## 3          0.30            2.81  5.68 1.03     3.17    1185
## 4          0.24            2.18  7.80 0.86     3.45    1480
## 5          0.39            1.82  4.32 1.04     2.93     735
## 6          0.34            1.97  6.75 1.05     2.85    1450

Exercise 1: Remove the first column from the data and scale it using the scale() function

df <- scale(wine[,-1])

Now we’d like to cluster the data using K-Means. How do we decide how many clusters to use if you don’t know that already? We’ll try two methods.

Method 1: A plot of the total within-groups sums of squares against the number of clusters in a K-means solution can be helpful. A bend in the graph can suggest the appropriate number of clusters.

wssplot <- function(data, nc=15, seed=1234){
                  wss <- (nrow(data)-1)*sum(apply(data,2,var))
                      for (i in 2:nc){
                set.seed(seed)
                    wss[i] <- sum(kmeans(data, centers=i)$withinss)}
              plot(1:nc, wss, type="b", xlab="Number of Clusters",
                            ylab="Within groups sum of squares")
              wss
       }

wssplot(df)

##  [1] 2301.0000 1649.4400 1270.7491 1168.6143 1098.7390 1039.2957  977.5410
##  [8]  952.5328  920.4558  883.7607  846.7963  806.6972  744.7018  729.1297
## [15]  702.2454

Exercise 2:

There is a distinct drop in within groups sum of squares when moving from 1 to 3 clusters. After three clusters, this decrease drops off, suggesting that a 3-cluster solution may be a good fit to the data.

Method 2: Use the NbClust library, which runs many experiments and gives a distribution of potential number of clusters.

library(NbClust)
set.seed(1234)
nc <- NbClust(df, min.nc=2, max.nc=15, method="kmeans")

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
## 

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 4 proposed 2 as the best number of clusters 
## * 15 proposed 3 as the best number of clusters 
## * 1 proposed 10 as the best number of clusters 
## * 1 proposed 12 as the best number of clusters 
## * 1 proposed 14 as the best number of clusters 
## * 1 proposed 15 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  3 
##  
##  
## *******************************************************************
barplot(table(nc$Best.n[1,]),
              xlab="Numer of Clusters", ylab="Number of Criteria",
                    main="Number of Clusters Chosen by 26 Criteria")
table(nc$Best.n[1,])
## 
##  0  1  2  3 10 12 14 15 
##  2  1  4 15  1  1  1  1

Exercise 3: How many clusters does this method suggest?

14 of 24 criteria provided by the NbClust package suggest a 3-cluster solution.

Exercise 4: Once you’ve picked the number of clusters, run k-means using this number of clusters. Output the result of calling kmeans() into a variable fit.km

set.seed(1234)
fit.km <- kmeans(df, centers=3,  nstart=25)
fit.km$size
## [1] 62 65 51

Now we want to evaluate how well this clustering does.

Exercise 5: using the table() function, show how the clusters in fit.km\(clusters compares to the actual wine types in wine\)Type. Would you consider this a good clustering?

table(fit.km$cluster,wine$Type)
##    
##      1  2  3
##   1 59  3  0
##   2  0 65  0
##   3  0  3 48

Exercise 6:

clusplot(pam(df,3))