Clustering is a technique in machine learning that attempts to find groups or clusters of observations within a dataset.

The goal is to find clusters such that the observations within each cluster are quite similar to each other, while observations in different clusters are quite different from each other.

K Means Clustering is an unsupervised learning algorithm that tries to cluster data based on their similarity. Unsupervised learning means that there is no outcome to be predicted, and the algorithm just tries to find patterns in the data. In k means clustering, we have to specify the number of clusters we want the data to be grouped into. The algorithm randomly assigns each observation to a cluster, and finds the centroid of each cluster. Then, the algorithm iterates through two steps:

  1. Reassign data points to the cluster whose centroid is closest.
  2. Calculate new centroid of each cluster. These two steps are repeated till the within cluster variation cannot be reduced any further. The within cluster variation is calculated as the sum of the (euclidean) distance between the data points and their respective cluster centroids.

Illustration:

The iris dataset contains data about sepal length, sepal width, petal length, and petal width of flowers of different species. Let us see what it looks like:

library(datasets)
head(iris)
irisB <- iris
names(irisB)
## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"
summary(irisB$Species)
##     setosa versicolor  virginica 
##         50         50         50

Petal.Length and Petal.Width were similar among the same species but varied considerably between different species.

str(irisB)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
library(ggplot2)
ggplot(iris, aes(Petal.Length, Petal.Width, color = Species)) + geom_point()

let us try to cluster it.

Since we know that there are 3 species involved, we ask the algorithm to group the data into 3 clusters, and since the starting assignments are random, we specify nstart = 30. This means that R will try 30 different random starting assignments and then select the one with the lowest within cluster variation.

We can see the cluster centroids, the clusters that each data point was assigned to, and the within cluster variation.

set.seed(30)
irisCluster <- kmeans(iris[, 3:4], 3, nstart = 30)
irisCluster
## K-means clustering with 3 clusters of sizes 52, 48, 50
## 
## Cluster means:
##   Petal.Length Petal.Width
## 1     4.269231    1.342308
## 2     5.595833    2.037500
## 3     1.462000    0.246000
## 
## Clustering vector:
##   [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
##  [38] 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [75] 1 1 1 2 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 1 2 2 2 2
## [112] 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2
## [149] 2 2
## 
## Within cluster sum of squares by cluster:
## [1] 13.05769 16.29167  2.02200
##  (between_SS / total_SS =  94.3 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

Comparison

table(irisCluster$cluster, iris$Species)
##    
##     setosa versicolor virginica
##   1      0         48         4
##   2      0          2        46
##   3     50          0         0

As we can see, the data belonging to the setosa species got grouped into cluster 3, versicolor into cluster 2, and virginica into cluster 1. The algorithm wrongly classified two data points belonging to versicolor and six data points belonging to virginica.

Plot to visualize the cluster:

irisCluster$cluster <- as.factor(irisCluster$cluster)
ggplot(iris, aes(Petal.Length, Petal.Width, color = irisCluster$cluster)) + geom_point()

Simple Ad-hoc way:

irisB$Species <- NULL
kmeans.result <- kmeans(irisB, 3)
kmeans.result
## K-means clustering with 3 clusters of sizes 38, 62, 50
## 
## Cluster means:
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1     6.850000    3.073684     5.742105    2.071053
## 2     5.901613    2.748387     4.393548    1.433871
## 3     5.006000    3.428000     1.462000    0.246000
## 
## Clustering vector:
##   [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
##  [38] 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [75] 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 1 1 1 1 2 1 1 1 1
## [112] 1 1 2 2 1 1 1 1 2 1 2 1 2 1 1 2 2 1 1 1 1 1 2 1 1 1 1 2 1 1 1 2 1 1 1 2 1
## [149] 1 2
## 
## Within cluster sum of squares by cluster:
## [1] 23.87947 39.82097 15.15100
##  (between_SS / total_SS =  88.4 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

clustering result is then compared with the class label (Species)

table(iris$Species, kmeans.result$cluster)
##             
##               1  2  3
##   setosa      0  0 50
##   versicolor  2 48  0
##   virginica  36 14  0

clusters and their centers are plotted & plot cluster centers

plot(irisB[c("Sepal.Length", "Sepal.Width")], col = kmeans.result$cluster)
points(kmeans.result$centers[,c("Sepal.Length", "Sepal.Width")], col = 1:3, pch = 15, cex=2)


The k-Medoids Clustering:


One of the most common forms of clustering is known as k-means clustering.

Unfortunately, this method can be influenced by outliers so an alternative that is often used is k-medoids clustering.

We need following packages:

library(fpc)
library(cluster)

We will use Partitioning Around Medoids (PAM)

Description:

Partitioning (clustering) of the data into k clusters “around medoids”, a more robust version of K-means.

By default, when medoids are not specified, the algorithm first looks for a good initial set of medoids (this is called the build phase).

Then it finds a local minimum for the objective function, that is, a solution such that there is no single switch of an observation with a medoid (i.e. a ‘swap’) that will decrease the objective (this is called the swap phase).

When the medoids are specified (or randomly generated), their order does not matter; in general, the algorithms have been designed to not depend on the order of the observations.

The pam-algorithm is based on the search for k representative objects or medoids among the observations of the dataset.

These observations should represent the structure of the data. After finding a set of k medoids, k clusters are constructed by assigning each observation to the nearest medoid.

The goal is to find k representative objects which minimize the sum of the dissimilarities of the observations to their closest representative object.

Compared to the k-means approach in kmeans, the function pam has the following features:

  1. it also accepts a dissimilarity matrix;

  2. it is more robust because it minimizes a sum of dissimilarities instead of a sum of squared euclidean distances;

  3. it provides a novel graphical display, the silhouette plot

  4. it allows to select the number of clusters using mean(silhouette(pr)[, “sil_width”])

By default, when medoids are not specified, the algorithm first looks for a good initial set of medoids (this is called the build phase). Then it finds a local minimum for the objective function, that is, a solution such that there is no single switch of an observation with a medoid (i.e. a ‘swap’) that will decrease the objective (this is called the swap phase).

When the medoids are specified (or randomly generated), their order does not matter; in general, the algorithms have been designed to not depend on the order of the observations.

Illustrations:

  1. In synthetic data:
## generate 25 objects, divided into 2 clusters.
x <- rbind(cbind(rnorm(10,0,0.5), rnorm(10,0,0.5)),
           cbind(rnorm(15,5,0.5), rnorm(15,5,0.5)))
pamx <- pam(x, 2)
pamx 
## Medoids:
##      ID                     
## [1,]  3 -0.2558064 0.2002753
## [2,] 18  4.9238847 5.1347784
## Clustering vector:
##  [1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## Objective function:
##     build      swap 
## 0.6563598 0.6069484 
## 
## Available components:
##  [1] "medoids"    "id.med"     "clustering" "objective"  "isolation" 
##  [6] "clusinfo"   "silinfo"    "diss"       "call"       "data"
summary(pamx)
## Medoids:
##      ID                     
## [1,]  3 -0.2558064 0.2002753
## [2,] 18  4.9238847 5.1347784
## Clustering vector:
##  [1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## Objective function:
##     build      swap 
## 0.6563598 0.6069484 
## 
## Numerical information per cluster:
##      size max_diss   av_diss diameter separation
## [1,]   10 1.689258 0.6720692 1.966291   5.485607
## [2,]   15 1.445219 0.5635345 2.327790   5.485607
## 
## Isolated clusters:
##  L-clusters: character(0)
##  L*-clusters: [1] 1 2
## 
## Silhouette plot information:
##    cluster neighbor sil_width
## 3        1        2 0.8946288
## 9        1        2 0.8929426
## 4        1        2 0.8829436
## 1        1        2 0.8748196
## 7        1        2 0.8695144
## 6        1        2 0.8674877
## 5        1        2 0.8617928
## 10       1        2 0.8494896
## 8        1        2 0.8391896
## 2        1        2 0.7691407
## 18       2        1 0.9117813
## 19       2        1 0.9096561
## 17       2        1 0.9070590
## 25       2        1 0.9053284
## 12       2        1 0.8994183
## 11       2        1 0.8994172
## 16       2        1 0.8971066
## 14       2        1 0.8935551
## 24       2        1 0.8910501
## 13       2        1 0.8833141
## 15       2        1 0.8658695
## 23       2        1 0.8610017
## 22       2        1 0.8452902
## 21       2        1 0.8094472
## 20       2        1 0.8040749
## Average silhouette width per cluster:
## [1] 0.8601949 0.8788913
## Average silhouette width of total data set:
## [1] 0.8714128
## 
## 300 dissimilarities, summarized :
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.08849 0.82446 3.90670 3.81250 6.74290 8.08030 
## Metric :  euclidean 
## Number of objects : 25
## 
## Available components:
##  [1] "medoids"    "id.med"     "clustering" "objective"  "isolation" 
##  [6] "clusinfo"   "silinfo"    "diss"       "call"       "data"
plot(pamx)

## use obs. 1 & 16 as starting medoids -- same result (typically)
(p2m <- pam(x, 2, medoids = c(1,16)))
## Medoids:
##      ID                     
## [1,]  3 -0.2558064 0.2002753
## [2,] 18  4.9238847 5.1347784
## Clustering vector:
##  [1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## Objective function:
##     build      swap 
## 0.7344326 0.6069484 
## 
## Available components:
##  [1] "medoids"    "id.med"     "clustering" "objective"  "isolation" 
##  [6] "clusinfo"   "silinfo"    "diss"       "call"       "data"
## no _build_ *and* no _swap_ phase: just cluster all obs. around (1, 16):
p2.s <- pam(x, 2, medoids = c(1,16), do.swap = FALSE)
p2.s
## Medoids:
##      ID                     
## [1,]  1 -0.4577633 0.4231422
## [2,] 16  5.4257682 5.1306639
## Clustering vector:
##  [1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## Objective function:
##     build      swap 
## 0.7344326 0.7344326 
## 
## Available components:
##  [1] "medoids"    "id.med"     "clustering" "objective"  "isolation" 
##  [6] "clusinfo"   "silinfo"    "diss"       "call"       "data"

—-

Using stupid initial medoids:

p3m <- pam(x, 3, trace = 2)
## C pam(): computing 301 dissimilarities from  25 x 2  matrix: [Ok]
## pam()'s bswap(*, s=8.08029, pamonce=0): build 3 medoids:
##     new repr. 24
##     new repr. 3
##     new repr. 16
##   after build: medoids are  3 16 24
##    swp new 17 <-> 24 old; decreasing diss. 12.7992 by -0.391213
## end{bswap()}, end{cstat()}
(p3m. <- pam(x, 3, medoids = 3:1, trace = 1))
## C pam(): computing 301 dissimilarities from  25 x 2  matrix: [Ok]
## pam()'s bswap(*, s=8.08029, pamonce=0): medoids given;   after build: medoids are  1  2  3
## end{bswap()}, end{cstat()}
## Medoids:
##      ID                     
## [1,]  3 -0.2558064 0.2002753
## [2,]  6  0.4870282 0.5573946
## [3,] 18  4.9238847 5.1347784
## Clustering vector:
##  [1] 1 2 1 1 2 2 1 2 1 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## Objective function:
##     build      swap 
## 4.1042897 0.5128779 
## 
## Available components:
##  [1] "medoids"    "id.med"     "clustering" "objective"  "isolation" 
##  [6] "clusinfo"   "silinfo"    "diss"       "call"       "data"
plot(p3m)

plot(p3m.)

Manhattan metric

daisy: Dissimilarity Matrix Calculation:

Compute all the pairwise dissimilarities (distances) between observations in the data set.

m=pam(daisy(x, metric = "manhattan"), 2, diss = TRUE)
m
## Medoids:
##      ID   
## [1,]  9  9
## [2,] 19 19
## Clustering vector:
##  [1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## Objective function:
##     build      swap 
## 1.0320143 0.7840973 
## 
## Available components:
## [1] "medoids"    "id.med"     "clustering" "objective"  "isolation" 
## [6] "clusinfo"   "silinfo"    "diss"       "call"

Illustration in ruspini data data(ruspini) A data frame with 75 observations on 2 variables giving the x and y coordinates of the points, respectively.

Source:

E. H. Ruspini (1970) Numerical methods for fuzzy clustering. Inform. Sci. 2, 319–350.

data(ruspini)
## Plot similar to Figure 4 in Stryuf et al (1996)
kmrus=kmeans(ruspini, 4, nstart=1)
plot(ruspini, col = kmrus$cluster)

plot(pam(ruspini, 4))


Iris data clustering using k-Medioid
pam.result <- pam(irisB, 3)
table(pam.result$clustering, iris$Species)
##    
##     setosa versicolor virginica
##   1     50          0         0
##   2      0         48        14
##   3      0          2        36
pam.result
## Medoids:
##       ID Sepal.Length Sepal.Width Petal.Length Petal.Width
## [1,]   8          5.0         3.4          1.5         0.2
## [2,]  79          6.0         2.9          4.5         1.5
## [3,] 113          6.8         3.0          5.5         2.1
## Clustering vector:
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [75] 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 3 3 3 2 3 3 3 3
## [112] 3 3 2 2 3 3 3 3 2 3 2 3 2 3 3 2 2 3 3 3 3 3 2 3 3 3 3 2 3 3 3 2 3 3 3 2 3
## [149] 3 2
## Objective function:
##     build      swap 
## 0.6709391 0.6542077 
## 
## Available components:
##  [1] "medoids"    "id.med"     "clustering" "objective"  "isolation" 
##  [6] "clusinfo"   "silinfo"    "diss"       "call"       "data"
layout(matrix(c(1,2),1,2)) # 2 graphs per page
plot(pam.result)