Clustering is a machine learning technique that enables researchers and data scientists to partition and segment data. Clustering is an unsupervised learning algorithm, i.e.,it’s used when the dataset doesn’t have any label. Segmenting data into appropriate groups is a core task when conducting exploratory analysis.Clustering, which plays a big role in modern machine learning, is the partitioning of data into groups. This can be done in a number of ways, the two most popular being Partition and Hierarchical clustering. In terms of a data.frame, a clustering algorithm finds out which rows are similar to each other. Rows that are grouped together are supposed to have high similarity to each other and low similarity with rows outside the grouping.

...

Partition Clustering

One of the more popular algorithms for Partition clustering is K-means. It divides the observations into discrete groups based on some distance metric. The clusters are formed based on the centroid values of the previous clusters. The working of K-Means is as follows:

  1. Start
  2. Taking input of number of clusters.
  3. Calculate centroid
  4. Calculate euclidean distance between the clusters.
  5. Group points based on minimum distance.
  6. Go to Step 1.

Here we’ll take a very popular dataset called iris to demonstrate K-means clustering.

head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Printing the summary of the dataset.

summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 

Since the Species column is the categorical variable and it is the target variable, I’m creating a new dataframe called data which contains all the columns of iris dataset except Species and another dataframe called class which only contain the Species column.

data <- iris[,-5]
class <- iris[,5]

Now we’ll find the optimal number of clusters, i.e., k value using Elbow Plot.

wss <- 0
for (i in 1:15) wss[i] <- kmeans(data,centers=i)$tot.withinss
plot(1:15, wss, type="b",
     xlab="Number of Clusters",
     ylab="Within groups sum of squares",col="blue",pch=16,lwd=3)

From the Elbow Plot we can observe that there is inflection point or “elbow of the graph” at k = 3. When the k value is greater than 3, the sum of squares of difference within groups became significantly less, which signifies that 3 is the optimal value of k. Let’s implement kmeans with k=3 and see the result.

results <- kmeans(data,centers=3)
class(results)
## [1] "kmeans"
results$size
## [1] 50 62 38
results$cluster
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [75] 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 3 3 3 2 3 3 3 3
## [112] 3 3 2 2 3 3 3 3 2 3 2 3 2 3 3 2 2 3 3 3 3 3 2 3 3 3 3 2 3 3 3 2 3 3 3 2 3
## [149] 3 2

Now let’s see the confusion matrix to see how well the model performed.

table(class,results$cluster)
##             
## class         1  2  3
##   setosa     50  0  0
##   versicolor  0 48  2
##   virginica   0 14 36

The CrossTable( ) function in the gmodels package produces crosstabulations modeled after PROC FREQ in SAS or CROSSTABS in SPSS. It has a wealth of options.

library(gmodels)
CrossTable(class,results$cluster)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  150 
## 
##  
##              | results$cluster 
##        class |         1 |         2 |         3 | Row Total | 
## -------------|-----------|-----------|-----------|-----------|
##       setosa |        50 |         0 |         0 |        50 | 
##              |    66.667 |    20.667 |    12.667 |           | 
##              |     1.000 |     0.000 |     0.000 |     0.333 | 
##              |     1.000 |     0.000 |     0.000 |           | 
##              |     0.333 |     0.000 |     0.000 |           | 
## -------------|-----------|-----------|-----------|-----------|
##   versicolor |         0 |        48 |         2 |        50 | 
##              |    16.667 |    36.151 |     8.982 |           | 
##              |     0.000 |     0.960 |     0.040 |     0.333 | 
##              |     0.000 |     0.774 |     0.053 |           | 
##              |     0.000 |     0.320 |     0.013 |           | 
## -------------|-----------|-----------|-----------|-----------|
##    virginica |         0 |        14 |        36 |        50 | 
##              |    16.667 |     2.151 |    42.982 |           | 
##              |     0.000 |     0.280 |     0.720 |     0.333 | 
##              |     0.000 |     0.226 |     0.947 |           | 
##              |     0.000 |     0.093 |     0.240 |           | 
## -------------|-----------|-----------|-----------|-----------|
## Column Total |        50 |        62 |        38 |       150 | 
##              |     0.333 |     0.413 |     0.253 |           | 
## -------------|-----------|-----------|-----------|-----------|
## 
## 

Now plotting principal components by cluster and species for comparison.

p <- princomp(data)
par(mfrow=c(1,2))
plot(p$scores[,1],p$scores[,2],col=results$cluster,
     pch=16,
     xlab="Principal Component 1",
     ylab="Principal Component 2",
     main="By cluster")
plot(p$scores[,1],p$scores[,2],col=class,
     pch=16,
     xlab="Principal Component 1",
     ylab="Principal Component 2",
     main="By species")

Now applying clustplot to the data.

library(cluster) 
par(mfrow=c(1,1))
clusplot(data,results$cluster,color=TRUE,shade=F,labels=0,
         main="Results for k = 3 clusters",xlab="Principal Component 1",
         ylab="Principal Component 2")

Drawbacks

There are two problems with K-means clustering.These are:

  • it does not work with categorical data
  • it is susceptible to outliers. An alternative is K-medoids.

Hierarchical Clustering

A Hierarchical clustering method works via grouping data into a tree of clusters. Hierarchical clustering begins by treating every data points as a separate cluster.In Hierarchical Clustering, the aim is to produce a hierarchical series of nested clusters. A diagram called Dendrogram graphically represents this hierarchy and is an inverted tree that describes the order in which factors are merged or cluster are break up.Here we’ll discuss about Agglomerative only.Initially consider every data point as an individual Cluster and at every step, merge the nearest pairs of the cluster. (It is a bottom-up method). At first everydata set set is considered as individual entity or cluster. At every iteration, the clusters merge with different clusters until one cluster is formed.The algorithm is as follows:

  1. Calculate the similarity of one cluster with all the other clusters (calculate proximity matrix)
  2. Consider every data point as a individual cluster
  3. Merge the clusters which are highly similar or close to each other.
  4. Recalculate the proximity matrix for each cluster
  5. Repeat Step 3 and 4 until only a single cluster remains.

Here we’ll take an inbuilt dataset called mtcars to demonstrate Hierarchical clustering.

head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Printing the summary of the dataset.

summary(mtcars)
##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000

Finding the distance matrix.

d <- dist(mtcars)

Now applying hierarchical clustering and generate the dendogram.

ward <- hclust(d,method="ward.D")

Display the dendogram and cluster by putting k=3, 5 and 7 ans see the result.

par(mfrow=c(1,1))

plot(ward,cex=0.8)
rect.hclust(ward, k=3)

plot(ward,cex=0.8)

rect.hclust(ward, k=5)

plot(ward,cex=0.8)

rect.hclust(ward, k=7)

Hope I was able to share some helpful concepts with you. See you in the next article. My website link Krishnamita.