Clustering is a machine learning technique that enables researchers and data scientists to partition and segment data. Clustering is an unsupervised learning algorithm, i.e.,it’s used when the dataset doesn’t have any label. Segmenting data into appropriate groups is a core task when conducting exploratory analysis.Clustering, which plays a big role in modern machine learning, is the partitioning of data into groups. This can be done in a number of ways, the two most popular being Partition and Hierarchical clustering. In terms of a data.frame, a clustering algorithm finds out which rows are similar to each other. Rows that are grouped together are supposed to have high similarity to each other and low similarity with rows outside the grouping.
…
One of the more popular algorithms for Partition clustering is K-means. It divides the observations into discrete groups based on some distance metric. The clusters are formed based on the centroid values of the previous clusters. The working of K-Means is as follows:
Here we’ll take a very popular dataset called iris to demonstrate K-means clustering.
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
Printing the summary of the dataset.
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
Since the Species column is the categorical variable and it is the target variable, I’m creating a new dataframe called data which contains all the columns of iris dataset except Species and another dataframe called class which only contain the Species column.
data <- iris[,-5]
class <- iris[,5]
Now we’ll find the optimal number of clusters, i.e., k value using Elbow Plot.
wss <- 0
for (i in 1:15) wss[i] <- kmeans(data,centers=i)$tot.withinss
plot(1:15, wss, type="b",
xlab="Number of Clusters",
ylab="Within groups sum of squares",col="blue",pch=16,lwd=3)
From the Elbow Plot we can observe that there is inflection point or “elbow of the graph” at k = 3. When the k value is greater than 3, the sum of squares of difference within groups became significantly less, which signifies that 3 is the optimal value of k. Let’s implement kmeans with k=3 and see the result.
results <- kmeans(data,centers=3)
class(results)
## [1] "kmeans"
results$size
## [1] 50 62 38
results$cluster
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [75] 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 3 3 3 2 3 3 3 3
## [112] 3 3 2 2 3 3 3 3 2 3 2 3 2 3 3 2 2 3 3 3 3 3 2 3 3 3 3 2 3 3 3 2 3 3 3 2 3
## [149] 3 2
Now let’s see the confusion matrix to see how well the model performed.
table(class,results$cluster)
##
## class 1 2 3
## setosa 50 0 0
## versicolor 0 48 2
## virginica 0 14 36
The CrossTable( ) function in the gmodels package produces crosstabulations modeled after PROC FREQ in SAS or CROSSTABS in SPSS. It has a wealth of options.
library(gmodels)
CrossTable(class,results$cluster)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 150
##
##
## | results$cluster
## class | 1 | 2 | 3 | Row Total |
## -------------|-----------|-----------|-----------|-----------|
## setosa | 50 | 0 | 0 | 50 |
## | 66.667 | 20.667 | 12.667 | |
## | 1.000 | 0.000 | 0.000 | 0.333 |
## | 1.000 | 0.000 | 0.000 | |
## | 0.333 | 0.000 | 0.000 | |
## -------------|-----------|-----------|-----------|-----------|
## versicolor | 0 | 48 | 2 | 50 |
## | 16.667 | 36.151 | 8.982 | |
## | 0.000 | 0.960 | 0.040 | 0.333 |
## | 0.000 | 0.774 | 0.053 | |
## | 0.000 | 0.320 | 0.013 | |
## -------------|-----------|-----------|-----------|-----------|
## virginica | 0 | 14 | 36 | 50 |
## | 16.667 | 2.151 | 42.982 | |
## | 0.000 | 0.280 | 0.720 | 0.333 |
## | 0.000 | 0.226 | 0.947 | |
## | 0.000 | 0.093 | 0.240 | |
## -------------|-----------|-----------|-----------|-----------|
## Column Total | 50 | 62 | 38 | 150 |
## | 0.333 | 0.413 | 0.253 | |
## -------------|-----------|-----------|-----------|-----------|
##
##
Now plotting principal components by cluster and species for comparison.
p <- princomp(data)
par(mfrow=c(1,2))
plot(p$scores[,1],p$scores[,2],col=results$cluster,
pch=16,
xlab="Principal Component 1",
ylab="Principal Component 2",
main="By cluster")
plot(p$scores[,1],p$scores[,2],col=class,
pch=16,
xlab="Principal Component 1",
ylab="Principal Component 2",
main="By species")
Now applying clustplot to the data.
library(cluster)
par(mfrow=c(1,1))
clusplot(data,results$cluster,color=TRUE,shade=F,labels=0,
main="Results for k = 3 clusters",xlab="Principal Component 1",
ylab="Principal Component 2")
There are two problems with K-means clustering.These are:
A Hierarchical clustering method works via grouping data into a tree of clusters. Hierarchical clustering begins by treating every data points as a separate cluster.In Hierarchical Clustering, the aim is to produce a hierarchical series of nested clusters. A diagram called Dendrogram graphically represents this hierarchy and is an inverted tree that describes the order in which factors are merged or cluster are break up.Here we’ll discuss about Agglomerative only.Initially consider every data point as an individual Cluster and at every step, merge the nearest pairs of the cluster. (It is a bottom-up method). At first everydata set set is considered as individual entity or cluster. At every iteration, the clusters merge with different clusters until one cluster is formed.The algorithm is as follows:
Here we’ll take an inbuilt dataset called mtcars to demonstrate Hierarchical clustering.
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Printing the summary of the dataset.
summary(mtcars)
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
Finding the distance matrix.
d <- dist(mtcars)
Now applying hierarchical clustering and generate the dendogram.
ward <- hclust(d,method="ward.D")
Display the dendogram and cluster by putting k=3, 5 and 7 ans see the result.
par(mfrow=c(1,1))
plot(ward,cex=0.8)
rect.hclust(ward, k=3)
plot(ward,cex=0.8)
rect.hclust(ward, k=5)
plot(ward,cex=0.8)
rect.hclust(ward, k=7)
Hope I was able to share some helpful concepts with you. See you in the next article. My website link Krishnamita.