k-means_clustering

IRIS dataset

data("iris")
head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

iris2 = iris[,-5]

mean.iris = apply(iris2, 2, mean)
sd.iris = apply(iris2, 2, sd)

scaled.iris = scale(iris2, mean.iris, sd.iris)

3 methods of Clustering validation:

Cohesion: measure of similarity between data-points in a cluster. measured by within cluster sum of square errors. minimum the better.
Separation: measure of distinction between clusters. measured by between clusters sum of squares. Higher the better.
Silhouette coefficient: a measure of combination of both cohesion and separation. a coefficient from 0 to 1. 1 being the best.

finding best K where wss is optimum

wss = (nrow(scaled.iris)-1) * sum(apply(scaled.iris,2,var))
wss# this is max wss possible for this dataset which we get when K=1.

## [1] 596

for (i in 1:15){
        kmeans.model2 = kmeans(scaled.iris, i,nstart = 50)
        wss[i] = kmeans.model2$tot.withinss
}

plot(1:15, wss, type="b",
     xlab = "Number of Clusters",
     ylab = "within cluster sum of squares")

at K=3 we get best clustering with minimum wss. after that the advantage is not that big.

bss = rep(0,15)

for (i in 1:15){
        kmeans.model2 = kmeans(scaled.iris, i, nstart = 50)
        bss[i] = kmeans.model2$betweenss
}

plot(1:15, bss, type="b",
     xlab = "Number of Clusters",
     ylab = "Between cluster sum of squares")

at K=3 we get best clustering with maximum bss. after that the advantage is not that big.

# calculating euclidean distance
distance = dist(scaled.iris)

library(cluster)

plot(silhouette(kmeans(scaled.iris, centers = 3, nstart = 50)$cluster, dist(scaled.iris)))

running k-means clustering with K=3 ………

Run multiple times and keep the one with minimum wss and aximum bss.

set.seed(1)

kmeans.model = kmeans(scaled.iris, centers = 3, nstart = 50)
wss = kmeans.model$tot.withinss
bss = kmeans.model$betweenss
wss

## [1] 138.8884

bss

## [1] 457.1116

plot(iris2$Petal.Length, iris2$Petal.Width, 
     col=kmeans.model$cluster, 
     cex=2, 
     pch=19,
     xlab = "petal.length",
     ylab = "petal.width")
points(kmeans.model$centers, pch=3, cex=5, col=1:3, lwd=4)

plot of Species type from original data.

plot(iris$Petal.Length, iris$Petal.Width, 
     col=iris$Species, 
     cex=2, 
     pch=19,
     xlab = "petal.length",
     ylab = "petal.width")

Limitations of k-means clustering:

no. of iterations required to cnverge cannot be determined.
poor choice of initial centroids may result in suboptimal cluster assigments.(solutions (1) performing multiple runs (2) choosing the centroids using heirarchical approach)
cannot handle noisy data well.
cannot handle missign data.
not suitable to discover clusters with non-convex (irregular) or concentric (same center) shapes.
does not work well with unequal sized clusters.
not robust to changes in data set. even minor changes results in changes in cluster changes.
every data point is forced to be a part of only one cluster, no overlap allowed.

k-means_clustering

Maulik Patel

November 4, 2016

GOAL: The purpose of clustering is to find natural grouping of data points in data set such that points in same cluster are more similar to each other than points in different cluster.

clustering can be an end goal of the project or results of clustering can be used for further research to explore the causes of similarities and dissimilarities. also it is a great data exploratoration technique in exploratory data analysis phase.

Process:

for categorical variables, we have to use K-medoids method. medoids are mode of a set. in k-medoids method, medoids are data-points itself.

IRIS dataset

3 methods of Clustering validation:

finding best K where wss is optimum

at K=3 we get best clustering with minimum wss. after that the advantage is not that big.

at K=3 we get best clustering with maximum bss. after that the advantage is not that big.

running k-means clustering with K=3 ………

Run multiple times and keep the one with minimum wss and aximum bss.

plot of Species type from original data.

Limitations of k-means clustering: