What is Unsupervised Learning?

Unsupervised learning is a branch of machine learning that is used to find underlying patterns in data and is often used in exploratory data analysis. Unsupervised learning does not use labeled data like supervised learning, but instead focuses on the data’s features

What is k means clustering?

K-means clustering is a type of unsupervised learning, which is used when you have unlabeled data (i.e., data without defined categories or groups). The goal of this algorithm is to find groups in the data, with the number of groups represented by the variable K. Data points are clustered based on feature similarity.

Importing necessary libraries

library(gmodels)
library(cluster)
library(psych)

Overview of Iris dataset

head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
dim(iris)
## [1] 150   5
str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 
describe(iris)
##              vars   n mean   sd median trimmed  mad min max range  skew
## Sepal.Length    1 150 5.84 0.83   5.80    5.81 1.04 4.3 7.9   3.6  0.31
## Sepal.Width     2 150 3.06 0.44   3.00    3.04 0.44 2.0 4.4   2.4  0.31
## Petal.Length    3 150 3.76 1.77   4.35    3.76 1.85 1.0 6.9   5.9 -0.27
## Petal.Width     4 150 1.20 0.76   1.30    1.18 1.04 0.1 2.5   2.4 -0.10
## Species*        5 150 2.00 0.82   2.00    2.00 1.48 1.0 3.0   2.0  0.00
##              kurtosis   se
## Sepal.Length    -0.61 0.07
## Sepal.Width      0.14 0.04
## Petal.Length    -1.42 0.14
## Petal.Width     -1.36 0.06
## Species*        -1.52 0.07

No of missing values

colSums(is.na(iris))
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
##            0            0            0            0            0

This proves that there is no missing values and no need of data cleaning.

Clustering Iris dataset without species column

data <- iris[,-5]
class <- iris[,5]

Computing Within Sum of Squres for Cluster number 1 to 15

wss <- 0
for (i in 1:15) wss[i] <- kmeans(data,centers=i)$tot.withinss
plot(1:15, wss, type="b",
     xlab="Number of Clusters",
     ylab="Within groups sum of squares",col="blue",pch=16,lwd=3)

There is inflection point or “elbow of the graph” at k = 3; some knowledge of the data (namely the number of species) also tells us that it might be logical to look for three clusters in our data.

taking clusters = 3

results <- kmeans(data,centers=3)
class(results)
## [1] "kmeans"

Compairing clustering prediction with actual species in tabular form

table(class)
## class
##     setosa versicolor  virginica 
##         50         50         50
results$size
## [1] 50 62 38
results$cluster
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [75] 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 3 3 3 2 3 3 3 3
## [112] 3 3 2 2 3 3 3 3 2 3 2 3 2 3 3 2 2 3 3 3 3 3 2 3 3 3 3 2 3 3 3 2 3 3 3 2 3
## [149] 3 2
table(class,results$cluster)
##             
## class         1  2  3
##   setosa     50  0  0
##   versicolor  0 48  2
##   virginica   0 14 36

Cross Tabular form

CrossTable(class,results$cluster)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  150 
## 
##  
##              | results$cluster 
##        class |         1 |         2 |         3 | Row Total | 
## -------------|-----------|-----------|-----------|-----------|
##       setosa |        50 |         0 |         0 |        50 | 
##              |    66.667 |    20.667 |    12.667 |           | 
##              |     1.000 |     0.000 |     0.000 |     0.333 | 
##              |     1.000 |     0.000 |     0.000 |           | 
##              |     0.333 |     0.000 |     0.000 |           | 
## -------------|-----------|-----------|-----------|-----------|
##   versicolor |         0 |        48 |         2 |        50 | 
##              |    16.667 |    36.151 |     8.982 |           | 
##              |     0.000 |     0.960 |     0.040 |     0.333 | 
##              |     0.000 |     0.774 |     0.053 |           | 
##              |     0.000 |     0.320 |     0.013 |           | 
## -------------|-----------|-----------|-----------|-----------|
##    virginica |         0 |        14 |        36 |        50 | 
##              |    16.667 |     2.151 |    42.982 |           | 
##              |     0.000 |     0.280 |     0.720 |     0.333 | 
##              |     0.000 |     0.226 |     0.947 |           | 
##              |     0.000 |     0.093 |     0.240 |           | 
## -------------|-----------|-----------|-----------|-----------|
## Column Total |        50 |        62 |        38 |       150 | 
##              |     0.333 |     0.413 |     0.253 |           | 
## -------------|-----------|-----------|-----------|-----------|
## 
## 

#Visualisation Compairing clustering prediction with actual species In Petal.length and Petal.width

par(mfrow=c(1,2))
plot(data$Petal.Length,data$Petal.Width,col=results$cluster,pch=19,
     xlab="Petal Length",ylab="Petal Width",main="By cluster")
plot(data$Petal.Length,data$Petal.Width,col=class,pch=19,
     xlab="Petal Length",ylab="Petal Width",main="By species")

In Sepal.Length and Sepal.Width

par(mfrow=c(1,2))
plot(data$Sepal.Length, data$Sepal.Width,col=results$cluster,pch=19,
     xlab="Sepal Length",ylab="Sepal Width",main="By cluster")
plot(data$Sepal.Length, data$Sepal.Width,col=class,pch=19,
     xlab="Sepal Length",ylab="Sepal Width",main="By Species")

By Principal Component Analysis

p <- princomp(data)
par(mfrow=c(1,2))
plot(p$scores[,1],p$scores[,2],col=results$cluster,
     pch=16,
     xlab="Principal Component 1",
     ylab="Principal Component 2",
     main="By cluster")
plot(p$scores[,1],p$scores[,2],col=class,
     pch=16,
     xlab="Principal Component 1",
     ylab="Principal Component 2",
     main="By species")

Finally,We can comment that if we divide the data into 3 clusters then it will greatly reduce within sum of squares and thus explined part will increase. So, cluster=3 is appropriate.

Thank You!!