What is Unsupervised Learning?
Unsupervised learning is a branch of machine learning that is used to find underlying patterns in data and is often used in exploratory data analysis. Unsupervised learning does not use labeled data like supervised learning, but instead focuses on the data’s features
What is k means clustering?
K-means clustering is a type of unsupervised learning, which is used when you have unlabeled data (i.e., data without defined categories or groups). The goal of this algorithm is to find groups in the data, with the number of groups represented by the variable K. Data points are clustered based on feature similarity.
Importing necessary libraries
library(gmodels)
library(cluster)
library(psych)
Overview of Iris dataset
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
dim(iris)
## [1] 150 5
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
describe(iris)
## vars n mean sd median trimmed mad min max range skew
## Sepal.Length 1 150 5.84 0.83 5.80 5.81 1.04 4.3 7.9 3.6 0.31
## Sepal.Width 2 150 3.06 0.44 3.00 3.04 0.44 2.0 4.4 2.4 0.31
## Petal.Length 3 150 3.76 1.77 4.35 3.76 1.85 1.0 6.9 5.9 -0.27
## Petal.Width 4 150 1.20 0.76 1.30 1.18 1.04 0.1 2.5 2.4 -0.10
## Species* 5 150 2.00 0.82 2.00 2.00 1.48 1.0 3.0 2.0 0.00
## kurtosis se
## Sepal.Length -0.61 0.07
## Sepal.Width 0.14 0.04
## Petal.Length -1.42 0.14
## Petal.Width -1.36 0.06
## Species* -1.52 0.07
No of missing values
colSums(is.na(iris))
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 0 0 0 0 0
This proves that there is no missing values and no need of data cleaning.
Clustering Iris dataset without species column
data <- iris[,-5]
class <- iris[,5]
Computing Within Sum of Squres for Cluster number 1 to 15
wss <- 0
for (i in 1:15) wss[i] <- kmeans(data,centers=i)$tot.withinss
plot(1:15, wss, type="b",
xlab="Number of Clusters",
ylab="Within groups sum of squares",col="blue",pch=16,lwd=3)
There is inflection point or “elbow of the graph” at k = 3; some knowledge of the data (namely the number of species) also tells us that it might be logical to look for three clusters in our data.
taking clusters = 3
results <- kmeans(data,centers=3)
class(results)
## [1] "kmeans"
Compairing clustering prediction with actual species in tabular form
table(class)
## class
## setosa versicolor virginica
## 50 50 50
results$size
## [1] 50 62 38
results$cluster
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [75] 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 3 3 3 2 3 3 3 3
## [112] 3 3 2 2 3 3 3 3 2 3 2 3 2 3 3 2 2 3 3 3 3 3 2 3 3 3 3 2 3 3 3 2 3 3 3 2 3
## [149] 3 2
table(class,results$cluster)
##
## class 1 2 3
## setosa 50 0 0
## versicolor 0 48 2
## virginica 0 14 36
Cross Tabular form
CrossTable(class,results$cluster)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 150
##
##
## | results$cluster
## class | 1 | 2 | 3 | Row Total |
## -------------|-----------|-----------|-----------|-----------|
## setosa | 50 | 0 | 0 | 50 |
## | 66.667 | 20.667 | 12.667 | |
## | 1.000 | 0.000 | 0.000 | 0.333 |
## | 1.000 | 0.000 | 0.000 | |
## | 0.333 | 0.000 | 0.000 | |
## -------------|-----------|-----------|-----------|-----------|
## versicolor | 0 | 48 | 2 | 50 |
## | 16.667 | 36.151 | 8.982 | |
## | 0.000 | 0.960 | 0.040 | 0.333 |
## | 0.000 | 0.774 | 0.053 | |
## | 0.000 | 0.320 | 0.013 | |
## -------------|-----------|-----------|-----------|-----------|
## virginica | 0 | 14 | 36 | 50 |
## | 16.667 | 2.151 | 42.982 | |
## | 0.000 | 0.280 | 0.720 | 0.333 |
## | 0.000 | 0.226 | 0.947 | |
## | 0.000 | 0.093 | 0.240 | |
## -------------|-----------|-----------|-----------|-----------|
## Column Total | 50 | 62 | 38 | 150 |
## | 0.333 | 0.413 | 0.253 | |
## -------------|-----------|-----------|-----------|-----------|
##
##
#Visualisation Compairing clustering prediction with actual species In Petal.length and Petal.width
par(mfrow=c(1,2))
plot(data$Petal.Length,data$Petal.Width,col=results$cluster,pch=19,
xlab="Petal Length",ylab="Petal Width",main="By cluster")
plot(data$Petal.Length,data$Petal.Width,col=class,pch=19,
xlab="Petal Length",ylab="Petal Width",main="By species")
In Sepal.Length and Sepal.Width
par(mfrow=c(1,2))
plot(data$Sepal.Length, data$Sepal.Width,col=results$cluster,pch=19,
xlab="Sepal Length",ylab="Sepal Width",main="By cluster")
plot(data$Sepal.Length, data$Sepal.Width,col=class,pch=19,
xlab="Sepal Length",ylab="Sepal Width",main="By Species")
By Principal Component Analysis
p <- princomp(data)
par(mfrow=c(1,2))
plot(p$scores[,1],p$scores[,2],col=results$cluster,
pch=16,
xlab="Principal Component 1",
ylab="Principal Component 2",
main="By cluster")
plot(p$scores[,1],p$scores[,2],col=class,
pch=16,
xlab="Principal Component 1",
ylab="Principal Component 2",
main="By species")
Finally,We can comment that if we divide the data into 3 clusters then it will greatly reduce within sum of squares and thus explined part will increase. So, cluster=3 is appropriate.
Thank You!!