Display the final confusion matrix and the rand index. Also plot a 2- dimensional graph between the petal length and the petal width first based on the given iris species and then based on the cluster type that was generated by your algorithm. .
require("datasets")
data("iris") # load Iris Dataset
str(iris) #view structure of dataset
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
head(iris)
Let’s plot the scatterplot.
library(ggplot2)
Want to understand how all the pieces fit together? Read R for Data Science:
https://r4ds.had.co.nz/
df <- iris
ggplot(df, aes(Petal.Length, Petal.Width)) + geom_point(aes(col=Species), size=4)
Let’s use a seed value of 1234, nstart=25, and k=3.
set.seed(1234)
irisCluster <- kmeans(df[,1:4], center=3, nstart=25)
irisCluster
K-means clustering with 3 clusters of sizes 50, 62, 38
Cluster means:
Sepal.Length Sepal.Width Petal.Length Petal.Width
1 5.006000 3.428000 1.462000 0.246000
2 5.901613 2.748387 4.393548 1.433871
3 6.850000 3.073684 5.742105 2.071053
Clustering vector:
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2
[52] 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2
[103] 3 3 3 3 2 3 3 3 3 3 3 2 2 3 3 3 3 2 3 2 3 2 3 3 2 2 3 3 3 3 3 2 3 3 3 3 2 3 3 3 2 3 3 3 2 3 3 2
Within cluster sum of squares by cluster:
[1] 15.15100 39.82097 23.87947
(between_SS / total_SS = 88.4 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss" "betweenss"
[7] "size" "iter" "ifault"
Let’s compare the predicted clusters with the original data.
table(irisCluster$cluster, df$Species)
setosa versicolor virginica
1 50 0 0
2 0 48 14
3 0 2 36
Let’s plot out these clusters.
library(cluster)
clusplot(iris, irisCluster$cluster, color=T, shade=T, labels=0, lines=0)
We can see the setosa cluster perfectly explained, meanwhile virginica and versicolor have a little noise between their clusters.
Let’s find out the exact number of centers, we should have built the elbow method.
tot.withinss <- vector(mode="character", length=10)
for (i in 1:10){
irisCluster <- kmeans(df[,1:4], center=i, nstart=20)
tot.withinss[i] <- irisCluster$tot.withinss
}
Let’s visualize it.
plot(1:10, tot.withinss, type="b", pch=19)
Clearly, the optimal number of clusters is 3.