Assignment 6

Write the R code to perform k-means clustering on the Iris dataset. Iris dataset comes built-in with R. You can also go online to read more about the dataset. Use a seed value of 1234, nstart=25, and k=3.

Display the final confusion matrix and the rand index. Also plot a 2- dimensional graph between the petal length and the petal width first based on the given iris species and then based on the cluster type that was generated by your algorithm. .

require("datasets")
data("iris") # load Iris Dataset
str(iris) #view structure of dataset

'data.frame':   150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

head(iris)

Let’s plot the scatterplot.

library(ggplot2)

Want to understand how all the pieces fit together? Read R for Data Science:
https://r4ds.had.co.nz/

df <- iris
ggplot(df, aes(Petal.Length, Petal.Width)) + geom_point(aes(col=Species), size=4)

Let’s use a seed value of 1234, nstart=25, and k=3.

set.seed(1234)
irisCluster <- kmeans(df[,1:4], center=3, nstart=25)
irisCluster

K-means clustering with 3 clusters of sizes 50, 62, 38

Cluster means:
  Sepal.Length Sepal.Width Petal.Length Petal.Width
1     5.006000    3.428000     1.462000    0.246000
2     5.901613    2.748387     4.393548    1.433871
3     6.850000    3.073684     5.742105    2.071053

Clustering vector:
  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2
 [52] 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2
[103] 3 3 3 3 2 3 3 3 3 3 3 2 2 3 3 3 3 2 3 2 3 2 3 3 2 2 3 3 3 3 3 2 3 3 3 3 2 3 3 3 2 3 3 3 2 3 3 2

Within cluster sum of squares by cluster:
[1] 15.15100 39.82097 23.87947
 (between_SS / total_SS =  88.4 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss" "betweenss"   
[7] "size"         "iter"         "ifault"

Let’s compare the predicted clusters with the original data.

table(irisCluster$cluster, df$Species)

   
    setosa versicolor virginica
  1     50          0         0
  2      0         48        14
  3      0          2        36

Let’s plot out these clusters.

library(cluster)
clusplot(iris, irisCluster$cluster, color=T, shade=T, labels=0, lines=0)

We can see the setosa cluster perfectly explained, meanwhile virginica and versicolor have a little noise between their clusters.

Let’s find out the exact number of centers, we should have built the elbow method.

tot.withinss <- vector(mode="character", length=10)
for (i in 1:10){
  irisCluster <- kmeans(df[,1:4], center=i, nstart=20)
  tot.withinss[i] <- irisCluster$tot.withinss
}

Let’s visualize it.

plot(1:10, tot.withinss, type="b", pch=19)

Clearly, the optimal number of clusters is 3.

LS0tDQp0aXRsZTogIkFzc2lnbm1lbnQgNiINCm91dHB1dDogaHRtbF9ub3RlYm9vaw0KLS0tDQoNCjMpIFdyaXRlIHRoZSBSIGNvZGUgdG8gcGVyZm9ybSBrLW1lYW5zIGNsdXN0ZXJpbmcgb24gdGhlIElyaXMgZGF0YXNldC4gSXJpcyBkYXRhc2V0IGNvbWVzIGJ1aWx0LWluIHdpdGggUi4gWW91IGNhbiBhbHNvIGdvIG9ubGluZSB0byByZWFkIG1vcmUgYWJvdXQgdGhlIGRhdGFzZXQuIFVzZSBhIHNlZWQgdmFsdWUgb2YgMTIzNCwgbnN0YXJ0PTI1LCBhbmQgaz0zLg0KDQpEaXNwbGF5IHRoZSBmaW5hbCBjb25mdXNpb24gbWF0cml4IGFuZCB0aGUgcmFuZCBpbmRleC4gQWxzbyBwbG90IGEgMi0gZGltZW5zaW9uYWwgZ3JhcGggYmV0d2VlbiB0aGUgcGV0YWwgbGVuZ3RoIGFuZCB0aGUgcGV0YWwgd2lkdGggZmlyc3QgYmFzZWQgb24gdGhlIGdpdmVuIGlyaXMgc3BlY2llcyBhbmQgdGhlbiBiYXNlZCBvbiB0aGUgY2x1c3RlciB0eXBlIHRoYXQgd2FzIGdlbmVyYXRlZCBieSB5b3VyIGFsZ29yaXRobS4NCi4gDQoNCmBgYHtyfQ0KcmVxdWlyZSgiZGF0YXNldHMiKQ0KZGF0YSgiaXJpcyIpICMgbG9hZCBJcmlzIERhdGFzZXQNCnN0cihpcmlzKSAjdmlldyBzdHJ1Y3R1cmUgb2YgZGF0YXNldA0KaGVhZChpcmlzKQ0KYGBgDQoNCkxldCdzIHBsb3QgdGhlIHNjYXR0ZXJwbG90Lg0KDQoNCmBgYHtyfQ0KbGlicmFyeShnZ3Bsb3QyKQ0KZGYgPC0gaXJpcw0KZ2dwbG90KGRmLCBhZXMoUGV0YWwuTGVuZ3RoLCBQZXRhbC5XaWR0aCkpICsgZ2VvbV9wb2ludChhZXMoY29sPVNwZWNpZXMpLCBzaXplPTQpDQpgYGANCg0KTGV0J3MgdXNlIGEgc2VlZCB2YWx1ZSBvZiAxMjM0LCBuc3RhcnQ9MjUsIGFuZCBrPTMuDQoNCmBgYHtyfQ0Kc2V0LnNlZWQoMTIzNCkNCmlyaXNDbHVzdGVyIDwtIGttZWFucyhkZlssMTo0XSwgY2VudGVyPTMsIG5zdGFydD0yNSkNCmlyaXNDbHVzdGVyDQpgYGANCg0KTGV0J3MgY29tcGFyZSB0aGUgcHJlZGljdGVkIGNsdXN0ZXJzIHdpdGggdGhlIG9yaWdpbmFsIGRhdGEuDQoNCg0KYGBge3J9DQp0YWJsZShpcmlzQ2x1c3RlciRjbHVzdGVyLCBkZiRTcGVjaWVzKQ0KDQpgYGANCg0KTGV0J3MgcGxvdCBvdXQgdGhlc2UgY2x1c3RlcnMuDQoNCmBgYHtyfQ0KbGlicmFyeShjbHVzdGVyKQ0KY2x1c3Bsb3QoaXJpcywgaXJpc0NsdXN0ZXIkY2x1c3RlciwgY29sb3I9VCwgc2hhZGU9VCwgbGFiZWxzPTAsIGxpbmVzPTApDQpgYGANCg0KV2UgY2FuIHNlZSB0aGUgc2V0b3NhIGNsdXN0ZXIgcGVyZmVjdGx5IGV4cGxhaW5lZCwgbWVhbndoaWxlIHZpcmdpbmljYSBhbmQgdmVyc2ljb2xvciBoYXZlIGEgbGl0dGxlIG5vaXNlIGJldHdlZW4gdGhlaXIgY2x1c3RlcnMuDQoNCg0KDQpMZXQncyBmaW5kIG91dCB0aGUgZXhhY3QgbnVtYmVyIG9mIGNlbnRlcnMsIHdlIHNob3VsZCBoYXZlIGJ1aWx0IHRoZSBlbGJvdyBtZXRob2QuDQoNCmBgYHtyfQ0KdG90LndpdGhpbnNzIDwtIHZlY3Rvcihtb2RlPSJjaGFyYWN0ZXIiLCBsZW5ndGg9MTApDQpmb3IgKGkgaW4gMToxMCl7DQogIGlyaXNDbHVzdGVyIDwtIGttZWFucyhkZlssMTo0XSwgY2VudGVyPWksIG5zdGFydD0yMCkNCiAgdG90LndpdGhpbnNzW2ldIDwtIGlyaXNDbHVzdGVyJHRvdC53aXRoaW5zcw0KfQ0KYGBgDQoNCkxldOKAmXMgdmlzdWFsaXplIGl0Lg0KDQoNCmBgYHtyfQ0KcGxvdCgxOjEwLCB0b3Qud2l0aGluc3MsIHR5cGU9ImIiLCBwY2g9MTkpDQoNCmBgYA0KDQpDbGVhcmx5LCB0aGUgb3B0aW1hbCBudW1iZXIgb2YgY2x1c3RlcnMgaXMgMy4=