K means clustering

set.seed(1)
#import iris data
irisData<- iris
#a feature is another word for an explanatory variable in machine learning. 

#create a matrix of features
iris.features<- irisData
iris.features$Species <- NULL

#use kmeans to separate the observations into three clusters
results<- kmeans(iris.features, 3)

#view kmeans output
results

## K-means clustering with 3 clusters of sizes 50, 38, 62
## 
## Cluster means:
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1     5.006000    3.428000     1.462000    0.246000
## 2     6.850000    3.073684     5.742105    2.071053
## 3     5.901613    2.748387     4.393548    1.433871
## 
## Clustering vector:
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
##  [71] 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 2 2 2
## [106] 2 3 2 2 2 2 2 2 3 3 2 2 2 2 3 2 3 2 3 2 2 3 3 2 2 2 2 2 3 2 2 2 2 3 2
## [141] 2 2 3 2 2 2 3 2 2 3
## 
## Within cluster sum of squares by cluster:
## [1] 15.15100 23.87947 39.82097
##  (between_SS / total_SS =  88.4 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

We expect the kmeans to try to separate the flowers into three separate groups, which, if the species are different enough from each other, will match up nicely with the species label that we removed from the features. Let’s compare our kmeans results to the characteristics of the actual flower species

#compare kmeans output to plyr output on labeled data
results$centers

##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1     5.006000    3.428000     1.462000    0.246000
## 2     6.850000    3.073684     5.742105    2.071053
## 3     5.901613    2.748387     4.393548    1.433871

library("plyr")

labeledmeans<- ddply(irisData, "Species", numcolwise(mean))

labeledmeans

##      Species Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1     setosa        5.006       3.428        1.462       0.246
## 2 versicolor        5.936       2.770        4.260       1.326
## 3  virginica        6.588       2.974        5.552       2.026

#compare original classification to kmeans cluster classification 

table(irisData$Species)

## 
##     setosa versicolor  virginica 
##         50         50         50

table(irisData$Species, results$cluster)

##             
##               1  2  3
##   setosa     50  0  0
##   versicolor  0  2 48
##   virginica   0 36 14

#plot petal length by width with color dependent upon clustering from kmeans
plot(irisData[c("Petal.Length", "Petal.Width")], col = results$cluster)

title("K means clustering on the Iris dataset")

It looks like there was some misclassification in the model. 14 virginica were assigned to the versicolor cluster, and two versicolor were assigned to the virginica. This isn’t too bad though. Only 16 out of the 150 flowers were misclassified by an algorithm that doesn’t know anything about flowers!

K means clustering

Leland Krych

May 29, 2016