Dataset Description

The well known Iris dataset. Containing 150 plants from 3 types of iris plants, 4 attributes were measured such as petal length and width. The goal is to properly classify every plant into its true type. Let’s load the packages needed first.

library(ggplot2)
library(DT)
library(EMCluster)
library(FactoMineR)
library(reshape2)
library(summarytools)
library(dplyr)

Dataset is available directly using iris in R. Descriptive Statistics are shown below.


Clustering Methods

We’ll compare two ways of performing cluster Analysis: k means and EM-algorithm over raw data and principal components(PC). A PCA was performed via PCA() function from FactomineR package. Similarly, for kmeans we used stats built-in function.

set.seed(123)
pca_iris=PCA(iris[,-5],scale.unit = T,ncp = 2,graph = F)
iris_sc=scale(iris[,-5],center = T,scale = T)

km_raw=kmeans(iris_sc, centers=3)

emobj <- exhaust.EM(iris_sc,nclass = 3) 
emobj <- shortemcluster(iris_sc, emobj,maxiter = 1000)
em_raw <- emcluster(iris_sc, emobj, assign.class = TRUE)

dat_clust_raw=data.frame(iris_sc,"Species"=as.factor(iris$Species), 
          kmeans=as.factor(km_raw$cluster), EM=as.factor(em_raw$class))
km_pca=kmeans(pca_iris$ind$coord, centers=3)

emobj_pca <- exhaust.EM(pca_iris$ind$coord, nclass = 3) 
emobj_pca <- shortemcluster(pca_iris$ind$coord, emobj_pca,maxiter = 1000)
em_pca <- emcluster(pca_iris$ind$coord, emobj_pca, assign.class = TRUE)

dat_clust_pca=data.frame(pca_iris$ind$coord,"Species"=iris$Species,
              kmeans=as.factor(km_pca$cluster), EM=as.factor(em_pca$class))

EM-Algorithm is supposed to perform better than normal K-means when the true shape of cluster is either non-spherical or small-sized.

dat_raw=dat_clust_raw %>% melt(id.var=colnames(iris[,-5]))

dat_raw %>% ggplot(aes(x=Sepal.Length,y=Petal.Width, color=value))+geom_point()+
  facet_wrap(~variable)+theme_grey()+
  theme(strip.background =element_rect(fill="darkred"))+
  theme(strip.text = element_text(colour = 'white'))+guides(color=FALSE)
Raw Data

Raw Data


dat_pca=dat_clust_pca %>% melt(id.var=c("Dim.1","Dim.2"))

dat_pca %>% ggplot(aes(x=Dim.1,y=Dim.2, color=value))+geom_point()+facet_wrap(~variable)+theme_grey()+
  theme(strip.background =element_rect(fill="darkred"))+
  theme(strip.text = element_text(colour = 'white'))+guides(color=FALSE)
PCA Data

PCA Data


It can be seen that in both, raw data and pca coordinates, EM seems to behave and perform better respect to Kmeans. Compare it yourself. Look at the “Species” facet and find which of the two methods is closer to the true graphic representation.


Conclusion

EM Algorithm is a solid alternative to traditional k-means clustering on semi-supervised learning. It produces stable solutions by finding multivariate Gaussian distributions for each cluster. For more information on EMCluster package go to https://cran.r-project.org/web/packages/EMCluster/EMCluster.pdf



Written by: Jhon Parra
jhonparra939@gmail.com