The well known Iris dataset. Containing 150 plants from 3 types of iris plants, 4 attributes were measured such as petal length and width. The goal is to properly classify every plant into its true type. Let’s load the packages needed first.
library(ggplot2)
library(DT)
library(EMCluster)
library(FactoMineR)
library(reshape2)
library(summarytools)
library(dplyr)
Dataset is available directly using iris in R. Descriptive Statistics are shown below.
We’ll compare two ways of performing cluster Analysis: k means and EM-algorithm over raw data and principal components(PC). A PCA was performed via PCA() function from FactomineR package. Similarly, for kmeans we used stats built-in function.
set.seed(123)
pca_iris=PCA(iris[,-5],scale.unit = T,ncp = 2,graph = F)
iris_sc=scale(iris[,-5],center = T,scale = T)
km_raw=kmeans(iris_sc, centers=3)
emobj <- exhaust.EM(iris_sc,nclass = 3)
emobj <- shortemcluster(iris_sc, emobj,maxiter = 1000)
em_raw <- emcluster(iris_sc, emobj, assign.class = TRUE)
dat_clust_raw=data.frame(iris_sc,"Species"=as.factor(iris$Species),
kmeans=as.factor(km_raw$cluster), EM=as.factor(em_raw$class))
km_pca=kmeans(pca_iris$ind$coord, centers=3)
emobj_pca <- exhaust.EM(pca_iris$ind$coord, nclass = 3)
emobj_pca <- shortemcluster(pca_iris$ind$coord, emobj_pca,maxiter = 1000)
em_pca <- emcluster(pca_iris$ind$coord, emobj_pca, assign.class = TRUE)
dat_clust_pca=data.frame(pca_iris$ind$coord,"Species"=iris$Species,
kmeans=as.factor(km_pca$cluster), EM=as.factor(em_pca$class))
EM-Algorithm is supposed to perform better than normal K-means when the true shape of cluster is either non-spherical or small-sized.
dat_raw=dat_clust_raw %>% melt(id.var=colnames(iris[,-5]))
dat_raw %>% ggplot(aes(x=Sepal.Length,y=Petal.Width, color=value))+geom_point()+
facet_wrap(~variable)+theme_grey()+
theme(strip.background =element_rect(fill="darkred"))+
theme(strip.text = element_text(colour = 'white'))+guides(color=FALSE)
Raw Data
dat_pca=dat_clust_pca %>% melt(id.var=c("Dim.1","Dim.2"))
dat_pca %>% ggplot(aes(x=Dim.1,y=Dim.2, color=value))+geom_point()+facet_wrap(~variable)+theme_grey()+
theme(strip.background =element_rect(fill="darkred"))+
theme(strip.text = element_text(colour = 'white'))+guides(color=FALSE)
PCA Data
It can be seen that in both, raw data and pca coordinates, EM seems to behave and perform better respect to Kmeans. Compare it yourself. Look at the “Species” facet and find which of the two methods is closer to the true graphic representation.
EM Algorithm is a solid alternative to traditional k-means clustering on semi-supervised learning. It produces stable solutions by finding multivariate Gaussian distributions for each cluster. For more information on EMCluster package go to https://cran.r-project.org/web/packages/EMCluster/EMCluster.pdf
Written by: Jhon Parra
jhonparra939@gmail.com