Cluster analysis

Loading libraries

library(caret)    #confusionMatrix
library(cluster)  #clustering 
library(fpc)

Reading and Prepapring data

mydata<- iris
str(mydata)

## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

mydata <- na.omit(mydata) # listwise deletion of missing
mydata <- scale(mydata[, 1:4]) # standardize variables

Partitioning

K-means algorithm as the most popular partitioning model requires you to specify the number of clusters for your data. A plot of the within groups sum of squares by number of clusters extracted can help determine the appropriate number of clusters. Thisis calledelbow method, through which you can look for a bend in the plot that shows where you can get an opimal number of clusters.

wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))
for (i in 2:15) wss[i] <- sum(kmeans(mydata,
   centers=i)$withinss)
plot(1:15, wss, type="b", xlab="Number of Clusters",
  ylab="Within groups sum of squares")

K-means Clsuter Analysis

fit <- kmeans(mydata, 5) #you can decide the number of clusters for kmeans 
# get cluster means
aggregate(mydata,by=list(fit$cluster),FUN=mean)

##   Group.1 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1       1   -0.9987207   0.9032290  -1.29875725 -1.25214931
## 2       2    0.2383957  -0.1485335   0.39724690  0.32642682
## 3       3    0.4937294  -0.6801714   0.80454443  0.80514277
## 4       4    1.3926646   0.2323817   1.15674505  1.21327591
## 5       5   -0.4201099  -1.4246794   0.03924137 -0.05279511

# append cluster assignment for the obeservations in the dataset
mydata <- data.frame(mydata, fit$cluster)

Model-based Clustering

Model based clustering method tries a variety of models and apply maximum likelihood estimation and Bayes criteria to identify the most likely model and number of clusters. The Mclust( ) function (in the mclust package) selects the optimal model.
One chooses the model and number of clusters with the largest BIC https://medium.com/@analyttica/what-is-bayesian-information-criterion-bic-b3396a894be6. Try help(mclustModelNames) to see more details.

library(mclust)
fit <- Mclust(mydata)
#plot(fit) # plot results
summary(fit) # display the best model

## ---------------------------------------------------- 
## Gaussian finite mixture model fitted by EM algorithm 
## ---------------------------------------------------- 
## 
## Mclust VEV (ellipsoidal, equal shape) model with 4 components: 
## 
##  log-likelihood   n df       BIC      ICL
##       -164.5763 150 71 -684.9076 -684.952
## 
## Clustering table:
##  1  2  3  4 
## 49 22 29 50

Plotting clusters

# K-Means Clustering with the number of clusters that you specify
fit <- kmeans(mydata, 5)

# Cluster Plot against 1st 2 principal components

# vary parameters for most readable graph

clusplot(mydata, fit$cluster, color=TRUE, shade=TRUE,
   labels= 2, lines=0)

# Centroid Plot against 1st 2 discriminant functions

plotcluster(mydata, fit$cluster)

Resources

Cluster Analysis https://www.statmethods.net/advstats/cluster.html