Clustering

Loading libraries

library(caret)    #confusionMatrix
library(cluster)  #clustering 
library(fpc)

Reading and Prepapring data

mydata<- iris
str(mydata)

## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

mydata <- na.omit(mydata) # listwise deletion of missing
mydata <- scale(mydata[, 1:4]) # standardize variables

Partitioning

K-means algorithm as the most popular partitioning model requires you to specify the number of clusters for your data. A plot of the within groups sum of squares by number of clusters extracted can help determine the appropriate number of clusters. Thisis calledelbow method, through which you can look for a bend in the plot that shows where you can get an opimal number of clusters.

wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))
for (i in 2:15) wss[i] <- sum(kmeans(mydata,
   centers=i)$withinss)
plot(1:15, wss, type="b", xlab="Number of Clusters",
  ylab="Within groups sum of squares")

K-means Clsuter Analysis

fit <- kmeans(mydata, 5) #you can decide the number of clusters for kmeans 
# get cluster means
aggregate(mydata,by=list(fit$cluster),FUN=mean)

##   Group.1 Sepal.Length Sepal.Width Petal.Length   Petal.Width
## 1       1    0.3558492  -0.3930869    0.5846038  0.5466361525
## 2       2   -0.3628650  -1.4097814    0.1074147  0.0008746178
## 3       3   -1.3477916   0.1187465   -1.3100027 -1.2931622378
## 4       4   -0.7467198   1.4252951   -1.2932659 -1.2173430935
## 5       5    1.3926646   0.2323817    1.1567451  1.2132759051

# append cluster assignment for the obeservations in the dataset
mydata <- data.frame(mydata, fit$cluster)

Model-based Clustering

Model based clustering method tries a variety of models and apply maximum likelihood estimation and Bayes criteria to identify the most likely model and number of clusters. The Mclust( ) function (in the mclust package) selects the optimal model.
One chooses the model and number of clusters with the largest BIC https://medium.com/@analyttica/what-is-bayesian-information-criterion-bic-b3396a894be6. Try help(mclustModelNames) to see more details.

library(mclust)
fit <- Mclust(mydata)
#plot(fit) # plot results
summary(fit) # display the best model

## ---------------------------------------------------- 
## Gaussian finite mixture model fitted by EM algorithm 
## ---------------------------------------------------- 
## 
## Mclust VEV (ellipsoidal, equal shape) model with 4 components: 
## 
##  log-likelihood   n df       BIC       ICL
##       -139.8621 150 71 -635.4792 -635.4792
## 
## Clustering table:
##  1  2  3  4 
## 50 21 29 50

Plotting clusters

# K-Means Clustering with the number of clusters that you specify
fit <- kmeans(mydata, 5)

# Cluster Plot against 1st 2 principal components

# vary parameters for most readable graph

clusplot(mydata, fit$cluster, color=TRUE, shade=TRUE,
   labels= 2, lines=0)

# Centroid Plot against 1st 2 discriminant functions

plotcluster(mydata, fit$cluster)

Resources

Cluster Analysis https://www.statmethods.net/advstats/cluster.html