library(caret) #confusionMatrix
library(cluster) #clustering
library(fpc)
mydata<- iris
str(mydata)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
mydata <- na.omit(mydata) # listwise deletion of missing
mydata <- scale(mydata[, 1:4]) # standardize variables
K-means algorithm as the most popular partitioning model requires you to specify the number of clusters for your data. A plot of the within groups sum of squares by number of clusters extracted can help determine the appropriate number of clusters. Thisis calledelbow method, through which you can look for a bend in the plot that shows where you can get an opimal number of clusters.
wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))
for (i in 2:15) wss[i] <- sum(kmeans(mydata,
centers=i)$withinss)
plot(1:15, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")
fit <- kmeans(mydata, 5) #you can decide the number of clusters for kmeans
# get cluster means
aggregate(mydata,by=list(fit$cluster),FUN=mean)
## Group.1 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 1 -0.9987207 0.9032290 -1.29875725 -1.25214931
## 2 2 0.2383957 -0.1485335 0.39724690 0.32642682
## 3 3 0.4937294 -0.6801714 0.80454443 0.80514277
## 4 4 1.3926646 0.2323817 1.15674505 1.21327591
## 5 5 -0.4201099 -1.4246794 0.03924137 -0.05279511
# append cluster assignment for the obeservations in the dataset
mydata <- data.frame(mydata, fit$cluster)
Model based clustering method tries a variety of models and apply maximum likelihood estimation and Bayes criteria to identify the most likely model and number of clusters. The Mclust( ) function (in the mclust package) selects the optimal model.
One chooses the model and number of clusters with the largest BIC https://medium.com/@analyttica/what-is-bayesian-information-criterion-bic-b3396a894be6. Try help(mclustModelNames) to see more details.
library(mclust)
fit <- Mclust(mydata)
#plot(fit) # plot results
summary(fit) # display the best model
## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm
## ----------------------------------------------------
##
## Mclust VEV (ellipsoidal, equal shape) model with 4 components:
##
## log-likelihood n df BIC ICL
## -164.5763 150 71 -684.9076 -684.952
##
## Clustering table:
## 1 2 3 4
## 49 22 29 50
# K-Means Clustering with the number of clusters that you specify
fit <- kmeans(mydata, 5)
# Cluster Plot against 1st 2 principal components
# vary parameters for most readable graph
clusplot(mydata, fit$cluster, color=TRUE, shade=TRUE,
labels= 2, lines=0)
# Centroid Plot against 1st 2 discriminant functions
plotcluster(mydata, fit$cluster)
Cluster Analysis https://www.statmethods.net/advstats/cluster.html