library(caret) #confusionMatrix
library(cluster) #clustering
library(fpc)
mydata<- iris
str(mydata)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
mydata <- na.omit(mydata) # listwise deletion of missing
mydata <- scale(mydata[, 1:4]) # standardize variables
K-means algorithm as the most popular partitioning model requires you to specify the number of clusters for your data. A plot of the within groups sum of squares by number of clusters extracted can help determine the appropriate number of clusters. Thisis calledelbow method, through which you can look for a bend in the plot that shows where you can get an opimal number of clusters.
wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))
for (i in 2:15) wss[i] <- sum(kmeans(mydata,
centers=i)$withinss)
plot(1:15, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")
fit <- kmeans(mydata, 5) #you can decide the number of clusters for kmeans
# get cluster means
aggregate(mydata,by=list(fit$cluster),FUN=mean)
## Group.1 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 1 0.3558492 -0.3930869 0.5846038 0.5466361525
## 2 2 -0.3628650 -1.4097814 0.1074147 0.0008746178
## 3 3 -1.3477916 0.1187465 -1.3100027 -1.2931622378
## 4 4 -0.7467198 1.4252951 -1.2932659 -1.2173430935
## 5 5 1.3926646 0.2323817 1.1567451 1.2132759051
# append cluster assignment for the obeservations in the dataset
mydata <- data.frame(mydata, fit$cluster)
Model based clustering method tries a variety of models and apply maximum likelihood estimation and Bayes criteria to identify the most likely model and number of clusters. The Mclust( ) function (in the mclust package) selects the optimal model.
One chooses the model and number of clusters with the largest BIC https://medium.com/@analyttica/what-is-bayesian-information-criterion-bic-b3396a894be6. Try help(mclustModelNames) to see more details.
library(mclust)
fit <- Mclust(mydata)
#plot(fit) # plot results
summary(fit) # display the best model
## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm
## ----------------------------------------------------
##
## Mclust VEV (ellipsoidal, equal shape) model with 4 components:
##
## log-likelihood n df BIC ICL
## -139.8621 150 71 -635.4792 -635.4792
##
## Clustering table:
## 1 2 3 4
## 50 21 29 50
# K-Means Clustering with the number of clusters that you specify
fit <- kmeans(mydata, 5)
# Cluster Plot against 1st 2 principal components
# vary parameters for most readable graph
clusplot(mydata, fit$cluster, color=TRUE, shade=TRUE,
labels= 2, lines=0)
# Centroid Plot against 1st 2 discriminant functions
plotcluster(mydata, fit$cluster)
Cluster Analysis https://www.statmethods.net/advstats/cluster.html