K-Medoids in R: Algorithm and Practical Examples
The k-medoids algorithm is a clustering approach related to k-means clustering for partitioning a data set into k groups or clusters. In k-medoids clustering, each cluster is represented by one of the data point in the cluster. These points are named cluster medoids.
K-medoid is a robust alternative to k-means clustering. This means that, the algorithm is less sensitive to noise and outliers, compared to k-means, because it uses medoids as cluster centers instead of means (used in k-means).
The most common k-medoids clustering methods is the PAM algorithm (Partitioning Around Medoids, (Kaufman and Rousseeuw 1990)).
We’ll use the demo data sets “USArrests”, which we start by scaling (Chapter data preparation and R packages) using the R function scale() as follow:
data("USArrests") # Load the data set
df <- scale(USArrests) # Scale the data
head(df, n = 3) # View the firt 3 rows of the data
## Murder Assault UrbanPop Rape
## Alabama 1.24256408 0.7828393 -0.5209066 -0.003416473
## Alaska 0.50786248 1.1068225 -1.2117642 2.484202941
## Arizona 0.07163341 1.4788032 0.9989801 1.042878388
In the following examples, we’ll describe only the function pam(), which simplified format is:
library("cluster")
library("factoextra")
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
# pam(x, k, metric = "euclidean", stand = FALSE)
To estimate the optimal number of clusters, we’ll use the average silhouette method.
fviz_nbclust(
df,
FUNcluster = clara,
method = c("silhouette"),
diss = NULL,
k.max = 10,
nboot = 100,
verbose = interactive(),
barfill = "steelblue",
barcolor = "steelblue",
linecolor = "steelblue",
print.summary = TRUE,
)
From the plot, the suggested number of clusters is 2. In the next section, we’ll classify the observations into 2 clusters.
The R code below computes PAM algorithm with k = 2:
pam.res <- pam(df, 2)
print(pam.res)
## Medoids:
## ID Murder Assault UrbanPop Rape
## New Mexico 31 0.8292944 1.3708088 0.3081225 1.1603196
## Nebraska 27 -0.8008247 -0.8250772 -0.2445636 -0.5052109
## Clustering vector:
## Alabama Alaska Arizona Arkansas California
## 1 1 1 2 1
## Colorado Connecticut Delaware Florida Georgia
## 1 2 2 1 1
## Hawaii Idaho Illinois Indiana Iowa
## 2 2 1 2 2
## Kansas Kentucky Louisiana Maine Maryland
## 2 2 1 2 1
## Massachusetts Michigan Minnesota Mississippi Missouri
## 2 1 2 1 1
## Montana Nebraska Nevada New Hampshire New Jersey
## 2 2 1 2 2
## New Mexico New York North Carolina North Dakota Ohio
## 1 1 1 2 2
## Oklahoma Oregon Pennsylvania Rhode Island South Carolina
## 2 2 2 2 1
## South Dakota Tennessee Texas Utah Vermont
## 2 1 1 2 2
## Virginia Washington West Virginia Wisconsin Wyoming
## 2 2 2 2 2
## Objective function:
## build swap
## 1.441358 1.368969
##
## Available components:
## [1] "medoids" "id.med" "clustering" "objective" "isolation"
## [6] "clusinfo" "silinfo" "diss" "call" "data"
The printed output shows: > the cluster medoids: a matrix, which rows are the medoids and columns are variables > the clustering vector: A vector of integers (from 1:k) indicating the cluster to which each point is allocated
If you want to add the point classifications to the original data, use this:
dd <- cbind(USArrests, cluster = pam.res$cluster)
head(dd, n = 3)
## Murder Assault UrbanPop Rape cluster
## Alabama 13.2 236 58 21.2 1
## Alaska 10.0 263 48 44.5 1
## Arizona 8.1 294 80 31.0 1
The function pam() returns an object of class pam which components include: > medoids: Objects that represent clusters > clustering: a vector containing the cluster number of each object
These components can be accessed as follow:
# Cluster medoids: New Mexico, Nebraska
pam.res$medoids
## Murder Assault UrbanPop Rape
## New Mexico 0.8292944 1.3708088 0.3081225 1.1603196
## Nebraska -0.8008247 -0.8250772 -0.2445636 -0.5052109
# Cluster numbers
head(pam.res$clustering)
## Alabama Alaska Arizona Arkansas California Colorado
## 1 1 1 2 1 1
fviz_cluster(
pam.res,
data = df,
choose.vars = NULL,
stand = TRUE,
axes = c(1, 2),
geom = c("point", "text"),
repel = FALSE,
show.clust.cent = TRUE,
ellipse = TRUE,
ellipse.type = "convex",
ellipse.level = 0.95,
ellipse.alpha = 0.2,
shape = NULL,
pointsize = 1.5,
labelsize = 12,
main = "Cluster plot",
xlab = NULL,
ylab = NULL,
outlier.color = "black",
outlier.shape = 19,
outlier.pointsize = pointsize,
outlier.labelsize = labelsize,
ggtheme = theme_grey()
)
Source: https://www.datanovia.com/en/lessons/k-medoids-in-r-algorithm-and-practical-examples/