Clustering is a data exploratory technique used for discovering groups or pattern in a dataset. There are two standard clustering strategies: partitioning methods and hierarchical clustering.
The most well-known and commonly used partitioning algorithms including:
K-means clustering (MacQueen, 1967), in which, each cluster is represented by the center or means of the data points belonging to the cluster. K-medoids clustering or PAM (Partitioning Around Medoids, Kaufman & Rousseeuw, 1990), in which, each cluster is represented by one of the objects in the cluster. We’ll describe also a variant of PAM named CLARA (Clustering Large Applications) which is used for analyzing large data sets.
This particular dataset, named USArrests, contains the number of arrests for murder, assault, and rape for each of the 50 states in 1973. It also contains the percentage of people in the state who live in an urban area.
Cluster analysis is popular in many fields, including:
If you want to invest in housing market, you can use clustering for identifying groups of houses according to their type, value and location.
In cancer research for classifying patients into subgroups according their gene expression profile. This can be useful for identifying the molecular profile of patients with good or bad prognostic, as well as for understanding the disease.
In marketing for market segmentation by identifying subgroups of customers with similar profiles and who might be receptive to a particular form of advertising.
Factoextra official online documentation: http://www.sthda.com/english/rpkgs/factoextra.
It contains 50 observations on 4 variables:
[,1] Murder numeric Murder arrests (per 100,000) [,2] Assault numeric Assault arrests (per 100,000) [,3] UrbanPop numeric Percent urban population [,4] Rape numeric Rape arrests (per 100,000)
data("USArrests")
rawdf <- na.omit(USArrests)
head(rawdf)
## Murder Assault UrbanPop Rape
## Alabama 13.2 236 58 21.2
## Alaska 10.0 263 48 44.5
## Arizona 8.1 294 80 31.0
## Arkansas 8.8 190 50 19.5
## California 9.0 276 91 40.6
## Colorado 7.9 204 78 38.7
Note that the variables have a large different means and variances. As we don’t want the k-means algorithm to depend to an arbitrary variable unit, we wil scale the data using the R function scale(), before we can apply K-means.
desc_stats <- data.frame(
Min = apply(rawdf, 2, min), # minimum
Med = apply(rawdf, 2, median), # median
Mean = apply(rawdf, 2, mean), # mean
SD = apply(rawdf, 2, sd), # Standard deviation
Max = apply(rawdf, 2, max) # Maximum
)
desc_stats <- round(desc_stats, 1)
head(desc_stats)
## Min Med Mean SD Max
## Murder 0.8 7.2 7.8 4.4 17.4
## Assault 45.0 159.0 170.8 83.3 337.0
## UrbanPop 32.0 66.0 65.5 14.5 91.0
## Rape 7.3 20.1 21.2 9.4 46.0
df <- scale(USArrests)
head(df)
## Murder Assault UrbanPop Rape
## Alabama 1.24256408 0.7828393 -0.5209066 -0.003416473
## Alaska 0.50786248 1.1068225 -1.2117642 2.484202941
## Arizona 0.07163341 1.4788032 0.9989801 1.042878388
## Arkansas 0.23234938 0.2308680 -1.0735927 -0.184916602
## California 0.27826823 1.2628144 1.7589234 2.067820292
## Colorado 0.02571456 0.3988593 0.8608085 1.864967207
set.seed(123)
km.res <- kmeans(scale(USArrests), 4, nstart = 25)
km.res
## K-means clustering with 4 clusters of sizes 13, 16, 13, 8
##
## Cluster means:
## Murder Assault UrbanPop Rape
## 1 -0.9615407 -1.1066010 -0.9301069 -0.96676331
## 2 -0.4894375 -0.3826001 0.5758298 -0.26165379
## 3 0.6950701 1.0394414 0.7226370 1.27693964
## 4 1.4118898 0.8743346 -0.8145211 0.01927104
##
## Clustering vector:
## Alabama Alaska Arizona Arkansas California
## 4 3 3 4 3
## Colorado Connecticut Delaware Florida Georgia
## 3 2 2 3 4
## Hawaii Idaho Illinois Indiana Iowa
## 2 1 3 2 1
## Kansas Kentucky Louisiana Maine Maryland
## 2 1 4 1 3
## Massachusetts Michigan Minnesota Mississippi Missouri
## 2 3 1 4 3
## Montana Nebraska Nevada New Hampshire New Jersey
## 1 1 3 1 2
## New Mexico New York North Carolina North Dakota Ohio
## 3 3 4 1 2
## Oklahoma Oregon Pennsylvania Rhode Island South Carolina
## 2 2 2 2 4
## South Dakota Tennessee Texas Utah Vermont
## 1 4 3 2 1
## Virginia Washington West Virginia Wisconsin Wyoming
## 2 2 1 1 2
##
## Within cluster sum of squares by cluster:
## [1] 11.952463 16.212213 19.922437 8.316061
## (between_SS / total_SS = 71.2 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
# It’s possible to compute the mean of each of the variables in the clusters:
aggregate(USArrests, by=list(cluster=km.res$cluster), mean)
## cluster Murder Assault UrbanPop Rape
## 1 1 3.60000 78.53846 52.07692 12.17692
## 2 2 5.65625 138.87500 73.87500 18.78125
## 3 3 10.81538 257.38462 76.00000 33.19231
## 4 4 13.93750 243.62500 53.75000 21.41250
Observations are represented by points in the plot, using principal components if ncol(data) > 2. An ellipse is drawn around each cluster.
library("factoextra")
## Warning: package 'factoextra' was built under R version 3.3.3
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.3.3
fviz_cluster(km.res, data = df,
palette = c("#00AFBB","#2E9FDF", "#E7B800", "#FC4E07"),
ggtheme = theme_minimal(),
main = "Partitioning Clustering Plot"
)
Cluster validation using the silhouette coefficient (Si): A value of Si close to 1 indicates that the object is well clustered. A value of Si close to -1 indicates that the object is poorly clustered.
The figure below shows the silhouette plot of a k-means clustering and introduce the silhouette function.
library(cluster)
library(HSAUR)
## Warning: package 'HSAUR' was built under R version 3.3.3
## Loading required package: tools
data(pottery)
km <- kmeans(pottery,3)
dissE <- daisy(pottery)
dE2 <- dissE^2
sk2 <- silhouette(km$cl, dE2)
plot(sk2)