Clustering refers to classification/categorization of observations based on their inherited similarities. There are numerous applications of cluster analysis like in marketing for market segmentation by identifying subgroups of customers with similar profiles and who might be receptive to a particular form of advertising. In this post we will perform cluster analysis of “USArrests” dataset which is available in r “data package”.

Data

This data has number of arrests made (per 100,000) in US states for murder, assault and rape as well as the percent of urban population of those states.

data("USArrests")
us_arrests<-USArrests #Making copy
head(us_arrests)
##            Murder Assault UrbanPop Rape
## Alabama      13.2     236       58 21.2
## Alaska       10.0     263       48 44.5
## Arizona       8.1     294       80 31.0
## Arkansas      8.8     190       50 19.5
## California    9.0     276       91 40.6
## Colorado      7.9     204       78 38.7
str(us_arrests)
## 'data.frame':    50 obs. of  4 variables:
##  $ Murder  : num  13.2 10 8.1 8.8 9 7.9 3.3 5.9 15.4 17.4 ...
##  $ Assault : int  236 263 294 190 276 204 110 238 335 211 ...
##  $ UrbanPop: int  58 48 80 50 91 78 77 72 80 60 ...
##  $ Rape    : num  21.2 44.5 31 19.5 40.6 38.7 11.1 15.8 31.9 25.8 ...
summary(us_arrests)
##      Murder          Assault         UrbanPop          Rape      
##  Min.   : 0.800   Min.   : 45.0   Min.   :32.00   Min.   : 7.30  
##  1st Qu.: 4.075   1st Qu.:109.0   1st Qu.:54.50   1st Qu.:15.07  
##  Median : 7.250   Median :159.0   Median :66.00   Median :20.10  
##  Mean   : 7.788   Mean   :170.8   Mean   :65.54   Mean   :21.23  
##  3rd Qu.:11.250   3rd Qu.:249.0   3rd Qu.:77.75   3rd Qu.:26.18  
##  Max.   :17.400   Max.   :337.0   Max.   :91.00   Max.   :46.00

The data is fairly clean and does not have any missing values.

Cluster Analysis

Aim of our cluster analysis is to classify states based on their criminal similarity. Different clustering algorithms could be utilized for our analysis, namely:

  1. Hierarchical clustering

  2. K-means clustering

  3. Partition Around Medoids (PAM)

  4. Clustering Large Application (CLARA)

All the clustering methods are sensitive to scale (units) of data and hence it is essential to bring all the variables of our data in similar scale.

sus_arrests<-scale(us_arrests)
head(sus_arrests)
##                Murder   Assault   UrbanPop         Rape
## Alabama    1.24256408 0.7828393 -0.5209066 -0.003416473
## Alaska     0.50786248 1.1068225 -1.2117642  2.484202941
## Arizona    0.07163341 1.4788032  0.9989801  1.042878388
## Arkansas   0.23234938 0.2308680 -1.0735927 -0.184916602
## California 0.27826823 1.2628144  1.7589234  2.067820292
## Colorado   0.02571456 0.3988593  0.8608085  1.864967207

Hierarchical clustering

This algorithm produces a hierarchy for the observations of the dataset. For assessing the similarity between observations, a distance matrix is required to quantify closeness. ‘dist’ function can be used in r to device this distance matrix:

dist_data<-dist(sus_arrests, method = 'euclidean')

The above distance matrix is calculating euclidean distances however method can be changed to “maximum”, “manhattan”, “canberra”, “binary” or “minkowski”. Hierarchical clustering can be very well visualized by plotting dendrogram.

hdata<-hclust(dist_data)
plot(hdata)
abline(h=3.75, lty=2) 

Referring the dendrogram, a suitable height shall be chosen where a horizontal line shall be plotted and the number of instances where the horizontal line cut the branches of dendrogram are clusters of our data. So, our hierarchical clustering analysis is producing 4 clusters.

K-Means clustering

This is probably the most popular clustering algorithm. It classifies the data into user anticipated number of clusters by centering them about means of clusters. As the number of clusters are required before performing k-means analysis, the results of this algorithm become sensitive to user’s anticipation. As per above hierarchical clustering, we can proceed for k-means with 4 centers.

set.seed(123)
kus_arrests<-kmeans(sus_arrests, centers = 4, nstart = 50)
#Simple visualisation of clusters 
plot(x=sus_arrests[,1], y=sus_arrests[,2], col=kus_arrests$cluster)
points(kus_arrests$centers, pch=3, cex=2)

The reason for setting seed before k-means is that this algorithm chooses the starting center points at random and hence it could produce different results in another run if seed is not set. This brings up the explanation of ‘nstart’; nstart is the number of times the algorithm initializes with new random centers and hence it is highly recommended to pick higher value for it to stablise the algorithm.

The above plot is simple visualization for clusters but we used only 2 variables from our data. A better way to plot the data is to use principle components of the dataset.

library(cluster)
clusplot(sus_arrests, kus_arrests$cluster, color = T, labels = 2, main = 'Cluster Plot')

Another way of analyzing this is via silhouette plots:

plot(silhouette(kus_arrests$cluster, dist = dist_data), col=2:5)

Silhouette Plot shows the number of elements (nj) per cluster. Each horizontal line corresponds to an element. The length of the lines corresponds to silhouette width (si), which is the mean similarity of each element to its own cluster minus the mean similarity to the next most similar cluster. Observations with a large Si (almost 1) are very well clustered, a small Si (around 0) means that the observation lies between two clusters, and observations with a negative Si are probably placed in the wrong cluster.

Partition Around Medoids (PAM)

The K-means algorithm uses mean for getting center points, however, means are sensitive to outliers. This problem is solved through PAM. In place of mean, PAM searches for k representative objects or medoids in data observations. After finding a set of k medoids, k clusters are constructed by assigning each observation to the nearest medoid.

PAMus_arrests<-pam(sus_arrests, 4)
clusplot(sus_arrests, PAMus_arrests$clustering, color = T, main = 'Cluster Plot')

CLARA is a partitioning method used to deal with much larger data sets (more than several thousand observations) in order to reduce computing time and RAM storage problem, which could be much higher for PAM.

Determining optimal number of clusters

Above partition methods, except hierarchical clustering, takes k (number of clusters) as an input from user and hence optimal number of clusters shall be assessed for these algorithms.

Elbow Method

The basic idea behind partitioning methods, such as k-means clustering, is to define clusters such that the total intra-cluster variation (known as total within-cluster variation or total within-cluster sum of square) is minimized. Hence, we can do;

kmax<-10
WSSus_arrests<-sapply(1:kmax, function(k) kmeans(sus_arrests, centers = k, nstart = 10)$tot.withinss)

plot(1:kmax, WSSus_arrests, type = 'b', xlab = 'k', ylab = 'Total wss')
abline(v=4, lty=2)

The location of a bend (knee) in the plot is generally considered as an indicator of the appropriate number of clusters.

Gap Statistic method

This approach can be applied to any clustering method (K-means clustering, hierarchical clustering). The gap statistic compares the total within intracluster variation for different values of k with their expected values under null reference distribution of the data, i.e. a distribution with no obvious clustering. The reference dataset is generated using Monte Carlo simulations of the sampling process. Choose the number of clusters at the smallest k such that

Gap(k)≥Gap(k+1)−sk+1Gap(k)≥Gap(k+1)−sk+1.

Refer additional material and function documentation for details.

set.seed(123)
gap_stat <- clusGap(sus_arrests, FUN = kmeans, nstart = 25, K.max = 10, B = 50)

plot(gap_stat, xlab = "Number of clusters k")
abline(v = 4, lty = 2)