Clustering

Introduction

The aim of cluster analysis is to group observations such that observations within a group are similar to each other and observations in different groups are not. The main idea behind clustering is to try to find hidden structure in data which are not already labelled; hence we organize data into clusters to show any internal structure the data might have.There are many ways to define similarity between observations and many different ways to group the observations once similarity is defined. Since clustering is the grouping of similar instances/objects, some sort of measure that can determine whether two objects are similar or dissimilar is required. Before we become a bit more specific about the different similarity definitions and the ways to group observations; let’s first have a look at some situations where clustering can be helpful.

Assume that you are running a large online store and you want to personalise each user’s shopping experience. You do not know each user’s personal preferences but you do have lots of information about him/her (i.e. records of all his/hers previous purchases). How can you use all this information to give each user a set of unique recommendations in order for him/her to buy more products? (yes you are not just doing this to help the user but you want to make some money also)

You are also working part time at an internet and telephone company which needs to establish its network by using its towers in an isolated region. Where should the company locate the towers, in order for the users to always have a strong signal?

Finally, your online store went bankrupt and now you were just hired to the Glasgow city council where your job is to identify groups of houses according to their house type, their value, and their geographical location.

You can use and take advantage of clustering techniques for all the previous situations.

Why clustering?

Clustering is a technique used extensively in many different fields; including image analysis, statistics, social science, machine learning, pattern recognition, and many others. One can perform a cluster analysis technique to

Detect a hidden pattern of the data
As a stand-alone-tool to get insight into the data distribution.
As a preprocessing step for other algorithms/techniques.

Clustering Methods: hierarchical methods

Hierarchical methods construct the clusters by recursively partitioning the instances in either a top-down (divisive) or bottom-up (agglomerative) fashion. Agglomerative hierarchical clustering begins with every case being a cluster by itself. At successive steps, similar clusters are merged. The algorithm ends with all observations in one cluster. Divisive clustering starts with all observations in one cluster and ends up with each observation in its own cluster. Obviously, neither the first step nor the last step is a worthwhile solution with either method.

In order to decide which clusters should be combined or split, depending on which of the previous hierarchical approaches we follow, a measure of similarity/dissimilarity must be introduced. Specifically, we need to define similarity between observations and dissimilarity between clusters. There are two main types of measures used to estimate this relation: distance measures and similarity measures. We can denote the distance between two observations \(x\) and \(y\) as \(d(x,y)\). To be a valid distance measure \(d\) should be symmetric and obtain its minimum value (usually zero) for \(x=y\). In order for the distance measure to be a metric distance measure it must also satisfy is must satisfy the following properties:

Triangle inequality \(d(x,y) < d(x,z)+d(z,y) \ \ \ \forall x, z, y, \ \) and
\(d(x, y)=0 \Rightarrow x=y\).

Popular distance metrics

To define simmilarity between observations we can use a metric distance measure. Some commonly used metrics can be seen on the Table below.

Metric	Formula
Euclidean distance	\(d(x,y)=\sqrt{\sum_{i}(x_i-y_i)^2}\)
Squared Euclidean distance	\(d(x,y)=\sum_{i}(x_i-y_i)^2\)
maximum distance	\(d(x,y)=\max\limits_i{\\|x_i-y_i\\|}\)
Manhattan distance	\(d(x,y)=\sum_i{\\|x_i-y_i\\|}\)

Cluster distance methods

To define dissimilarity between clusters we can use a linkage criterion. Some commonly used linkage criteria can be seen on the Table below.

Linkage criterion	Formula
single linkage	\(\min\{d(x,y): x \in A, y \in B \}\)
complete linkage	\(\max\{d(x,y): x \in A, y \in B \}\)
average linkage	\(\frac{1}{\\|A\\| \\|B\\|}\sum_{x \in A}\sum_{y \in B}d(x,y)\)

Clustering Methods: partitioning (or non-hierarchical) methods

Partitioning methods relocate instances by moving them from one cluster to another, starting from an initial partitioning. Such methods typically require that the number of clusters will be pre-set by the user.

By far, the most common partitioning method is the K-means cluser analysis. The procedure for this algorithm can be summarised as

Select K centroids
Assign each observation to its closest centroid.
Compute new centroids as the average of all observations in a cluster.
Assign each observation to its closest centroid.
Continue steps \(3\) and \(4\) until the observations are not reassigned or the maximum number of iterations is reached.

Question 1 I did seem to forget something important when describing the procedure for assigning observations to centroids. What is it?

K-means clustering can handle larger datasets than hierarchical cluster approaches. Additionally, observations are not permanently committed to a cluster. They are moved when doing so improves the overall solution. However, the use of means implies that all variables must be continuous and the approach can be severely affected by outliers.

Since K-means cluster analysis starts with k randomly chosen centroids, a different solution can be obtained each time the function is invoked. Use the set.seed() function to guarantee that the results are reproducible. Additionally, this clustering approach can be sensitive to the initial selection of centroids. The kmeans() function has an nstart option that attempts multiple initial configurations and reports on the best one. For example, adding nstart=25 will generate \(25\) initial configurations. This approach is often recommended.

Model-based clustering methods

Model-based methods assume that the data come from a source with several (K) sub-populations. Overall population is modelled by a weighted summation of the individual component densities which is known as a finite mixture model. \[\underline{x} \sim \sum_{g=1}^G\pi_gf_g(\underline{x}), \text{with} \ 0<\pi_g\leq1,\ \forall g, \ \ \sum_{g=1}^G\pi_g =1\] The \(\pi_g\) are known as the mixing weights and the full vector of them as the mixing distribution. Usually, components are given by Gaussian densities with (different) mean vectors and (possibly different) covariance matrices (also known as Gaussian mixture modelling) \[ \underline{x} \sim \sum_{g=1}^G\pi_gN(\underline{x}\mid \underline{\mu}_g,\Sigma_g)\] This is commonly estimated in a Bayesian fashion or by the EM algorithm; in order to estimate parameters and find cluster groupings

Reliability of clustering

In order to assess how “good” a specific clustering structure is, we can:

assess the strength of cluster assignments for observations through resampling,
assess cluster homogeneity,
use statistical inference to check the distributional properties of clustering results.

Silhouette plots

Silhouette plots offer another method to interpret and validate the consistency within the clusters. The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). The silhouette ranges from \(-1\) to \(1\), where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. If most objects have a high value, then the clustering configuration is appropriate. If many points have a low or negative value, then the clustering configuration may have too many or too few clusters.

Specifically if

\(a_i\) is the average distance between \(i\) and all other observations in the cluster containing the \(i\) observation and,
\(b_i\) is the minimum average distance of the \(i\) observation to observations in other clusters; then we can define a silhouette width by \(s_i = (b_i - a_i)/\max\{a_i, b_i\}\)

For \(s_i\) to be close to \(1\) we would like \(a_i << b_i\) since is a measure of how dissimilar \(i\) is to its own cluster, a small value means it is well matched. Furthermore, a large \(b_i\) implies that \(i\) is badly matched to its neighbouring cluster. Thus an \(s_i\) close to one means that the observation is appropriately clustered. If \(s_i\) is close to \(-1\), then by the same logic we can see that \(i\) would be more appropriate if it was clustered in its neighbouring cluster. An \(s_i\) close to \(0\) means that the observation is on the border of two natural clusters.

The average \(s_i\) over all observations of a cluster is a measure of how tightly grouped all the data in the cluster are. Thus the average \(s_i\) over all observations of the entire data set is a measure of how appropriately the observations have been clustered. If there are too many or too few clusters, as may occur when a poor choice of \(K\) is used in the clustering algorithm (e.g.: k-means), some of the clusters will typically display much narrower silhouettes than the rest. Thus silhouette plots and averages may be used to determine the natural number of clusters within a dataset.

Objects with large silhouette width \(s_i\) are well-clustered, those with small \(s_i\) tend to lie between clusters.

Other methods

Besides hierarchical, partitioning, and model-based methods, there are also bensity-based, and grid-based methods.

Clustering issues

Some issues that could pop up using clustering techniques are:

Choice of starting values. This is very crucial and a good practice is to use a range such as random starts, and hierarchical clustering-based starts
How many clusters \(K\) should we choose? Beware of Overfitting: A \(K\)-component mixture distribution is empirically indistinguishable from a mixture with more than \(K\) components.
Cluster shapes not captured by the method. Circular, normally distributed values are easiest to cluster by most existing methods; otherwise, we can try transforming the data.

Violent Crime rates by US State

Location of data set

This data set contains statistics, in arrests per \(100.000\) residents for assault, murder, and rape in each of the \(50\) US states in \(1973\). Also given is the percent of the population living in urban areas. The data set is located in the package datasets and is titled USArrests. You can load the data set with the command data(USArrests).

Important R functions for clustering

To compute and return the distance matrix computed by using the specified distance measure to compute the distances between the rows of a data matrix we will need to use the function dist. to use different distance measures we will use its argument method (e.g. method="euclidean"). Now in order to do a hierarchical cluster analysis on a set of dissimilarities we need the function hclust. To choose the linkage criterion we will use its argument method (e.g. method="complete").

Dendrograms

You can think of a dendrogram as a graphical representation of a matrix of distances where the objects are joined together in a hierarchical fashion from the closest, that is most similar, to the furthest apart, that is the most different. The y-axis is a measure of closeness of either individual data points or clusters.

A dendrogram is a branching diagram that represents the relationships of similarity among a group of entities. Each branch is called a clade. The terminal end of each clade is called a leaf. Clades can have just one leaf,or they can have more than one (there is no limit to the number of leaves in a clade). The arrangement of the clades tells us which leaves are most similar to each other. The height of the branch points indicates how similar or different they are from each other: the greater the height, the greater the difference. We can use a dendrogram to represent the relationships between any kinds of entities as long as we can measure their similarity to each other. There are two ways to interpret a dendrogram: in terms of large-scale groups or in terms of similarities among individual chunks.

Let’s now plot the results of our clustering approach using what is known as a cluster dendrogram. For the next plot we will create a dendrogram plot with horizontal labels but also having the leaves hang according to their height (for the plot we have used the Euclidean distance and the complete linkage method).

usarrests.clus=hclust(dist(USArrests),method = "complete")
plot(as.dendrogram(usarrests.clus,hang = 0.01),ylab="complete linkage")

To identify large-scale groups, we start reading from the top down, finding the branch points that are at high levels in the structure. In the previous dendrogram we can see that the first two clades are at values \(160\) and \(100\). The states at the 2nd clade (at value \(100\) are more similar to each other than the states at the 1st clade ( at value \(160\)). The longer the line, the greater the difference. The difference between the two aforementioned clades is \(60\).

If we are trying to identify which segments are most similar to each other, we read the dendrogram from the bottom up, identifying the first clades to join together as we move from bottom to top. In this case, New Hampshire and Iowa seem to be the first two leaves that constitute a clade. Kansas and Indiana are similarly clustered and they are more similar to each other than they are to any other states. The height of the vertical lines (in this case the horizontal lines) indicates the degree of difference between branches.

Question 2 If we don’t just focus on New Hampshire and Iowa but also include Wisconsin; which states are most similar to each other?the previous triplet?or Vermont and North Dakota?

Basic plots

We can now use the function cutree to cut the dendrogram (i.e. tree) into several groups. In this case we will choose 3 groups and plot the data with respect to the number of assaults and the percentage of urban population.

 usarrests.labels=cutree(usarrests.clus,k=3)
plot(USArrests[,2:3],col=usarrests.labels,xlab="Number of assaults (per 100000)",ylab="Percentage of urban population")

Question 3 Locate Florida and Illinois in the previous plot. After you do that, assume that you have information that Florida and Illinois should be clustered together. What can you do to make this happen?

K-means clustering

In this case I will choose for the data set to be split in 3 clusters.

usarr.kcl = kmeans(USArrests, centers=3)
plot(USArrests, col = usarr.kcl$cluster)

Question 4 Can you explain what you see in the plot?

In order to get the cluster means and also append the cluster assignment

# get cluster means 
aggregate(USArrests,by=list(usarr.kcl$cluster),FUN=mean)

  Group.1    Murder  Assault UrbanPop     Rape
1       1 11.812500 272.5625 68.31250 28.37500
2       2  4.270000  87.5500 59.75000 14.39000
3       3  8.214286 173.2857 70.64286 22.84286

# append cluster assignment
USArrests_cl <- data.frame(USArrests, usarr.kcl$cluster)

We should also have a look at the cluster results

library(cluster) 
clusplot(USArrests_cl, usarr.kcl$cluster, color=TRUE, shade=TRUE, 
    labels=2, lines=0)

Question 4 What are principal components?

We have already said that choosing \(K\) (i.e. the numbers of clusters) can sometimes look arbitrary. IN that case we may be interested to compare the similarity between two cluster solutions using a variety of information criteria.

library(fpc)
usarr.kcl.3 = kmeans(USArrests, centers=3)

usarr.kcl.5 = kmeans(USArrests, centers=5)
d <- dist(USArrests)
cl.stats <- cluster.stats(d, usarr.kcl.3$cluster, usarr.kcl.5$cluster)

The corrected Rand index provides a measure for assessing the similarity between two partitions, adjusted for chance. Its range is \(-1\) (no agreement) to \(1\) (perfect agreement). Agreement between the species types and the cluster solution is 0.75 using Rand index and 0.41 using Meila’s VI.

Question 5 What is the Rand index and Meila’s VI?

To use model-based clustering in R we need the Mclust function that is located in the mclust package. We can use the function where we set the number of clusters between \(2\) and \(4\).

library(mclust)
usarrests.mclus <- Mclust(USArrests, G=2:4) 
usarrests.mclus$class

       Alabama         Alaska        Arizona       Arkansas     California       Colorado 
             1              1              1              3              1              1 
   Connecticut       Delaware        Florida        Georgia         Hawaii          Idaho 
             3              3              1              1              3              2 
      Illinois        Indiana           Iowa         Kansas       Kentucky      Louisiana 
             1              3              2              3              3              1 
         Maine       Maryland  Massachusetts       Michigan      Minnesota    Mississippi 
             2              1              3              1              2              1 
      Missouri        Montana       Nebraska         Nevada  New Hampshire     New Jersey 
             1              3              3              1              2              3 
    New Mexico       New York North Carolina   North Dakota           Ohio       Oklahoma 
             1              1              1              2              3              3 
        Oregon   Pennsylvania   Rhode Island South Carolina   South Dakota      Tennessee 
             3              3              3              1              2              1 
         Texas           Utah        Vermont       Virginia     Washington  West Virginia 
             1              3              2              3              3              2 
     Wisconsin        Wyoming 
             2              3

usarrests.mclus$BIC

Bayesian Information Criterion (BIC): 
        EII       VII       EEI       VEI       EVI       VVI       EEE       EVE       VEE
2 -1909.053 -1912.947 -1607.452 -1609.023 -1615.866 -1617.668 -1606.974 -1615.833 -1603.649
3 -1830.779 -1832.879 -1606.625 -1593.359 -1624.123 -1608.130 -1613.336 -1632.235 -1603.548
4 -1818.534 -1825.686 -1598.203 -1599.814 -1623.513 -1624.106 -1614.600 -1622.765 -1616.619
        VVE       EEV       VEV       EVV       VVV
2 -1608.602 -1608.856 -1610.551 -1615.973 -1618.261
3 -1620.902 -1636.859 -1625.687 -1614.129 -1618.224
4 -1635.793 -1626.683 -1638.374 -1639.442 -1665.643

Top 3 models based on the BIC criterion: 
    VEI,3     EEI,4     VEI,4 
-1593.359 -1598.203 -1599.814

Question 6 Can you interpret the results?

Let’s have a look at a plot

plot(USArrests, col = usarrests.mclus$class,
pch= usarrests.mclus$class+16)

In order to visualise and assess the clustering structure we can use a silhouette plot (for more info on silhouette plots look at the section Clustering). We can use the function silhouette within the package cluster.

 usarrests.kmeans  <-  kmeans(USArrests, centers=3, iter.max=1000)
usarrests.mclus <- Mclust(USArrests, G=2:4)  
si.k <- silhouette(usarrests.kmeans$cluster, dist(USArrests))
si.m <- silhouette(usarrests.mclus$class, dist(USArrests))
plot(si.k, col = c("red", "green", "blue"), main = "silhouette plot for k-means") # with cluster-wise coloring

plot(si.m, col = c("red", "green", "blue"), main = "silhouette plot for model-based clustering")

Question 6 Can you interpret the results on the previous two graphs?

We can also visualise the results by looking at a map of the United States.

library(maps)
usarrests.kmeans  <-  kmeans(USArrests, centers=3, iter.max=1000)
usarrests.mclus <- Mclust(USArrests, G=2:4)  
si.k <- silhouette(usarrests.kmeans$cluster, dist(USArrests))
si.m <- silhouette(usarrests.mclus$class, dist(USArrests))
par(mfrow=c(1,2))
map('state', col = usarrests.kmeans$cluster+1, fill=T)#k-means results
title(main="K-means clustering results")  
map('state', col = usarrests.mclus$class+1, fill=T) #mclust results
title(main="Model-based clustering results")

Question 7 Can you interpret the clustering results on the two maps?

Cluster analysis on percentage of votes

Description of data set

The data set you will work on, on Lecture 5, is titled votes.repub and is located in the package cluster. You should install it with the command library(cluster). The data set refers to the percentage of votes given to the republican candidate in presidential elections from \(1856\) to \(1976\). Rows represent the 50 states, and columns the \(31\) elections.

Individual project

Things to do:

Perform a cluster analysis on the previous dataset using hierarchical and partitioning techniques.

Things to be careful of or/and think about:

Limitations of each technique.
What is the question that you try to answer using cluster analysis?
Why did you choose this clustering instead of another?
Do you have any concerns about any of the variables or even the data set as a whole?

Cluster analysis on different cars

Description of data set

The data set you will work on, on Lecture 6, is titled cars and is located on Moodle. You should load the data set with the command cars <- read.csv("cars_try.csv",header=TRUE,sep="\t") after you have put the file in your working directory. The data set refers to \(38\) different cars from \(1978\) to \(1979\). Rows represent each car, and columns refer to each car’s characteristics (e.g. MPG, weight, etc).

Individual project

Things to do:

Perform a cluster analysis on the previous dataset using hierarchical,partitioning, and model-based techniques.

Things to be careful of or/and think about:

Limitations of each technique.
What is the question that you try to answer using cluster analysis?
Why did you choose this clustering instead of another?
Do you have any concerns about any of the variables or even the data set as a whole?

One way to start

Just to help you out, you can start with:

cars <- read.csv("cars.csv",header=TRUE,sep="\t")# This line may be different for you
cars.use = cars[,-c(1,2)]
cars.use = scale(cars.use,center=TRUE,scale=TRUE)
head(cars.use)

            MPG     Weight Drive_Ratio Horsepower Displacement  Cylinders
[1,] -1.2005727  2.1179346 -0.70204953  2.0141161    1.9432589  1.6252130
[2,] -1.4144009  1.6850405 -1.60998615  1.5225285    1.9545104  1.6252130
[3,] -0.8492836  1.0498463 -1.03045214  0.8796831    1.0093813  1.6252130
[4,] -0.9561977  1.5237662 -1.24294794  1.8250439    2.0557742  1.6252130
[5,]  0.8002478 -1.0014491  1.17177710 -1.2757395   -0.8921284 -0.8700635
[6,]  0.4184118 -0.4285011 -0.08387992 -0.2547499   -0.4870731 -0.8700635

Question 8 Why did we use the scale function?

And now focusing on hierarchical partitioning we can plot the results with:

cars.dist = dist(cars.use)
cars.hclust = hclust(cars.dist)
plot(cars.hclust,labels=cars$Car,main='Default from hclust')

Question 9 Can you continue working on different clustering techniques?

Clustering

Introduction

Why clustering?

Clustering Methods: hierarchical methods

Popular distance metrics

Cluster distance methods

Clustering Methods: partitioning (or non-hierarchical) methods

Model-based clustering methods

Reliability of clustering

Silhouette plots

Other methods

Clustering issues

Further reading

Reference

Self help

Violent Crime rates by US State

Location of data set

Important R functions for clustering

Dendrograms

Basic plots

K-means clustering

Cluster analysis on percentage of votes

Description of data set

Individual project

Cluster analysis on different cars

Description of data set

Individual project

One way to start