ID5059 Lecture 17 - Clustering

C. Donovan
18 April 2018

Administrivia

  • Industrial action: averted
  • Presentations: I'll circulate times today
  • Marking: I suck, there's lots, it'll be this week, watch Moodle

NB: If it's not in the lecture or lab, it's not in the exam

Today

Clustering

  • The only unsupervised learning methods we cover (i.e. target/response is not defined/known)
  • We're simply looking for general similarity of observation vectors

What is a `cluster'?

Patterns of points

What is a `cluster'?

So:

  • We're good at finding clusters!…
  • …. even when they don't exist', and
  • …. only up to 3-dimensions.
  • Internally/intuitively we have a quite vague definition.

What is a `cluster'?

“Clusters may be described as continuous regions of (a) space containing a relatively high density of points, separated from other such regions by regions containing a relatively low density of points.” (Everitt 1980).

  • does encompass most of what people would consider clusters
  • clear need to choose how to measure (dis)similarity, and
  • somehow control the spatial scale of what we consider consider distinct clusters

A basic problem

  • Consider a \( n\times p \) matrix \( \mathbf{Y} \) - \( n \) rows give our observations and the \( p \) variables represent the variety of measurements we have made on each observation.
  • seek to group the \( n \) observations into a set of groups whose members are similar in terms of their measurements on the \( p \) variables
  • seek succinct descriptions of the clusters
  • seek some definition of the clusters to give rules for predicting class membership of new observations.

Observe there is no response variable (unsupervised)

Dissimilarity

We need something to summarise how similar things are

  • We usually use a number that is larger the less alike things are
  • For example, distance as we use it normally (dissimilar)

A basic approach

The basic principle will be a comparison of vectors from our data matrix \( \mathbf{Y} \) e.g taking the rows \( i \) and \( j \) has us comparing in some way \( \mathbf{y}_i \) and \( \mathbf{y}_j \):

\[ \begin{bmatrix} y_{i1}\\ y_{i2}\\ \vdots y_{ip} \end{bmatrix} \leftrightarrow \begin{bmatrix} y_{j1}\\ y_{j2}\\ \vdots y_{jp} \end{bmatrix} \]

Need-to-knows

  • In general how the similarity of two points may be measured.
  • How Euclidean and Manhattan distances are calculated.

Euclidean distance

Consider two \( p \)-dimensional column vectors \( \mathbf{y}_1 \) and \( \mathbf{y}_2 \) i.e. observations/rows taken from our data matrix. If we define a vector \( \mathbf{d} \) to be \( \mathbf{y}_1-\mathbf{y}_2 \), then:

\[ \|\mathbf{d}\|=\sqrt{\mathbf{d}^T\mathbf{d}} \qquad {\rm or} \qquad \left[\sum_{i=1}^p d_i^2\right]^{1/2} \]

This is the measure of distance that we encounter in day-to-day life (albeit only in one to three dimensions). You may notice this is simply a compact expression of the Pythagorean calculation of the hypotenuese.

Manhattan distance

The Manhattan or Taxi-cab distance is given by

\[ |\mathbf{d}|=\mathbf{d}^T\mathbf{1}_p \qquad {\rm or} \qquad \sum_{i=1}^p d_i \]

which can thought as the distance between points where you can only move parallel to one axis at a time e.g. travelling around-the-block.

Summarising variability

  • Our \( n\times p \) data matrix \( \mathbf{Y} \) (which can be thought of as a cloud of \( n \) points in \( p \) dimensional space) has a centroid' vector \( \bar{\mathbf{y}} \).
  • Create the sum-of-squares and cross-products matrix (the basis of a variance/covariance matrix) by subtracting this centroid from each row of \( \mathbf{Y} \) to give the matrix \( \mathbf{X} \), and performing a simple multiplication:

\[ \mathbf{T}=\mathbf{X^T}\mathbf{X} \]

  • The matrix \( \mathbf{T} \) will be termined the Total scatter or dispersion matrix.

Dispersion matrices

In terms of clustering we can decompose this to:

\[ \mathbf{T}=\mathbf{W}+\mathbf{B} \]

The within cluster bits and the between cluster bits.

Clustering methods

Consider only two broad methods of clustering here.

  • hierarchical clustering
  • non-hierarchical clustering (only \( k \) means as example)

Need-to-knows

  • What is meant by hierarchical and \( k \)-means clustering, agglomerative and divisive clustering. How do they operate?
  • What dispersion matrices are in clustering, and how they are used.
  • A description of what single, complete and group average linkage means, as well as Wards method.
  • What a dendrogram is and how to interpret it.
  • The short-comings (and assumptions) of the cluster methods covered, and difficulties in clustering for prediction and description in general.

k-means

  • We will only cover \( k \)-means clustering as a non-hierarchical clustering method
  • The principle is this: assume some number of clusters \( k \), and then organise the points into clusters to optimise some criterion - for example minimising the trace of \( \mathbf{W} \)
  • That is, we want small within cluster variability and large between cluster variability.

k-means

Algorithm: K-Means Clustering

  • Set K, the total number of clusters.
  • Randomly assign each observation to a cluster.
  • Iterate until the no further improvement is achieved:

  • Compute cluster centroid for each of the \( K \) clusters, where the \( k^{th} \) cluster centroid is the mean vector for the observations in the \( k^{th} \) cluster.

  • Assign each observation to the cluster whose centroid is closest' e.g. Euclidean distance.

k-means algorithm in action

k-means algorithm in action

k-means algorithm in action

Dispersion matrices - recall

\[ \mathbf{T}=\mathbf{X}^T\mathbf{X} \]

  • The matrix \( \mathbf{T} \) will be termed the Total scatter or dispersion matrix. We can decompose this to Between and Within group variability:

\[ \mathbf{T}=\mathbf{W}+\mathbf{B} \]

Hierarchical clustering

In contrast to \( k \)-means:

  • assume a hierarchical system of clusters exists i.e. within a cluster there are a number of smaller clusters, which are themselves made from a number of smaller clusters
  • at extremes have \( n \) clusters where each point forms its own cluster, and the other a single super cluster containing all points

Hierarchical clustering

Two methods for fusing/dividing clusters have the following terms:

  • agglomerative clustering the iterative process of building clusters by fusing clusters together. Suppose a series of points form their own clusters. A larger cluster is created by combining a cluster with its nearest neighbor (nearest' being subject to defintion).
  • divisive clustering all points combined form a single cluster. More clusters are formed by iteratively splitting the existing cluster(s) to give the best separation of points (best' being subject to definition).

Agglomeration process

  • place all points that are zero distance apart into clusters - this gives up to \( n \).
  • some distance value for governing fusion is raised until two existing clusters are found similar enough to fuse.
  • the relevant clusters are fused, and the distance value governing fusion is increased - repeat from step 2 until all clusters are fused into one super-cluster.

Hierarchical clustering

  • The history in the fusing process described above, naturally gives rise to dendrograms
  • The single observation clusters are at one end.
  • The cluster of all observations at the other.
  • The vertical distance reflects the distance between clusters being fused/divided.
  • Big distances indicate heterogenous clusters joining/splitting

Cluster likeness

Distance between observations is relatively simple. Distance between clusters, less so.

  • Single linkage clustering - nearest neighbor
  • Complete linkage clustering - farthest neighbour
  • Group average linkage
  • Ward's method

Single linkage - nearest neighbor

  • Single linkage is fusing the two clusters with the smallest distance (nearest neighbours) between objects.
  • Monte Carlo simulations have shown that single linkage has a poor performance.
  • Single linkage suffers' from chaining (long clusters).

Complete linkage clustering - farthest neighbour

  • Complete linkage is fusing the two clusters with the largest distance (farthest neighbours) between objects.
  • Monte Carlo simulations have shown that complete linkage nearly always performed better as single linkage.
  • Highly sensitive to outliers.
  • Complete linkage avoids chaining. Leads to compact clusters, but they are not far enough apart.

Group average linkage

  • Average linkage is fusing the two clusters with the smallest average distance between the objects.
  • Compromise between single and complete linkage.
  • Monte Carlo simulations have shown that complete linkage generally performed well.
  • Average linkage leads to clusters which are to relatively compact and relatively far apart.

Ward's method

  • Ward's method has a slightly different strategy then the methods described before.
  • Not based on simple proximity, but the two clusters whose fusion leads to the smallst within-cluster variation.
  • Monte Carlo simulations have shown that Ward's method belongs to the best hierarchical methods.
  • Tends to produce equal sized clusters that are convex and compact.
  • Can be seen as the hierarchical version of K-means

How many clusters?

This is akin to determining model complexity

  • many clusters is complex
  • we'd like objective measures
  • it is hard to find satisfactory solutions

Choosing cluster number

In both the partition and hierarchical clustering require that the number of your final solution be determined. This is the common problem - how complex should our model' be?

  • Approaches:

    • make a fairly qualitative decision. Look at dendrograms - balance practical requirements and distinctness of clusters.
    • choose an entirely pragmatic number of clusters for the problem at hand (segmentation).
    • use one of the quantitative measures given below.

In practice there is often only so many clusters that can be dealt with.

Choosing cluster number

(We'll not cover these due to time constraints) Metrics include:

  • Callinski and Harabasz's Index/Pseudo-\( F \)
  • Duda and Hart's index
  • pseudo \( t^2 \) statistic
  • The GAP statistic

Prediction and cluster interpretation

Have clusters - now what?

  • If we have hierarchical clustering - cut' the dendrogram
  • If \( k \)-means, run with desired \( k \)

In keeping with the two over-arching uses of statistical models:

  • we will typically want to allocate new observations to a cluster/group - prediction
  • or investigate the nature of the clusters we have identified - description

Prediction

Prediction is conceptually easy

  • nearest cluster centroid.
  • For a vector of new values, the same simple approach.
  • (although can have fuzzy memberships i.e. some numeric measure of membership).

Description

Description is relatively hard

  • Describing/interpreting the clusters is a more involved process.
  • Our clusters exist in \( p \)-dimensions which makes their description a daunting task.
  • Some of our \( p \) variables may be more important in defining some clusters than others
  • cluster definitions may also be complex interactions between variables

Cluster description

  • Summary statistics: the predicted class memberships can be used as a simple index for subsetting the data
  • Reduced spaces: the data, and hence the clusters, will be generally \( p \)-dimensional. However reduced space methods do allow us to explore these visually.

Principal Components Analysis (PCA)

(Covered if time) We can use dimension reduction to explore/help.