C. Donovan
18 April 2018
NB: If it's not in the lecture or lab, it's not in the exam
Clustering
So:
“Clusters may be described as continuous regions of (a) space containing a relatively high density of points, separated from other such regions by regions containing a relatively low density of points.” (Everitt 1980).
Observe there is no response variable (unsupervised)
We need something to summarise how similar things are
The basic principle will be a comparison of vectors from our data matrix \( \mathbf{Y} \) e.g taking the rows \( i \) and \( j \) has us comparing in some way \( \mathbf{y}_i \) and \( \mathbf{y}_j \):
\[ \begin{bmatrix} y_{i1}\\ y_{i2}\\ \vdots y_{ip} \end{bmatrix} \leftrightarrow \begin{bmatrix} y_{j1}\\ y_{j2}\\ \vdots y_{jp} \end{bmatrix} \]
Consider two \( p \)-dimensional column vectors \( \mathbf{y}_1 \) and \( \mathbf{y}_2 \) i.e. observations/rows taken from our data matrix. If we define a vector \( \mathbf{d} \) to be \( \mathbf{y}_1-\mathbf{y}_2 \), then:
\[ \|\mathbf{d}\|=\sqrt{\mathbf{d}^T\mathbf{d}} \qquad {\rm or} \qquad \left[\sum_{i=1}^p d_i^2\right]^{1/2} \]
This is the measure of distance that we encounter in day-to-day life (albeit only in one to three dimensions). You may notice this is simply a compact expression of the Pythagorean calculation of the hypotenuese.
The Manhattan or Taxi-cab distance is given by
\[ |\mathbf{d}|=\mathbf{d}^T\mathbf{1}_p \qquad {\rm or} \qquad \sum_{i=1}^p d_i \]
which can thought as the distance between points where you can only move parallel to one axis at a time e.g. travelling around-the-block.
\[ \mathbf{T}=\mathbf{X^T}\mathbf{X} \]
In terms of clustering we can decompose this to:
\[ \mathbf{T}=\mathbf{W}+\mathbf{B} \]
The within cluster bits and the between cluster bits.
Consider only two broad methods of clustering here.
Algorithm: K-Means Clustering
Iterate until the no further improvement is achieved:
Compute cluster centroid for each of the \( K \) clusters, where the \( k^{th} \) cluster centroid is the mean vector for the observations in the \( k^{th} \) cluster.
Assign each observation to the cluster whose centroid is closest' e.g. Euclidean distance.
\[ \mathbf{T}=\mathbf{X}^T\mathbf{X} \]
\[ \mathbf{T}=\mathbf{W}+\mathbf{B} \]
In contrast to \( k \)-means:
Two methods for fusing/dividing clusters have the following terms:
Distance between observations is relatively simple. Distance between clusters, less so.
This is akin to determining model complexity
In both the partition and hierarchical clustering require that the number of your final solution be determined. This is the common problem - how complex should our model' be?
Approaches:
In practice there is often only so many clusters that can be dealt with.
(We'll not cover these due to time constraints) Metrics include:
Have clusters - now what?
In keeping with the two over-arching uses of statistical models:
Prediction is conceptually easy
Description is relatively hard
(Covered if time) We can use dimension reduction to explore/help.