Clustering

Jake

11/08/2021

  • Unsupervised measures are used when there is no associated response variable \(Y\) to variables.
    • Want to find interesting things about the measurements on the predictors
    • Subjective analysis, typically part of exploratory analysis.
  • With too many dimensions training is slower and it can be harder to find a good solution
    • Curse of dimensionality
    • Common dimension reduction techniques use projections and manifolds
  • Often speeds up training but can lead to worse results

PCA

  • PCA is used when we want a low dimension representation of the data that captures as much info as possible.
    • Often used to find 2 principle components and plot on a 2 dimensional space
  • We assume all variables are centered to mean 0 (as we only care about variance) and also standardised to have 1 sd.
  • First principle component is the normalised linear combination of features that have the largest variance:

\[ Z_1 = \phi_{11}X_1 + ... + \phi_{pi}X_p,\quad \sum^p_{j=1}\phi^2_{j1}=1\] * This is equivalent to saying the first component score with the largest sample variance. This leads to the maximisation problem: + Note that as all \(\hat{x}_j = 0\), the average of all principle component scores are 0 as well. + The second PC is orthogonal to the first

\[ \max_{\phi_{i1},...,\phi_{p1}} \left[\frac{1}{n}\sum^n_{i=1}\underbrace{\left(\sum^m_{j=1}\phi_{j1}x_{ij}\right)^2}_{z_{i1}^2}\right],\quad \sum^p_{j=1}\phi_{j1}^2=1\] * PCA can be interpreted as providing a low dimensional surface closest to all observations.

  • Following is a bipolot of a 4 variable dataset onto 2 principle components
    • The top and right scales refer to the principle component loading vector and the line plots. Lines that are close to eachother are considered correlated.
    • The bottom and left scales are the principle component scores for the observations.
    • The direction of the arrows kind of indicate where on the component plot these increase. In this example High positive scores on the first principle component indicate high crime rates (rape, assualt murder), while high positive scores on the second principle component is more related to high urban pop.
BiPlot

BiPlot

Principle Component Loading Vector

Principle Component Loading Vector

  • A measure of how ‘good’ the principle component is the proportion of variance explained.
  • Assuming all variables have been centered to have mean 0, the total variance of a dataset is defined as:

\[ \sum^p_{j=1}Var(X_j) = \sum^p_{j=1}\frac{1}{n}\sum^n_{i=1}x_{ij}^2\]

  • The variance explained by the mth principle component is descibed as:

\[ \frac{1}{n}\sum^n_{i=1}z^2_{im}=\frac{1}{n}\sum^n_{i=1}\left(\sum^p_{j=1}\phi_{jm}x_{ij}\right)^2\]

  • Therefore the PVE of the mth principle component is the following:
    • Note to get the cumulative PVE just sum this over the relevent m components.
    • There is \(\min(n-1,p)\) max principle components, and the PVEs must sum to 1.

\[ \frac{\sum^n_{i=1}\left(\sum^p_{j=1}\phi_{jm}x_{ij}\right)^2}{\sum^p_{j=1}\sum^n_{i=1}x_{ij}^2}\] * An adhoc way to determine how many principle components should be used is to visualise the data using a scree plot:

Scree Plot PCA

Scree Plot PCA

PCA Notes Math

  • PCA vectors are eigenvectors of the covariance matrix
    • Can be found through eigenvalue decomposition for square matrices
    • However, SVD is often used as it can be used for rectangular matricies
  • Advantages
    • Removes correlated features
    • Can helps overfitting by reducing amount of features
    • PCA can help visualise high dimensional data
  • Disadvantages
    • Cannont be used for sparse data
    • Cannot be used if features are correlated
    • Features obtained from PCA are hard to interpret
  • Incremental PCA can be used to apply PCA on large datasets or online

ICA

  • Works to seperate seperate signals into additive subcomponents

Clustering

  • Clustering methods attempt to find clusters/subgroups in the data
  • Often useful in combination with domain knowledge
  • Applications can include
    • Customer segmentation
    • Anomaly detection
    • Image processing and segmentation
    • Data analysis
    • Semi-supervised learning

K-Means Clustering

  • K-means Clustering is a method where we must define the amount of clusters, \(K\), before analysis starts.
    • An algorithm will assign each observation into one of these \(K\) clusters
  • K-means clustering attempts to create clusters with within cluster variation as small as possible.
    • \(W(C)\) denotes the within cluster variation metric.

\[ \min_{C_1,...,C_K}\left[\sum^K_{k=1}W(C)\right]\] * The within cluster variation metric is often squared euclidean distance. + \(|C_k|\) is the amount of observations in the \(kth\) cluster.

\[ W(C_k) = \frac{1}{|C_k|}\sum_{i,i'\in C_k}\sum^p_{j=1}(x_{ij}-x_{i'j})^2\]

  • There is almost \(K^n\) ways to partition the data, therefore we look for a local optimum rather than a global optimum.
    • While this often coverges to the best solution, as the algorithm is random it should be ran multiple times and the best solution (minimal within cluster variation) chosen.
  • The following algorithm is used to create the clusters
    • Note the centroid is the average of all the observations in each cluster
K Means Algorithm

K Means Algorithm

Optimal K

  • To pick the number of \(K\) to use, a useful method is the scree-plot again, however this is inherently ad-hoc.
K Means Scree Plot

K Means Scree Plot

  • Can also use silhouette diagrams
    • Silhouette coefficients vary from -1,1, depending on how close to the centroid and distance from other cluster centroids (1 is deep within own cluster, -1 means an instance may be assigned to a wrong cluster).
  • We can pick based off of the highest silhouette coefficient (average silhouette coefficient of instances) or by analyzing the graph.
Average Silhouette Coefficient vs K

Average Silhouette Coefficient vs K

Silhouette Plots

Silhouette Plots

  • Hard clustering assigns each instance to a cluster, while soft clustering assigns a score per cluster to each instance
    • Score can be any similarity measure, such as distance from centroid
    • Can be used as a form of dimensionality reduction

Mini-Batch K-Means

  • Mini-batch K-means allows K-means to be used with minibatches of data. This is useful if the entire dataset cannot be held in memory, and also in general for large datasets as it is a much faster implementation
    • However, it has slightly worse performance

Limits of K-Means

  • Doesn’t work well with varying cluster sizes, different densities or non spherical shapes

Heirachial Clustering

  • Heirachial clustering doesn’t need a set amount of clusters before beginning, and therefore is considered more flexible than K-Means.
    • It also results in a tree like diagram that is easy to interpret.
    • However it relies on the assumption that clusters are nested within clusters obtained from cutting at a greater height, which is not always the case
      • 2 clusters = gender, 3 clusters = nationality. The nationaility clusters wouldnt be nested within the gender clusters even though they are the best relevant clusters.
  • The algorithm is as follows:
Algorithm

Algorithm

Dendogram

  • Each leaf represents an observation
  • As we move up the dendogram leaves fuse into branches indicating similarity
    • The earlier this happens, the more related the responses
    • Note that position on the horizontal axis means nothing. If an observation merges with a group at a certain height, it is as related to each member in the group.
Dendogram Example

Dendogram Example

  • we can cut the dendogram at any point to obtain \(1-n\) clusters. however this choice is not clear.
3 Cluster Cut

3 Cluster Cut

  • The way the dendogram forms is based on the dissimilarty and linkage measures.
  • Dissimilarity measure is used between pairs. Most often used is often euclidean distance.
    • Another common choice is correlated-based distance

\[ \sum^p_{j=1}(x_{ij}-x_{i'j})^2\]

  • Linkage measure is used to define where to measure the dissimiliary measure from if one of the pair members are a group.
    • Average and Complete linkage tend to make the most balanced trees.
    • Centroid linkage is generally unfavourable as it can lead to inversion, where two clusters merge at a lower lever than the two individual clusters do,
Linkage Measures

Linkage Measures

Linkage Diagram

Linkage Diagram

Density Based Clustering

  • Local density estimation to identify clusters of arbitrary shapes not possible with K-means
    • Core instances are created in dense regions
    • All instances in the neighbourhood of a core instance belong to the same cluster
      • \(\epsilon\) is the neighbourhood distance from a point
    • Any instance that is not a core instance and does not have one in its neighbourhood is considered an outlier/anomaly
  • DB Clustering works well if clusters are dense enough and well seperated by low density regions
  • Don’t need to specify number of clusters

Notes On Clustering

  • Need to consider if and how to scale variables before clustering is performed.
  • If we force outliers into a cluster it may distort the clusters (which happens in K-Means).