Introduction

This code through explores the statistical method of cluster analysis and includes two of the most commonly used algorithms: hierarchical and k-means clustering.

Cluster analysis, or clustering, is a way of processing data where groups of items within a data set are identified based on their within-group similarities (homogeneity) and out-group differences (heterogeneity). The resulting groups are called clusters and can vary in number based on the data. The type of clustering analysis algorithm chosen for a particular data set can also include multiple experimental iterations with different analyses. There are dozens of published clustering algorithms, some of which were created for specific types of models and are inappropriate for use with other models.

Overview

In this code through, we will focus on the following types of cluster analysis techniques:

  1. Hierarchical clustering, or connectivity-based clustering

  2. k-means clustering, also known as centroid-based clustering

Learning Objectives

Develop a deeper understanding of hierarchical and k-means clustering analysis

Compare the pros and cons of each algorithm

Learn where clustering shows up in real-world applications


Hierarchical Cluster Analysis

Hierarchical cluster analysis is based on the concept that elements closer together are more similar than those farther away. Hierarchical clustering algorithms connect items in a data set to form “clusters’’ of items based on their distance. As we will see below, there are multiple methods that differ by the way distances are computed. The hierarchical clustering algorithms fall into two categories, agglomerative and divisive.

Classifications

Divisive: a “top-down” approach where all observations start in one cluster and subclusters are formed from that cluster iteratively until each object is its own cluster. The algorithm is known as DIANA (DIvisive ANAlysis Clustering) and was first described by Kaufman and Rousseeuw (1990). This algorithm is not as commonly used as the agglomerative approach.

Agglomerative: a “bottom-up” approach where each observation starts in its own cluster and is then merged into successive clusters based on the level of similarity to other clusters. One has to choose the criteria for similarity for the cluster analysis, select a distance metric to employ, and then choose a method for comparing these clusters to each other. This algorithm is performed using the agnes() function in R.

Process

The basic steps of agglomerative hierarchical clustering are outlined below:

  1. Every object starts out as its own cluster.

  2. The distance between each cluster is determined by some metric. The Euclidean distance is most commonly used and is based upon the Pythagorean Theorem where the hypotenuse of a triangle represents the distance between two points. The type of measure used is arbitrary and can be chosen based on how best to interpret the particular sample.

  3. The distances are compared and the two most similar clusters in the entire data set are grouped first, becoming a single cluster.

  4. The distances between all clusters is recalculated based on the chosen linkage method:

  • Single-linkage compares an object to the closest point contained in each cluster
  • Complete-linkage compares an object to the furthest point contained in each cluster
  • Average-linkage clustering compares an object to the average of all points in a cluster. This is known as the centroid.
  1. The next set of most similar clusters are grouped and the process repeats until a single cluster encompassing all others exists.

Divisive clustering can be performed in the reverse order of agglomerative clustering.

Dendrogram

This process creates a dendrogram, which indicates the similarity of clusters and the order they were formed.

Image Source: Tim Brock


To interpret a dendrogram, a horizontal line is drawn at a given level. The number of subsequent intersections it has with the vertical lines (connections) determines the number of clusters. The greater the distance between clusters, the more differences exist between them based on chosen features. Where to draw this line is up to the analyst and can be determined using cluster optimization such as the elbow method, which will be discussed later.

Image Source: Tim Brock


Pros

  • The analyst gets to see the big picture of the data so it is easier to interpret and to draw conclusions.
  • It doesn’t require the user to determine a number of clusters before performing the analysis.

Cons

  • This method doesn’t work as well with extremely large samples because the dendrogram becomes so compacted that it is too difficult to interpret.
  • It requires a lot of effort and as a result is very time consuming.


Applications

Some examples of hierarchical clustering at work include:

  • Genetics (Child > Parent > Grandparent)
  • Taxonomy (Animal > Mammal > Dog > Poodle)
  • Geography (Earth > North America > USA > Arizona > Phoenix)


k-means Cluster Analysis

k-means clustering is a method of partitioning a sample into a number of clusters, k, which are selected by the analyst prior to clustering. Just like hierarchical algorithm, the clusters will be drawn based on the highest level of intra-class similarity while also having low inter-class similarity.

Process

The following steps outline the process:

  1. The number of clusters, k, is determined by the analyst. This can be random or based on the visualization of the data.

  2. k objects are selected from the data and serve as the initial center points (centroids)

  3. The remaining observations are assigned their closest centroid based on the distance metric chosen. The Euclidean method is the most widely used for this type of clustering as well.

  4. After assignment, the centroid values are recalculated based on the values of all observations belonging to each cluster. The new mean values become the centroid for the cluster.

  5. This process of cluster reassignment and centroid updating is repeated iteratively until the clusters have stopped changing. When observations are no longer being optimized it is known as convergence. The number of iterations may be chosen by the analyst.

Image Source: Chris Piech

Cluster Optimization

Ward’s minimum variance method, also known as the elbow method, helps determine the optimal number of clusters to select for a given data set. First, the clustering algorithm is computed for different values of k (i.e. from 1 to 10). After each iteration of clustering, the total within-cluster sum of square (wss) is calculated. The k clusters are then plotted against the wss and a curve of the change in variation is created. These plots usually have an inflection point that resembles an elbow and that point is considered a good indicator of the optimal number of clusters for the data.

Image Source: University of Cincinnati Business Analytics

Pros

  • It is easy to compute this algorithm even on large data sets.
  • The algorithm is very adaptable and can accommodate clusters of different shapes & sizes.

Cons

  • The number of clusters needs to be selected prior to analysis, which can affect how the data is interpreted.
  • The algorithm prefers similarly sized clusters as the output. This can make the borders of clusters extremely variable and incorrect.

Applications

Some examples of k-means clustering at work include:

  • Customer Segmentation (comparing a user’s purchase history to recommend similar products)
  • Computer vision (machine learning to collect information from images)
  • Astronomy (classifying clusters of stars)


Further Resources

Get a more comprehensive overview of cluster analysis

Explore the various examples of clustering algorithms

Learn more about the hierarchial clustering and the hclust function

Explore the agnes and diana functions in R

Learn more about k-means clustering and the kmeans function

Look into the criteria and the process of Ward’s Method


Works Cited

This code through references and cites the following sources: