Hierarchical Clustering

HC Intuition Lecture 143 https://www.udemy.com/machinelearning/learn/lecture/5714428

In data mining and statistics, hierarchical clustering (also called hierarchical cluster analysis or HCA) is a method of cluster analysis which seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into two types:[1]

Agglomerative: This is a “bottom-up” approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy. Divisive: This is a “top-down” approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.

In general, the merges and splits are determined in a greedy manner. The results of hierarchical clustering are usually presented in a dendrogram.

https://en.wikipedia.org/wiki/Hierarchical_clustering

check working directory getwd()

Hierachical Clustering Steps

knitr::include_graphics("hc_steps.png")

A caption

What’s going on here

Lets remember Euclidean Distances as this is an important element (distances) in the clustering steps.

knitr::include_graphics("Euclidean_Distances.png")

$A caption$

A caption

see minute 6:36 in intuition lecture above for visual walk through of steps

Working out the distances between clusters

knitr::include_graphics("Euclidean_Distances_Options.png")

A caption

How the Dendogram works

What the Hierachical Clustering algorithm does while it walks through the steps above is stores the memory of the steps in a Dendrogram :-)

https://en.wikipedia.org/wiki/Dendrogram

https://www.udemy.com/machinelearning/learn/lecture/5714432 Basically how is the Dendogram created.

knitr::include_graphics("How_Dendogram_Forms.png")

A caption

What we do is set a dissimilarity threshold based on a certain Euclidean Distance. We can work how where to set this by visualizing the Dendogram.

Dendogram Threshold

knitr::include_graphics("Dendogram_Threshold.png")

A caption

Threshold into Cluster Count

knitr::include_graphics("Threshold_intoClusterCount.png")

A caption

Import data

dataset = read.csv('Mall_Customers.csv')

What’s this about

knitr::include_graphics("/Users/markloessi/Machine_Learning/Machine Learning A-Z New/Part 4 - Clustering/MallCustomer_Task.png")

A caption

Quick look

summary(dataset)

##    CustomerID        Genre          Age        Annual.Income..k..
##  Min.   :  1.00   Female:112   Min.   :18.00   Min.   : 15.00    
##  1st Qu.: 50.75   Male  : 88   1st Qu.:28.75   1st Qu.: 41.50    
##  Median :100.50                Median :36.00   Median : 61.50    
##  Mean   :100.50                Mean   :38.85   Mean   : 60.56    
##  3rd Qu.:150.25                3rd Qu.:49.00   3rd Qu.: 78.00    
##  Max.   :200.00                Max.   :70.00   Max.   :137.00    
##  Spending.Score..1.100.
##  Min.   : 1.00         
##  1st Qu.:34.75         
##  Median :50.00         
##  Mean   :50.20         
##  3rd Qu.:73.00         
##  Max.   :99.00

Another look

head(dataset)

##   CustomerID  Genre Age Annual.Income..k.. Spending.Score..1.100.
## 1          1   Male  19                 15                     39
## 2          2   Male  21                 15                     81
## 3          3 Female  20                 16                      6
## 4          4 Female  23                 16                     77
## 5          5 Female  31                 17                     40
## 6          6 Female  22                 17                     76

Build array

Now we want to build an array of our two columns we want to test.

X = dataset[4:5] # to follow lecture
dataset = dataset[4:5] # documentation

Quick look

summary(dataset)

##  Annual.Income..k.. Spending.Score..1.100.
##  Min.   : 15.00     Min.   : 1.00         
##  1st Qu.: 41.50     1st Qu.:34.75         
##  Median : 61.50     Median :50.00         
##  Mean   : 60.56     Mean   :50.20         
##  3rd Qu.: 78.00     3rd Qu.:73.00         
##  Max.   :137.00     Max.   :99.00

Another look

head(dataset)

##   Annual.Income..k.. Spending.Score..1.100.
## 1                 15                     39
## 2                 15                     81
## 3                 16                      6
## 4                 16                     77
## 5                 17                     40
## 6                 17                     76

Splitting the dataset into the Training set and Test set - won’t be done for KMeans

Feature Scaling - won’t be done for KMeans

Using the dendrogram to find the optimal number of clusters

# the ward.D method tries to minimize the within cluster variance
dendrogram = hclust(d = dist(dataset, method = 'euclidean'), method = 'ward.D')
plot(dendrogram,
     main = paste('Dendrogram'),
     xlab = 'Customers',
     ylab = 'Euclidean distances')

Let’s interpret;

knitr::include_graphics("R_HCluster_Dendogram.png")

A caption

Fitting Hierachical Clustering to the dataset

And adjusting the clusters to 5 based on our analysis

# we're going to make another object
hc = hclust(d = dist(dataset, method = 'euclidean'), method = 'ward.D')
# then fit it again to our data, note we are 'cutting' the tree where we get 5 clusters ;-)
y_hc = cutree(hc, 5)

Visualising the clusters

This code is only for 2 dimensional clustering.

library(cluster)
clusplot(dataset,
         y_hc,
         lines = 0,
         shade = TRUE,
         color = TRUE,
         labels= 2,
         plotchar = FALSE,
         span = TRUE,
         main = paste('Clusters of customers'),
         xlab = 'Annual Income',
         ylab = 'Spending Score')

Clustering Pros Cons

knitr::include_graphics("Clustering_ProsCons.png")

A caption

=========================
Github files; https://github.com/ghettocounselor

Useful PDF for common questions in Lectures;
https://github.com/ghettocounselor/Machine_Learning/blob/master/Machine-Learning-A-Z-Q-A.pdf

Hierarchical Clustering in R

Hierarchical Clustering

Hierachical Clustering Steps

What’s going on here

Working out the distances between clusters

How the Dendogram works

Dendogram Threshold

Threshold into Cluster Count

Import data

Build array

Using the dendrogram to find the optimal number of clusters

Fitting Hierachical Clustering to the dataset

Visualising the clusters

Clustering Pros Cons