Hierarchical Clustering

HC Intuition Lecture 143 https://www.udemy.com/machinelearning/learn/lecture/5714428

In data mining and statistics, hierarchical clustering (also called hierarchical cluster analysis or HCA) is a method of cluster analysis which seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into two types:[1]

Agglomerative: This is a “bottom-up” approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy. Divisive: This is a “top-down” approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.

In general, the merges and splits are determined in a greedy manner. The results of hierarchical clustering are usually presented in a dendrogram.

https://en.wikipedia.org/wiki/Hierarchical_clustering

check working directory getwd()

Hierachical Clustering Steps

knitr::include_graphics("hc_steps.png")
A caption

A caption

What’s going on here

Lets remember Euclidean Distances as this is an important element (distances) in the clustering steps.

knitr::include_graphics("Euclidean_Distances.png")
A caption

A caption

see minute 6:36 in intuition lecture above for visual walk through of steps

Working out the distances between clusters

knitr::include_graphics("Euclidean_Distances_Options.png")
A caption

A caption

How the Dendogram works

What the Hierachical Clustering algorithm does while it walks through the steps above is stores the memory of the steps in a Dendrogram :-)

https://en.wikipedia.org/wiki/Dendrogram

https://www.udemy.com/machinelearning/learn/lecture/5714432 Basically how is the Dendogram created.

knitr::include_graphics("How_Dendogram_Forms.png")
A caption

A caption

What we do is set a dissimilarity threshold based on a certain Euclidean Distance. We can work how where to set this by visualizing the Dendogram.

Dendogram Threshold

knitr::include_graphics("Dendogram_Threshold.png")
A caption

A caption

Threshold into Cluster Count

knitr::include_graphics("Threshold_intoClusterCount.png")
A caption

A caption

Import data

dataset = read.csv('Mall_Customers.csv')

What’s this about

knitr::include_graphics("/Users/markloessi/Machine_Learning/Machine Learning A-Z New/Part 4 - Clustering/MallCustomer_Task.png")
A caption

A caption

Quick look

summary(dataset)
##    CustomerID        Genre          Age        Annual.Income..k..
##  Min.   :  1.00   Female:112   Min.   :18.00   Min.   : 15.00    
##  1st Qu.: 50.75   Male  : 88   1st Qu.:28.75   1st Qu.: 41.50    
##  Median :100.50                Median :36.00   Median : 61.50    
##  Mean   :100.50                Mean   :38.85   Mean   : 60.56    
##  3rd Qu.:150.25                3rd Qu.:49.00   3rd Qu.: 78.00    
##  Max.   :200.00                Max.   :70.00   Max.   :137.00    
##  Spending.Score..1.100.
##  Min.   : 1.00         
##  1st Qu.:34.75         
##  Median :50.00         
##  Mean   :50.20         
##  3rd Qu.:73.00         
##  Max.   :99.00

Another look

head(dataset)
##   CustomerID  Genre Age Annual.Income..k.. Spending.Score..1.100.
## 1          1   Male  19                 15                     39
## 2          2   Male  21                 15                     81
## 3          3 Female  20                 16                      6
## 4          4 Female  23                 16                     77
## 5          5 Female  31                 17                     40
## 6          6 Female  22                 17                     76

Build array

Now we want to build an array of our two columns we want to test.

X = dataset[4:5] # to follow lecture
dataset = dataset[4:5] # documentation

Quick look

summary(dataset)
##  Annual.Income..k.. Spending.Score..1.100.
##  Min.   : 15.00     Min.   : 1.00         
##  1st Qu.: 41.50     1st Qu.:34.75         
##  Median : 61.50     Median :50.00         
##  Mean   : 60.56     Mean   :50.20         
##  3rd Qu.: 78.00     3rd Qu.:73.00         
##  Max.   :137.00     Max.   :99.00

Another look

head(dataset)
##   Annual.Income..k.. Spending.Score..1.100.
## 1                 15                     39
## 2                 15                     81
## 3                 16                      6
## 4                 16                     77
## 5                 17                     40
## 6                 17                     76

Splitting the dataset into the Training set and Test set - won’t be done for KMeans

Feature Scaling - won’t be done for KMeans

Using the dendrogram to find the optimal number of clusters

# the ward.D method tries to minimize the within cluster variance
dendrogram = hclust(d = dist(dataset, method = 'euclidean'), method = 'ward.D')
plot(dendrogram,
     main = paste('Dendrogram'),
     xlab = 'Customers',
     ylab = 'Euclidean distances')

Let’s interpret;

knitr::include_graphics("R_HCluster_Dendogram.png")
A caption

A caption

Fitting Hierachical Clustering to the dataset

And adjusting the clusters to 5 based on our analysis

# we're going to make another object
hc = hclust(d = dist(dataset, method = 'euclidean'), method = 'ward.D')
# then fit it again to our data, note we are 'cutting' the tree where we get 5 clusters ;-)
y_hc = cutree(hc, 5)

Visualising the clusters

This code is only for 2 dimensional clustering.

library(cluster)
clusplot(dataset,
         y_hc,
         lines = 0,
         shade = TRUE,
         color = TRUE,
         labels= 2,
         plotchar = FALSE,
         span = TRUE,
         main = paste('Clusters of customers'),
         xlab = 'Annual Income',
         ylab = 'Spending Score')

Clustering Pros Cons

knitr::include_graphics("Clustering_ProsCons.png")
A caption

A caption

=========================
Github files; https://github.com/ghettocounselor

Useful PDF for common questions in Lectures;
https://github.com/ghettocounselor/Machine_Learning/blob/master/Machine-Learning-A-Z-Q-A.pdf