HC Intuition Lecture 143 https://www.udemy.com/machinelearning/learn/lecture/5714428
In data mining and statistics, hierarchical clustering (also called hierarchical cluster analysis or HCA) is a method of cluster analysis which seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into two types:[1]
Agglomerative: This is a “bottom-up” approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy. Divisive: This is a “top-down” approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.
In general, the merges and splits are determined in a greedy manner. The results of hierarchical clustering are usually presented in a dendrogram.
https://en.wikipedia.org/wiki/Hierarchical_clustering
check working directory getwd()
knitr::include_graphics("hc_steps.png")
A caption
Lets remember Euclidean Distances as this is an important element (distances) in the clustering steps.
knitr::include_graphics("Euclidean_Distances.png")
A caption
see minute 6:36 in intuition lecture above for visual walk through of steps
knitr::include_graphics("Euclidean_Distances_Options.png")
A caption
What the Hierachical Clustering algorithm does while it walks through the steps above is stores the memory of the steps in a Dendrogram :-)
https://en.wikipedia.org/wiki/Dendrogram
https://www.udemy.com/machinelearning/learn/lecture/5714432 Basically how is the Dendogram created.
knitr::include_graphics("How_Dendogram_Forms.png")
A caption
What we do is set a dissimilarity threshold based on a certain Euclidean Distance. We can work how where to set this by visualizing the Dendogram.
knitr::include_graphics("Dendogram_Threshold.png")
A caption
knitr::include_graphics("Threshold_intoClusterCount.png")
A caption
dataset = read.csv('Mall_Customers.csv')
What’s this about
knitr::include_graphics("/Users/markloessi/Machine_Learning/Machine Learning A-Z New/Part 4 - Clustering/MallCustomer_Task.png")
A caption
Quick look
summary(dataset)
## CustomerID Genre Age Annual.Income..k..
## Min. : 1.00 Female:112 Min. :18.00 Min. : 15.00
## 1st Qu.: 50.75 Male : 88 1st Qu.:28.75 1st Qu.: 41.50
## Median :100.50 Median :36.00 Median : 61.50
## Mean :100.50 Mean :38.85 Mean : 60.56
## 3rd Qu.:150.25 3rd Qu.:49.00 3rd Qu.: 78.00
## Max. :200.00 Max. :70.00 Max. :137.00
## Spending.Score..1.100.
## Min. : 1.00
## 1st Qu.:34.75
## Median :50.00
## Mean :50.20
## 3rd Qu.:73.00
## Max. :99.00
Another look
head(dataset)
## CustomerID Genre Age Annual.Income..k.. Spending.Score..1.100.
## 1 1 Male 19 15 39
## 2 2 Male 21 15 81
## 3 3 Female 20 16 6
## 4 4 Female 23 16 77
## 5 5 Female 31 17 40
## 6 6 Female 22 17 76
Now we want to build an array of our two columns we want to test.
X = dataset[4:5] # to follow lecture
dataset = dataset[4:5] # documentation
Quick look
summary(dataset)
## Annual.Income..k.. Spending.Score..1.100.
## Min. : 15.00 Min. : 1.00
## 1st Qu.: 41.50 1st Qu.:34.75
## Median : 61.50 Median :50.00
## Mean : 60.56 Mean :50.20
## 3rd Qu.: 78.00 3rd Qu.:73.00
## Max. :137.00 Max. :99.00
Another look
head(dataset)
## Annual.Income..k.. Spending.Score..1.100.
## 1 15 39
## 2 15 81
## 3 16 6
## 4 16 77
## 5 17 40
## 6 17 76
Splitting the dataset into the Training set and Test set - won’t be done for KMeans
Feature Scaling - won’t be done for KMeans
# the ward.D method tries to minimize the within cluster variance
dendrogram = hclust(d = dist(dataset, method = 'euclidean'), method = 'ward.D')
plot(dendrogram,
main = paste('Dendrogram'),
xlab = 'Customers',
ylab = 'Euclidean distances')
Let’s interpret;
knitr::include_graphics("R_HCluster_Dendogram.png")
A caption
And adjusting the clusters to 5 based on our analysis
# we're going to make another object
hc = hclust(d = dist(dataset, method = 'euclidean'), method = 'ward.D')
# then fit it again to our data, note we are 'cutting' the tree where we get 5 clusters ;-)
y_hc = cutree(hc, 5)
This code is only for 2 dimensional clustering.
library(cluster)
clusplot(dataset,
y_hc,
lines = 0,
shade = TRUE,
color = TRUE,
labels= 2,
plotchar = FALSE,
span = TRUE,
main = paste('Clusters of customers'),
xlab = 'Annual Income',
ylab = 'Spending Score')
knitr::include_graphics("Clustering_ProsCons.png")
A caption
=========================
Github files; https://github.com/ghettocounselor
Useful PDF for common questions in Lectures;
https://github.com/ghettocounselor/Machine_Learning/blob/master/Machine-Learning-A-Z-Q-A.pdf