Cluster Analysis_distance between observations and Hierarchical clustering

Kate C

2022-02-08

Load Packages and Dataset

  • dplyr - for data manipulation

  • dummies - for converting categorical values into binary feature value representation

  • ggplot2 - for visualization

  • dendextend - make colorful dendrogram (tree diagram)

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(dummies)
## dummies-1.5.6 provided by Decision Patterns
library(ggplot2)
library(dendextend)
## 
## ---------------------
## Welcome to dendextend version 1.15.2
## Type citation('dendextend') for how to cite the package.
## 
## Type browseVignettes(package = 'dendextend') for the package vignette.
## The github page is: https://github.com/talgalili/dendextend/
## 
## Suggestions and bug-reports can be submitted at: https://github.com/talgalili/dendextend/issues
## You may ask questions at stackoverflow, use the r and dendextend tags: 
##   https://stackoverflow.com/questions/tagged/dendextend
## 
##  To suppress this message use:  suppressPackageStartupMessages(library(dendextend))
## ---------------------
## 
## Attaching package: 'dendextend'
## The following object is masked from 'package:stats':
## 
##     cutree

Key Points

  • simple definition: grouping observations members are similar within the group

  • a more formal definition: a form of exploratory data analysis where observations are divided into meaningful groups that share common characteristics (aka features)

  • pre-process data, select similarity measure, cluster, analyze. may require iterations here as well.

  • here in this section we focus on:

    • understanding of what it means for two observation to be similar - or more specifically dissimilar

    • learn why the features of the data need to be comparable to one another

Distance between two observations

  • how dissimilar are these observations?

  • distance = 1 - similarity

  • Euclidean distance as the measurement

Practice

  • calculate and plot the distance between two players. here we created a simple dataframe.
two_players <- tibble(x = c(5,15), y = c(4,10))
  • plot the positions of the two players
ggplot(two_players, aes(x = x, y = y)) + 
  geom_point() +
  # Assuming a 40x60 field
  lims(x = c(-30,30), y = c(-20, 20))

  • split the players data frame into two observations
player1 <- two_players[1, ]
player2 <- two_players[2, ]
  • calculate and print the distance using Euclidean distance formula
player_distance <- sqrt(
  (player1$x - player2$x)^2 + 
  (player1$y - player2$y)^2
)
player_distance
## [1] 11.6619
  • easier and quicker way to calculate the distance is to use the dist() function - its default calculation method is euclidean.
dist(two_players)
##         1
## 2 11.6619

The importance of scale

  • standardization - for scaling features

  • use scale(0 function

Measuring distance for categorical data

  • Jaccard index - This measure of similarity captures the ratio between the intersection of A and B to the union of A and B.

    • The ratio between the number of times the features of both observations are TRUE to the number of times they are ever TRUE.
  • dummy.data.frame to convert categorical data to binary features

job_survey <- data.frame(job_satisfaction = c("Hi", "Hi", "Hi", "Hi", "Mid"), is_happy = c("No", "No", "No", "Yes", "No"))
  • dummify the survey data
dummy_survey <- dummy.data.frame(job_survey)
## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts = FALSE):
## non-list contrasts argument ignored

## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts = FALSE):
## non-list contrasts argument ignored
  • calculate the distance - correctly captured that observation 1 and 2 are identical, hence the distance is of 0.
dist_survey <- dist(dummy_survey, method = "binary")
dist_survey
##           1         2         3         4
## 2 0.0000000                              
## 3 0.0000000 0.0000000                    
## 4 0.6666667 0.6666667 0.6666667          
## 5 0.6666667 0.6666667 0.6666667 1.0000000
job_survey
##   job_satisfaction is_happy
## 1               Hi       No
## 2               Hi       No
## 3               Hi       No
## 4               Hi      Yes
## 5              Mid       No

Compare more than two observations

  • linkage criteria

    • complete linkage - maximum distance between two sets

    • single linkage - min distance between two sets

    • average linkage - average distance between two sets

Practice - calculate linkage

  • create matrix distance for three players
three_players <- tibble(x = c(5,15,3), y = c(1,10,9), z = c(3,1,10))
dist_players <- dist(three_players)
  • extract pair distances
distance_1_2 <- dist_players[1]
distance_1_3 <- dist_players[2]
distance_2_3 <- dist_players[3]
  • calculate the complete distance between group 1-2 and 3 - that is to calculate the max. distance from 3 to 1 and 2 respectively.
complete <- max(c(distance_1_3, distance_2_3))
complete
## [1] 15.0333
  • calculate the single distance between group 1-2 and 3 - that is to calculate the min. distance from 3 to 1 and 2 respectively.
single <- min(c(distance_1_3, distance_2_3))
single
## [1] 10.81665
  • calculate the average distance between group 1-2 and 3
average <- mean((c(distance_1_3, distance_2_3)))
average
## [1] 12.92498

Capture K clusters

Group observations into pre-defined clusters.

  • Objectives: use hclust() function (hierarchical clustering) to calculate the iterative linkage steps and use cutree() function to extract the cluster assignments for the desired number (k) of clusters.
lineup <- tibble(x = c(-1, -2, 8, 7, -12, -15, -13, 15, 21, 12, -25, 26), y = c(1, -3, 6, -8, 8, 0, -10, 16, 2, -15, 1, 0))
  • we know that there are two teams (k =2) in the match so we are going to use clustering methods to assign which team each player belongs in based on their position (stated in lineup dataset)

  • calculate the distance

dist_players <- dist(lineup, method = "euclidean")
  • perform the complete linkage calculation for hierarchical clustering using hclust and store the values
hc_players <- hclust(dist_players, method = "complete")
plot(hc_players)

  • build the cluster assignment vector using cutree with a k = 2
clusters_k2 <- cutree(hc_players, k = 2)
  • append the cluster assignments as a column cluster to the lineup data frame and save the results to a new data frame called lineup_k2_complete
lineup_k2_complete <- mutate(lineup, cluster = clusters_k2)
  • explore the clusters. count the number of clusters and plot the positions of players and color them using their cluster. We can tell if the clustering makes sense by observing the plot and reasoning based on our knowledge from the data. i.e. two teams, same number of players on each team.
count(lineup_k2_complete, cluster)
## # A tibble: 2 × 2
##   cluster     n
##     <int> <int>
## 1       1     6
## 2       2     6
ggplot(lineup_k2_complete, aes(x,y, color = factor(cluster))) + 
  geom_point()

Dendrogram

  • perform the linkage calculate for the hierarchical clustering using three linkages

  • plot the three dendrogram side by side and review the changes

hc_complete <- hclust(dist_players, method = "complete")
hc_single <- hclust(dist_players, method = "single")
hc_average <- hclust(dist_players, method = "average")

par(mfrow = c(1,3))
plot(hc_complete, main = "complete linkage")
plot(hc_single, main = "single linkage")
plot(hc_average, main = "average linkage")

Cutting the tree

  • cut the tree at any desired height

  • use height cutoff to describe a characteristic of the data (height, link, distance calculating method)

library(dendextend)
dist_players <- dist(lineup, method = 'euclidean')
hc_players <- hclust(dist_players, method = "complete")

# Create a dendrogram object from the hclust variable
dend_players <- as.dendrogram(hc_players)

# Plot the dendrogram
plot(dend_players)

# Color branches by cluster formed from the cut at a height of 10, 20 & plot
dend_10 <- color_branches(dend_players, h = 10)
dend_20 <- color_branches(dend_players, h = 20)

# Plot the dendrogram with clusters colored below height 30
dend_30 <- color_branches(dend_players, h = 30)
# Color branches by cluster formed from the cut at a height of 40 & plot
dend_40 <- color_branches(dend_players, h = 40)

# Plot the dendrogram with clusters colored below height 40

plot(dend_10, main = "height of 10")

plot(dend_20, main = "height of 20")

plot(dend_30, main = "height of 30")

plot(dend_40, main = "height of 40")

Explore the branches cut from the tree

  • note that we have already created variables dist_players (euclidean method) and hc_players (complete linkage)
cluster_20 <- cutree(hc_players, h = 20) # calculate assignment vector with a h of 20
  • below formula means relationships of the members within each cluster is of a maximum (complete linkage) Euclidean (euclidean method) distance to all other members of its cluster that is less than 40.
cluster_40 <- cutree(hc_players, h = 40)
lineup_h20_complete <- mutate(lineup, cluster = cluster_20)
lineup_h40_complete <- mutate(lineup, cluster = cluster_40)
  • plot the positions of the players and color them according to their cluster for height = 20
ggplot(lineup_h20_complete, aes(x = x, y = y, color = factor(cluster))) +
  geom_point()

  • for height = 40
ggplot(lineup_h40_complete, aes(x = x, y = y, color = factor(cluster))) +
  geom_point()

Making Sense of the clusters

load new data - the amount spent by 45 different clients of a wholesale distributor for the food categories of Milk, Grocery, and Frozen.

Objective - to assign these clients into meaningful clusters.

customers_spend <- readRDS("~/Documents/R programming/Datacamp/DC_Cluster/Data/ws_customers.rds")
  • segment the customers
dist_customers <- dist(customers_spend, method = "euclidean")
hc_customers <- hclust(dist_customers, method = "complete")
plot(hc_customers)

clust_customers <- cutree(hc_customers, h = 15000)
segment_customers <- mutate(customers_spend, cluster = clust_customers)
  • explore wholesale customer clusters

  • Since we are working with more than 2 dimensions it would be challenging to visualize a scatter plot of the clusters, instead you will rely on summary statistics to explore these clusters.

  • In this exercise we will analyze the mean amount spent in each cluster for all three categories.

  • calculate each cluster count

count <- count(segment_customers, cluster)
  • color and plot the dendrogram using the height of 15,000
dend_customers <- as.dendrogram(hc_customers)
dend_colored <- color_branches(dend_customers, h = 15000)
plot(dend_colored)

  • calculate mean for each category and we also appended cluster count in below by using inner_join.
segment_customers %>% 
  group_by(cluster) %>% 
  summarize_all(list(mean)) %>% 
  inner_join(count, by = "cluster" )
## # A tibble: 4 × 5
##   cluster   Milk Grocery Frozen     n
##     <int>  <dbl>   <dbl>  <dbl> <int>
## 1       1 16950   12891.   991.     5
## 2       2  2513.   5229.  1796.    29
## 3       3 10452.  22551.  1355.     5
## 4       4  1250.   3917. 10889.     6
  • the majority of customers fell into cluster 2 and did not show any excessive spending in any category

  • meaningful conclusion can be drawn based on the business context of the clustering