Load Packages and Dataset
dplyr - for data manipulation
dummies - for converting categorical values into binary feature value representation
ggplot2 - for visualization
dendextend - make colorful dendrogram (tree diagram)
library(dplyr)##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(dummies)## dummies-1.5.6 provided by Decision Patterns
library(ggplot2)
library(dendextend)##
## ---------------------
## Welcome to dendextend version 1.15.2
## Type citation('dendextend') for how to cite the package.
##
## Type browseVignettes(package = 'dendextend') for the package vignette.
## The github page is: https://github.com/talgalili/dendextend/
##
## Suggestions and bug-reports can be submitted at: https://github.com/talgalili/dendextend/issues
## You may ask questions at stackoverflow, use the r and dendextend tags:
## https://stackoverflow.com/questions/tagged/dendextend
##
## To suppress this message use: suppressPackageStartupMessages(library(dendextend))
## ---------------------
##
## Attaching package: 'dendextend'
## The following object is masked from 'package:stats':
##
## cutree
Key Points
simple definition: grouping observations members are similar within the group
a more formal definition: a form of exploratory data analysis where observations are divided into meaningful groups that share common characteristics (aka features)
pre-process data, select similarity measure, cluster, analyze. may require iterations here as well.
here in this section we focus on:
understanding of what it means for two observation to be similar - or more specifically dissimilar
learn why the features of the data need to be comparable to one another
Distance between two observations
how dissimilar are these observations?
distance = 1 - similarity
Euclidean distance as the measurement
Practice
- calculate and plot the distance between two players. here we created a simple dataframe.
two_players <- tibble(x = c(5,15), y = c(4,10))- plot the positions of the two players
ggplot(two_players, aes(x = x, y = y)) +
geom_point() +
# Assuming a 40x60 field
lims(x = c(-30,30), y = c(-20, 20))- split the players data frame into two observations
player1 <- two_players[1, ]
player2 <- two_players[2, ]- calculate and print the distance using Euclidean distance formula
player_distance <- sqrt(
(player1$x - player2$x)^2 +
(player1$y - player2$y)^2
)
player_distance## [1] 11.6619
- easier and quicker way to calculate the distance is to use the dist() function - its default calculation method is euclidean.
dist(two_players)## 1
## 2 11.6619
The importance of scale
standardization - for scaling features
use scale(0 function
Measuring distance for categorical data
Jaccard index - This measure of similarity captures the ratio between the intersection of A and B to the union of A and B.
- The ratio between the number of times the features of both observations are TRUE to the number of times they are ever TRUE.
dummy.data.frame to convert categorical data to binary features
job_survey <- data.frame(job_satisfaction = c("Hi", "Hi", "Hi", "Hi", "Mid"), is_happy = c("No", "No", "No", "Yes", "No"))- dummify the survey data
dummy_survey <- dummy.data.frame(job_survey)## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts = FALSE):
## non-list contrasts argument ignored
## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts = FALSE):
## non-list contrasts argument ignored
- calculate the distance - correctly captured that observation 1 and 2 are identical, hence the distance is of 0.
dist_survey <- dist(dummy_survey, method = "binary")
dist_survey## 1 2 3 4
## 2 0.0000000
## 3 0.0000000 0.0000000
## 4 0.6666667 0.6666667 0.6666667
## 5 0.6666667 0.6666667 0.6666667 1.0000000
job_survey## job_satisfaction is_happy
## 1 Hi No
## 2 Hi No
## 3 Hi No
## 4 Hi Yes
## 5 Mid No
Compare more than two observations
linkage criteria
complete linkage - maximum distance between two sets
single linkage - min distance between two sets
average linkage - average distance between two sets
Practice - calculate linkage
- create matrix distance for three players
three_players <- tibble(x = c(5,15,3), y = c(1,10,9), z = c(3,1,10))
dist_players <- dist(three_players)- extract pair distances
distance_1_2 <- dist_players[1]
distance_1_3 <- dist_players[2]
distance_2_3 <- dist_players[3]- calculate the complete distance between group 1-2 and 3 - that is to calculate the max. distance from 3 to 1 and 2 respectively.
complete <- max(c(distance_1_3, distance_2_3))
complete## [1] 15.0333
- calculate the single distance between group 1-2 and 3 - that is to calculate the min. distance from 3 to 1 and 2 respectively.
single <- min(c(distance_1_3, distance_2_3))
single## [1] 10.81665
- calculate the average distance between group 1-2 and 3
average <- mean((c(distance_1_3, distance_2_3)))
average## [1] 12.92498
Capture K clusters
Group observations into pre-defined clusters.
- Objectives: use hclust() function (hierarchical clustering) to calculate the iterative linkage steps and use cutree() function to extract the cluster assignments for the desired number (k) of clusters.
lineup <- tibble(x = c(-1, -2, 8, 7, -12, -15, -13, 15, 21, 12, -25, 26), y = c(1, -3, 6, -8, 8, 0, -10, 16, 2, -15, 1, 0))we know that there are two teams (k =2) in the match so we are going to use clustering methods to assign which team each player belongs in based on their position (stated in lineup dataset)
calculate the distance
dist_players <- dist(lineup, method = "euclidean")- perform the complete linkage calculation for hierarchical clustering using hclust and store the values
hc_players <- hclust(dist_players, method = "complete")
plot(hc_players)- build the cluster assignment vector using cutree with a k = 2
clusters_k2 <- cutree(hc_players, k = 2)- append the cluster assignments as a column cluster to the lineup data frame and save the results to a new data frame called lineup_k2_complete
lineup_k2_complete <- mutate(lineup, cluster = clusters_k2)- explore the clusters. count the number of clusters and plot the positions of players and color them using their cluster. We can tell if the clustering makes sense by observing the plot and reasoning based on our knowledge from the data. i.e. two teams, same number of players on each team.
count(lineup_k2_complete, cluster)## # A tibble: 2 × 2
## cluster n
## <int> <int>
## 1 1 6
## 2 2 6
ggplot(lineup_k2_complete, aes(x,y, color = factor(cluster))) +
geom_point()Dendrogram
perform the linkage calculate for the hierarchical clustering using three linkages
plot the three dendrogram side by side and review the changes
hc_complete <- hclust(dist_players, method = "complete")
hc_single <- hclust(dist_players, method = "single")
hc_average <- hclust(dist_players, method = "average")
par(mfrow = c(1,3))
plot(hc_complete, main = "complete linkage")
plot(hc_single, main = "single linkage")
plot(hc_average, main = "average linkage")Cutting the tree
cut the tree at any desired height
use height cutoff to describe a characteristic of the data (height, link, distance calculating method)
library(dendextend)
dist_players <- dist(lineup, method = 'euclidean')
hc_players <- hclust(dist_players, method = "complete")
# Create a dendrogram object from the hclust variable
dend_players <- as.dendrogram(hc_players)
# Plot the dendrogram
plot(dend_players)# Color branches by cluster formed from the cut at a height of 10, 20 & plot
dend_10 <- color_branches(dend_players, h = 10)
dend_20 <- color_branches(dend_players, h = 20)
# Plot the dendrogram with clusters colored below height 30
dend_30 <- color_branches(dend_players, h = 30)
# Color branches by cluster formed from the cut at a height of 40 & plot
dend_40 <- color_branches(dend_players, h = 40)
# Plot the dendrogram with clusters colored below height 40
plot(dend_10, main = "height of 10")plot(dend_20, main = "height of 20")plot(dend_30, main = "height of 30")plot(dend_40, main = "height of 40")Explore the branches cut from the tree
- note that we have already created variables dist_players (euclidean method) and hc_players (complete linkage)
cluster_20 <- cutree(hc_players, h = 20) # calculate assignment vector with a h of 20- below formula means relationships of the members within each cluster is of a maximum (complete linkage) Euclidean (euclidean method) distance to all other members of its cluster that is less than 40.
cluster_40 <- cutree(hc_players, h = 40)lineup_h20_complete <- mutate(lineup, cluster = cluster_20)lineup_h40_complete <- mutate(lineup, cluster = cluster_40)- plot the positions of the players and color them according to their cluster for height = 20
ggplot(lineup_h20_complete, aes(x = x, y = y, color = factor(cluster))) +
geom_point()- for height = 40
ggplot(lineup_h40_complete, aes(x = x, y = y, color = factor(cluster))) +
geom_point()Making Sense of the clusters
load new data - the amount spent by 45 different clients of a wholesale distributor for the food categories of Milk, Grocery, and Frozen.
Objective - to assign these clients into meaningful clusters.
customers_spend <- readRDS("~/Documents/R programming/Datacamp/DC_Cluster/Data/ws_customers.rds")- segment the customers
dist_customers <- dist(customers_spend, method = "euclidean")
hc_customers <- hclust(dist_customers, method = "complete")
plot(hc_customers)clust_customers <- cutree(hc_customers, h = 15000)
segment_customers <- mutate(customers_spend, cluster = clust_customers)explore wholesale customer clusters
Since we are working with more than 2 dimensions it would be challenging to visualize a scatter plot of the clusters, instead you will rely on summary statistics to explore these clusters.
In this exercise we will analyze the mean amount spent in each cluster for all three categories.
calculate each cluster count
count <- count(segment_customers, cluster)- color and plot the dendrogram using the height of 15,000
dend_customers <- as.dendrogram(hc_customers)
dend_colored <- color_branches(dend_customers, h = 15000)
plot(dend_colored)- calculate mean for each category and we also appended cluster count in below by using inner_join.
segment_customers %>%
group_by(cluster) %>%
summarize_all(list(mean)) %>%
inner_join(count, by = "cluster" )## # A tibble: 4 × 5
## cluster Milk Grocery Frozen n
## <int> <dbl> <dbl> <dbl> <int>
## 1 1 16950 12891. 991. 5
## 2 2 2513. 5229. 1796. 29
## 3 3 10452. 22551. 1355. 5
## 4 4 1250. 3917. 10889. 6
the majority of customers fell into cluster 2 and did not show any excessive spending in any category
meaningful conclusion can be drawn based on the business context of the clustering