Ch. 1 - Calculating distance between observations

What is cluster analysis?

[Video]

Cluster analysis is a form of exploratory data analysis (EDA) where #observations# are divided into meaningful groups that share common characteristics (#features#).

The Flow of Cluster Analysis

  1. Pre-process data
  2. Select similarity measure
  3. Cluster
  4. Analyze (may need to iterate back to #2)

Will learn two types of clustering

  1. Hierarchical clustering
  2. K-means clustering

When to cluster?

In which of these scenarios would clustering methods likely be appropriate? 1. Using consumer behavior data to identify distinct segments within a market. 1. Predicting whether a given user will click on an ad. 1. Identifying distinct groups of stocks that follow similar trading patterns. 1. Modeling & predicting GDP growth.

  • 1
  • 2
  • 4
  • [*] 1 & 3
  • 2 & 4

Distance between two observations

[Video]

Distance vs. Similarity

Distance = 1 - Similarity

# print(two_players)
#         X     Y
# BLUE    0     0
# RED     9     12

# dist(two_players, method = 'euclidean')
#         RED
# BLUE    15


## print(three_players)
#         X     Y
# BLUE    0     0
# RED     9     12
# GREEN   -2    19

# dist(three_players, method = 'euclidean')
#         BLUE      RED
# RED     15.00000  
# GREEN   19.10497  13.03840

Calculate & plot the distance between two players

# Plot the positions of the players
ggplot(two_players, aes(x = x, y = y)) + 
  geom_point() +
  # Assuming a 40x60 field
  lims(x = c(-30,30), y = c(-20, 20))

# Split the players data frame into two observations
player1 <- two_players[1, ]
player2 <- two_players[2, ]

# Calculate and print their distance using the Euclidean Distance formula
player_distance <- sqrt( (player1$x - player2$x)^2 + (player1$y - player2$y)^2 )
player_distance
## [1] 11.6619

Using the dist() function

# Calculate the Distance Between two_players
dist_two_players <- dist(two_players)
dist_two_players
##         1
## 2 11.6619
# Calculate the Distance Between three_players
dist_three_players <- dist(three_players)
dist_three_players
##          1        2
## 2 11.66190         
## 3 16.76305 18.02776

Who are the closest players?

You are given the data frame containing the positions of 4 players on a soccer field.

This data is preloaded as four_players in your environment and is displayed below.

# Player   x    y
# 1        5    4
# 2        15   10
# 3        0    20
# 4       -5    5

Work in the R console to answer the following question:

# Calculate the Distance Between four_players
dist_four_players <- dist(four_players)
dist_four_players
##          1        2        3
## 2 11.66190                  
## 3 16.76305 18.02776         
## 4 10.04988 20.61553 15.81139

Which two players are closest to one another?

  • 1 & 2
  • 1 & 3
  • [*] 1 & 4
  • 2 & 3
  • 2 & 4
  • 3 & 4
  • Not enough information to decide

The importance of scale

[Video]

Standardization

height_scaled = [height - mean(height)] / sd(height)

scale() function to standardized height and weight

Effects of scale

# Calculate distance for three_trees 
dist_trees <- dist(three_trees)

# Scale three trees & calculate the distance  
scaled_three_trees <- scale(three_trees)
dist_scaled_trees <- dist(scaled_three_trees)

# Output the results of both Matrices
print('Without Scaling')
## [1] "Without Scaling"
dist_trees
##          1        2
## 2 60.00075         
## 3 24.10062 84.02149
print('With Scaling')
## [1] "With Scaling"
dist_scaled_trees
##          1        2
## 2 1.409365         
## 3 1.925659 2.511082

When to scale data?

Below are examples of datasets and their corresponding features.

In which of these examples would scaling not be necessary?

  • Taxi Trips - tip earned ($), distance traveled (km).
  • Health Measurements of Individuals - height (meters), weight (grams), body fat percentage (%).
  • Student Attributes - average test score (1-100), distance from school (km), annual household income ($).
  • Salespeople Commissions - total yearly commision ($), number of trips taken.
  • [*] None of the above, they all should be scaled when measuring distance.

Measuring distance for categorical data

[Video]

# dist(survey_a, method = "binary")

# Dummification in R
# library(dummies)
# dummy.data.frame(survey_b)

# print(survey_b)
# dummy_survey_b <- dummy.data.frame(survey_b)
# dist(dummy_survey_b, method = "binary")

Calculating distance between categorical variables

# Dummify the Survey Data
dummy_survey <- dummy.data.frame(job_survey)
## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts = FALSE):
## non-list contrasts argument ignored

## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts = FALSE):
## non-list contrasts argument ignored
# Calculate the Distance
dist_survey <- dist(dummy_survey, method = "binary")

# Print the Original Data
job_survey
##   job_satisfaction is_happy
## 1              Low       No
## 2              Low       No
## 3               Hi      Yes
## 4              Low       No
## 5              Mid       No
# Print the Distance Matrix
dist_survey
##           1         2         3         4
## 2 0.0000000                              
## 3 1.0000000 1.0000000                    
## 4 0.0000000 0.0000000 1.0000000          
## 5 0.6666667 0.6666667 1.0000000 0.6666667

The closest observation to a pair

Below you see a pre-calculated distance matrix between four players on a soccer field. You can clearly see that players 1 & 4 are the closest to one another with a Euclidean distance value of 10.

#       1      2      3
# 2     11.7        
# 3     16.8     18.0   
# 4     10.0   20.6   15.8

If 1 and 4 are the closest players among the four, which player is closest to players 1 and 4?

  • Clearly its player 2!
  • No! Player 3 makes more sense.
  • [*] Are you kidding me? There isn’t enough information to decide.

Ch. 2 - Hierarchical clustering

Comparing more than two observations

[Video]

# max(D(2,1), D(2,4))

Calculating linkage

# Extract the pair distances
distance_1_2 <- dist_players[1]
distance_1_3 <- dist_players[2]
distance_2_3 <- dist_players[3]

# Calculate the complete distance between group 1-2 and 3
complete <- max(c(distance_1_3, distance_2_3))
complete
## [1] 18.02776
# Calculate the single distance between group 1-2 and 3
single <- min(c(distance_1_3, distance_2_3))
single
## [1] 16.76305
# Calculate the average distance between group 1-2 and 3
average <- mean(c(distance_1_3, distance_2_3))
average
## [1] 17.39541

Revisited: The closest observation to a pair

You are now ready to answer this question!

Below you see a pre-calculated distance matrix between four players on a soccer field. You can clearly see that players 1 & 4 are the closest to one another with a Euclidean distance value of 10. This distance matrix is available for your exploration as the variable dist_players

1 2 3 2 11.7
3 16.8 18.0
4 10.0 20.6 15.8

If 1 and 4 are the closest players among the four, which player is closest to players 1 and 4?

  • Complete Linkage: Player 3, Single & Average Linkage: Player 2
  • Complete Linkage: Player 2, Single & Average Linkage: Player 3
  • Player 2 using Complete, Single & Average Linkage methods
  • Player 3 using Complete, Single & Average Linkage methods

Capturing K clusters

Assign cluster membership

Exploring the clusters

Validating the clusters

Visualizing the Dendrogram

Comparing average, single & complete linkage

Height of the tree

Cutting the tree

Clusters based on height

Exploring the branches cut from the tree

What do we know about our clusters?

Making sense of the clusters

Segment wholesale customers

Explore wholesale customer clusters

Interpreting the wholesale customer clusters


Ch. 3 - K-means clustering

Introduction to K-means

K-means on a soccer field

K-means on a soccer field (part 2)

Evaluating different values of K by eye

Many K’s many models

Elbow (Scree) plot

Interpreting the elbow plot

Silhouette analysis: Observation level performance

Silhouette analysis

Making sense of the K-means clusters

Revisiting wholesale data: “Best” k

Revisiting wholesale data: Exploration


Ch. 4 - Case Study: National Occupational mean wage

Occupational wage data

Initial exploration of the data

Hierarchical clustering: Occupation trees

Hierarchical clustering: Preparing for exploration

Hierarchical clustering: Plotting occupational clusters

Reviewing the HC Results

K-means: Elbow analysis

K-means: Average Silhouette Widths

The “best” number of clusters

Review K-means Results


About Michael Mallari

Michael is a hybrid thinker and doer—a byproduct of being a StrengthsFinder “Learner” over time. With nearly 20 years of engineering, design, and product experience, he helps organizations identify market needs, mobilize internal and external resources, and deliver delightful digital customer experiences that align with business goals. He has been entrusted with problem-solving for brands—ranging from Fortune 500 companies to early-stage startups to not-for-profit organizations.

Michael earned his BS in Computer Science from New York Institute of Technology and his MBA from the University of Maryland, College Park. He is also a candidate to receive his MS in Applied Analytics from Columbia University.

LinkedIn | Twitter | michaelmallari.com