[Video]
Cluster analysis is a form of exploratory data analysis (EDA) where #observations# are divided into meaningful groups that share common characteristics (#features#).
The Flow of Cluster Analysis
Will learn two types of clustering
In which of these scenarios would clustering methods likely be appropriate? 1. Using consumer behavior data to identify distinct segments within a market. 1. Predicting whether a given user will click on an ad. 1. Identifying distinct groups of stocks that follow similar trading patterns. 1. Modeling & predicting GDP growth.
[Video]
Distance vs. Similarity
Distance = 1 - Similarity
# print(two_players)
# X Y
# BLUE 0 0
# RED 9 12
# dist(two_players, method = 'euclidean')
# RED
# BLUE 15
## print(three_players)
# X Y
# BLUE 0 0
# RED 9 12
# GREEN -2 19
# dist(three_players, method = 'euclidean')
# BLUE RED
# RED 15.00000
# GREEN 19.10497 13.03840
# Plot the positions of the players
ggplot(two_players, aes(x = x, y = y)) +
geom_point() +
# Assuming a 40x60 field
lims(x = c(-30,30), y = c(-20, 20))
# Split the players data frame into two observations
player1 <- two_players[1, ]
player2 <- two_players[2, ]
# Calculate and print their distance using the Euclidean Distance formula
player_distance <- sqrt( (player1$x - player2$x)^2 + (player1$y - player2$y)^2 )
player_distance
## [1] 11.6619
# Calculate the Distance Between two_players
dist_two_players <- dist(two_players)
dist_two_players
## 1
## 2 11.6619
# Calculate the Distance Between three_players
dist_three_players <- dist(three_players)
dist_three_players
## 1 2
## 2 11.66190
## 3 16.76305 18.02776
You are given the data frame containing the positions of 4 players on a soccer field.
This data is preloaded as four_players in your environment and is displayed below.
# Player x y
# 1 5 4
# 2 15 10
# 3 0 20
# 4 -5 5
Work in the R console to answer the following question:
# Calculate the Distance Between four_players
dist_four_players <- dist(four_players)
dist_four_players
## 1 2 3
## 2 11.66190
## 3 16.76305 18.02776
## 4 10.04988 20.61553 15.81139
Which two players are closest to one another?
[Video]
Standardization
height_scaled = [height - mean(height)] / sd(height)
scale() function to standardized height and weight
# Calculate distance for three_trees
dist_trees <- dist(three_trees)
# Scale three trees & calculate the distance
scaled_three_trees <- scale(three_trees)
dist_scaled_trees <- dist(scaled_three_trees)
# Output the results of both Matrices
print('Without Scaling')
## [1] "Without Scaling"
dist_trees
## 1 2
## 2 60.00075
## 3 24.10062 84.02149
print('With Scaling')
## [1] "With Scaling"
dist_scaled_trees
## 1 2
## 2 1.409365
## 3 1.925659 2.511082
Below are examples of datasets and their corresponding features.
In which of these examples would scaling not be necessary?
[Video]
# dist(survey_a, method = "binary")
# Dummification in R
# library(dummies)
# dummy.data.frame(survey_b)
# print(survey_b)
# dummy_survey_b <- dummy.data.frame(survey_b)
# dist(dummy_survey_b, method = "binary")
# Dummify the Survey Data
dummy_survey <- dummy.data.frame(job_survey)
## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts = FALSE):
## non-list contrasts argument ignored
## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts = FALSE):
## non-list contrasts argument ignored
# Calculate the Distance
dist_survey <- dist(dummy_survey, method = "binary")
# Print the Original Data
job_survey
## job_satisfaction is_happy
## 1 Low No
## 2 Low No
## 3 Hi Yes
## 4 Low No
## 5 Mid No
# Print the Distance Matrix
dist_survey
## 1 2 3 4
## 2 0.0000000
## 3 1.0000000 1.0000000
## 4 0.0000000 0.0000000 1.0000000
## 5 0.6666667 0.6666667 1.0000000 0.6666667
Below you see a pre-calculated distance matrix between four players on a soccer field. You can clearly see that players 1 & 4 are the closest to one another with a Euclidean distance value of 10.
# 1 2 3
# 2 11.7
# 3 16.8 18.0
# 4 10.0 20.6 15.8
If 1 and 4 are the closest players among the four, which player is closest to players 1 and 4?
[Video]
# max(D(2,1), D(2,4))
# Extract the pair distances
distance_1_2 <- dist_players[1]
distance_1_3 <- dist_players[2]
distance_2_3 <- dist_players[3]
# Calculate the complete distance between group 1-2 and 3
complete <- max(c(distance_1_3, distance_2_3))
complete
## [1] 18.02776
# Calculate the single distance between group 1-2 and 3
single <- min(c(distance_1_3, distance_2_3))
single
## [1] 16.76305
# Calculate the average distance between group 1-2 and 3
average <- mean(c(distance_1_3, distance_2_3))
average
## [1] 17.39541
You are now ready to answer this question!
Below you see a pre-calculated distance matrix between four players on a soccer field. You can clearly see that players 1 & 4 are the closest to one another with a Euclidean distance value of 10. This distance matrix is available for your exploration as the variable dist_players
1 2 3 2 11.7
3 16.8 18.0
4 10.0 20.6 15.8
If 1 and 4 are the closest players among the four, which player is closest to players 1 and 4?
Michael is a hybrid thinker and doer—a byproduct of being a StrengthsFinder “Learner” over time. With nearly 20 years of engineering, design, and product experience, he helps organizations identify market needs, mobilize internal and external resources, and deliver delightful digital customer experiences that align with business goals. He has been entrusted with problem-solving for brands—ranging from Fortune 500 companies to early-stage startups to not-for-profit organizations.
Michael earned his BS in Computer Science from New York Institute of Technology and his MBA from the University of Maryland, College Park. He is also a candidate to receive his MS in Applied Analytics from Columbia University.