This is a recap of what I have learned on cluster analysis. I basically reproduced examples found in various sources (acknowledged at bottom), to learn how to implement the algorithms in R.
GOAL: putting like with like It is considered
Exploratory Data Mining bc it helps tease out relationships
in large datasets
DIFFERENCE FROM CLASSIFICATION: Classification
identifies and matches obs. with EXISTING label/group (spam vs non spam
/ cats vs dogs) – Clustering is “pragmatic grouping” done
for a reason that make sense to me. (THESE ARE NOT NATURAL
GROUPINGS, THESE ARE GROUPS OF CONVENIENCE). Its success
depends on how well it serves my purpose.
RESULTS WILL CHANGE A LOT DEPENDING ON THE VARIABLES YOU USE!!!!!!
Some of the algorithms REQUIRE CONTINUOUS DATA
EXAMPLES:
ALGORITHMs:
centroid is defined by a mean
Hierarchical: Start separate and combinekmeans)I am using 2 different datasets:
1. a data set `states` with searches made in collected at US states and information about 5 personality traits. Data and hypothesis taken from "Divided We Stand: Three Psychological Regions of the United States and their Political, Economic, Social, and Health Correlates" https://www.apa.org/pubs/journals/releases/psp-a0034434.pdf
2. a (subset of a) bank marketing dataset `bank`
stats::hclust / data =
statesSource Linkedin Data Mining Course (DM_03_03.R)
STEPS:
1. Create distance matrix (Euclidean)
2. Perform hierarchical cluster analysis on a set of dissimilarities (distance)
3. Plot
(*) Here only numeric variables are considered
# Check
colnames(states)
## [1] "State" "state_code" "data.science"
## [4] "cluster.analysis" "college" "startup"
## [7] "entrepreneur" "ceo" "mortgage"
## [10] "nba" "nfl" "mlb"
## [13] "fifa" "modern.dance" "prius"
## [16] "escalade" "subaru" "jello"
## [19] "bbq" "royal.family" "obfuscation"
## [22] "unicorn" "Extraversion" "Agreeableness"
## [25] "Conscientiousness" "Neuroticism" "Openness"
## [28] "PsychRegions" "region" "division"
# Save numerical data only
st <- states[, 3:27]
# use the State Abbreviation as Index of rows (so they don't get analized)
row.names(st) <- states[, 2] # set row names for df
colnames(st)
## [1] "data.science" "cluster.analysis" "college"
## [4] "startup" "entrepreneur" "ceo"
## [7] "mortgage" "nba" "nfl"
## [10] "mlb" "fifa" "modern.dance"
## [13] "prius" "escalade" "subaru"
## [16] "jello" "bbq" "royal.family"
## [19] "obfuscation" "unicorn" "Extraversion"
## [22] "Agreeableness" "Conscientiousness" "Neuroticism"
## [25] "Openness"
# CLUSTERING (hierarchical clustering)###############################################
# 1) Create distance matrix (Euclidean)
d <- dist(st)
# 2) Hierarchical clustering
c <- hclust(d, method = "complete") # or one of "ward.D", "ward.D2", "single", "complete", "average" (= UPGMA), "mcquitty" (= WPGMA), "median" (= WPGMC) or "centroid" (= UPGMC).
c # Info on clustering
##
## Call:
## hclust(d = d, method = "complete")
##
## Cluster method : complete
## Distance : euclidean
## Number of objects: 48
# 3.a) Plot dendrogram of clusters
plot(c , main = "Cluster with All Searches and Personality")
# 4.a) Cut the tree (resulting from hclust) into several groups
# Naming the file
png(file = "graphs/dendrogramALLsearches.png")
# plot
plot(c) # , main = "Cluster with All Searches and Personality")
# 4) Cut the tree (resulting from hclust) into several groups
# Put observations in groups
# Need to specify either k = groups or h = height
g3 <- cutree(c, k = 3) # "g3" = "groups 3"
# cutree(hcmt, h = 230) will give same result
g3
## AL AZ AR CA CO CT DE FL GA ID IL IN IA KS KY LA ME MD MA MI MN MS MO MT NE NV
## 1 2 3 2 2 3 3 1 1 2 2 2 1 2 2 2 3 3 3 1 1 1 1 2 1 2
## NH NJ NM NY NC ND OH OK OR PA RI SC SD TN TX UT VT VA WA WV WI WY
## 3 3 2 3 1 1 2 2 2 2 3 1 1 1 1 1 2 2 2 3 1 2
# Or do several levels of groups at once
# "gm" = "groups/multiple"
gm <- cutree(c, k = 2:5) # or k = c(2, 4)
gm
## 2 3 4 5
## AL 1 1 1 1
## AZ 1 2 2 2
## AR 2 3 3 3
## CA 1 2 2 2
## CO 1 2 2 2
## CT 2 3 3 3
## DE 2 3 3 3
## FL 1 1 1 1
## GA 1 1 1 1
## ID 1 2 2 2
## IL 1 2 2 2
## IN 1 2 2 2
## IA 1 1 4 4
## KS 1 2 2 2
## KY 1 2 2 2
## LA 1 2 2 2
## ME 2 3 3 3
## MD 2 3 3 3
## MA 2 3 3 3
## MI 1 1 1 1
## MN 1 1 4 4
## MS 1 1 1 1
## MO 1 1 1 1
## MT 1 2 2 5
## NE 1 1 4 4
## NV 1 2 2 2
## NH 2 3 3 3
## NJ 2 3 3 3
## NM 1 2 2 5
## NY 2 3 3 3
## NC 1 1 1 1
## ND 1 1 4 4
## OH 1 2 2 2
## OK 1 2 2 2
## OR 1 2 2 5
## PA 1 2 2 2
## RI 2 3 3 3
## SC 1 1 1 1
## SD 1 1 4 4
## TN 1 1 1 1
## TX 1 1 1 1
## UT 1 1 1 1
## VT 1 2 2 5
## VA 1 2 2 2
## WA 1 2 2 5
## WV 2 3 3 3
## WI 1 1 4 4
## WY 1 2 2 2
# Draw boxes around clusters
rect.hclust(c, k = 2, border = "gray")
rect.hclust(c, k = 3, border = "blue")
rect.hclust(c, k = 4, border = "green4")
rect.hclust(c, k = 5, border = "darkred")
#Saving the file
dev.off()
## quartz_off_screen
## 2
# 3.b) Plot dendrogram of clusters
dend <- c %>% stats::as.dendrogram()
# Naming the file
png(file = "graphs/dendrogramALLsearches2.png")
plot(dend)
# 4.b) Cut the tree (resulting from hclust) into several groups
plot(dend)
dendextend::rect.dendrogram(dend,
k = 3,
border = 2:4, # colors
#density = 2,
text = c("1", "b", "miao"),
text_cex = 3
)
# or
# Vectorize(rect.dendrogram, "k")(dend, 4:5, border = 6)
#Saving the file
dev.off()
## quartz_off_screen
## 2
For comparison purpose, we create a df with Sports search data only
# sports <- st[, 8:11]
# head(sports)
#
# # For comparison, TRY AND LOOK USING ONLY THE SPORT SEARCHES
# # 1//2/3) Or nest commands in one line (for sports data)
# #Naming the file
# png(file = "graphs/dendrogramSPORTsearches.png")
# plot(hclust(dist(sports)),
# main = "Sports Searches")
# dev.off() #Saving the file
K-means clustering is an unsupervised machine learning
algorithm for clustering n observations into k
clusters where k is predefined or user-defined constant. It finds the
minimum total distance of obs. form the
centroid of the (best) cluster (for all dimensions)
It is a partitioning algorithm.
CONDITIONS:
k = the number of clusters and
has to be set beforehand. k should be picked as a # that is in the ball
park of what you are willing to act on (e.g. 5 if I want 5 segments of
donors)APPLICATIONS: data has a smaller number of dimensions, is numeric, and is continuous. such as document clustering, identifying crime-prone areas, customer segmentation, insurance fraud detection, public transport data analysis, clustering of IT alerts…etc.
(*) We can scale/standardize the variables before clustering the data –> give same importance to each variable in calculating distances
stats::kmeans / data =
bankbank1 <- bank[, c(1,6, 12, 13, 14)] # Select num variables
# ====================================================== OPION k = 3
km1 <- stats::kmeans(bank1, centers=3)
km1
## K-means clustering with 3 clusters of sizes 52, 11, 337
##
## Cluster means:
## age balance duration campaign pdays
## 1 43.30769 5162.5577 241.5577 2.730769 73.36538
## 2 46.18182 15943.0000 141.0909 2.090909 12.36364
## 3 41.43027 556.0979 290.5223 2.816024 36.31454
##
## Clustering vector:
## [1] 3 3 3 2 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 1 3 3 2 3 3 3 1 1
## [38] 1 1 3 3 1 3 1 3 3 2 3 1 2 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [75] 3 1 1 3 3 3 1 3 3 3 3 3 3 1 3 1 3 3 3 1 3 3 3 3 3 3 3 1 3 3 3 1 3 3 3 3 3
## [112] 3 1 1 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 1 3 3
## [149] 3 1 3 3 3 3 3 3 3 3 3 3 3 1 3 1 1 3 3 3 3 3 1 1 3 3 3 3 1 3 3 1 3 3 1 3 3
## [186] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 2 3 3 3 2 3 3 3 3 3
## [223] 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 1 3 3 3 3 3 3 3 3 3 3 1 3 3 1 3 1 3 3 3 3 3
## [260] 3 3 3 3 3 3 3 3 3 3 3 1 3 1 2 3 3 3 3 3 3 3 3 3 3 1 3 3 1 3 3 3 1 3 2 3 3
## [297] 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 1 3 1 3 3 3 3
## [334] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 2 3 3 3 1 3 3 3 3 3 3 3
## [371] 1 3 3 3 3 3 3 3 3 1 3 1 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3
##
## Within cluster sum of squares by cluster:
## [1] 161739541 240946360 198937914
## (between_SS / total_SS = 84.5 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
# Graph based on k-means
cluster::clusplot(bank1, # data frame
km1$cluster, # cluster data
color = TRUE, # color
# shade = TRUE, # Lines in clusters
lines = 3, # Lines connecting centroids
labels = 2) # Labels clusters and cases
# here I get the centers
km1$centers
## age balance duration campaign pdays
## 1 43.30769 5162.5577 241.5577 2.730769 73.36538
## 2 46.18182 15943.0000 141.0909 2.090909 12.36364
## 3 41.43027 556.0979 290.5223 2.816024 36.31454
# here I get the cluster assignment (vector)
km1$cluster
## [1] 3 3 3 2 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 1 3 3 2 3 3 3 1 1
## [38] 1 1 3 3 1 3 1 3 3 2 3 1 2 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [75] 3 1 1 3 3 3 1 3 3 3 3 3 3 1 3 1 3 3 3 1 3 3 3 3 3 3 3 1 3 3 3 1 3 3 3 3 3
## [112] 3 1 1 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 1 3 3
## [149] 3 1 3 3 3 3 3 3 3 3 3 3 3 1 3 1 1 3 3 3 3 3 1 1 3 3 3 3 1 3 3 1 3 3 1 3 3
## [186] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 2 3 3 3 2 3 3 3 3 3
## [223] 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 1 3 3 3 3 3 3 3 3 3 3 1 3 3 1 3 1 3 3 3 3 3
## [260] 3 3 3 3 3 3 3 3 3 3 3 1 3 1 2 3 3 3 3 3 3 3 3 3 3 1 3 3 1 3 3 3 1 3 2 3 3
## [297] 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 1 3 1 3 3 3 3
## [334] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 2 3 3 3 1 3 3 3 3 3 3 3
## [371] 1 3 3 3 3 3 3 3 3 1 3 1 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3
# You can evaluate the clusters by looking at $totss and $betweenss.
km1$totss
## [1] 3891981317
km1$betweenss # sum of the squared distance between cluster centers. Ideally you want cluster centers far apart from each other.
## [1] 3290357502
km1$withinss # sum of the square of the distance from each data point to the cluster center. Lower is better. (high withinss would indicate either outliers are in your data or you need to create more clusters.)
## [1] 161739541 240946360 198937914
# ====================================================== OPION k = 5
# k-means clustering - I pick 5 clusters
km2 <- kmeans(bank1, 5)
km2
## K-means clustering with 5 clusters of sizes 276, 78, 10, 2, 34
##
## Cluster means:
## age balance duration campaign pdays
## 1 40.84058 277.0761 294.8623 2.750000 36.53261
## 2 43.85897 2181.6538 246.8077 2.923077 31.10256
## 3 45.90000 13523.4000 119.1000 2.100000 -1.00000
## 4 54.00000 25057.5000 209.0000 2.000000 72.50000
## 5 43.00000 5860.0000 287.5588 3.000000 104.26471
##
## Clustering vector:
## [1] 1 2 1 3 1 2 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 3 2 5 2 2 3 1 1 2 5 5
## [38] 5 5 1 2 2 1 5 1 1 4 2 2 3 1 1 2 1 1 1 5 1 2 1 2 2 2 2 1 1 1 1 1 1 1 1 1 1
## [75] 1 5 2 1 1 1 5 1 1 1 1 1 2 5 1 5 1 1 2 3 1 1 1 1 2 1 1 5 1 1 1 5 1 1 1 1 1
## [112] 1 5 5 1 1 1 1 1 1 1 2 2 1 1 1 1 5 2 1 1 1 1 1 1 2 1 1 1 1 5 1 1 1 1 5 1 1
## [149] 1 5 1 1 1 1 1 1 1 1 1 1 1 2 1 5 5 1 1 1 1 1 5 2 1 2 1 2 2 1 2 2 1 1 2 1 2
## [186] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 5 1 4 2 1 1 3 1 1 2 1 1
## [223] 1 1 1 1 1 1 1 1 1 1 1 5 1 1 1 5 1 2 2 1 1 2 1 2 1 1 2 2 1 5 1 2 1 1 1 1 2
## [260] 1 1 1 1 1 1 1 2 2 1 1 2 1 5 3 2 1 1 1 1 2 1 2 1 2 2 1 1 5 2 1 2 5 1 3 2 1
## [297] 2 1 1 1 1 1 1 1 2 1 1 5 1 1 2 1 1 1 2 1 1 1 1 1 2 1 2 5 1 1 5 1 2 1 1 2 1
## [334] 1 2 1 1 1 1 2 1 1 1 1 2 1 1 1 1 1 1 1 1 2 1 1 3 1 3 1 1 2 2 1 1 1 1 2 1 2
## [371] 2 1 2 1 1 1 1 1 1 5 1 5 1 1 1 1 1 1 1 2 1 1 2 1 2 1 1 1 1 1
##
## Within cluster sum of squares by cluster:
## [1] 62512468 56610278 47941653 3901147 71994363
## (between_SS / total_SS = 93.8 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
# Graph based on k-means
clusplot(bank1, # data frame
km2$cluster, # cluster data
color = TRUE, # color
# shade = TRUE, # Lines in clusters
lines = 3, # Lines connecting centroids
labels = 2) # Labels clusters and cases
# here I get the centers
km2$centers
## age balance duration campaign pdays
## 1 40.84058 277.0761 294.8623 2.750000 36.53261
## 2 43.85897 2181.6538 246.8077 2.923077 31.10256
## 3 45.90000 13523.4000 119.1000 2.100000 -1.00000
## 4 54.00000 25057.5000 209.0000 2.000000 72.50000
## 5 43.00000 5860.0000 287.5588 3.000000 104.26471
# You can evaluate the clusters by looking at $totss and $betweenss.
km2$totss
## [1] 3891981317
km2$withinss # sum of the square of the distance from each data point to the cluster center. Lower is better. (high withinss would indicate either outliers are in your data or you need to create more clusters.)
## [1] 62512468 56610278 47941653 3901147 71994363
km2$betweenss # sum of the squared distance between cluster centers. Ideally you want cluster centers far apart from each other.
## [1] 3649021408
# ====================================================== Finding optimal k --> 5
# ----- Find ideal k
rng <- 2:20 # range of k = from 2 to 20
tries <- 100 #Run the k Means algorithm 100 times
avg.totw.ss <-integer(length(rng)) #Set up an empty vector to hold all of points
for(v in rng){ # For each value of the range variable
v.totw.ss <-integer(tries) #Set up an empty vector to hold the 100 tries
for(i in 1:tries){
k.temp <-kmeans(bank1, centers=v) #Run kmeans
v.totw.ss[i] <-k.temp$tot.withinss#Store the total withinss
}
avg.totw.ss[v-1] <-mean(v.totw.ss) #Average the 100 total withinss
}
plot(rng,avg.totw.ss,type="b", main="Total Within SS by Various K",
ylab="Average Total Within Sum of Squares",
xlab="Value of K")
# Somewhere around K = 5 we start losing dramatic gains. So I’m satisfied with 5 clusters.
(*) Looks like I might need to rescale to get meaningful clusters
# ========== standardize variables
bank1_scaled<- scale(bank1)
# ========== Re-cluster on bank1_scaled df (k = 5)
# k-means clustering - I pick 5 clusters
km3 <- kmeans(bank1_scaled, 5)
km3
## K-means clustering with 5 clusters of sizes 174, 16, 118, 44, 48
##
## Cluster means:
## age balance duration campaign pdays
## 1 -0.6390405 -0.22663116 -0.34795581 -0.14172258 -0.3621368
## 2 -0.4247785 0.19609827 -0.57516352 3.74315049 -0.2890690
## 3 1.0914910 0.42778460 -0.24918680 -0.09660068 -0.2474364
## 4 -0.1328877 -0.03102809 -0.08670816 -0.26629039 2.4360097
## 5 -0.1033204 -0.26702286 2.14512770 -0.25239629 -0.2156254
##
## Clustering vector:
## [1] 5 1 3 3 1 3 1 1 3 1 1 3 1 1 5 1 4 3 5 4 1 5 1 1 1 3 3 3 1 1 3 3 1 1 5 3 3
## [38] 1 1 4 3 1 1 4 5 3 3 3 3 3 5 1 5 5 3 1 1 3 5 3 1 4 2 3 3 1 3 1 3 1 3 5 1 2
## [75] 1 2 3 1 1 1 3 1 3 3 2 1 3 4 4 4 1 1 1 3 1 1 4 1 3 4 3 5 3 1 1 3 5 1 3 4 3
## [112] 4 4 3 1 1 3 4 5 1 1 3 1 5 3 4 1 3 1 1 1 5 1 1 1 4 1 2 3 1 2 5 1 1 1 1 3 1
## [149] 1 4 5 1 3 1 3 1 3 1 3 1 5 3 1 3 1 5 1 3 1 1 5 3 3 5 3 1 1 3 1 3 2 1 1 1 4
## [186] 3 1 1 3 1 1 4 4 1 1 3 1 1 1 4 1 4 4 1 1 1 3 3 3 5 3 4 3 1 1 1 3 5 1 1 1 4
## [223] 1 4 1 1 3 5 1 1 3 1 1 2 1 1 2 3 1 4 3 3 3 4 3 5 1 1 3 2 5 2 1 1 3 4 5 3 2
## [260] 1 1 1 5 5 5 1 2 1 1 3 3 1 3 3 3 3 3 3 3 3 1 1 3 1 1 1 1 1 5 3 3 3 2 3 1 5
## [297] 4 1 1 4 1 3 1 4 1 4 5 3 5 4 3 1 1 3 3 1 3 5 1 3 2 5 3 3 4 5 4 3 1 1 1 3 3
## [334] 4 3 5 3 1 1 5 1 3 1 5 1 4 1 4 1 1 5 4 1 5 1 1 3 5 1 1 1 1 3 1 5 1 1 1 5 1
## [371] 3 3 1 4 5 5 1 2 1 4 3 3 3 1 1 4 1 1 1 3 1 3 1 3 1 3 1 1 4 1
##
## Within cluster sum of squares by cluster:
## [1] 186.10147 58.42526 469.99692 121.52461 123.96767
## (between_SS / total_SS = 51.9 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
# ========== Plot clusters based on k-means -- 2 most important variables
clusplot(bank1_scaled, # data frame
km2$cluster, # cluster data
color = TRUE, # color
shade = TRUE, # Lines in clusters
lines = 3, # Lines connecting centroids
labels = 2
) # Labels clusters and cases
# =========== Centroid Plot against 1st 2 discriminant functions
library(fpc)
fpc::plotcluster(x = bank1_scaled, km3$cluster)
function / data =Source: http://blog.ephorie.de/learning-data-science-understanding-and-using-k-means-clustering
http://www.learnbymarketing.com/tutorials/k-means-clustering-in-r-example/
Because there are too many possible combinations of all possible clusters comprising all possible data points k-means follows an iterative approach called expectation-maximization algorithm.: 1. Initialization: assign clusters randomly to all data points 2. E-step (for expectation): assign each observation to the “nearest” (based on Euclidean distance) cluster 3. M-step (for maximization): determine new centroids based on the mean of assigned objects 4. Repeat steps 3 and 4 until no further changes occur
n <- 3 # no. of centroids
set.seed(1415) # set seed for reproducibility
M1 <- matrix(round(runif(100, 1, 5), 1), ncol = 2)
M2 <- matrix(round(runif(100, 7, 12), 1), ncol = 2)
M3 <- matrix(round(runif(100, 20, 25), 1), ncol = 2)
M <- rbind(M1, M2, M3)
C <- M[1:n, ] # define centroids as first n objects
obs <- length(M) / 2
A <- sample(1:n, obs, replace = TRUE) # assign objects to centroids at random
colors <- seq(10, 200, 25)
# helper function for plotting the steps
clusterplot <- function(M, C, txt) {
plot(M, main = txt, xlab = "", ylab = "")
for(i in 1:n) {
points(C[i, , drop = FALSE], pch = 23, lwd = 3, col = colors[i])
points(M[A == i, , drop = FALSE], col = colors[i])
}
}
clusterplot(M, C, "Initialization")
#------------k-means algorithm iterative
# diamonds are the Centroids
repeat {
# calculate Euclidean distance between objects and centroids
D <- matrix(data = NA, nrow = n, ncol = obs)
for(i in 1:n) {
for(j in 1:obs) {
D[i, j] <- sqrt((M[j, 1] - C[i, 1])^2 + (M[j, 2] - C[i, 2])^2)
}
}
O <- A
## E-step: parameters are fixed, distributions are optimized
A <- max.col(t(-D)) # assign objects to centroids
if(all(O == A)) break # if no change stop
clusterplot(M, C, "E-step")
## M-step: distributions are fixed, parameters are optimized
# determine new centroids based on mean of assigned objects
for(i in 1:n) {
C[i, ] <- apply(M[A == i, , drop = FALSE], 2, mean)
}
clusterplot(M, C, "M-step")
}
#Check results
(custom <- C[order(C[ , 1]), ])
## [,1] [,2]
## [1,] 3.008 2.740
## [2,] 9.518 9.326
## [3,] 22.754 22.396
## [,1] [,2]
## [1,] 3.008 2.740
## [2,] 9.518 9.326
## [3,] 22.754 22.396
#------------k-means algorithm BASE R
cl <- kmeans(M, n)
clusterplot(M, cl$centers, "Base R")
#Check results
(base <- cl$centers[order(cl$centers[ , 1]), ])
## [,1] [,2]
## 2 3.008 2.740
## 1 9.518 9.326
## 3 22.754 22.396
## [,1] [,2]
## 2 3.008 2.740
## 1 9.518 9.326
## 3 22.754 22.396
# same!
function / data = bankSource: https://towardsdatascience.com/clustering-on-mixed-type-data-8bbd0a2569c3 copied from(?)http://dpmartin42.github.io/posts/r/cluster-mixed-types
In the context of unsupervised classification, we may need to analyze
datasets made of mixed-type data, where numeric, nominal
(categorical, not ordered) or ordinal (categorical, ordered) features
coexist.
Gower distance ~ avg
of partial dissimilarity ranges )k-medoid is a classical partitioning technique of
clustering that clusters the data set of n objects into k clusters known
a priori.silhouette coefficient is used to determine optimal #
of clusters (we don’t know a priori)We first need to define some notion of (dis)similarity between
observations. A popular choice for clustering is Euclidean distance, but
only valid for continuous variables We have to use a distance metric
that can handle mixed data types –> Gower distance.
* pros: Intuitive to understand and straightforward to calculate
* cons: Sensitive to non-normality and outliers in continuous variables --> transformations as a pre-processing step might be necessary.
Most similar and dissimilar clients according to Gower distance:
# manipulation char -> factor
Factors <- c("job", "marital", "education", "default", "housing", "loan", "contact", "month", "poutcome", "y")
bank[ ,Factors] <- lapply( bank[ ,Factors], as.factor)
glimpse(bank)
## Rows: 400
## Columns: 17
## $ age <dbl> 46, 39, 48, 36, 38, 56, 37, 31, 59, 34, 36, 63, 43, 38, 36, …
## $ job <fct> technician, blue-collar, technician, self-employed, manageme…
## $ marital <fct> married, married, married, married, married, married, marrie…
## $ education <fct> secondary, primary, secondary, tertiary, tertiary, secondary…
## $ default <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, …
## $ balance <dbl> 57, 1564, 454, 16430, 1198, 1504, 137, 469, 3573, 415, 1852,…
## $ housing <fct> no, no, yes, yes, yes, yes, no, no, no, no, no, no, yes, yes…
## $ loan <fct> no, no, no, no, no, no, no, yes, no, yes, no, no, no, no, no…
## $ contact <fct> unknown, unknown, cellular, unknown, cellular, unknown, cell…
## $ day <dbl> 28, 17, 10, 6, 2, 29, 13, 2, 15, 23, 8, 25, 9, 7, 14, 6, 20,…
## $ month <fct> may, jun, jul, jun, feb, may, aug, feb, may, jul, may, jan, …
## $ duration <dbl> 796, 57, 114, 197, 63, 66, 244, 111, 44, 361, 362, 423, 100,…
## $ campaign <dbl> 1, 1, 1, 3, 1, 2, 6, 1, 1, 2, 1, 1, 1, 2, 2, 3, 1, 10, 4, 1,…
## $ pdays <dbl> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, …
## $ previous <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, …
## $ poutcome <fct> unknown, unknown, unknown, unknown, unknown, unknown, unknow…
## $ y <fct> yes, no, no, no, no, no, no, no, no, no, no, yes, no, no, no…
#' Compute Gower distance with the daisy {cluster} function
gower_dist <- daisy(bank, metric = "gower")
gower_mat <- as.matrix(gower_dist)
# Check...
#' Print most similar clients
bank[which(gower_mat == min(gower_mat[gower_mat != min(gower_mat)]), arr.ind = TRUE)[1, ], ]
## # A tibble: 2 × 17
## age job marital educa…¹ default balance housing loan contact day month
## <dbl> <fct> <fct> <fct> <fct> <dbl> <fct> <fct> <fct> <dbl> <fct>
## 1 54 techn… married second… no 89 yes no cellul… 10 jul
## 2 48 techn… married second… no 454 yes no cellul… 10 jul
## # … with 6 more variables: duration <dbl>, campaign <dbl>, pdays <dbl>,
## # previous <dbl>, poutcome <fct>, y <fct>, and abbreviated variable name
## # ¹education
#' Print most dissimilar clients
bank[which(gower_mat == max(gower_mat[gower_mat != max(gower_mat)]), arr.ind = TRUE)[1, ], ]
## # A tibble: 2 × 17
## age job marital educa…¹ default balance housing loan contact day month
## <dbl> <fct> <fct> <fct> <fct> <dbl> <fct> <fct> <fct> <dbl> <fct>
## 1 32 servi… single unknown no 6145 yes no cellul… 14 may
## 2 55 retir… divorc… second… no 1580 no yes unknown 19 jun
## # … with 6 more variables: duration <dbl>, campaign <dbl>, pdays <dbl>,
## # previous <dbl>, poutcome <fct>, y <fct>, and abbreviated variable name
## # ¹education
We use
Partitioning around medoids (PAM) clustering algorithm =
an iterative clustering procedure with the following steps:
But first, we look for right number of clusters with
silhoutte width [-1, 1] (where higher is better)
sil_width <- c(NA)
# calculating silhouette width for clusters ranging from 2 to 8 for the PAM algorithm
for(i in 2:8){
pam_fit <- pam(gower_dist, diss = TRUE, k = i)
sil_width[i] <- pam_fit$silinfo$avg.width
}
# Plot sihouette width (higher is better)
plot(1:8, sil_width,
xlab = "Number of clusters",
ylab = "Silhouette Width")
lines(1:8, sil_width)
The silhouette coefficient contrasts the average
distance to elements in the same cluster with the average distance to
elements in other clusters. Objects with a high silhouette value are
considered well clustered, objects with a low value may be outliers.
This index works well with k-medoids clustering, and is also used to
determine the optimal number of clusters. Please read the Wikipedia page
for further details around computation and interpretation.
here –> pick 3
Cluster Interpretation
#Summary of each cluster
k <- 3
pam_fit <- pam(gower_dist, diss = TRUE, k)
pam_results <- bank %>%
mutate(cluster = pam_fit$clustering) %>%
group_by(cluster) %>%
do(the_summary = summary(.))
pam_results$the_summary
## [[1]]
## age job marital education default
## Min. :22.00 blue-collar:62 divorced: 20 primary : 27 no :152
## 1st Qu.:33.75 services :22 married :100 secondary:101 yes: 0
## Median :39.50 admin. :19 single : 32 tertiary : 20
## Mean :41.30 technician :15 unknown : 4
## 3rd Qu.:50.00 management : 8
## Max. :60.00 retired : 8
## (Other) :18
## balance housing loan contact day
## Min. :-1206 no : 19 no :130 cellular : 41 Min. : 2.00
## 1st Qu.: 176 yes:133 yes: 22 telephone: 11 1st Qu.: 9.00
## Median : 624 unknown :100 Median :15.00
## Mean : 1604 Mean :15.82
## 3rd Qu.: 1585 3rd Qu.:23.00
## Max. :16992 Max. :30.00
##
## month duration campaign pdays
## may :100 Min. : 9.0 Min. : 1.000 Min. : -1.00
## jun : 25 1st Qu.: 112.2 1st Qu.: 1.000 1st Qu.: -1.00
## apr : 7 Median : 200.0 Median : 2.000 Median : -1.00
## jul : 4 Mean : 297.3 Mean : 2.579 Mean : 37.05
## nov : 4 3rd Qu.: 367.0 3rd Qu.: 3.000 3rd Qu.: -1.00
## aug : 3 Max. :1529.0 Max. :17.000 Max. :397.00
## (Other): 9
## previous poutcome y cluster
## Min. : 0.0000 failure: 13 no :142 Min. :1
## 1st Qu.: 0.0000 other : 3 yes: 10 1st Qu.:1
## Median : 0.0000 success: 3 Median :1
## Mean : 0.3816 unknown:133 Mean :1
## 3rd Qu.: 0.0000 3rd Qu.:1
## Max. :13.0000 Max. :1
##
##
## [[2]]
## age job marital education default
## Min. :19.00 technician :44 divorced:14 primary : 8 no :121
## 1st Qu.:33.00 services :20 married :67 secondary:103 yes: 2
## Median :40.00 admin. :18 single :42 tertiary : 9
## Mean :42.17 retired : 9 unknown : 3
## 3rd Qu.:50.00 blue-collar: 8
## Max. :86.00 housemaid : 5
## (Other) :19
## balance housing loan contact day
## Min. : -411.0 no :73 no :97 cellular :94 Min. : 1.00
## 1st Qu.: 0.0 yes:50 yes:26 telephone:18 1st Qu.: 9.50
## Median : 255.0 unknown :11 Median :17.00
## Mean : 1162.4 Mean :16.98
## 3rd Qu.: 867.5 3rd Qu.:24.00
## Max. :26452.0 Max. :31.00
##
## month duration campaign pdays
## jul :52 Min. : 12.0 Min. : 1.000 Min. : -1.00
## jun :16 1st Qu.: 99.0 1st Qu.: 1.000 1st Qu.: -1.00
## aug :14 Median :177.0 Median : 2.000 Median : -1.00
## jan : 9 Mean :244.9 Mean : 3.114 Mean : 32.01
## apr : 8 3rd Qu.:330.5 3rd Qu.: 4.000 3rd Qu.: -1.00
## feb : 8 Max. :978.0 Max. :15.000 Max. :462.00
## (Other):16
## previous poutcome y cluster
## Min. : 0.0000 failure: 7 no :106 Min. :2
## 1st Qu.: 0.0000 other : 8 yes: 17 1st Qu.:2
## Median : 0.0000 success: 3 Median :2
## Mean : 0.5447 unknown:105 Mean :2
## 3rd Qu.: 0.0000 3rd Qu.:2
## Max. :10.0000 Max. :2
##
##
## [[3]]
## age job marital education default
## Min. :25.00 management :69 divorced:14 primary : 13 no :124
## 1st Qu.:34.00 technician : 8 married :81 secondary: 6 yes: 1
## Median :39.00 blue-collar : 7 single :30 tertiary :100
## Mean :42.06 entrepreneur: 7 unknown : 6
## 3rd Qu.:48.00 housemaid : 7
## Max. :83.00 retired : 7
## (Other) :20
## balance housing loan contact day
## Min. :-1212 no :78 no :114 cellular :108 Min. : 2.00
## 1st Qu.: 96 yes:47 yes: 11 telephone: 8 1st Qu.: 9.00
## Median : 699 unknown : 9 Median :17.00
## Mean : 1956 Mean :16.21
## 3rd Qu.: 2144 3rd Qu.:22.00
## Max. :23663 Max. :30.00
##
## month duration campaign pdays
## aug :40 Min. : 8.0 Min. : 1.000 Min. : -1.00
## nov :21 1st Qu.: 115.0 1st Qu.: 1.000 1st Qu.: -1.00
## may :11 Median : 187.0 Median : 2.000 Median : -1.00
## jul :10 Mean : 293.7 Mean : 2.712 Mean : 52.96
## jun :10 3rd Qu.: 361.0 3rd Qu.: 3.000 3rd Qu.: 85.00
## apr : 9 Max. :1532.0 Max. :24.000 Max. :761.00
## (Other):24
## previous poutcome y cluster
## Min. : 0.000 failure:18 no :100 Min. :3
## 1st Qu.: 0.000 other : 7 yes: 25 1st Qu.:3
## Median : 0.000 success: 7 Median :3
## Mean : 0.752 unknown:93 Mean :3
## 3rd Qu.: 1.000 3rd Qu.:3
## Max. :18.000 Max. :3
##
Here one can attempt to derive some common patterns for clients within a cluster. As an example, cluster 1 is made of “management x tertiary x no default x no housing” clients, cluster 2 is made of “blue-collar x secondary x no default x housing” clients, etc.
# BTW, in PAM alg, medoids serve as exemplars of each clusters
bank[pam_fit$medoids,]
## # A tibble: 3 × 17
## age job marital educa…¹ default balance housing loan contact day month
## <dbl> <fct> <fct> <fct> <fct> <dbl> <fct> <fct> <fct> <dbl> <fct>
## 1 43 blue-… married second… no 3060 yes no unknown 15 may
## 2 35 techn… married second… no 0 no no cellul… 14 jul
## 3 40 manag… married tertia… no 51 no no cellul… 19 aug
## # … with 6 more variables: duration <dbl>, campaign <dbl>, pdays <dbl>,
## # previous <dbl>, poutcome <fct>, y <fct>, and abbreviated variable name
## # ¹education
Visualization in a lower dimensional space wwith t-distributed
stochastic neighborhood embedding, or t-SNE
# t-SNE = dimension reduction technique that tries to preserve local structure to make clusters visible in a 2D or 3D visualization. While typically utilizes Euclidean distance, it can handle a custom distance metric
tsne_obj <- Rtsne(gower_dist, is_distance = TRUE)
tsne_data <- tsne_obj$Y %>%
data.frame() %>%
setNames(c("X", "Y")) %>%
mutate(cluster = factor(pam_fit$clustering))
ggplot(aes(x = X, y = Y), data = tsne_data) +
geom_point(aes(color = cluster))
Alternatives (for clustering mixed data): https://www.researchgate.net/publication/323422280_kamila_Clustering_Mixed-Type_Data_in_R_and_Hadoop