Introduction
Brief theory about Cluster Analysis
- Three major kinds of clustering:
Examples
- Set up
HIERARCHICAL CLUSTERING
K-MEANS CLUSTERING
Resources

Introduction

This is a recap of what I have learned on cluster analysis. I basically reproduced examples found in various sources (acknowledged at bottom), to learn how to implement the algorithms in R.

Brief theory about Cluster Analysis

GOAL: putting like with like It is considered Exploratory Data Mining bc it helps tease out relationships in large datasets

DIFFERENCE FROM CLASSIFICATION: Classification identifies and matches obs. with EXISTING label/group (spam vs non spam / cats vs dogs) – Clustering is “pragmatic grouping” done for a reason that make sense to me. (THESE ARE NOT NATURAL GROUPINGS, THESE ARE GROUPS OF CONVENIENCE). Its success depends on how well it serves my purpose.

RESULTS WILL CHANGE A LOT DEPENDING ON THE VARIABLES YOU USE!!!!!!

Some of the algorithms REQUIRE CONTINUOUS DATA

EXAMPLES:

marketing segmentation - similar customers to recommend the same things
medicine - similar patients
music - genre on playlist that I want together

ALGORITHMs:

by distance (finds only convex clusters / these models are slow for big data…)
- euclidean distance
- connectivity models
- hierarchical diagrams
- joining or splitting
by distance from centroid (only convex clusters / generally of similar size, you must pick k) centroid is defined by a mean
- (mean for every group)
- k-means clustering
by density of data (connects dense regions and draw a perimeter around / )
- connected dense regions in k-dimensions
- it can model NON convex clusters
- it is possible to sort of ignore outliers
- hard to describe (by parametric distrib function)
by distributional model
- clusters as statistical distributions (ex. multivariate normal)
- prone to overfitting
- good to see correlations b/w attributes in data
- limited to convex clusters

Three major kinds of clustering:

Hierarchical: Start separate and combine
Split into set number of clusters (e.g., kmeans)
Dividing: Start with a single group and split

Examples

I am using 2 different datasets:

1. a data set `states` with searches made in collected at US states and information about 5 personality traits. Data and hypothesis taken from "Divided We Stand: Three Psychological Regions of the United States and their Political, Economic, Social, and Health Correlates" https://www.apa.org/pubs/journals/releases/psp-a0034434.pdf
2. a (subset of a) bank marketing dataset `bank`

Set up

HIERARCHICAL CLUSTERING

a) Hierarchical clustering / `stats::hclust` / data = `states`

Source Linkedin Data Mining Course (DM_03_03.R)

STEPS:

1. Create distance matrix (Euclidean)

2. Perform hierarchical cluster analysis on a set of dissimilarities (distance)

3. Plot

(*) Here only numeric variables are considered

# Check
colnames(states)

##  [1] "State"             "state_code"        "data.science"     
##  [4] "cluster.analysis"  "college"           "startup"          
##  [7] "entrepreneur"      "ceo"               "mortgage"         
## [10] "nba"               "nfl"               "mlb"              
## [13] "fifa"              "modern.dance"      "prius"            
## [16] "escalade"          "subaru"            "jello"            
## [19] "bbq"               "royal.family"      "obfuscation"      
## [22] "unicorn"           "Extraversion"      "Agreeableness"    
## [25] "Conscientiousness" "Neuroticism"       "Openness"         
## [28] "PsychRegions"      "region"            "division"

# Save numerical data only
st <- states[, 3:27]

# use the State Abbreviation as Index of rows (so they don't get analized)
row.names(st) <- states[, 2] # set row names for df 
colnames(st)

##  [1] "data.science"      "cluster.analysis"  "college"          
##  [4] "startup"           "entrepreneur"      "ceo"              
##  [7] "mortgage"          "nba"               "nfl"              
## [10] "mlb"               "fifa"              "modern.dance"     
## [13] "prius"             "escalade"          "subaru"           
## [16] "jello"             "bbq"               "royal.family"     
## [19] "obfuscation"       "unicorn"           "Extraversion"     
## [22] "Agreeableness"     "Conscientiousness" "Neuroticism"      
## [25] "Openness"

# CLUSTERING  (hierarchical clustering)###############################################

# 1) Create distance matrix (Euclidean)
d <- dist(st)

# 2) Hierarchical clustering
c <- hclust(d, method = "complete") # or one of "ward.D", "ward.D2", "single", "complete", "average" (= UPGMA), "mcquitty" (= WPGMA), "median" (= WPGMC) or "centroid" (= UPGMC).

c # Info on clustering

## 
## Call:
## hclust(d = d, method = "complete")
## 
## Cluster method   : complete 
## Distance         : euclidean 
## Number of objects: 48

# 3.a) Plot dendrogram of clusters
plot(c  , main = "Cluster with All Searches and Personality")

# 4.a) Cut the tree (resulting from hclust) into several groups  
    # Naming the file
    png(file = "graphs/dendrogramALLsearches.png")
# plot 
plot(c) # , main = "Cluster with All Searches and Personality")
 

# 4) Cut the tree (resulting from hclust) into several groups         
# Put observations in groups
# Need to specify either k = groups or h = height
g3 <- cutree(c, k = 3)  # "g3" = "groups 3"
# cutree(hcmt, h = 230) will give same result
g3

## AL AZ AR CA CO CT DE FL GA ID IL IN IA KS KY LA ME MD MA MI MN MS MO MT NE NV 
##  1  2  3  2  2  3  3  1  1  2  2  2  1  2  2  2  3  3  3  1  1  1  1  2  1  2 
## NH NJ NM NY NC ND OH OK OR PA RI SC SD TN TX UT VT VA WA WV WI WY 
##  3  3  2  3  1  1  2  2  2  2  3  1  1  1  1  1  2  2  2  3  1  2

# Or do several levels of groups at once
# "gm" = "groups/multiple"
gm <- cutree(c, k = 2:5) # or k = c(2, 4)
gm

##    2 3 4 5
## AL 1 1 1 1
## AZ 1 2 2 2
## AR 2 3 3 3
## CA 1 2 2 2
## CO 1 2 2 2
## CT 2 3 3 3
## DE 2 3 3 3
## FL 1 1 1 1
## GA 1 1 1 1
## ID 1 2 2 2
## IL 1 2 2 2
## IN 1 2 2 2
## IA 1 1 4 4
## KS 1 2 2 2
## KY 1 2 2 2
## LA 1 2 2 2
## ME 2 3 3 3
## MD 2 3 3 3
## MA 2 3 3 3
## MI 1 1 1 1
## MN 1 1 4 4
## MS 1 1 1 1
## MO 1 1 1 1
## MT 1 2 2 5
## NE 1 1 4 4
## NV 1 2 2 2
## NH 2 3 3 3
## NJ 2 3 3 3
## NM 1 2 2 5
## NY 2 3 3 3
## NC 1 1 1 1
## ND 1 1 4 4
## OH 1 2 2 2
## OK 1 2 2 2
## OR 1 2 2 5
## PA 1 2 2 2
## RI 2 3 3 3
## SC 1 1 1 1
## SD 1 1 4 4
## TN 1 1 1 1
## TX 1 1 1 1
## UT 1 1 1 1
## VT 1 2 2 5
## VA 1 2 2 2
## WA 1 2 2 5
## WV 2 3 3 3
## WI 1 1 4 4
## WY 1 2 2 2

# Draw boxes around clusters
rect.hclust(c, k = 2, border = "gray")
rect.hclust(c, k = 3, border = "blue")
rect.hclust(c, k = 4, border = "green4")
rect.hclust(c, k = 5, border = "darkred")
    #Saving the file
   dev.off()

## quartz_off_screen 
##                 2

# 3.b) Plot dendrogram of clusters
dend <- c %>% stats::as.dendrogram()
    # Naming the file
    png(file = "graphs/dendrogramALLsearches2.png")
plot(dend)


# 4.b) Cut the tree (resulting from hclust) into several groups         
plot(dend)
dendextend::rect.dendrogram(dend, 
                k = 3,
                border = 2:4, # colors
                #density = 2, 
                text = c("1", "b", "miao"), 
                text_cex = 3
)

# or 
# Vectorize(rect.dendrogram, "k")(dend, 4:5, border = 6)
    #Saving the file
   dev.off()

## quartz_off_screen 
##                 2

For comparison purpose, we create a df with Sports search data only

        # sports <- st[, 8:11]
        # head(sports) 
        # 
        # # For comparison, TRY AND LOOK USING ONLY THE SPORT SEARCHES
        # # 1//2/3) Or nest commands in one line (for sports data)
        # #Naming the file
        # png(file = "graphs/dendrogramSPORTsearches.png")
        # plot(hclust(dist(sports)),
        #     main = "Sports Searches")
        # dev.off() #Saving the file

K-MEANS CLUSTERING

K-means clustering is an unsupervised machine learning algorithm for clustering n observations into k clusters where k is predefined or user-defined constant. It finds the minimum total distance of obs. form the centroid of the (best) cluster (for all dimensions)

It is a partitioning algorithm.

CONDITIONS:

(only convex clusters similar size, you pick k)
hyperparameter k = the number of clusters and has to be set beforehand. k should be picked as a # that is in the ball park of what you are willing to act on (e.g. 5 if I want 5 segments of donors)
Where (through an iterative approach) it finds:
- The distance between data points within clusters should be as small as possible.
- The distance of the centroids (= centers of the clusters) should be as big as possible.

APPLICATIONS: data has a smaller number of dimensions, is numeric, and is continuous. such as document clustering, identifying crime-prone areas, customer segmentation, insurance fraud detection, public transport data analysis, clustering of IT alerts…etc.

(*) We can scale/standardize the variables before clustering the data –> give same importance to each variable in calculating distances

a) k-means clustering / `stats::kmeans` / data = bank

bank1 <- bank[, c(1,6, 12, 13, 14)]  # Select num variables

# ====================================================== OPION k = 3 
km1 <- stats::kmeans(bank1, centers=3)
km1

## K-means clustering with 3 clusters of sizes 52, 11, 337
## 
## Cluster means:
##        age    balance duration campaign    pdays
## 1 43.30769  5162.5577 241.5577 2.730769 73.36538
## 2 46.18182 15943.0000 141.0909 2.090909 12.36364
## 3 41.43027   556.0979 290.5223 2.816024 36.31454
## 
## Clustering vector:
##   [1] 3 3 3 2 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 1 3 3 2 3 3 3 1 1
##  [38] 1 1 3 3 1 3 1 3 3 2 3 1 2 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
##  [75] 3 1 1 3 3 3 1 3 3 3 3 3 3 1 3 1 3 3 3 1 3 3 3 3 3 3 3 1 3 3 3 1 3 3 3 3 3
## [112] 3 1 1 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 1 3 3
## [149] 3 1 3 3 3 3 3 3 3 3 3 3 3 1 3 1 1 3 3 3 3 3 1 1 3 3 3 3 1 3 3 1 3 3 1 3 3
## [186] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 2 3 3 3 2 3 3 3 3 3
## [223] 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 1 3 3 3 3 3 3 3 3 3 3 1 3 3 1 3 1 3 3 3 3 3
## [260] 3 3 3 3 3 3 3 3 3 3 3 1 3 1 2 3 3 3 3 3 3 3 3 3 3 1 3 3 1 3 3 3 1 3 2 3 3
## [297] 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 1 3 1 3 3 3 3
## [334] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 2 3 3 3 1 3 3 3 3 3 3 3
## [371] 1 3 3 3 3 3 3 3 3 1 3 1 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3
## 
## Within cluster sum of squares by cluster:
## [1] 161739541 240946360 198937914
##  (between_SS / total_SS =  84.5 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

# Graph based on k-means
cluster::clusplot(bank1,  # data frame
         km1$cluster,  # cluster data
         color = TRUE,  # color
#          shade = TRUE,  # Lines in clusters
         lines = 3,  # Lines connecting centroids
         labels = 2)  # Labels clusters and cases

# here I get the centers
km1$centers

##        age    balance duration campaign    pdays
## 1 43.30769  5162.5577 241.5577 2.730769 73.36538
## 2 46.18182 15943.0000 141.0909 2.090909 12.36364
## 3 41.43027   556.0979 290.5223 2.816024 36.31454

# here I get the cluster assignment (vector)
km1$cluster

##   [1] 3 3 3 2 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 1 3 3 2 3 3 3 1 1
##  [38] 1 1 3 3 1 3 1 3 3 2 3 1 2 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
##  [75] 3 1 1 3 3 3 1 3 3 3 3 3 3 1 3 1 3 3 3 1 3 3 3 3 3 3 3 1 3 3 3 1 3 3 3 3 3
## [112] 3 1 1 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 1 3 3
## [149] 3 1 3 3 3 3 3 3 3 3 3 3 3 1 3 1 1 3 3 3 3 3 1 1 3 3 3 3 1 3 3 1 3 3 1 3 3
## [186] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 2 3 3 3 2 3 3 3 3 3
## [223] 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 1 3 3 3 3 3 3 3 3 3 3 1 3 3 1 3 1 3 3 3 3 3
## [260] 3 3 3 3 3 3 3 3 3 3 3 1 3 1 2 3 3 3 3 3 3 3 3 3 3 1 3 3 1 3 3 3 1 3 2 3 3
## [297] 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 1 3 1 3 3 3 3
## [334] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 2 3 3 3 1 3 3 3 3 3 3 3
## [371] 1 3 3 3 3 3 3 3 3 1 3 1 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3

# You can evaluate the clusters by looking at $totss and $betweenss.
km1$totss

## [1] 3891981317

km1$betweenss # sum of the squared distance between cluster centers.  Ideally you want cluster centers far apart from each other.

## [1] 3290357502

km1$withinss # sum of the square of the distance from each data point to the cluster center.  Lower is better. (high withinss would indicate either outliers are in your data or you need to create more clusters.)

## [1] 161739541 240946360 198937914

# ====================================================== OPION k = 5 
# k-means clustering - I pick 5 clusters 
km2 <- kmeans(bank1, 5)
km2

## K-means clustering with 5 clusters of sizes 276, 78, 10, 2, 34
## 
## Cluster means:
##        age    balance duration campaign     pdays
## 1 40.84058   277.0761 294.8623 2.750000  36.53261
## 2 43.85897  2181.6538 246.8077 2.923077  31.10256
## 3 45.90000 13523.4000 119.1000 2.100000  -1.00000
## 4 54.00000 25057.5000 209.0000 2.000000  72.50000
## 5 43.00000  5860.0000 287.5588 3.000000 104.26471
## 
## Clustering vector:
##   [1] 1 2 1 3 1 2 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 3 2 5 2 2 3 1 1 2 5 5
##  [38] 5 5 1 2 2 1 5 1 1 4 2 2 3 1 1 2 1 1 1 5 1 2 1 2 2 2 2 1 1 1 1 1 1 1 1 1 1
##  [75] 1 5 2 1 1 1 5 1 1 1 1 1 2 5 1 5 1 1 2 3 1 1 1 1 2 1 1 5 1 1 1 5 1 1 1 1 1
## [112] 1 5 5 1 1 1 1 1 1 1 2 2 1 1 1 1 5 2 1 1 1 1 1 1 2 1 1 1 1 5 1 1 1 1 5 1 1
## [149] 1 5 1 1 1 1 1 1 1 1 1 1 1 2 1 5 5 1 1 1 1 1 5 2 1 2 1 2 2 1 2 2 1 1 2 1 2
## [186] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 5 1 4 2 1 1 3 1 1 2 1 1
## [223] 1 1 1 1 1 1 1 1 1 1 1 5 1 1 1 5 1 2 2 1 1 2 1 2 1 1 2 2 1 5 1 2 1 1 1 1 2
## [260] 1 1 1 1 1 1 1 2 2 1 1 2 1 5 3 2 1 1 1 1 2 1 2 1 2 2 1 1 5 2 1 2 5 1 3 2 1
## [297] 2 1 1 1 1 1 1 1 2 1 1 5 1 1 2 1 1 1 2 1 1 1 1 1 2 1 2 5 1 1 5 1 2 1 1 2 1
## [334] 1 2 1 1 1 1 2 1 1 1 1 2 1 1 1 1 1 1 1 1 2 1 1 3 1 3 1 1 2 2 1 1 1 1 2 1 2
## [371] 2 1 2 1 1 1 1 1 1 5 1 5 1 1 1 1 1 1 1 2 1 1 2 1 2 1 1 1 1 1
## 
## Within cluster sum of squares by cluster:
## [1] 62512468 56610278 47941653  3901147 71994363
##  (between_SS / total_SS =  93.8 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

# Graph based on k-means
clusplot(bank1,  # data frame
         km2$cluster,  # cluster data
         color = TRUE,  # color
#          shade = TRUE,  # Lines in clusters
         lines = 3,  # Lines connecting centroids
         labels = 2)  # Labels clusters and cases

 # here I get the centers
km2$centers

##        age    balance duration campaign     pdays
## 1 40.84058   277.0761 294.8623 2.750000  36.53261
## 2 43.85897  2181.6538 246.8077 2.923077  31.10256
## 3 45.90000 13523.4000 119.1000 2.100000  -1.00000
## 4 54.00000 25057.5000 209.0000 2.000000  72.50000
## 5 43.00000  5860.0000 287.5588 3.000000 104.26471

# You can evaluate the clusters by looking at $totss and $betweenss.
km2$totss

## [1] 3891981317

km2$withinss # sum of the square of the distance from each data point to the cluster center.  Lower is better. (high withinss would indicate either outliers are in your data or you need to create more clusters.)

## [1] 62512468 56610278 47941653  3901147 71994363

km2$betweenss # sum of the squared distance between cluster centers.  Ideally you want cluster centers far apart from each other.

## [1] 3649021408

# ======================================================  Finding optimal k -->  5 
# ----- Find ideal k
rng <- 2:20 # range of k =  from 2 to 20
tries <- 100 #Run the k Means algorithm 100 times
avg.totw.ss <-integer(length(rng)) #Set up an empty vector to hold all of points

for(v in rng){ # For each value of the range variable
 v.totw.ss <-integer(tries) #Set up an empty vector to hold the 100 tries
 for(i in 1:tries){
 k.temp <-kmeans(bank1, centers=v) #Run kmeans
 v.totw.ss[i] <-k.temp$tot.withinss#Store the total withinss
 }
 avg.totw.ss[v-1] <-mean(v.totw.ss) #Average the 100 total withinss
}

plot(rng,avg.totw.ss,type="b", main="Total Within SS by Various K",
 ylab="Average Total Within Sum of Squares",
 xlab="Value of K")

# Somewhere around K = 5 we start losing dramatic gains.  So I’m satisfied with 5 clusters.

(*) Looks like I might need to rescale to get meaningful clusters

What if I rescale the data?

# ========== standardize variables 
bank1_scaled<- scale(bank1) 

 
# ==========  Re-cluster on bank1_scaled df (k = 5)
# k-means clustering - I pick 5 clusters 
km3 <- kmeans(bank1_scaled, 5)
km3

## K-means clustering with 5 clusters of sizes 174, 16, 118, 44, 48
## 
## Cluster means:
##          age     balance    duration    campaign      pdays
## 1 -0.6390405 -0.22663116 -0.34795581 -0.14172258 -0.3621368
## 2 -0.4247785  0.19609827 -0.57516352  3.74315049 -0.2890690
## 3  1.0914910  0.42778460 -0.24918680 -0.09660068 -0.2474364
## 4 -0.1328877 -0.03102809 -0.08670816 -0.26629039  2.4360097
## 5 -0.1033204 -0.26702286  2.14512770 -0.25239629 -0.2156254
## 
## Clustering vector:
##   [1] 5 1 3 3 1 3 1 1 3 1 1 3 1 1 5 1 4 3 5 4 1 5 1 1 1 3 3 3 1 1 3 3 1 1 5 3 3
##  [38] 1 1 4 3 1 1 4 5 3 3 3 3 3 5 1 5 5 3 1 1 3 5 3 1 4 2 3 3 1 3 1 3 1 3 5 1 2
##  [75] 1 2 3 1 1 1 3 1 3 3 2 1 3 4 4 4 1 1 1 3 1 1 4 1 3 4 3 5 3 1 1 3 5 1 3 4 3
## [112] 4 4 3 1 1 3 4 5 1 1 3 1 5 3 4 1 3 1 1 1 5 1 1 1 4 1 2 3 1 2 5 1 1 1 1 3 1
## [149] 1 4 5 1 3 1 3 1 3 1 3 1 5 3 1 3 1 5 1 3 1 1 5 3 3 5 3 1 1 3 1 3 2 1 1 1 4
## [186] 3 1 1 3 1 1 4 4 1 1 3 1 1 1 4 1 4 4 1 1 1 3 3 3 5 3 4 3 1 1 1 3 5 1 1 1 4
## [223] 1 4 1 1 3 5 1 1 3 1 1 2 1 1 2 3 1 4 3 3 3 4 3 5 1 1 3 2 5 2 1 1 3 4 5 3 2
## [260] 1 1 1 5 5 5 1 2 1 1 3 3 1 3 3 3 3 3 3 3 3 1 1 3 1 1 1 1 1 5 3 3 3 2 3 1 5
## [297] 4 1 1 4 1 3 1 4 1 4 5 3 5 4 3 1 1 3 3 1 3 5 1 3 2 5 3 3 4 5 4 3 1 1 1 3 3
## [334] 4 3 5 3 1 1 5 1 3 1 5 1 4 1 4 1 1 5 4 1 5 1 1 3 5 1 1 1 1 3 1 5 1 1 1 5 1
## [371] 3 3 1 4 5 5 1 2 1 4 3 3 3 1 1 4 1 1 1 3 1 3 1 3 1 3 1 1 4 1
## 
## Within cluster sum of squares by cluster:
## [1] 186.10147  58.42526 469.99692 121.52461 123.96767
##  (between_SS / total_SS =  51.9 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

# ========== Plot clusters  based on k-means -- 2 most important variables 
clusplot(bank1_scaled,  # data frame
         km2$cluster,  # cluster data
         color = TRUE,  # color
          shade = TRUE,  # Lines in clusters
         lines = 3,  # Lines connecting centroids
         labels = 2
         )  # Labels clusters and cases

# =========== Centroid Plot against 1st 2 discriminant functions
library(fpc)
fpc::plotcluster(x = bank1_scaled, km3$cluster)

b) k-means clustering / `function` / data =

Source: http://blog.ephorie.de/learning-data-science-understanding-and-using-k-means-clustering

http://www.learnbymarketing.com/tutorials/k-means-clustering-in-r-example/

Because there are too many possible combinations of all possible clusters comprising all possible data points k-means follows an iterative approach called expectation-maximization algorithm.: 1. Initialization: assign clusters randomly to all data points 2. E-step (for expectation): assign each observation to the “nearest” (based on Euclidean distance) cluster 3. M-step (for maximization): determine new centroids based on the mean of assigned objects 4. Repeat steps 3 and 4 until no further changes occur

n <- 3 # no. of centroids
set.seed(1415) # set seed for reproducibility
 
M1 <- matrix(round(runif(100, 1, 5), 1), ncol = 2)
M2 <- matrix(round(runif(100, 7, 12), 1), ncol = 2)
M3 <- matrix(round(runif(100, 20, 25), 1), ncol = 2)
M <- rbind(M1, M2, M3)
 
C <- M[1:n, ] # define centroids as first n objects
obs <- length(M) / 2
A <- sample(1:n, obs, replace = TRUE) # assign objects to centroids at random
colors <- seq(10, 200, 25) 

# helper function for plotting the steps 
clusterplot <- function(M, C, txt) {
  plot(M, main = txt, xlab = "", ylab = "")
  for(i in 1:n) {
    points(C[i, , drop = FALSE], pch = 23, lwd = 3, col = colors[i])
    points(M[A == i, , drop = FALSE], col = colors[i])    
  }
}
clusterplot(M, C, "Initialization")

#------------k-means algorithm iterative
# diamonds are the Centroids

repeat {
  # calculate Euclidean distance between objects and centroids
  D <- matrix(data = NA, nrow = n, ncol = obs)
  for(i in 1:n) {
    for(j in 1:obs) {
      D[i, j] <- sqrt((M[j, 1] - C[i, 1])^2 + (M[j, 2] - C[i, 2])^2)
    }
  }
  O <- A
   
  ## E-step: parameters are fixed, distributions are optimized
  A <- max.col(t(-D)) # assign objects to centroids
  if(all(O == A)) break # if no change stop
  clusterplot(M, C, "E-step")
   
  ## M-step: distributions are fixed, parameters are optimized
  # determine new centroids based on mean of assigned objects
  for(i in 1:n) {
    C[i, ] <- apply(M[A == i, , drop = FALSE], 2, mean)
  }
  clusterplot(M, C, "M-step")
}

#Check results 
(custom <- C[order(C[ , 1]), ])

##        [,1]   [,2]
## [1,]  3.008  2.740
## [2,]  9.518  9.326
## [3,] 22.754 22.396

##        [,1]   [,2]
## [1,]  3.008  2.740
## [2,]  9.518  9.326
## [3,] 22.754 22.396


#------------k-means algorithm BASE R
cl <- kmeans(M, n)
clusterplot(M, cl$centers, "Base R")

#Check results 
(base <- cl$centers[order(cl$centers[ , 1]), ])

##     [,1]   [,2]
## 2  3.008  2.740
## 1  9.518  9.326
## 3 22.754 22.396

##     [,1]   [,2]
## 2  3.008  2.740
## 1  9.518  9.326
## 3 22.754 22.396

# same!

c) PAM (partitioning around medoids) clustering/ `function` / data = bank

Source: https://towardsdatascience.com/clustering-on-mixed-type-data-8bbd0a2569c3 copied from(?)http://dpmartin42.github.io/posts/r/cluster-mixed-types

Example k-medoids (Partitioning around medoids) with bank marketing dataset

In the context of unsupervised classification, we may need to analyze datasets made of mixed-type data, where numeric, nominal (categorical, not ordered) or ordinal (categorical, ordered) features coexist.

similarity measured with distance (Gower distance ~ avg of partial dissimilarity ranges )
PAM clustering algorithm (partitioning around medoids)
k-medoid is a classical partitioning technique of clustering that clusters the data set of n objects into k clusters known a priori.
silhouette coefficient is used to determine optimal # of clusters (we don’t know a priori)
bank marketing dataset

We first need to define some notion of (dis)similarity between observations. A popular choice for clustering is Euclidean distance, but only valid for continuous variables We have to use a distance metric that can handle mixed data types –> Gower distance.

* pros: Intuitive to understand and straightforward to calculate
* cons: Sensitive to non-normality and outliers in continuous variables --> transformations as a pre-processing step might be necessary.

Most similar and dissimilar clients according to Gower distance:

# manipulation char -> factor
Factors <- c("job", "marital", "education", "default", "housing", "loan", "contact", "month", "poutcome", "y")
bank[ ,Factors] <-  lapply( bank[ ,Factors], as.factor)
glimpse(bank)

## Rows: 400
## Columns: 17
## $ age       <dbl> 46, 39, 48, 36, 38, 56, 37, 31, 59, 34, 36, 63, 43, 38, 36, …
## $ job       <fct> technician, blue-collar, technician, self-employed, manageme…
## $ marital   <fct> married, married, married, married, married, married, marrie…
## $ education <fct> secondary, primary, secondary, tertiary, tertiary, secondary…
## $ default   <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, …
## $ balance   <dbl> 57, 1564, 454, 16430, 1198, 1504, 137, 469, 3573, 415, 1852,…
## $ housing   <fct> no, no, yes, yes, yes, yes, no, no, no, no, no, no, yes, yes…
## $ loan      <fct> no, no, no, no, no, no, no, yes, no, yes, no, no, no, no, no…
## $ contact   <fct> unknown, unknown, cellular, unknown, cellular, unknown, cell…
## $ day       <dbl> 28, 17, 10, 6, 2, 29, 13, 2, 15, 23, 8, 25, 9, 7, 14, 6, 20,…
## $ month     <fct> may, jun, jul, jun, feb, may, aug, feb, may, jul, may, jan, …
## $ duration  <dbl> 796, 57, 114, 197, 63, 66, 244, 111, 44, 361, 362, 423, 100,…
## $ campaign  <dbl> 1, 1, 1, 3, 1, 2, 6, 1, 1, 2, 1, 1, 1, 2, 2, 3, 1, 10, 4, 1,…
## $ pdays     <dbl> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, …
## $ previous  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, …
## $ poutcome  <fct> unknown, unknown, unknown, unknown, unknown, unknown, unknow…
## $ y         <fct> yes, no, no, no, no, no, no, no, no, no, no, yes, no, no, no…

#' Compute Gower distance with the daisy {cluster} function
gower_dist <- daisy(bank, metric = "gower")
gower_mat <- as.matrix(gower_dist)

# Check...
#' Print most similar clients
bank[which(gower_mat == min(gower_mat[gower_mat != min(gower_mat)]), arr.ind = TRUE)[1, ], ]

## # A tibble: 2 × 17
##     age job    marital educa…¹ default balance housing loan  contact   day month
##   <dbl> <fct>  <fct>   <fct>   <fct>     <dbl> <fct>   <fct> <fct>   <dbl> <fct>
## 1    54 techn… married second… no           89 yes     no    cellul…    10 jul  
## 2    48 techn… married second… no          454 yes     no    cellul…    10 jul  
## # … with 6 more variables: duration <dbl>, campaign <dbl>, pdays <dbl>,
## #   previous <dbl>, poutcome <fct>, y <fct>, and abbreviated variable name
## #   ¹education

#' Print most dissimilar clients
bank[which(gower_mat == max(gower_mat[gower_mat != max(gower_mat)]), arr.ind = TRUE)[1, ], ]

## # A tibble: 2 × 17
##     age job    marital educa…¹ default balance housing loan  contact   day month
##   <dbl> <fct>  <fct>   <fct>   <fct>     <dbl> <fct>   <fct> <fct>   <dbl> <fct>
## 1    32 servi… single  unknown no         6145 yes     no    cellul…    14 may  
## 2    55 retir… divorc… second… no         1580 no      yes   unknown    19 jun  
## # … with 6 more variables: duration <dbl>, campaign <dbl>, pdays <dbl>,
## #   previous <dbl>, poutcome <fct>, y <fct>, and abbreviated variable name
## #   ¹education

We use Partitioning around medoids (PAM) clustering algorithm = an iterative clustering procedure with the following steps:

Choose k random entities to become the medoids
Assign every entity to its closest medoid (using our custom distance matrix in this case)
For each cluster, identify the observation that would yield the lowest average distance if it were to be re-assigned as the medoid. If so, make this observation the new medoid.
If at least one medoid has changed, return to step 2. Otherwise, end the algorithm.

But first, we look for right number of clusters with silhoutte width [-1, 1] (where higher is better)

sil_width <- c(NA)

# calculating silhouette width for clusters ranging from 2 to 8 for the PAM algorithm
for(i in 2:8){  
  pam_fit <- pam(gower_dist, diss = TRUE, k = i)  
  sil_width[i] <- pam_fit$silinfo$avg.width  
} 

# Plot sihouette width (higher is better)
plot(1:8, sil_width,
   xlab = "Number of clusters",
   ylab = "Silhouette Width")
lines(1:8, sil_width)

The silhouette coefficient contrasts the average distance to elements in the same cluster with the average distance to elements in other clusters. Objects with a high silhouette value are considered well clustered, objects with a low value may be outliers. This index works well with k-medoids clustering, and is also used to determine the optimal number of clusters. Please read the Wikipedia page for further details around computation and interpretation.

here –> pick 3

Cluster Interpretation

#Summary of each cluster
k <- 3
pam_fit <- pam(gower_dist, diss = TRUE, k)

pam_results <- bank %>%
  mutate(cluster = pam_fit$clustering) %>%
  group_by(cluster) %>%
  do(the_summary = summary(.))

pam_results$the_summary

## [[1]]
##       age                 job         marital        education   default  
##  Min.   :22.00   blue-collar:62   divorced: 20   primary  : 27   no :152  
##  1st Qu.:33.75   services   :22   married :100   secondary:101   yes:  0  
##  Median :39.50   admin.     :19   single  : 32   tertiary : 20            
##  Mean   :41.30   technician :15                  unknown  :  4            
##  3rd Qu.:50.00   management : 8                                           
##  Max.   :60.00   retired    : 8                                           
##                  (Other)    :18                                           
##     balance      housing    loan          contact         day       
##  Min.   :-1206   no : 19   no :130   cellular : 41   Min.   : 2.00  
##  1st Qu.:  176   yes:133   yes: 22   telephone: 11   1st Qu.: 9.00  
##  Median :  624                       unknown  :100   Median :15.00  
##  Mean   : 1604                                       Mean   :15.82  
##  3rd Qu.: 1585                                       3rd Qu.:23.00  
##  Max.   :16992                                       Max.   :30.00  
##                                                                     
##      month        duration         campaign          pdays       
##  may    :100   Min.   :   9.0   Min.   : 1.000   Min.   : -1.00  
##  jun    : 25   1st Qu.: 112.2   1st Qu.: 1.000   1st Qu.: -1.00  
##  apr    :  7   Median : 200.0   Median : 2.000   Median : -1.00  
##  jul    :  4   Mean   : 297.3   Mean   : 2.579   Mean   : 37.05  
##  nov    :  4   3rd Qu.: 367.0   3rd Qu.: 3.000   3rd Qu.: -1.00  
##  aug    :  3   Max.   :1529.0   Max.   :17.000   Max.   :397.00  
##  (Other):  9                                                     
##     previous          poutcome     y          cluster 
##  Min.   : 0.0000   failure: 13   no :142   Min.   :1  
##  1st Qu.: 0.0000   other  :  3   yes: 10   1st Qu.:1  
##  Median : 0.0000   success:  3             Median :1  
##  Mean   : 0.3816   unknown:133             Mean   :1  
##  3rd Qu.: 0.0000                           3rd Qu.:1  
##  Max.   :13.0000                           Max.   :1  
##                                                       
## 
## [[2]]
##       age                 job         marital       education   default  
##  Min.   :19.00   technician :44   divorced:14   primary  :  8   no :121  
##  1st Qu.:33.00   services   :20   married :67   secondary:103   yes:  2  
##  Median :40.00   admin.     :18   single  :42   tertiary :  9            
##  Mean   :42.17   retired    : 9                 unknown  :  3            
##  3rd Qu.:50.00   blue-collar: 8                                          
##  Max.   :86.00   housemaid  : 5                                          
##                  (Other)    :19                                          
##     balance        housing   loan         contact        day       
##  Min.   : -411.0   no :73   no :97   cellular :94   Min.   : 1.00  
##  1st Qu.:    0.0   yes:50   yes:26   telephone:18   1st Qu.: 9.50  
##  Median :  255.0                     unknown  :11   Median :17.00  
##  Mean   : 1162.4                                    Mean   :16.98  
##  3rd Qu.:  867.5                                    3rd Qu.:24.00  
##  Max.   :26452.0                                    Max.   :31.00  
##                                                                    
##      month       duration        campaign          pdays       
##  jul    :52   Min.   : 12.0   Min.   : 1.000   Min.   : -1.00  
##  jun    :16   1st Qu.: 99.0   1st Qu.: 1.000   1st Qu.: -1.00  
##  aug    :14   Median :177.0   Median : 2.000   Median : -1.00  
##  jan    : 9   Mean   :244.9   Mean   : 3.114   Mean   : 32.01  
##  apr    : 8   3rd Qu.:330.5   3rd Qu.: 4.000   3rd Qu.: -1.00  
##  feb    : 8   Max.   :978.0   Max.   :15.000   Max.   :462.00  
##  (Other):16                                                    
##     previous          poutcome     y          cluster 
##  Min.   : 0.0000   failure:  7   no :106   Min.   :2  
##  1st Qu.: 0.0000   other  :  8   yes: 17   1st Qu.:2  
##  Median : 0.0000   success:  3             Median :2  
##  Mean   : 0.5447   unknown:105             Mean   :2  
##  3rd Qu.: 0.0000                           3rd Qu.:2  
##  Max.   :10.0000                           Max.   :2  
##                                                       
## 
## [[3]]
##       age                  job         marital       education   default  
##  Min.   :25.00   management  :69   divorced:14   primary  : 13   no :124  
##  1st Qu.:34.00   technician  : 8   married :81   secondary:  6   yes:  1  
##  Median :39.00   blue-collar : 7   single  :30   tertiary :100            
##  Mean   :42.06   entrepreneur: 7                 unknown  :  6            
##  3rd Qu.:48.00   housemaid   : 7                                          
##  Max.   :83.00   retired     : 7                                          
##                  (Other)     :20                                          
##     balance      housing   loan          contact         day       
##  Min.   :-1212   no :78   no :114   cellular :108   Min.   : 2.00  
##  1st Qu.:   96   yes:47   yes: 11   telephone:  8   1st Qu.: 9.00  
##  Median :  699                      unknown  :  9   Median :17.00  
##  Mean   : 1956                                      Mean   :16.21  
##  3rd Qu.: 2144                                      3rd Qu.:22.00  
##  Max.   :23663                                      Max.   :30.00  
##                                                                    
##      month       duration         campaign          pdays       
##  aug    :40   Min.   :   8.0   Min.   : 1.000   Min.   : -1.00  
##  nov    :21   1st Qu.: 115.0   1st Qu.: 1.000   1st Qu.: -1.00  
##  may    :11   Median : 187.0   Median : 2.000   Median : -1.00  
##  jul    :10   Mean   : 293.7   Mean   : 2.712   Mean   : 52.96  
##  jun    :10   3rd Qu.: 361.0   3rd Qu.: 3.000   3rd Qu.: 85.00  
##  apr    : 9   Max.   :1532.0   Max.   :24.000   Max.   :761.00  
##  (Other):24                                                     
##     previous         poutcome    y          cluster 
##  Min.   : 0.000   failure:18   no :100   Min.   :3  
##  1st Qu.: 0.000   other  : 7   yes: 25   1st Qu.:3  
##  Median : 0.000   success: 7             Median :3  
##  Mean   : 0.752   unknown:93             Mean   :3  
##  3rd Qu.: 1.000                          3rd Qu.:3  
##  Max.   :18.000                          Max.   :3  
##

Here one can attempt to derive some common patterns for clients within a cluster. As an example, cluster 1 is made of “management x tertiary x no default x no housing” clients, cluster 2 is made of “blue-collar x secondary x no default x housing” clients, etc.

# BTW, in PAM alg, medoids serve as exemplars of each clusters
bank[pam_fit$medoids,]

## # A tibble: 3 × 17
##     age job    marital educa…¹ default balance housing loan  contact   day month
##   <dbl> <fct>  <fct>   <fct>   <fct>     <dbl> <fct>   <fct> <fct>   <dbl> <fct>
## 1    43 blue-… married second… no         3060 yes     no    unknown    15 may  
## 2    35 techn… married second… no            0 no      no    cellul…    14 jul  
## 3    40 manag… married tertia… no           51 no      no    cellul…    19 aug  
## # … with 6 more variables: duration <dbl>, campaign <dbl>, pdays <dbl>,
## #   previous <dbl>, poutcome <fct>, y <fct>, and abbreviated variable name
## #   ¹education

Visualization in a lower dimensional space wwith t-distributed stochastic neighborhood embedding, or t-SNE

# t-SNE = dimension reduction technique that tries to preserve local structure to make clusters visible in a 2D or 3D visualization. While typically utilizes Euclidean distance, it can handle a custom distance metric  
tsne_obj <- Rtsne(gower_dist, is_distance = TRUE)

tsne_data <- tsne_obj$Y %>%
  data.frame() %>%
  setNames(c("X", "Y")) %>%
  mutate(cluster = factor(pam_fit$clustering))

ggplot(aes(x = X, y = Y), data = tsne_data) +
  geom_point(aes(color = cluster))

Alternatives (for clustering mixed data): https://www.researchgate.net/publication/323422280_kamila_Clustering_Mixed-Type_Data_in_R_and_Hadoop

Resources

LinkedIn Premium course “Data Mining”https://www.linkedin.com/learning/data-science-foundations-data-mining/clustering-in-r
https://www.kdnuggets.com/2016/09/comparing-clustering-techniques-concise-technical-overview.html
http://www.learnbymarketing.com/tutorials/ (kmeans clustering)
https://www.datasciencecentral.com/profiles/blogs/14-great-articles-and-tutorials-on-clustering
IADB ML course
Hastie, Tibshirani, Firedman “The elements of Statistical Learning”, 2009

Notes On Cluster Anaysis

Luisa M Mimmi

24 October, 2022

Introduction

Brief theory about Cluster Analysis

Three major kinds of clustering:

Examples

Set up

HIERARCHICAL CLUSTERING

a) Hierarchical clustering / `stats::hclust` / data = `states`

K-MEANS CLUSTERING

a) k-means clustering / `stats::kmeans` / data = bank

What if I rescale the data?

b) k-means clustering / `function` / data =

c) PAM (partitioning around medoids) clustering/ `function` / data = bank

Example k-medoids (Partitioning around medoids) with bank marketing dataset

Resources

Notes On Cluster Anaysis

Luisa M Mimmi

24 October, 2022

Introduction

Brief theory about Cluster Analysis

Three major kinds of clustering:

Examples

Set up

HIERARCHICAL CLUSTERING

a) Hierarchical clustering / stats::hclust / data = states

K-MEANS CLUSTERING

a) k-means clustering / stats::kmeans / data = bank

What if I rescale the data?

b) k-means clustering / function / data =

c) PAM (partitioning around medoids) clustering/ function / data = bank

Example k-medoids (Partitioning around medoids) with bank marketing dataset

Resources

a) Hierarchical clustering / `stats::hclust` / data = `states`

a) k-means clustering / `stats::kmeans` / data = bank

b) k-means clustering / `function` / data =

c) PAM (partitioning around medoids) clustering/ `function` / data = bank