Study of Consumer Behavior in Marketing using K-means clustering technique

Data Science Project

Chetana Kulkarni

06-08-2022

Install libraries; Read the input file (dataset);

#install.packages("rmarkdown", repos = "https://packagemanager.rstudio.com/all/latest")
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(data.table)
## 
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
#install.packages("ggcorrplot")
library(ggcorrplot)
#install.packages("pastecs")
library(pastecs)
## 
## Attaching package: 'pastecs'
## The following objects are masked from 'package:data.table':
## 
##     first, last
## The following objects are masked from 'package:dplyr':
## 
##     first, last

setwd("C:/Users/Admin/Desktop/Oregon/Spring 2022/Data Science/Project/Clustering-with-K-means-in-R")
#list.files()
raw.data <- read.csv("retailMarketingDI.csv")
View(raw.data)

Introduction
Nowadays customers are heavy users of online shopping. For online shoppers, customer data is much more available now than it has been even before. By applying data science methods one can find key trends and apply different marketing strategies to better capture customers and have successful long term relationships. In this project, I have used the dataset of physical retail shoppers. It aims to analyze data and apply a K-means clustering method to better group customers and derive insights from each cluster such as factors which influence the decision making of the customers. Which in the end, it will be value-added for the shopping platforms to enhance their marketing strategies which lead to satisfied customers and shareholders and identify the gap in the market which helps to analyze the products that are needed and the products that are obsolate.

K-means Clustering
Clustering is a data mining method which is using customer data to segment customers into groups in a way that members of one group have big similarities within the group members while they do not have many similarities with other group members. One of the most widespread methods in clustering is K-means method.
K-means clustering is a type of unsupervised learning, which is used when we have unlabeled data (i.e., data without defined categories or groups). The goal of this algorithm is to find groups in the data, with the number of groups represented by the variable K. The algorithm works iteratively to assign each data point to one of K groups based on the features that are provided. Data points are clustered based on feature similarity. The results of the K-means clustering algorithm are:
1. The centroids of the K clusters, which can be used to label new data.
2. Labels for the training data (each data point is assigned to a single cluster)
Rather than defining groups before looking at the data, clustering allows us to find and analyze the groups that have formed organically.
Each centroid of a cluster is a collection of feature values which define the resulting groups. Examining the centroid feature weights can be used to qualitatively interpret what kind of group each cluster represents.

Dataset and Data Cleaning
1. Dataset: Retail shopping site.
2. 10 variables and 1000 records (before cleaning the data).
3. 7 factors and 2 integers in the dataset.
4. Factor discrete variables: Age, Gender, Own Home, Married, Location, Children and Catalogs. Continuous variables: Salary and Amount spent. 5. Removed a variable with missing values.
6. Cleaned 3 additional records with missing values in the ‘Money Spent’ variable, leaving us with 9 variables and 997 records to analyze.

str(raw.data)
## 'data.frame':    1000 obs. of  10 variables:
##  $ Age        : Factor w/ 3 levels "Middle","Old",..: 2 1 3 1 1 3 1 1 1 2 ...
##  $ Gender     : Factor w/ 2 levels "Female","Male": 1 2 1 2 1 2 1 2 1 2 ...
##  $ OwnHome    : Factor w/ 2 levels "Own","Rent": 1 2 2 1 1 1 2 1 1 1 ...
##  $ Married    : Factor w/ 2 levels "Married","Single": 2 2 2 1 2 1 2 2 1 1 ...
##  $ Location   : Factor w/ 2 levels "Close","Far": 2 1 1 1 1 1 1 1 1 2 ...
##  $ Salary     : int  47500 63600 13500 85600 68400 30400 48100 68400 51900 80700 ...
##  $ Children   : int  0 0 0 1 0 0 0 0 3 0 ...
##  $ History    : Factor w/ 3 levels "High","Low","Medium": 1 1 2 1 1 2 3 1 2 NA ...
##  $ Catalogs   : int  6 6 18 18 12 6 12 18 6 18 ...
##  $ AmountSpent: int  755 1318 296 2436 1304 495 782 1155 158 3034 ...

Check the number of missing values for each variable. (It also displays the count of NAs if present.)

table(is.na(raw.data$Age)) #no NA
table(is.na(raw.data$Gender)) #no NA
table(is.na(raw.data$OwnHome)) #no NA
table(is.na(raw.data$Married)) #no NA
table(is.na(raw.data$Location)) #no NA
table(is.na(raw.data$Salary)) #no NA
table(is.na(raw.data$Children)) #no NA
table(is.na(raw.data$History)) #There are 303 NAs, which I will replace with 'Unknown'
table(is.na(raw.data$Catalogs)) #no NA
table(is.na(raw.data$AmountSpent)) ##There are 6 NAs, records which I can remove OR do forcast on them - we will decide soon 

Replace the NAs of History with ‘Unknown’

#first I will replace the NAs of History with 'Unknown':
raw.data$History <- as.character(raw.data$History)
raw.data$History[is.na(raw.data$History)] <- 'Unknown'
raw.data$History <- factor(raw.data$History)
table((raw.data$History)) # worked succefuly 
## 
##    High     Low  Medium Unknown 
##     255     230     212     303

More preprocessing of Data

#first I will remove the 6 NAs with no amount spent
retail.df <- raw.data[!is.na(raw.data$AmountSpent),]

# I will factorize the Children veriable
retail.df$Children <- factor(retail.df$Children)  

“Distribution”

View(retail.df %>%
       group_by(Catalogs) %>%
       summarise(mean_of_amount = mean(AmountSpent),numebr_of_appirances = n()))
#By looking at the table above we can see that variabele Catalogs is actually a factor variable where 6 is the 'low_end' prices and 24 is 'high_end' products
#Where 12 and 16 are the mid_range products
#therefore I will change the notation to more intuitive notation (althoth there is no real change in the content) :

retail.df<- (retail.df %>%
       mutate(Catalog = ifelse (Catalogs ==6, 'low_end',
                                (ifelse(Catalogs == 12, "low_midrange", 
                                        (ifelse(Catalogs == 18, "high_midrange", "high_end")))))))
#I will factorize the new variable
retail.df$Catalog <- as.factor(retail.df$Catalog)

#And remove the old one:
retail.df$Catalogs <- NULL
str(retail.df) # we are left with 10 variables, 8 of them are factors and 2 are integers (salary + amount spent)
## 'data.frame':    994 obs. of  10 variables:
##  $ Age        : Factor w/ 3 levels "Middle","Old",..: 2 1 3 1 1 3 1 1 1 2 ...
##  $ Gender     : Factor w/ 2 levels "Female","Male": 1 2 1 2 1 2 1 2 1 2 ...
##  $ OwnHome    : Factor w/ 2 levels "Own","Rent": 1 2 2 1 1 1 2 1 1 1 ...
##  $ Married    : Factor w/ 2 levels "Married","Single": 2 2 2 1 2 1 2 2 1 1 ...
##  $ Location   : Factor w/ 2 levels "Close","Far": 2 1 1 1 1 1 1 1 1 2 ...
##  $ Salary     : int  47500 63600 13500 85600 68400 30400 48100 68400 51900 80700 ...
##  $ Children   : Factor w/ 4 levels "0","1","2","3": 1 1 1 2 1 1 1 1 4 1 ...
##  $ History    : Factor w/ 4 levels "High","Low","Medium",..: 1 1 2 1 1 2 3 1 2 4 ...
##  $ AmountSpent: int  755 1318 296 2436 1304 495 782 1155 158 3034 ...
##  $ Catalog    : Factor w/ 4 levels "high_end","high_midrange",..: 3 3 2 2 4 3 4 2 3 2 ...

Analyze data. (summary statistics and EDA)

raw.data <- raw.data[!is.na(raw.data$AmountSpent),]
head(raw.data$Age, 10)

cor.data <- raw.data
levels(raw.data$Age)
cor.data$Age <- ifelse(cor.data$Age == 'Young', 0,
                        ifelse(cor.data$Age == 'Middle',1,2))

levels(raw.data$Gender)
cor.data$Gender <- ifelse(cor.data$Gender == "Female", 0 ,1)
levels(raw.data$OwnHome)
cor.data$OwnHome <- ifelse(cor.data$OwnHome == "Rent", 0 ,1)

levels(raw.data$Married)
cor.data$Married <- ifelse(cor.data$Married == "Single", 0 ,1)

levels(raw.data$Location)
cor.data$Location_close <- ifelse(cor.data$Location == "Far", 0 ,1)

cor.data$History<- NULL
cor.data$Location<- NULL

Observations
1. Most of the customers are in middle age 504 vs 205 old and vs 285 young.
2. Gender is distributed evenly. Same with Own Home and Married variables distributions.
3. The majority lives close 706 vs 288 far.
4. Most customers, 462, in the dataset do not have children.
5. ‘Catalog’ indicated the type of products the customer has bought, and it’s distributed evenly as well.

After Data Cleaning

str(cor.data)
## 'data.frame':    994 obs. of  9 variables:
##  $ Age           : num  2 1 0 1 1 0 1 1 1 2 ...
##  $ Gender        : num  0 1 0 1 0 1 0 1 0 1 ...
##  $ OwnHome       : num  1 0 0 1 1 1 0 1 1 1 ...
##  $ Married       : num  0 0 0 1 0 1 0 0 1 1 ...
##  $ Salary        : int  47500 63600 13500 85600 68400 30400 48100 68400 51900 80700 ...
##  $ Children      : int  0 0 0 1 0 0 0 0 3 0 ...
##  $ Catalogs      : int  6 6 18 18 12 6 12 18 6 18 ...
##  $ AmountSpent   : int  755 1318 296 2436 1304 495 782 1155 158 3034 ...
##  $ Location_close: num  0 1 1 1 1 1 1 1 1 0 ...
cor.maxtrix<- cor(cor.data, method = "pearson", use = "complete.obs")

corrplot(cor.maxtrix)

#explain the matrix, high correlation, low correlation etc. 

Observations from the Corelation matrix
1. Both Marriage and Salary, as well as, Money-Spent and Salary are highly correlated positively.
2. Number of children and age are negatively correlated, in that dataset, older people have a small amount of children- either 0 or 1.
3. Marriage and number of children are not correlated. Married and singles have about the same amount of children.
4. No correlation is apparent between Gender and Age.

library(ggplot2)

par(mfrow=c(1,7))

barplot(table(raw.data$Age), main="Age", col = "#69b3a2")
barplot(table(raw.data$Gender), main="Gender", col = "#A9A9A9")
barplot(table(raw.data$OwnHome), main="Own Home?", col = "#69b3a2")
barplot(table(raw.data$Married), main="Married", col = "#A9A9A9")
barplot(table(raw.data$Location), main="Location", col = "#69b3a2")
barplot(table(raw.data$Children), main="Children", col = "#A9A9A9")
barplot(table(raw.data$Catalog), main="Catalog", col = "#69b3a2")

par(
  mfrow=c(1,2),
  mar=c(4,4,1,0)
)
hist((raw.data$AmountSpent), xlab="", main="Amount Spent", col = "#69b3a2")
hist((raw.data$Salary), xlab="", ylab="", main="Salary", col = "#A9A9A9")

#I will show the distribution of each categorical veriables
lapply( retail.df %>%
          select(c("Age", "Gender", "OwnHome", "Married", "Location", "Children", "History","Catalog"))
        ,table)
## $Age
## 
## Middle    Old  Young 
##    504    205    285 
## 
## $Gender
## 
## Female   Male 
##    501    493 
## 
## $OwnHome
## 
##  Own Rent 
##  514  480 
## 
## $Married
## 
## Married  Single 
##     500     494 
## 
## $Location
## 
## Close   Far 
##   706   288 
## 
## $Children
## 
##   0   1   2   3 
## 462 267 143 122 
## 
## $History
## 
##    High     Low  Medium Unknown 
##     254     229     211     300 
## 
## $Catalog
## 
##      high_end high_midrange       low_end  low_midrange 
##           232           232           250           280
ggplot(data = retail.df, aes(x = Salary))+
  geom_histogram(bins = 50, colour = 'white', fill = 'darkblue')+
  scale_x_continuous(breaks = seq(0,150000,25000))+
  scale_y_continuous(breaks = seq(0,70,10))+
  xlab("Salary")+
  ylab("Frequency")+
  ggtitle("Distribution of salaries")+
  geom_vline(xintercept = mean(retail.df$Salary), color = 'red')+
  labs(subtitle  = 'red line represent average salary')

Salary distribution is skewed to the right, with an average salary of 56032.

mean_salary_female <- mean(retail.df$Salary[retail.df$Gender =="Female"])
mean_salary_male <- mean(retail.df$Salary[retail.df$Gender =="Male"])


ggplot(data = retail.df, aes(x = Salary))+
  geom_histogram(bins = 50, colour = 'white', fill = 'darkblue')+
  scale_x_continuous(breaks = seq(0,150000,35000))+
  scale_y_continuous(breaks = seq(0,70,10))+
  xlab("Salary")+
  ylab("Frequency")+
  ggtitle("Distribution of salaries faceted by gender")+
  geom_vline(xintercept = mean_salary_female, color = 'pink',size=1.5)+
  geom_vline(xintercept = mean_salary_male, color = 'red', alpha= 0.6)+
  labs(subtitle  = "red line is male's average salary, and pink's female's")+
  facet_wrap(~Gender)

Distribution of salaries for Male is close to a normal distribution, while the distribution of salaries for Female has a heavy tail and positive skewness. As we already have seen in the correlation matrix, males have higher average salaries.

#explain the distributions, males is more normally looking dist, while womens is right skued

mean_AmountSpent_female <- mean(retail.df$AmountSpent[retail.df$Gender =="Female"])
mean_AmountSpent_male <- mean(retail.df$AmountSpent[retail.df$Gender =="Male"])

ggplot(data = retail.df, aes(x = AmountSpent))+
  geom_histogram(bins = 50, colour = 'white', fill = 'lightgreen')+
  scale_x_continuous()+
  scale_y_continuous()+
  xlab("Amount Spent")+
  ylab("Frequency")+
  ggtitle("Distribution of Amount Spent faceted by gender")+
  labs(subtitle  = "red line is male's average spent, and pink's female's")+
  facet_wrap(~Gender)+
  geom_vline(xintercept = mean_AmountSpent_female, color = 'pink',size=1.5)+
  geom_vline(xintercept = mean_AmountSpent_male, color = 'red', alpha= 0.6)

#Again explain the distributions, males is more normally looking dist, while womens is right skued

Looking at the distribution of Amount Spent is close to the distribution of Salaries by gender (and as we have seen- positively correlated). Males spend on average 37.3% more than Females.

K-means Clustering (a few points)

4 clusters, 4 different types of customer segments, each one with its unique characteristics of customers. Decision of choosing the optimal number of clusters was found by different validity measures. Total Within-Sum-Squares (‘Elbow method’), silhouette score and Calinski-Harabasz index.

library(cluster)
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(flexclust)
## Loading required package: grid
## Loading required package: lattice
## Loading required package: modeltools
## Loading required package: stats4
library(fpc)
library(clustertend)
## Package `clustertend` is deprecated.  Use package `hopkins` instead.
library(ClusterR)
## Loading required package: gtools
library(data.table)
library(ggplot2)
retail.df <- raw.data[!is.na(raw.data$AmountSpent),]
clustering.df <- cor.data #make sure you have the current correlation_script before you run this line
dim(clustering.df)[2] # make sure that you get 9 after running this line

Choosing optimal number of clusters.
First, before running k-Means, lets decide how many clusters we want to generate. Lets start with the elbow method.
Let’s decide the maximum K to cluster. Say 10.

# Choosing optimal number of clusters ----------------------------
#first before we run k-Means, let's decide how many clusters we want to generate 
#we can do it in many various ways, I will start with the elbow method: 


########### explanation about WSS ########### 

#let's decide the maximum K to cluster. Say 10: 
k.max <- 10

#we will create a vector of the total within sum of squars, in order to visulize it 
wss <- sapply(1:k.max, function(k){kmeans(clustering.df, k, nstart=50,iter.max = 1000 )$tot.withinss})

options("scipen"=999)
ggplot()+ aes(x = 1:k.max, y = wss) + geom_point() + geom_line()+
  labs(x = "Number of clusters K", y = "Total within-clusters sum of squares")+
  scale_x_continuous(breaks = seq(0,10,1))+
  ggtitle("The Elbow Method")

#install.packages("broom")
#install.packages("broom", type="binary")
library(factoextra)
#if(!require(broom)) install.packages("broom",repos = "http://cran.us.r-project.org")
#We can use the built in function persented to us in class, fviz_nbclust: 

#remove.packages("rlang")
#remove.packages("dplyr")
#remove.packages("tibble")

#install.packages("rlang")
#install.packages("dplyr")
#install.packages("tibble")

library(rlang)
## 
## Attaching package: 'rlang'
## The following object is masked from 'package:gtools':
## 
##     chr
## The following object is masked from 'package:data.table':
## 
##     :=
library(dplyr)
library(tibble)

Total WSS is being reduced with each incremented number of clusters. However, it’s also very visible to notice that the change is less significant after, some would say the 3rd, some would say the 4th number of clusters.
As the ‘Elbow method’ did not provide conclusive evidence on how many clusters would be optimal for our data, it did give us the impression that it would be either 2 or 4.

fviz_nbclust(clustering.df, FUN = kmeans,method = "wss" ,nstart = 50)

Silhouette Score

Interpretation of the silhouette score formula above is as follows: b(x) would be the minimum average distance between x and the closed neighbor cluster, while a(x) would be the average distance within the cluster. The difference between these two averages normalized by the maximum of two. In simple words, if all points were assigned optimally, the difference between b(x) and a(x) would be great and the score would be close to +1. On the other hand, if all the points were assigned to the wrong cluster, we would get ascore close to -1. A score of 0 simply means that there is a similar cluster that would be as good as the clustered originally assigned.

#When looking at the Elbow Method, one cannot tell for sure what's the optimal 
# number of clusters K. could be either 3 or 4 
#(some would say only 2), therefore we shall look into the silhouette score 
#using the built-in function Optimal_Clusters_KMeans:


########### explanation about silhouette Score ########### 

opt.k.sil<- Optimal_Clusters_KMeans(clustering.df, max_clusters=10, plot_clusters=TRUE, 
                                    criterion="silhouette")

#both 2 and 4 number of clusters generated a high silhouette score of 5.9 
#combining that with the WSS output we can conclude that the optimal number of clusters would be 4. 

Both 2 and 4 number of clusters generated a high silhouette score of 0.59.
Combining that with the WSS output we can conclude that the optimal number of clusters would be 4.

Calinski-Harabasz index

########### explanation about Calinski-Harabasz index ########### 

#the final nail in the coffin would be Calinski-Harabasz index between 2 and 4 clusters
km_2k <- kmeans(clustering.df, 2) 
km_4k <- kmeans(clustering.df, 4) 

round(calinhara(clustering.df,km_2k$cluster),digits=1)
## [1] 2257.6
round(calinhara(clustering.df,km_4k$cluster),digits=1)
## [1] 4018.9
#It is obvious now that 4 clusters would be best and we can move on

# Custering ---------------------------------------------------
#We can start our clustering 
retail.df$History <- NULL
retail.df <- raw.data[!is.na(raw.data$AmountSpent),]

KMC <- kmeans(clustering.df,centers = 4,iter.max = 999, nstart=50) 

Higher the index the better results. In our data, the Calinski-Harabasz index of 2 clusters is: 2257.6, while for 4 clusters is 4018.9

After conducting 3 independent validity measures, total within the sum of squares, silhouette score, and Calinski-Harabasz index, we can conclue that the optimal number of clusters for our data is 4.

retail.clustered <- (cbind(retail.df, cluster= KMC$cluster)) 
# Create new DF, # consisted with the original DF 
# with the cluster number for each observation
table_of_cluster_distribution <- table(retail.clustered$cluster) # the result:
# 1   2   3   4 
# 157 285 283 269 
#barplot(table_of_cluster_distribution, xlab="Clusters", 
#        ylab="# of customers", main="# of customers in each cluster",
#        col="#69b3a2")

retail.clustered <- data.table(retail.clustered)
retail.clustered[, avg_AmountSpent_in_cluster := mean(AmountSpent),by=list(cluster)]
retail.clustered[, avg_SalarySpent_in_cluster := mean(Salary),by=list(cluster)]

retail.clustered  <-  retail.clustered[, c("Age", "Gender", "OwnHome", "Married",                   
         "Location", "Children", "Catalogs", "Salary","AmountSpent", 
         "avg_AmountSpent_in_cluster", "avg_SalarySpent_in_cluster", "cluster" )]

cluster_1 <- retail.clustered[retail.clustered$cluster==1,]
cluster_2 <- retail.clustered[retail.clustered$cluster==2,]
cluster_3 <- retail.clustered[retail.clustered$cluster==3,]
cluster_4 <- retail.clustered[retail.clustered$cluster==4,]
#View(cluster_1)
lapply(retail.clustered[,1:7],table)
## $Age
## 
## Middle    Old  Young 
##    504    205    285 
## 
## $Gender
## 
## Female   Male 
##    501    493 
## 
## $OwnHome
## 
##  Own Rent 
##  514  480 
## 
## $Married
## 
## Married  Single 
##     500     494 
## 
## $Location
## 
## Close   Far 
##   706   288 
## 
## $Children
## 
##   0   1   2   3 
## 462 267 143 122 
## 
## $Catalogs
## 
##   6  12  18  24 
## 250 280 232 232
data.with.clustering <- cbind(clustering.df, retail.clustered)
View(data.with.clustering)
#install.packages("imager")
#library(imager)
#fpath <- system.file('C:/Users/Admin/Desktop/Oregon/Spring 2022/Data Science/Project/Clustering-with-K-means-in-R/Plot_photos/6Clustering_Results.png',package='imager') 
#im <- load.image('C:/Users/Admin/Desktop/Oregon/Spring 2022/Data Science/Project/Clustering-with-K-means-in-R/Plot_photos/6Clustering_Results.png')
#plot(im)

“Table”

Results

  1. Cluster number-1 mostly young single women with no children, who live in rent. The cluster has the lowest average salary and the lowest average amount spent.

  2. Cluster number-2 middle age and old men and women, with mostly no or single children who buy mid-range products. The second-lowest average salary and amount spent.

  3. Cluster number-3 mostly middle-aged men, own homes. Every single person in the cluster is married. Relative to the other clusters they have the highest ratio of 2 and 3 children. They made the highest salaries and spend the most.

  4. Cluster number-4 middle age who mostly own homes and are married. Buy high-end products and spend the second-highest amount.

More Pointers:
Cluster number 1 has the “least valuable” customers when it comes to generating money, however, we have no data about the purchasing frequency of this segment. But the picture that depicted from this segment is of a young female student, who doesn’t make a lot of money and doesn’t spend it either. Cluster number 3, on the other hand, are middle-aged men, who have a steady high income, owns children and spend the highest amount. Cluster number 2 is middle age, and old customers who don’t make a lot of money and don’t spend much, almost similar to cluster number 4 but these middle-age do have high salaries and do spend a lot, mostly on high-end products.

Overall Idea:
Results are clear and each clustered segment has different distinguishing characteristics, and that’s why these methods are highly used in the marketing industry for the purpose of segmentation of clients but also for matching the best products similar to what the clients might have bought.
Results are more satisfied clients, satisfied clients tend to use our service/product more often, which generates more income. All parties are satisfied, and the social surplus increases.

Compare Clusters One way to compare the learned weights is to compare the expressive power of their respective classifiers. One way to do so is by utilizing the squared L2 difference.

“”

Where c is every cluster, and x represents every data-point registered to that cluster. The smaller the loss the better the classifier.
Also, comparing the clusters weights might not give us much information about individual run’s of k-means.
Let’s say we want compare the weights of two k-means. Let us say that the ith k-means produces k centroid’s Ci:ci0,ci1,…cik.Now the problem with directly comparing clusters at the same index for different k-mean runs is that clusters do not necessarily have to be localized to a specific index. In one run a centroid might appear in the first index, while in a separate run that same centroid might appear in a different index. Here is a very naive approach.

“”.

This is the distance between every centroid in two k-means runs. This matrix only has to be calculated for the upper triangular due to the symmetry in the distance function used (squared L2).Then we can calculate the total distance between two k-mean runs as D(i,j)=∑imin(Di)

This is a very naive approach. It does not set a constraint that indices can only show up once in the minimum function (1 to 1 mapping between every centroid in two k-mean runs). To generalise this approach to n k-mean runs, we can construct another distance matrix like the one above with every entry Di,j=D[i,j] This is a very naive approach. But it does show the distance between two k-means, and it has a nice property that if two k-mean runs did learn the same exact centroids, regardless of order the distance D[i,j]=0.

An empirical way to quantify the similarity between clusterings might be using the squared L2 distance indeed.

If we have cluster labels we can also compute the similarity between clusterings by the consensus index. This is just an average similarity between all clustering pairs.

For example we might want to use the Adjusted Mutual Information (AMI) as similarity measure between two clusterings U and V. Then the consensus index (CI) between n clusterings is defined as:
“”

If we use an adjusted similarity measure for clustering comparisons the CI is equal to 0 if there the clusterings are random and independent and is equal to 1 when they are identical.