Install libraries; Read the input file (dataset);
#install.packages("rmarkdown", repos = "https://packagemanager.rstudio.com/all/latest")
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(data.table)
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
#install.packages("ggcorrplot")
library(ggcorrplot)
#install.packages("pastecs")
library(pastecs)
##
## Attaching package: 'pastecs'
## The following objects are masked from 'package:data.table':
##
## first, last
## The following objects are masked from 'package:dplyr':
##
## first, last
setwd("C:/Users/Admin/Desktop/Oregon/Spring 2022/Data Science/Project/Clustering-with-K-means-in-R")
#list.files()
raw.data <- read.csv("retailMarketingDI.csv")
View(raw.data)
Introduction
Nowadays customers are heavy users of online shopping. For online
shoppers, customer data is much more available now than it has been even
before. By applying data science methods one can find key trends and
apply different marketing strategies to better capture customers and
have successful long term relationships. In this project, I
have used the dataset of physical retail shoppers. It aims to analyze
data and apply a K-means clustering method to better group customers and
derive insights from each cluster such as factors which influence the
decision making of the customers. Which in the end, it will be
value-added for the shopping platforms to enhance their marketing
strategies which lead to satisfied customers and shareholders and
identify the gap in the market which helps to analyze the products that
are needed and the products that are obsolate.
K-means Clustering
Clustering is a data mining method which is using customer data to
segment customers into groups in a way that members of one group have
big similarities within the group members while they do not have many
similarities with other group members. One of the most widespread
methods in clustering is K-means method.
K-means clustering is a type of unsupervised learning, which is used
when we have unlabeled data (i.e., data without defined categories or
groups). The goal of this algorithm is to find groups in the data, with
the number of groups represented by the variable K. The algorithm works
iteratively to assign each data point to one of K groups based on the
features that are provided. Data points are clustered based on feature
similarity. The results of the K-means clustering algorithm are:
1. The centroids of the K clusters, which can be used to label new
data.
2. Labels for the training data (each data point is assigned to a single
cluster)
Rather than defining groups before looking at the data, clustering
allows us to find and analyze the groups that have formed
organically.
Each centroid of a cluster is a collection of feature values which
define the resulting groups. Examining the centroid feature weights can
be used to qualitatively interpret what kind of group each cluster
represents.
Dataset and Data Cleaning
1. Dataset: Retail shopping site.
2. 10 variables and 1000 records (before cleaning the data).
3. 7 factors and 2 integers in the dataset.
4. Factor discrete variables: Age, Gender, Own Home, Married,
Location, Children and Catalogs. Continuous variables: Salary
and Amount spent. 5. Removed a variable with missing values.
6. Cleaned 3 additional records with missing values in the ‘Money Spent’
variable, leaving us with 9 variables and 997 records to analyze.
str(raw.data)
## 'data.frame': 1000 obs. of 10 variables:
## $ Age : Factor w/ 3 levels "Middle","Old",..: 2 1 3 1 1 3 1 1 1 2 ...
## $ Gender : Factor w/ 2 levels "Female","Male": 1 2 1 2 1 2 1 2 1 2 ...
## $ OwnHome : Factor w/ 2 levels "Own","Rent": 1 2 2 1 1 1 2 1 1 1 ...
## $ Married : Factor w/ 2 levels "Married","Single": 2 2 2 1 2 1 2 2 1 1 ...
## $ Location : Factor w/ 2 levels "Close","Far": 2 1 1 1 1 1 1 1 1 2 ...
## $ Salary : int 47500 63600 13500 85600 68400 30400 48100 68400 51900 80700 ...
## $ Children : int 0 0 0 1 0 0 0 0 3 0 ...
## $ History : Factor w/ 3 levels "High","Low","Medium": 1 1 2 1 1 2 3 1 2 NA ...
## $ Catalogs : int 6 6 18 18 12 6 12 18 6 18 ...
## $ AmountSpent: int 755 1318 296 2436 1304 495 782 1155 158 3034 ...
Check the number of missing values for each variable. (It also displays the count of NAs if present.)
table(is.na(raw.data$Age)) #no NA
table(is.na(raw.data$Gender)) #no NA
table(is.na(raw.data$OwnHome)) #no NA
table(is.na(raw.data$Married)) #no NA
table(is.na(raw.data$Location)) #no NA
table(is.na(raw.data$Salary)) #no NA
table(is.na(raw.data$Children)) #no NA
table(is.na(raw.data$History)) #There are 303 NAs, which I will replace with 'Unknown'
table(is.na(raw.data$Catalogs)) #no NA
table(is.na(raw.data$AmountSpent)) ##There are 6 NAs, records which I can remove OR do forcast on them - we will decide soon
Replace the NAs of History with ‘Unknown’
#first I will replace the NAs of History with 'Unknown':
raw.data$History <- as.character(raw.data$History)
raw.data$History[is.na(raw.data$History)] <- 'Unknown'
raw.data$History <- factor(raw.data$History)
table((raw.data$History)) # worked succefuly
##
## High Low Medium Unknown
## 255 230 212 303
More preprocessing of Data
#first I will remove the 6 NAs with no amount spent
retail.df <- raw.data[!is.na(raw.data$AmountSpent),]
# I will factorize the Children veriable
retail.df$Children <- factor(retail.df$Children)
View(retail.df %>%
group_by(Catalogs) %>%
summarise(mean_of_amount = mean(AmountSpent),numebr_of_appirances = n()))
#By looking at the table above we can see that variabele Catalogs is actually a factor variable where 6 is the 'low_end' prices and 24 is 'high_end' products
#Where 12 and 16 are the mid_range products
#therefore I will change the notation to more intuitive notation (althoth there is no real change in the content) :
retail.df<- (retail.df %>%
mutate(Catalog = ifelse (Catalogs ==6, 'low_end',
(ifelse(Catalogs == 12, "low_midrange",
(ifelse(Catalogs == 18, "high_midrange", "high_end")))))))
#I will factorize the new variable
retail.df$Catalog <- as.factor(retail.df$Catalog)
#And remove the old one:
retail.df$Catalogs <- NULL
str(retail.df) # we are left with 10 variables, 8 of them are factors and 2 are integers (salary + amount spent)
## 'data.frame': 994 obs. of 10 variables:
## $ Age : Factor w/ 3 levels "Middle","Old",..: 2 1 3 1 1 3 1 1 1 2 ...
## $ Gender : Factor w/ 2 levels "Female","Male": 1 2 1 2 1 2 1 2 1 2 ...
## $ OwnHome : Factor w/ 2 levels "Own","Rent": 1 2 2 1 1 1 2 1 1 1 ...
## $ Married : Factor w/ 2 levels "Married","Single": 2 2 2 1 2 1 2 2 1 1 ...
## $ Location : Factor w/ 2 levels "Close","Far": 2 1 1 1 1 1 1 1 1 2 ...
## $ Salary : int 47500 63600 13500 85600 68400 30400 48100 68400 51900 80700 ...
## $ Children : Factor w/ 4 levels "0","1","2","3": 1 1 1 2 1 1 1 1 4 1 ...
## $ History : Factor w/ 4 levels "High","Low","Medium",..: 1 1 2 1 1 2 3 1 2 4 ...
## $ AmountSpent: int 755 1318 296 2436 1304 495 782 1155 158 3034 ...
## $ Catalog : Factor w/ 4 levels "high_end","high_midrange",..: 3 3 2 2 4 3 4 2 3 2 ...
Analyze data. (summary statistics and EDA)
raw.data <- raw.data[!is.na(raw.data$AmountSpent),]
head(raw.data$Age, 10)
cor.data <- raw.data
levels(raw.data$Age)
cor.data$Age <- ifelse(cor.data$Age == 'Young', 0,
ifelse(cor.data$Age == 'Middle',1,2))
levels(raw.data$Gender)
cor.data$Gender <- ifelse(cor.data$Gender == "Female", 0 ,1)
levels(raw.data$OwnHome)
cor.data$OwnHome <- ifelse(cor.data$OwnHome == "Rent", 0 ,1)
levels(raw.data$Married)
cor.data$Married <- ifelse(cor.data$Married == "Single", 0 ,1)
levels(raw.data$Location)
cor.data$Location_close <- ifelse(cor.data$Location == "Far", 0 ,1)
cor.data$History<- NULL
cor.data$Location<- NULL
Observations
1. Most of the customers are in middle age 504 vs 205 old and vs 285
young.
2. Gender is distributed evenly. Same with Own Home and Married
variables distributions.
3. The majority lives close 706 vs 288 far.
4. Most customers, 462, in the dataset do not have children.
5. ‘Catalog’ indicated the type of products the customer has bought, and
it’s distributed evenly as well.
After Data Cleaning
str(cor.data)
## 'data.frame': 994 obs. of 9 variables:
## $ Age : num 2 1 0 1 1 0 1 1 1 2 ...
## $ Gender : num 0 1 0 1 0 1 0 1 0 1 ...
## $ OwnHome : num 1 0 0 1 1 1 0 1 1 1 ...
## $ Married : num 0 0 0 1 0 1 0 0 1 1 ...
## $ Salary : int 47500 63600 13500 85600 68400 30400 48100 68400 51900 80700 ...
## $ Children : int 0 0 0 1 0 0 0 0 3 0 ...
## $ Catalogs : int 6 6 18 18 12 6 12 18 6 18 ...
## $ AmountSpent : int 755 1318 296 2436 1304 495 782 1155 158 3034 ...
## $ Location_close: num 0 1 1 1 1 1 1 1 1 0 ...
cor.maxtrix<- cor(cor.data, method = "pearson", use = "complete.obs")
corrplot(cor.maxtrix)
#explain the matrix, high correlation, low correlation etc.
Observations from the Corelation matrix
1. Both Marriage and Salary, as well as, Money-Spent and Salary are
highly correlated positively.
2. Number of children and age are negatively correlated, in that
dataset, older people have a small amount of children- either 0 or
1.
3. Marriage and number of children are not correlated. Married and
singles have about the same amount of children.
4. No correlation is apparent between Gender and Age.
library(ggplot2)
par(mfrow=c(1,7))
barplot(table(raw.data$Age), main="Age", col = "#69b3a2")
barplot(table(raw.data$Gender), main="Gender", col = "#A9A9A9")
barplot(table(raw.data$OwnHome), main="Own Home?", col = "#69b3a2")
barplot(table(raw.data$Married), main="Married", col = "#A9A9A9")
barplot(table(raw.data$Location), main="Location", col = "#69b3a2")
barplot(table(raw.data$Children), main="Children", col = "#A9A9A9")
barplot(table(raw.data$Catalog), main="Catalog", col = "#69b3a2")
par(
mfrow=c(1,2),
mar=c(4,4,1,0)
)
hist((raw.data$AmountSpent), xlab="", main="Amount Spent", col = "#69b3a2")
hist((raw.data$Salary), xlab="", ylab="", main="Salary", col = "#A9A9A9")
#I will show the distribution of each categorical veriables
lapply( retail.df %>%
select(c("Age", "Gender", "OwnHome", "Married", "Location", "Children", "History","Catalog"))
,table)
## $Age
##
## Middle Old Young
## 504 205 285
##
## $Gender
##
## Female Male
## 501 493
##
## $OwnHome
##
## Own Rent
## 514 480
##
## $Married
##
## Married Single
## 500 494
##
## $Location
##
## Close Far
## 706 288
##
## $Children
##
## 0 1 2 3
## 462 267 143 122
##
## $History
##
## High Low Medium Unknown
## 254 229 211 300
##
## $Catalog
##
## high_end high_midrange low_end low_midrange
## 232 232 250 280
ggplot(data = retail.df, aes(x = Salary))+
geom_histogram(bins = 50, colour = 'white', fill = 'darkblue')+
scale_x_continuous(breaks = seq(0,150000,25000))+
scale_y_continuous(breaks = seq(0,70,10))+
xlab("Salary")+
ylab("Frequency")+
ggtitle("Distribution of salaries")+
geom_vline(xintercept = mean(retail.df$Salary), color = 'red')+
labs(subtitle = 'red line represent average salary')
Salary distribution is skewed to the right, with an average salary of 56032.
mean_salary_female <- mean(retail.df$Salary[retail.df$Gender =="Female"])
mean_salary_male <- mean(retail.df$Salary[retail.df$Gender =="Male"])
ggplot(data = retail.df, aes(x = Salary))+
geom_histogram(bins = 50, colour = 'white', fill = 'darkblue')+
scale_x_continuous(breaks = seq(0,150000,35000))+
scale_y_continuous(breaks = seq(0,70,10))+
xlab("Salary")+
ylab("Frequency")+
ggtitle("Distribution of salaries faceted by gender")+
geom_vline(xintercept = mean_salary_female, color = 'pink',size=1.5)+
geom_vline(xintercept = mean_salary_male, color = 'red', alpha= 0.6)+
labs(subtitle = "red line is male's average salary, and pink's female's")+
facet_wrap(~Gender)
Distribution of salaries for Male is close to a normal distribution, while the distribution of salaries for Female has a heavy tail and positive skewness. As we already have seen in the correlation matrix, males have higher average salaries.
#explain the distributions, males is more normally looking dist, while womens is right skued
mean_AmountSpent_female <- mean(retail.df$AmountSpent[retail.df$Gender =="Female"])
mean_AmountSpent_male <- mean(retail.df$AmountSpent[retail.df$Gender =="Male"])
ggplot(data = retail.df, aes(x = AmountSpent))+
geom_histogram(bins = 50, colour = 'white', fill = 'lightgreen')+
scale_x_continuous()+
scale_y_continuous()+
xlab("Amount Spent")+
ylab("Frequency")+
ggtitle("Distribution of Amount Spent faceted by gender")+
labs(subtitle = "red line is male's average spent, and pink's female's")+
facet_wrap(~Gender)+
geom_vline(xintercept = mean_AmountSpent_female, color = 'pink',size=1.5)+
geom_vline(xintercept = mean_AmountSpent_male, color = 'red', alpha= 0.6)
#Again explain the distributions, males is more normally looking dist, while womens is right skued
Looking at the distribution of Amount Spent is close to the distribution of Salaries by gender (and as we have seen- positively correlated). Males spend on average 37.3% more than Females.
K-means Clustering (a few points)
4 clusters, 4 different types of customer segments, each one with its unique characteristics of customers. Decision of choosing the optimal number of clusters was found by different validity measures. Total Within-Sum-Squares (‘Elbow method’), silhouette score and Calinski-Harabasz index.
library(cluster)
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(flexclust)
## Loading required package: grid
## Loading required package: lattice
## Loading required package: modeltools
## Loading required package: stats4
library(fpc)
library(clustertend)
## Package `clustertend` is deprecated. Use package `hopkins` instead.
library(ClusterR)
## Loading required package: gtools
library(data.table)
library(ggplot2)
retail.df <- raw.data[!is.na(raw.data$AmountSpent),]
clustering.df <- cor.data #make sure you have the current correlation_script before you run this line
dim(clustering.df)[2] # make sure that you get 9 after running this line
Choosing optimal number of clusters.
First, before running k-Means, lets decide how many clusters we want to
generate. Lets start with the elbow method.
Let’s decide the maximum K to cluster. Say 10.
# Choosing optimal number of clusters ----------------------------
#first before we run k-Means, let's decide how many clusters we want to generate
#we can do it in many various ways, I will start with the elbow method:
########### explanation about WSS ###########
#let's decide the maximum K to cluster. Say 10:
k.max <- 10
#we will create a vector of the total within sum of squars, in order to visulize it
wss <- sapply(1:k.max, function(k){kmeans(clustering.df, k, nstart=50,iter.max = 1000 )$tot.withinss})
options("scipen"=999)
ggplot()+ aes(x = 1:k.max, y = wss) + geom_point() + geom_line()+
labs(x = "Number of clusters K", y = "Total within-clusters sum of squares")+
scale_x_continuous(breaks = seq(0,10,1))+
ggtitle("The Elbow Method")
#install.packages("broom")
#install.packages("broom", type="binary")
library(factoextra)
#if(!require(broom)) install.packages("broom",repos = "http://cran.us.r-project.org")
#We can use the built in function persented to us in class, fviz_nbclust:
#remove.packages("rlang")
#remove.packages("dplyr")
#remove.packages("tibble")
#install.packages("rlang")
#install.packages("dplyr")
#install.packages("tibble")
library(rlang)
##
## Attaching package: 'rlang'
## The following object is masked from 'package:gtools':
##
## chr
## The following object is masked from 'package:data.table':
##
## :=
library(dplyr)
library(tibble)
Total WSS is being reduced with each incremented number of
clusters. However, it’s also very visible to notice that the change is
less significant after, some would say the 3rd, some would say the 4th
number of clusters.
As the ‘Elbow method’ did not provide conclusive evidence on how
many clusters would be optimal for our data, it did give us the
impression that it would be either 2 or 4.
fviz_nbclust(clustering.df, FUN = kmeans,method = "wss" ,nstart = 50)
Silhouette Score
Interpretation of the silhouette score formula above is as follows: b(x) would be the minimum average distance between x and the closed neighbor cluster, while a(x) would be the average distance within the cluster. The difference between these two averages normalized by the maximum of two. In simple words, if all points were assigned optimally, the difference between b(x) and a(x) would be great and the score would be close to +1. On the other hand, if all the points were assigned to the wrong cluster, we would get ascore close to -1. A score of 0 simply means that there is a similar cluster that would be as good as the clustered originally assigned.
#When looking at the Elbow Method, one cannot tell for sure what's the optimal
# number of clusters K. could be either 3 or 4
#(some would say only 2), therefore we shall look into the silhouette score
#using the built-in function Optimal_Clusters_KMeans:
########### explanation about silhouette Score ###########
opt.k.sil<- Optimal_Clusters_KMeans(clustering.df, max_clusters=10, plot_clusters=TRUE,
criterion="silhouette")
#both 2 and 4 number of clusters generated a high silhouette score of 5.9
#combining that with the WSS output we can conclude that the optimal number of clusters would be 4.
Both 2 and 4 number of clusters generated a high silhouette score
of 0.59.
Combining that with the WSS output we can conclude that the optimal
number of clusters would be 4.
Calinski-Harabasz index
########### explanation about Calinski-Harabasz index ###########
#the final nail in the coffin would be Calinski-Harabasz index between 2 and 4 clusters
km_2k <- kmeans(clustering.df, 2)
km_4k <- kmeans(clustering.df, 4)
round(calinhara(clustering.df,km_2k$cluster),digits=1)
## [1] 2257.6
round(calinhara(clustering.df,km_4k$cluster),digits=1)
## [1] 4018.9
#It is obvious now that 4 clusters would be best and we can move on
# Custering ---------------------------------------------------
#We can start our clustering
retail.df$History <- NULL
retail.df <- raw.data[!is.na(raw.data$AmountSpent),]
KMC <- kmeans(clustering.df,centers = 4,iter.max = 999, nstart=50)
Higher the index the better results. In our data, the Calinski-Harabasz index of 2 clusters is: 2257.6, while for 4 clusters is 4018.9
After conducting 3 independent validity measures, total within the sum of squares, silhouette score, and Calinski-Harabasz index, we can conclue that the optimal number of clusters for our data is 4.
retail.clustered <- (cbind(retail.df, cluster= KMC$cluster))
# Create new DF, # consisted with the original DF
# with the cluster number for each observation
table_of_cluster_distribution <- table(retail.clustered$cluster) # the result:
# 1 2 3 4
# 157 285 283 269
#barplot(table_of_cluster_distribution, xlab="Clusters",
# ylab="# of customers", main="# of customers in each cluster",
# col="#69b3a2")
retail.clustered <- data.table(retail.clustered)
retail.clustered[, avg_AmountSpent_in_cluster := mean(AmountSpent),by=list(cluster)]
retail.clustered[, avg_SalarySpent_in_cluster := mean(Salary),by=list(cluster)]
retail.clustered <- retail.clustered[, c("Age", "Gender", "OwnHome", "Married",
"Location", "Children", "Catalogs", "Salary","AmountSpent",
"avg_AmountSpent_in_cluster", "avg_SalarySpent_in_cluster", "cluster" )]
cluster_1 <- retail.clustered[retail.clustered$cluster==1,]
cluster_2 <- retail.clustered[retail.clustered$cluster==2,]
cluster_3 <- retail.clustered[retail.clustered$cluster==3,]
cluster_4 <- retail.clustered[retail.clustered$cluster==4,]
#View(cluster_1)
lapply(retail.clustered[,1:7],table)
## $Age
##
## Middle Old Young
## 504 205 285
##
## $Gender
##
## Female Male
## 501 493
##
## $OwnHome
##
## Own Rent
## 514 480
##
## $Married
##
## Married Single
## 500 494
##
## $Location
##
## Close Far
## 706 288
##
## $Children
##
## 0 1 2 3
## 462 267 143 122
##
## $Catalogs
##
## 6 12 18 24
## 250 280 232 232
data.with.clustering <- cbind(clustering.df, retail.clustered)
View(data.with.clustering)
#install.packages("imager")
#library(imager)
#fpath <- system.file('C:/Users/Admin/Desktop/Oregon/Spring 2022/Data Science/Project/Clustering-with-K-means-in-R/Plot_photos/6Clustering_Results.png',package='imager')
#im <- load.image('C:/Users/Admin/Desktop/Oregon/Spring 2022/Data Science/Project/Clustering-with-K-means-in-R/Plot_photos/6Clustering_Results.png')
#plot(im)
“Table”
Results
Cluster number-1 mostly young single women with no children, who live in rent. The cluster has the lowest average salary and the lowest average amount spent.
Cluster number-2 middle age and old men and women, with mostly no or single children who buy mid-range products. The second-lowest average salary and amount spent.
Cluster number-3 mostly middle-aged men, own homes. Every single person in the cluster is married. Relative to the other clusters they have the highest ratio of 2 and 3 children. They made the highest salaries and spend the most.
Cluster number-4 middle age who mostly own homes and are married. Buy high-end products and spend the second-highest amount.
More Pointers:
Cluster number 1 has the “least valuable” customers when it comes to
generating money, however, we have no data about the purchasing
frequency of this segment. But the picture that depicted from this
segment is of a young female student, who doesn’t make a lot of money
and doesn’t spend it either. Cluster number 3, on the other hand, are
middle-aged men, who have a steady high income, owns children and spend
the highest amount. Cluster number 2 is middle age, and old customers
who don’t make a lot of money and don’t spend much, almost similar to
cluster number 4 but these middle-age do have high salaries and do spend
a lot, mostly on high-end products.
Overall Idea:
Results are clear and each clustered segment has different
distinguishing characteristics, and that’s why these methods are highly
used in the marketing industry for the purpose of segmentation of
clients but also for matching the best products similar to what the
clients might have bought.
Results are more satisfied clients, satisfied clients tend to use our
service/product more often, which generates more income. All parties are
satisfied, and the social surplus increases.
Compare Clusters One way to compare the learned weights is to compare the expressive power of their respective classifiers. One way to do so is by utilizing the squared L2 difference.
“”
Where c is every cluster, and x represents every data-point
registered to that cluster. The smaller the loss the better the
classifier.
Also, comparing the clusters weights might not give us much information
about individual run’s of k-means.
Let’s say we want compare the weights of two k-means. Let us say that
the ith k-means produces k centroid’s Ci:ci0,ci1,…cik.Now the problem
with directly comparing clusters at the same index for different k-mean
runs is that clusters do not necessarily have to be localized to a
specific index. In one run a centroid might appear in the first index,
while in a separate run that same centroid might appear in a different
index. Here is a very naive approach.
.
This is the distance between every centroid in two k-means runs. This matrix only has to be calculated for the upper triangular due to the symmetry in the distance function used (squared L2).Then we can calculate the total distance between two k-mean runs as D(i,j)=∑imin(Di)
This is a very naive approach. It does not set a constraint that indices can only show up once in the minimum function (1 to 1 mapping between every centroid in two k-mean runs). To generalise this approach to n k-mean runs, we can construct another distance matrix like the one above with every entry Di,j=D[i,j] This is a very naive approach. But it does show the distance between two k-means, and it has a nice property that if two k-mean runs did learn the same exact centroids, regardless of order the distance D[i,j]=0.
An empirical way to quantify the similarity between clusterings might be using the squared L2 distance indeed.
If we have cluster labels we can also compute the similarity between clusterings by the consensus index. This is just an average similarity between all clustering pairs.
For example we might want to use the Adjusted Mutual Information
(AMI) as similarity measure between two clusterings U and V. Then the
consensus index (CI) between n clusterings is defined as:
If we use an adjusted similarity measure for clustering comparisons the CI is equal to 0 if there the clusterings are random and independent and is equal to 1 when they are identical.