Efficient use of KMEANS and PAM clustering algorithms in the cookie business

INTRODUCTION

customers are clustered so that the can business target customers with offers and incentives personalized to their needs and preferences and also achieve more effective customer marketing via personalization.

DATASET SUMMARY

The data set was obatined from kaggle, #####text

Data Description:

Customer ID: — Customer/Buyers Identity Number

Age: — Age of the customer

Age Group: — Age Category

Postcode: — Customer Location

Gender: —- Sex

Favourite Cookie: — Cookie Purchased

Cookies bought each week: — No. of Cookies bought each week.

Dataset overview

options(repos = list(CRAN="http://cran.rstudio.com/"))
library(readr)
cookie_business <- read_csv("cookie_business.csv")
head(cookie_business)

## # A tibble: 6 × 7
##   `Customer ID`   Age `Age Group` Postcode Gender `Favourite Cookie` Cookies b…¹
##           <dbl> <dbl> <chr>          <dbl> <chr>  <chr>                    <dbl>
## 1          1001    60 60-69           2000 M      Choc chip                    1
## 2          1002    53 50-59           2010 M      Choc chip                    1
## 3          1003    22 20-29           2010 F      Choc chip                    2
## 4          1004    30 30-39           2010 F      Choc chip                    6
## 5          1005    52 50-59           2010 F      Macadamia                    3
## 6          1006    22 20-29           2022 F      Macadamia                    3
## # … with abbreviated variable name ¹`Cookies bought each week`

Removing missing values

We do this to remove inconsistences within the data which bring about bias during the analysis

sum(is.na(cookie_business) == 1)

## [1] 0

Since there no missing values, we proceed to deploy the libraries and basic descriptive analysis of the data. However if their were some missing data, we would have removed them together with their corresponding rows.

Libraries used

Below are the libraries that will be used in clustering and visualization of the analysis

requiredPackages = c("factoextra","flexclust", "fpc", "clustertend", "cluster","ClusterR", "grid",
                     "lattice","modeltools","stats4", "seriation", "devtools",  "moments")
for(i in requiredPackages){if(!require(i,character.only = TRUE)) install.packages(i)}
for(i in requiredPackages){if(!require(i,character.only = TRUE)) library(i,character.only = TRUE) }

library(factoextra)
library(seriation)
library(flexclust)
library(fpc)
library(clustertend)
library(cluster)
library(ClusterR)
library(grid)
library(lattice)
library(modeltools)
library(stats4)
library(devtools)
library(moments)

Transformation of data

Since we cant use gender and favourite cookie as characters we now convert them into numeric so that we can easily use them while clustering. I also dropped the age groups and opted for the age during the clustering.

cookie_business$Gender[cookie_business$Gender == 'F'] <- 1
cookie_business$Gender[cookie_business$Gender == 'M'] <- 2
cookie_business$`Favourite Cookie`[cookie_business$`Favourite Cookie` == "Choc chip"] <- 1
cookie_business$`Favourite Cookie`[cookie_business$`Favourite Cookie` == "Macadamia"] <- 2
cookie_business$`Favourite Cookie`[cookie_business$`Favourite Cookie` == "Mint"] <- 3
cookie_business$`Favourite Cookie`[cookie_business$`Favourite Cookie` == "Salted caramel"] <- 4
cookie_business$`Favourite Cookie`[cookie_business$`Favourite Cookie` == "Granola"] <- 5
cookie_business$`Favourite Cookie`[cookie_business$`Favourite Cookie` == "Triple choc"] <- 6
cookie_business[3] <- NULL
cookie_business$Gender <- as.numeric(cookie_business$Gender)
cookie_business$`Favourite Cookie`<- as.numeric(cookie_business$`Favourite Cookie`)

DATA SUMMARY.

Basic data summaries are computed so that we can get an over view of the data statistically.

At this point we measure descriptive statistics which are measures of central tendancy, dispersion and shape.

summary(cookie_business)

##   Customer ID        Age           Postcode        Gender      Favourite Cookie
##  Min.   :1001   Min.   :12.00   Min.   :2000   Min.   :1.000   Min.   :1.000   
##  1st Qu.:1012   1st Qu.:20.25   1st Qu.:2000   1st Qu.:1.000   1st Qu.:1.250   
##  Median :1024   Median :31.50   Median :2014   Median :1.000   Median :2.000   
##  Mean   :1024   Mean   :34.17   Mean   :2136   Mean   :1.413   Mean   :2.826   
##  3rd Qu.:1035   3rd Qu.:44.75   3rd Qu.:2296   3rd Qu.:2.000   3rd Qu.:4.750   
##  Max.   :1046   Max.   :68.00   Max.   :2873   Max.   :2.000   Max.   :6.000   
##  Cookies bought each week
##  Min.   : 1.000          
##  1st Qu.: 1.250          
##  Median : 3.000          
##  Mean   : 3.957          
##  3rd Qu.: 5.000          
##  Max.   :20.000

skewness(cookie_business)

##              Customer ID                      Age                 Postcode 
##                0.0000000                0.5806652                1.7950816 
##                   Gender         Favourite Cookie Cookies bought each week 
##                0.3532086                0.6955363                2.2700018

kurtosis(cookie_business)

##              Customer ID                      Age                 Postcode 
##                 1.798865                 2.137290                 6.251993 
##                   Gender         Favourite Cookie Cookies bought each week 
##                 1.124756                 2.007562                 9.471059

var(cookie_business)

##                          Customer ID          Age     Postcode      Gender
## Customer ID              180.1666667   -75.622222   828.511111  0.12222222
## Age                      -75.6222222   262.102415 -1306.705314  1.05990338
## Postcode                 828.5111111 -1306.705314 41744.796135  8.73043478
## Gender                     0.1222222     1.059903     8.730435  0.24782609
## Favourite Cookie           3.7333333    -1.057971     5.816425  0.05120773
## Cookies bought each week   3.4888889   -12.592271   149.320773 -0.29275362
##                          Favourite Cookie Cookies bought each week
## Customer ID                    3.73333333                3.4888889
## Age                           -1.05797101              -12.5922705
## Postcode                       5.81642512              149.3207729
## Gender                         0.05120773               -0.2927536
## Favourite Cookie               3.16908213               -1.2299517
## Cookies bought each week      -1.22995169               13.3758454

Assessing Clustering Tendency and visual inspection of data

We then proceed with clustering of the cookie business data however we need to fnd out whether the data is clusterable, has meaningful clusters or not and if yes how many clusters are within the dataset.

df <- cookie_business[,c(2,4,5,6)]
library(ggplot2)


set.seed(123)
n <- nrow(df)
random_df <- data.frame(
  x = runif(nrow(df), min(df$Age), max(df$Age)),
  y = runif(nrow(df), min(df$Gender), max(df$Gender)))
# Plot the data
ggplot(random_df, aes(x, y)) + geom_point() + ggtitle("Random Data")

ggplot(df, aes(x=Age, y=Gender)) + ggtitle("Age vs Gender") +
  geom_point() +  # Scatter plot
  geom_density_2d()

##### From the above we exluded post code and customer I.D in the analysis. So we try out a k-means on the df datafame and its evident that our data can be clusterable

library(factoextra)
set.seed(123)
# K-means on df dataframe
km.res1 <- kmeans(df, 4)
fviz_cluster(list(data = df, cluster = km.res1$cluster),
             frame.type = "norm", geom = "point", stand = FALSE)

Statistical methord of asssessing clustering tendency(Hopkins statsitics)

At this stage we use the Hopkins statsitics to analyse whether our data set is uniformly distributed and thus testing spatial randomness of the data.

# Compute Hopkins statistic for df
set.seed(123)
hopkins(df, n = nrow(df)-1)

## $H
## [1] 0.3261212

# Compute Hopkins statistic for random data
set.seed(123)
hopkins(random_df, n = nrow(random_df)-1)

## $H
## [1] 0.5656129

Therefore this shows that data contained in df is more clusterable than data choosen at random(random_df) because H < 0.5 and thus has statistically significant clusters in df while H >0.5 thus we fail to reject the null for random data.

Visual methods for testing clustering tendency

For the visual assessment of clustering tendency, we use the dissimilarity matrix that a shows a sum up of small boxes that when put together visualise as a cluster

set.seed(123)
get_clust_tendency(df, n=nrow(df)-1, graph=TRUE, gradient=list(low = "white",mid="Grey", high = "Dark grey"))

## $hopkins_stat
## [1] 0.7135737
## 
## $plot

get_clust_tendency(random_df, n=nrow(random_df)-1, graph=TRUE, gradient=list(low = "white",mid="Grey", high = "Dark grey") )

## $hopkins_stat
## [1] 0.3789834
## 
## $plot

We go ahead to test and comfirm the clustering tendancies with the data through the ploting methord.

# Plotting df dataset with 2 clusters
k2 <- kmeans(df, centers = 2, nstart = 25)
fviz_cluster(k2, data = df)

It is evident that the df dataset has meaningful clusters as compared to the random dataset.

# Plotting random dataset with 2 clusters
k2 <- kmeans(random_df, centers = 2, nstart = 25)
fviz_cluster(k2, data = random_df)

The dissimilarity matrix image confirms that there is a cluster structure in df but not the random data.

Optimal number of clusters

Here i deployed the gap statistics and silhouette method to determine the number of clusters while using the kmeans and PAM algorithms.

#Using the kmeans to test for number of clusters using both gap stat and silhouette
fviz_nbclust(df, kmeans, method = "gap_stat")

fviz_nbclust(df, kmeans, method = "silhouette")

#Using the PAM to test for number of clusters using both gap stat and silhouette
fviz_nbclust(df, pam, method ="gap_stat")+theme_minimal()

fviz_nbclust(df, pam, method ="silhouette")+theme_minimal()

From the above graphs, the gap_stat and silhouette determined 4 and 2 clusters respectively in kmeans while the gap_stat and silhouette determined 3 and 2 clusters respectively in PAMS. Therefore its more evident that in my dataset, kmeans is more effcient that PAM.

Doing further analysis with the Cluster silhouette plot

KMEANS

k2cluster <- kmeans(df, centers = 2, nstart = 25)
fviz_cluster(k2cluster, data = df, elipse.type="concave", geom=c("point")) + ggtitle("Kmeans for 2 clusters")

sil<-silhouette(k2cluster$cluster, dist(df))
fviz_silhouette(sil)

##   cluster size ave.sil.width
## 1       1   30          0.61
## 2       2   16          0.55

library(ggpubr)
ggarrange(fviz_cluster(k2, data = df, elipse.type="concave", geom=c("point")) + ggtitle("Kmeans for 2 clusters"),fviz_silhouette(sil) , ncol=2, nrow = 1)

##   cluster size ave.sil.width
## 1       1   30          0.61
## 2       2   16          0.55

k4cluster <- kmeans(df, 4, nstart = 25)
fviz_cluster(k4cluster, data = df, elipse.type="concave", geom=c("point")) + ggtitle("Kmeans for 4 clusters")

sil<-silhouette(k4cluster$cluster, dist(df))
fviz_silhouette(sil)

##   cluster size ave.sil.width
## 1       1   10          0.51
## 2       2   17          0.43
## 3       3   13          0.45
## 4       4    6          0.53

library(ggpubr)
ggarrange(fviz_cluster(k4cluster, data = df, elipse.type="concave", geom=c("point")) + ggtitle("Kmeans for 4 clusters"),fviz_silhouette(sil) , ncol=2, nrow = 1)

##   cluster size ave.sil.width
## 1       1   10          0.51
## 2       2   17          0.43
## 3       3   13          0.45
## 4       4    6          0.53

PAM

k2cluster <- pam(df, 2, nstart = 25)
fviz_cluster(k2cluster, data = df, elipse.type="concave", geom=c("point")) + ggtitle("PAM for 2 clusters")

sil<-silhouette(k2cluster$cluster, dist(df))
fviz_silhouette(sil)

##   cluster size ave.sil.width
## 1       1   17          0.51
## 2       2   29          0.62

library(ggpubr)
ggarrange(fviz_cluster(k2, data = df, elipse.type="concave", geom=c("point")) + ggtitle("PAM for 2 clusters"),fviz_silhouette(sil) , ncol=2, nrow = 1)

##   cluster size ave.sil.width
## 1       1   17          0.51
## 2       2   29          0.62

k4cluster <- pam(df, 4, nstart = 25)
fviz_cluster(k4cluster, data = df, elipse.type="concave", geom=c("point")) + ggtitle("PAM for 4 clusters")

sil<-silhouette(k4cluster$cluster, dist(df))
fviz_silhouette(sil)

##   cluster size ave.sil.width
## 1       1    9          0.52
## 2       2   17          0.43
## 3       3   13          0.49
## 4       4    7          0.40

library(ggpubr)
ggarrange(fviz_cluster(k4cluster, data = df, elipse.type="concave", geom=c("point")) + ggtitle("PAM for 4 clusters"),fviz_silhouette(sil) , ncol=2, nrow = 1)

##   cluster size ave.sil.width
## 1       1    9          0.52
## 2       2   17          0.43
## 3       3   13          0.49
## 4       4    7          0.40

Analysis of clustering results

Using the average silhouette width, we can ably indentify the number of clusters that are more applicable to the dataset using the kmeans and PAM algorithms.Average silhouette width near zero indicate that the sampples is very close to the decision boundary between two neighboring clusters

Under kmeans,the average silhouette width for 2 and 4 clsuters is 0.59 and 0.47 respectively while in PAM , the average silhouette width for 2 and 4 clsuters is 0.58 and 0.46 respectively. Therefore under Kmeans cluster 4 had the lowest value that tends to 0 which is also the same in PAM.

Efficient use of KMEANS and PAM clustering algorithms in the cookie business

Joachim Ndhokero

2/17/2022

INTRODUCTION

customers are clustered so that the can business target customers with offers and incentives personalized to their needs and preferences and also achieve more effective customer marketing via personalization.

DATASET SUMMARY

The data set was obatined from kaggle, #####text

Dataset overview

Removing missing values

We do this to remove inconsistences within the data which bring about bias during the analysis

Since there no missing values, we proceed to deploy the libraries and basic descriptive analysis of the data. However if their were some missing data, we would have removed them together with their corresponding rows.

Libraries used

Below are the libraries that will be used in clustering and visualization of the analysis

Transformation of data

DATA SUMMARY.

Basic data summaries are computed so that we can get an over view of the data statistically.

At this point we measure descriptive statistics which are measures of central tendancy, dispersion and shape.

Assessing Clustering Tendency and visual inspection of data

Statistical methord of asssessing clustering tendency(Hopkins statsitics)

At this stage we use the Hopkins statsitics to analyse whether our data set is uniformly distributed and thus testing spatial randomness of the data.

Therefore this shows that data contained in df is more clusterable than data choosen at random(random_df) because H < 0.5 and thus has statistically significant clusters in df while H >0.5 thus we fail to reject the null for random data.

Visual methods for testing clustering tendency

For the visual assessment of clustering tendency, we use the dissimilarity matrix that a shows a sum up of small boxes that when put together visualise as a cluster

We go ahead to test and comfirm the clustering tendancies with the data through the ploting methord.

It is evident that the df dataset has meaningful clusters as compared to the random dataset.

The dissimilarity matrix image confirms that there is a cluster structure in df but not the random data.

Optimal number of clusters

Here i deployed the gap statistics and silhouette method to determine the number of clusters while using the kmeans and PAM algorithms.

From the above graphs, the gap_stat and silhouette determined 4 and 2 clusters respectively in kmeans while the gap_stat and silhouette determined 3 and 2 clusters respectively in PAMS. Therefore its more evident that in my dataset, kmeans is more effcient that PAM.

Doing further analysis with the Cluster silhouette plot

KMEANS

PAM

Analysis of clustering results

Using the average silhouette width, we can ably indentify the number of clusters that are more applicable to the dataset using the kmeans and PAM algorithms.Average silhouette width near zero indicate that the sampples is very close to the decision boundary between two neighboring clusters

Under kmeans,the average silhouette width for 2 and 4 clsuters is 0.59 and 0.47 respectively while in PAM , the average silhouette width for 2 and 4 clsuters is 0.58 and 0.46 respectively. Therefore under Kmeans cluster 4 had the lowest value that tends to 0 which is also the same in PAM.

In conclusion, 4 clusters are the most appropriate clusters in the dataframe while using KMEANS and PAM alogorithms.

Efficient use of KMEANS and PAM clustering algorithms in the cookie business

Joachim Ndhokero

2/17/2022

INTRODUCTION

In each and very business ascept such as the cookie business, its very vital for the business owner to clearly understand the needs of different kinds of customers so as to imporve on the service delivery towrads them and thus leads to an increase in profitability levels within the business.

customers are clustered so that the can business target customers with offers and incentives personalized to their needs and preferences and also achieve more effective customer marketing via personalization.

DATASET SUMMARY

The data set was obatined from kaggle, #####text

Dataset overview

Removing missing values

We do this to remove inconsistences within the data which bring about bias during the analysis

Since there no missing values, we proceed to deploy the libraries and basic descriptive analysis of the data. However if their were some missing data, we would have removed them together with their corresponding rows.

Libraries used

Below are the libraries that will be used in clustering and visualization of the analysis

Transformation of data

Since we cant use gender and favourite cookie as characters we now convert them into numeric so that we can easily use them while clustering. I also dropped the age groups and opted for the age during the clustering.

DATA SUMMARY.

Basic data summaries are computed so that we can get an over view of the data statistically.

At this point we measure descriptive statistics which are measures of central tendancy, dispersion and shape.

Assessing Clustering Tendency and visual inspection of data

We then proceed with clustering of the cookie business data however we need to fnd out whether the data is clusterable, has meaningful clusters or not and if yes how many clusters are within the dataset.

Statistical methord of asssessing clustering tendency(Hopkins statsitics)

At this stage we use the Hopkins statsitics to analyse whether our data set is uniformly distributed and thus testing spatial randomness of the data.

Therefore this shows that data contained in df is more clusterable than data choosen at random(random_df) because H < 0.5 and thus has statistically significant clusters in df while H >0.5 thus we fail to reject the null for random data.

Visual methods for testing clustering tendency

For the visual assessment of clustering tendency, we use the dissimilarity matrix that a shows a sum up of small boxes that when put together visualise as a cluster

We go ahead to test and comfirm the clustering tendancies with the data through the ploting methord.

It is evident that the df dataset has meaningful clusters as compared to the random dataset.

The dissimilarity matrix image confirms that there is a cluster structure in df but not the random data.

Optimal number of clusters

Here i deployed the gap statistics and silhouette method to determine the number of clusters while using the kmeans and PAM algorithms.

From the above graphs, the gap_stat and silhouette determined 4 and 2 clusters respectively in kmeans while the gap_stat and silhouette determined 3 and 2 clusters respectively in PAMS. Therefore its more evident that in my dataset, kmeans is more effcient that PAM.

Doing further analysis with the Cluster silhouette plot

KMEANS

PAM

Analysis of clustering results

Using the average silhouette width, we can ably indentify the number of clusters that are more applicable to the dataset using the kmeans and PAM algorithms.Average silhouette width near zero indicate that the sampples is very close to the decision boundary between two neighboring clusters

Under kmeans,the average silhouette width for 2 and 4 clsuters is 0.59 and 0.47 respectively while in PAM , the average silhouette width for 2 and 4 clsuters is 0.58 and 0.46 respectively. Therefore under Kmeans cluster 4 had the lowest value that tends to 0 which is also the same in PAM.

In conclusion, 4 clusters are the most appropriate clusters in the dataframe while using KMEANS and PAM alogorithms.