INTRODUCTION

customers are clustered so that the can business target customers with offers and incentives personalized to their needs and preferences and also achieve more effective customer marketing via personalization.

DATASET SUMMARY

The data set was obatined from kaggle, #####text

Data Description:

Customer ID: — Customer/Buyers Identity Number

Age: — Age of the customer

Age Group: — Age Category

Postcode: — Customer Location

Gender: —- Sex

Favourite Cookie: — Cookie Purchased

Cookies bought each week: — No. of Cookies bought each week.

Dataset overview

options(repos = list(CRAN="http://cran.rstudio.com/"))
library(readr)
cookie_business <- read_csv("cookie_business.csv")
head(cookie_business)
## # A tibble: 6 × 7
##   `Customer ID`   Age `Age Group` Postcode Gender `Favourite Cookie` Cookies b…¹
##           <dbl> <dbl> <chr>          <dbl> <chr>  <chr>                    <dbl>
## 1          1001    60 60-69           2000 M      Choc chip                    1
## 2          1002    53 50-59           2010 M      Choc chip                    1
## 3          1003    22 20-29           2010 F      Choc chip                    2
## 4          1004    30 30-39           2010 F      Choc chip                    6
## 5          1005    52 50-59           2010 F      Macadamia                    3
## 6          1006    22 20-29           2022 F      Macadamia                    3
## # … with abbreviated variable name ¹​`Cookies bought each week`

Removing missing values

We do this to remove inconsistences within the data which bring about bias during the analysis
sum(is.na(cookie_business) == 1)
## [1] 0
Since there no missing values, we proceed to deploy the libraries and basic descriptive analysis of the data. However if their were some missing data, we would have removed them together with their corresponding rows.

Libraries used

Below are the libraries that will be used in clustering and visualization of the analysis
requiredPackages = c("factoextra","flexclust", "fpc", "clustertend", "cluster","ClusterR", "grid",
                     "lattice","modeltools","stats4", "seriation", "devtools",  "moments")
for(i in requiredPackages){if(!require(i,character.only = TRUE)) install.packages(i)}
for(i in requiredPackages){if(!require(i,character.only = TRUE)) library(i,character.only = TRUE) }

library(factoextra)
library(seriation)
library(flexclust)
library(fpc)
library(clustertend)
library(cluster)
library(ClusterR)
library(grid)
library(lattice)
library(modeltools)
library(stats4)
library(devtools)
library(moments)

Transformation of data

DATA SUMMARY.

Basic data summaries are computed so that we can get an over view of the data statistically.
At this point we measure descriptive statistics which are measures of central tendancy, dispersion and shape.
summary(cookie_business)
##   Customer ID        Age           Postcode        Gender      Favourite Cookie
##  Min.   :1001   Min.   :12.00   Min.   :2000   Min.   :1.000   Min.   :1.000   
##  1st Qu.:1012   1st Qu.:20.25   1st Qu.:2000   1st Qu.:1.000   1st Qu.:1.250   
##  Median :1024   Median :31.50   Median :2014   Median :1.000   Median :2.000   
##  Mean   :1024   Mean   :34.17   Mean   :2136   Mean   :1.413   Mean   :2.826   
##  3rd Qu.:1035   3rd Qu.:44.75   3rd Qu.:2296   3rd Qu.:2.000   3rd Qu.:4.750   
##  Max.   :1046   Max.   :68.00   Max.   :2873   Max.   :2.000   Max.   :6.000   
##  Cookies bought each week
##  Min.   : 1.000          
##  1st Qu.: 1.250          
##  Median : 3.000          
##  Mean   : 3.957          
##  3rd Qu.: 5.000          
##  Max.   :20.000
skewness(cookie_business)
##              Customer ID                      Age                 Postcode 
##                0.0000000                0.5806652                1.7950816 
##                   Gender         Favourite Cookie Cookies bought each week 
##                0.3532086                0.6955363                2.2700018
kurtosis(cookie_business)
##              Customer ID                      Age                 Postcode 
##                 1.798865                 2.137290                 6.251993 
##                   Gender         Favourite Cookie Cookies bought each week 
##                 1.124756                 2.007562                 9.471059
var(cookie_business)
##                          Customer ID          Age     Postcode      Gender
## Customer ID              180.1666667   -75.622222   828.511111  0.12222222
## Age                      -75.6222222   262.102415 -1306.705314  1.05990338
## Postcode                 828.5111111 -1306.705314 41744.796135  8.73043478
## Gender                     0.1222222     1.059903     8.730435  0.24782609
## Favourite Cookie           3.7333333    -1.057971     5.816425  0.05120773
## Cookies bought each week   3.4888889   -12.592271   149.320773 -0.29275362
##                          Favourite Cookie Cookies bought each week
## Customer ID                    3.73333333                3.4888889
## Age                           -1.05797101              -12.5922705
## Postcode                       5.81642512              149.3207729
## Gender                         0.05120773               -0.2927536
## Favourite Cookie               3.16908213               -1.2299517
## Cookies bought each week      -1.22995169               13.3758454

Assessing Clustering Tendency and visual inspection of data

Statistical methord of asssessing clustering tendency(Hopkins statsitics)

At this stage we use the Hopkins statsitics to analyse whether our data set is uniformly distributed and thus testing spatial randomness of the data.
# Compute Hopkins statistic for df
set.seed(123)
hopkins(df, n = nrow(df)-1)
## $H
## [1] 0.3261212
# Compute Hopkins statistic for random data
set.seed(123)
hopkins(random_df, n = nrow(random_df)-1)
## $H
## [1] 0.5656129
Therefore this shows that data contained in df is more clusterable than data choosen at random(random_df) because H < 0.5 and thus has statistically significant clusters in df while H >0.5 thus we fail to reject the null for random data.

Visual methods for testing clustering tendency

For the visual assessment of clustering tendency, we use the dissimilarity matrix that a shows a sum up of small boxes that when put together visualise as a cluster
set.seed(123)
get_clust_tendency(df, n=nrow(df)-1, graph=TRUE, gradient=list(low = "white",mid="Grey", high = "Dark grey"))
## $hopkins_stat
## [1] 0.7135737
## 
## $plot

get_clust_tendency(random_df, n=nrow(random_df)-1, graph=TRUE, gradient=list(low = "white",mid="Grey", high = "Dark grey") )
## $hopkins_stat
## [1] 0.3789834
## 
## $plot

We go ahead to test and comfirm the clustering tendancies with the data through the ploting methord.
# Plotting df dataset with 2 clusters
k2 <- kmeans(df, centers = 2, nstart = 25)
fviz_cluster(k2, data = df)

It is evident that the df dataset has meaningful clusters as compared to the random dataset.
# Plotting random dataset with 2 clusters
k2 <- kmeans(random_df, centers = 2, nstart = 25)
fviz_cluster(k2, data = random_df)

The dissimilarity matrix image confirms that there is a cluster structure in df but not the random data.

Optimal number of clusters

Here i deployed the gap statistics and silhouette method to determine the number of clusters while using the kmeans and PAM algorithms.
#Using the kmeans to test for number of clusters using both gap stat and silhouette
fviz_nbclust(df, kmeans, method = "gap_stat")

fviz_nbclust(df, kmeans, method = "silhouette")

#Using the PAM to test for number of clusters using both gap stat and silhouette
fviz_nbclust(df, pam, method ="gap_stat")+theme_minimal()

fviz_nbclust(df, pam, method ="silhouette")+theme_minimal()

From the above graphs, the gap_stat and silhouette determined 4 and 2 clusters respectively in kmeans while the gap_stat and silhouette determined 3 and 2 clusters respectively in PAMS. Therefore its more evident that in my dataset, kmeans is more effcient that PAM.

Doing further analysis with the Cluster silhouette plot

KMEANS

k2cluster <- kmeans(df, centers = 2, nstart = 25)
fviz_cluster(k2cluster, data = df, elipse.type="concave", geom=c("point")) + ggtitle("Kmeans for 2 clusters")

sil<-silhouette(k2cluster$cluster, dist(df))
fviz_silhouette(sil)
##   cluster size ave.sil.width
## 1       1   30          0.61
## 2       2   16          0.55

library(ggpubr)
ggarrange(fviz_cluster(k2, data = df, elipse.type="concave", geom=c("point")) + ggtitle("Kmeans for 2 clusters"),fviz_silhouette(sil) , ncol=2, nrow = 1)
##   cluster size ave.sil.width
## 1       1   30          0.61
## 2       2   16          0.55

k4cluster <- kmeans(df, 4, nstart = 25)
fviz_cluster(k4cluster, data = df, elipse.type="concave", geom=c("point")) + ggtitle("Kmeans for 4 clusters")

sil<-silhouette(k4cluster$cluster, dist(df))
fviz_silhouette(sil)
##   cluster size ave.sil.width
## 1       1   10          0.51
## 2       2   17          0.43
## 3       3   13          0.45
## 4       4    6          0.53

library(ggpubr)
ggarrange(fviz_cluster(k4cluster, data = df, elipse.type="concave", geom=c("point")) + ggtitle("Kmeans for 4 clusters"),fviz_silhouette(sil) , ncol=2, nrow = 1)
##   cluster size ave.sil.width
## 1       1   10          0.51
## 2       2   17          0.43
## 3       3   13          0.45
## 4       4    6          0.53

PAM

k2cluster <- pam(df, 2, nstart = 25)
fviz_cluster(k2cluster, data = df, elipse.type="concave", geom=c("point")) + ggtitle("PAM for 2 clusters")

sil<-silhouette(k2cluster$cluster, dist(df))
fviz_silhouette(sil)
##   cluster size ave.sil.width
## 1       1   17          0.51
## 2       2   29          0.62

library(ggpubr)
ggarrange(fviz_cluster(k2, data = df, elipse.type="concave", geom=c("point")) + ggtitle("PAM for 2 clusters"),fviz_silhouette(sil) , ncol=2, nrow = 1)
##   cluster size ave.sil.width
## 1       1   17          0.51
## 2       2   29          0.62

k4cluster <- pam(df, 4, nstart = 25)
fviz_cluster(k4cluster, data = df, elipse.type="concave", geom=c("point")) + ggtitle("PAM for 4 clusters")

sil<-silhouette(k4cluster$cluster, dist(df))
fviz_silhouette(sil)
##   cluster size ave.sil.width
## 1       1    9          0.52
## 2       2   17          0.43
## 3       3   13          0.49
## 4       4    7          0.40

library(ggpubr)
ggarrange(fviz_cluster(k4cluster, data = df, elipse.type="concave", geom=c("point")) + ggtitle("PAM for 4 clusters"),fviz_silhouette(sil) , ncol=2, nrow = 1)
##   cluster size ave.sil.width
## 1       1    9          0.52
## 2       2   17          0.43
## 3       3   13          0.49
## 4       4    7          0.40

Analysis of clustering results

Using the average silhouette width, we can ably indentify the number of clusters that are more applicable to the dataset using the kmeans and PAM algorithms.Average silhouette width near zero indicate that the sampples is very close to the decision boundary between two neighboring clusters
Under kmeans,the average silhouette width for 2 and 4 clsuters is 0.59 and 0.47 respectively while in PAM , the average silhouette width for 2 and 4 clsuters is 0.58 and 0.46 respectively. Therefore under Kmeans cluster 4 had the lowest value that tends to 0 which is also the same in PAM.
In conclusion, 4 clusters are the most appropriate clusters in the dataframe while using KMEANS and PAM alogorithms.