Homework 4

mydata <- read.table("~/hw17/mall clustering/Mall_Customers.csv",
                    header = TRUE,
                    sep=",",
                    dec=".") #Creating the dataset
head(mydata)
##   CustomerID Gender Age Annual.Income..k.. Spending.Score..1.100.
## 1          1   Male  19                 15                     39
## 2          2   Male  21                 15                     81
## 3          3 Female  20                 16                      6
## 4          4 Female  23                 16                     77
## 5          5 Female  31                 17                     40
## 6          6 Female  22                 17                     76

Dataset Description:

There are 200 observations with 4 variables, excluding the customer ID.

Unit of observation is a customer.

This datasets presents the recording of the spending of customers in mall, varying by their age, gender, annual income and spending score. They help with predicting sales trends, demographic influences and purchased behaviors.

Source of this dataset is from kaggle website: Credit card dataset for clustering. (2018, March 2). https://www.kaggle.com/datasets/vjchoudhary7/customer-segmentation-tutorial-in-python/data

colnames(mydata) <- c("CustomerID", "Gender_beforefactor", "Age_values", "Annual.Income..k.._values", "Spending.Score..1.100._values")
mydata$Age <- scale(mydata$Age_values)
mydata$Annual.Income..k..   <- scale(mydata$Annual.Income..k.._values
)
mydata$Spending.Score..1.100.  <- scale(mydata$Spending.Score..1.100._values)

mydata$Gender <- factor(mydata$Gender_beforefactor, 
                              levels = c("Male", "Female"), 
                              labels = c("Male", "Female"))
mydata$Dissimilarity <- sqrt(mydata$Age^2 + mydata$Annual.Income..k..^2 + mydata$Spending.Score..1.100.^2) #Finding outliers
head(mydata[order(-mydata$Dissimilarity), ], 10) #10 units with the highest value of dissimilarity
##     CustomerID Gender_beforefactor Age_values Annual.Income..k.._values
## 200        200                Male         30                       137
## 199        199                Male         32                       137
## 9            9                Male         64                        19
## 11          11                Male         67                        19
## 3            3              Female         20                        16
## 198        198                Male         32                       126
## 195        195              Female         47                       120
## 197        197              Female         45                       126
## 31          31                Male         60                        30
## 193        193                Male         33                       113
##     Spending.Score..1.100._values        Age Annual.Income..k..
## 200                            83 -0.6335454           2.910368
## 199                            18 -0.4903713           2.910368
## 9                               3  1.8004143          -1.582351
## 11                             14  2.0151754          -1.582351
## 3                               6 -1.3494159          -1.696572
## 198                            74 -0.4903713           2.491555
## 195                            16  0.5834344           2.263112
## 197                            28  0.4402603           2.491555
## 31                              4  1.5140661          -1.163538
## 193                             8 -0.4187842           1.996595
##     Spending.Score..1.100. Gender Dissimilarity
## 200              1.2701598   Male      3.238044
## 199             -1.2469252   Male      3.203986
## 9               -1.8277910   Male      3.014323
## 11              -1.4018227   Male      2.920595
## 3               -1.7116178 Female      2.762049
## 198              0.9216404   Male      2.701431
## 195             -1.3243740 Female      2.686268
## 197             -0.8596814 Female      2.672214
## 31              -1.7890666   Male      2.616673
## 193             -1.6341691   Male      2.613863
library(factoextra)
## Warning: package 'factoextra' was built under R version 4.3.2
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.3.2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
#Calculating Euclidean distances
distance <- get_dist(mydata[c("Age", "Annual.Income..k..", "Spending.Score..1.100.")], 
                     method = "euclidian")

distance2 <- distance^2

fviz_dist(distance2) #Showing dissimilarity matrix

get_clust_tendency(mydata[c("Age", "Annual.Income..k..", "Spending.Score..1.100.")],
                   n = nrow(mydata) - 1,
                   graph = FALSE)
## $hopkins_stat
## [1] 0.6862594
## 
## $plot
## NULL

Data is okay (clusterable), because it’s >0,5.

library(dplyr) #Allowing use of %>%
## Warning: package 'dplyr' was built under R version 4.3.2
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
WARD <- mydata[c("Age", "Annual.Income..k..", "Spending.Score..1.100.")] %>%
  #scale() %>%                           
  get_dist(method = "euclidean") %>%  #Selecting distance
  hclust(method = "ward.D2") #Selecting algorithm

WARD
## 
## Call:
## hclust(d = ., method = "ward.D2")
## 
## Cluster method   : ward.D2 
## Distance         : euclidean 
## Number of objects: 200
library(factoextra)
fviz_dend(WARD)
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
##   Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Looking at the dendrogram, I will cut the sample at 6 groups.

fviz_dend(WARD, 
          k = 6,
          cex = 0.5, 
          palette = "jama",
          color_labels_by_k = TRUE, 
          rect = TRUE)

mydata$ClusterWard <- cutree(WARD, 
                             k = 6) #Cutting the dendrogram
head(mydata)
##   CustomerID Gender_beforefactor Age_values Annual.Income..k.._values
## 1          1                Male         19                        15
## 2          2                Male         21                        15
## 3          3              Female         20                        16
## 4          4              Female         23                        16
## 5          5              Female         31                        17
## 6          6              Female         22                        17
##   Spending.Score..1.100._values        Age Annual.Income..k..
## 1                            39 -1.4210029          -1.734646
## 2                            81 -1.2778288          -1.734646
## 3                             6 -1.3494159          -1.696572
## 4                            77 -1.1346547          -1.696572
## 5                            40 -0.5619583          -1.658498
## 6                            76 -1.2062418          -1.658498
##   Spending.Score..1.100. Gender Dissimilarity ClusterWard
## 1             -0.4337131   Male      2.283934           1
## 2              1.1927111   Male      2.462601           2
## 3             -1.7116178 Female      2.762049           1
## 4              1.0378135 Female      2.289728           2
## 5             -0.3949887 Female      1.795113           1
## 6              0.9990891 Female      2.281187           2

Customer with ID 1 got classified in group 1…

Initial_leaders <- aggregate(mydata[, c("Age", "Annual.Income..k..", "Spending.Score..1.100.")], 
                             by = list(mydata$ClusterWard), 
                             FUN = mean)

Initial_leaders
##   Group.1        Age Annual.Income..k.. Spending.Score..1.100.
## 1       1  0.3914510         -1.3244867            -1.15891524
## 2       2 -1.0051162         -1.3303378             1.16320677
## 3       3 -0.8212625         -0.1160830            -0.16866621
## 4       4  1.2563527         -0.2006917            -0.07142498
## 5       5 -0.4408110          0.9891010             1.23640011
## 6       6  0.3610033          1.1698473            -1.29809671

Calculating position of initial leaders by Ward…

library(factoextra)

#Performing K-Means clustering - initial leaders are chosen as centroids of groups, found with hierarhical clustering
K_MEANS <- hkmeans(mydata[c("Age", "Annual.Income..k..", "Spending.Score..1.100.")], 
                   k = 6,
                   hc.metric = "euclidean",
                   hc.method = "ward.D2")

K_MEANS
## Hierarchical K-means clustering with 6 clusters of sizes 21, 24, 38, 45, 39, 33
## 
## Cluster means:
##          Age Annual.Income..k.. Spending.Score..1.100.
## 1  0.4777583         -1.3049552            -1.19344867
## 2 -0.9735839         -1.3221791             1.03458649
## 3 -0.8709130         -0.1135003            -0.09334615
## 4  1.2515802         -0.2396117            -0.04388764
## 5 -0.4408110          0.9891010             1.23640011
## 6  0.2211606          1.0805138            -1.28682305
## 
## Clustering vector:
##   [1] 2 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1
##  [38] 2 1 2 4 2 1 2 1 2 4 3 3 3 4 3 3 4 4 4 4 4 3 4 4 3 4 4 4 3 4 4 3 3 4 4 4 4
##  [75] 4 3 4 3 3 4 4 3 4 4 3 4 4 3 3 4 4 3 4 3 3 3 4 3 4 3 3 4 4 3 4 3 4 4 4 4 4
## [112] 3 3 3 3 3 4 4 4 4 3 3 3 5 3 5 6 5 6 5 6 5 3 5 6 5 6 5 3 5 6 5 3 5 6 5 6 5
## [149] 6 5 6 5 6 5 6 5 6 5 6 5 4 5 6 5 6 5 6 5 6 5 6 5 6 5 6 5 6 5 6 5 6 5 6 5 6
## [186] 5 6 5 6 5 6 5 6 5 6 5 6 5 6 5
## 
## Within cluster sum of squares by cluster:
## [1] 20.52332 11.71664 20.20990 23.87015 22.36267 34.51630
##  (between_SS / total_SS =  77.7 %)
## 
## Available components:
## 
##  [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
##  [6] "betweenss"    "size"         "iter"         "ifault"       "data"        
## [11] "hclust"

Final leaders by K-means, I can see them being reclassified. Ratio of between SS and total SS is 77,7%, which is good, as it’s above 50%.

fviz_cluster(K_MEANS, 
             palette = "jama", 
             repel = FALSE,
             ggtheme = theme_classic())

mydata$ClusterK_Means <- K_MEANS$cluster
head(mydata[c("CustomerID", "ClusterWard", "ClusterK_Means")])
##   CustomerID ClusterWard ClusterK_Means
## 1          1           1              2
## 2          2           2              2
## 3          3           1              1
## 4          4           2              2
## 5          5           1              1
## 6          6           2              2

Customer with ID 1 got reclassified from cluster 1 by Ward to cluster 2 by K-means.

#Checking for reclassifications
table(mydata$ClusterWard)
## 
##  1  2  3  4  5  6 
## 22 21 45 45 39 28
table(mydata$ClusterK_Means)
## 
##  1  2  3  4  5  6 
## 21 24 38 45 39 33
table(mydata$ClusterWard, mydata$ClusterK_Means)
##    
##      1  2  3  4  5  6
##   1 21  1  0  0  0  0
##   2  0 21  0  0  0  0
##   3  0  2 38  2  0  3
##   4  0  0  0 43  0  2
##   5  0  0  0  0 39  0
##   6  0  0  0  0  0 28

Initially, there were 22 units in cluster 1, 21 in cluster 2, 45 in cluster 3, and so on. However, through K-means, units were reclassified, so after the reclassification, there were 21 units in cluster 1, 24 units in cluster 2, 38 in cluster 3, and so on. Even, if the number of the units in clusters, such as cluster 4 or 5 remained the same, it doesn’t necessarily mean they remained unchanged, as they could gain and lose the same amount of units. For example:

Centroids <- K_MEANS$centers
Centroids
##          Age Annual.Income..k.. Spending.Score..1.100.
## 1  0.4777583         -1.3049552            -1.19344867
## 2 -0.9735839         -1.3221791             1.03458649
## 3 -0.8709130         -0.1135003            -0.09334615
## 4  1.2515802         -0.2396117            -0.04388764
## 5 -0.4408110          0.9891010             1.23640011
## 6  0.2211606          1.0805138            -1.28682305
library(ggplot2)
library(tidyr)

Figure <- as.data.frame(Centroids)
Figure$id <- 1:nrow(Figure)
Figure <- pivot_longer(Figure, cols = c(Age, Annual.Income..k.., Spending.Score..1.100.))

Figure$Groups <- factor(Figure$id, 
                        levels = c(1, 2, 3,4,5,6), 
                        labels = c("1", "2", "3", "4", "5", "6"))

Figure$nameFactor <- factor(Figure$name, 
                            levels = c("Age", "Annual.Income..k..", "Spending.Score..1.100."), 
                            labels = c("Age", "Annual.Income..k..", "Spending.Score..1.100."))

ggplot(Figure, aes(x = nameFactor, y = value)) +
  geom_hline(yintercept = 0) +
  theme_bw() +
  geom_point(aes(shape = Groups, col = Groups), size = 3) +
  geom_line(aes(group = id), linewidth = 1) +
  ylab("Averages") +
  xlab("Cluster variables") +
  ylim(-1.5, 1.5)

#Are all the cluster variables successful at classifing units into groups? Performing ANOVAs.

fit <- aov(cbind(Age, Annual.Income..k.., Spending.Score..1.100.)  ~ as.factor(ClusterK_Means), 
                 data = mydata)

summary(fit)
##  Response 1 :
##                            Df  Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(ClusterK_Means)   5 136.047 27.2095  83.851 < 2.2e-16 ***
## Residuals                 194  62.953  0.3245                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response 2 :
##                            Df  Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(ClusterK_Means)   5 157.472 31.4945  147.13 < 2.2e-16 ***
## Residuals                 194  41.528  0.2141                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response 3 :
##                            Df  Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(ClusterK_Means)   5 170.281  34.056  230.06 < 2.2e-16 ***
## Residuals                 194  28.719   0.148                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

H0: all means are the same H1: at least one is different

We reject H0 for all 3 variables at p<0,001 and conclude that for every cluster variable at least one cluster mean is different from the others.

VALIDATION

CRITERION CHECKING WITH NEW VARIABLE:

#Pearson Chi2 test

chisq_results <- chisq.test(mydata$Gender, as.factor(mydata$ClusterK_Means))
chisq_results
## 
##  Pearson's Chi-squared test
## 
## data:  mydata$Gender and as.factor(mydata$ClusterK_Means)
## X-squared = 3.7398, df = 5, p-value = 0.5875
addmargins(chisq_results$observed)
##              
## mydata$Gender   1   2   3   4   5   6 Sum
##        Male     8  10  14  19  18  19  88
##        Female  13  14  24  26  21  14 112
##        Sum     21  24  38  45  39  33 200
round(chisq_results$expected, 2)
##              
## mydata$Gender     1     2     3    4     5     6
##        Male    9.24 10.56 16.72 19.8 17.16 14.52
##        Female 11.76 13.44 21.28 25.2 21.84 18.48
round(chisq_results$res, 2)
##              
## mydata$Gender     1     2     3     4     5     6
##        Male   -0.41 -0.17 -0.67 -0.18  0.20  1.18
##        Female  0.36  0.15  0.59  0.16 -0.18 -1.04

H0: no association between gender and classification H1: association between gender and classification

BSD we cannot reject H0 at p=0,59. There is no statistical significant differences between gender structure between groups.

We cannot validate the clusters with gender.

aggregate(mydata$Age_values, 
          by = list(mydata$ClusterK_Means), 
          FUN = "mean")
##   Group.1        x
## 1       1 45.52381
## 2       2 25.25000
## 3       3 26.68421
## 4       4 56.33333
## 5       5 32.69231
## 6       6 41.93939
fit <- aov(Age_values ~ as.factor(ClusterK_Means), 
           data = mydata)

summary(fit)
##                            Df Sum Sq Mean Sq F value Pr(>F)    
## as.factor(ClusterK_Means)   5  26547    5309   83.85 <2e-16 ***
## Residuals                 194  12284      63                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

H0: all means are the same H1: at least one is different

We reject H0 at p<0,001. There are statistically significant differences in age between clusters. We can validate our clusters with the age.

We can see that group 4 has the oldest customers with average age of 56 years, followed by group 1 with average age of 45 years, followed by group 6 with average age of 42 years, followed by group 5 with average age of 33 years, followed by group 3 with avg of 27 years and the youngest group being group 2 with average age 25 years of age.

aggregate(mydata$Spending.Score..1.100._values, 
          by = list(mydata$ClusterK_Means), 
          FUN = "mean")
##   Group.1        x
## 1       1 19.38095
## 2       2 76.91667
## 3       3 47.78947
## 4       4 49.06667
## 5       5 82.12821
## 6       6 16.96970
fit <- aov(Spending.Score..1.100._values ~ as.factor(ClusterK_Means),
           data = mydata)

summary(fit)
##                            Df Sum Sq Mean Sq F value Pr(>F)    
## as.factor(ClusterK_Means)   5 113553   22711   230.1 <2e-16 ***
## Residuals                 194  19151      99                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

H0: all means are the same H1: at least one is different

There are statistically significant differences in spending score between groups at p<0,001. We can validate the clusters with spending score.

Group 5 has customers with highest spending score (possible ranging from 1 to 100) with average score of 82, and group 6 has customers with lowest spending score with average spending score being 17.

aggregate(mydata$Annual.Income..k.._values, 
          by = list(mydata$ClusterK_Means), 
          FUN = "mean")
##   Group.1        x
## 1       1 26.28571
## 2       2 25.83333
## 3       3 57.57895
## 4       4 54.26667
## 5       5 86.53846
## 6       6 88.93939
fit <- aov(Annual.Income..k.._values ~ as.factor(ClusterK_Means),
           data = mydata)

summary(fit)
##                            Df Sum Sq Mean Sq F value Pr(>F)    
## as.factor(ClusterK_Means)   5 108630   21726   147.1 <2e-16 ***
## Residuals                 194  28647     148                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

H0: all means are the same H1: at least one is different

There are statistically significant differences in customer’s annual income between groups at p<0,001. We can validate clusters with annual income.

Group 6 has customers with highest annual income with 89 thousand dollars on average, compared to group 2 with lowest annual income with an average of 26 thousand dollars per year.

Conclusion:

Our classification of 200 people was based on three standardized variables: Age, Annual income (in thousand dollars), and spending score (from 1 to 100).

For hierarchical clustering, we used Ward’s clustering algorithm and decided to divide them into 6 groups based on the analysis of the dendrogram. The classification was further optimized using K-Means clustering.

Group 4 contains 22,5% of customers (45/200), which is the largest group among all 6. It’s characterized by slightly lower than average annual income, as well as lower than average spending score, but still remaining in the middle range close to average compared to other groups. On average, group 4 is the oldest among the customers.