Clustering of mall customers

RESEARCH QUESTION: How do the characteristics of different groups of customers differ from one another?

EXPLANATION: We want to prepare optimized marketing strategy to advertise our mall to different groups of customers based on their characteristics. In our case, clustering is the best method possible as we can get clear view on number of clusters and their characteristics, that can help us make optimized marketing plan.

dataframe <- read.csv("./primer/Mall_Customers.csv")
colnames(dataframe)[4] = "Annual_Income"
colnames(dataframe)[5] = "Spending_Score"
dataframe <- as.data.frame(dataframe)
str(dataframe)

## 'data.frame':    200 obs. of  5 variables:
##  $ CustomerID    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Gender        : chr  "Male" "Male" "Female" "Female" ...
##  $ Age           : int  19 21 20 23 31 22 35 23 64 30 ...
##  $ Annual_Income : int  15 15 16 16 17 17 18 18 19 19 ...
##  $ Spending_Score: int  39 81 6 77 40 76 6 94 3 72 ...

head(dataframe)

##   CustomerID Gender Age Annual_Income Spending_Score
## 1          1   Male  19            15             39
## 2          2   Male  21            15             81
## 3          3 Female  20            16              6
## 4          4 Female  23            16             77
## 5          5 Female  31            17             40
## 6          6 Female  22            17             76

DESCRIBTION OF DATA

Our unit of observation is a customer of our mall. Sample size is 200 units.

Description of variables: Gender (Male or Female), Age (age in completed years), Annual income (in thousands of dollars), Spending Score (1-100, a higher score indicates a greater amount spent by the customer)

SOURCE OF THE DATA Dataset: Choudhary, VJ. “Customer Segmentation Tutorial in Python.” Kaggle, 2022. https://www.kaggle.com/datasets/vjchoudhary7/customer-segmentation-tutorial-in-python

summary(dataframe)

##    CustomerID        Gender               Age        Annual_Income   
##  Min.   :  1.00   Length:200         Min.   :18.00   Min.   : 15.00  
##  1st Qu.: 50.75   Class :character   1st Qu.:28.75   1st Qu.: 41.50  
##  Median :100.50   Mode  :character   Median :36.00   Median : 61.50  
##  Mean   :100.50                      Mean   :38.85   Mean   : 60.56  
##  3rd Qu.:150.25                      3rd Qu.:49.00   3rd Qu.: 78.00  
##  Max.   :200.00                      Max.   :70.00   Max.   :137.00  
##  Spending_Score 
##  Min.   : 1.00  
##  1st Qu.:34.75  
##  Median :50.00  
##  Mean   :50.20  
##  3rd Qu.:73.00  
##  Max.   :99.00

DESCRIPTIVE STATISTICS

Half of the customers in our sample are 36 years old or younger, while average is a little less than 39 years old. The minimum annual income of a person in our sample amounted to 15 thousand dollars a year, while the highest paid person earned 137 thousand dollars a year, giving us a range of 122 thousand dollars of annual income. Spending score was pretty evenly distributed among our customers.

#Standardization of the data

variables_to_standardize <- dataframe[, c("Age", "Annual_Income", "Spending_Score")]
standardized_data <- scale(variables_to_standardize)

standardized_data <- as.data.frame(standardized_data)

standardized_data$Razlicnost = sqrt(standardized_data$Age^2 + standardized_data$Annual_Income^2 + standardized_data$Spending_Score^2)

#Check for possible outliers

Razlicnost <- head(standardized_data[order(-standardized_data$Razlicnost), ], 5)

print(Razlicnost)

##            Age Annual_Income Spending_Score Razlicnost
## 200 -0.6335454      2.910368       1.270160   3.238044
## 199 -0.4903713      2.910368      -1.246925   3.203986
## 9    1.8004143     -1.582351      -1.827791   3.014323
## 11   2.0151754     -1.582351      -1.401823   2.920595
## 3   -1.3494159     -1.696572      -1.711618   2.762049

No obvious outlier was found.

#Visualization of distances between units

library(factoextra)

## Loading required package: ggplot2

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

Razdalje <- get_dist(standardized_data[c("Age", "Annual_Income", "Spending_Score")],
                     method="euclidian")

fviz_dist(Razdalje)

Based on this matrix, we can expect around 6 groups.

#Data verification using Hopkinson statistics

get_clust_tendency(standardized_data[c("Age", "Annual_Income", "Spending_Score")],
                   n = nrow(standardized_data) - 1,
                   graph = FALSE)

## $hopkins_stat
## [1] 0.6862594
## 
## $plot
## NULL

We confirm the existence of natural groups in our data.

#Implementation of hierarchical sorting

library("dplyr")

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(factoextra)

WARD <- standardized_data[c("Age", "Annual_Income", "Spending_Score")] %>%
  get_dist(method = "euclidean") %>%
  hclust(method = "ward.D2")

WARD

## 
## Call:
## hclust(d = ., method = "ward.D2")
## 
## Cluster method   : ward.D2 
## Distance         : euclidean 
## Number of objects: 200

#Dendogram for visual representation of the number of groups

library(factoextra)

fviz_dend(WARD, k = 2,
          cex = 0.1,
          pallete = "jama",
          color_labels_by_k = TRUE,
          rect = FALSE,
          )

## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
##   Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

We can still agree that our customers can be divided in 6 natural groups.

#Review of the optimal number of groups based on indices

set.seed(1)
library(dplyr)
library(NbClust)
library(factoextra)
OptimalGroupNumber <- standardized_data[c("Age", "Annual_Income", "Spending_Score")] %>%
  NbClust(distance = "euclidean",
          min.nc = 2, max.nc = 7,
          method = "ward.D2",
          index = "all")

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
##

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 4 proposed 2 as the best number of clusters 
## * 2 proposed 3 as the best number of clusters 
## * 4 proposed 4 as the best number of clusters 
## * 1 proposed 5 as the best number of clusters 
## * 9 proposed 6 as the best number of clusters 
## * 3 proposed 7 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  6 
##  
##  
## *******************************************************************

Majority of indices suggest 6 clusters. We choose 6 cluster to proceed in our clustering process.

#Classification of units into groups

dataframe$ClassificationWARD <- cutree(WARD,
                                               k = 6)
standardized_data$ClassificationWARD <- cutree(WARD,
                                               k = 6)
head(dataframe[, c("CustomerID", "ClassificationWARD")])

##   CustomerID ClassificationWARD
## 1          1                  1
## 2          2                  2
## 3          3                  1
## 4          4                  2
## 5          5                  1
## 6          6                  2

#K-means statistics

Zac_voditelji <- aggregate(standardized_data[, c("Age", "Annual_Income", "Spending_Score")],
                           by = list(standardized_data$ClassificationWARD),
                           FUN = mean)

Zac_voditelji

##   Group.1        Age Annual_Income Spending_Score
## 1       1  0.3914510    -1.3244867    -1.15891524
## 2       2 -1.0051162    -1.3303378     1.16320677
## 3       3 -0.8212625    -0.1160830    -0.16866621
## 4       4  1.2563527    -0.2006917    -0.07142498
## 5       5 -0.4408110     0.9891010     1.23640011
## 6       6  0.3610033     1.1698473    -1.29809671

#K-means statistics - reclassification

library(factoextra)
K_means <- hkmeans(standardized_data[c("Age", "Annual_Income", "Spending_Score")],
                   k = 6,
                   hc.metric = "euclidian",
                   hc.method = "ward.D2")

K_means

## Hierarchical K-means clustering with 6 clusters of sizes 21, 24, 38, 45, 39, 33
## 
## Cluster means:
##          Age Annual_Income Spending_Score
## 1  0.4777583    -1.3049552    -1.19344867
## 2 -0.9735839    -1.3221791     1.03458649
## 3 -0.8709130    -0.1135003    -0.09334615
## 4  1.2515802    -0.2396117    -0.04388764
## 5 -0.4408110     0.9891010     1.23640011
## 6  0.2211606     1.0805138    -1.28682305
## 
## Clustering vector:
##   [1] 2 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1
##  [38] 2 1 2 4 2 1 2 1 2 4 3 3 3 4 3 3 4 4 4 4 4 3 4 4 3 4 4 4 3 4 4 3 3 4 4 4 4
##  [75] 4 3 4 3 3 4 4 3 4 4 3 4 4 3 3 4 4 3 4 3 3 3 4 3 4 3 3 4 4 3 4 3 4 4 4 4 4
## [112] 3 3 3 3 3 4 4 4 4 3 3 3 5 3 5 6 5 6 5 6 5 3 5 6 5 6 5 3 5 6 5 3 5 6 5 6 5
## [149] 6 5 6 5 6 5 6 5 6 5 6 5 4 5 6 5 6 5 6 5 6 5 6 5 6 5 6 5 6 5 6 5 6 5 6 5 6
## [186] 5 6 5 6 5 6 5 6 5 6 5 6 5 6 5
## 
## Within cluster sum of squares by cluster:
## [1] 20.52332 11.71664 20.20990 23.87015 22.36267 34.51630
##  (between_SS / total_SS =  77.7 %)
## 
## Available components:
## 
##  [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
##  [6] "betweenss"    "size"         "iter"         "ifault"       "data"        
## [11] "hclust"

Differences between groups represents 77,7% of the total variability of the data.

#Visual presentation of clusters

fviz_cluster(K_means,
             palette = "jama",
             repel = FALSE,
             ggtheme = theme_classic())

#Classification - WARD

table(standardized_data$ClassificationWARD)

## 
##  1  2  3  4  5  6 
## 22 21 45 45 39 28

#Classification - K-means

cluster_assignments <- K_means$cluster
standardized_data$Classification_K_means <- cluster_assignments

table(standardized_data$Classification_K_means)

## 
##  1  2  3  4  5  6 
## 21 24 38 45 39 33

#Comparison of classifications

table(standardized_data$ClassificationWARD, standardized_data$Classification_K_means)

##    
##      1  2  3  4  5  6
##   1 21  1  0  0  0  0
##   2  0 21  0  0  0  0
##   3  0  2 38  2  0  3
##   4  0  0  0 43  0  2
##   5  0  0  0  0 39  0
##   6  0  0  0  0  0 28

10 units were redistributed using the k-means method.

#Centers of groups

Means <- K_means$centers
Means

##          Age Annual_Income Spending_Score
## 1  0.4777583    -1.3049552    -1.19344867
## 2 -0.9735839    -1.3221791     1.03458649
## 3 -0.8709130    -0.1135003    -0.09334615
## 4  1.2515802    -0.2396117    -0.04388764
## 5 -0.4408110     0.9891010     1.23640011
## 6  0.2211606     1.0805138    -1.28682305

#Group properties image

library(ggplot2)
library(tidyr)

Group_properties <- as.data.frame(Means)
Group_properties$id <- 1:nrow(Group_properties)
Group_properties <- pivot_longer(Group_properties, cols = c("Age", "Annual_Income", "Spending_Score"))

Group_properties$Group <- factor(Group_properties$id,
                                levels = c(1, 2, 3, 4, 5, 6),
                                labels = c("1", "2", "3", "4", "5", "6"))

Group_properties$nameFactor <- factor(Group_properties$name,
                                      levels = c("Age", "Annual_Income", "Spending_Score"),
                                      labels = c("Age", "Annual_Income", "Spending_Score"))

ggplot(Group_properties, aes(x = nameFactor, y = value)) +
  geom_hline(yintercept = 0) +
  theme_bw() +
  geom_point(aes(shape = Group), size = 3) +
  geom_line(aes(group = id, linetype = Group), linewidth = 1) +
  ylab("Mean") +
  xlab("Properties") +
  ylim(-1.5, 1.5)

#Checking the appropriateness of the distribution variables

fit_Age <- aov(Age ~ as.factor(Classification_K_means), data = standardized_data)
fit_Annual_Income <- aov(Annual_Income ~ as.factor(Classification_K_means), data = standardized_data)
fit_Spending_Score <- aov(Spending_Score ~ as.factor(Classification_K_means), data = standardized_data)

summary(fit_Age)

##                                    Df Sum Sq Mean Sq F value Pr(>F)    
## as.factor(Classification_K_means)   5 136.05  27.209   83.85 <2e-16 ***
## Residuals                         194  62.95   0.324                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(fit_Annual_Income)

##                                    Df Sum Sq Mean Sq F value Pr(>F)    
## as.factor(Classification_K_means)   5 157.47  31.494   147.1 <2e-16 ***
## Residuals                         194  41.53   0.214                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(fit_Spending_Score)

##                                    Df Sum Sq Mean Sq F value Pr(>F)    
## as.factor(Classification_K_means)   5 170.28   34.06   230.1 <2e-16 ***
## Residuals                         194  28.72    0.15                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can denie H0 for all 3 variables (p=0,001), which means that all 3 variables successfully group units.

#Group profiling based on Gender/ validation

Gender <- dataframe$Gender
standardized_data$Gender <- Gender

standardized_data <- mutate(standardized_data, Gender = ifelse(Gender == "Male", 0, 1))

aggregate(standardized_data$Gender,
          by = list(standardized_data$Classification_K_means),
          FUN = mean)

##   Group.1         x
## 1       1 0.6190476
## 2       2 0.5833333
## 3       3 0.6315789
## 4       4 0.5777778
## 5       5 0.5384615
## 6       6 0.4242424

Group 3 has the highest percentage of male customers (63,2%).

#chi squared test for testing connection between gender and clustering in groups

chisq <- chisq.test(standardized_data$Gender, as.factor(standardized_data$Classification_K_means))
chisq

## 
##  Pearson's Chi-squared test
## 
## data:  standardized_data$Gender and as.factor(standardized_data$Classification_K_means)
## X-squared = 3.7398, df = 5, p-value = 0.5875

We cannot denie H0 (p=0,05), which means there is no statistically significant connection between gender and clustering in groups.

#standard residuals

standardized_data$Gender <- as.character(standardized_data$Gender)

round(chisq$res, 2)

##                         
## standardized_data$Gender     1     2     3     4     5     6
##                        0 -0.41 -0.17 -0.67 -0.18  0.20  1.18
##                        1  0.36  0.15  0.59  0.16 -0.18 -1.04

All standard residuals are below 1.96, which indicates that there is no statistically significant deviation from the expected number of men or women in any group. We did not confirm the validity of the clustering using gender.

DESCRIPTION AND RESULTS:

We classified 200 customers based on 3 standardized variables: age, annual income and spending score.

In the hierarchical clustering, we used Ward’s clustering algorithm, and based on the analysis of the dendogram and indices that analyze the increase in heterogeneity, we decided to cluster them into 6 groups. We further optimized the classification using k-means method (metoda glavnih voditeljev).After that group profiling was done based on dichotomous variable Gender.

The most customers were grouped in Cluster 4 (22,5%), which represents older people with just below average annual income and spending score. In general there is a pattern where younger customers spend more (than average), however their annual income is below average and the other way around for older people. There was no statistically significant deviation from the expected number of men or women in any group, so there is no statistically significant connection between gender and clustering in specific group.

Q: How do the characteristics of different groups of customers differ from one another? A: Finally, we can conclude that we can group our sample of customers in 6 clusters using clustering statistical method. In general younger people spend more than average in our mall, while earning less than average annual income. Older people tend to spend less than average, while earning around average annual income. Annual income and spending score are in general linearly related, only those with the highest and lowest annual earnings are spending just the opposite of what would be expected.

Clustering of mall customers

Gal Modic

2024-01-27