RESEARCH QUESTION: How do the characteristics of different groups of customers differ from one another?
EXPLANATION: We want to prepare optimized marketing strategy to advertise our mall to different groups of customers based on their characteristics. In our case, clustering is the best method possible as we can get clear view on number of clusters and their characteristics, that can help us make optimized marketing plan.
dataframe <- read.csv("./primer/Mall_Customers.csv")
colnames(dataframe)[4] = "Annual_Income"
colnames(dataframe)[5] = "Spending_Score"
dataframe <- as.data.frame(dataframe)
str(dataframe)
## 'data.frame': 200 obs. of 5 variables:
## $ CustomerID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Gender : chr "Male" "Male" "Female" "Female" ...
## $ Age : int 19 21 20 23 31 22 35 23 64 30 ...
## $ Annual_Income : int 15 15 16 16 17 17 18 18 19 19 ...
## $ Spending_Score: int 39 81 6 77 40 76 6 94 3 72 ...
head(dataframe)
## CustomerID Gender Age Annual_Income Spending_Score
## 1 1 Male 19 15 39
## 2 2 Male 21 15 81
## 3 3 Female 20 16 6
## 4 4 Female 23 16 77
## 5 5 Female 31 17 40
## 6 6 Female 22 17 76
DESCRIBTION OF DATA
Our unit of observation is a customer of our mall. Sample size is 200 units.
Description of variables: Gender (Male or Female), Age (age in completed years), Annual income (in thousands of dollars), Spending Score (1-100, a higher score indicates a greater amount spent by the customer)
SOURCE OF THE DATA Dataset: Choudhary, VJ. “Customer Segmentation Tutorial in Python.” Kaggle, 2022. https://www.kaggle.com/datasets/vjchoudhary7/customer-segmentation-tutorial-in-python
summary(dataframe)
## CustomerID Gender Age Annual_Income
## Min. : 1.00 Length:200 Min. :18.00 Min. : 15.00
## 1st Qu.: 50.75 Class :character 1st Qu.:28.75 1st Qu.: 41.50
## Median :100.50 Mode :character Median :36.00 Median : 61.50
## Mean :100.50 Mean :38.85 Mean : 60.56
## 3rd Qu.:150.25 3rd Qu.:49.00 3rd Qu.: 78.00
## Max. :200.00 Max. :70.00 Max. :137.00
## Spending_Score
## Min. : 1.00
## 1st Qu.:34.75
## Median :50.00
## Mean :50.20
## 3rd Qu.:73.00
## Max. :99.00
DESCRIPTIVE STATISTICS
Half of the customers in our sample are 36 years old or younger, while average is a little less than 39 years old. The minimum annual income of a person in our sample amounted to 15 thousand dollars a year, while the highest paid person earned 137 thousand dollars a year, giving us a range of 122 thousand dollars of annual income. Spending score was pretty evenly distributed among our customers.
#Standardization of the data
variables_to_standardize <- dataframe[, c("Age", "Annual_Income", "Spending_Score")]
standardized_data <- scale(variables_to_standardize)
standardized_data <- as.data.frame(standardized_data)
standardized_data$Razlicnost = sqrt(standardized_data$Age^2 + standardized_data$Annual_Income^2 + standardized_data$Spending_Score^2)
#Check for possible outliers
Razlicnost <- head(standardized_data[order(-standardized_data$Razlicnost), ], 5)
print(Razlicnost)
## Age Annual_Income Spending_Score Razlicnost
## 200 -0.6335454 2.910368 1.270160 3.238044
## 199 -0.4903713 2.910368 -1.246925 3.203986
## 9 1.8004143 -1.582351 -1.827791 3.014323
## 11 2.0151754 -1.582351 -1.401823 2.920595
## 3 -1.3494159 -1.696572 -1.711618 2.762049
No obvious outlier was found.
#Visualization of distances between units
library(factoextra)
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
Razdalje <- get_dist(standardized_data[c("Age", "Annual_Income", "Spending_Score")],
method="euclidian")
fviz_dist(Razdalje)
Based on this matrix, we can expect around 6 groups.
#Data verification using Hopkinson statistics
get_clust_tendency(standardized_data[c("Age", "Annual_Income", "Spending_Score")],
n = nrow(standardized_data) - 1,
graph = FALSE)
## $hopkins_stat
## [1] 0.6862594
##
## $plot
## NULL
We confirm the existence of natural groups in our data.
#Implementation of hierarchical sorting
library("dplyr")
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(factoextra)
WARD <- standardized_data[c("Age", "Annual_Income", "Spending_Score")] %>%
get_dist(method = "euclidean") %>%
hclust(method = "ward.D2")
WARD
##
## Call:
## hclust(d = ., method = "ward.D2")
##
## Cluster method : ward.D2
## Distance : euclidean
## Number of objects: 200
#Dendogram for visual representation of the number of groups
library(factoextra)
fviz_dend(WARD, k = 2,
cex = 0.1,
pallete = "jama",
color_labels_by_k = TRUE,
rect = FALSE,
)
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
## Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
We can still agree that our customers can be divided in 6 natural groups.
#Review of the optimal number of groups based on indices
set.seed(1)
library(dplyr)
library(NbClust)
library(factoextra)
OptimalGroupNumber <- standardized_data[c("Age", "Annual_Income", "Spending_Score")] %>%
NbClust(distance = "euclidean",
min.nc = 2, max.nc = 7,
method = "ward.D2",
index = "all")
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 4 proposed 2 as the best number of clusters
## * 2 proposed 3 as the best number of clusters
## * 4 proposed 4 as the best number of clusters
## * 1 proposed 5 as the best number of clusters
## * 9 proposed 6 as the best number of clusters
## * 3 proposed 7 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 6
##
##
## *******************************************************************
Majority of indices suggest 6 clusters. We choose 6 cluster to proceed in our clustering process.
#Classification of units into groups
dataframe$ClassificationWARD <- cutree(WARD,
k = 6)
standardized_data$ClassificationWARD <- cutree(WARD,
k = 6)
head(dataframe[, c("CustomerID", "ClassificationWARD")])
## CustomerID ClassificationWARD
## 1 1 1
## 2 2 2
## 3 3 1
## 4 4 2
## 5 5 1
## 6 6 2
#K-means statistics
Zac_voditelji <- aggregate(standardized_data[, c("Age", "Annual_Income", "Spending_Score")],
by = list(standardized_data$ClassificationWARD),
FUN = mean)
Zac_voditelji
## Group.1 Age Annual_Income Spending_Score
## 1 1 0.3914510 -1.3244867 -1.15891524
## 2 2 -1.0051162 -1.3303378 1.16320677
## 3 3 -0.8212625 -0.1160830 -0.16866621
## 4 4 1.2563527 -0.2006917 -0.07142498
## 5 5 -0.4408110 0.9891010 1.23640011
## 6 6 0.3610033 1.1698473 -1.29809671
#K-means statistics - reclassification
library(factoextra)
K_means <- hkmeans(standardized_data[c("Age", "Annual_Income", "Spending_Score")],
k = 6,
hc.metric = "euclidian",
hc.method = "ward.D2")
K_means
## Hierarchical K-means clustering with 6 clusters of sizes 21, 24, 38, 45, 39, 33
##
## Cluster means:
## Age Annual_Income Spending_Score
## 1 0.4777583 -1.3049552 -1.19344867
## 2 -0.9735839 -1.3221791 1.03458649
## 3 -0.8709130 -0.1135003 -0.09334615
## 4 1.2515802 -0.2396117 -0.04388764
## 5 -0.4408110 0.9891010 1.23640011
## 6 0.2211606 1.0805138 -1.28682305
##
## Clustering vector:
## [1] 2 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1
## [38] 2 1 2 4 2 1 2 1 2 4 3 3 3 4 3 3 4 4 4 4 4 3 4 4 3 4 4 4 3 4 4 3 3 4 4 4 4
## [75] 4 3 4 3 3 4 4 3 4 4 3 4 4 3 3 4 4 3 4 3 3 3 4 3 4 3 3 4 4 3 4 3 4 4 4 4 4
## [112] 3 3 3 3 3 4 4 4 4 3 3 3 5 3 5 6 5 6 5 6 5 3 5 6 5 6 5 3 5 6 5 3 5 6 5 6 5
## [149] 6 5 6 5 6 5 6 5 6 5 6 5 4 5 6 5 6 5 6 5 6 5 6 5 6 5 6 5 6 5 6 5 6 5 6 5 6
## [186] 5 6 5 6 5 6 5 6 5 6 5 6 5 6 5
##
## Within cluster sum of squares by cluster:
## [1] 20.52332 11.71664 20.20990 23.87015 22.36267 34.51630
## (between_SS / total_SS = 77.7 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault" "data"
## [11] "hclust"
Differences between groups represents 77,7% of the total variability of the data.
#Visual presentation of clusters
fviz_cluster(K_means,
palette = "jama",
repel = FALSE,
ggtheme = theme_classic())
#Classification - WARD
table(standardized_data$ClassificationWARD)
##
## 1 2 3 4 5 6
## 22 21 45 45 39 28
#Classification - K-means
cluster_assignments <- K_means$cluster
standardized_data$Classification_K_means <- cluster_assignments
table(standardized_data$Classification_K_means)
##
## 1 2 3 4 5 6
## 21 24 38 45 39 33
#Comparison of classifications
table(standardized_data$ClassificationWARD, standardized_data$Classification_K_means)
##
## 1 2 3 4 5 6
## 1 21 1 0 0 0 0
## 2 0 21 0 0 0 0
## 3 0 2 38 2 0 3
## 4 0 0 0 43 0 2
## 5 0 0 0 0 39 0
## 6 0 0 0 0 0 28
10 units were redistributed using the k-means method.
#Centers of groups
Means <- K_means$centers
Means
## Age Annual_Income Spending_Score
## 1 0.4777583 -1.3049552 -1.19344867
## 2 -0.9735839 -1.3221791 1.03458649
## 3 -0.8709130 -0.1135003 -0.09334615
## 4 1.2515802 -0.2396117 -0.04388764
## 5 -0.4408110 0.9891010 1.23640011
## 6 0.2211606 1.0805138 -1.28682305
#Group properties image
library(ggplot2)
library(tidyr)
Group_properties <- as.data.frame(Means)
Group_properties$id <- 1:nrow(Group_properties)
Group_properties <- pivot_longer(Group_properties, cols = c("Age", "Annual_Income", "Spending_Score"))
Group_properties$Group <- factor(Group_properties$id,
levels = c(1, 2, 3, 4, 5, 6),
labels = c("1", "2", "3", "4", "5", "6"))
Group_properties$nameFactor <- factor(Group_properties$name,
levels = c("Age", "Annual_Income", "Spending_Score"),
labels = c("Age", "Annual_Income", "Spending_Score"))
ggplot(Group_properties, aes(x = nameFactor, y = value)) +
geom_hline(yintercept = 0) +
theme_bw() +
geom_point(aes(shape = Group), size = 3) +
geom_line(aes(group = id, linetype = Group), linewidth = 1) +
ylab("Mean") +
xlab("Properties") +
ylim(-1.5, 1.5)
#Checking the appropriateness of the distribution variables
fit_Age <- aov(Age ~ as.factor(Classification_K_means), data = standardized_data)
fit_Annual_Income <- aov(Annual_Income ~ as.factor(Classification_K_means), data = standardized_data)
fit_Spending_Score <- aov(Spending_Score ~ as.factor(Classification_K_means), data = standardized_data)
summary(fit_Age)
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Classification_K_means) 5 136.05 27.209 83.85 <2e-16 ***
## Residuals 194 62.95 0.324
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(fit_Annual_Income)
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Classification_K_means) 5 157.47 31.494 147.1 <2e-16 ***
## Residuals 194 41.53 0.214
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(fit_Spending_Score)
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Classification_K_means) 5 170.28 34.06 230.1 <2e-16 ***
## Residuals 194 28.72 0.15
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can denie H0 for all 3 variables (p=0,001), which means that all 3 variables successfully group units.
#Group profiling based on Gender/ validation
Gender <- dataframe$Gender
standardized_data$Gender <- Gender
standardized_data <- mutate(standardized_data, Gender = ifelse(Gender == "Male", 0, 1))
aggregate(standardized_data$Gender,
by = list(standardized_data$Classification_K_means),
FUN = mean)
## Group.1 x
## 1 1 0.6190476
## 2 2 0.5833333
## 3 3 0.6315789
## 4 4 0.5777778
## 5 5 0.5384615
## 6 6 0.4242424
Group 3 has the highest percentage of male customers (63,2%).
#chi squared test for testing connection between gender and clustering in groups
chisq <- chisq.test(standardized_data$Gender, as.factor(standardized_data$Classification_K_means))
chisq
##
## Pearson's Chi-squared test
##
## data: standardized_data$Gender and as.factor(standardized_data$Classification_K_means)
## X-squared = 3.7398, df = 5, p-value = 0.5875
We cannot denie H0 (p=0,05), which means there is no statistically significant connection between gender and clustering in groups.
#standard residuals
standardized_data$Gender <- as.character(standardized_data$Gender)
round(chisq$res, 2)
##
## standardized_data$Gender 1 2 3 4 5 6
## 0 -0.41 -0.17 -0.67 -0.18 0.20 1.18
## 1 0.36 0.15 0.59 0.16 -0.18 -1.04
All standard residuals are below 1.96, which indicates that there is no statistically significant deviation from the expected number of men or women in any group. We did not confirm the validity of the clustering using gender.
DESCRIPTION AND RESULTS:
We classified 200 customers based on 3 standardized variables: age, annual income and spending score.
In the hierarchical clustering, we used Ward’s clustering algorithm, and based on the analysis of the dendogram and indices that analyze the increase in heterogeneity, we decided to cluster them into 6 groups. We further optimized the classification using k-means method (metoda glavnih voditeljev).After that group profiling was done based on dichotomous variable Gender.
The most customers were grouped in Cluster 4 (22,5%), which represents older people with just below average annual income and spending score. In general there is a pattern where younger customers spend more (than average), however their annual income is below average and the other way around for older people. There was no statistically significant deviation from the expected number of men or women in any group, so there is no statistically significant connection between gender and clustering in specific group.
Q: How do the characteristics of different groups of customers differ from one another? A: Finally, we can conclude that we can group our sample of customers in 6 clusters using clustering statistical method. In general younger people spend more than average in our mall, while earning less than average annual income. Older people tend to spend less than average, while earning around average annual income. Annual income and spending score are in general linearly related, only those with the highest and lowest annual earnings are spending just the opposite of what would be expected.