mydata <- read.table("~/hw17/mall clustering/Mall_Customers.csv",
header = TRUE,
sep=",",
dec=".") #Creating the dataset
head(mydata)
## CustomerID Gender Age Annual.Income..k.. Spending.Score..1.100.
## 1 1 Male 19 15 39
## 2 2 Male 21 15 81
## 3 3 Female 20 16 6
## 4 4 Female 23 16 77
## 5 5 Female 31 17 40
## 6 6 Female 22 17 76
Dataset Description:
There are 200 observations with 4 variables, excluding the customer ID.
Unit of observation is a customer.
This datasets presents the recording of the spending of customers in mall, varying by their age, gender, annual income and spending score. They help with predicting sales trends, demographic influences and purchased behaviors.
Source of this dataset is from kaggle website: Credit card dataset for clustering. (2018, March 2). https://www.kaggle.com/datasets/vjchoudhary7/customer-segmentation-tutorial-in-python/data
colnames(mydata) <- c("CustomerID", "Gender_beforefactor", "Age_values", "Annual.Income..k.._values", "Spending.Score..1.100._values")
mydata$Age <- scale(mydata$Age_values)
mydata$Annual.Income..k.. <- scale(mydata$Annual.Income..k.._values
)
mydata$Spending.Score..1.100. <- scale(mydata$Spending.Score..1.100._values)
mydata$Gender <- factor(mydata$Gender_beforefactor,
levels = c("Male", "Female"),
labels = c("Male", "Female"))
mydata$Dissimilarity <- sqrt(mydata$Age^2 + mydata$Annual.Income..k..^2 + mydata$Spending.Score..1.100.^2) #Finding outliers
head(mydata[order(-mydata$Dissimilarity), ], 10) #10 units with the highest value of dissimilarity
## CustomerID Gender_beforefactor Age_values Annual.Income..k.._values
## 200 200 Male 30 137
## 199 199 Male 32 137
## 9 9 Male 64 19
## 11 11 Male 67 19
## 3 3 Female 20 16
## 198 198 Male 32 126
## 195 195 Female 47 120
## 197 197 Female 45 126
## 31 31 Male 60 30
## 193 193 Male 33 113
## Spending.Score..1.100._values Age Annual.Income..k..
## 200 83 -0.6335454 2.910368
## 199 18 -0.4903713 2.910368
## 9 3 1.8004143 -1.582351
## 11 14 2.0151754 -1.582351
## 3 6 -1.3494159 -1.696572
## 198 74 -0.4903713 2.491555
## 195 16 0.5834344 2.263112
## 197 28 0.4402603 2.491555
## 31 4 1.5140661 -1.163538
## 193 8 -0.4187842 1.996595
## Spending.Score..1.100. Gender Dissimilarity
## 200 1.2701598 Male 3.238044
## 199 -1.2469252 Male 3.203986
## 9 -1.8277910 Male 3.014323
## 11 -1.4018227 Male 2.920595
## 3 -1.7116178 Female 2.762049
## 198 0.9216404 Male 2.701431
## 195 -1.3243740 Female 2.686268
## 197 -0.8596814 Female 2.672214
## 31 -1.7890666 Male 2.616673
## 193 -1.6341691 Male 2.613863
library(factoextra)
## Warning: package 'factoextra' was built under R version 4.3.2
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.3.2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
#Calculating Euclidean distances
distance <- get_dist(mydata[c("Age", "Annual.Income..k..", "Spending.Score..1.100.")],
method = "euclidian")
distance2 <- distance^2
fviz_dist(distance2) #Showing dissimilarity matrix
get_clust_tendency(mydata[c("Age", "Annual.Income..k..", "Spending.Score..1.100.")],
n = nrow(mydata) - 1,
graph = FALSE)
## $hopkins_stat
## [1] 0.6862594
##
## $plot
## NULL
Data is okay (clusterable), because it’s >0,5.
library(dplyr) #Allowing use of %>%
## Warning: package 'dplyr' was built under R version 4.3.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
WARD <- mydata[c("Age", "Annual.Income..k..", "Spending.Score..1.100.")] %>%
#scale() %>%
get_dist(method = "euclidean") %>% #Selecting distance
hclust(method = "ward.D2") #Selecting algorithm
WARD
##
## Call:
## hclust(d = ., method = "ward.D2")
##
## Cluster method : ward.D2
## Distance : euclidean
## Number of objects: 200
library(factoextra)
fviz_dend(WARD)
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
## Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Looking at the dendrogram, I will cut the sample at 6 groups.
fviz_dend(WARD,
k = 6,
cex = 0.5,
palette = "jama",
color_labels_by_k = TRUE,
rect = TRUE)
mydata$ClusterWard <- cutree(WARD,
k = 6) #Cutting the dendrogram
head(mydata)
## CustomerID Gender_beforefactor Age_values Annual.Income..k.._values
## 1 1 Male 19 15
## 2 2 Male 21 15
## 3 3 Female 20 16
## 4 4 Female 23 16
## 5 5 Female 31 17
## 6 6 Female 22 17
## Spending.Score..1.100._values Age Annual.Income..k..
## 1 39 -1.4210029 -1.734646
## 2 81 -1.2778288 -1.734646
## 3 6 -1.3494159 -1.696572
## 4 77 -1.1346547 -1.696572
## 5 40 -0.5619583 -1.658498
## 6 76 -1.2062418 -1.658498
## Spending.Score..1.100. Gender Dissimilarity ClusterWard
## 1 -0.4337131 Male 2.283934 1
## 2 1.1927111 Male 2.462601 2
## 3 -1.7116178 Female 2.762049 1
## 4 1.0378135 Female 2.289728 2
## 5 -0.3949887 Female 1.795113 1
## 6 0.9990891 Female 2.281187 2
Customer with ID 1 got classified in group 1…
Initial_leaders <- aggregate(mydata[, c("Age", "Annual.Income..k..", "Spending.Score..1.100.")],
by = list(mydata$ClusterWard),
FUN = mean)
Initial_leaders
## Group.1 Age Annual.Income..k.. Spending.Score..1.100.
## 1 1 0.3914510 -1.3244867 -1.15891524
## 2 2 -1.0051162 -1.3303378 1.16320677
## 3 3 -0.8212625 -0.1160830 -0.16866621
## 4 4 1.2563527 -0.2006917 -0.07142498
## 5 5 -0.4408110 0.9891010 1.23640011
## 6 6 0.3610033 1.1698473 -1.29809671
Calculating position of initial leaders by Ward…
library(factoextra)
#Performing K-Means clustering - initial leaders are chosen as centroids of groups, found with hierarhical clustering
K_MEANS <- hkmeans(mydata[c("Age", "Annual.Income..k..", "Spending.Score..1.100.")],
k = 6,
hc.metric = "euclidean",
hc.method = "ward.D2")
K_MEANS
## Hierarchical K-means clustering with 6 clusters of sizes 21, 24, 38, 45, 39, 33
##
## Cluster means:
## Age Annual.Income..k.. Spending.Score..1.100.
## 1 0.4777583 -1.3049552 -1.19344867
## 2 -0.9735839 -1.3221791 1.03458649
## 3 -0.8709130 -0.1135003 -0.09334615
## 4 1.2515802 -0.2396117 -0.04388764
## 5 -0.4408110 0.9891010 1.23640011
## 6 0.2211606 1.0805138 -1.28682305
##
## Clustering vector:
## [1] 2 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1
## [38] 2 1 2 4 2 1 2 1 2 4 3 3 3 4 3 3 4 4 4 4 4 3 4 4 3 4 4 4 3 4 4 3 3 4 4 4 4
## [75] 4 3 4 3 3 4 4 3 4 4 3 4 4 3 3 4 4 3 4 3 3 3 4 3 4 3 3 4 4 3 4 3 4 4 4 4 4
## [112] 3 3 3 3 3 4 4 4 4 3 3 3 5 3 5 6 5 6 5 6 5 3 5 6 5 6 5 3 5 6 5 3 5 6 5 6 5
## [149] 6 5 6 5 6 5 6 5 6 5 6 5 4 5 6 5 6 5 6 5 6 5 6 5 6 5 6 5 6 5 6 5 6 5 6 5 6
## [186] 5 6 5 6 5 6 5 6 5 6 5 6 5 6 5
##
## Within cluster sum of squares by cluster:
## [1] 20.52332 11.71664 20.20990 23.87015 22.36267 34.51630
## (between_SS / total_SS = 77.7 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault" "data"
## [11] "hclust"
Final leaders by K-means, I can see them being reclassified. Ratio of between SS and total SS is 77,7%, which is good, as it’s above 50%.
fviz_cluster(K_MEANS,
palette = "jama",
repel = FALSE,
ggtheme = theme_classic())
mydata$ClusterK_Means <- K_MEANS$cluster
head(mydata[c("CustomerID", "ClusterWard", "ClusterK_Means")])
## CustomerID ClusterWard ClusterK_Means
## 1 1 1 2
## 2 2 2 2
## 3 3 1 1
## 4 4 2 2
## 5 5 1 1
## 6 6 2 2
Customer with ID 1 got reclassified from cluster 1 by Ward to cluster 2 by K-means.
#Checking for reclassifications
table(mydata$ClusterWard)
##
## 1 2 3 4 5 6
## 22 21 45 45 39 28
table(mydata$ClusterK_Means)
##
## 1 2 3 4 5 6
## 21 24 38 45 39 33
table(mydata$ClusterWard, mydata$ClusterK_Means)
##
## 1 2 3 4 5 6
## 1 21 1 0 0 0 0
## 2 0 21 0 0 0 0
## 3 0 2 38 2 0 3
## 4 0 0 0 43 0 2
## 5 0 0 0 0 39 0
## 6 0 0 0 0 0 28
Initially, there were 22 units in cluster 1, 21 in cluster 2, 45 in cluster 3, and so on. However, through K-means, units were reclassified, so after the reclassification, there were 21 units in cluster 1, 24 units in cluster 2, 38 in cluster 3, and so on. Even, if the number of the units in clusters, such as cluster 4 or 5 remained the same, it doesn’t necessarily mean they remained unchanged, as they could gain and lose the same amount of units. For example:
Centroids <- K_MEANS$centers
Centroids
## Age Annual.Income..k.. Spending.Score..1.100.
## 1 0.4777583 -1.3049552 -1.19344867
## 2 -0.9735839 -1.3221791 1.03458649
## 3 -0.8709130 -0.1135003 -0.09334615
## 4 1.2515802 -0.2396117 -0.04388764
## 5 -0.4408110 0.9891010 1.23640011
## 6 0.2211606 1.0805138 -1.28682305
library(ggplot2)
library(tidyr)
Figure <- as.data.frame(Centroids)
Figure$id <- 1:nrow(Figure)
Figure <- pivot_longer(Figure, cols = c(Age, Annual.Income..k.., Spending.Score..1.100.))
Figure$Groups <- factor(Figure$id,
levels = c(1, 2, 3,4,5,6),
labels = c("1", "2", "3", "4", "5", "6"))
Figure$nameFactor <- factor(Figure$name,
levels = c("Age", "Annual.Income..k..", "Spending.Score..1.100."),
labels = c("Age", "Annual.Income..k..", "Spending.Score..1.100."))
ggplot(Figure, aes(x = nameFactor, y = value)) +
geom_hline(yintercept = 0) +
theme_bw() +
geom_point(aes(shape = Groups, col = Groups), size = 3) +
geom_line(aes(group = id), linewidth = 1) +
ylab("Averages") +
xlab("Cluster variables") +
ylim(-1.5, 1.5)
#Are all the cluster variables successful at classifing units into groups? Performing ANOVAs.
fit <- aov(cbind(Age, Annual.Income..k.., Spending.Score..1.100.) ~ as.factor(ClusterK_Means),
data = mydata)
summary(fit)
## Response 1 :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(ClusterK_Means) 5 136.047 27.2095 83.851 < 2.2e-16 ***
## Residuals 194 62.953 0.3245
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response 2 :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(ClusterK_Means) 5 157.472 31.4945 147.13 < 2.2e-16 ***
## Residuals 194 41.528 0.2141
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response 3 :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(ClusterK_Means) 5 170.281 34.056 230.06 < 2.2e-16 ***
## Residuals 194 28.719 0.148
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
H0: all means are the same H1: at least one is different
We reject H0 for all 3 variables at p<0,001 and conclude that for every cluster variable at least one cluster mean is different from the others.
CRITERION CHECKING WITH NEW VARIABLE:
#Pearson Chi2 test
chisq_results <- chisq.test(mydata$Gender, as.factor(mydata$ClusterK_Means))
chisq_results
##
## Pearson's Chi-squared test
##
## data: mydata$Gender and as.factor(mydata$ClusterK_Means)
## X-squared = 3.7398, df = 5, p-value = 0.5875
addmargins(chisq_results$observed)
##
## mydata$Gender 1 2 3 4 5 6 Sum
## Male 8 10 14 19 18 19 88
## Female 13 14 24 26 21 14 112
## Sum 21 24 38 45 39 33 200
round(chisq_results$expected, 2)
##
## mydata$Gender 1 2 3 4 5 6
## Male 9.24 10.56 16.72 19.8 17.16 14.52
## Female 11.76 13.44 21.28 25.2 21.84 18.48
round(chisq_results$res, 2)
##
## mydata$Gender 1 2 3 4 5 6
## Male -0.41 -0.17 -0.67 -0.18 0.20 1.18
## Female 0.36 0.15 0.59 0.16 -0.18 -1.04
H0: no association between gender and classification H1: association between gender and classification
BSD we cannot reject H0 at p=0,59. There is no statistical significant differences between gender structure between groups.
We cannot validate the clusters with gender.
aggregate(mydata$Age_values,
by = list(mydata$ClusterK_Means),
FUN = "mean")
## Group.1 x
## 1 1 45.52381
## 2 2 25.25000
## 3 3 26.68421
## 4 4 56.33333
## 5 5 32.69231
## 6 6 41.93939
fit <- aov(Age_values ~ as.factor(ClusterK_Means),
data = mydata)
summary(fit)
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(ClusterK_Means) 5 26547 5309 83.85 <2e-16 ***
## Residuals 194 12284 63
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
H0: all means are the same H1: at least one is different
We reject H0 at p<0,001. There are statistically significant differences in age between clusters. We can validate our clusters with the age.
We can see that group 4 has the oldest customers with average age of 56 years, followed by group 1 with average age of 45 years, followed by group 6 with average age of 42 years, followed by group 5 with average age of 33 years, followed by group 3 with avg of 27 years and the youngest group being group 2 with average age 25 years of age.
aggregate(mydata$Spending.Score..1.100._values,
by = list(mydata$ClusterK_Means),
FUN = "mean")
## Group.1 x
## 1 1 19.38095
## 2 2 76.91667
## 3 3 47.78947
## 4 4 49.06667
## 5 5 82.12821
## 6 6 16.96970
fit <- aov(Spending.Score..1.100._values ~ as.factor(ClusterK_Means),
data = mydata)
summary(fit)
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(ClusterK_Means) 5 113553 22711 230.1 <2e-16 ***
## Residuals 194 19151 99
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
H0: all means are the same H1: at least one is different
There are statistically significant differences in spending score between groups at p<0,001. We can validate the clusters with spending score.
Group 5 has customers with highest spending score (possible ranging from 1 to 100) with average score of 82, and group 6 has customers with lowest spending score with average spending score being 17.
aggregate(mydata$Annual.Income..k.._values,
by = list(mydata$ClusterK_Means),
FUN = "mean")
## Group.1 x
## 1 1 26.28571
## 2 2 25.83333
## 3 3 57.57895
## 4 4 54.26667
## 5 5 86.53846
## 6 6 88.93939
fit <- aov(Annual.Income..k.._values ~ as.factor(ClusterK_Means),
data = mydata)
summary(fit)
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(ClusterK_Means) 5 108630 21726 147.1 <2e-16 ***
## Residuals 194 28647 148
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
H0: all means are the same H1: at least one is different
There are statistically significant differences in customer’s annual income between groups at p<0,001. We can validate clusters with annual income.
Group 6 has customers with highest annual income with 89 thousand dollars on average, compared to group 2 with lowest annual income with an average of 26 thousand dollars per year.
Our classification of 200 people was based on three standardized variables: Age, Annual income (in thousand dollars), and spending score (from 1 to 100).
For hierarchical clustering, we used Ward’s clustering algorithm and decided to divide them into 6 groups based on the analysis of the dendrogram. The classification was further optimized using K-Means clustering.
Group 4 contains 22,5% of customers (45/200), which is the largest group among all 6. It’s characterized by slightly lower than average annual income, as well as lower than average spending score, but still remaining in the middle range close to average compared to other groups. On average, group 4 is the oldest among the customers.