This project aims to study about the clustering analysis. I will use Mall Customer dataset from Kaggle for customer segmentation. This dataset contains the information of different mall customers, includes their gender, ages, annual income and their spending score.
#Import dataset
library("readxl")
df<-read.csv("Mall_Customers.csv")
print(head(df))
## CustomerID Gender Age Annual.Income..k.. Spending.Score..1.100.
## 1 1 Male 19 15 39
## 2 2 Male 21 15 81
## 3 3 Female 20 16 6
## 4 4 Female 23 16 77
## 5 5 Female 31 17 40
## 6 6 Female 22 17 76
#Summary mall customer dataset
str(df)
## 'data.frame': 200 obs. of 5 variables:
## $ CustomerID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Gender : chr "Male" "Male" "Female" "Female" ...
## $ Age : int 19 21 20 23 31 22 35 23 64 30 ...
## $ Annual.Income..k.. : int 15 15 16 16 17 17 18 18 19 19 ...
## $ Spending.Score..1.100.: int 39 81 6 77 40 76 6 94 3 72 ...
From the summary of dataset, we have this dataset is very small, there are only 200 observations and 5 variables. Only Gender is the categorical variable while others are quantitative data. Now, we will analyze the dataset for a deeper insights of data.
#Descriptive analysis
#summary of data
summary(df)
## CustomerID Gender Age Annual.Income..k..
## Min. : 1.00 Length:200 Min. :18.00 Min. : 15.00
## 1st Qu.: 50.75 Class :character 1st Qu.:28.75 1st Qu.: 41.50
## Median :100.50 Mode :character Median :36.00 Median : 61.50
## Mean :100.50 Mean :38.85 Mean : 60.56
## 3rd Qu.:150.25 3rd Qu.:49.00 3rd Qu.: 78.00
## Max. :200.00 Max. :70.00 Max. :137.00
## Spending.Score..1.100.
## Min. : 1.00
## 1st Qu.:34.75
## Median :50.00
## Mean :50.20
## 3rd Qu.:73.00
## Max. :99.00
We see that the customers of this store has the average age is 38.85, and 75% of customers are younger or equal 49 years old.
hist(df$Age,breaks=10,main="Customer ages",xlab='Age')
From the graph, the age between 25 to 40 is the most frequent customer group age, and the group age between 45-50 is also familiar with this mall. We now analyze which sex is likely to go the mall more.
a<-table(df$Gender)
prop.table(a)
##
## Female Male
## 0.56 0.44
barplot(a,main="Gender of customers",xlab='Gender',ylab="Count",col=rainbow(2),legend=rownames(a))
It is not surprising that more female shopping than male (56% customers
are female). But we want to know which gender is willing to spend more
for customer products.After knowing about ages and gender, we will
analyze the annual income of customers in this mall.
hist(df$Annual.Income..k..,main='Annual income of mall customers',xlab="Annual income (in k-thousand)")
The distribution of annual income is likely skewed to the right, 75% of
annual income is lower than $78000/year, where the average is around
$615000.
But we want to know how each variable is related to others. First, we want to know the distribution of annual income of each gender.
boxplot(Annual.Income..k..~Gender,data=df,ylab = "Annual Income (in thousand)",col=c("lightskyblue","lightseagreen"))
From the boxplot, there is not many difference between the annual income distribution of male or female, where annual income of male is slightly higher than the opponents.
In Mall Customer dataset, we have the Spending score variable is the score(out of 100) given to a customer by the mall authorities, based on the money spent and the behavior of the customer.This is necessary for group segmentation of the mall. Usually, not the high-income individuals have the good behavior during their shopping at mall. We will analyze the relationship between age and spending score, or annual income and spending score.
library("ggplot2") # Load ggplot2 package
## Warning: package 'ggplot2' was built under R version 4.2.2
library("GGally")
## Warning: package 'GGally' was built under R version 4.2.2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
data<-df[,c(3,4,5)]
group<-df[,2]
ggpairs(data,ggplot2::aes(colour=group))
From the scatter plot and its correlation coefficient, we see that there are very week relationship(closes to 0) between age, spending score and annual income. There is only correlation coefficient between spending score and age is -0.3, indicates that there is a weak negative correlation, It shows that the higher the spending score, the lower the age. Therefore, we expect that younger people could be a favorite segmentation of the mall since they not only pay much for products but also well-behavior during their shopping compared to the older customers.
After analyzing data, the mall authorities want to separate their customers into multiple potential groups. Based on the analysis and also scatter plot, we want to segment customer groups by two variables, the annual income and the spending score. This two variables can form different groups, such as a group who have the high annual income and also spend a lot of money for mall products. Therefore, we will apply cluster analysis to solve this problem.
The first method can be implemented is partitioning clustering.
Partition is the most common and simplest type of clustering technique.
In this partitioning technique, dataset is divided into k exclusive
groups or clusters. In partitioning methods, algorithm assigns the
observations into k partitions (k ≤ n). The clusters is formed based on
the distance, where the observations in the same cluster are “similar”,
and they are far “dissimilar” to observations in other clusters.
Partition technique has different methods, but we will use K-means methods for our analysis today. The most difficult part of K-means is choosing the optimal numbers of k clusters. In this problem, I will introduce 3 methods to find the optimal number of clusters.
We will use direct methods: consists of optimizing a criterion, such as the within cluster sums of squares or the average silhouette. The corresponding methods are named elbow and silhouette methods, respectively.
And statistical testing methods: consists of comparing evidence against null hypothesis. An example is the gap statistic.
library(factoextra)
## Warning: package 'factoextra' was built under R version 4.2.2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(NbClust)
# Elbow method
fviz_nbclust(df[,4:5], kmeans, method = "wss") +
geom_vline(xintercept = 4, linetype = 2)+
labs(subtitle = "Elbow method")
# Silhouette method
fviz_nbclust(df[,4:5], kmeans, method = "silhouette")+
labs(subtitle = "Silhouette method")
#Gap statistic method
set.seed(123)
fviz_nbclust(df[,4:5], kmeans, nstart = 25, method = "gap_stat", nboot = 50)+
labs(subtitle = "Gap statistic method")
The elbow methods suggests 4 is the optimal numbers.
The silhouette method suggests 6 is the optimal numbers.
The gap statistic method suggests 1 is the optimal numbers.
From the results above, and also from the scatter plot, I decide to choose 5 as the possible clusters for analysis.
set.seed(27)
kmean_model <- kmeans(df[,4:5], 5, nstart = 30)
fviz_cluster(kmean_model, data = df[,4:5],
ellipse.type = "euclid", # Concentration ellipse
star.plot = TRUE, # Add segments from centroids to items
repel = TRUE, # Avoid label overplotting (slow)
ggtheme = theme_minimal()
) + scale_fill_discrete(labels=c("Spendthrift", "Target", "Careful",'General','Miser'))
## Warning: ggrepel: 106 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
We see that k-means clustering analysis forms 5 different groups of customers, namely Spendthrift, Target, Careful, General,Miser.
Second technique we will apply to segment customer groups is hierrachical clustering. Different from partitioning technique, we do not need to choose the number of clusters at the first step. In hierarchical clustering, after analyzing dataset, we can cut the dendrogram for our preferred number of clusters k.
# Compute the dissimilarity matrix
df_distance <-dist(df[,4:5],method="euclidean")
df_hc <- hclust(d = df_distance, method = "complete")
#We cut dendrogram to have 5 clusters.
library("factoextra")
fviz_dend(df_hc,k=5, cex = 1,
main="Dendrogram - Complete linkage",
xlab="Objects",ylab="Distance",sub="",
k_colors = c("#2E9FDF", "#FC4E07","plum2","seagreen3","#E7B800"),
color_labels_by_k = TRUE,
rect = TRUE, # Add rectangle around groups
rect_fill = TRUE)
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
## Please report the issue at <]8;;https://github.com/kassambara/factoextra/issueshttps://github.com/kassambara/factoextra/issues]8;;>.
### Cut dendrogram to have 3 clusters.
library("factoextra")
fviz_dend(df_hc,k=3, cex = 1,
main="Dendrogram - Complete linkage",
xlab="Objects",ylab="Distance",sub="",
k_colors = c("#2E9FDF", "#FC4E07","plum2","seagreen3","#E7B800"),
color_labels_by_k = TRUE,
rect = TRUE, # Add rectangle around groups
rect_fill = TRUE)
## Warning in get_col(col, k): Length of color vector was longer than the number of
## clusters - first k elements are used
#Plot the dendrogram
grp <-cutree(df_hc,k=5)
fviz_cluster(list(data =df[ ,4:5],cluster = grp) ,
palette=c("#2E9FDF","#FC4E07","plum2","seagreen3","#E7B800") ,
ellipse.type ="convex", # Concentration ellipse
repel=TRUE, # Avoid label overplotting ( slow )
show.clust.cent=FALSE,ggtheme=theme_minimal())
## Warning: ggrepel: 88 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
In conclusion, we see that bot partitioning clustering or hierrachical clustering are appropriate for segmentation problem. Moreover, with K-means methods, we analyze 5 possible group of customers for mall. And the groups of customers analyzed by hierrachical clustering are nearly as same as the results from partitioning clustering. Therefore, we conclude that mall customers can be divided into 5 potential groups: Spendthrift, Target, Careful, General,Miser. And it will be helpful for any marketing campaign in order to increase the revenue based on group characteristics.