1.Introduction

The main idea of this paper is to use clustering methods to refine the classification of members of a shopping mall and to portray user profiles. The dataset of this project is the basic information of the members of a shopping mall,including CustomerID,Gender, Age,Annual Income (k$) and Spending Score(1-100)(Spending Score: a score based on the customer’s spending behaviors, such as customer behaviors and purchasing data malls).In this project, two methods(K-means and PAM)are adopted to cluster the data and optimize the clustering results by comparing the clustering metrics.

2.Libraries

library(scatterplot3d)
library(ggplot2)
library(factoextra)
library(NbClust)
library(ClusterR)
library(psych)
library(flexclust)
library(clusterSim)
library(smacof)
library(fpc)

3.Data processing

Read the dataset and check for the presence of missing values, as can be seen from the results in the figure below, the data does not have missing values and does not need to be processed.

#read data
#setwd("E:/Master of Data science/6-unsupervised learing/archive")
MS<-read.csv("Mall_Customers.csv")
#checking for missing value
MS[!complete.cases(MS),]

## [1] CustomerID             Gender                 Age                   
## [4] Annual.Income..k..     Spending.Score..1.100.
## <0 行> (或0-长度的row.names)

head(MS)

##   CustomerID Gender Age Annual.Income..k.. Spending.Score..1.100.
## 1          1   Male  19                 15                     39
## 2          2   Male  21                 15                     81
## 3          3 Female  20                 16                      6
## 4          4 Female  23                 16                     77
## 5          5 Female  31                 17                     40
## 6          6 Female  22                 17                     76

4.Data visualization and analysis

4.1 Preview data distribution

pairs(data.frame(MS[,c(3:5)]))

Previewing the data gives a preliminary view of the distribution of the data. From the above graph, it can be inferred that there is no obvious linear relationship between the three variables and further analysis is required.

4.2 Relationship between variables by gender

The dataset also contains a gender variable, and the following analyzes whether there is a significant difference between the other variables in terms of gender.

4.2.1 Relationship between age and annual income by gender

ggplot(MS[,c(2:4)])+
  geom_point(aes(x = Age, y = Annual.Income..k.., color = Gender),size = 3, alpha = 0.5) +
  labs(x = 'Age', y = 'Annual Income (k$)', title = 'Relationship between age and annual income by gender') +
  theme_minimal() +
  theme(legend.position = "top") +
  scale_color_manual(values = c('Male' = 'blue', 'Female' = 'pink'))

As can be seen from the above, there is no significant difference between gender on age and income.

4.2.2 Relationship between age and consumption scores by gender

ggplot(MS[,c(2:3,5)])+
  geom_point(aes(x = Age, y = Spending.Score..1.100., color = Gender),size = 3, alpha = 0.5) +
  labs(x = 'Age', y = 'Spending.Score(1-100)', title = 'Relationship between age and spending scores by gender') +
  theme_minimal() +
  theme(legend.position = "top") +
  scale_color_manual(values = c('Male' = 'blue', 'Female' = 'pink'))

As can be seen from the above, there is no significant difference between gender on age and spending score.

4.2.3 Relationship between annual income and consumption scores by gender

ggplot(MS[,c(2,4:5)])+
  geom_point(aes(x = Annual.Income..k.., y = Spending.Score..1.100., color = Gender),size = 3, alpha = 0.5) +
  labs(x = 'Annual Income (k$)', y = 'Spending.Score(1-100)', title = 'Relationship between annual income and spending scores by gender') +
  theme_minimal() +
  theme(legend.position = "top") +
  scale_color_manual(values = c('Male' = 'blue', 'Female' = 'pink'))

As can be seen from the above, there is no significant difference between gender on income and spending score.

5.Clustering

5.1 Prediagnostics - clustering tendency

Before the proper clustering it is usually very worthwhile to assess the general clustering tendency of the dataset.

MS_new<-MS[,c(3:5)]
get_clust_tendency(MS_new, 10, graph=TRUE, gradient=list(low="red", mid="white", high="blue"), seed = 123)

## $hopkins_stat
## [1] 0.7279497
## 
## $plot

Clustering tendency can be assessed with the Hopkins’ statistic. Close to one values of the statistic mean that we should reject the null hypothesis that the dataset is uniformly distributed, so it seems that the data is clusterable.

5.2 K-means

K-means clustering method is one of the most common clustering methods, but it is very sensitive to the number of clusters.Different number of clusters will lead to different clustering results, so it is very necessary to calculate the optimal number of clusters before performing K-means clustering. Wss and Silhouette methods were used below to calculate the optimal number of clusters.

5.2.1 Optimal number of clusters

fviz_nbclust(MS_new, FUNcluster=kmeans, method="wss")

The smaller the value of WSS, the better the clustering. From the above figure, it is clear that 4 or 6 is a better number of clusters.

fviz_nbclust(MS_new, FUNcluster=kmeans, method="silhouette")

The silhouette is more commonly used to calculate the optimal number of clusters, the higher its value, the better the clustering. From the above figure, it is clear that the optimal number of clusters is 6. Combining the above analysis, we can see that 4 or 6 is the optimal number of clusters.

5.2.2 Results and post-diagnostics

In the following, the data will be clustered according to the number of clusters 4 or 6 respectively by K-means method, and the clustering effect will be analyzed by Silhouette method to come up with the best clustering results.

kmeans6 <- eclust(MS_new, "kmeans", hc_metric="euclidean", k=6)

kmeans4 <- eclust(MS_new, "kmeans", hc_metric="euclidean", k=4)

The above are the clustering results for cluster numbers 6 and 4, respectively.

fviz_silhouette(kmeans6)

##   cluster size ave.sil.width
## 1       1   11          0.25
## 2       2   17          0.32
## 3       3   28          0.51
## 4       4   10          0.28
## 5       5   39          0.57
## 6       6   95          0.22

fviz_silhouette(kmeans4)

##   cluster size ave.sil.width
## 1       1   23          0.47
## 2       2   38          0.48
## 3       3  100          0.28
## 4       4   39          0.55

The clustering results are evaluated using Silhouette, where a higher value represents the best clustering. From the above analysis, it can be seen that the clustering is best when the number of clusters is 4.

kmeans4_cluster<-kmeans4$centers
kmeans4_cluster

##        Age Annual.Income..k.. Spending.Score..1.100.
## 1 45.21739           26.30435               20.91304
## 2 40.39474           87.00000               18.63158
## 3 39.20000           48.26000               56.48000
## 4 32.69231           86.53846               82.12821

barplot(t(kmeans4_cluster),width=0.5,
        col = c('red', 'green', 'blue'), 
        ylim = c(0, 130),
        names.arg = c('Cluster 1','Cluster 2','Cluster 3','Cluster 4'),beside = TRUE)
legend("topleft",legend=c('Age', 'Annual Income(k$)','Spending Score(1-100)'),fill=c('red','green','blue'))

kmeans4$size

## [1]  23  38 100  39

The clustering results are analyzed above and the clustering results are visualized. From the above, it can be seen that the customer base can be categorized into 4 classes.

1）Group 1: age around 45 years old, annual income around 26 and consumption score around 20. This type of customers are mainly middle-aged and old, with lower income and consumption ability.

2）Group 2: age around 40, with an annual income of around 87 and a consumption score of around 18. This type of customer is predominantly middle-aged, with higher income but low consumption.

3）Group 3: age around 39, annual income around 48, consumption score around 56. These customers are mainly middle-aged, with middle income and consumption.

4）Group 4: age around 32, annual income at about 86, consumption score at about 82, this type of customer young people are mainly, income and consumption are high, you can focus on this type of customer.

From the percentage of each group, the customer group has the largest number of people in the third group and the smallest number of people in the first group. So far, we have subdivided the customers into 4 groups according to the clustering results and can develop different marketing strategies.

5.3 PAM

5.3.1 Optimal number of clusters

Similar to K-means, the PAM algorithm reduces the interference of noise, but as well is very sensitive to the number of clusters, so it is also still necessary to calculate the optimal number of clusters before using the PAM algorithm for clustering.

fviz_nbclust(MS_new, FUNcluster = cluster::pam, method = c("silhouette"), k.max = 10, nboot = 100,)

As can be judged from the above figure, the optimal number of clusters is 6.

fviz_nbclust(MS_new, FUNcluster = cluster::pam, method = c("wss"), k.max = 10, nboot = 100,)

From the above analysis, it can be determined that the optimal number of clusters is 6 or 4.

5.3.2 Results and post-diagnostics

pam6 <- eclust(MS_new, "pam", hc_metric="euclidean", k=6)

pam4 <- eclust(MS_new, "pam", hc_metric="euclidean", k=4)

The above are the clustering results for cluster numbers 6 and 4, respectively.

fviz_silhouette(pam6)

##   cluster size ave.sil.width
## 1       1   25          0.46
## 2       2   20          0.45
## 3       3   44          0.45
## 4       4   37          0.41
## 5       5   39          0.50
## 6       6   35          0.41

fviz_silhouette(pam4)

##   cluster size ave.sil.width
## 1       1   29          0.48
## 2       2   95          0.29
## 3       3   39          0.59
## 4       4   37          0.44

pam6_cluster<-pam6$medoids
pam6_cluster

##      Age Annual.Income..k.. Spending.Score..1.100.
## [1,]  25                 24                     73
## [2,]  49                 33                     14
## [3,]  57                 54                     51
## [4,]  27                 60                     50
## [5,]  29                 79                     83
## [6,]  42                 86                     20

barplot(t(as.matrix(pam6_cluster)),width=0.5,
        col = c('red', 'green', 'blue'), 
        ylim = c(0, 125),
        names.arg = c('Cluster 1','Cluster 2','Cluster 3','Cluster 4','Cluster 5','Cluster 6'),
        beside = TRUE)
legend("topleft",legend=c('Age', 'Annual Income(k$)','Spending Score(1-100)'),fill=c('red','green','blue'))

pam6$clusinfo

##      size max_diss  av_diss diameter separation
## [1,]   25 35.67913 15.31559 62.22540   3.741657
## [2,]   20 34.55431 17.86436 47.77028   6.480741
## [3,]   44 24.00000 12.68394 39.50949   6.480741
## [4,]   37 23.60085 13.24367 42.95346   3.741657
## [5,]   39 58.00862 17.16133 69.05795  15.684387
## [6,]   35 52.00961 19.75251 71.65194   8.124038

The clustering results are analyzed above and the clustering results are visualized. From the above, it can be seen that the customer base can be categorized into 6 classes.

Group 1: age around 25, with an annual income of around 24 and a consumption score of around 73. These customers are mainly young people with low income but high consumption score.
Group 2: age around 49, with an annual income of 33 or so, and a consumption score of 14 or so. These customers are mainly middle-aged and old, with lower income and lower consumption score.
Group 3: aged around 57, with an annual income of 54 or so, and a consumption score of 51 or so. These customers are mainly middle-aged and old people, and their income and consumption scores are in the middle.
Group 4: aged around 27, with an annual income of 60 or so, and a consumption score of 50 or so. These customers are mainly young people, with moderate income and consumption scores.
Group 5: Age round 29, annual income of about 79, consumption score of about 83. These customers are mainly middle-aged, with high income and consumption scores.
Group 6: aged around 42, with an annual income of 86 or so and a consumption score of 20 or so. These customers are mainly middle-aged and old, with higher income but lower consumption score.

From the percentage of each group, the customer group has the largest number of people in the third group and the smallest number of people in the second group. So far, we have subdivided the customers into 6 groups according to the clustering results and can develop different marketing strategies.

6.Conclusion

From the above analysis, we can see that the results obtained by K-means and PAM are not quite the same, and from the analysis of the final results, the results of PAM are more practical.The results of PAM are more obvious, and the age groups are divided into finer ones, which can respond to the real consumption of people of different age groups.

Segmentation of customer using clustering methods

Yun Wu

2024-01-01

1.Introduction

2.Libraries

3.Data processing

4.Data visualization and analysis

4.1 Preview data distribution

4.2 Relationship between variables by gender

4.2.1 Relationship between age and annual income by gender

4.2.2 Relationship between age and consumption scores by gender

4.2.3 Relationship between annual income and consumption scores by gender

5.Clustering

5.1 Prediagnostics - clustering tendency

5.2 K-means

5.2.1 Optimal number of clusters

5.2.2 Results and post-diagnostics

5.3 PAM

5.3.1 Optimal number of clusters

5.3.2 Results and post-diagnostics

6.Conclusion