The main idea of this paper is to use clustering methods to refine the classification of members of a shopping mall and to portray user profiles. The dataset of this project is the basic information of the members of a shopping mall,including CustomerID,Gender, Age,Annual Income (k$) and Spending Score(1-100)(Spending Score: a score based on the customer’s spending behaviors, such as customer behaviors and purchasing data malls).In this project, two methods(K-means and PAM)are adopted to cluster the data and optimize the clustering results by comparing the clustering metrics.
library(scatterplot3d)
library(ggplot2)
library(factoextra)
library(NbClust)
library(ClusterR)
library(psych)
library(flexclust)
library(clusterSim)
library(smacof)
library(fpc)
Read the dataset and check for the presence of missing values, as can be seen from the results in the figure below, the data does not have missing values and does not need to be processed.
#read data
#setwd("E:/Master of Data science/6-unsupervised learing/archive")
MS<-read.csv("Mall_Customers.csv")
#checking for missing value
MS[!complete.cases(MS),]
## [1] CustomerID Gender Age
## [4] Annual.Income..k.. Spending.Score..1.100.
## <0 行> (或0-长度的row.names)
head(MS)
## CustomerID Gender Age Annual.Income..k.. Spending.Score..1.100.
## 1 1 Male 19 15 39
## 2 2 Male 21 15 81
## 3 3 Female 20 16 6
## 4 4 Female 23 16 77
## 5 5 Female 31 17 40
## 6 6 Female 22 17 76
pairs(data.frame(MS[,c(3:5)]))
Previewing the data gives a preliminary view of the distribution of the data. From the above graph, it can be inferred that there is no obvious linear relationship between the three variables and further analysis is required.
The dataset also contains a gender variable, and the following analyzes whether there is a significant difference between the other variables in terms of gender.
ggplot(MS[,c(2:4)])+
geom_point(aes(x = Age, y = Annual.Income..k.., color = Gender),size = 3, alpha = 0.5) +
labs(x = 'Age', y = 'Annual Income (k$)', title = 'Relationship between age and annual income by gender') +
theme_minimal() +
theme(legend.position = "top") +
scale_color_manual(values = c('Male' = 'blue', 'Female' = 'pink'))
As can be seen from the above, there is no significant difference between gender on age and income.
ggplot(MS[,c(2:3,5)])+
geom_point(aes(x = Age, y = Spending.Score..1.100., color = Gender),size = 3, alpha = 0.5) +
labs(x = 'Age', y = 'Spending.Score(1-100)', title = 'Relationship between age and spending scores by gender') +
theme_minimal() +
theme(legend.position = "top") +
scale_color_manual(values = c('Male' = 'blue', 'Female' = 'pink'))
As can be seen from the above, there is no significant difference between gender on age and spending score.
ggplot(MS[,c(2,4:5)])+
geom_point(aes(x = Annual.Income..k.., y = Spending.Score..1.100., color = Gender),size = 3, alpha = 0.5) +
labs(x = 'Annual Income (k$)', y = 'Spending.Score(1-100)', title = 'Relationship between annual income and spending scores by gender') +
theme_minimal() +
theme(legend.position = "top") +
scale_color_manual(values = c('Male' = 'blue', 'Female' = 'pink'))
As can be seen from the above, there is no significant difference between gender on income and spending score.
Before the proper clustering it is usually very worthwhile to assess the general clustering tendency of the dataset.
MS_new<-MS[,c(3:5)]
get_clust_tendency(MS_new, 10, graph=TRUE, gradient=list(low="red", mid="white", high="blue"), seed = 123)
## $hopkins_stat
## [1] 0.7279497
##
## $plot
Clustering tendency can be assessed with the Hopkins’ statistic. Close to one values of the statistic mean that we should reject the null hypothesis that the dataset is uniformly distributed, so it seems that the data is clusterable.
K-means clustering method is one of the most common clustering methods, but it is very sensitive to the number of clusters.Different number of clusters will lead to different clustering results, so it is very necessary to calculate the optimal number of clusters before performing K-means clustering. Wss and Silhouette methods were used below to calculate the optimal number of clusters.
fviz_nbclust(MS_new, FUNcluster=kmeans, method="wss")
The smaller the value of WSS, the better the clustering. From the above figure, it is clear that 4 or 6 is a better number of clusters.
fviz_nbclust(MS_new, FUNcluster=kmeans, method="silhouette")
The silhouette is more commonly used to calculate the optimal number of clusters, the higher its value, the better the clustering. From the above figure, it is clear that the optimal number of clusters is 6. Combining the above analysis, we can see that 4 or 6 is the optimal number of clusters.
In the following, the data will be clustered according to the number of clusters 4 or 6 respectively by K-means method, and the clustering effect will be analyzed by Silhouette method to come up with the best clustering results.
kmeans6 <- eclust(MS_new, "kmeans", hc_metric="euclidean", k=6)
kmeans4 <- eclust(MS_new, "kmeans", hc_metric="euclidean", k=4)
The above are the clustering results for cluster numbers 6 and 4, respectively.
fviz_silhouette(kmeans6)
## cluster size ave.sil.width
## 1 1 11 0.25
## 2 2 17 0.32
## 3 3 28 0.51
## 4 4 10 0.28
## 5 5 39 0.57
## 6 6 95 0.22
fviz_silhouette(kmeans4)
## cluster size ave.sil.width
## 1 1 23 0.47
## 2 2 38 0.48
## 3 3 100 0.28
## 4 4 39 0.55
The clustering results are evaluated using Silhouette, where a higher value represents the best clustering. From the above analysis, it can be seen that the clustering is best when the number of clusters is 4.
kmeans4_cluster<-kmeans4$centers
kmeans4_cluster
## Age Annual.Income..k.. Spending.Score..1.100.
## 1 45.21739 26.30435 20.91304
## 2 40.39474 87.00000 18.63158
## 3 39.20000 48.26000 56.48000
## 4 32.69231 86.53846 82.12821
barplot(t(kmeans4_cluster),width=0.5,
col = c('red', 'green', 'blue'),
ylim = c(0, 130),
names.arg = c('Cluster 1','Cluster 2','Cluster 3','Cluster 4'),beside = TRUE)
legend("topleft",legend=c('Age', 'Annual Income(k$)','Spending Score(1-100)'),fill=c('red','green','blue'))
kmeans4$size
## [1] 23 38 100 39
The clustering results are analyzed above and the clustering results are visualized. From the above, it can be seen that the customer base can be categorized into 4 classes.
1)Group 1: age around 45 years old, annual income around 26 and consumption score around 20. This type of customers are mainly middle-aged and old, with lower income and consumption ability.
2)Group 2: age around 40, with an annual income of around 87 and a consumption score of around 18. This type of customer is predominantly middle-aged, with higher income but low consumption.
3)Group 3: age around 39, annual income around 48, consumption score around 56. These customers are mainly middle-aged, with middle income and consumption.
4)Group 4: age around 32, annual income at about 86, consumption score at about 82, this type of customer young people are mainly, income and consumption are high, you can focus on this type of customer.
From the percentage of each group, the customer group has the largest number of people in the third group and the smallest number of people in the first group. So far, we have subdivided the customers into 4 groups according to the clustering results and can develop different marketing strategies.
Similar to K-means, the PAM algorithm reduces the interference of noise, but as well is very sensitive to the number of clusters, so it is also still necessary to calculate the optimal number of clusters before using the PAM algorithm for clustering.
fviz_nbclust(MS_new, FUNcluster = cluster::pam, method = c("silhouette"), k.max = 10, nboot = 100,)
As can be judged from the above figure, the optimal number of clusters is 6.
fviz_nbclust(MS_new, FUNcluster = cluster::pam, method = c("wss"), k.max = 10, nboot = 100,)
From the above analysis, it can be determined that the optimal number of clusters is 6 or 4.
pam6 <- eclust(MS_new, "pam", hc_metric="euclidean", k=6)
pam4 <- eclust(MS_new, "pam", hc_metric="euclidean", k=4)
The above are the clustering results for cluster numbers 6 and 4, respectively.
fviz_silhouette(pam6)
## cluster size ave.sil.width
## 1 1 25 0.46
## 2 2 20 0.45
## 3 3 44 0.45
## 4 4 37 0.41
## 5 5 39 0.50
## 6 6 35 0.41
fviz_silhouette(pam4)
## cluster size ave.sil.width
## 1 1 29 0.48
## 2 2 95 0.29
## 3 3 39 0.59
## 4 4 37 0.44
The clustering results are evaluated using Silhouette, where a higher value represents the best clustering. From the above analysis, it can be seen that the clustering is best when the number of clusters is 6.
pam6_cluster<-pam6$medoids
pam6_cluster
## Age Annual.Income..k.. Spending.Score..1.100.
## [1,] 25 24 73
## [2,] 49 33 14
## [3,] 57 54 51
## [4,] 27 60 50
## [5,] 29 79 83
## [6,] 42 86 20
barplot(t(as.matrix(pam6_cluster)),width=0.5,
col = c('red', 'green', 'blue'),
ylim = c(0, 125),
names.arg = c('Cluster 1','Cluster 2','Cluster 3','Cluster 4','Cluster 5','Cluster 6'),
beside = TRUE)
legend("topleft",legend=c('Age', 'Annual Income(k$)','Spending Score(1-100)'),fill=c('red','green','blue'))
pam6$clusinfo
## size max_diss av_diss diameter separation
## [1,] 25 35.67913 15.31559 62.22540 3.741657
## [2,] 20 34.55431 17.86436 47.77028 6.480741
## [3,] 44 24.00000 12.68394 39.50949 6.480741
## [4,] 37 23.60085 13.24367 42.95346 3.741657
## [5,] 39 58.00862 17.16133 69.05795 15.684387
## [6,] 35 52.00961 19.75251 71.65194 8.124038
The clustering results are analyzed above and the clustering results are visualized. From the above, it can be seen that the customer base can be categorized into 6 classes.
Group 1: age around 25, with an annual income of around 24 and a consumption score of around 73. These customers are mainly young people with low income but high consumption score.
Group 2: age around 49, with an annual income of 33 or so, and a consumption score of 14 or so. These customers are mainly middle-aged and old, with lower income and lower consumption score.
Group 3: aged around 57, with an annual income of 54 or so, and a consumption score of 51 or so. These customers are mainly middle-aged and old people, and their income and consumption scores are in the middle.
Group 4: aged around 27, with an annual income of 60 or so, and a consumption score of 50 or so. These customers are mainly young people, with moderate income and consumption scores.
Group 5: Age round 29, annual income of about 79, consumption score of about 83. These customers are mainly middle-aged, with high income and consumption scores.
Group 6: aged around 42, with an annual income of 86 or so and a consumption score of 20 or so. These customers are mainly middle-aged and old, with higher income but lower consumption score.
From the percentage of each group, the customer group has the largest number of people in the third group and the smallest number of people in the second group. So far, we have subdivided the customers into 6 groups according to the clustering results and can develop different marketing strategies.
From the above analysis, we can see that the results obtained by K-means and PAM are not quite the same, and from the analysis of the final results, the results of PAM are more practical.The results of PAM are more obvious, and the age groups are divided into finer ones, which can respond to the real consumption of people of different age groups.