The aim of this project is to segregating customers into groups based on common characteristics.
The method of segmenting or segregating clients into classes based on similar features is called consumer segmentation.
This process is very usefull in order to get a better knowledge of the tastes and desires of each group and adapt the marketing campaigns of companies accordingly.
The dataset used in this project is made up of 200 observations where each row represents a customer in the mall and 5 different columns:
CustomerID: ordinal customer identification number.Gender: Female and Male.Age: age of customer.Annual.Income..k..: annual income of clients in thousands of dollars.Spending.Score..1.100.: that is the purchase score that the shopping center assigns according to the buying actions of the consumer.I decide to rename the Columns for a better organization.
customers <- read.csv("Mall_Customers.csv")
colnames(customers)[colnames(customers) == "Annual.Income..k.."] <- "Yearly_Income"
colnames(customers)[colnames(customers) == "Spending.Score..1.100."] <- "Spending_Score"
str(customers)## 'data.frame': 200 obs. of 5 variables:
## $ CustomerID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Gender : chr "Male" "Male" "Female" "Female" ...
## $ Age : int 19 21 20 23 31 22 35 23 64 30 ...
## $ Yearly_Income : int 15 15 16 16 17 17 18 18 19 19 ...
## $ Spending_Score: int 39 81 6 77 40 76 6 94 3 72 ...
customers1<-customersFirst of all, I will begin to explore the data graphically.
The women customers are 56% while the Male ones are 44%.
The table below show the distribution of gender by absolute frequencies.
The barplot illustrates graphically the distribution of the variable Gender.
table(customers$Gender)##
## Female Male
## 112 88
ggplot(customers, aes(Gender,fill=Gender)) +
geom_text(aes(y = ..count..,label = paste0((prop.table(..count..)) * 100, '%')),
stat = 'count',
position = position_dodge(1),
size = 5, color= "darkred")+
geom_histogram(alpha=0.5, aes(y=(..count..)), stat="count") +
labs(x = "Gender", y = "Count")In order to study the distribution of gender between their age, the violin plot is visualiazed.
The male customers are distributed between 18 and 70 while the female between 18 and 68.
The middle age (point green) is a more higher in the male clients.
ggplot(customers, aes(y = Gender, x = Age), colors='blue') +
geom_violin() +
stat_summary(fun.y = mean, geom = 'point', color='darkgreen',size=4)Most of the observations are on the left handside on the mean value, thus the distribution of Yearly Income is defined as right-skewed.
Mean annual income equals 60.56 thousand, whereas the median is equal to 61.5 thousand.
mean(customers$Yearly_Income)## [1] 60.56
median(customers$Yearly_Income)## [1] 61.5
ggplot(customers, aes(Yearly_Income)) +
geom_histogram(aes(y=..density..),fill="orange", color="darkred") +
labs(x = "Yearly Income", y = "Density")+
geom_density(fill='brown',alpha=0.5)The spending score distributions is more symmetrical than before one.
Spending score is uniformly distributed with a mean value and median more or less at 50.
mean(customers$Spending_Score)## [1] 50.2
median(customers$Spending_Score)## [1] 50
ggplot(customers, aes(Spending_Score)) +
geom_histogram(aes(y=..density..),fill="green", color="darkgreen") +
labs(x = "Spending Score", y = "Density")+
geom_density(fill='orange',alpha=0.5)K-means clustering is one of the simplest and popular unsupervised machine learning algorithms.
The purpose of the algorithm is finds the similarity between the items and groups them into the clusters, minimizing the differences within cluster and maximizing the differences between clusters.
The number of groups (cluster) is represented by K.
It assigns data elements to the closest cluster (center) and the centroid's position is recalculated everytime a component is added to the cluster.
Summurize, the given “n” data set is placed in “k” cluster in a way to minimize the error function and cluster similarity is measured by the approximation of the values in the cluster.
Now I can procede with the code, taking only Yearly Income and Spending Score and standardize all.
customers<-customers[,c(4,5)]
customers_z <- as.data.frame(lapply(customers, scale))First try is with 4 cluster.
Moreover, I put the results in our dataset.
set.seed(12345)
customers_clusters4 <- kmeans(customers_z, 4)
customers_clusters4$size## [1] 39 100 38 23
customers_clusters4$centers## Yearly_Income Spending_Score
## 1 0.9891010 1.2364001
## 2 -0.4683088 0.2431891
## 3 1.0066735 -1.2224677
## 4 -1.3042458 -1.1341194
customers$cluster4 <- customers_clusters4$cluster I apply the same procedure above, selecting 5 clusters.
set.seed(12345)
customers_clusters5 <- kmeans(customers_z, 5)
customers_clusters5$size## [1] 29 100 10 23 38
customers_clusters5$centers## Yearly_Income Spending_Score
## 1 0.6850149 1.2381121
## 2 -0.4683088 0.2431891
## 3 1.8709508 1.2314354
## 4 -1.3042458 -1.1341194
## 5 1.0066735 -1.2224677
customers$cluster5 <- customers_clusters5$cluster
customers1$cluster5 <- customers_clusters5$clusterThis is the result in our data of K-means clustering with k= 4-5
The first comparison by anlysis graphs suggests that 5 is the best number of clusters.
The function cclust performs an altermative k-means clustering.
fviz_cluster(list(data=customers_z, cluster=customers$cluster5),
ellipse.type="euclid", geom=c("point"), pointsize=2,stand=FALSE,main='Plot with 5 clusters', palette="jco", ggtheme=theme_classic())fviz_cluster(list(data=customers_z, cluster=customers$cluster4),
ellipse.type="norm", geom="point",pointsize=2, stand=FALSE,main='Plot with 4 clusters', palette="jco", ggtheme=theme_classic())k5<- cclust(customers_z, k=5, simple=FALSE, save.data=TRUE)
plot(k5)k4<- cclust(customers_z, k=4, simple=FALSE, save.data=TRUE)
plot(k4)There are more than one method to choose the perfect number of cluster.
Average silhouette method computes the average silhouette of observations for different values of k.
The optimal number of clusters k is the one that maximize the average silhouette over a range of possible values for k.
All of the following plots advise that with k=5, clustering provide the maximum value of average silhouette.
silh4<-silhouette(customers_clusters4$cluster, dist(customers_z))
fviz_silhouette(silh4)## cluster size ave.sil.width
## 1 1 39 0.54
## 2 2 100 0.44
## 3 3 38 0.54
## 4 4 23 0.55
silh5<-silhouette(customers_clusters5$cluster, dist(customers_z))
fviz_silhouette(silh5)## cluster size ave.sil.width
## 1 1 29 0.54
## 2 2 100 0.41
## 3 3 10 0.33
## 4 4 23 0.55
## 5 5 38 0.53
opt<-Optimal_Clusters_KMeans(customers_z, max_clusters=15, plot_clusters=TRUE, criterion="silhouette")In Elbow method, the sum of squares at each number of clusters is calculated and graphed.
In order to understand the optimal number of cluster, must look for a change of slope from steep to shallow.
As the method before showed, the number of 5 clusters is the right one.
opt<-Optimal_Clusters_KMeans(customers_z, max_clusters=15, plot_clusters=TRUE)PAM stands for partition around medoids and it usefull for small dataset.
The algorithm find a sequence of objects called medoids that are centrally located in clusters.
This clustering method begins with an arbitrary selection of medoids.
It is followed by a swap between a selected medoids and a non-selected medoids, if and only if this swap would result in an improvement of the quality of the clustering. The loop continues until there is no change.
Will be calculate three models with the same type of clustering, in order to show several type of metric used for calculating dissimilarities between observation (default = euclidian) and two different types of functions: pam and eclust.
Here the values (same) of medoids are obtained by partiotion around medoids with k=5.
p1$medoids## Yearly_Income Spending_Score
## [1,] -1.2396857 -1.40182274
## [2,] -1.5442768 1.11526229
## [3,] -0.2497647 0.03097951
## [4,] 0.7020825 1.27015983
## [5,] 1.0447474 -1.36309836
p2$medoids## Yearly_Income Spending_Score
## [1,] -1.2396857 -1.40182274
## [2,] -1.5442768 1.11526229
## [3,] -0.2497647 0.03097951
## [4,] 0.7020825 1.27015983
## [5,] 1.0447474 -1.36309836
p3$medoids## Yearly_Income Spending_Score
## [1,] -1.2396857 -1.40182274
## [2,] -1.5442768 1.11526229
## [3,] -0.2497647 0.03097951
## [4,] 0.7020825 1.27015983
## [5,] 1.0447474 -1.36309836
Same as above, there are several type of graphs in order to visulize the partition.
fviz_cluster(p1, geom = "point", ellipse.type = "norm")fviz_cluster(p2)fviz_cluster(p3)Concluding, it is possible to discover which cluster has the highest score across all of the variables.
In fact, thanks to the table reorderd below:
Group 4 contains customers with the lowest yearly income and spending score.This group contains the Customers renamed Low-Value.
Group 3 contains customers with the highest spending score and yearly income.This group contains the so-called High-Value individuals.
centers <- as.data.frame(customers_clusters5$centers)
centers$k<-rep(1:5)
centers %>% arrange(Spending_Score,Yearly_Income)## Yearly_Income Spending_Score k
## 5 1.0066735 -1.2224677 5
## 4 -1.3042458 -1.1341194 4
## 2 -0.4683088 0.2431891 2
## 3 1.8709508 1.2314354 3
## 1 0.6850149 1.2381121 1
Let's analyze the general customers of group 3 by gender:
| Gender | Average Age | USD yearly income | Spending score |
|---|---|---|---|
| Male | 30 | 116 | 80.5 |
| Female | 33.7 | 106 | 83 |
k5_group<-customers1[which(customers_clusters5$cluster == 3),]
k5_group %>%
group_by(Gender)%>%
summarise(Avg_Spending = mean(Spending_Score),
Avg_Yearly_Income= mean(Yearly_Income),
Avg.age = mean(Age))## # A tibble: 2 x 4
## Gender Avg_Spending Avg_Yearly_Income Avg.age
## <chr> <dbl> <dbl> <dbl>
## 1 Female 83 106. 33.7
## 2 Male 80.5 116. 30