Introduction

The aim of this project is to segregating customers into groups based on common characteristics.

The method of segmenting or segregating clients into classes based on similar features is called consumer segmentation.

This process is very usefull in order to get a better knowledge of the tastes and desires of each group and adapt the marketing campaigns of companies accordingly.

Dataset and first view

The dataset used in this project is made up of 200 observations where each row represents a customer in the mall and 5 different columns:

CustomerID: ordinal customer identification number.
Gender: Female and Male.
Age: age of customer.
Annual.Income..k..: annual income of clients in thousands of dollars.
Spending.Score..1.100.: that is the purchase score that the shopping center assigns according to the buying actions of the consumer.

I decide to rename the Columns for a better organization.

customers <- read.csv("Mall_Customers.csv")

colnames(customers)[colnames(customers) == "Annual.Income..k.."] <- "Yearly_Income"
colnames(customers)[colnames(customers) == "Spending.Score..1.100."] <- "Spending_Score"
str(customers)

## 'data.frame':    200 obs. of  5 variables:
##  $ CustomerID    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Gender        : chr  "Male" "Male" "Female" "Female" ...
##  $ Age           : int  19 21 20 23 31 22 35 23 64 30 ...
##  $ Yearly_Income : int  15 15 16 16 17 17 18 18 19 19 ...
##  $ Spending_Score: int  39 81 6 77 40 76 6 94 3 72 ...

customers1<-customers

Exploratory data analysis

First of all, I will begin to explore the data graphically.

Distribution of Gender

The women customers are 56% while the Male ones are 44%.

The table below show the distribution of gender by absolute frequencies.

The barplot illustrates graphically the distribution of the variable Gender.

table(customers$Gender)

## 
## Female   Male 
##    112     88

ggplot(customers, aes(Gender,fill=Gender)) + 
  geom_text(aes(y = ..count..,label = paste0((prop.table(..count..)) * 100, '%')), 
            stat = 'count', 
            position = position_dodge(1), 
            size = 5, color= "darkred")+
  geom_histogram(alpha=0.5, aes(y=(..count..)), stat="count") +
  labs(x = "Gender", y = "Count")

In order to study the distribution of gender between their age, the violin plot is visualiazed.

The male customers are distributed between 18 and 70 while the female between 18 and 68.

The middle age (point green) is a more higher in the male clients.

ggplot(customers, aes(y = Gender, x = Age), colors='blue') +
geom_violin() +
stat_summary(fun.y = mean, geom = 'point', color='darkgreen',size=4)

Density Yearly Income

Most of the observations are on the left handside on the mean value, thus the distribution of Yearly Income is defined as right-skewed.

Mean annual income equals 60.56 thousand, whereas the median is equal to 61.5 thousand.

mean(customers$Yearly_Income)

## [1] 60.56

median(customers$Yearly_Income)

## [1] 61.5

ggplot(customers, aes(Yearly_Income)) + 
  geom_histogram(aes(y=..density..),fill="orange", color="darkred") +
  labs(x = "Yearly Income", y = "Density")+
  geom_density(fill='brown',alpha=0.5)

Density Spending Score

The spending score distributions is more symmetrical than before one.

Spending score is uniformly distributed with a mean value and median more or less at 50.

mean(customers$Spending_Score)

## [1] 50.2

median(customers$Spending_Score)

## [1] 50

ggplot(customers, aes(Spending_Score)) + 
  geom_histogram(aes(y=..density..),fill="green", color="darkgreen") +
  labs(x = "Spending Score", y = "Density")+
  geom_density(fill='orange',alpha=0.5)

K-Means theory

K-means clustering is one of the simplest and popular unsupervised machine learning algorithms.

The purpose of the algorithm is finds the similarity between the items and groups them into the clusters, minimizing the differences within cluster and maximizing the differences between clusters.

The number of groups (cluster) is represented by K.

It assigns data elements to the closest cluster (center) and the centroid's position is recalculated everytime a component is added to the cluster.

Summurize, the given “n” data set is placed in “k” cluster in a way to minimize the error function and cluster similarity is measured by the approximation of the values in the cluster.

K-means

Now I can procede with the code, taking only Yearly Income and Spending Score and standardize all.

customers<-customers[,c(4,5)] 
customers_z <- as.data.frame(lapply(customers, scale))

Clusters = 4

First try is with 4 cluster.

Moreover, I put the results in our dataset.

set.seed(12345)
customers_clusters4 <- kmeans(customers_z, 4)
customers_clusters4$size

## [1]  39 100  38  23

customers_clusters4$centers

##   Yearly_Income Spending_Score
## 1     0.9891010      1.2364001
## 2    -0.4683088      0.2431891
## 3     1.0066735     -1.2224677
## 4    -1.3042458     -1.1341194

customers$cluster4 <- customers_clusters4$cluster

Clusters = 5

I apply the same procedure above, selecting 5 clusters.

set.seed(12345)
customers_clusters5 <- kmeans(customers_z, 5)
customers_clusters5$size

## [1]  29 100  10  23  38

customers_clusters5$centers

##   Yearly_Income Spending_Score
## 1     0.6850149      1.2381121
## 2    -0.4683088      0.2431891
## 3     1.8709508      1.2314354
## 4    -1.3042458     -1.1341194
## 5     1.0066735     -1.2224677

customers$cluster5 <- customers_clusters5$cluster
customers1$cluster5 <- customers_clusters5$cluster

This is the result in our data of K-means clustering with k= 4-5

Visualize Clustering Results

The first comparison by anlysis graphs suggests that 5 is the best number of clusters.

The function cclust performs an altermative k-means clustering.

fviz_cluster(list(data=customers_z, cluster=customers$cluster5), 
             ellipse.type="euclid", geom=c("point"), pointsize=2,stand=FALSE,main='Plot with 5 clusters', palette="jco", ggtheme=theme_classic())

fviz_cluster(list(data=customers_z, cluster=customers$cluster4), 
             ellipse.type="norm", geom="point",pointsize=2, stand=FALSE,main='Plot with 4 clusters', palette="jco", ggtheme=theme_classic())

k5<- cclust(customers_z, k=5, simple=FALSE, save.data=TRUE)
plot(k5)

k4<- cclust(customers_z, k=4, simple=FALSE, save.data=TRUE)
plot(k4)

Find out the best number of clusters

There are more than one method to choose the perfect number of cluster.

Silhouette Method

Average silhouette method computes the average silhouette of observations for different values of k.

The optimal number of clusters k is the one that maximize the average silhouette over a range of possible values for k.

All of the following plots advise that with k=5, clustering provide the maximum value of average silhouette.

silh4<-silhouette(customers_clusters4$cluster, dist(customers_z))
fviz_silhouette(silh4)

##   cluster size ave.sil.width
## 1       1   39          0.54
## 2       2  100          0.44
## 3       3   38          0.54
## 4       4   23          0.55

silh5<-silhouette(customers_clusters5$cluster, dist(customers_z))
fviz_silhouette(silh5)

##   cluster size ave.sil.width
## 1       1   29          0.54
## 2       2  100          0.41
## 3       3   10          0.33
## 4       4   23          0.55
## 5       5   38          0.53

opt<-Optimal_Clusters_KMeans(customers_z, max_clusters=15, plot_clusters=TRUE, criterion="silhouette")

Elbow Method

In Elbow method, the sum of squares at each number of clusters is calculated and graphed.

In order to understand the optimal number of cluster, must look for a change of slope from steep to shallow.

As the method before showed, the number of 5 clusters is the right one.

opt<-Optimal_Clusters_KMeans(customers_z, max_clusters=15, plot_clusters=TRUE)

PAM theory

PAM stands for partition around medoids and it usefull for small dataset.

The algorithm find a sequence of objects called medoids that are centrally located in clusters.

This clustering method begins with an arbitrary selection of medoids.

It is followed by a swap between a selected medoids and a non-selected medoids, if and only if this swap would result in an improvement of the quality of the clustering. The loop continues until there is no change.

PAM code

Will be calculate three models with the same type of clustering, in order to show several type of metric used for calculating dissimilarities between observation (default = euclidian) and two different types of functions: pam and eclust.

Here the values (same) of medoids are obtained by partiotion around medoids with k=5.

p1$medoids

##      Yearly_Income Spending_Score
## [1,]    -1.2396857    -1.40182274
## [2,]    -1.5442768     1.11526229
## [3,]    -0.2497647     0.03097951
## [4,]     0.7020825     1.27015983
## [5,]     1.0447474    -1.36309836

p2$medoids

##      Yearly_Income Spending_Score
## [1,]    -1.2396857    -1.40182274
## [2,]    -1.5442768     1.11526229
## [3,]    -0.2497647     0.03097951
## [4,]     0.7020825     1.27015983
## [5,]     1.0447474    -1.36309836

p3$medoids

##      Yearly_Income Spending_Score
## [1,]    -1.2396857    -1.40182274
## [2,]    -1.5442768     1.11526229
## [3,]    -0.2497647     0.03097951
## [4,]     0.7020825     1.27015983
## [5,]     1.0447474    -1.36309836

Same as above, there are several type of graphs in order to visulize the partition.

fviz_cluster(p1, geom = "point", ellipse.type = "norm")

fviz_cluster(p2)

fviz_cluster(p3)

Conclusion

Concluding, it is possible to discover which cluster has the highest score across all of the variables.

In fact, thanks to the table reorderd below:

Group 4 contains customers with the lowest yearly income and spending score.

This group contains the Customers renamed Low-Value.

Group 3 contains customers with the highest spending score and yearly income.

This group contains the so-called High-Value individuals.

centers <- as.data.frame(customers_clusters5$centers)
centers$k<-rep(1:5)
centers %>% arrange(Spending_Score,Yearly_Income)

##   Yearly_Income Spending_Score k
## 5     1.0066735     -1.2224677 5
## 4    -1.3042458     -1.1341194 4
## 2    -0.4683088      0.2431891 2
## 3     1.8709508      1.2314354 3
## 1     0.6850149      1.2381121 1

Let's analyze the general customers of group 3 by gender:

Gender	Average Age	USD yearly income	Spending score
Male	30	116	80.5
Female	33.7	106	83

k5_group<-customers1[which(customers_clusters5$cluster == 3),]
k5_group %>%
  group_by(Gender)%>%
summarise(Avg_Spending = mean(Spending_Score),
          Avg_Yearly_Income= mean(Yearly_Income),
          Avg.age = mean(Age))

## # A tibble: 2 x 4
##   Gender Avg_Spending Avg_Yearly_Income Avg.age
##   <chr>         <dbl>             <dbl>   <dbl>
## 1 Female         83                106.    33.7
## 2 Male           80.5              116.    30

Customer clustering using K-Means/PAM

Matteo Pancaldi