Customer Segmentation Project In R

Customer Segmentation is one the most important applications of unsupervised learning. Using clustering techniques, companies can identify the several segments of customers allowing them to target the potential user base. In this machine learning project, we will make use of K-means clustering which is the essential algorithm for clustering unlabeled dataset.

What Is Customer Segmentation

Customer Segmentation is the process of division of customer base into several groups of individuals that share a similarity in different ways that are relevant to marketing such as gender, age, interests, and miscellaneous spending habits. Companies that deploy customer segmentation are under the notion that every customer has different requirements and require a specific marketing effort to address them appropriately. Companies aim to gain a deeper approach of the customer they are targeting. Therefore, their aim has to be specific and should be tailored to address the requirements of each and every individual customer. Furthermore, through the data collected, companies can gain a deeper understanding of customer preferences as well as the requirements for discovering valuable segments that would reap them maximum profit. This way, they can strategize their marketing techniques more efficiently and minimize the possibility of risk to their investment. The technique of customer segmentation is dependent on several key differentiators that divide customers into groups to be targeted. Data related to demographics, geography, economic status as well as behavioral patterns play a crucial role in determining the company direction towards addressing the various segments.

Read Data

customerData <- read.csv("mall.csv")
head(customerData)
##   CustomerID Gender Age Annual.Income..k.. Spending.Score..1.100.
## 1          1   Male  19                 15                     39
## 2          2   Male  21                 15                     81
## 3          3 Female  20                 16                      6
## 4          4 Female  23                 16                     77
## 5          5 Female  31                 17                     40
## 6          6 Female  22                 17                     76
tail(customerData)
##     CustomerID Gender Age Annual.Income..k.. Spending.Score..1.100.
## 195        195 Female  47                120                     16
## 196        196 Female  35                120                     79
## 197        197 Female  45                126                     28
## 198        198   Male  32                126                     74
## 199        199   Male  32                137                     18
## 200        200   Male  30                137                     83

Structure And Summary Of The Data

str(customerData)
## 'data.frame':    200 obs. of  5 variables:
##  $ CustomerID            : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Gender                : Factor w/ 2 levels "Female","Male": 2 2 1 1 1 1 1 1 2 1 ...
##  $ Age                   : int  19 21 20 23 31 22 35 23 64 30 ...
##  $ Annual.Income..k..    : int  15 15 16 16 17 17 18 18 19 19 ...
##  $ Spending.Score..1.100.: int  39 81 6 77 40 76 6 94 3 72 ...
summary(customerData)
##    CustomerID        Gender         Age        Annual.Income..k..
##  Min.   :  1.00   Female:112   Min.   :18.00   Min.   : 15.00    
##  1st Qu.: 50.75   Male  : 88   1st Qu.:28.75   1st Qu.: 41.50    
##  Median :100.50                Median :36.00   Median : 61.50    
##  Mean   :100.50                Mean   :38.85   Mean   : 60.56    
##  3rd Qu.:150.25                3rd Qu.:49.00   3rd Qu.: 78.00    
##  Max.   :200.00                Max.   :70.00   Max.   :137.00    
##  Spending.Score..1.100.
##  Min.   : 1.00         
##  1st Qu.:34.75         
##  Median :50.00         
##  Mean   :50.20         
##  3rd Qu.:73.00         
##  Max.   :99.00
sd(customerData$Age)
## [1] 13.96901
sd(customerData$Annual.Income..k..)
## [1] 26.26472

Customer Gender Visualization

In this, we will create a barplot and a piechart to show the gender distribution across our customerData dataset.

gender <- table(customerData$Gender)
barplot(gender,main = "Gender Comparison",xlab = "Gender",ylab = "Count",col =c("pink","lightblue"))

From the above barplot, we observe that the number of females is higher than the males. Now, let us visualize a pie chart to observe the ratio of male and female distribution.

pct = round(gender/sum(gender)*100)
lbs = paste(c("Female","Male")," ",pct,"%",sep = " ")
library(plotrix)
## Warning: package 'plotrix' was built under R version 3.6.1
pie3D(gender,labels = lbs,main="Pie Chart Depicting Ratio Of Female And Male",col = c("red","orange"))

From the above graph, we conclude that the percentage of females is 56%, whereas the percentage of male in the customer dataset is 44%.

Visualization of Age Distribution

Let us plot a histogram to view the distribution to plot the frequency of customer ages. We will first proceed by taking summary of the Age variable.

summary(customerData$Age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   28.75   36.00   38.85   49.00   70.00
hist(customerData$Age,col = "lightblue",main = "Hitogram to show count of Age Class",xlab = "Age Class",ylab = "Frequency",labels = TRUE)

boxplot(customerData$Age,col = "#ff0066",main="Boxplot for Descriptive Analysis of Age")

From the above two visualizations, we conclude that the maximum customer ages are between 30 and 35. The minimum age of customers is 18, whereas, the maximum age is 70.

Analysis Of The Annual Income Of The Customer

we will create visualizations to analyze the annual income of the customers. We will plot a histogram and then we will proceed to examine this data using a density plot.

summary(customerData$Annual.Income..k..)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   15.00   41.50   61.50   60.56   78.00  137.00
hist(customerData$Annual.Income..k..,col = "orange",main = "Histogram For Annual Income",xlab = "Annual Income Class",ylab = "Frequency",labels = TRUE)

plot(density(customerData$Annual.Income..k..),col="white",main = "Density Plot For Annual Income",xlab = "Annual Income Class",ylab = "Density")

polygon(density(customerData$Annual.Income..k..),
        col = "#ccff66")

From the above descriptive analysis, we conclude that the minimum annual income of the customers is 15 and the maximum income is 137. People earning an average income of 70 have the highest frequency count in our histogram distribution. The average salary of all the customers is60.56. In the Kernel Density Plot that we displayed above, we observe that the annual income has a Normal Distribution.

Analyzing Spending Score Of The Customers

summary(customerData$Spending.Score..1.100.)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   34.75   50.00   50.20   73.00   99.00
boxplot(customerData$Spending.Score..1.100.,horizontal = TRUE,col = "lightblue",main="Boxplot For Descriptive Analysis Of Spending Score")

hist(customerData$Spending.Score..1.100.,main = "Histogram For Spending Score",xlab = "Spending Score Class",ylab = "Frequency",col = "#2475B0",labels = TRUE)

The minimum spending score is 1, maximum is 99 and the average is 50.20. We can see Descriptive Analysis of Spending Score is that Min is 1, Max is 99 and avg. is 50.20. From the histogram, we conclude that customers between class 40 and 50 have the highest spending score among all the classes.

Using K-Means Algorithm

While using the k-means clustering algorithm, the first step is to indicate the number of clusters (k) that we wish to produce in the final output.

Elbow Method

library(purrr)
set.seed(123)
iss <- function(k){
  kmeans(customerData[,3:5],k,iter.max = 100,nstart = 100,algorithm = "Lloyd")$tot.withinss
}

k.values <- 1:10

iss_values <- map_dbl(k.values,iss)
plot(k.values,iss_values,type = "b",pch=19,frame=FALSE,xlab = "Number Of Clusters K",ylab = "Total Intra Clusters Sum Of Squares",col="#1287A5")

Average Sillhouette Method

With the help of the average silhouette method, we can measure the quality of our clustering operation. With this, we can determine how well within the cluster is the data object. If we obtain a high average silhouette width, it means that we have good clustering.

library(cluster)
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 3.6.1
library(grid)
k2 <- kmeans(customerData[,3:5],2,iter.max = 100,nstart = 50,algorithm = "Lloyd")
s2 <- plot(silhouette(k2$cluster,dist(customerData[,3:5],"euclidean")),col = "#1287A5")

k3 <- kmeans(customerData[,3:5],3,iter.max = 100,nstart = 50,algorithm = "Lloyd")
s3 <- plot(silhouette(k3$cluster,dist(customerData[,3:5],"euclidean")),col="#1287A5")

k4 <- kmeans(customerData[,3:5],4,iter.max = 100,nstart = 50,algorithm = "Lloyd")
s4 <- plot(silhouette(k4$cluster,dist(customerData[,3:5],"euclidean")),col="#1287A5")

k5 <- kmeans(customerData[,3:5],5,iter.max = 100,nstart = 50,algorithm = "Lloyd")
s5 <- plot(silhouette(k5$cluster,dist(customerData[,3:5],"euclidean")),col="#1287A5")

k6 <- kmeans(customerData[,3:5],6,iter.max = 100,nstart = 50,algorithm = "Lloyd")
s6 <- plot(silhouette(k6$cluster,dist(customerData[,3:5],"euclidean")),col="#1287A5")

k7 <- kmeans(customerData[,3:5],7,iter.max = 100,nstart = 50,algorithm = "Lloyd")
s7 <- plot(silhouette(k7$cluster,dist(customerData[,3:5],"euclidean")),col = "#1287A5")

k8 <- kmeans(customerData[,3:5],8,iter.max = 100,nstart = 50,algorithm = "Lloyd")
s8 <- plot(silhouette(k8$cluster,dist(customerData[,3:5],"euclidean")),col = "#1287A5")

k9 <- kmeans(customerData[,3:5],9,iter.max = 100,nstart = 50,algorithm = "Lloyd")
s9 <- plot(silhouette(k9$cluster,dist(customerData[,3:5],"euclidean")),col = "#1287A5")

k10 <- kmeans(customerData[,3:5],10,iter.max = 100,nstart = 50,algorithm = "Lloyd")
s10 <- plot(silhouette(k10$cluster,dist(customerData[,3:5],"euclidean")),col = "#1287A5")

Visulaize The Optimal Number Of Clusters

library(NbClust)
library(factoextra)
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
fviz_nbclust(customerData[,3:5],kmeans,method = "silhouette")

Gap Statistics

For computing the gap statistics method we can utilize the clusGap function for providing gap statistic as well as standard error for a given output

set.seed(125)
stat_gap <- clusGap(customerData[,3:5],FUN=kmeans,nstart=25,K.max = 10,B=50)
fviz_gap_stat(stat_gap)

Choosing K Value

clusterK <- kmeans(customerData[,3:5],6,iter.max = 100,nstart = 50,algorithm = "Lloyd")
clusterK
## K-means clustering with 6 clusters of sizes 45, 22, 21, 38, 35, 39
## 
## Cluster means:
##        Age Annual.Income..k.. Spending.Score..1.100.
## 1 56.15556           53.37778               49.08889
## 2 25.27273           25.72727               79.36364
## 3 44.14286           25.14286               19.52381
## 4 27.00000           56.65789               49.13158
## 5 41.68571           88.22857               17.28571
## 6 32.69231           86.53846               82.12821
## 
## Clustering vector:
##   [1] 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3
##  [36] 2 3 2 3 2 1 2 1 4 3 2 1 4 4 4 1 4 4 1 1 1 1 1 4 1 1 4 1 1 1 4 1 1 4 4
##  [71] 1 1 1 1 1 4 1 4 4 1 1 4 1 1 4 1 1 4 4 1 1 4 1 4 4 4 1 4 1 4 4 1 1 4 1
## [106] 4 1 1 1 1 1 4 4 4 4 4 1 1 1 1 4 4 4 6 4 6 5 6 5 6 5 6 4 6 5 6 5 6 5 6
## [141] 5 6 4 6 5 6 5 6 5 6 5 6 5 6 5 6 5 6 5 6 5 6 5 6 5 6 5 6 5 6 5 6 5 6 5
## [176] 6 5 6 5 6 5 6 5 6 5 6 5 6 5 6 5 6 5 6 5 6 5 6 5 6
## 
## Within cluster sum of squares by cluster:
## [1]  8062.133  4099.818  7732.381  7742.895 16690.857 13972.359
##  (between_SS / total_SS =  81.1 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

Visualizing the Clustering Results using the First Two Principle Components

pclust <- prcomp(customerData[,3:5],scale. = FALSE)
summary(pclust)
## Importance of components:
##                            PC1     PC2     PC3
## Standard deviation     26.4625 26.1597 12.9317
## Proportion of Variance  0.4512  0.4410  0.1078
## Cumulative Proportion   0.4512  0.8922  1.0000
pclust$rotation[,1:2]
##                               PC1        PC2
## Age                     0.1889742 -0.1309652
## Annual.Income..k..     -0.5886410 -0.8083757
## Spending.Score..1.100. -0.7859965  0.5739136

Model Visualization Using Annual Income And Spending Score

Cluster 1 this cluster represents the customer_data having a high annual income as well as a high annual spend.

Model Visualization Using Annual Income And Age

kCols=function(vec){cols=rainbow (length (unique (vec)))
return (cols[as.numeric(as.factor(vec))])}
digCluster <- clusterK$cluster 
dignm <- as.character(digCluster)
plot(pclust$x[,1:2],col=kCols(digCluster),pch=19,xlab = "K-Means",ylab = "Classes")
legend("bottomleft",unique(dignm),fill = unique(kCols(digCluster)))

  • Cluster 4 and 1 These two clusters consist of customers with medium PCA1 and medium PCA2 score.
  • Cluster 6 this cluster represents customers having a high PCA2 and a low PCA1.
  • Cluster 5 In this cluster, there are customers with a medium PCA1 and a low PCA2 score.
  • Cluster 3 This cluster comprises of customers with a high PCA1 income and a high PCA2.
  • Cluster 2 This comprises of customers with a high PCA2 and a medium annual spend of income.