Customer Segmentation is one the most important applications of unsupervised learning. Using clustering techniques, companies can identify the several segments of customers allowing them to target the potential user base. In this machine learning project, we will make use of K-means clustering which is the essential algorithm for clustering unlabeled dataset.
Customer Segmentation is the process of division of customer base into several groups of individuals that share a similarity in different ways that are relevant to marketing such as gender, age, interests, and miscellaneous spending habits. Companies that deploy customer segmentation are under the notion that every customer has different requirements and require a specific marketing effort to address them appropriately. Companies aim to gain a deeper approach of the customer they are targeting. Therefore, their aim has to be specific and should be tailored to address the requirements of each and every individual customer. Furthermore, through the data collected, companies can gain a deeper understanding of customer preferences as well as the requirements for discovering valuable segments that would reap them maximum profit. This way, they can strategize their marketing techniques more efficiently and minimize the possibility of risk to their investment. The technique of customer segmentation is dependent on several key differentiators that divide customers into groups to be targeted. Data related to demographics, geography, economic status as well as behavioral patterns play a crucial role in determining the company direction towards addressing the various segments.
customerData <- read.csv("mall.csv")
head(customerData)
## CustomerID Gender Age Annual.Income..k.. Spending.Score..1.100.
## 1 1 Male 19 15 39
## 2 2 Male 21 15 81
## 3 3 Female 20 16 6
## 4 4 Female 23 16 77
## 5 5 Female 31 17 40
## 6 6 Female 22 17 76
tail(customerData)
## CustomerID Gender Age Annual.Income..k.. Spending.Score..1.100.
## 195 195 Female 47 120 16
## 196 196 Female 35 120 79
## 197 197 Female 45 126 28
## 198 198 Male 32 126 74
## 199 199 Male 32 137 18
## 200 200 Male 30 137 83
str(customerData)
## 'data.frame': 200 obs. of 5 variables:
## $ CustomerID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Gender : Factor w/ 2 levels "Female","Male": 2 2 1 1 1 1 1 1 2 1 ...
## $ Age : int 19 21 20 23 31 22 35 23 64 30 ...
## $ Annual.Income..k.. : int 15 15 16 16 17 17 18 18 19 19 ...
## $ Spending.Score..1.100.: int 39 81 6 77 40 76 6 94 3 72 ...
summary(customerData)
## CustomerID Gender Age Annual.Income..k..
## Min. : 1.00 Female:112 Min. :18.00 Min. : 15.00
## 1st Qu.: 50.75 Male : 88 1st Qu.:28.75 1st Qu.: 41.50
## Median :100.50 Median :36.00 Median : 61.50
## Mean :100.50 Mean :38.85 Mean : 60.56
## 3rd Qu.:150.25 3rd Qu.:49.00 3rd Qu.: 78.00
## Max. :200.00 Max. :70.00 Max. :137.00
## Spending.Score..1.100.
## Min. : 1.00
## 1st Qu.:34.75
## Median :50.00
## Mean :50.20
## 3rd Qu.:73.00
## Max. :99.00
sd(customerData$Age)
## [1] 13.96901
sd(customerData$Annual.Income..k..)
## [1] 26.26472
In this, we will create a barplot and a piechart to show the gender distribution across our customerData dataset.
gender <- table(customerData$Gender)
barplot(gender,main = "Gender Comparison",xlab = "Gender",ylab = "Count",col =c("pink","lightblue"))
From the above barplot, we observe that the number of females is higher than the males. Now, let us visualize a pie chart to observe the ratio of male and female distribution.
pct = round(gender/sum(gender)*100)
lbs = paste(c("Female","Male")," ",pct,"%",sep = " ")
library(plotrix)
## Warning: package 'plotrix' was built under R version 3.6.1
pie3D(gender,labels = lbs,main="Pie Chart Depicting Ratio Of Female And Male",col = c("red","orange"))
From the above graph, we conclude that the percentage of females is 56%, whereas the percentage of male in the customer dataset is 44%.
Let us plot a histogram to view the distribution to plot the frequency of customer ages. We will first proceed by taking summary of the Age variable.
summary(customerData$Age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.00 28.75 36.00 38.85 49.00 70.00
hist(customerData$Age,col = "lightblue",main = "Hitogram to show count of Age Class",xlab = "Age Class",ylab = "Frequency",labels = TRUE)
boxplot(customerData$Age,col = "#ff0066",main="Boxplot for Descriptive Analysis of Age")
From the above two visualizations, we conclude that the maximum customer ages are between 30 and 35. The minimum age of customers is 18, whereas, the maximum age is 70.
we will create visualizations to analyze the annual income of the customers. We will plot a histogram and then we will proceed to examine this data using a density plot.
summary(customerData$Annual.Income..k..)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 15.00 41.50 61.50 60.56 78.00 137.00
hist(customerData$Annual.Income..k..,col = "orange",main = "Histogram For Annual Income",xlab = "Annual Income Class",ylab = "Frequency",labels = TRUE)
plot(density(customerData$Annual.Income..k..),col="white",main = "Density Plot For Annual Income",xlab = "Annual Income Class",ylab = "Density")
polygon(density(customerData$Annual.Income..k..),
col = "#ccff66")
From the above descriptive analysis, we conclude that the minimum annual income of the customers is 15 and the maximum income is 137. People earning an average income of 70 have the highest frequency count in our histogram distribution. The average salary of all the customers is60.56. In the Kernel Density Plot that we displayed above, we observe that the annual income has a Normal Distribution.
summary(customerData$Spending.Score..1.100.)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 34.75 50.00 50.20 73.00 99.00
boxplot(customerData$Spending.Score..1.100.,horizontal = TRUE,col = "lightblue",main="Boxplot For Descriptive Analysis Of Spending Score")
hist(customerData$Spending.Score..1.100.,main = "Histogram For Spending Score",xlab = "Spending Score Class",ylab = "Frequency",col = "#2475B0",labels = TRUE)
The minimum spending score is 1, maximum is 99 and the average is 50.20. We can see Descriptive Analysis of Spending Score is that Min is 1, Max is 99 and avg. is 50.20. From the histogram, we conclude that customers between class 40 and 50 have the highest spending score among all the classes.
While using the k-means clustering algorithm, the first step is to indicate the number of clusters (k) that we wish to produce in the final output.
library(purrr)
set.seed(123)
iss <- function(k){
kmeans(customerData[,3:5],k,iter.max = 100,nstart = 100,algorithm = "Lloyd")$tot.withinss
}
k.values <- 1:10
iss_values <- map_dbl(k.values,iss)
plot(k.values,iss_values,type = "b",pch=19,frame=FALSE,xlab = "Number Of Clusters K",ylab = "Total Intra Clusters Sum Of Squares",col="#1287A5")
With the help of the average silhouette method, we can measure the quality of our clustering operation. With this, we can determine how well within the cluster is the data object. If we obtain a high average silhouette width, it means that we have good clustering.
library(cluster)
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 3.6.1
library(grid)
k2 <- kmeans(customerData[,3:5],2,iter.max = 100,nstart = 50,algorithm = "Lloyd")
s2 <- plot(silhouette(k2$cluster,dist(customerData[,3:5],"euclidean")),col = "#1287A5")
k3 <- kmeans(customerData[,3:5],3,iter.max = 100,nstart = 50,algorithm = "Lloyd")
s3 <- plot(silhouette(k3$cluster,dist(customerData[,3:5],"euclidean")),col="#1287A5")
k4 <- kmeans(customerData[,3:5],4,iter.max = 100,nstart = 50,algorithm = "Lloyd")
s4 <- plot(silhouette(k4$cluster,dist(customerData[,3:5],"euclidean")),col="#1287A5")
k5 <- kmeans(customerData[,3:5],5,iter.max = 100,nstart = 50,algorithm = "Lloyd")
s5 <- plot(silhouette(k5$cluster,dist(customerData[,3:5],"euclidean")),col="#1287A5")
k6 <- kmeans(customerData[,3:5],6,iter.max = 100,nstart = 50,algorithm = "Lloyd")
s6 <- plot(silhouette(k6$cluster,dist(customerData[,3:5],"euclidean")),col="#1287A5")
k7 <- kmeans(customerData[,3:5],7,iter.max = 100,nstart = 50,algorithm = "Lloyd")
s7 <- plot(silhouette(k7$cluster,dist(customerData[,3:5],"euclidean")),col = "#1287A5")
k8 <- kmeans(customerData[,3:5],8,iter.max = 100,nstart = 50,algorithm = "Lloyd")
s8 <- plot(silhouette(k8$cluster,dist(customerData[,3:5],"euclidean")),col = "#1287A5")
k9 <- kmeans(customerData[,3:5],9,iter.max = 100,nstart = 50,algorithm = "Lloyd")
s9 <- plot(silhouette(k9$cluster,dist(customerData[,3:5],"euclidean")),col = "#1287A5")
k10 <- kmeans(customerData[,3:5],10,iter.max = 100,nstart = 50,algorithm = "Lloyd")
s10 <- plot(silhouette(k10$cluster,dist(customerData[,3:5],"euclidean")),col = "#1287A5")
library(NbClust)
library(factoextra)
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
fviz_nbclust(customerData[,3:5],kmeans,method = "silhouette")
For computing the gap statistics method we can utilize the clusGap function for providing gap statistic as well as standard error for a given output
set.seed(125)
stat_gap <- clusGap(customerData[,3:5],FUN=kmeans,nstart=25,K.max = 10,B=50)
fviz_gap_stat(stat_gap)
clusterK <- kmeans(customerData[,3:5],6,iter.max = 100,nstart = 50,algorithm = "Lloyd")
clusterK
## K-means clustering with 6 clusters of sizes 45, 22, 21, 38, 35, 39
##
## Cluster means:
## Age Annual.Income..k.. Spending.Score..1.100.
## 1 56.15556 53.37778 49.08889
## 2 25.27273 25.72727 79.36364
## 3 44.14286 25.14286 19.52381
## 4 27.00000 56.65789 49.13158
## 5 41.68571 88.22857 17.28571
## 6 32.69231 86.53846 82.12821
##
## Clustering vector:
## [1] 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3
## [36] 2 3 2 3 2 1 2 1 4 3 2 1 4 4 4 1 4 4 1 1 1 1 1 4 1 1 4 1 1 1 4 1 1 4 4
## [71] 1 1 1 1 1 4 1 4 4 1 1 4 1 1 4 1 1 4 4 1 1 4 1 4 4 4 1 4 1 4 4 1 1 4 1
## [106] 4 1 1 1 1 1 4 4 4 4 4 1 1 1 1 4 4 4 6 4 6 5 6 5 6 5 6 4 6 5 6 5 6 5 6
## [141] 5 6 4 6 5 6 5 6 5 6 5 6 5 6 5 6 5 6 5 6 5 6 5 6 5 6 5 6 5 6 5 6 5 6 5
## [176] 6 5 6 5 6 5 6 5 6 5 6 5 6 5 6 5 6 5 6 5 6 5 6 5 6
##
## Within cluster sum of squares by cluster:
## [1] 8062.133 4099.818 7732.381 7742.895 16690.857 13972.359
## (between_SS / total_SS = 81.1 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
pclust <- prcomp(customerData[,3:5],scale. = FALSE)
summary(pclust)
## Importance of components:
## PC1 PC2 PC3
## Standard deviation 26.4625 26.1597 12.9317
## Proportion of Variance 0.4512 0.4410 0.1078
## Cumulative Proportion 0.4512 0.8922 1.0000
pclust$rotation[,1:2]
## PC1 PC2
## Age 0.1889742 -0.1309652
## Annual.Income..k.. -0.5886410 -0.8083757
## Spending.Score..1.100. -0.7859965 0.5739136
Cluster 1 this cluster represents the customer_data having a high annual income as well as a high annual spend.
kCols=function(vec){cols=rainbow (length (unique (vec)))
return (cols[as.numeric(as.factor(vec))])}
digCluster <- clusterK$cluster
dignm <- as.character(digCluster)
plot(pclust$x[,1:2],col=kCols(digCluster),pch=19,xlab = "K-Means",ylab = "Classes")
legend("bottomleft",unique(dignm),fill = unique(kCols(digCluster)))