Customer Segmentation is the process of splitting a customer base into multiple groups of individuals that share a similarity in ways a product is or can be marketed to them such as gender, age, interests, demographics, economic status, geography, behavioral patterns, spending habits and much more.
Customer Segmentation is one the most important applications of unsupervised learning. Using clustering techniques, companies can identify the several segments of customers allowing them to target the potential user base. Companies use the clustering process to foresee or map customer segments with similar behavior to identify and target potential user base.
Here we seek to achieve Value-based segmentation, where it differentiates customers by their economic value, grouping customers with the same value level into individual segments that can be distinctly targeted.
UPLOADING DATA SHEET
library(readr)
Mall_Customers <- read_csv("~/KAMILIMU/Data Science/customer-segmentation-dataset/customer-segmentation-dataset/Mall_Customers.csv")
## Parsed with column specification:
## cols(
## CustomerID = col_double(),
## Gender = col_character(),
## Age = col_double(),
## `Annual Income (k$)` = col_double(),
## `Spending Score (1-100)` = col_double()
## )
#table(Mall_Customers$`Annual Income (k$)`)
#table(Mall_Customers$`Spending Score (1-100)`)
DATA SUMMARY
The data set used is for mall customers that includes the following columns:Customer, Gender, Age, Annual Income, Spending Score.
View(Mall_Customers)
summary(Mall_Customers) #summary of the data set
## CustomerID Gender Age Annual Income (k$)
## Min. : 1.00 Length:200 Min. :18.00 Min. : 15.00
## 1st Qu.: 50.75 Class :character 1st Qu.:28.75 1st Qu.: 41.50
## Median :100.50 Mode :character Median :36.00 Median : 61.50
## Mean :100.50 Mean :38.85 Mean : 60.56
## 3rd Qu.:150.25 3rd Qu.:49.00 3rd Qu.: 78.00
## Max. :200.00 Max. :70.00 Max. :137.00
## Spending Score (1-100)
## Min. : 1.00
## 1st Qu.:34.75
## Median :50.00
## Mean :50.20
## 3rd Qu.:73.00
## Max. :99.00
any(is.na(Mall_Customers))#To find missing values
## [1] FALSE
FALSE indicated that there is no missing data.
Therefore, there is no data cleaning that was done, and we head to data visulaization.
Data profiling report (file:///C:/Users/Koi%20Anyango/Documents/report.html#raw-counts).
Histograms
library(ggplot2)
ggplot(Mall_Customers,aes(x= Age, fill=Gender))+geom_histogram(bins = 50) # Histogram of Age filling Gender
ggplot(Mall_Customers,aes(x= `Annual Income (k$)`,fill=Gender)) +geom_histogram(bins = 50) # Histogram of Annual income filling Gender
ggplot(Mall_Customers,aes(x= `Spending Score (1-100)`,fill=Gender)) +geom_histogram(bins=50) # Histogram of Spending Score filling Gender
**NoTING**
There is no Customer at the age of 62.
The maximum customer ages are between 30 and 35. The minimum age of customers is 18, whereas, the maximum age is 70.
The minimum spending score is 1, maximum is 99 and the average is 50.20. We can see Descriptive Analysis of Spending Score is that Min is 1, Max is 99 and avg. is 50.20. From the histogram, we conclude that customers between class 40 and 50 have the highest spending score among all the classes.
Bar Plot
library(ggplot2)
ggplot(Mall_Customers,aes(x= Gender))+geom_bar ()
table(Mall_Customers$Gender)
##
## Female Male
## 112 88
**NOTING**
From the above bar plot, we observe that the number of females is higher than the males
Frequency Polygon
library(ggplot2)
ggplot(Mall_Customers,aes(x= `Spending Score (1-100)`, col=Gender)) + geom_freqpoly(bins=50)
ggplot(Mall_Customers,aes(x= `Annual Income (k$)`, col=Gender)) + geom_freqpoly(bins=50)
#This is similar to the histograms.
Box plot
Violin Plot
The violin plot is similar to box plots, except that they also show the kernel probability density of the data at different value. Typically violin plots will include a marker for the median of the data and a box indicating the interquartile range, as in standard box plots
library(ggplot2)
p<-ggplot(Mall_Customers,aes(y= `Annual Income (k$)`, x= Gender))+geom_violin(trim=FALSE)
p + geom_boxplot(width=0.1)# shows the Annual income plot showwing the different genders
p<-ggplot(Mall_Customers,aes(y= `Annual Income (k$)`, x= Age,fill= Gender))+geom_violin(trim=FALSE)
p + geom_boxplot(width=0.1) #overlapping plots
## Warning: position_dodge requires non-overlapping x intervals
p<-ggplot(Mall_Customers,aes(y= `Spending Score (1-100)`, x= Gender))+geom_violin(trim=FALSE)
p + geom_boxplot(width=0.1)# shows the Annual income plot showwing the different genders
p<-ggplot(Mall_Customers,aes(y= `Spending Score (1-100)`, x= Age,fill= Gender))+geom_violin(trim=FALSE)
p + geom_boxplot(width=0.1)# overlapping plot
## Warning: position_dodge requires non-overlapping x intervals
ggplot(Mall_Customers,aes(y= `Spending Score (1-100)`, x= `Annual Income (k$)`))+geom_violin(trim=FALSE) # violin plot of Spending score and annual income
Correlation is a statistical measure that indicates the extent to which two or more variables move together. A positive correlation indicates that the variables increase or decrease together. A negative correlation indicates that if one variable increases, the other decreases, and vice versa.
Collinearity is a linear association between two predictors. Multicollinearity is a situation where two or more predictors are highly linearly related. In general, an absolute correlation coefficient of >0.7 among two or more predictors indicates the presence of multicollinearity.
Though correlation talks about bivariate linear relationship whereas multicollinearity are multivariate, if not always, correlation matrix can be a good indicator of multicollinearity and indicate the need for further investigation
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
ggcorr(Mall_Customers)
## Warning in ggcorr(Mall_Customers): data in column(s) 'Gender' are not numeric
## and were ignored
ggcorr(Mall_Customers, label = TRUE, label_alpha = TRUE)
## Warning in ggcorr(Mall_Customers, label = TRUE, label_alpha = TRUE): data in
## column(s) 'Gender' are not numeric and were ignored
par(mfrow=c(2,2))
plot(Mall_Customers)
X<-Mall_Customers[,2:5]
library(GGally)
ggpairs(X)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Summing up the K-means clustering –
• We specify the number of clusters that we need to create.
• The algorithm selects k objects at random from the dataset. This object is the initial cluster or mean.
• The closest centroid obtains the assignment of a new observation. We base this assignment on the Euclidean Distance between object and the centroid.
• k clusters in the data points update the centroid through calculation of the new mean values present in all the data points of the cluster. The kth cluster’s centroid has a length of p that contains means of all variables for observations in the k-th cluster. We denote the number of variables with p.
• Iterative minimization of the total within the sum of squares. Then through the iterative minimization of the total sum of the square, the assignment stops wavering when we achieve maximum iteration. The default value is 10 that the R software uses for the maximum iterations.
Determining Optimal Clusters: While working with clusters, you need to specify the number of clusters to use. You would like to utilize the optimal number of clusters. To help you in determining the optimal clusters, there are three popular methods –
ELBOW PLOT
The main goal behind cluster partitioning methods like k-means is to define the clusters such that the intra-cluster variation stays minimum.
Minimize (sum W(Ck)), k=1…k
Where Ck represents the kth cluster and W(Ck) denotes the intra-cluster variation (this is the distance between data points in the same cluster). With the measurement of the total intra-cluster variation, one can evaluate the compactness of the clustering boundary. We can then proceed to define the optimal clusters as follows –
1. Calculate the clustering algorithm for several values of k. This can be done by creating a variation within k from 1 to 10 clusters.
2. Calculate the total intra-cluster sum of square (iss).
3. Plot iss based on the number of k clusters. This plot denotes the appropriate number of clusters required in our model.
4. In the plot, the location of a bend or a knee is the indication of the optimum number of clusters
library(purrr)
set.seed(123)
#function to calculate total intra-cluster sum of square
iss<-function(k)
{kmeans(Mall_Customers[,3:5],k,iter.max=100,nstart=100,algorithm="Lloyd" )$tot.withinss}
k.value<-1:10
iss_value<- map_dbl(k.value, iss)
plot(k.value, iss_value,type="b", pch = 19, frame = FALSE,xlab="Number of clusters K",ylab="Total intra-clusters sum of squares")
AVERAGE SILHOUETTE METHOD
With the help of the average silhouette method, we can measure the quality of our clustering operation. With this, we can determine how well within the cluster is the data object. If we obtain a high average silhouette width, it means that we have good clustering. The average silhouette method calculates the mean of silhouette observations for different k values. With the optimal number of k clusters, one can maximize the average silhouette over significant values for k clusters.
Using the average silhouette widths (Distance between one cluster and the next clusters) for every k from 1 to 10, and plotted an optimal number of clusters to find the k with the highest average width.
library(NbClust)
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
fviz_nbclust(Mall_Customers[,3:5], kmeans, method = "silhouette")
GAP STATISTIC METHOD
We can use this method to any of the clustering method like K-means, hierarchical clustering etc. Using the gap statistic, one can compare the total intra-cluster variation for different values of k along with their expected values under the null reference distribution of data. With the help of Monte Carlo simulations, one can produce the sample dataset. For each variable in the dataset, we can calculate the range between min(xi) and max (xj) through which we can produce values uniformly from interval lower bound to upper bound.
For computing the gap statistics method, we can utilize the clusGap function for providing gap statistic as well as standard error for a given output.
library(NbClust)
library(factoextra)
stat_gap <- clusGap(Mall_Customers[,3:5], FUN = kmeans, nstart = 25,K.max = 10, B = 50)
fviz_gap_stat(stat_gap)
Taking k as 6 due to the three analysis above.
PLOTTING THE 6 CLUSTERS
pcclust= prcomp(Mall_Customers[,3:5],scale=FALSE) #principal component analysis
summary(pcclust)
## Importance of components:
## PC1 PC2 PC3
## Standard deviation 26.4625 26.1597 12.9317
## Proportion of Variance 0.4512 0.4410 0.1078
## Cumulative Proportion 0.4512 0.8922 1.0000
pcclust$rotation[,1:2]
## PC1 PC2
## Age 0.1889742 -0.1309652
## Annual Income (k$) -0.5886410 -0.8083757
## Spending Score (1-100) -0.7859965 0.5739136
library(ggplot2)
# annual income vs spending score clusters
ggplot(Mall_Customers, aes(x =`Annual Income (k$)`, y = `Spending Score (1-100)`)) +
geom_point(stat = "identity", aes(color = as.factor(k6$cluster))) +
scale_color_discrete(name=" ",breaks=c("1", "2", "3", "4", "5","6"), labels=c("Cluster 1", "Cluster 2", "Cluster 3", "Cluster 4", "Cluster 5","Cluster 6")) +
ggtitle("Segments of Mall Customers", subtitle = "Using K-means Clustering")
#spending score vs age clusters
ggplot(Mall_Customers, aes(x =`Spending Score (1-100)`, y =Age)) +
geom_point(stat = "identity", aes(color = as.factor(k6$cluster))) +
scale_color_discrete(name=" ",breaks=c("1", "2", "3", "4", "5","6"),labels=c("Cluster 1", "Cluster 2", "Cluster 3", "Cluster 4", "Cluster 5","Cluster 6")) +
ggtitle("Segments of Mall Customers", subtitle = "Using K-means Clustering")
#annual income vs age clusters
ggplot(Mall_Customers, aes(x =`Annual Income (k$)`, y =Age)) +
geom_point(stat = "identity", aes(color = as.factor(k6$cluster))) +
scale_color_discrete(name=" ",breaks=c("1", "2", "3", "4", "5","6"),labels=c("Cluster 1", "Cluster 2", "Cluster 3", "Cluster 4", "Cluster 5","Cluster 6")) +
ggtitle("Segments of Mall Customers", subtitle = "Using K-means Clustering")
ANALYSIS
Cluster 1 – This cluster represents the customers having a high annual income as well as a high annual spend with an age between age 28 and 40.
Cluster 2 - This clusters represents customers with a average annual income(above 25 below 60 ), average spending score(above 25 and below 60) and ages above 40 years.
Cluster 3 – This cluster denotes the customers with low annual income as well as low yearly spend of income with ages spreading through out the spectrum.
Cluster 4 – This cluster denotes a high annual income and low yearly spending score with with ages spreading through out the spectrum.
Cluster 5 – This cluster represents a low annual income but its high yearly expenditure.
Cluster 6 - This clusters represents customers with a average annual income(above 30 below 80 ), average spending score(above 25 and below 62) and ages of 40 and below.
MAPPING CLUSTERS BACK TO THE DATA
Taking the 6 clusters and mappinng the to the data set.
o=order(k6$cluster)
data.frame(Mall_Customers$CustomerID[o],k6$cluster[o])# Using only customer ID for easy tracking
## Mall_Customers.CustomerID.o. k6.cluster.o.
## 1 124 1
## 2 126 1
## 3 128 1
## 4 130 1
## 5 132 1
## 6 134 1
## 7 136 1
## 8 138 1
## 9 140 1
## 10 142 1
## 11 144 1
## 12 146 1
## 13 148 1
## 14 150 1
## 15 152 1
## 16 154 1
## 17 156 1
## 18 158 1
## 19 160 1
## 20 162 1
## 21 164 1
## 22 166 1
## 23 168 1
## 24 170 1
## 25 172 1
## 26 174 1
## 27 176 1
## 28 178 1
## 29 180 1
## 30 182 1
## 31 184 1
## 32 186 1
## 33 188 1
## 34 190 1
## 35 192 1
## 36 194 1
## 37 196 1
## 38 198 1
## 39 200 1
## 40 41 2
## 41 43 2
## 42 47 2
## 43 51 2
## 44 54 2
## 45 55 2
## 46 56 2
## 47 57 2
## 48 58 2
## 49 60 2
## 50 61 2
## 51 63 2
## 52 64 2
## 53 65 2
## 54 67 2
## 55 68 2
## 56 71 2
## 57 72 2
## 58 73 2
## 59 74 2
## 60 75 2
## 61 77 2
## 62 80 2
## 63 81 2
## 64 83 2
## 65 84 2
## 66 86 2
## 67 87 2
## 68 90 2
## 69 91 2
## 70 93 2
## 71 97 2
## 72 99 2
## 73 102 2
## 74 103 2
## 75 105 2
## 76 107 2
## 77 108 2
## 78 109 2
## 79 110 2
## 80 111 2
## 81 117 2
## 82 118 2
## 83 119 2
## 84 120 2
## 85 1 3
## 86 3 3
## 87 5 3
## 88 7 3
## 89 9 3
## 90 11 3
## 91 13 3
## 92 15 3
## 93 17 3
## 94 19 3
## 95 21 3
## 96 23 3
## 97 25 3
## 98 27 3
## 99 29 3
## 100 31 3
## 101 33 3
## 102 35 3
## 103 37 3
## 104 39 3
## 105 45 3
## 106 127 4
## 107 129 4
## 108 131 4
## 109 135 4
## 110 137 4
## 111 139 4
## 112 141 4
## 113 145 4
## 114 147 4
## 115 149 4
## 116 151 4
## 117 153 4
## 118 155 4
## 119 157 4
## 120 159 4
## 121 161 4
## 122 163 4
## 123 165 4
## 124 167 4
## 125 169 4
## 126 171 4
## 127 173 4
## 128 175 4
## 129 177 4
## 130 179 4
## 131 181 4
## 132 183 4
## 133 185 4
## 134 187 4
## 135 189 4
## 136 191 4
## 137 193 4
## 138 195 4
## 139 197 4
## 140 199 4
## 141 2 5
## 142 4 5
## 143 6 5
## 144 8 5
## 145 10 5
## 146 12 5
## 147 14 5
## 148 16 5
## 149 18 5
## 150 20 5
## 151 22 5
## 152 24 5
## 153 26 5
## 154 28 5
## 155 30 5
## 156 32 5
## 157 34 5
## 158 36 5
## 159 38 5
## 160 40 5
## 161 42 5
## 162 46 5
## 163 44 6
## 164 48 6
## 165 49 6
## 166 50 6
## 167 52 6
## 168 53 6
## 169 59 6
## 170 62 6
## 171 66 6
## 172 69 6
## 173 70 6
## 174 76 6
## 175 78 6
## 176 79 6
## 177 82 6
## 178 85 6
## 179 88 6
## 180 89 6
## 181 92 6
## 182 94 6
## 183 95 6
## 184 96 6
## 185 98 6
## 186 100 6
## 187 101 6
## 188 104 6
## 189 106 6
## 190 112 6
## 191 113 6
## 192 114 6
## 193 115 6
## 194 116 6
## 195 121 6
## 196 122 6
## 197 123 6
## 198 125 6
## 199 133 6
## 200 143 6
x = data.frame(Mall_Customers$CustomerID[o],Mall_Customers$Gender[o], Mall_Customers$Age[o], Mall_Customers$`Annual Income (k$)`[o], Mall_Customers$`Spending Score (1-100)`[o],k6$cluster[o])# mapping using all the other columns.
View(x) # Dataset with clusters
## CONCLUSION
Value-based segmentation differentiates customers by their economic value, grouping customers with the same value level into individual segments that can be distinctly targeted.This is improtant for targeted marketing and sales maximization.
With the given dataset,MAll Customers,the columns spending score and annual income are used for the value based segmentention. Age is also a intesting factor to look at. Therefore, the six clusters are analysed in the three factors.