CUSTOMER SEGMATATION USING R

INTRODUCTION

Customer Segmentation is the process of splitting a customer base into multiple groups of individuals that share a similarity in ways a product is or can be marketed to them such as gender, age, interests, demographics, economic status, geography, behavioral patterns, spending habits and much more.

Customer Segmentation is one the most important applications of unsupervised learning. Using clustering techniques, companies can identify the several segments of customers allowing them to target the potential user base. Companies use the clustering process to foresee or map customer segments with similar behavior to identify and target potential user base.

Here we seek to achieve Value-based segmentation, where it differentiates customers by their economic value, grouping customers with the same value level into individual segments that can be distinctly targeted.

UPLOADING DATA SHEET

library(readr)
Mall_Customers <- read_csv("~/KAMILIMU/Data Science/customer-segmentation-dataset/customer-segmentation-dataset/Mall_Customers.csv")
## Parsed with column specification:
## cols(
##   CustomerID = col_double(),
##   Gender = col_character(),
##   Age = col_double(),
##   `Annual Income (k$)` = col_double(),
##   `Spending Score (1-100)` = col_double()
## )
#table(Mall_Customers$`Annual Income (k$)`)
#table(Mall_Customers$`Spending Score (1-100)`)

DATA SUMMARY

The data set used is for mall customers that includes the following columns:Customer, Gender, Age, Annual Income, Spending Score.
View(Mall_Customers)
summary(Mall_Customers) #summary of the data set 
##    CustomerID        Gender               Age        Annual Income (k$)
##  Min.   :  1.00   Length:200         Min.   :18.00   Min.   : 15.00    
##  1st Qu.: 50.75   Class :character   1st Qu.:28.75   1st Qu.: 41.50    
##  Median :100.50   Mode  :character   Median :36.00   Median : 61.50    
##  Mean   :100.50                      Mean   :38.85   Mean   : 60.56    
##  3rd Qu.:150.25                      3rd Qu.:49.00   3rd Qu.: 78.00    
##  Max.   :200.00                      Max.   :70.00   Max.   :137.00    
##  Spending Score (1-100)
##  Min.   : 1.00         
##  1st Qu.:34.75         
##  Median :50.00         
##  Mean   :50.20         
##  3rd Qu.:73.00         
##  Max.   :99.00
any(is.na(Mall_Customers))#To find missing values
## [1] FALSE
FALSE indicated that there is no missing data.

Therefore, there is no data cleaning that was done, and we head to data visulaization.
Data profiling report (file:///C:/Users/Koi%20Anyango/Documents/report.html#raw-counts).

DATA VISUALIZATION

PLOTS

Histograms

library(ggplot2)
ggplot(Mall_Customers,aes(x= Age, fill=Gender))+geom_histogram(bins = 50) # Histogram of Age filling Gender

ggplot(Mall_Customers,aes(x= `Annual Income (k$)`,fill=Gender)) +geom_histogram(bins = 50) # Histogram of Annual income filling Gender

ggplot(Mall_Customers,aes(x= `Spending Score (1-100)`,fill=Gender)) +geom_histogram(bins=50) # Histogram of Spending Score filling Gender

**NoTING**
There is no Customer at the age of 62.
The maximum customer ages are between 30 and 35. The minimum age of customers is 18, whereas, the maximum age is 70.

The minimum spending score is 1, maximum is 99 and the average is 50.20. We can see Descriptive Analysis of Spending Score is that Min is 1, Max is 99 and avg. is 50.20. From the histogram, we conclude that customers between class 40 and 50 have the highest spending score among all the classes.

Bar Plot

library(ggplot2)
ggplot(Mall_Customers,aes(x= Gender))+geom_bar ()

table(Mall_Customers$Gender)
## 
## Female   Male 
##    112     88
**NOTING**
From the above bar plot, we observe that the number of females is higher than the males

Frequency Polygon

library(ggplot2)
ggplot(Mall_Customers,aes(x= `Spending Score (1-100)`, col=Gender)) + geom_freqpoly(bins=50)

ggplot(Mall_Customers,aes(x= `Annual Income (k$)`, col=Gender)) + geom_freqpoly(bins=50)

#This is similar to the histograms. 

Box plot

Violin Plot

The violin plot is similar to box plots, except that they also show the kernel probability density of the data at different value. Typically violin plots will include a marker for the median of the data and a box indicating the interquartile range, as in standard box plots
library(ggplot2)
p<-ggplot(Mall_Customers,aes(y= `Annual Income (k$)`, x= Gender))+geom_violin(trim=FALSE)
p + geom_boxplot(width=0.1)#  shows the Annual income plot showwing the different genders  

p<-ggplot(Mall_Customers,aes(y= `Annual Income (k$)`, x= Age,fill= Gender))+geom_violin(trim=FALSE)
p + geom_boxplot(width=0.1) #overlapping plots
## Warning: position_dodge requires non-overlapping x intervals

p<-ggplot(Mall_Customers,aes(y= `Spending Score (1-100)`, x= Gender))+geom_violin(trim=FALSE)
p + geom_boxplot(width=0.1)# shows the Annual income plot showwing the different genders 

p<-ggplot(Mall_Customers,aes(y= `Spending Score (1-100)`, x= Age,fill= Gender))+geom_violin(trim=FALSE)
p + geom_boxplot(width=0.1)# overlapping plot
## Warning: position_dodge requires non-overlapping x intervals

ggplot(Mall_Customers,aes(y= `Spending Score (1-100)`, x= `Annual Income (k$)`))+geom_violin(trim=FALSE) # violin plot of Spending score and annual income

CORRELATION AND MULTI-COLLINEARITY

Correlation is a statistical measure that indicates the extent to which two or more variables move together. A positive correlation indicates that the variables increase or decrease together. A negative correlation indicates that if one variable increases, the other decreases, and vice versa.

Collinearity is a linear association between two predictors. Multicollinearity is a situation where two or more predictors are highly linearly related. In general, an absolute correlation coefficient of >0.7 among two or more predictors indicates the presence of multicollinearity.

Though correlation talks about bivariate linear relationship whereas multicollinearity are multivariate, if not always, correlation matrix can be a good indicator of multicollinearity and indicate the need for further investigation
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
ggcorr(Mall_Customers)
## Warning in ggcorr(Mall_Customers): data in column(s) 'Gender' are not numeric
## and were ignored

ggcorr(Mall_Customers, label = TRUE, label_alpha = TRUE)
## Warning in ggcorr(Mall_Customers, label = TRUE, label_alpha = TRUE): data in
## column(s) 'Gender' are not numeric and were ignored

par(mfrow=c(2,2))
plot(Mall_Customers)

X<-Mall_Customers[,2:5]
library(GGally)
ggpairs(X)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

K-MEAN ALGORITHM

Summing up the K-means clustering –
•   We specify the number of clusters that we need to create.
•   The algorithm selects k objects at random from the dataset. This object is the initial cluster or mean.
•   The closest centroid obtains the assignment of a new observation. We base this assignment on the Euclidean Distance between object and the centroid.
•   k clusters in the data points update the centroid through calculation of the new mean values present in all the data points of the cluster. The kth cluster’s centroid has a length of p that contains means of all variables for observations in the k-th cluster. We denote the number of variables with p.
•   Iterative minimization of the total within the sum of squares. Then through the iterative minimization of the total sum of the square, the assignment stops wavering when we achieve maximum iteration. The default value is 10 that the R software uses for the maximum iterations.

Determining Optimal Clusters: While working with clusters, you need to specify the number of clusters to use. You would like to utilize the optimal number of clusters. To help you in determining the optimal clusters, there are three popular methods –

ELBOW PLOT

The main goal behind cluster partitioning methods like k-means is to define the clusters such that the intra-cluster variation stays minimum.

Minimize (sum W(Ck)), k=1…k

Where Ck represents the kth cluster and W(Ck) denotes the intra-cluster variation (this is the distance between data points in the same cluster). With the measurement of the total intra-cluster variation, one can evaluate the compactness of the clustering boundary. We can then proceed to define the optimal clusters as follows –

1.  Calculate the clustering algorithm for several values of k. This can be done by creating a variation within k from 1 to 10 clusters. 
2.  Calculate the total intra-cluster sum of square (iss). 
3.  Plot iss based on the number of k clusters. This plot denotes the appropriate number of clusters required in our model.
4.  In the plot, the location of a bend or a knee is the indication of the optimum number of clusters
library(purrr)
set.seed(123)
#function to calculate total intra-cluster sum of square
iss<-function(k)
{kmeans(Mall_Customers[,3:5],k,iter.max=100,nstart=100,algorithm="Lloyd" )$tot.withinss}
k.value<-1:10
iss_value<- map_dbl(k.value, iss)
plot(k.value, iss_value,type="b", pch = 19, frame = FALSE,xlab="Number of clusters K",ylab="Total intra-clusters sum of squares")

AVERAGE SILHOUETTE METHOD

With the help of the average silhouette method, we can measure the quality of our clustering operation. With this, we can determine how well within the cluster is the data object. If we obtain a high average silhouette width, it means that we have good clustering. The average silhouette method calculates the mean of silhouette observations for different k values. With the optimal number of k clusters, one can maximize the average silhouette over significant values for k clusters.

Using the average silhouette widths (Distance between one cluster and the next  clusters) for every k from 1 to 10, and plotted an optimal number of clusters to find the k with the highest average width.
library(NbClust)
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
fviz_nbclust(Mall_Customers[,3:5], kmeans, method = "silhouette")

GAP STATISTIC METHOD

We can use this method to any of the clustering method like K-means, hierarchical clustering etc. Using the gap statistic, one can compare the total intra-cluster variation for different values of k along with their expected values under the null reference distribution of data. With the help of Monte Carlo simulations, one can produce the sample dataset. For each variable in the dataset, we can calculate the range between min(xi) and max (xj) through which we can produce values uniformly from interval lower bound to upper bound.
For computing the gap statistics method, we can utilize the clusGap function for providing gap statistic as well as standard error for a given output.
library(NbClust)
library(factoextra)
stat_gap <- clusGap(Mall_Customers[,3:5], FUN = kmeans, nstart = 25,K.max = 10, B = 50)
fviz_gap_stat(stat_gap)

Taking k  as 6 due to the three analysis above. 

PLOTTING THE 6 CLUSTERS

pcclust= prcomp(Mall_Customers[,3:5],scale=FALSE) #principal component analysis
summary(pcclust)
## Importance of components:
##                            PC1     PC2     PC3
## Standard deviation     26.4625 26.1597 12.9317
## Proportion of Variance  0.4512  0.4410  0.1078
## Cumulative Proportion   0.4512  0.8922  1.0000
pcclust$rotation[,1:2]
##                               PC1        PC2
## Age                     0.1889742 -0.1309652
## Annual Income (k$)     -0.5886410 -0.8083757
## Spending Score (1-100) -0.7859965  0.5739136
library(ggplot2)
# annual income vs spending score clusters 
ggplot(Mall_Customers, aes(x =`Annual Income (k$)`, y = `Spending Score (1-100)`)) + 
  geom_point(stat = "identity", aes(color = as.factor(k6$cluster))) +
  scale_color_discrete(name=" ",breaks=c("1", "2", "3", "4", "5","6"), labels=c("Cluster 1", "Cluster 2", "Cluster 3", "Cluster 4", "Cluster 5","Cluster 6")) +
  ggtitle("Segments of Mall Customers", subtitle = "Using K-means Clustering")

#spending score vs age clusters
ggplot(Mall_Customers, aes(x =`Spending Score (1-100)`, y =Age)) + 
  geom_point(stat = "identity", aes(color = as.factor(k6$cluster))) +
  scale_color_discrete(name=" ",breaks=c("1", "2", "3", "4", "5","6"),labels=c("Cluster 1", "Cluster 2", "Cluster 3", "Cluster 4", "Cluster 5","Cluster 6")) +
  ggtitle("Segments of Mall Customers", subtitle = "Using K-means Clustering")

#annual income vs age clusters
ggplot(Mall_Customers, aes(x =`Annual Income (k$)`, y =Age)) + 
  geom_point(stat = "identity", aes(color = as.factor(k6$cluster))) +
  scale_color_discrete(name=" ",breaks=c("1", "2", "3", "4", "5","6"),labels=c("Cluster 1", "Cluster 2", "Cluster 3", "Cluster 4", "Cluster 5","Cluster 6")) +
  ggtitle("Segments of Mall Customers", subtitle = "Using K-means Clustering")

ANALYSIS
Cluster 1 – This cluster represents the customers having a high annual income as well as a high annual spend with an age between age 28 and 40.
Cluster 2 - This clusters represents customers with a average annual income(above 25 below 60 ), average spending score(above 25 and below 60) and ages above 40 years.
Cluster 3 – This cluster denotes the customers with low annual income as well as low yearly spend of income with ages spreading through out the spectrum.
Cluster 4 – This cluster denotes a high annual income and low yearly spending score with with ages spreading through out the spectrum.
Cluster 5 – This cluster represents a low annual income but its high yearly expenditure.
Cluster  6 - This clusters represents customers with a average annual income(above 30 below 80 ), average spending score(above 25 and below 62) and ages of 40 and below.

MAPPING CLUSTERS BACK TO THE DATA

Taking the 6 clusters and mappinng the to the data set.
o=order(k6$cluster)
data.frame(Mall_Customers$CustomerID[o],k6$cluster[o])# Using only customer ID for easy tracking
##     Mall_Customers.CustomerID.o. k6.cluster.o.
## 1                            124             1
## 2                            126             1
## 3                            128             1
## 4                            130             1
## 5                            132             1
## 6                            134             1
## 7                            136             1
## 8                            138             1
## 9                            140             1
## 10                           142             1
## 11                           144             1
## 12                           146             1
## 13                           148             1
## 14                           150             1
## 15                           152             1
## 16                           154             1
## 17                           156             1
## 18                           158             1
## 19                           160             1
## 20                           162             1
## 21                           164             1
## 22                           166             1
## 23                           168             1
## 24                           170             1
## 25                           172             1
## 26                           174             1
## 27                           176             1
## 28                           178             1
## 29                           180             1
## 30                           182             1
## 31                           184             1
## 32                           186             1
## 33                           188             1
## 34                           190             1
## 35                           192             1
## 36                           194             1
## 37                           196             1
## 38                           198             1
## 39                           200             1
## 40                            41             2
## 41                            43             2
## 42                            47             2
## 43                            51             2
## 44                            54             2
## 45                            55             2
## 46                            56             2
## 47                            57             2
## 48                            58             2
## 49                            60             2
## 50                            61             2
## 51                            63             2
## 52                            64             2
## 53                            65             2
## 54                            67             2
## 55                            68             2
## 56                            71             2
## 57                            72             2
## 58                            73             2
## 59                            74             2
## 60                            75             2
## 61                            77             2
## 62                            80             2
## 63                            81             2
## 64                            83             2
## 65                            84             2
## 66                            86             2
## 67                            87             2
## 68                            90             2
## 69                            91             2
## 70                            93             2
## 71                            97             2
## 72                            99             2
## 73                           102             2
## 74                           103             2
## 75                           105             2
## 76                           107             2
## 77                           108             2
## 78                           109             2
## 79                           110             2
## 80                           111             2
## 81                           117             2
## 82                           118             2
## 83                           119             2
## 84                           120             2
## 85                             1             3
## 86                             3             3
## 87                             5             3
## 88                             7             3
## 89                             9             3
## 90                            11             3
## 91                            13             3
## 92                            15             3
## 93                            17             3
## 94                            19             3
## 95                            21             3
## 96                            23             3
## 97                            25             3
## 98                            27             3
## 99                            29             3
## 100                           31             3
## 101                           33             3
## 102                           35             3
## 103                           37             3
## 104                           39             3
## 105                           45             3
## 106                          127             4
## 107                          129             4
## 108                          131             4
## 109                          135             4
## 110                          137             4
## 111                          139             4
## 112                          141             4
## 113                          145             4
## 114                          147             4
## 115                          149             4
## 116                          151             4
## 117                          153             4
## 118                          155             4
## 119                          157             4
## 120                          159             4
## 121                          161             4
## 122                          163             4
## 123                          165             4
## 124                          167             4
## 125                          169             4
## 126                          171             4
## 127                          173             4
## 128                          175             4
## 129                          177             4
## 130                          179             4
## 131                          181             4
## 132                          183             4
## 133                          185             4
## 134                          187             4
## 135                          189             4
## 136                          191             4
## 137                          193             4
## 138                          195             4
## 139                          197             4
## 140                          199             4
## 141                            2             5
## 142                            4             5
## 143                            6             5
## 144                            8             5
## 145                           10             5
## 146                           12             5
## 147                           14             5
## 148                           16             5
## 149                           18             5
## 150                           20             5
## 151                           22             5
## 152                           24             5
## 153                           26             5
## 154                           28             5
## 155                           30             5
## 156                           32             5
## 157                           34             5
## 158                           36             5
## 159                           38             5
## 160                           40             5
## 161                           42             5
## 162                           46             5
## 163                           44             6
## 164                           48             6
## 165                           49             6
## 166                           50             6
## 167                           52             6
## 168                           53             6
## 169                           59             6
## 170                           62             6
## 171                           66             6
## 172                           69             6
## 173                           70             6
## 174                           76             6
## 175                           78             6
## 176                           79             6
## 177                           82             6
## 178                           85             6
## 179                           88             6
## 180                           89             6
## 181                           92             6
## 182                           94             6
## 183                           95             6
## 184                           96             6
## 185                           98             6
## 186                          100             6
## 187                          101             6
## 188                          104             6
## 189                          106             6
## 190                          112             6
## 191                          113             6
## 192                          114             6
## 193                          115             6
## 194                          116             6
## 195                          121             6
## 196                          122             6
## 197                          123             6
## 198                          125             6
## 199                          133             6
## 200                          143             6
x = data.frame(Mall_Customers$CustomerID[o],Mall_Customers$Gender[o], Mall_Customers$Age[o], Mall_Customers$`Annual Income (k$)`[o], Mall_Customers$`Spending Score (1-100)`[o],k6$cluster[o])# mapping using all the other columns.
View(x) # Dataset with clusters 
## CONCLUSION
Value-based segmentation differentiates customers by their economic value, grouping customers with the same value level into individual segments that can be distinctly targeted.This is improtant for targeted marketing and sales maximization.

With the given dataset,MAll Customers,the columns spending score and annual income are used for the value based segmentention. Age is also a intesting factor to look at. Therefore, the six clusters are analysed in the three factors.