Hackathon Project
Submitted By : Ajesh
Introduction-
Customer Segementation:
The process of division of customer base into several groups of individuals that share a similarity in different ways that are relevant to marketing and
for this we are using k-means clustering technique.
Clustering:
The process of dividing the entire data into groups known as clusters based on the patterns in the data.
K-means clustering:
A simple unsupervised learning algorithm that is used to solve clustering problems. It follows a simple procedure of classifying a given data set into a number of clusters, defined by the letter “k,” which is fixed beforehand. The clusters are then positioned as points and all observations or data points are associated with the nearest cluster, computed, adjusted and then the process starts over using the new adjustments until a desired result is reached.
#loading data
customer_data=data.frame(mallcustomer)
#top 6 data
head(customer_data)
The above shown is the top 6 observation of the data.
#showing the structure of the data
str(customer_data)
'data.frame': 200 obs. of 5 variables:
$ CustomerID : num 1 2 3 4 5 6 7 8 9 10 ...
$ Gender : chr "Male" "Male" "Female" "Female" ...
$ Age : num 19 21 20 23 31 22 35 23 64 30 ...
$ Annual.Income..k.. : num 15 15 16 16 17 17 18 18 19 19 ...
$ Spending.Score..1.100.: num 39 81 6 77 40 76 6 94 3 72 ...
The above shown is the structure of the data showing the datatype of the variables.
#summary of the data
summary(customer_data)
CustomerID Gender Age Annual.Income..k..
Min. : 1.00 Length:200 Min. :18.00 Min. : 15.00
1st Qu.: 50.75 Class :character 1st Qu.:28.75 1st Qu.: 41.50
Median :100.50 Mode :character Median :36.00 Median : 61.50
Mean :100.50 Mean :38.85 Mean : 60.56
3rd Qu.:150.25 3rd Qu.:49.00 3rd Qu.: 78.00
Max. :200.00 Max. :70.00 Max. :137.00
Spending.Score..1.100.
Min. : 1.00
1st Qu.:34.75
Median :50.00
Mean :50.20
3rd Qu.:73.00
Max. :99.00
The above shown is the discriptive analysis of the customer.
#summary of Age variable
summary(customer_data$Age)
Min. 1st Qu. Median Mean 3rd Qu. Max.
18.00 28.75 36.00 38.85 49.00 70.00
#standard deviation of age variable
print(paste("standard deviation",sd(customer_data$Age)))
[1] "standard deviation 13.9690073315589"
The above information is showing the discriptive analysis Age variable and its standard deviation.
#summary of Annual income
summary(customer_data$Annual.Income..k..)
Min. 1st Qu. Median Mean 3rd Qu. Max.
15.00 41.50 61.50 60.56 78.00 137.00
#standard deviation of Annual income
print(paste("standard deviation",sd(customer_data$Annual.Income..k..)))
[1] "standard deviation 26.2647211652712"
The above information is showing the discriptive analysis Annual income variable and its standard deviation.
#summary of Spending score
summary(customer_data$Spending.Score..1.100.)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 34.75 50.00 50.20 73.00 99.00
#standard deviation of Spending score
print(paste("standard deviation",sd(customer_data$Spending.Score..1.100.)))
[1] "standard deviation 25.8235216683702"
The above information is showing the discriptive analysis Age variable and its standard deviation.
Visualisation of Data
library(ggplot2)
ggplot(customer_data,aes(Gender,fill=Gender))+geom_bar()+scale_fill_viridis_d()+theme_light()+ggtitle("Barplot for Gender ")

From the above plot ,We understand that number of females are more than males but to know the proportion of females and males in the data we can do visualisation using pie chart.
library(plotrix)
per=round(t/sum(t)*100)
lbs=paste(c("Female","Male")," ",per,"%",sep=" ")
pie3D(t,labels=lbs,
main="Pie Chart Depicting Ratio of Female and Male")

From the above graph, we conclude that the proportion of females is 56%, whereas the proportion of male in the customer dataset is 44%.
Now to plot the frequency of the age of the customers,We will do histogram plot.
#barplot for Age
ggplot(customer_data,aes(Age))+geom_bar(fill="green",col="purple",alpha=0.5)+theme_dark()+ggtitle("Age Distribution ")

The above graph shows the distribution of Age variable throughout the data.
It is very clear from the above graph that people of Age between 30 to 32 are more in count in the data.
#Discriptive boxplot for Age and annual income
ggplot(customer_data,aes(Age,Annual.Income..k..,group=Gender,fill=Gender))+geom_boxplot()+ggtitle("Discriptive boxplot for Age and annual income")

From the above plot: We conclude that the customer of Age between 25 and 42 are more in count the minimum age of customers is 18, whereas, the maximum age is 70.
Now doing visualization to get insights the annual income.
hist(customer_data$Annual.Income..k..,
col="#660033",
main="Histogram for Annual Income",
xlab="Annual Income Class",
ylab="Frequency",
labels=TRUE)

From the above data,we conclude that most of the customer’s annual income are under 70 to 80k.
Now the density plot of customer’s annual income data.
ggplot(customer_data,aes(Annual.Income..k..,fill=Gender))+geom_density(alpha=0.5,col="lightblue")+scale_fill_manual(values = c("red","blue"))+theme_minimal()+ggtitle("Density plot for Annual income")

The above density plot explains the distribution annual income for female and male separately and it is clear that annual income is irrespective of the Gender.
Now to make the understanding about spending score graphically ,We will do some visualisation.
boxplot(customer_data$Spending.Score..1.100.,
horizontal=TRUE,
col="#990000",
main="BoxPlot for Descriptive Analysis of Spending Score")

NA
NA
The above boxplot we can conclude that maximum and minimum spending scores are 1 and 99 respectively,and average spending score is near 50 (50.20 exact from the summary ).
Barplot for Spending score
hist(customer_data$Spending.Score..1.100.,
main="HistoGram for Spending Score",
xlab="Spending Score Class",
ylab="Frequency",
col="#6600cc",
labels=TRUE)

From the above plot we can clearly see that most of the customers are having spending scores in between 40-50.
K-means modeling:
For customer segementation we are using k-means algorithm and deviding the customer data into 2 groups.
k2<-kmeans(customer_data[,3:5],2,iter.max=100,nstart=50,algorithm="Lloyd")
k2
K-means clustering with 2 clusters of sizes 115, 85
Cluster means:
Age Annual.Income..k.. Spending.Score..1.100.
1 46.16522 59.36522 32.88696
2 28.95294 62.17647 73.62353
Clustering vector:
[1] 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2
[35] 1 2 1 2 1 2 1 2 1 2 1 2 1 1 1 1 1 2 2 1 1 1 1 1 2 1 1 2 1 1 1 2 1 1
[69] 2 1 1 1 1 1 1 2 1 1 2 1 1 2 1 1 2 1 1 2 2 1 1 1 1 1 1 2 1 2 1 2 1 1
[103] 1 2 1 1 1 1 1 1 1 2 1 2 2 2 1 2 1 1 2 1 2 2 1 2 1 2 1 2 1 2 1 2 1 2
[137] 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2
[171] 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2
Within cluster sum of squares by cluster:
[1] 124540.05 88300.12
(between_SS / total_SS = 31.1 %)
Available components:
[1] "cluster" "centers" "totss" "withinss"
[5] "tot.withinss" "betweenss" "size" "iter"
[9] "ifault"
The above information shows the grouping of the customer_data and the mean values set for the clusters, but the ratio of between_ss and total_ss(ss=sum of squares) is very less ,so we need to improve the quality of k-means model
For the better quality k-means model we need to find the optimum value for k(i.e number of clusters).
Average Silhouette Method:
Using this method we can determine the optimum value of k.
In silhouette method,The average sil_width is the measure of quality of cluster.The higher the value of average sil_width , the better the quality within the intra-clusters.
sil(2)
sil_width
0.2931661
The above is the mean value of sil_width for k=2.
So now in order to determine the optimum value of k for clustering we have to check the average sil_width for different values of k then select the max value of average sil_width and the k corresponds to that will be the optimum value.
So we will make a function which will iterate the average sil_width for different value of k and returns the value of k with max sil_width.
op_sil_k=function(data){
s3=map_dbl(2:10,sil)
s5=as.matrix(s3)
for(index in 1:ncol(s5)){
s6=which(s5==max(s5))
s8=s6+1
}
print(paste("optimum value of k = ",s8))
}
op_sil_k()
[1] "optimum value of k = 6"
Here we got 6 as the optimum value for k.
Now we can also confirm the optimum value of k by visualisation.

Hence from the above plot we can see the highest value of average sil_width corresponds to k=6.
Now we can make a model using k=6 for better quality.
k6<-kmeans(customer_data[,3:5],6,iter.max=100,nstart=50,algorithm="Lloyd")
k6
K-means clustering with 6 clusters of sizes 45, 22, 21, 38, 35, 39
Cluster means:
Age Annual.Income..k.. Spending.Score..1.100.
1 56.15556 53.37778 49.08889
2 25.27273 25.72727 79.36364
3 44.14286 25.14286 19.52381
4 27.00000 56.65789 49.13158
5 41.68571 88.22857 17.28571
6 32.69231 86.53846 82.12821
Clustering vector:
[1] 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2
[35] 3 2 3 2 3 2 1 2 1 4 3 2 1 4 4 4 1 4 4 1 1 1 1 1 4 1 1 4 1 1 1 4 1 1
[69] 4 4 1 1 1 1 1 4 1 4 4 1 1 4 1 1 4 1 1 4 4 1 1 4 1 4 4 4 1 4 1 4 4 1
[103] 1 4 1 4 1 1 1 1 1 4 4 4 4 4 1 1 1 1 4 4 4 6 4 6 5 6 5 6 5 6 4 6 5 6
[137] 5 6 5 6 5 6 4 6 5 6 5 6 5 6 5 6 5 6 5 6 5 6 5 6 5 6 5 6 5 6 5 6 5 6
[171] 5 6 5 6 5 6 5 6 5 6 5 6 5 6 5 6 5 6 5 6 5 6 5 6 5 6 5 6 5 6
Within cluster sum of squares by cluster:
[1] 8062.133 4099.818 7732.381 7742.895 16690.857 13972.359
(between_SS / total_SS = 81.1 %)
Available components:
[1] "cluster" "centers" "totss" "withinss"
[5] "tot.withinss" "betweenss" "size" "iter"
[9] "ifault"
We have grouped the data into 6 clusters with better quality as we got the ratio of between ss and total ss(ss=sum odf squares) increase to 81%.
Visualising the cluster result using ggplot2
library(ggplot2)
set.seed(1)
ggplot(customer_data, aes(x =Annual.Income..k.., y = Spending.Score..1.100.)) +
geom_point(stat = "identity", aes(color = as.factor(k6$cluster)),size=4,alpha=0.6) +
scale_color_discrete(name=" ",
breaks=c("1", "2", "3", "4", "5","6"),
labels=c("Cluster 1", "Cluster 2", "Cluster 3", "Cluster 4", "Cluster 5","Cluster 6")) +
ggtitle("Segments of Mall Customers", subtitle = "Using K-means Clustering")+theme(
panel.background = element_rect(fill = "lightblue", colour = "black",
size = 2, linetype = "solid"),plot.background = element_rect(fill = "#6D9EC1"))

Hence from the above graph we came to know that:
Cluster6 - having high annual income and high spending score.
Cluster5 - having high annual income but low spending score.
Cluster4 & Cluster1- having medium annual income and medium spending score.
Cluster3 - having low annual income and low spending score.
Visualisation for Age and Annual income.
ggplot(customer_data, aes(x =Annual.Income..k.., y =Age)) +
geom_point(stat = "identity", aes(color = as.factor(k6$cluster)),size=4,alpha=0.6) +
scale_color_discrete(name=" ",
breaks=c("1", "2", "3", "4", "5","6"),
labels=c("Cluster 1", "Cluster 2", "Cluster 3", "Cluster 4", "Cluster 5","Cluster 6")) +
ggtitle("Segments of Mall Customers", subtitle = "Using K-means Clustering")

Hence from the above graph we came to know that:
Cluster6 - having customer of age group 30-40 and high annual .
Cluster5 - having customer of age group 40-60 but high annual income.
Cluster4 - having customer of age group 20-40 and medium annual income.
Cluster3 - having customer of age group 40-60 and low annual income.
Cluster2 - having customer of age group 20-40 but low annual income.
Cluster1 - having customer of age group 40-60 and medium annual income.
Visualizing cluster using fviz function.
fviz_cluster(k6, data = customer_data[,3:5],geom = "points")

Hence with the help of the clustering method we can disect/segementise the customers for better approach.
