Customer segmentation is a very important tool for every business and organizations. It helps business owners and marketers to be able to identify key areas that need attention and more efficient strategies in order to satisfy their customers. This makes customers also to spend more on their products and services and also make recommendations to others.
Customer segmentation helps business managers and marketers identify the following key segments, in with respect to e-commerce:
The dataset used in this paper is available on kaggle website. This data set contains the information of customers of a particular business in relation to their debt history . The dataset contains 850 observations from different customers with ten(10) variables revealing certain information about a particular customer. Below are the interpretations of the various column headings.
Customer.id - Customer’s unique id number
Age - The age of the particular customer
Edu - Indicates the educational background of the customer
Years.Employed - Number of years a customer has being actively employed
Income - Indicates a customer’s active income per month in USD
Card.Debt - The debt on a customer’s credit card
Other.Debt - Implies a other debts of the customer
Defaulted - Shows whether or not a customer has defaulted in debt payment. (1 = Yes, 0 = No)
Address - customer’s residence address
DebtIncomeRatio - Indicates the debt to income ratio of a particular customer.
head(customerData)
## Customer.Id Age Edu Years.Employed Income Card.Debt Other.Debt Defaulted
## 1 1 41 2 6 19 0.124 1.073 0
## 2 2 47 1 26 100 4.582 8.218 0
## 3 3 33 2 10 57 6.111 5.802 1
## 4 4 29 2 4 19 0.681 0.516 0
## 5 5 47 1 31 253 9.308 8.908 0
## 6 6 40 1 23 81 0.998 7.831 NA
## Address DebtIncomeRatio
## 1 NBA001 6.3
## 2 NBA021 12.8
## 3 NBA013 20.9
## 4 NBA009 6.3
## 5 NBA008 7.2
## 6 NBA016 10.9
Now, I check to find if there are any missing variables then take them off as they may interfere in the data structure and may cause inconsistencies in results.
sum(is.na(customerData) == 1)
## [1] 150
Since, the results reveals there are 150 observations with NAs, I will go ahead to remove them from my dataset.
customerData <- na.omit(customerData)
Also we have to take off the customer ID and Address column from the dataset to cut down the number of coulmns since it does not impact the dataset or the cluster method.
customerData$Address <- NULL
trimData <- customerData[,-1]
head(trimData)
## Age Edu Years.Employed Income Card.Debt Other.Debt Defaulted DebtIncomeRatio
## 1 41 2 6 19 0.124 1.073 0 6.3
## 2 47 1 26 100 4.582 8.218 0 12.8
## 3 33 2 10 57 6.111 5.802 1 20.9
## 4 29 2 4 19 0.681 0.516 0 6.3
## 5 47 1 31 253 9.308 8.908 0 7.2
## 7 38 2 4 56 0.442 0.454 0 1.6
The following packages were used to help in the clustering, analysis and the visualization of the methods and results.
library("factoextra")
library("ggplot2")
library("tidyverse")
library("psych")
library("ClusterR")
library("clustertend")
library("fpc")
library("gridExtra")
library("corrplot")
Rendering basic statistical data analysis helps one to properly understand the particular dataset been analyzed. I run an analysis on the data to find out basic information about the data. Such as the location of measures (mode,median and mean), minimum and maximum observations also be obtained for every variable. The correlation and visualization tabs shows the correlation matrix between the variables and the correlation plot respectively.
## vars n mean sd median trimmed mad min max range
## Age 1 700 34.86 8.00 34.00 34.49 8.90 20.00 56.00 36.00
## Edu 2 700 1.72 0.93 1.00 1.57 0.00 1.00 5.00 4.00
## Years.Employed 3 700 8.39 6.66 7.00 7.72 7.41 0.00 31.00 31.00
## Income 4 700 45.60 36.81 34.00 38.82 17.79 14.00 446.00 432.00
## Card.Debt 5 700 1.55 2.12 0.86 1.13 0.88 0.01 20.56 20.55
## Other.Debt 6 700 3.06 3.29 1.99 2.41 1.66 0.05 27.03 26.99
## Defaulted 7 700 0.26 0.44 0.00 0.20 0.00 0.00 1.00 1.00
## DebtIncomeRatio 8 700 10.26 6.83 8.60 9.48 6.23 0.40 41.30 40.90
## skew kurtosis se
## Age 0.36 -0.62 0.30
## Edu 1.20 0.72 0.04
## Years.Employed 0.83 0.21 0.25
## Income 3.84 25.89 1.39
## Card.Debt 3.88 21.74 0.08
## Other.Debt 2.72 10.21 0.12
## Defaulted 1.08 -0.83 0.02
## DebtIncomeRatio 1.09 1.19 0.26
## Age Edu Years.Employed Income Card.Debt
## Age 1.0000000 0.02232500 0.5364968 0.4787099 0.29521432
## Edu 0.0223250 1.00000000 -0.1536208 0.2351905 0.08827721
## Years.Employed 0.5364968 -0.15362077 1.0000000 0.6196813 0.40369784
## Income 0.4787099 0.23519050 0.6196813 1.0000000 0.57019584
## Card.Debt 0.2952143 0.08827721 0.4036978 0.5701958 1.00000000
## Other.Debt 0.3402130 0.16545833 0.4060894 0.6106627 0.63310841
## Other.Debt Defaulted DebtIncomeRatio
## Age 0.3402130 -0.13765710 0.016398077
## Edu 0.1654583 0.11467555 0.008838431
## Years.Employed 0.4060894 -0.28297839 -0.031182215
## Income 0.6106627 -0.07096966 -0.026777293
## Card.Debt 0.6331084 0.24473424 0.501772450
## Other.Debt 1.0000000 0.14571635 0.584867409
At this point, I begin the clustering procedure. However, as a preliminary method, I will check to see if there is a cluster tendancy in the dataset. That is we are checking whether are not the dataset is uniformly distributed. This can be achieved using the Hopkins test statistic. The null hypothesis can therefore be stated as “Data is uniformly distributed and does not require clustering”. The alternate hypothesis is the vice versa.
get_clust_tendency(trimData[,-8], 100, graph=TRUE, gradient=list(low = "red",mid="white", high = "steelblue"))
## $hopkins_stat
## [1] 0.91487
##
## $plot
We achieved a Hopkins statistic value of around 0.9. Since this value of the Hopkins statistic is close to 1 (far above 0.5), then we can conclude that the dataset is significantly clusterable. More information on Assessing Clustering Tendency can be found at the STHDA website or on the HELP menu in RStudio.
Proceeding to cluster the dataset, it is worthy to note that, only the K-MEAN and PAM clustering method are used in this paper because, CLARA (Clustering Large Applications) is just an extension of of the PAM(Partitioning Around Medoids) applied on a large dataset, which is not the case of this dataset.
Now we check for the number of clusters in the datasets. Here, I perform the Elbow and the silhouette method of determining optimal number of clusters on both the K-MEANS and PAM algorithms. The optimal number of clusters is indicated by the dotted lines.
Since the Elbow and Silhouette methods are proposing different number of clusters, it would be neccesary to cluster separately based on the number of clusters and then make decision based on the average silhouette width and the visualization of the cluster silhouette plot.
To process the data, the K-means algorithm in starts with a first group of randomly selected centroids, which are used as the beginning points for every cluster, and then performs iterative (repetitive) calculations to optimize the positions of the centroids.
kmclust2 <- eclust(trimData[,-8], k=2, FUNcluster="kmeans", hc_metric="euclidean", graph=F)
km_plot2 <- fviz_cluster(kmclust2, data=trimData[,-8], elipse.type="convex", geom=c("point")) + ggtitle("K-means with 2 clusters")
km_sil_plot2 <- fviz_silhouette(kmclust2)
grid.arrange(km_plot2,km_sil_plot2, ncol=2)
kmclust4 <- eclust(trimData[,-8], k=4, FUNcluster="kmeans", hc_metric="euclidean", graph=F)
km_plot4 <- fviz_cluster(kmclust4, data=trimData[,-8], elipse.type="convex", geom=c("point")) + ggtitle("K-means with 4 clusters")
km_sil_plot4 <- fviz_silhouette(kmclust4)
grid.arrange(km_plot4,km_sil_plot4, ncol=2)
The PAM Clustering Algorithm. PAM stands for “partition around medoids”. The algorithm is intended to find a sequence of objects called medoids that are centrally located in clusters. Objects that are tentatively defined as medoids are placed into a set S of selected objects.
pamclust2 <- eclust(trimData[,-8], k=2 , FUNcluster="pam", hc_metric="euclidean", graph=F)
pam_plot2<- fviz_cluster(pamclust2, data=trimData[,-8], elipse.type="convex", geom=c("point")) + ggtitle("PAM with 2 clusters")
pam_sil_plot2 <- fviz_silhouette(pamclust2)
grid.arrange(pam_plot2, pam_sil_plot2, ncol=2)
pamclust4 <- eclust(trimData[,-8], k=4 , FUNcluster="pam", hc_metric="euclidean", graph=F)
pam_plot4<- fviz_cluster(pamclust4, data=trimData[,-8], elipse.type="convex", geom=c("point")) + ggtitle("PAM with 4 clusters")
pam_sil_plot4 <- fviz_silhouette(pamclust4)
grid.arrange(pam_plot4, pam_sil_plot4, ncol=2)
Here, I finally analyze the results based on the intrinsic methods since we do not have the class information for the data points such the homogeneity score and completeness which allows us to use the extrinsic methods. The intrinsic method i will be using is based on the average silhouette score for each clustering method.
It would be realized that, from the silhouette plot for 2 clusters, the average silhouette width value was given as 0.68 (above the ) with very little portion of cluster 2 running into negative. Comparing this to the plot for 4 clusters, the average silhouette score of 0.5 is lesser than that of two clusters. In this case we conclude that clustering this data set with 2 clusters would be more appropriate.
It would also be realized that, from the silhouette plot for 2 clusters, the average silhouette width score was given as 0.56 (above the ) with very little portion of cluster 2 running into negative. Comparing this to the plot for 4 clusters, the average silhouette score of 0.38 is lesser than that of two clusters. In this case we conclude that clustering this data set with 2 clusters would also be more appropriate.
As discussed in the analysis of each clustering method, it was revealed that the more appropriate number of cluster for each method was found to be 2. we can also base on the intrinsic method of analysis to deduce that the K-MEAN clustering algorithm best fits for this dataset than that of the PAM.
Also it would worthy to note that, the Elbow method suggesting different number of clusters does not render them not appropriate as this may be based on several other factors.