INTRODUCTION

Customer segmentation is a very important tool for every business and organizations. It helps business owners and marketers to be able to identify key areas that need attention and more efficient strategies in order to satisfy their customers. This makes customers also to spend more on their products and services and also make recommendations to others.

Customer segmentation helps business managers and marketers identify the following key segments, in with respect to e-commerce:

High spenders
One timers
Location
Loyal Customers
Inactive Customers
Coupon lovers
And many more

OVERVIEW OF DATASET

The dataset used in this paper is available on kaggle website. This data set contains the information of customers of a particular business in relation to their debt history . The dataset contains 850 observations from different customers with ten(10) variables revealing certain information about a particular customer. Below are the interpretations of the various column headings.

Customer.id - Customer’s unique id number
Age - The age of the particular customer
Edu - Indicates the educational background of the customer
Years.Employed - Number of years a customer has being actively employed
Income - Indicates a customer’s active income per month in USD
Card.Debt - The debt on a customer’s credit card
Other.Debt - Implies a other debts of the customer
Defaulted - Shows whether or not a customer has defaulted in debt payment. (1 = Yes, 0 = No)
Address - customer’s residence address
DebtIncomeRatio - Indicates the debt to income ratio of a particular customer.

head(customerData)

##   Customer.Id Age Edu Years.Employed Income Card.Debt Other.Debt Defaulted
## 1           1  41   2              6     19     0.124      1.073         0
## 2           2  47   1             26    100     4.582      8.218         0
## 3           3  33   2             10     57     6.111      5.802         1
## 4           4  29   2              4     19     0.681      0.516         0
## 5           5  47   1             31    253     9.308      8.908         0
## 6           6  40   1             23     81     0.998      7.831        NA
##   Address DebtIncomeRatio
## 1  NBA001             6.3
## 2  NBA021            12.8
## 3  NBA013            20.9
## 4  NBA009             6.3
## 5  NBA008             7.2
## 6  NBA016            10.9

Cleaning the Dataset

Now, I check to find if there are any missing variables then take them off as they may interfere in the data structure and may cause inconsistencies in results.

sum(is.na(customerData) == 1)

## [1] 150

Since, the results reveals there are 150 observations with NAs, I will go ahead to remove them from my dataset.

customerData <- na.omit(customerData)

Also we have to take off the customer ID and Address column from the dataset to cut down the number of coulmns since it does not impact the dataset or the cluster method.

customerData$Address <- NULL
trimData <- customerData[,-1]
head(trimData)

##   Age Edu Years.Employed Income Card.Debt Other.Debt Defaulted DebtIncomeRatio
## 1  41   2              6     19     0.124      1.073         0             6.3
## 2  47   1             26    100     4.582      8.218         0            12.8
## 3  33   2             10     57     6.111      5.802         1            20.9
## 4  29   2              4     19     0.681      0.516         0             6.3
## 5  47   1             31    253     9.308      8.908         0             7.2
## 7  38   2              4     56     0.442      0.454         0             1.6

Libraries

The following packages were used to help in the clustering, analysis and the visualization of the methods and results.

library("factoextra")
library("ggplot2")
library("tidyverse")
library("psych")
library("ClusterR")
library("clustertend")
library("fpc")
library("gridExtra")
library("corrplot")

STATISTICS AND EXPLORATORY DATA ANALYSIS

Rendering basic statistical data analysis helps one to properly understand the particular dataset been analyzed. I run an analysis on the data to find out basic information about the data. Such as the location of measures (mode,median and mean), minimum and maximum observations also be obtained for every variable. The correlation and visualization tabs shows the correlation matrix between the variables and the correlation plot respectively.

Results from SEDA

Measures

##                 vars   n  mean    sd median trimmed   mad   min    max  range
## Age                1 700 34.86  8.00  34.00   34.49  8.90 20.00  56.00  36.00
## Edu                2 700  1.72  0.93   1.00    1.57  0.00  1.00   5.00   4.00
## Years.Employed     3 700  8.39  6.66   7.00    7.72  7.41  0.00  31.00  31.00
## Income             4 700 45.60 36.81  34.00   38.82 17.79 14.00 446.00 432.00
## Card.Debt          5 700  1.55  2.12   0.86    1.13  0.88  0.01  20.56  20.55
## Other.Debt         6 700  3.06  3.29   1.99    2.41  1.66  0.05  27.03  26.99
## Defaulted          7 700  0.26  0.44   0.00    0.20  0.00  0.00   1.00   1.00
## DebtIncomeRatio    8 700 10.26  6.83   8.60    9.48  6.23  0.40  41.30  40.90
##                 skew kurtosis   se
## Age             0.36    -0.62 0.30
## Edu             1.20     0.72 0.04
## Years.Employed  0.83     0.21 0.25
## Income          3.84    25.89 1.39
## Card.Debt       3.88    21.74 0.08
## Other.Debt      2.72    10.21 0.12
## Defaulted       1.08    -0.83 0.02
## DebtIncomeRatio 1.09     1.19 0.26

Correlation

##                      Age         Edu Years.Employed    Income  Card.Debt
## Age            1.0000000  0.02232500      0.5364968 0.4787099 0.29521432
## Edu            0.0223250  1.00000000     -0.1536208 0.2351905 0.08827721
## Years.Employed 0.5364968 -0.15362077      1.0000000 0.6196813 0.40369784
## Income         0.4787099  0.23519050      0.6196813 1.0000000 0.57019584
## Card.Debt      0.2952143  0.08827721      0.4036978 0.5701958 1.00000000
## Other.Debt     0.3402130  0.16545833      0.4060894 0.6106627 0.63310841
##                Other.Debt   Defaulted DebtIncomeRatio
## Age             0.3402130 -0.13765710     0.016398077
## Edu             0.1654583  0.11467555     0.008838431
## Years.Employed  0.4060894 -0.28297839    -0.031182215
## Income          0.6106627 -0.07096966    -0.026777293
## Card.Debt       0.6331084  0.24473424     0.501772450
## Other.Debt      1.0000000  0.14571635     0.584867409

Correlation plot

CLUSTERING

At this point, I begin the clustering procedure. However, as a preliminary method, I will check to see if there is a cluster tendancy in the dataset. That is we are checking whether are not the dataset is uniformly distributed. This can be achieved using the Hopkins test statistic. The null hypothesis can therefore be stated as “Data is uniformly distributed and does not require clustering”. The alternate hypothesis is the vice versa.

get_clust_tendency(trimData[,-8], 100, graph=TRUE, gradient=list(low = "red",mid="white", high = "steelblue"))

## $hopkins_stat
## [1] 0.91487
## 
## $plot

We achieved a Hopkins statistic value of around 0.9. Since this value of the Hopkins statistic is close to 1 (far above 0.5), then we can conclude that the dataset is significantly clusterable. More information on Assessing Clustering Tendency can be found at the STHDA website or on the HELP menu in RStudio.

Proceeding to cluster the dataset, it is worthy to note that, only the K-MEAN and PAM clustering method are used in this paper because, CLARA (Clustering Large Applications) is just an extension of of the PAM(Partitioning Around Medoids) applied on a large dataset, which is not the case of this dataset.

Optimal Number Of Clusters

Now we check for the number of clusters in the datasets. Here, I perform the Elbow and the silhouette method of determining optimal number of clusters on both the K-MEANS and PAM algorithms. The optimal number of clusters is indicated by the dotted lines.

K-MEANS

PAM

Since the Elbow and Silhouette methods are proposing different number of clusters, it would be neccesary to cluster separately based on the number of clusters and then make decision based on the average silhouette width and the visualization of the cluster silhouette plot.

K-MEANS

To process the data, the K-means algorithm in starts with a first group of randomly selected centroids, which are used as the beginning points for every cluster, and then performs iterative (repetitive) calculations to optimize the positions of the centroids.

2 Clusters

kmclust2 <- eclust(trimData[,-8], k=2, FUNcluster="kmeans", hc_metric="euclidean", graph=F)

km_plot2 <- fviz_cluster(kmclust2, data=trimData[,-8], elipse.type="convex", geom=c("point")) + ggtitle("K-means with 2 clusters")
km_sil_plot2 <- fviz_silhouette(kmclust2)

grid.arrange(km_plot2,km_sil_plot2, ncol=2)

4 Clusters

kmclust4 <- eclust(trimData[,-8], k=4, FUNcluster="kmeans", hc_metric="euclidean", graph=F)

km_plot4 <- fviz_cluster(kmclust4, data=trimData[,-8], elipse.type="convex", geom=c("point")) + ggtitle("K-means with 4 clusters")
km_sil_plot4 <- fviz_silhouette(kmclust4)

grid.arrange(km_plot4,km_sil_plot4, ncol=2)

PAM

The PAM Clustering Algorithm. PAM stands for “partition around medoids”. The algorithm is intended to find a sequence of objects called medoids that are centrally located in clusters. Objects that are tentatively defined as medoids are placed into a set S of selected objects.

2 Clusters

pamclust2 <- eclust(trimData[,-8], k=2 , FUNcluster="pam", hc_metric="euclidean", graph=F)

pam_plot2<- fviz_cluster(pamclust2, data=trimData[,-8], elipse.type="convex", geom=c("point")) + ggtitle("PAM with 2 clusters")
pam_sil_plot2 <- fviz_silhouette(pamclust2)

grid.arrange(pam_plot2, pam_sil_plot2, ncol=2)

4 Clusters

pamclust4 <- eclust(trimData[,-8], k=4 , FUNcluster="pam", hc_metric="euclidean", graph=F)

pam_plot4<- fviz_cluster(pamclust4, data=trimData[,-8], elipse.type="convex", geom=c("point")) + ggtitle("PAM with 4 clusters")
pam_sil_plot4 <- fviz_silhouette(pamclust4)

grid.arrange(pam_plot4, pam_sil_plot4, ncol=2)

Analysis of Cluster results

Here, I finally analyze the results based on the intrinsic methods since we do not have the class information for the data points such the homogeneity score and completeness which allows us to use the extrinsic methods. The intrinsic method i will be using is based on the average silhouette score for each clustering method.

K-MEANS

It would be realized that, from the silhouette plot for 2 clusters, the average silhouette width value was given as 0.68 (above the ) with very little portion of cluster 2 running into negative. Comparing this to the plot for 4 clusters, the average silhouette score of 0.5 is lesser than that of two clusters. In this case we conclude that clustering this data set with 2 clusters would be more appropriate.

PAM

It would also be realized that, from the silhouette plot for 2 clusters, the average silhouette width score was given as 0.56 (above the ) with very little portion of cluster 2 running into negative. Comparing this to the plot for 4 clusters, the average silhouette score of 0.38 is lesser than that of two clusters. In this case we conclude that clustering this data set with 2 clusters would also be more appropriate.

CONCLUSION

As discussed in the analysis of each clustering method, it was revealed that the more appropriate number of cluster for each method was found to be 2. we can also base on the intrinsic method of analysis to deduce that the K-MEAN clustering algorithm best fits for this dataset than that of the PAM.

Also it would worthy to note that, the Elbow method suggesting different number of clusters does not render them not appropriate as this may be based on several other factors.

CUSTOMER SEGMENTATION USING K-MEANS AND PAM CLUSRTERING ALGORITHMS

MARK ASAMOAH

11/02/2021

INTRODUCTION

OVERVIEW OF DATASET

Cleaning the Dataset

Libraries

STATISTICS AND EXPLORATORY DATA ANALYSIS

Results from SEDA

Measures

Correlation

Correlation plot

CLUSTERING

Optimal Number Of Clusters

K-MEANS

PAM

K-MEANS

2 Clusters

4 Clusters

PAM

2 Clusters

4 Clusters

Analysis of Cluster results

K-MEANS

PAM

CONCLUSION