Customers Segmentation using Clustering

Introduction:
Dataset Description:
Data Preprocessing:
K Means Clustering:
Hierarichal Clustering:

# Loading required libraries
library(dplyr)
library(ggplot2)
library(clustertend)
library(factoextra)
library(fpc)

Introduction:

In this project, we are working with the customer churn dataset. Our main goal of this project is to apply different clustering techniques to cluster the customers based on their information that is being present in the dataset. The dataset is taken from Kaggle and could be found at https://www.kaggle.com/datasets/thedevastator/predicting-credit-card-customer-attrition-with-m. The dataset has a total of 20 attributes and 10127 observations in total.

Dataset Description:

This dataset includes an abundance of customer information that was collected from within a consumer credit card portfolio with the intention of assisting researchers in predicting the amount of customer attrition that will occur. It contains extensive demographic information such as age, gender, marital status, and salary category, as well as information regarding each customer’s relationship with the credit card supplier, such as the type of card, the number of months on record, and the inactive intervals that have occurred. In addition, it stores essential information regarding the spending patterns of customers who are getting closer to making the decision to leave the company, such as their total revolving balance, credit limit, average open to buy rate, and analyzable metrics such as the total amount of change from quarter 4 to quarter 1. When presented with this set of useful predicted data points across multiple variables, which capture up-to-date information, they offer us an equipped understanding when seeking to manage a portfolio or serve individual customers. These data points can determine whether a long-term account will remain stable or whether it will leave in the near future.

Data Preprocessing:

Before performing any type of data analysis, we need to clean and prepare the dataset. The very first step that is being applied after reading the dataset is to remove the ID column as it is useless.

data <- read.csv("BankChurners.csv")
data <- data[,c(2:21)]
head(data)

Initially, the dimensions of the dataset are 20 attributes and 10127 observations.

dim(data)

## [1] 10127    20

The dataset is checked for missing observations and we can see that none of the column has any missing values in it.

# Missing observations
colSums(is.na(data))

##           Attrition_Flag             Customer_Age                   Gender 
##                        0                        0                        0 
##          Dependent_count          Education_Level           Marital_Status 
##                        0                        0                        0 
##          Income_Category            Card_Category           Months_on_book 
##                        0                        0                        0 
## Total_Relationship_Count   Months_Inactive_12_mon    Contacts_Count_12_mon 
##                        0                        0                        0 
##             Credit_Limit      Total_Revolving_Bal          Avg_Open_To_Buy 
##                        0                        0                        0 
##     Total_Amt_Chng_Q4_Q1          Total_Trans_Amt           Total_Trans_Ct 
##                        0                        0                        0 
##      Total_Ct_Chng_Q4_Q1    Avg_Utilization_Ratio 
##                        0                        0

The categorical columns are checked for garbage values and there are no garbage values as the values are cleaned in all categorical columns.

# Checking garbage values in categorical columns
unique(data$Gender)

## [1] "M" "F"

unique(data$Education_Level)

## [1] "High School"   "Graduate"      "Uneducated"    "Unknown"      
## [5] "College"       "Post-Graduate" "Doctorate"

unique(data$Marital_Status)

## [1] "Married"  "Single"   "Unknown"  "Divorced"

unique(data$Income_Category)

## [1] "$60K - $80K"    "Less than $40K" "$80K - $120K"   "$40K - $60K"   
## [5] "$120K +"        "Unknown"

unique(data$Card_Category)

## [1] "Blue"     "Gold"     "Silver"   "Platinum"

The summary of the dataset is obtained and is attached below:

summary(data)

##  Attrition_Flag      Customer_Age      Gender          Dependent_count
##  Length:10127       Min.   :26.00   Length:10127       Min.   :0.000  
##  Class :character   1st Qu.:41.00   Class :character   1st Qu.:1.000  
##  Mode  :character   Median :46.00   Mode  :character   Median :2.000  
##                     Mean   :46.33                      Mean   :2.346  
##                     3rd Qu.:52.00                      3rd Qu.:3.000  
##                     Max.   :73.00                      Max.   :5.000  
##  Education_Level    Marital_Status     Income_Category    Card_Category     
##  Length:10127       Length:10127       Length:10127       Length:10127      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  Months_on_book  Total_Relationship_Count Months_Inactive_12_mon
##  Min.   :13.00   Min.   :1.000            Min.   :0.000         
##  1st Qu.:31.00   1st Qu.:3.000            1st Qu.:2.000         
##  Median :36.00   Median :4.000            Median :2.000         
##  Mean   :35.93   Mean   :3.813            Mean   :2.341         
##  3rd Qu.:40.00   3rd Qu.:5.000            3rd Qu.:3.000         
##  Max.   :56.00   Max.   :6.000            Max.   :6.000         
##  Contacts_Count_12_mon  Credit_Limit   Total_Revolving_Bal Avg_Open_To_Buy
##  Min.   :0.000         Min.   : 1438   Min.   :   0        Min.   :    3  
##  1st Qu.:2.000         1st Qu.: 2555   1st Qu.: 359        1st Qu.: 1324  
##  Median :2.000         Median : 4549   Median :1276        Median : 3474  
##  Mean   :2.455         Mean   : 8632   Mean   :1163        Mean   : 7469  
##  3rd Qu.:3.000         3rd Qu.:11068   3rd Qu.:1784        3rd Qu.: 9859  
##  Max.   :6.000         Max.   :34516   Max.   :2517        Max.   :34516  
##  Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct   Total_Ct_Chng_Q4_Q1
##  Min.   :0.0000       Min.   :  510   Min.   : 10.00   Min.   :0.0000     
##  1st Qu.:0.6310       1st Qu.: 2156   1st Qu.: 45.00   1st Qu.:0.5820     
##  Median :0.7360       Median : 3899   Median : 67.00   Median :0.7020     
##  Mean   :0.7599       Mean   : 4404   Mean   : 64.86   Mean   :0.7122     
##  3rd Qu.:0.8590       3rd Qu.: 4741   3rd Qu.: 81.00   3rd Qu.:0.8180     
##  Max.   :3.3970       Max.   :18484   Max.   :139.00   Max.   :3.7140     
##  Avg_Utilization_Ratio
##  Min.   :0.0000       
##  1st Qu.:0.0230       
##  Median :0.1760       
##  Mean   :0.2749       
##  3rd Qu.:0.5030       
##  Max.   :0.9990

The numerical columns are checked for outliers and outliers are removed from all those columns where the difference between third quartile and the maximum value was high from the above summary statistics.

# Removing outliers if present
outliers <- function(x) {

  Q1 <- quantile(x, probs=.25)
  Q3 <- quantile(x, probs=.75)
  iqr = Q3-Q1

 upper_limit = Q3 + (iqr*1.5)
 lower_limit = Q1 - (iqr*1.5)

 x > upper_limit | x < lower_limit
}

remove_outliers <- function(df, cols = names(df)) {
  for (col in cols) {
    df <- df[!outliers(df[[col]]),]
  }
  df
}

clean <- remove_outliers(data, c('Customer_Age', 'Avg_Utilization_Ratio',
                                          'Total_Ct_Chng_Q4_Q1', 'Total_Trans_Ct',
                                          'Total_Trans_Amt', 'Total_Amt_Chng_Q4_Q1',
                                          'Avg_Open_To_Buy', 'Total_Revolving_Bal',
                                          'Credit_Limit', 'Months_on_book'))

We can see that after removing the outliers from the dataset we are left with only 6903 observations.

dim(clean)

## [1] 6903   20

The target column of attrition flag is recoded.

clean$Attrition_Flag[clean$Attrition_Flag == "Existing Customer"] <- 2
clean$Attrition_Flag[clean$Attrition_Flag == "Attrited Customer"] <- 1
unique(clean$Attrition_Flag)

## [1] "2" "1"

All categorical columns are converted to numerical columns using label encoding. This is because we are going to perform clustering which works on distance matrix and distances between categorical columns cannot be calculated. Hence, all categorical columns are converted to numerical columns.

clean$Gender <- as.numeric(factor(clean$Gender))
clean$Education_Level <- as.numeric(factor(clean$Education_Level))
clean$Marital_Status <- as.numeric(factor(clean$Marital_Status))
clean$Income_Category <- as.numeric(factor(clean$Income_Category))
clean$Card_Category <- as.numeric(factor(clean$Card_Category))
str(clean)

## 'data.frame':    6903 obs. of  20 variables:
##  $ Attrition_Flag          : chr  "2" "2" "2" "1" ...
##  $ Customer_Age            : int  42 57 45 62 47 54 41 47 58 55 ...
##  $ Gender                  : num  2 1 1 1 1 2 1 2 2 1 ...
##  $ Dependent_count         : int  5 2 2 0 4 2 3 4 0 1 ...
##  $ Education_Level         : num  6 3 3 3 7 7 3 4 3 1 ...
##  $ Marital_Status          : num  4 2 2 2 3 2 3 2 2 3 ...
##  $ Income_Category         : num  1 5 6 5 5 4 5 2 4 5 ...
##  $ Card_Category           : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Months_on_book          : int  31 48 37 49 36 42 28 42 49 36 ...
##  $ Total_Relationship_Count: int  5 5 6 2 3 4 6 6 6 4 ...
##  $ Months_Inactive_12_mon  : int  3 2 1 3 3 2 1 0 2 2 ...
##  $ Contacts_Count_12_mon   : int  2 2 2 3 2 3 2 0 2 1 ...
##  $ Credit_Limit            : num  6748 2436 14470 1438 2492 ...
##  $ Total_Revolving_Bal     : int  1467 680 1157 0 1560 0 1669 1362 1696 1914 ...
##  $ Avg_Open_To_Buy         : num  5281 1756 13313 1438 932 ...
##  $ Total_Amt_Chng_Q4_Q1    : num  0.831 1.19 0.966 1.047 0.573 ...
##  $ Total_Trans_Amt         : int  1201 1570 1207 692 1126 1110 1051 1045 1291 1407 ...
##  $ Total_Trans_Ct          : int  42 29 21 16 23 21 22 38 24 43 ...
##  $ Total_Ct_Chng_Q4_Q1     : num  0.68 0.611 0.909 0.6 0.353 0.75 0.833 0.9 0.714 0.483 ...
##  $ Avg_Utilization_Ratio   : num  0.217 0.279 0.08 0 0.626 0 0.215 0.285 0.135 0.544 ...

Finally, the target column is removed from the dataset.

data_features <- clean[,-1]

A correlation plot is created to see if there are any pairs which has high correlation between them which could cause multi-collinearity. But we can see that no such highly correlated features could be spotted except one or two. Having one or two in such situations is fine.

corrplot::corrplot(cor(data_features), method = "col")

K Means Clustering:

In order to perform K means clustering, first of all the silhoutte average method is applied to the dataset to calculate the optimal number of clusters in this case. From the graph attached below, we can see that the optimal number of clusters are 2. Hence, K means clustering will be applied with K set to 2.

fviz_nbclust(data_features, 
             FUNcluster = kmeans, 
             method = "silhouette") + 
  theme_classic()

Next, the K means clustering is applied with K set to 2. The HC metric thus used is pearson. The table attached shows us that the size of cluster 1 is 5235 while the size of cluster 2 is 1671. The graph attached below shows us that the average silhouette width is 0.61 with clusters being plotted.

kmeans_clustering <- eclust(data_features,
                            k = 2, 
                            FUNcluster = "kmeans", 
                            hc_metric = "pearson", 
                            graph = FALSE)
fviz_silhouette(kmeans_clustering)

##   cluster size ave.sil.width
## 1       1 5232          0.66
## 2       2 1671          0.46

In order to visualize the clusters, another graph is created which is attached below.

fviz_cluster(kmeans_clustering, 
             data = data_features, 
             elipse.type = "convex") + 
  theme_minimal()

Finally, the efficiency of clusters is calculated and we can see from the result attached below that 4333 + 286 = 4619 observations are mis-classified totally.

table(kmeans_clustering$cluster, clean$Attrition_Flag)

##    
##        1    2
##   1  899 4333
##   2  286 1385

Hierarichal Clustering:

Next, the data is sampled down to speed up the hierarchical clustering process. Three variants are calculated which are single, complete and ward.

clean$Attrition_Flag <- as.numeric(clean$Attrition_Flag)
clean <- clean %>%
  arrange(-Attrition_Flag)

data_sample <- clean[c(1:1000, 5719:6719),]

data_sample_features <- data_sample[,-1]

First of all the distance matrix is calculated using the euclidean distance matrix.

distance_mat <- dist(data_sample_features, 
                     method = 'euclidean')

The very first hierarichal clustering technique that is being applied is single technique.

cluster_single <- eclust(distance_mat,
                         k = 2, 
                         FUNcluster="hclust",
                         hc_method = "single")

The plot thus obtained is attached below.

plot(cluster_single, 
     cex=0.6, 
     hang=-1, 
     main = "Dendrogram of HAC")
rect.hclust(cluster_single, 
            k = 2, 
            border='red')

In order to check the efficiency of the clustering method single, we can see that 999 observations are mis-classified.

clusterCut <- cutree(cluster_single, 2)
table(clusterCut, data_sample$Attrition_Flag)

##           
## clusterCut    1    2
##          1 1001  999
##          2    0    1

Next, the complete linkage method is applied.

cluster_complete <- eclust(distance_mat,
                         k = 2, 
                         FUNcluster="hclust",
                         hc_method = "complete")

The graph for complete linkeage is obtained.

plot(cluster_complete, 
     cex=0.6, 
     hang=-1, 
     main = "Dendrogram of HAC")
rect.hclust(cluster_single, 
            k = 2, 
            border='red')

In order to check the efficiency of the clustering method complete linkage, we can see that 934 observations are mis-classified.

clusterCut <- cutree(cluster_complete, 2)
table(clusterCut, data_sample$Attrition_Flag)

##           
## clusterCut   1   2
##          1 878 811
##          2 123 189

Finally, the ward linkage method is applied.

cluster_ward <- eclust(distance_mat,
                         k = 2, 
                         FUNcluster="hclust",
                         hc_method = "ward")

## The "ward" method has been renamed to "ward.D"; note new "ward.D2"

The plot thus created is attached below:

plot(cluster_ward, 
     cex=0.6, 
     hang=-1, 
     main = "Dendrogram of HAC")
rect.hclust(cluster_ward, 
            k = 2, 
            border='red')

In order to check the efficiency of the clustering method ward linkage, we can see that 931 observations are mis-classified.

clusterCut <- cutree(cluster_ward, 2)
table(clusterCut, data_sample$Attrition_Flag)

##           
## clusterCut   1   2
##          1 791 721
##          2 210 279

Based on the above results, the best performing method is ward linkage method. The statistical values are thus obtained and attached below where we can see that the cluster sizes are 1512 and 489. The within cluster sum of squares is 3252.208.

hc_stats <- cluster.stats(distance_mat, cluster_ward$cluster)
hc_stats

## $n
## [1] 2001
## 
## $cluster.number
## [1] 2
## 
## $cluster.size
## [1] 1512  489
## 
## $min.cluster.size
## [1] 489
## 
## $noisen
## [1] 0
## 
## $diameter
## [1]  7782.372 13252.449
## 
## $average.distance
## [1] 2877.873 4409.663
## 
## $median.distance
## [1] 2665.506 3755.352
## 
## $separation
## [1] 264.0739 264.0739
## 
## $average.toother
## [1] 9951.59 9951.59
## 
## $separation.matrix
##          [,1]     [,2]
## [1,]   0.0000 264.0739
## [2,] 264.0739   0.0000
## 
## $ave.between.matrix
##         [,1]    [,2]
## [1,]    0.00 9951.59
## [2,] 9951.59    0.00
## 
## $average.between
## [1] 9951.59
## 
## $average.within
## [1] 3252.208
## 
## $n.between
## [1] 739368
## 
## $n.within
## [1] 1261632
## 
## $max.diameter
## [1] 13252.45
## 
## $min.separation
## [1] 264.0739
## 
## $within.cluster.ss
## [1] 14422279193
## 
## $clus.avg.silwidths
##         1         2 
## 0.6837822 0.5160043 
## 
## $avg.silwidth
## [1] 0.642781
## 
## $g2
## NULL
## 
## $g3
## NULL
## 
## $pearsongamma
## [1] 0.7782616
## 
## $dunn
## [1] 0.01992642
## 
## $dunn2
## [1] 2.256769
## 
## $entropy
## [1] 0.5560738
## 
## $wb.ratio
## [1] 0.3268029
## 
## $ch
## [1] 4880.681
## 
## $cwidegap
## [1] 1021.150 1347.484
## 
## $widestgap
## [1] 1347.484
## 
## $sindex
## [1] 975.2363
## 
## $corrected.rand
## NULL
## 
## $vi
## NULL