# Loading required libraries
library(dplyr)
library(ggplot2)
library(clustertend)
library(factoextra)
library(fpc)
In this project, we are working with the customer churn dataset. Our main goal of this project is to apply different clustering techniques to cluster the customers based on their information that is being present in the dataset. The dataset is taken from Kaggle and could be found at https://www.kaggle.com/datasets/thedevastator/predicting-credit-card-customer-attrition-with-m. The dataset has a total of 20 attributes and 10127 observations in total.
This dataset includes an abundance of customer information that was collected from within a consumer credit card portfolio with the intention of assisting researchers in predicting the amount of customer attrition that will occur. It contains extensive demographic information such as age, gender, marital status, and salary category, as well as information regarding each customer’s relationship with the credit card supplier, such as the type of card, the number of months on record, and the inactive intervals that have occurred. In addition, it stores essential information regarding the spending patterns of customers who are getting closer to making the decision to leave the company, such as their total revolving balance, credit limit, average open to buy rate, and analyzable metrics such as the total amount of change from quarter 4 to quarter 1. When presented with this set of useful predicted data points across multiple variables, which capture up-to-date information, they offer us an equipped understanding when seeking to manage a portfolio or serve individual customers. These data points can determine whether a long-term account will remain stable or whether it will leave in the near future.
Before performing any type of data analysis, we need to clean and prepare the dataset. The very first step that is being applied after reading the dataset is to remove the ID column as it is useless.
data <- read.csv("BankChurners.csv")
data <- data[,c(2:21)]
head(data)
Initially, the dimensions of the dataset are 20 attributes and 10127 observations.
dim(data)
## [1] 10127 20
The dataset is checked for missing observations and we can see that none of the column has any missing values in it.
# Missing observations
colSums(is.na(data))
## Attrition_Flag Customer_Age Gender
## 0 0 0
## Dependent_count Education_Level Marital_Status
## 0 0 0
## Income_Category Card_Category Months_on_book
## 0 0 0
## Total_Relationship_Count Months_Inactive_12_mon Contacts_Count_12_mon
## 0 0 0
## Credit_Limit Total_Revolving_Bal Avg_Open_To_Buy
## 0 0 0
## Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct
## 0 0 0
## Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio
## 0 0
The categorical columns are checked for garbage values and there are no garbage values as the values are cleaned in all categorical columns.
# Checking garbage values in categorical columns
unique(data$Gender)
## [1] "M" "F"
unique(data$Education_Level)
## [1] "High School" "Graduate" "Uneducated" "Unknown"
## [5] "College" "Post-Graduate" "Doctorate"
unique(data$Marital_Status)
## [1] "Married" "Single" "Unknown" "Divorced"
unique(data$Income_Category)
## [1] "$60K - $80K" "Less than $40K" "$80K - $120K" "$40K - $60K"
## [5] "$120K +" "Unknown"
unique(data$Card_Category)
## [1] "Blue" "Gold" "Silver" "Platinum"
The summary of the dataset is obtained and is attached below:
summary(data)
## Attrition_Flag Customer_Age Gender Dependent_count
## Length:10127 Min. :26.00 Length:10127 Min. :0.000
## Class :character 1st Qu.:41.00 Class :character 1st Qu.:1.000
## Mode :character Median :46.00 Mode :character Median :2.000
## Mean :46.33 Mean :2.346
## 3rd Qu.:52.00 3rd Qu.:3.000
## Max. :73.00 Max. :5.000
## Education_Level Marital_Status Income_Category Card_Category
## Length:10127 Length:10127 Length:10127 Length:10127
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Months_on_book Total_Relationship_Count Months_Inactive_12_mon
## Min. :13.00 Min. :1.000 Min. :0.000
## 1st Qu.:31.00 1st Qu.:3.000 1st Qu.:2.000
## Median :36.00 Median :4.000 Median :2.000
## Mean :35.93 Mean :3.813 Mean :2.341
## 3rd Qu.:40.00 3rd Qu.:5.000 3rd Qu.:3.000
## Max. :56.00 Max. :6.000 Max. :6.000
## Contacts_Count_12_mon Credit_Limit Total_Revolving_Bal Avg_Open_To_Buy
## Min. :0.000 Min. : 1438 Min. : 0 Min. : 3
## 1st Qu.:2.000 1st Qu.: 2555 1st Qu.: 359 1st Qu.: 1324
## Median :2.000 Median : 4549 Median :1276 Median : 3474
## Mean :2.455 Mean : 8632 Mean :1163 Mean : 7469
## 3rd Qu.:3.000 3rd Qu.:11068 3rd Qu.:1784 3rd Qu.: 9859
## Max. :6.000 Max. :34516 Max. :2517 Max. :34516
## Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct Total_Ct_Chng_Q4_Q1
## Min. :0.0000 Min. : 510 Min. : 10.00 Min. :0.0000
## 1st Qu.:0.6310 1st Qu.: 2156 1st Qu.: 45.00 1st Qu.:0.5820
## Median :0.7360 Median : 3899 Median : 67.00 Median :0.7020
## Mean :0.7599 Mean : 4404 Mean : 64.86 Mean :0.7122
## 3rd Qu.:0.8590 3rd Qu.: 4741 3rd Qu.: 81.00 3rd Qu.:0.8180
## Max. :3.3970 Max. :18484 Max. :139.00 Max. :3.7140
## Avg_Utilization_Ratio
## Min. :0.0000
## 1st Qu.:0.0230
## Median :0.1760
## Mean :0.2749
## 3rd Qu.:0.5030
## Max. :0.9990
The numerical columns are checked for outliers and outliers are removed from all those columns where the difference between third quartile and the maximum value was high from the above summary statistics.
# Removing outliers if present
outliers <- function(x) {
Q1 <- quantile(x, probs=.25)
Q3 <- quantile(x, probs=.75)
iqr = Q3-Q1
upper_limit = Q3 + (iqr*1.5)
lower_limit = Q1 - (iqr*1.5)
x > upper_limit | x < lower_limit
}
remove_outliers <- function(df, cols = names(df)) {
for (col in cols) {
df <- df[!outliers(df[[col]]),]
}
df
}
clean <- remove_outliers(data, c('Customer_Age', 'Avg_Utilization_Ratio',
'Total_Ct_Chng_Q4_Q1', 'Total_Trans_Ct',
'Total_Trans_Amt', 'Total_Amt_Chng_Q4_Q1',
'Avg_Open_To_Buy', 'Total_Revolving_Bal',
'Credit_Limit', 'Months_on_book'))
We can see that after removing the outliers from the dataset we are left with only 6903 observations.
dim(clean)
## [1] 6903 20
The target column of attrition flag is recoded.
clean$Attrition_Flag[clean$Attrition_Flag == "Existing Customer"] <- 2
clean$Attrition_Flag[clean$Attrition_Flag == "Attrited Customer"] <- 1
unique(clean$Attrition_Flag)
## [1] "2" "1"
All categorical columns are converted to numerical columns using label encoding. This is because we are going to perform clustering which works on distance matrix and distances between categorical columns cannot be calculated. Hence, all categorical columns are converted to numerical columns.
clean$Gender <- as.numeric(factor(clean$Gender))
clean$Education_Level <- as.numeric(factor(clean$Education_Level))
clean$Marital_Status <- as.numeric(factor(clean$Marital_Status))
clean$Income_Category <- as.numeric(factor(clean$Income_Category))
clean$Card_Category <- as.numeric(factor(clean$Card_Category))
str(clean)
## 'data.frame': 6903 obs. of 20 variables:
## $ Attrition_Flag : chr "2" "2" "2" "1" ...
## $ Customer_Age : int 42 57 45 62 47 54 41 47 58 55 ...
## $ Gender : num 2 1 1 1 1 2 1 2 2 1 ...
## $ Dependent_count : int 5 2 2 0 4 2 3 4 0 1 ...
## $ Education_Level : num 6 3 3 3 7 7 3 4 3 1 ...
## $ Marital_Status : num 4 2 2 2 3 2 3 2 2 3 ...
## $ Income_Category : num 1 5 6 5 5 4 5 2 4 5 ...
## $ Card_Category : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Months_on_book : int 31 48 37 49 36 42 28 42 49 36 ...
## $ Total_Relationship_Count: int 5 5 6 2 3 4 6 6 6 4 ...
## $ Months_Inactive_12_mon : int 3 2 1 3 3 2 1 0 2 2 ...
## $ Contacts_Count_12_mon : int 2 2 2 3 2 3 2 0 2 1 ...
## $ Credit_Limit : num 6748 2436 14470 1438 2492 ...
## $ Total_Revolving_Bal : int 1467 680 1157 0 1560 0 1669 1362 1696 1914 ...
## $ Avg_Open_To_Buy : num 5281 1756 13313 1438 932 ...
## $ Total_Amt_Chng_Q4_Q1 : num 0.831 1.19 0.966 1.047 0.573 ...
## $ Total_Trans_Amt : int 1201 1570 1207 692 1126 1110 1051 1045 1291 1407 ...
## $ Total_Trans_Ct : int 42 29 21 16 23 21 22 38 24 43 ...
## $ Total_Ct_Chng_Q4_Q1 : num 0.68 0.611 0.909 0.6 0.353 0.75 0.833 0.9 0.714 0.483 ...
## $ Avg_Utilization_Ratio : num 0.217 0.279 0.08 0 0.626 0 0.215 0.285 0.135 0.544 ...
Finally, the target column is removed from the dataset.
data_features <- clean[,-1]
A correlation plot is created to see if there are any pairs which has high correlation between them which could cause multi-collinearity. But we can see that no such highly correlated features could be spotted except one or two. Having one or two in such situations is fine.
corrplot::corrplot(cor(data_features), method = "col")
In order to perform K means clustering, first of all the silhoutte average method is applied to the dataset to calculate the optimal number of clusters in this case. From the graph attached below, we can see that the optimal number of clusters are 2. Hence, K means clustering will be applied with K set to 2.
fviz_nbclust(data_features,
FUNcluster = kmeans,
method = "silhouette") +
theme_classic()
Next, the K means clustering is applied with K set to 2. The HC metric thus used is pearson. The table attached shows us that the size of cluster 1 is 5235 while the size of cluster 2 is 1671. The graph attached below shows us that the average silhouette width is 0.61 with clusters being plotted.
kmeans_clustering <- eclust(data_features,
k = 2,
FUNcluster = "kmeans",
hc_metric = "pearson",
graph = FALSE)
fviz_silhouette(kmeans_clustering)
## cluster size ave.sil.width
## 1 1 5232 0.66
## 2 2 1671 0.46
In order to visualize the clusters, another graph is created which is attached below.
fviz_cluster(kmeans_clustering,
data = data_features,
elipse.type = "convex") +
theme_minimal()
Finally, the efficiency of clusters is calculated and we can see from the result attached below that 4333 + 286 = 4619 observations are mis-classified totally.
table(kmeans_clustering$cluster, clean$Attrition_Flag)
##
## 1 2
## 1 899 4333
## 2 286 1385
Next, the data is sampled down to speed up the hierarchical clustering process. Three variants are calculated which are single, complete and ward.
clean$Attrition_Flag <- as.numeric(clean$Attrition_Flag)
clean <- clean %>%
arrange(-Attrition_Flag)
data_sample <- clean[c(1:1000, 5719:6719),]
data_sample_features <- data_sample[,-1]
First of all the distance matrix is calculated using the euclidean distance matrix.
distance_mat <- dist(data_sample_features,
method = 'euclidean')
The very first hierarichal clustering technique that is being applied is single technique.
cluster_single <- eclust(distance_mat,
k = 2,
FUNcluster="hclust",
hc_method = "single")
The plot thus obtained is attached below.
plot(cluster_single,
cex=0.6,
hang=-1,
main = "Dendrogram of HAC")
rect.hclust(cluster_single,
k = 2,
border='red')
In order to check the efficiency of the clustering method single, we can see that 999 observations are mis-classified.
clusterCut <- cutree(cluster_single, 2)
table(clusterCut, data_sample$Attrition_Flag)
##
## clusterCut 1 2
## 1 1001 999
## 2 0 1
Next, the complete linkage method is applied.
cluster_complete <- eclust(distance_mat,
k = 2,
FUNcluster="hclust",
hc_method = "complete")
The graph for complete linkeage is obtained.
plot(cluster_complete,
cex=0.6,
hang=-1,
main = "Dendrogram of HAC")
rect.hclust(cluster_single,
k = 2,
border='red')
In order to check the efficiency of the clustering method complete linkage, we can see that 934 observations are mis-classified.
clusterCut <- cutree(cluster_complete, 2)
table(clusterCut, data_sample$Attrition_Flag)
##
## clusterCut 1 2
## 1 878 811
## 2 123 189
Finally, the ward linkage method is applied.
cluster_ward <- eclust(distance_mat,
k = 2,
FUNcluster="hclust",
hc_method = "ward")
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
The plot thus created is attached below:
plot(cluster_ward,
cex=0.6,
hang=-1,
main = "Dendrogram of HAC")
rect.hclust(cluster_ward,
k = 2,
border='red')
In order to check the efficiency of the clustering method ward linkage, we can see that 931 observations are mis-classified.
clusterCut <- cutree(cluster_ward, 2)
table(clusterCut, data_sample$Attrition_Flag)
##
## clusterCut 1 2
## 1 791 721
## 2 210 279
Based on the above results, the best performing method is ward linkage method. The statistical values are thus obtained and attached below where we can see that the cluster sizes are 1512 and 489. The within cluster sum of squares is 3252.208.
hc_stats <- cluster.stats(distance_mat, cluster_ward$cluster)
hc_stats
## $n
## [1] 2001
##
## $cluster.number
## [1] 2
##
## $cluster.size
## [1] 1512 489
##
## $min.cluster.size
## [1] 489
##
## $noisen
## [1] 0
##
## $diameter
## [1] 7782.372 13252.449
##
## $average.distance
## [1] 2877.873 4409.663
##
## $median.distance
## [1] 2665.506 3755.352
##
## $separation
## [1] 264.0739 264.0739
##
## $average.toother
## [1] 9951.59 9951.59
##
## $separation.matrix
## [,1] [,2]
## [1,] 0.0000 264.0739
## [2,] 264.0739 0.0000
##
## $ave.between.matrix
## [,1] [,2]
## [1,] 0.00 9951.59
## [2,] 9951.59 0.00
##
## $average.between
## [1] 9951.59
##
## $average.within
## [1] 3252.208
##
## $n.between
## [1] 739368
##
## $n.within
## [1] 1261632
##
## $max.diameter
## [1] 13252.45
##
## $min.separation
## [1] 264.0739
##
## $within.cluster.ss
## [1] 14422279193
##
## $clus.avg.silwidths
## 1 2
## 0.6837822 0.5160043
##
## $avg.silwidth
## [1] 0.642781
##
## $g2
## NULL
##
## $g3
## NULL
##
## $pearsongamma
## [1] 0.7782616
##
## $dunn
## [1] 0.01992642
##
## $dunn2
## [1] 2.256769
##
## $entropy
## [1] 0.5560738
##
## $wb.ratio
## [1] 0.3268029
##
## $ch
## [1] 4880.681
##
## $cwidegap
## [1] 1021.150 1347.484
##
## $widestgap
## [1] 1347.484
##
## $sindex
## [1] 975.2363
##
## $corrected.rand
## NULL
##
## $vi
## NULL