Credit Card Clustering with R

About Dataset

This case requires to develop a customer segmentation to define marketing strategy. The sample Dataset summarizes the usage behavior of about 9000 active credit card holders during the last 6 months. The file is at a customer level with 18 behavioral variables.

Following is the Data Dictionary for Credit Card dataset :-

CUST_ID : Identification of Credit Card holder (Categorical) BALANCE : Balance amount left in their account to make purchases BALANCE_FREQUENCY : How frequently the Balance is updated, score between 0 and 1 (1 = frequently updated, 0 = not frequently updated) PURCHASES : Amount of purchases made from account ONEOFF_PURCHASES : Maximum purchase amount done in one-go INSTALLMENTS_PURCHASES : Amount of purchase done in installment CASH_ADVANCE : Cash in advance given by the user PURCHASES_FREQUENCY : How frequently the Purchases are being made, score between 0 and 1 (1 = frequently purchased, 0 = not frequently purchased) ONEOFFPURCHASESFREQUENCY : How frequently Purchases are happening in one-go (1 = frequently purchased, 0 = not frequently purchased) PURCHASESINSTALLMENTSFREQUENCY : How frequently purchases in installments are being done (1 = frequently done, 0 = not frequently done) CASHADVANCEFREQUENCY : How frequently the cash in advance being paid CASHADVANCETRX : Number of Transactions made with "Cash in Advanced" PURCHASES_TRX : Number of purchase transactions made CREDIT_LIMIT : Limit of Credit Card for user PAYMENTS : Amount of Payment done by user MINIMUM_PAYMENTS : Minimum amount of payments made by user PRCFULLPAYMENT : Percent of full payment paid by user TENURE : Tenure of credit card service for user

Strategy:

Data Preprocessing : After having imported the dataset I will have to check and see if the data requires any kind of scaling and if all the explanatory variables have the right data type. I might assume that the column representing customer id's is not necessary. I will check for missing values and remove these rows or replace them if needed.
Exploratory Data Analysis: I will try and understand the data a bit further and generate the descriptive statistics. After which I will check for outliers and remove them accordingly. I will then check and see if any of the explanatory variables are correlated to each other.
I will find an optimal number for the number of clusters using the Silhouette Method in combination with the Elbow Method.
I will use K Means Clustering with the newly generated optimal value for K to generate the clusters.
The final results will be analysed.

1. Data Preprocessing

library(dplyr)
library(ggplot2)
library(gridExtra)
library(tidyr)
library(cluster)
library(caret)
library(reshape2)
library(corrplot)
library(randomForest)
library(lmtest)
library(zoo)
library(factoextra)
library(compareGroups)

Attaching package: 'dplyr'


The following objects are masked from 'package:stats':

    filter, lag


The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union



Attaching package: 'gridExtra'


The following object is masked from 'package:dplyr':

    combine


Loading required package: lattice


Attaching package: 'reshape2'


The following object is masked from 'package:tidyr':

    smiths


corrplot 0.92 loaded

randomForest 4.7-1.1

Type rfNews() to see new features/changes/bug fixes.


Attaching package: 'randomForest'


The following object is masked from 'package:gridExtra':

    combine


The following object is masked from 'package:ggplot2':

    margin


The following object is masked from 'package:dplyr':

    combine


Loading required package: zoo


Attaching package: 'zoo'


The following objects are masked from 'package:base':

    as.Date, as.Date.numeric


Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

ccards <- read.csv("CC_General.csv")

# Checking to see if the data set loaded correctly
head(ccards, 10)

A data.frame: 10 × 18

	CUST_ID <chr>	BALANCE <dbl>	BALANCE_FREQUENCY <dbl>	PURCHASES <dbl>	ONEOFF_PURCHASES <dbl>	INSTALLMENTS_PURCHASES <dbl>	CASH_ADVANCE <dbl>	PURCHASES_FREQUENCY <dbl>	ONEOFF_PURCHASES_FREQUENCY <dbl>	PURCHASES_INSTALLMENTS_FREQUENCY <dbl>	CASH_ADVANCE_FREQUENCY <dbl>	CASH_ADVANCE_TRX <int>	PURCHASES_TRX <int>	CREDIT_LIMIT <dbl>	PAYMENTS <dbl>	MINIMUM_PAYMENTS <dbl>	PRC_FULL_PAYMENT <dbl>	TENURE <int>
1	C10001	40.90075	0.818182	95.40	0.00	95.40	0.000	0.166667	0.000000	0.083333	0.000000	0	2	1000	201.8021	139.5098	0.000000	12
2	C10002	3202.46742	0.909091	0.00	0.00	0.00	6442.945	0.000000	0.000000	0.000000	0.250000	4	0	7000	4103.0326	1072.3402	0.222222	12
3	C10003	2495.14886	1.000000	773.17	773.17	0.00	0.000	1.000000	1.000000	0.000000	0.000000	0	12	7500	622.0667	627.2848	0.000000	12
4	C10004	1666.67054	0.636364	1499.00	1499.00	0.00	205.788	0.083333	0.083333	0.000000	0.083333	1	1	7500	0.0000	NA	0.000000	12
5	C10005	817.71434	1.000000	16.00	16.00	0.00	0.000	0.083333	0.083333	0.000000	0.000000	0	1	1200	678.3348	244.7912	0.000000	12
6	C10006	1809.82875	1.000000	1333.28	0.00	1333.28	0.000	0.666667	0.000000	0.583333	0.000000	0	8	1800	1400.0578	2407.2460	0.000000	12
7	C10007	627.26081	1.000000	7091.01	6402.63	688.38	0.000	1.000000	1.000000	1.000000	0.000000	0	64	13500	6354.3143	198.0659	1.000000	12
8	C10008	1823.65274	1.000000	436.20	0.00	436.20	0.000	1.000000	0.000000	1.000000	0.000000	0	12	2300	679.0651	532.0340	0.000000	12
9	C10009	1014.92647	1.000000	861.49	661.49	200.00	0.000	0.333333	0.083333	0.250000	0.000000	0	5	7000	688.2786	311.9634	0.000000	12
10	C10010	152.22598	0.545455	1281.60	1281.60	0.00	0.000	0.166667	0.166667	0.000000	0.000000	0	3	11000	1164.7706	100.3023	0.000000	12

str(ccards)

'data.frame':   8950 obs. of  18 variables:
 $ CUST_ID                         : chr  "C10001" "C10002" "C10003" "C10004" ...
 $ BALANCE                         : num  40.9 3202.5 2495.1 1666.7 817.7 ...
 $ BALANCE_FREQUENCY               : num  0.818 0.909 1 0.636 1 ...
 $ PURCHASES                       : num  95.4 0 773.2 1499 16 ...
 $ ONEOFF_PURCHASES                : num  0 0 773 1499 16 ...
 $ INSTALLMENTS_PURCHASES          : num  95.4 0 0 0 0 ...
 $ CASH_ADVANCE                    : num  0 6443 0 206 0 ...
 $ PURCHASES_FREQUENCY             : num  0.1667 0 1 0.0833 0.0833 ...
 $ ONEOFF_PURCHASES_FREQUENCY      : num  0 0 1 0.0833 0.0833 ...
 $ PURCHASES_INSTALLMENTS_FREQUENCY: num  0.0833 0 0 0 0 ...
 $ CASH_ADVANCE_FREQUENCY          : num  0 0.25 0 0.0833 0 ...
 $ CASH_ADVANCE_TRX                : int  0 4 0 1 0 0 0 0 0 0 ...
 $ PURCHASES_TRX                   : int  2 0 12 1 1 8 64 12 5 3 ...
 $ CREDIT_LIMIT                    : num  1000 7000 7500 7500 1200 1800 13500 2300 7000 11000 ...
 $ PAYMENTS                        : num  202 4103 622 0 678 ...
 $ MINIMUM_PAYMENTS                : num  140 1072 627 NA 245 ...
 $ PRC_FULL_PAYMENT                : num  0 0.222 0 0 0 ...
 $ TENURE                          : int  12 12 12 12 12 12 12 12 12 12 ...

# This will remove the customer ID column which is not necessary for clustering
ccards <- subset(ccards, select = -CUST_ID)

Error in eval(substitute(select), nl, parent.frame()): object 'CUST_ID' not found
Traceback:

1. subset(ccards, select = -CUST_ID)
2. subset.data.frame(ccards, select = -CUST_ID)
3. eval(substitute(select), nl, parent.frame())
4. eval(substitute(select), nl, parent.frame())

# Check for missing values in the dataset
sum(is.na(ccards))

314

# Filling the missing values with the mean of each column
ccards <- na.aggregate(ccards, FUN = mean)

# Check for missing values in the dataset
sum(is.na(ccards))

2. Exploratory Data Analysis

attach(ccards)

summary(ccards)

    BALANCE        BALANCE_FREQUENCY   PURCHASES        ONEOFF_PURCHASES 
 Min.   :    0.0   Min.   :0.0000    Min.   :    0.00   Min.   :    0.0  
 1st Qu.:  128.3   1st Qu.:0.8889    1st Qu.:   39.63   1st Qu.:    0.0  
 Median :  873.4   Median :1.0000    Median :  361.28   Median :   38.0  
 Mean   : 1564.5   Mean   :0.8773    Mean   : 1003.20   Mean   :  592.4  
 3rd Qu.: 2054.1   3rd Qu.:1.0000    3rd Qu.: 1110.13   3rd Qu.:  577.4  
 Max.   :19043.1   Max.   :1.0000    Max.   :49039.57   Max.   :40761.2  
 INSTALLMENTS_PURCHASES  CASH_ADVANCE     PURCHASES_FREQUENCY
 Min.   :    0.0        Min.   :    0.0   Min.   :0.00000    
 1st Qu.:    0.0        1st Qu.:    0.0   1st Qu.:0.08333    
 Median :   89.0        Median :    0.0   Median :0.50000    
 Mean   :  411.1        Mean   :  978.9   Mean   :0.49035    
 3rd Qu.:  468.6        3rd Qu.: 1113.8   3rd Qu.:0.91667    
 Max.   :22500.0        Max.   :47137.2   Max.   :1.00000    
 ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FREQUENCY
 Min.   :0.00000            Min.   :0.0000                  
 1st Qu.:0.00000            1st Qu.:0.0000                  
 Median :0.08333            Median :0.1667                  
 Mean   :0.20246            Mean   :0.3644                  
 3rd Qu.:0.30000            3rd Qu.:0.7500                  
 Max.   :1.00000            Max.   :1.0000                  
 CASH_ADVANCE_FREQUENCY CASH_ADVANCE_TRX  PURCHASES_TRX     CREDIT_LIMIT  
 Min.   :0.0000         Min.   :  0.000   Min.   :  0.00   Min.   :   50  
 1st Qu.:0.0000         1st Qu.:  0.000   1st Qu.:  1.00   1st Qu.: 1600  
 Median :0.0000         Median :  0.000   Median :  7.00   Median : 3000  
 Mean   :0.1351         Mean   :  3.249   Mean   : 14.71   Mean   : 4494  
 3rd Qu.:0.2222         3rd Qu.:  4.000   3rd Qu.: 17.00   3rd Qu.: 6500  
 Max.   :1.5000         Max.   :123.000   Max.   :358.00   Max.   :30000  
    PAYMENTS       MINIMUM_PAYMENTS   PRC_FULL_PAYMENT     TENURE     
 Min.   :    0.0   Min.   :    0.02   Min.   :0.0000   Min.   : 6.00  
 1st Qu.:  383.3   1st Qu.:  170.86   1st Qu.:0.0000   1st Qu.:12.00  
 Median :  856.9   Median :  335.63   Median :0.0000   Median :12.00  
 Mean   : 1733.1   Mean   :  864.21   Mean   :0.1537   Mean   :11.52  
 3rd Qu.: 1901.1   3rd Qu.:  864.21   3rd Qu.:0.1429   3rd Qu.:12.00  
 Max.   :50721.5   Max.   :76406.21   Max.   :1.0000   Max.   :12.00

Removing Outliers

# Calculate the z-score for each variable
z_score <- as.data.frame(lapply(ccards, scale))
# Identify the indices of outliers
outliers <- as.vector(which(abs(z_score) > 3, arr.ind = TRUE)[, 1])
# Remove the outliers
data_cleaned <- ccards[-outliers, ]

Examining Correlations of Explanatory Variables

# Subset the explanatory variables
explanatory_vars <- ccards[, 2:ncol(ccards)]

# Calculate the correlation matrix
corr_matrix <- cor(explanatory_vars)

# Convert the correlation matrix to a data frame
corr_df <- melt(corr_matrix)

# Create the heatmap
ggplot(corr_df, aes(x = Var1, y = Var2, fill = value)) +
  geom_tile() +
  scale_fill_gradient2(low = "blue", mid = "white", high = "red", midpoint = 0, limit = c(-1,1)) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Correlation Heatmap", x = "", y = "") +
  geom_text(aes(label = round(value, 2)), size = 3, color = "black")

There are quite a few variables that are strongly correlated with one another such as PURCHASES and ONE_OFF_PURCHASES as well as PURCHASES_FREQUENCY and PURCHASES_INSTALLMENTS_FREQUENCY. This is not alarming for now, as K means Clustering is the algorithm which will be used for clustering. I am considering applying Principal Component Analysis because of this but it might not be necessary considering the results of the cluster analysis and if they seem adequate enough.

3. Determining the Number of Clusters

Using the Silhouette Method

fviz_nbclust(data_cleaned, kmeans, method='silhouette')

Using the Elbow Method

fviz_nbclust(data_cleaned, kmeans, method = "wss")

The results of the silhouette analysis are very determinable whereas it is more difficult to understand the plot generated when using the elbow method. It would seem that two clusters seem to be the optimal number but it might be worth trying a 3 or 4 cluster solution.

4. Final Clustering

km.final <- kmeans(data_cleaned, 2)

## Total Within cluster sum of square
km.final$tot.withinss

## Cluster sizes
km.final$size


data_cleaned$cluster <- km.final$cluster
head(data_cleaned, 6)

74629888182.833

2328
5106

A data.frame: 6 × 18

	BALANCE <dbl>	BALANCE_FREQUENCY <dbl>	PURCHASES <dbl>	ONEOFF_PURCHASES <dbl>	INSTALLMENTS_PURCHASES <dbl>	CASH_ADVANCE <dbl>	PURCHASES_FREQUENCY <dbl>	ONEOFF_PURCHASES_FREQUENCY <dbl>	PURCHASES_INSTALLMENTS_FREQUENCY <dbl>	CASH_ADVANCE_FREQUENCY <dbl>	CASH_ADVANCE_TRX <dbl>	PURCHASES_TRX <dbl>	CREDIT_LIMIT <dbl>	PAYMENTS <dbl>	MINIMUM_PAYMENTS <dbl>	PRC_FULL_PAYMENT <dbl>	TENURE <dbl>	cluster <int>
1	40.90075	0.818182	95.40	0.00	95.40	0.000	0.166667	0.000000	0.083333	0.000000	0	2	1000	201.8021	139.5098	0.000000	12	2
2	3202.46742	0.909091	0.00	0.00	0.00	6442.945	0.000000	0.000000	0.000000	0.250000	4	0	7000	4103.0326	1072.3402	0.222222	12	1
3	2495.14886	1.000000	773.17	773.17	0.00	0.000	1.000000	1.000000	0.000000	0.000000	0	12	7500	622.0667	627.2848	0.000000	12	1
4	1666.67054	0.636364	1499.00	1499.00	0.00	205.788	0.083333	0.083333	0.000000	0.083333	1	1	7500	0.0000	864.2065	0.000000	12	1
5	817.71434	1.000000	16.00	16.00	0.00	0.000	0.083333	0.083333	0.000000	0.000000	0	1	1200	678.3348	244.7912	0.000000	12	2
6	1809.82875	1.000000	1333.28	0.00	1333.28	0.000	0.666667	0.000000	0.583333	0.000000	0	8	1800	1400.0578	2407.2460	0.000000	12	2

5. Analysis

# Create subset of data for cluster 1
cluster1 <- subset(data_cleaned, cluster == 1)

# Create subset of data for cluster 2
cluster2 <- subset(data_cleaned, cluster == 2)

fviz_cluster(km.final, data = data_cleaned)

The clustering generated with the two-cluster solution seem adequate enough. The plots in the first cluster (red above) seem to be more spread out than the points in the second cluster. This solution deserves to be analysed further and I do not believe a three cluster solution would generate better results.


# Define the variables to compare
vars_to_compare <- setdiff(names(data_cleaned), "cluster")

# Calculate summary statistics for each cluster
summary_table <- data_cleaned %>%
  group_by(cluster) %>%
  summarize(across(all_of(vars_to_compare), list(mean = mean)))

# Combine summary statistics into a single data frame
summary_table <- bind_rows(summary_table, .id = "variable")

# Display the summary table
summary_table

A tibble: 2 × 19

variable <chr>	cluster <int>	BALANCE_mean <dbl>	BALANCE_FREQUENCY_mean <dbl>	PURCHASES_mean <dbl>	ONEOFF_PURCHASES_mean <dbl>	INSTALLMENTS_PURCHASES_mean <dbl>	CASH_ADVANCE_mean <dbl>	PURCHASES_FREQUENCY_mean <dbl>	ONEOFF_PURCHASES_FREQUENCY_mean <dbl>	PURCHASES_INSTALLMENTS_FREQUENCY_mean <dbl>	CASH_ADVANCE_FREQUENCY_mean <dbl>	CASH_ADVANCE_TRX_mean <dbl>	PURCHASES_TRX_mean <dbl>	CREDIT_LIMIT_mean <dbl>	PAYMENTS_mean <dbl>	MINIMUM_PAYMENTS_mean <dbl>	PRC_FULL_PAYMENT_mean <dbl>	TENURE_mean <dbl>
1	1	2241.7345	0.9213953	1196.8163	766.0233	430.9514	1157.526	0.5425487	0.3034786	0.3702646	0.13602377	3.054983	16.48325	7643.102	2044.9374	801.0417	0.1610406	11.86125
1	2	826.7081	0.8751705	550.8284	278.6528	272.4888	450.570	0.4598474	0.1416738	0.3516731	0.09766528	1.976890	10.04015	2338.399	956.8133	490.9992	0.1516196	11.70035

In observing the mean values of the variables per cluster above, it is not difficult to determine that there are major differences in the clusters generated. It could take some time to understand the distinctiveness of each cluster but there is certainly enough here to get started.

Generally, it would seem that credit card holders in cluster 1 tend to spend more money more frequently. This group could be considered as the Big Spenders group. Their Credit Limits are substantially higher than that of the Credit limits of the other group. The amount of PURCHASES made from the accounts in Cluster 1 are also substantially higher.

Interestingly enough, the difference in frequency of ONEOFF_PURCHASES is higher than the difference when viewing standard PURCHASES meaning the BIG SPENDER Group appropriately earn their name as they will more frequently and spontaneously make purchases. This leads one to believe that perhaps these accounts are the accounts of high earners, but that may not be explored any further as this data is non-existent in the dataset.

# Combine data for both clusters
cluster_combined <- rbind(cluster1, cluster2)

# Create histogram of PURCHASES for each cluster, colored by cluster
hist_combined <- ggplot(cluster_combined, aes(x = PURCHASES, fill = factor(cluster))) + 
  geom_histogram(position = "identity", alpha = 0.5, bins = 10) +
  labs(title = "Comparison of PURCHASES between Clusters", x = "PURCHASES", y = "Count") +
  scale_fill_discrete(name = "Cluster", labels = c("Cluster 1", "Cluster 2"))

# Display histogram
hist_combined

# Combine data for both clusters
cluster_combined <- rbind(cluster1, cluster2)

# Create histogram of PURCHASES for each cluster, colored by cluster
hist_combined <- ggplot(cluster_combined, aes(x = CREDIT_LIMIT, fill = factor(cluster))) + 
  geom_histogram(position = "identity", alpha = 0.5, bins = 10) +
  labs(title = "Comparison of CREDIT LIMIT between Clusters", x = "CREDIT_LIMIT", y = "Count") +
  scale_fill_discrete(name = "Cluster", labels = c("Cluster 1", "Cluster 2"))

# Display histogram
hist_combined