This case requires to develop a customer segmentation to define marketing strategy. The sample Dataset summarizes the usage behavior of about 9000 active credit card holders during the last 6 months. The file is at a customer level with 18 behavioral variables.
Following is the Data Dictionary for Credit Card dataset :-
CUST_ID : Identification of Credit Card holder (Categorical) BALANCE : Balance amount left in their account to make purchases BALANCE_FREQUENCY : How frequently the Balance is updated, score between 0 and 1 (1 = frequently updated, 0 = not frequently updated) PURCHASES : Amount of purchases made from account ONEOFF_PURCHASES : Maximum purchase amount done in one-go INSTALLMENTS_PURCHASES : Amount of purchase done in installment CASH_ADVANCE : Cash in advance given by the user PURCHASES_FREQUENCY : How frequently the Purchases are being made, score between 0 and 1 (1 = frequently purchased, 0 = not frequently purchased) ONEOFFPURCHASESFREQUENCY : How frequently Purchases are happening in one-go (1 = frequently purchased, 0 = not frequently purchased) PURCHASESINSTALLMENTSFREQUENCY : How frequently purchases in installments are being done (1 = frequently done, 0 = not frequently done) CASHADVANCEFREQUENCY : How frequently the cash in advance being paid CASHADVANCETRX : Number of Transactions made with "Cash in Advanced" PURCHASES_TRX : Number of purchase transactions made CREDIT_LIMIT : Limit of Credit Card for user PAYMENTS : Amount of Payment done by user MINIMUM_PAYMENTS : Minimum amount of payments made by user PRCFULLPAYMENT : Percent of full payment paid by user TENURE : Tenure of credit card service for user
library(dplyr)
library(ggplot2)
library(gridExtra)
library(tidyr)
library(cluster)
library(caret)
library(reshape2)
library(corrplot)
library(randomForest)
library(lmtest)
library(zoo)
library(factoextra)
library(compareGroups)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
Attaching package: 'gridExtra'
The following object is masked from 'package:dplyr':
combine
Loading required package: lattice
Attaching package: 'reshape2'
The following object is masked from 'package:tidyr':
smiths
corrplot 0.92 loaded
randomForest 4.7-1.1
Type rfNews() to see new features/changes/bug fixes.
Attaching package: 'randomForest'
The following object is masked from 'package:gridExtra':
combine
The following object is masked from 'package:ggplot2':
margin
The following object is masked from 'package:dplyr':
combine
Loading required package: zoo
Attaching package: 'zoo'
The following objects are masked from 'package:base':
as.Date, as.Date.numeric
Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
ccards <- read.csv("CC_General.csv")
# Checking to see if the data set loaded correctly
head(ccards, 10)
A data.frame: 10 × 18
| CUST_ID <chr> | BALANCE <dbl> | BALANCE_FREQUENCY <dbl> | PURCHASES <dbl> | ONEOFF_PURCHASES <dbl> | INSTALLMENTS_PURCHASES <dbl> | CASH_ADVANCE <dbl> | PURCHASES_FREQUENCY <dbl> | ONEOFF_PURCHASES_FREQUENCY <dbl> | PURCHASES_INSTALLMENTS_FREQUENCY <dbl> | CASH_ADVANCE_FREQUENCY <dbl> | CASH_ADVANCE_TRX <int> | PURCHASES_TRX <int> | CREDIT_LIMIT <dbl> | PAYMENTS <dbl> | MINIMUM_PAYMENTS <dbl> | PRC_FULL_PAYMENT <dbl> | TENURE <int> | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | C10001 | 40.90075 | 0.818182 | 95.40 | 0.00 | 95.40 | 0.000 | 0.166667 | 0.000000 | 0.083333 | 0.000000 | 0 | 2 | 1000 | 201.8021 | 139.5098 | 0.000000 | 12 |
| 2 | C10002 | 3202.46742 | 0.909091 | 0.00 | 0.00 | 0.00 | 6442.945 | 0.000000 | 0.000000 | 0.000000 | 0.250000 | 4 | 0 | 7000 | 4103.0326 | 1072.3402 | 0.222222 | 12 |
| 3 | C10003 | 2495.14886 | 1.000000 | 773.17 | 773.17 | 0.00 | 0.000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0 | 12 | 7500 | 622.0667 | 627.2848 | 0.000000 | 12 |
| 4 | C10004 | 1666.67054 | 0.636364 | 1499.00 | 1499.00 | 0.00 | 205.788 | 0.083333 | 0.083333 | 0.000000 | 0.083333 | 1 | 1 | 7500 | 0.0000 | NA | 0.000000 | 12 |
| 5 | C10005 | 817.71434 | 1.000000 | 16.00 | 16.00 | 0.00 | 0.000 | 0.083333 | 0.083333 | 0.000000 | 0.000000 | 0 | 1 | 1200 | 678.3348 | 244.7912 | 0.000000 | 12 |
| 6 | C10006 | 1809.82875 | 1.000000 | 1333.28 | 0.00 | 1333.28 | 0.000 | 0.666667 | 0.000000 | 0.583333 | 0.000000 | 0 | 8 | 1800 | 1400.0578 | 2407.2460 | 0.000000 | 12 |
| 7 | C10007 | 627.26081 | 1.000000 | 7091.01 | 6402.63 | 688.38 | 0.000 | 1.000000 | 1.000000 | 1.000000 | 0.000000 | 0 | 64 | 13500 | 6354.3143 | 198.0659 | 1.000000 | 12 |
| 8 | C10008 | 1823.65274 | 1.000000 | 436.20 | 0.00 | 436.20 | 0.000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 0 | 12 | 2300 | 679.0651 | 532.0340 | 0.000000 | 12 |
| 9 | C10009 | 1014.92647 | 1.000000 | 861.49 | 661.49 | 200.00 | 0.000 | 0.333333 | 0.083333 | 0.250000 | 0.000000 | 0 | 5 | 7000 | 688.2786 | 311.9634 | 0.000000 | 12 |
| 10 | C10010 | 152.22598 | 0.545455 | 1281.60 | 1281.60 | 0.00 | 0.000 | 0.166667 | 0.166667 | 0.000000 | 0.000000 | 0 | 3 | 11000 | 1164.7706 | 100.3023 | 0.000000 | 12 |
str(ccards)
'data.frame': 8950 obs. of 18 variables:
$ CUST_ID : chr "C10001" "C10002" "C10003" "C10004" ...
$ BALANCE : num 40.9 3202.5 2495.1 1666.7 817.7 ...
$ BALANCE_FREQUENCY : num 0.818 0.909 1 0.636 1 ...
$ PURCHASES : num 95.4 0 773.2 1499 16 ...
$ ONEOFF_PURCHASES : num 0 0 773 1499 16 ...
$ INSTALLMENTS_PURCHASES : num 95.4 0 0 0 0 ...
$ CASH_ADVANCE : num 0 6443 0 206 0 ...
$ PURCHASES_FREQUENCY : num 0.1667 0 1 0.0833 0.0833 ...
$ ONEOFF_PURCHASES_FREQUENCY : num 0 0 1 0.0833 0.0833 ...
$ PURCHASES_INSTALLMENTS_FREQUENCY: num 0.0833 0 0 0 0 ...
$ CASH_ADVANCE_FREQUENCY : num 0 0.25 0 0.0833 0 ...
$ CASH_ADVANCE_TRX : int 0 4 0 1 0 0 0 0 0 0 ...
$ PURCHASES_TRX : int 2 0 12 1 1 8 64 12 5 3 ...
$ CREDIT_LIMIT : num 1000 7000 7500 7500 1200 1800 13500 2300 7000 11000 ...
$ PAYMENTS : num 202 4103 622 0 678 ...
$ MINIMUM_PAYMENTS : num 140 1072 627 NA 245 ...
$ PRC_FULL_PAYMENT : num 0 0.222 0 0 0 ...
$ TENURE : int 12 12 12 12 12 12 12 12 12 12 ...
# This will remove the customer ID column which is not necessary for clustering
ccards <- subset(ccards, select = -CUST_ID)
Error in eval(substitute(select), nl, parent.frame()): object 'CUST_ID' not found
Traceback:
1. subset(ccards, select = -CUST_ID)
2. subset.data.frame(ccards, select = -CUST_ID)
3. eval(substitute(select), nl, parent.frame())
4. eval(substitute(select), nl, parent.frame())
# Check for missing values in the dataset
sum(is.na(ccards))
314
# Filling the missing values with the mean of each column
ccards <- na.aggregate(ccards, FUN = mean)
# Check for missing values in the dataset
sum(is.na(ccards))
0
attach(ccards)
summary(ccards)
BALANCE BALANCE_FREQUENCY PURCHASES ONEOFF_PURCHASES
Min. : 0.0 Min. :0.0000 Min. : 0.00 Min. : 0.0
1st Qu.: 128.3 1st Qu.:0.8889 1st Qu.: 39.63 1st Qu.: 0.0
Median : 873.4 Median :1.0000 Median : 361.28 Median : 38.0
Mean : 1564.5 Mean :0.8773 Mean : 1003.20 Mean : 592.4
3rd Qu.: 2054.1 3rd Qu.:1.0000 3rd Qu.: 1110.13 3rd Qu.: 577.4
Max. :19043.1 Max. :1.0000 Max. :49039.57 Max. :40761.2
INSTALLMENTS_PURCHASES CASH_ADVANCE PURCHASES_FREQUENCY
Min. : 0.0 Min. : 0.0 Min. :0.00000
1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.:0.08333
Median : 89.0 Median : 0.0 Median :0.50000
Mean : 411.1 Mean : 978.9 Mean :0.49035
3rd Qu.: 468.6 3rd Qu.: 1113.8 3rd Qu.:0.91667
Max. :22500.0 Max. :47137.2 Max. :1.00000
ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FREQUENCY
Min. :0.00000 Min. :0.0000
1st Qu.:0.00000 1st Qu.:0.0000
Median :0.08333 Median :0.1667
Mean :0.20246 Mean :0.3644
3rd Qu.:0.30000 3rd Qu.:0.7500
Max. :1.00000 Max. :1.0000
CASH_ADVANCE_FREQUENCY CASH_ADVANCE_TRX PURCHASES_TRX CREDIT_LIMIT
Min. :0.0000 Min. : 0.000 Min. : 0.00 Min. : 50
1st Qu.:0.0000 1st Qu.: 0.000 1st Qu.: 1.00 1st Qu.: 1600
Median :0.0000 Median : 0.000 Median : 7.00 Median : 3000
Mean :0.1351 Mean : 3.249 Mean : 14.71 Mean : 4494
3rd Qu.:0.2222 3rd Qu.: 4.000 3rd Qu.: 17.00 3rd Qu.: 6500
Max. :1.5000 Max. :123.000 Max. :358.00 Max. :30000
PAYMENTS MINIMUM_PAYMENTS PRC_FULL_PAYMENT TENURE
Min. : 0.0 Min. : 0.02 Min. :0.0000 Min. : 6.00
1st Qu.: 383.3 1st Qu.: 170.86 1st Qu.:0.0000 1st Qu.:12.00
Median : 856.9 Median : 335.63 Median :0.0000 Median :12.00
Mean : 1733.1 Mean : 864.21 Mean :0.1537 Mean :11.52
3rd Qu.: 1901.1 3rd Qu.: 864.21 3rd Qu.:0.1429 3rd Qu.:12.00
Max. :50721.5 Max. :76406.21 Max. :1.0000 Max. :12.00
# Calculate the z-score for each variable
z_score <- as.data.frame(lapply(ccards, scale))
# Identify the indices of outliers
outliers <- as.vector(which(abs(z_score) > 3, arr.ind = TRUE)[, 1])
# Remove the outliers
data_cleaned <- ccards[-outliers, ]
# Subset the explanatory variables
explanatory_vars <- ccards[, 2:ncol(ccards)]
# Calculate the correlation matrix
corr_matrix <- cor(explanatory_vars)
# Convert the correlation matrix to a data frame
corr_df <- melt(corr_matrix)
# Create the heatmap
ggplot(corr_df, aes(x = Var1, y = Var2, fill = value)) +
geom_tile() +
scale_fill_gradient2(low = "blue", mid = "white", high = "red", midpoint = 0, limit = c(-1,1)) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Correlation Heatmap", x = "", y = "") +
geom_text(aes(label = round(value, 2)), size = 3, color = "black")

There are quite a few variables that are strongly correlated with one another such as PURCHASES and ONE_OFF_PURCHASES as well as PURCHASES_FREQUENCY and PURCHASES_INSTALLMENTS_FREQUENCY. This is not alarming for now, as K means Clustering is the algorithm which will be used for clustering. I am considering applying Principal Component Analysis because of this but it might not be necessary considering the results of the cluster analysis and if they seem adequate enough.
fviz_nbclust(data_cleaned, kmeans, method='silhouette')

fviz_nbclust(data_cleaned, kmeans, method = "wss")

The results of the silhouette analysis are very determinable whereas it is more difficult to understand the plot generated when using the elbow method. It would seem that two clusters seem to be the optimal number but it might be worth trying a 3 or 4 cluster solution.
km.final <- kmeans(data_cleaned, 2)
## Total Within cluster sum of square
km.final$tot.withinss
## Cluster sizes
km.final$size
data_cleaned$cluster <- km.final$cluster
head(data_cleaned, 6)
74629888182.833
A data.frame: 6 × 18
| BALANCE <dbl> | BALANCE_FREQUENCY <dbl> | PURCHASES <dbl> | ONEOFF_PURCHASES <dbl> | INSTALLMENTS_PURCHASES <dbl> | CASH_ADVANCE <dbl> | PURCHASES_FREQUENCY <dbl> | ONEOFF_PURCHASES_FREQUENCY <dbl> | PURCHASES_INSTALLMENTS_FREQUENCY <dbl> | CASH_ADVANCE_FREQUENCY <dbl> | CASH_ADVANCE_TRX <dbl> | PURCHASES_TRX <dbl> | CREDIT_LIMIT <dbl> | PAYMENTS <dbl> | MINIMUM_PAYMENTS <dbl> | PRC_FULL_PAYMENT <dbl> | TENURE <dbl> | cluster <int> | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 40.90075 | 0.818182 | 95.40 | 0.00 | 95.40 | 0.000 | 0.166667 | 0.000000 | 0.083333 | 0.000000 | 0 | 2 | 1000 | 201.8021 | 139.5098 | 0.000000 | 12 | 2 |
| 2 | 3202.46742 | 0.909091 | 0.00 | 0.00 | 0.00 | 6442.945 | 0.000000 | 0.000000 | 0.000000 | 0.250000 | 4 | 0 | 7000 | 4103.0326 | 1072.3402 | 0.222222 | 12 | 1 |
| 3 | 2495.14886 | 1.000000 | 773.17 | 773.17 | 0.00 | 0.000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0 | 12 | 7500 | 622.0667 | 627.2848 | 0.000000 | 12 | 1 |
| 4 | 1666.67054 | 0.636364 | 1499.00 | 1499.00 | 0.00 | 205.788 | 0.083333 | 0.083333 | 0.000000 | 0.083333 | 1 | 1 | 7500 | 0.0000 | 864.2065 | 0.000000 | 12 | 1 |
| 5 | 817.71434 | 1.000000 | 16.00 | 16.00 | 0.00 | 0.000 | 0.083333 | 0.083333 | 0.000000 | 0.000000 | 0 | 1 | 1200 | 678.3348 | 244.7912 | 0.000000 | 12 | 2 |
| 6 | 1809.82875 | 1.000000 | 1333.28 | 0.00 | 1333.28 | 0.000 | 0.666667 | 0.000000 | 0.583333 | 0.000000 | 0 | 8 | 1800 | 1400.0578 | 2407.2460 | 0.000000 | 12 | 2 |
# Create subset of data for cluster 1
cluster1 <- subset(data_cleaned, cluster == 1)
# Create subset of data for cluster 2
cluster2 <- subset(data_cleaned, cluster == 2)
fviz_cluster(km.final, data = data_cleaned)

The clustering generated with the two-cluster solution seem adequate enough. The plots in the first cluster (red above) seem to be more spread out than the points in the second cluster. This solution deserves to be analysed further and I do not believe a three cluster solution would generate better results.
# Define the variables to compare
vars_to_compare <- setdiff(names(data_cleaned), "cluster")
# Calculate summary statistics for each cluster
summary_table <- data_cleaned %>%
group_by(cluster) %>%
summarize(across(all_of(vars_to_compare), list(mean = mean)))
# Combine summary statistics into a single data frame
summary_table <- bind_rows(summary_table, .id = "variable")
# Display the summary table
summary_table
A tibble: 2 × 19
| variable <chr> | cluster <int> | BALANCE_mean <dbl> | BALANCE_FREQUENCY_mean <dbl> | PURCHASES_mean <dbl> | ONEOFF_PURCHASES_mean <dbl> | INSTALLMENTS_PURCHASES_mean <dbl> | CASH_ADVANCE_mean <dbl> | PURCHASES_FREQUENCY_mean <dbl> | ONEOFF_PURCHASES_FREQUENCY_mean <dbl> | PURCHASES_INSTALLMENTS_FREQUENCY_mean <dbl> | CASH_ADVANCE_FREQUENCY_mean <dbl> | CASH_ADVANCE_TRX_mean <dbl> | PURCHASES_TRX_mean <dbl> | CREDIT_LIMIT_mean <dbl> | PAYMENTS_mean <dbl> | MINIMUM_PAYMENTS_mean <dbl> | PRC_FULL_PAYMENT_mean <dbl> | TENURE_mean <dbl> |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 2241.7345 | 0.9213953 | 1196.8163 | 766.0233 | 430.9514 | 1157.526 | 0.5425487 | 0.3034786 | 0.3702646 | 0.13602377 | 3.054983 | 16.48325 | 7643.102 | 2044.9374 | 801.0417 | 0.1610406 | 11.86125 |
| 1 | 2 | 826.7081 | 0.8751705 | 550.8284 | 278.6528 | 272.4888 | 450.570 | 0.4598474 | 0.1416738 | 0.3516731 | 0.09766528 | 1.976890 | 10.04015 | 2338.399 | 956.8133 | 490.9992 | 0.1516196 | 11.70035 |
In observing the mean values of the variables per cluster above, it
is not difficult to determine that there are major differences in the
clusters generated. It could take some time to understand the
distinctiveness of each cluster but there is certainly enough here to
get started.
Generally, it would seem that credit card
holders in cluster 1 tend to spend more money more frequently. This
group could be considered as the Big Spenders group. Their Credit Limits
are substantially higher than that of the Credit limits of the other
group. The amount of PURCHASES made from the accounts in Cluster 1 are
also substantially higher.
Interestingly enough, the
difference in frequency of ONEOFF_PURCHASES is higher than the
difference when viewing standard PURCHASES meaning the BIG SPENDER Group
appropriately earn their name as they will more frequently and
spontaneously make purchases. This leads one to believe that perhaps
these accounts are the accounts of high earners, but that may not be
explored any further as this data is non-existent in the dataset.
# Combine data for both clusters
cluster_combined <- rbind(cluster1, cluster2)
# Create histogram of PURCHASES for each cluster, colored by cluster
hist_combined <- ggplot(cluster_combined, aes(x = PURCHASES, fill = factor(cluster))) +
geom_histogram(position = "identity", alpha = 0.5, bins = 10) +
labs(title = "Comparison of PURCHASES between Clusters", x = "PURCHASES", y = "Count") +
scale_fill_discrete(name = "Cluster", labels = c("Cluster 1", "Cluster 2"))
# Display histogram
hist_combined

# Combine data for both clusters
cluster_combined <- rbind(cluster1, cluster2)
# Create histogram of PURCHASES for each cluster, colored by cluster
hist_combined <- ggplot(cluster_combined, aes(x = CREDIT_LIMIT, fill = factor(cluster))) +
geom_histogram(position = "identity", alpha = 0.5, bins = 10) +
labs(title = "Comparison of CREDIT LIMIT between Clusters", x = "CREDIT_LIMIT", y = "Count") +
scale_fill_discrete(name = "Cluster", labels = c("Cluster 1", "Cluster 2"))
# Display histogram
hist_combined
