Credit Card Holders Clustering
using K-Means Clustering

Atika Faradilla

3/16/2022

Introduction

A credit card is a piece of metal or plastic issued by a financial services company or banks, which allows cardholders to borrow funds to pay for goods and services with different types of merchants. Furthermore, credit cards provided by the banks and other fintech institutions help customer purchase different types of services on credit without having to put up cash at the point of sale. Instead, the charges accrue as a balance that must be paid off on a monthly billing cycle, giving the buyer more time to get the cash together.

However, nowadays, credit cards are becoming less of a source of credit and more of a transactional platform. Many credit card users are switching from being credit seekers to regular users of credit cards in place of cash. More users are using credit cards as a transactional medium due to convenience. Credit cards provide greater convenience since users do not need to carry large bundles of cash to make purchases and they also offers a platform where one can track their spending and consequently adjust their spending accordingly.

From the industrial point of view, the credit card industry is anchored on continued innovations in marketing and technology, leading to increased competition among credit card companies. Hence, credit card companies invest billions of dollars into marketing activities, with the goal of acquiring new customers and growing their customer base. Innovative marketing activities, such as reward programs, discounts, loyalty points, and zero interest rates, are being deployed in an attempt to get more customers into the programs. Such marketing activities are also needed to make their customers interested in their programs and won’t result in churn.

In this report, we are going to develop a customer segmentation to define the marketing strategy for credit card company. To fulfill this objective, an unsupervised learning method, K-Means Clustering will be used. The process includes Data Preparation, Exploratory Data Analysis, Data Pre-Processing, K-Means Clustering, Principle Component Analysis (PCA), and Conclusion.

Reference: Article

Data Preparation

# Library Input
library(tidyverse)
library(DT)
library(GGally)
library(RColorBrewer)
library(ggplot2)
library(ggforce)
library(concaveman)
library(factoextra)
library(FactoMineR)
# Data Input
cc <- read.csv("data/CC GENERAL.csv")

datatable(cc, options = list(scrollX = TRUE))

The data set used in this report summarizes the usage behavior of about 8.950 active credit card holders during the last 6 months. The file is at a customer level with 18 behavioral variables.

Following is the Data Dictionary for Credit Card dataset :-

  • CUSTID : Identification of Credit Card holder (Categorical)
  • BALANCE : Balance amount left in their account to make purchases
  • BALANCEFREQUENCY : How frequently the Balance is updated, score between 0 and 1 (1 = frequently updated, 0 = not frequently updated)
  • PURCHASES : Amount of purchases made from account
  • ONEOFFPURCHASES : Maximum purchase amount done in one-go
  • INSTALLMENTSPURCHASES : Amount of purchase done in installment
  • CASHADVANCE : Cash in advance given by the user
  • PURCHASESFREQUENCY : How frequently the Purchases are being made, score between 0 and 1 (1 = frequently purchased, 0 = not frequently purchased)
  • ONEOFFPURCHASESFREQUENCY : How frequently Purchases are happening in one-go (1 = frequently purchased, 0 = not frequently purchased)
  • PURCHASESINSTALLMENTSFREQUENCY : How frequently purchases in installments are being done (1 = frequently done, 0 = not frequently done)
  • CASHADVANCEFREQUENCY : How frequently the cash in advance being paid
  • CASHADVANCETRX : Number of Transactions made with “Cash in Advanced”
  • PURCHASESTRX : Number of purchase transactions made
  • CREDITLIMIT : Limit of Credit Card for user
  • PAYMENTS : Amount of Payment done by user
  • MINIMUM_PAYMENTS : Minimum amount of payments made by user
  • PRCFULLPAYMENT : Percent of full payment paid by user
  • TENURE : Tenure (repayment period) of credit card service for user
# Checking Data Types
str(cc)
## 'data.frame':    8950 obs. of  18 variables:
##  $ CUST_ID                         : chr  "C10001" "C10002" "C10003" "C10004" ...
##  $ BALANCE                         : num  40.9 3202.5 2495.1 1666.7 817.7 ...
##  $ BALANCE_FREQUENCY               : num  0.818 0.909 1 0.636 1 ...
##  $ PURCHASES                       : num  95.4 0 773.2 1499 16 ...
##  $ ONEOFF_PURCHASES                : num  0 0 773 1499 16 ...
##  $ INSTALLMENTS_PURCHASES          : num  95.4 0 0 0 0 ...
##  $ CASH_ADVANCE                    : num  0 6443 0 206 0 ...
##  $ PURCHASES_FREQUENCY             : num  0.1667 0 1 0.0833 0.0833 ...
##  $ ONEOFF_PURCHASES_FREQUENCY      : num  0 0 1 0.0833 0.0833 ...
##  $ PURCHASES_INSTALLMENTS_FREQUENCY: num  0.0833 0 0 0 0 ...
##  $ CASH_ADVANCE_FREQUENCY          : num  0 0.25 0 0.0833 0 ...
##  $ CASH_ADVANCE_TRX                : int  0 4 0 1 0 0 0 0 0 0 ...
##  $ PURCHASES_TRX                   : int  2 0 12 1 1 8 64 12 5 3 ...
##  $ CREDIT_LIMIT                    : num  1000 7000 7500 7500 1200 1800 13500 2300 7000 11000 ...
##  $ PAYMENTS                        : num  202 4103 622 0 678 ...
##  $ MINIMUM_PAYMENTS                : num  140 1072 627 NA 245 ...
##  $ PRC_FULL_PAYMENT                : num  0 0.222 0 0 0 ...
##  $ TENURE                          : int  12 12 12 12 12 12 12 12 12 12 ...
# Checking Missing Values
colSums(is.na(cc))
##                          CUST_ID                          BALANCE 
##                                0                                0 
##                BALANCE_FREQUENCY                        PURCHASES 
##                                0                                0 
##                 ONEOFF_PURCHASES           INSTALLMENTS_PURCHASES 
##                                0                                0 
##                     CASH_ADVANCE              PURCHASES_FREQUENCY 
##                                0                                0 
##       ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FREQUENCY 
##                                0                                0 
##           CASH_ADVANCE_FREQUENCY                 CASH_ADVANCE_TRX 
##                                0                                0 
##                    PURCHASES_TRX                     CREDIT_LIMIT 
##                                0                                1 
##                         PAYMENTS                 MINIMUM_PAYMENTS 
##                                0                              313 
##                 PRC_FULL_PAYMENT                           TENURE 
##                                0                                0
#Calculating the percentage of missing values
colSums(is.na(cc))/nrow(cc)
##                          CUST_ID                          BALANCE 
##                     0.0000000000                     0.0000000000 
##                BALANCE_FREQUENCY                        PURCHASES 
##                     0.0000000000                     0.0000000000 
##                 ONEOFF_PURCHASES           INSTALLMENTS_PURCHASES 
##                     0.0000000000                     0.0000000000 
##                     CASH_ADVANCE              PURCHASES_FREQUENCY 
##                     0.0000000000                     0.0000000000 
##       ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FREQUENCY 
##                     0.0000000000                     0.0000000000 
##           CASH_ADVANCE_FREQUENCY                 CASH_ADVANCE_TRX 
##                     0.0000000000                     0.0000000000 
##                    PURCHASES_TRX                     CREDIT_LIMIT 
##                     0.0000000000                     0.0001117318 
##                         PAYMENTS                 MINIMUM_PAYMENTS 
##                     0.0000000000                     0.0349720670 
##                 PRC_FULL_PAYMENT                           TENURE 
##                     0.0000000000                     0.0000000000

Since the missing value in some columns are still below 5% of the data observation, the row with any missing value in it will be dropped. Furthermore, since CUST_ID won’t affect the clustering as it’s unique for each observation, we are going to drop it,

# Drop NA
cc_na <- cc %>% 
  drop_na(CREDIT_LIMIT, MINIMUM_PAYMENTS)

cc_clean <- cc_na %>% 
  select(-CUST_ID)
# Checking the dimension of cleaned data
dim(cc_clean)
## [1] 8636   17

After data cleansing is performed, our data contain 8636 observations and 17 variables. All data types have been converted to the desired data types and there’s no more missing value.

Exploratory Data Analysis

Exploratory data analysis is a phase where we explore the data variables, and find out any pattern that can indicate any kind of correlation between the variables.

# Data Summary
summary(cc_clean)
##     BALANCE        BALANCE_FREQUENCY   PURCHASES        ONEOFF_PURCHASES  
##  Min.   :    0.0   Min.   :0.0000    Min.   :    0.00   Min.   :    0.00  
##  1st Qu.:  148.1   1st Qu.:0.9091    1st Qu.:   43.37   1st Qu.:    0.00  
##  Median :  916.9   Median :1.0000    Median :  375.40   Median :   44.99  
##  Mean   : 1601.2   Mean   :0.8950    Mean   : 1025.43   Mean   :  604.90  
##  3rd Qu.: 2105.2   3rd Qu.:1.0000    3rd Qu.: 1145.98   3rd Qu.:  599.10  
##  Max.   :19043.1   Max.   :1.0000    Max.   :49039.57   Max.   :40761.25  
##  INSTALLMENTS_PURCHASES  CASH_ADVANCE     PURCHASES_FREQUENCY
##  Min.   :    0.00       Min.   :    0.0   Min.   :0.00000    
##  1st Qu.:    0.00       1st Qu.:    0.0   1st Qu.:0.08333    
##  Median :   94.78       Median :    0.0   Median :0.50000    
##  Mean   :  420.84       Mean   :  994.2   Mean   :0.49600    
##  3rd Qu.:  484.15       3rd Qu.: 1132.4   3rd Qu.:0.91667    
##  Max.   :22500.00       Max.   :47137.2   Max.   :1.00000    
##  ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FREQUENCY
##  Min.   :0.00000            Min.   :0.0000                  
##  1st Qu.:0.00000            1st Qu.:0.0000                  
##  Median :0.08333            Median :0.1667                  
##  Mean   :0.20591            Mean   :0.3688                  
##  3rd Qu.:0.33333            3rd Qu.:0.7500                  
##  Max.   :1.00000            Max.   :1.0000                  
##  CASH_ADVANCE_FREQUENCY CASH_ADVANCE_TRX  PURCHASES_TRX     CREDIT_LIMIT  
##  Min.   :0.0000         Min.   :  0.000   Min.   :  0.00   Min.   :   50  
##  1st Qu.:0.0000         1st Qu.:  0.000   1st Qu.:  1.00   1st Qu.: 1600  
##  Median :0.0000         Median :  0.000   Median :  7.00   Median : 3000  
##  Mean   :0.1376         Mean   :  3.314   Mean   : 15.03   Mean   : 4522  
##  3rd Qu.:0.2500         3rd Qu.:  4.000   3rd Qu.: 18.00   3rd Qu.: 6500  
##  Max.   :1.5000         Max.   :123.000   Max.   :358.00   Max.   :30000  
##     PAYMENTS        MINIMUM_PAYMENTS   PRC_FULL_PAYMENT     TENURE     
##  Min.   :    0.05   Min.   :    0.02   Min.   :0.0000   Min.   : 6.00  
##  1st Qu.:  418.56   1st Qu.:  169.16   1st Qu.:0.0000   1st Qu.:12.00  
##  Median :  896.68   Median :  312.45   Median :0.0000   Median :12.00  
##  Mean   : 1784.48   Mean   :  864.30   Mean   :0.1593   Mean   :11.53  
##  3rd Qu.: 1951.14   3rd Qu.:  825.50   3rd Qu.:0.1667   3rd Qu.:12.00  
##  Max.   :50721.48   Max.   :76406.21   Max.   :1.0000   Max.   :12.00

It is important to note that the dataset need to have the same scale. Hence, a further scaling might be needed.

Now, we are going to check the correlation of these variables through data visualization.

# Correlation of Each Variables
ggcorr(cc_clean, hjust = 1, layout.exp = 2, label = T, label_size = 4,
       low = "#7d9029", mid = "white", high = "#3580d2")

It can be seen that there is a strong correlation between some variables from the data, such as between PURCHASES and ONEOFF_PURCHASES, PURCHASES_FREQUENCY and ONEOFF_PURCHASES_FREQUENCY, etc. This result indicates that this dataset has multicollinearity and might not be suitable for various classification algorithms that have non-multicollinearity as their assumption.

Principal Component Analysis will be performed on this data to produce non-multicollinearity data, while also reducing the dimension of the data and retaining as much as information possible. The result of this analysis can be utilized further for classification purpose with lower computation.

Data Pre-Processing

Since the data set used isn’t on the same scale, we will scale them first using z scaling.

# Data Scaling
cc_z <- scale(cc_clean)
summary(cc_z)
##     BALANCE        BALANCE_FREQUENCY    PURCHASES        ONEOFF_PURCHASES   
##  Min.   :-0.7641   Min.   :-4.30933   Min.   :-0.47318   Min.   :-0.359139  
##  1st Qu.:-0.6934   1st Qu.: 0.06767   1st Qu.:-0.45317   1st Qu.:-0.359139  
##  Median :-0.3266   Median : 0.50538   Median :-0.29995   Median :-0.332425  
##  Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.000000  
##  3rd Qu.: 0.2405   3rd Qu.: 0.50538   3rd Qu.: 0.05563   3rd Qu.:-0.003444  
##  Max.   : 8.3232   Max.   : 0.50538   Max.   :22.15586   Max.   :23.841455  
##  INSTALLMENTS_PURCHASES  CASH_ADVANCE      PURCHASES_FREQUENCY
##  Min.   :-0.45881       Min.   :-0.46863   Min.   :-1.236067  
##  1st Qu.:-0.45881       1st Qu.:-0.46863   1st Qu.:-1.028396  
##  Median :-0.35548       Median :-0.46863   Median : 0.009968  
##  Mean   : 0.00000       Mean   : 0.00000   Mean   : 0.000000  
##  3rd Qu.: 0.06902       3rd Qu.: 0.06515   3rd Qu.: 1.048332  
##  Max.   :24.07116       Max.   :21.75062   Max.   : 1.256004  
##  ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FREQUENCY
##  Min.   :-0.6862            Min.   :-0.9265                 
##  1st Qu.:-0.6862            1st Qu.:-0.9265                 
##  Median :-0.4085            Median :-0.5078                 
##  Mean   : 0.0000            Mean   : 0.0000                 
##  3rd Qu.: 0.4247            3rd Qu.: 0.9575                 
##  Max.   : 2.6465            Max.   : 1.5855                 
##  CASH_ADVANCE_FREQUENCY CASH_ADVANCE_TRX   PURCHASES_TRX      CREDIT_LIMIT    
##  Min.   :-0.6819        Min.   :-0.47941   Min.   :-0.5970   Min.   :-1.2221  
##  1st Qu.:-0.6819        1st Qu.:-0.47941   1st Qu.:-0.5573   1st Qu.:-0.7986  
##  Median :-0.6819        Median :-0.47941   Median :-0.3190   Median :-0.4160  
##  Mean   : 0.0000        Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5570        3rd Qu.: 0.09925   3rd Qu.: 0.1178   3rd Qu.: 0.5405  
##  Max.   : 6.7515        Max.   :17.31443   Max.   :13.6203   Max.   : 6.9626  
##     PAYMENTS        MINIMUM_PAYMENTS   PRC_FULL_PAYMENT       TENURE       
##  Min.   :-0.61325   Min.   :-0.36428   Min.   :-0.53770   Min.   :-4.2216  
##  1st Qu.:-0.46942   1st Qu.:-0.29299   1st Qu.:-0.53770   1st Qu.: 0.3552  
##  Median :-0.30511   Median :-0.23260   Median :-0.53770   Median : 0.3552  
##  Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.05728   3rd Qu.:-0.01636   3rd Qu.: 0.02485   3rd Qu.: 0.3552  
##  Max.   :16.81794   Max.   :31.83974   Max.   : 2.83759   Max.   : 0.3552

Scaling process is done, and all of the variables already have the same scale.

K-Means Clustering

The Optimal k

Before the cluster analysis begin, first we must determine the optimal number of cluster for the analysis. To choose the optimal value of k in K-Means Clustering, we can use several methods:

1. Elbow Method

fviz_nbclust(cc_clean, FUNcluster = kmeans, method = "wss", k.max = 10) +
  labs(subtitle = "Elbow method")
## Warning: did not converge in 10 iterations

As the name suggest, in Elbow Method, we pick the elbow of the curve as the number of clusters to use. Based on the plot above, we can see that k = 2 or k = 4.

2. Silhouette Method

fviz_nbclust(cc_clean, FUNcluster = kmeans, method = "silhouette", k.max = 10) +
  labs(subtitle = "Silhouette Method")
## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

In Silhouette Method, the optimal number of clusters is chosen by the number of cluster with the highest silhouette score (the peak). Based on the plot above, we can see that k = 2 or k = 4. Hence we will try 2 types of clustering with k = 2 and k = 4.

Clustering

K-Means with k = 2

set.seed(123)
cc_km2 <- kmeans(x= cc_z, centers = 2)
# Number of observations in each cluster
cc_km2$size
## [1] 5015 3621
# Location of the center of the cluster/centroid, commonly used for cluster profiling
cc_km2$centers
##       BALANCE BALANCE_FREQUENCY  PURCHASES ONEOFF_PURCHASES
## 1  0.07458767        -0.2054153 -0.3431598       -0.2380054
## 2 -0.10330217         0.2844953  0.4752683        0.3296319
##   INSTALLMENTS_PURCHASES CASH_ADVANCE PURCHASES_FREQUENCY
## 1             -0.3737383    0.1889251          -0.7382587
## 2              0.5176187   -0.2616568           1.0224710
##   ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FREQUENCY
## 1                 -0.3820750                       -0.6599696
## 2                  0.5291649                        0.9140425
##   CASH_ADVANCE_FREQUENCY CASH_ADVANCE_TRX PURCHASES_TRX CREDIT_LIMIT   PAYMENTS
## 1              0.2823300        0.1982823    -0.4591673   -0.1181199 -0.1093591
## 2             -0.3910204       -0.2746164     0.6359359    0.1635933  0.1514598
##   MINIMUM_PAYMENTS PRC_FULL_PAYMENT      TENURE
## 1     -0.003079014       -0.2844889 -0.07908798
## 2      0.004264362        0.3940105  0.10953499

After the clustering process, it can be seen above that we have 5015 observations in cluster 1 and 3621 observations in cluster 2. We can check the visualization of it using PCA biplot as below:

# Cluster Visualization
fviz_cluster(object = cc_km2, data = cc_z, geom = "point") + 
  ggtitle("K-Means Clustering Plot") +
  scale_color_brewer(palette= "Accent") +
  theme_minimal() + 
  theme(legend.position = "bottom")

K-Means with k = 4

set.seed(123)
cc_km4 <- kmeans(x= cc_z, centers = 4)
# Number of observations in each cluster
cc_km4$size
## [1] 3984 2934 1068  650
# Location of the center of the cluster/centroid, commonly used for cluster profiling
cc_km4$centers
##      BALANCE BALANCE_FREQUENCY  PURCHASES ONEOFF_PURCHASES
## 1 -0.2596493        -0.3121714 -0.3302001       -0.2344101
## 2 -0.1273203         0.3667280  0.5780483        0.4132288
## 3  1.5380670         0.3766507 -0.1897072       -0.1338074
## 4 -0.3610076        -0.3608489 -0.2736449       -0.2086418
##   INSTALLMENTS_PURCHASES CASH_ADVANCE PURCHASES_FREQUENCY
## 1             -0.3497405  -0.21776739          -0.6458316
## 2              0.6070181  -0.35499461           1.0825484
## 3             -0.2027364   1.77189566          -0.5001745
## 4             -0.2632345   0.02577674          -0.1061808
##   ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FREQUENCY
## 1                 -0.3748694                       -0.5717539
## 2                  0.6499571                        0.9651283
## 3                 -0.2159751                       -0.4308311
## 4                 -0.2812817                       -0.1441407
##   CASH_ADVANCE_FREQUENCY CASH_ADVANCE_TRX PURCHASES_TRX CREDIT_LIMIT   PAYMENTS
## 1             -0.1636267      -0.18927960    -0.4319162   -0.3151287 -0.2591746
## 2             -0.4689014      -0.35712895     0.7546139    0.2196558  0.1692113
## 3              1.7513685       1.71413995    -0.2384098    0.9185768  0.7379135
## 4              0.2418213      -0.04430031    -0.3671711   -0.5692883 -0.3877017
##   MINIMUM_PAYMENTS PRC_FULL_PAYMENT      TENURE
## 1      -0.10469509       -0.2129974  0.26747034
## 2      -0.01864834        0.4173833  0.29640425
## 3       0.57660716       -0.4145205  0.08375683
## 4      -0.22153383        0.1025953 -3.11492951

After the clustering process, it can be seen above that we have 3984 observations in cluster 1, 2934 observations in cluster 2, 1068 observations in cluster 3, and 650 observations in cluster 4. We can check the visualization of it using PCA biplot as below:

# Cluster Visualization
fviz_cluster(object = cc_km4, data = cc_z, geom = "point") + 
  ggtitle("K-Means Clustering Plot") +
  scale_color_brewer(palette= "Accent") +
  theme_minimal() + 
  theme(legend.position = "bottom")

Since in the plot the dimension was reduced into two-dimensional, each of the cluster intersect each other, some even clumped together. This happens because we have to little dimensions to represent our data (17 variables).

Goodness of Fit

The evaluation of clustering results can be seen from 3 values:

  • Within Sum of Squares : the sum of squared distances from each observation to the centroid of each cluster.
  • Between Sum of Squares : the sum of squared distances from each centroid of each cluster to the whole data average.
  • Total Sum of Squares : the sum of squared distances from each observation to the whole data average.

Goodness of Fit of K-Means with k = 2

# Within Sum of Squares
cc_km2$withinss
## [1] 55676.70 67590.79
# Total Sum of Squares
cc_km2$betweenss/cc_km2$totss
## [1] 0.1602746

The “good” cluster results have a low value of Within Sum of Squares and Total Sum of Squares near to 1. However, as can be seen from the result, the Within Sum of Squares are too big and the Total Sum of Squares are still far from 1. Which indicates that this clustering might not be good enough.

Goodness of Fit of K-Means with k = 4

# Within Sum of Squares
cc_km4$withinss
## [1] 22578.74 50375.07 23843.58  5555.14
# Total Sum of Squares
cc_km4$betweenss/cc_km4$totss
## [1] 0.3027519

The same with the previous section, “good” cluster results have a low value of Within Sum of Squares and Total Sum of Squares near to 1. Unfortunately, as can be seen from the result, the Within Sum of Squares are too big and the Total Sum of Squares are still far from 1. Which indicates that this clustering might not be good enough. However, since this result is better than k = 2, we are going to use k = 4 model for further analysis.

Cluster Profiling

Now that we get the information about the cluster of each observation, we are going to combine the cluster column into the data set to interpret each characteristics of the cluster. We could also add the CUST_ID column to find out which cluster each customer belongs to in the end.

# Combining the cluster label into the data set
cc_clean$CLUSTER <-  as.factor(cc_km4$cluster)

# Profiling with aggregation table
cc_clean  %>% 
  group_by(CLUSTER) %>% 
  summarise_all(mean)
## # A tibble: 4 x 18
##   CLUSTER BALANCE BALANCE_FREQUENCY PURCHASES ONEOFF_PURCHASES INSTALLMENTS_PUR~
##   <fct>     <dbl>             <dbl>     <dbl>            <dbl>             <dbl>
## 1 1         1057.             0.830      310.             210.              100.
## 2 2         1334.             0.971     2278.            1301.              978.
## 3 3         4824.             0.973      614.             380.              235.
## 4 4          845.             0.820      432.             253.              179.
## # ... with 12 more variables: CASH_ADVANCE <dbl>, PURCHASES_FREQUENCY <dbl>,
## #   ONEOFF_PURCHASES_FREQUENCY <dbl>, PURCHASES_INSTALLMENTS_FREQUENCY <dbl>,
## #   CASH_ADVANCE_FREQUENCY <dbl>, CASH_ADVANCE_TRX <dbl>, PURCHASES_TRX <dbl>,
## #   CREDIT_LIMIT <dbl>, PAYMENTS <dbl>, MINIMUM_PAYMENTS <dbl>,
## #   PRC_FULL_PAYMENT <dbl>, TENURE <dbl>
# Profiling with aggregation table
cc_clean  %>% 
  group_by(CLUSTER) %>% 
  summarise_all(mean) %>% 
  tidyr::pivot_longer(-CLUSTER) %>% 
  group_by(name) %>% 
  summarize(cluster_min_val = which.min(value),
            cluster_max_val = which.max(value))
## # A tibble: 17 x 3
##    name                             cluster_min_val cluster_max_val
##    <chr>                                      <int>           <int>
##  1 BALANCE                                        4               3
##  2 BALANCE_FREQUENCY                              4               3
##  3 CASH_ADVANCE                                   2               3
##  4 CASH_ADVANCE_FREQUENCY                         2               3
##  5 CASH_ADVANCE_TRX                               2               3
##  6 CREDIT_LIMIT                                   4               3
##  7 INSTALLMENTS_PURCHASES                         1               2
##  8 MINIMUM_PAYMENTS                               4               3
##  9 ONEOFF_PURCHASES                               1               2
## 10 ONEOFF_PURCHASES_FREQUENCY                     1               2
## 11 PAYMENTS                                       4               3
## 12 PRC_FULL_PAYMENT                               3               2
## 13 PURCHASES                                      1               2
## 14 PURCHASES_FREQUENCY                            1               2
## 15 PURCHASES_INSTALLMENTS_FREQUENCY               1               2
## 16 PURCHASES_TRX                                  1               2
## 17 TENURE                                         4               2

Based on both aggregation tables, it can be concluded that:

  • Cluster 1: Customers with lowest amount of all purchases, not much withdrawals, indicates not many transactions of the credit card compared to the other clusters.

  • Cluster 2: Customers with lowest amount of withdrawal and frequency, however, have the highest amount of all purchases. They have the longest tenure and highest percent of full payments paid, indicating that they are aware of their credits.

  • Cluster 3: Customers with high amount of balance, high cash advance and high credit limit. Their balance also seemed to be updated frequently, indicates many transactions of the credit card. The customers of this cluster also have high amount of minimum payments, however, lowest percent of full payments paid, indicating higher loans amount and often like to withdraw a lot of money from the credit card.

  • Cluster 4: Customers with lowest amount of balance and lowest credit limit. The customers of this cluster also have the lowest of minimum payments, payments and tenure; indicating that the transactions made in these credit cards are small transactions.

These customer profiling can help the business to decide which programs will interest the customer the most. So that they could held the campaign efficiently and effectively. Besides using aggregation table, we can also use visualization to simplify the profiling process. For example:

  1. Purchases of Each Cluster

As can be seen from the plot, cluster 1 & 4 have the lowest amount of purchases compared with the other clusters, hence if there are offers such as reward programs, discounts using credit card, they could be the best target.

  1. Payments of Each Cluster

As can be seen from the plot, cluster 2 & 3 also has the highest amount of payments compared with the other clusters, indicating how aware they are of their credits. Hence, if there are offers such as loyalty points, they could be the best target. From the both plots above, cluster 4 relatively has the lowest amount of purchases and payments compared with the other clusters. They also can be offered to zero interest program to increase theirs purchase and payments.

Which cluster do they belong?

As stated earlier, we could add the CUST_ID column back to find out which cluster each customer belongs to.

# Combining the `CUST_ID` column into the data set
cc_id <-  cc_clean %>% 
  mutate(CUST_ID = cc_na$CUST_ID)

datatable(cc_id, options = list(scrollX = TRUE))

To find out a certain customer belongs to which cluster, please type in their ID in the Search tab from the data table above.

Principle Component Analysis (PCA)

Dimensionality reduction

Dimensionality reduction are often performed using PCA as the algorithm, the main purpose is to reduce the number of variables (dimensions) in the data while retaining as much information as possible. Dimensionality reduction can solve the problem of high-dimensional data such as in the data set we are using.

# PCA using FactoMineR
cc_pca <- PCA(X = cc_clean, 
                  quali.sup = 18, 
                  scale.unit = T,
                  ncp = 17,
                  graph = F)
cc_pca$eig
##           eigenvalue percentage of variance cumulative percentage of variance
## comp 1  4.629300e+00           2.723118e+01                          27.23118
## comp 2  3.463632e+00           2.037431e+01                          47.60548
## comp 3  1.516117e+00           8.918334e+00                          56.52382
## comp 4  1.287513e+00           7.573609e+00                          64.09743
## comp 5  1.067025e+00           6.276618e+00                          70.37405
## comp 6  9.711736e-01           5.712786e+00                          76.08683
## comp 7  8.357519e-01           4.916188e+00                          81.00302
## comp 8  7.158248e-01           4.210734e+00                          85.21375
## comp 9  6.258884e-01           3.681697e+00                          88.89545
## comp 10 5.238563e-01           3.081508e+00                          91.97696
## comp 11 4.018466e-01           2.363804e+00                          94.34076
## comp 12 3.016717e-01           1.774540e+00                          96.11530
## comp 13 2.425407e-01           1.426710e+00                          97.54201
## comp 14 2.003722e-01           1.178660e+00                          98.72067
## comp 15 1.713765e-01           1.008097e+00                          99.72877
## comp 16 4.609765e-02           2.711627e-01                          99.99993
## comp 17 1.173078e-05           6.900460e-05                         100.00000
# Variance explained by each dimensions
fviz_eig(cc_pca, ncp = 17, addlabels = T, main = "Variance explained by each dimensions")

Based from the above results, if we want to retain 80% information of the data set, using 7 dimensions to do so is enough. This means that we can reduce the number of dimensions on our data set from 17 to 7 dimensions. However, by doing so,it might be hard to interpret the classification result, since we can interpret it by each variable anymore. But, we could see the Variable Contribution of each PCA, such as below.

# Variable Contribution of PC1
fviz_contrib(X = cc_pca, choice = "var", axes = 1)

# variable contribution untuk PC2
fviz_contrib(X = cc_pca, choice = "var", axes = 2)

PCA Visualization

Individual Factor Map

Individual Factor Map plot the distribution of observations to find out which index is considered an outlier.

plot.PCA(x = cc_pca , 
         choix = "ind",
         invisible = "quali",
         select = "contrib 8", 
         habillage = "CLUSTER") +
  scale_color_brewer(palette = "Accent") +
  theme(legend.position = "bottom")

From the plot, we can see 8 outliers’ indexes: 123, 247, 465, 513, 1167, 1510, 3793, and 2055. Most of the outliers are in CLUSTER_2.

Variables Factor Map

Variables Factor Map are used to find out the variable contribution to each PC, as well as the amount of information summarized from each variable to each PC; and to find out the correlation between the initial variables.

fviz_pca_var(cc_pca, select.var = list(contrib = 17), col.var = "contrib", 
    gradient.cols = c("#7d9029", "white", "#3580d2"), repel = TRUE)

From the plot above, we can conclude that:

  • The most contributing variables in PC1 are PURCHASES and PURCHASES_TRX.

  • The most contributing variables in PC2 are CASH_ADVANCE and CASH_ADVANCE_TRX

  • Positive highly correlated variables are:

    • PURCHASES and ONEOFF_PURCHASES
    • PURCHASES_FREQUENCY and PURCHASES_INSTALLMENTS_FREQUENCY
    • CASH_ADVANCE and CASH_ADVANCE_TRX

Cluster Visualization with PCA

PCA can also be integrated with the result of the K-means Clustering to help visualize our data in a fewer dimensions than the original features.

# visualisasi PCA + hasil kmeans clustering
fviz_pca_biplot(cc_pca, 
                habillage = 18, 
                addEllipses = T, 
                geom.ind = "point") +
  theme_minimal() + 
  theme(legend.position = "bottom") +
  scale_color_brewer(palette= "Accent") 

However, same problem as the previous section (clustering plot using PCA biplot) happens: each of the cluster intersect each other, some even clumped together. This happens because we have to little dimensions to represent our data (17 variables).

Conclusion

Based on the data used in this report and the K_Means Clustering process that has been done, we can conclude that:

  • Cluster 1: Customers with lowest amount of all purchases, not much withdrawals, indicates not many transactions of the credit card compared to the other clusters.

  • Cluster 2: Customers with lowest amount of withdrawal and frequency, however, have the highest amount of all purchases. They have the longest tenure and highest percent of full payments paid, indicating that they are aware of their credits.

  • Cluster 3: Customers with high amount of balance, high cash advance and high credit limit. Their balance also seemed to be updated frequently, indicates many transactions of the credit card. The customers of this cluster also have high amount of minimum payments, however, lowest percent of full payments paid, indicating higher loans amount and often like to withdraw a lot of money from the credit card.

  • Cluster 4: Customers with lowest amount of balance and lowest credit limit. The customers of this cluster also have the lowest of minimum payments, payments and tenure; indicating that the transactions made in these credit cards are small transactions.

Based on the clusters that have been produced, a few business suggestions can be made to profit the industry, such as:

  • Cluster 1 & 4 have the lowest amount of purchases compared with the other clusters, hence if there are offers such as reward programs, discounts using credit card, they could be the best target.

  • Cluster 2 & 3 also has the highest amount of payments compared with the other clusters, indicating how aware they are of their credits. Hence, if there are offers such as loyalty points, they could be the best target.

  • Cluster 4 relatively has the lowest amount of purchases and payments compared with the other clusters. They also can be offered to zero interest program to increase theirs purchase and payments.

To find out a certain customer belongs to which cluster, type in their ID in the Search tab from the data table in this section.