Introduction
A credit card is a piece of metal or plastic issued by a financial services company or banks, which allows cardholders to borrow funds to pay for goods and services with different types of merchants. Furthermore, credit cards provided by the banks and other fintech institutions help customer purchase different types of services on credit without having to put up cash at the point of sale. Instead, the charges accrue as a balance that must be paid off on a monthly billing cycle, giving the buyer more time to get the cash together.
However, nowadays, credit cards are becoming less of a source of credit and more of a transactional platform. Many credit card users are switching from being credit seekers to regular users of credit cards in place of cash. More users are using credit cards as a transactional medium due to convenience. Credit cards provide greater convenience since users do not need to carry large bundles of cash to make purchases and they also offers a platform where one can track their spending and consequently adjust their spending accordingly.
From the industrial point of view, the credit card industry is anchored on continued innovations in marketing and technology, leading to increased competition among credit card companies. Hence, credit card companies invest billions of dollars into marketing activities, with the goal of acquiring new customers and growing their customer base. Innovative marketing activities, such as reward programs, discounts, loyalty points, and zero interest rates, are being deployed in an attempt to get more customers into the programs. Such marketing activities are also needed to make their customers interested in their programs and won’t result in churn.
In this report, we are going to develop a customer segmentation to define the marketing strategy for credit card company. To fulfill this objective, an unsupervised learning method, K-Means Clustering will be used. The process includes Data Preparation, Exploratory Data Analysis, Data Pre-Processing, K-Means Clustering, Principle Component Analysis (PCA), and Conclusion.
Reference: Article
Data Preparation
# Library Input
library(tidyverse)
library(DT)
library(GGally)
library(RColorBrewer)
library(ggplot2)
library(ggforce)
library(concaveman)
library(factoextra)
library(FactoMineR)# Data Input
cc <- read.csv("data/CC GENERAL.csv")
datatable(cc, options = list(scrollX = TRUE))The data set used in this report summarizes the usage behavior of about 8.950 active credit card holders during the last 6 months. The file is at a customer level with 18 behavioral variables.
Following is the Data Dictionary for Credit Card dataset :-
CUSTID: Identification of Credit Card holder (Categorical)BALANCE: Balance amount left in their account to make purchasesBALANCEFREQUENCY: How frequently the Balance is updated, score between 0 and 1 (1 = frequently updated, 0 = not frequently updated)PURCHASES: Amount of purchases made from accountONEOFFPURCHASES: Maximum purchase amount done in one-goINSTALLMENTSPURCHASES: Amount of purchase done in installmentCASHADVANCE: Cash in advance given by the userPURCHASESFREQUENCY: How frequently the Purchases are being made, score between 0 and 1 (1 = frequently purchased, 0 = not frequently purchased)ONEOFFPURCHASESFREQUENCY: How frequently Purchases are happening in one-go (1 = frequently purchased, 0 = not frequently purchased)PURCHASESINSTALLMENTSFREQUENCY: How frequently purchases in installments are being done (1 = frequently done, 0 = not frequently done)CASHADVANCEFREQUENCY: How frequently the cash in advance being paidCASHADVANCETRX: Number of Transactions made with “Cash in Advanced”PURCHASESTRX: Number of purchase transactions madeCREDITLIMIT: Limit of Credit Card for userPAYMENTS: Amount of Payment done by userMINIMUM_PAYMENTS: Minimum amount of payments made by userPRCFULLPAYMENT: Percent of full payment paid by userTENURE: Tenure (repayment period) of credit card service for user
# Checking Data Types
str(cc)## 'data.frame': 8950 obs. of 18 variables:
## $ CUST_ID : chr "C10001" "C10002" "C10003" "C10004" ...
## $ BALANCE : num 40.9 3202.5 2495.1 1666.7 817.7 ...
## $ BALANCE_FREQUENCY : num 0.818 0.909 1 0.636 1 ...
## $ PURCHASES : num 95.4 0 773.2 1499 16 ...
## $ ONEOFF_PURCHASES : num 0 0 773 1499 16 ...
## $ INSTALLMENTS_PURCHASES : num 95.4 0 0 0 0 ...
## $ CASH_ADVANCE : num 0 6443 0 206 0 ...
## $ PURCHASES_FREQUENCY : num 0.1667 0 1 0.0833 0.0833 ...
## $ ONEOFF_PURCHASES_FREQUENCY : num 0 0 1 0.0833 0.0833 ...
## $ PURCHASES_INSTALLMENTS_FREQUENCY: num 0.0833 0 0 0 0 ...
## $ CASH_ADVANCE_FREQUENCY : num 0 0.25 0 0.0833 0 ...
## $ CASH_ADVANCE_TRX : int 0 4 0 1 0 0 0 0 0 0 ...
## $ PURCHASES_TRX : int 2 0 12 1 1 8 64 12 5 3 ...
## $ CREDIT_LIMIT : num 1000 7000 7500 7500 1200 1800 13500 2300 7000 11000 ...
## $ PAYMENTS : num 202 4103 622 0 678 ...
## $ MINIMUM_PAYMENTS : num 140 1072 627 NA 245 ...
## $ PRC_FULL_PAYMENT : num 0 0.222 0 0 0 ...
## $ TENURE : int 12 12 12 12 12 12 12 12 12 12 ...
# Checking Missing Values
colSums(is.na(cc))## CUST_ID BALANCE
## 0 0
## BALANCE_FREQUENCY PURCHASES
## 0 0
## ONEOFF_PURCHASES INSTALLMENTS_PURCHASES
## 0 0
## CASH_ADVANCE PURCHASES_FREQUENCY
## 0 0
## ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FREQUENCY
## 0 0
## CASH_ADVANCE_FREQUENCY CASH_ADVANCE_TRX
## 0 0
## PURCHASES_TRX CREDIT_LIMIT
## 0 1
## PAYMENTS MINIMUM_PAYMENTS
## 0 313
## PRC_FULL_PAYMENT TENURE
## 0 0
#Calculating the percentage of missing values
colSums(is.na(cc))/nrow(cc)## CUST_ID BALANCE
## 0.0000000000 0.0000000000
## BALANCE_FREQUENCY PURCHASES
## 0.0000000000 0.0000000000
## ONEOFF_PURCHASES INSTALLMENTS_PURCHASES
## 0.0000000000 0.0000000000
## CASH_ADVANCE PURCHASES_FREQUENCY
## 0.0000000000 0.0000000000
## ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FREQUENCY
## 0.0000000000 0.0000000000
## CASH_ADVANCE_FREQUENCY CASH_ADVANCE_TRX
## 0.0000000000 0.0000000000
## PURCHASES_TRX CREDIT_LIMIT
## 0.0000000000 0.0001117318
## PAYMENTS MINIMUM_PAYMENTS
## 0.0000000000 0.0349720670
## PRC_FULL_PAYMENT TENURE
## 0.0000000000 0.0000000000
Since the missing value in some columns are still below 5% of the data observation, the row with any missing value in it will be dropped. Furthermore, since CUST_ID won’t affect the clustering as it’s unique for each observation, we are going to drop it,
# Drop NA
cc_na <- cc %>%
drop_na(CREDIT_LIMIT, MINIMUM_PAYMENTS)
cc_clean <- cc_na %>%
select(-CUST_ID)# Checking the dimension of cleaned data
dim(cc_clean)## [1] 8636 17
After data cleansing is performed, our data contain 8636 observations and 17 variables. All data types have been converted to the desired data types and there’s no more missing value.
Exploratory Data Analysis
Exploratory data analysis is a phase where we explore the data variables, and find out any pattern that can indicate any kind of correlation between the variables.
# Data Summary
summary(cc_clean)## BALANCE BALANCE_FREQUENCY PURCHASES ONEOFF_PURCHASES
## Min. : 0.0 Min. :0.0000 Min. : 0.00 Min. : 0.00
## 1st Qu.: 148.1 1st Qu.:0.9091 1st Qu.: 43.37 1st Qu.: 0.00
## Median : 916.9 Median :1.0000 Median : 375.40 Median : 44.99
## Mean : 1601.2 Mean :0.8950 Mean : 1025.43 Mean : 604.90
## 3rd Qu.: 2105.2 3rd Qu.:1.0000 3rd Qu.: 1145.98 3rd Qu.: 599.10
## Max. :19043.1 Max. :1.0000 Max. :49039.57 Max. :40761.25
## INSTALLMENTS_PURCHASES CASH_ADVANCE PURCHASES_FREQUENCY
## Min. : 0.00 Min. : 0.0 Min. :0.00000
## 1st Qu.: 0.00 1st Qu.: 0.0 1st Qu.:0.08333
## Median : 94.78 Median : 0.0 Median :0.50000
## Mean : 420.84 Mean : 994.2 Mean :0.49600
## 3rd Qu.: 484.15 3rd Qu.: 1132.4 3rd Qu.:0.91667
## Max. :22500.00 Max. :47137.2 Max. :1.00000
## ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FREQUENCY
## Min. :0.00000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.08333 Median :0.1667
## Mean :0.20591 Mean :0.3688
## 3rd Qu.:0.33333 3rd Qu.:0.7500
## Max. :1.00000 Max. :1.0000
## CASH_ADVANCE_FREQUENCY CASH_ADVANCE_TRX PURCHASES_TRX CREDIT_LIMIT
## Min. :0.0000 Min. : 0.000 Min. : 0.00 Min. : 50
## 1st Qu.:0.0000 1st Qu.: 0.000 1st Qu.: 1.00 1st Qu.: 1600
## Median :0.0000 Median : 0.000 Median : 7.00 Median : 3000
## Mean :0.1376 Mean : 3.314 Mean : 15.03 Mean : 4522
## 3rd Qu.:0.2500 3rd Qu.: 4.000 3rd Qu.: 18.00 3rd Qu.: 6500
## Max. :1.5000 Max. :123.000 Max. :358.00 Max. :30000
## PAYMENTS MINIMUM_PAYMENTS PRC_FULL_PAYMENT TENURE
## Min. : 0.05 Min. : 0.02 Min. :0.0000 Min. : 6.00
## 1st Qu.: 418.56 1st Qu.: 169.16 1st Qu.:0.0000 1st Qu.:12.00
## Median : 896.68 Median : 312.45 Median :0.0000 Median :12.00
## Mean : 1784.48 Mean : 864.30 Mean :0.1593 Mean :11.53
## 3rd Qu.: 1951.14 3rd Qu.: 825.50 3rd Qu.:0.1667 3rd Qu.:12.00
## Max. :50721.48 Max. :76406.21 Max. :1.0000 Max. :12.00
It is important to note that the dataset need to have the same scale. Hence, a further scaling might be needed.
Now, we are going to check the correlation of these variables through data visualization.
# Correlation of Each Variables
ggcorr(cc_clean, hjust = 1, layout.exp = 2, label = T, label_size = 4,
low = "#7d9029", mid = "white", high = "#3580d2")It can be seen that there is a strong correlation between some variables from the data, such as between PURCHASES and ONEOFF_PURCHASES, PURCHASES_FREQUENCY and ONEOFF_PURCHASES_FREQUENCY, etc. This result indicates that this dataset has multicollinearity and might not be suitable for various classification algorithms that have non-multicollinearity as their assumption.
Principal Component Analysis will be performed on this data to produce non-multicollinearity data, while also reducing the dimension of the data and retaining as much as information possible. The result of this analysis can be utilized further for classification purpose with lower computation.
Data Pre-Processing
Since the data set used isn’t on the same scale, we will scale them first using z scaling.
# Data Scaling
cc_z <- scale(cc_clean)
summary(cc_z)## BALANCE BALANCE_FREQUENCY PURCHASES ONEOFF_PURCHASES
## Min. :-0.7641 Min. :-4.30933 Min. :-0.47318 Min. :-0.359139
## 1st Qu.:-0.6934 1st Qu.: 0.06767 1st Qu.:-0.45317 1st Qu.:-0.359139
## Median :-0.3266 Median : 0.50538 Median :-0.29995 Median :-0.332425
## Mean : 0.0000 Mean : 0.00000 Mean : 0.00000 Mean : 0.000000
## 3rd Qu.: 0.2405 3rd Qu.: 0.50538 3rd Qu.: 0.05563 3rd Qu.:-0.003444
## Max. : 8.3232 Max. : 0.50538 Max. :22.15586 Max. :23.841455
## INSTALLMENTS_PURCHASES CASH_ADVANCE PURCHASES_FREQUENCY
## Min. :-0.45881 Min. :-0.46863 Min. :-1.236067
## 1st Qu.:-0.45881 1st Qu.:-0.46863 1st Qu.:-1.028396
## Median :-0.35548 Median :-0.46863 Median : 0.009968
## Mean : 0.00000 Mean : 0.00000 Mean : 0.000000
## 3rd Qu.: 0.06902 3rd Qu.: 0.06515 3rd Qu.: 1.048332
## Max. :24.07116 Max. :21.75062 Max. : 1.256004
## ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FREQUENCY
## Min. :-0.6862 Min. :-0.9265
## 1st Qu.:-0.6862 1st Qu.:-0.9265
## Median :-0.4085 Median :-0.5078
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4247 3rd Qu.: 0.9575
## Max. : 2.6465 Max. : 1.5855
## CASH_ADVANCE_FREQUENCY CASH_ADVANCE_TRX PURCHASES_TRX CREDIT_LIMIT
## Min. :-0.6819 Min. :-0.47941 Min. :-0.5970 Min. :-1.2221
## 1st Qu.:-0.6819 1st Qu.:-0.47941 1st Qu.:-0.5573 1st Qu.:-0.7986
## Median :-0.6819 Median :-0.47941 Median :-0.3190 Median :-0.4160
## Mean : 0.0000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5570 3rd Qu.: 0.09925 3rd Qu.: 0.1178 3rd Qu.: 0.5405
## Max. : 6.7515 Max. :17.31443 Max. :13.6203 Max. : 6.9626
## PAYMENTS MINIMUM_PAYMENTS PRC_FULL_PAYMENT TENURE
## Min. :-0.61325 Min. :-0.36428 Min. :-0.53770 Min. :-4.2216
## 1st Qu.:-0.46942 1st Qu.:-0.29299 1st Qu.:-0.53770 1st Qu.: 0.3552
## Median :-0.30511 Median :-0.23260 Median :-0.53770 Median : 0.3552
## Mean : 0.00000 Mean : 0.00000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.05728 3rd Qu.:-0.01636 3rd Qu.: 0.02485 3rd Qu.: 0.3552
## Max. :16.81794 Max. :31.83974 Max. : 2.83759 Max. : 0.3552
Scaling process is done, and all of the variables already have the same scale.
K-Means Clustering
The Optimal k
Before the cluster analysis begin, first we must determine the optimal number of cluster for the analysis. To choose the optimal value of k in K-Means Clustering, we can use several methods:
1. Elbow Method
fviz_nbclust(cc_clean, FUNcluster = kmeans, method = "wss", k.max = 10) +
labs(subtitle = "Elbow method")## Warning: did not converge in 10 iterations
As the name suggest, in Elbow Method, we pick the elbow of the curve as the number of clusters to use. Based on the plot above, we can see that k = 2 or k = 4.
2. Silhouette Method
fviz_nbclust(cc_clean, FUNcluster = kmeans, method = "silhouette", k.max = 10) +
labs(subtitle = "Silhouette Method")## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
In Silhouette Method, the optimal number of clusters is chosen by the number of cluster with the highest silhouette score (the peak). Based on the plot above, we can see that k = 2 or k = 4. Hence we will try 2 types of clustering with k = 2 and k = 4.
Clustering
K-Means with k = 2
set.seed(123)
cc_km2 <- kmeans(x= cc_z, centers = 2)# Number of observations in each cluster
cc_km2$size## [1] 5015 3621
# Location of the center of the cluster/centroid, commonly used for cluster profiling
cc_km2$centers## BALANCE BALANCE_FREQUENCY PURCHASES ONEOFF_PURCHASES
## 1 0.07458767 -0.2054153 -0.3431598 -0.2380054
## 2 -0.10330217 0.2844953 0.4752683 0.3296319
## INSTALLMENTS_PURCHASES CASH_ADVANCE PURCHASES_FREQUENCY
## 1 -0.3737383 0.1889251 -0.7382587
## 2 0.5176187 -0.2616568 1.0224710
## ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FREQUENCY
## 1 -0.3820750 -0.6599696
## 2 0.5291649 0.9140425
## CASH_ADVANCE_FREQUENCY CASH_ADVANCE_TRX PURCHASES_TRX CREDIT_LIMIT PAYMENTS
## 1 0.2823300 0.1982823 -0.4591673 -0.1181199 -0.1093591
## 2 -0.3910204 -0.2746164 0.6359359 0.1635933 0.1514598
## MINIMUM_PAYMENTS PRC_FULL_PAYMENT TENURE
## 1 -0.003079014 -0.2844889 -0.07908798
## 2 0.004264362 0.3940105 0.10953499
After the clustering process, it can be seen above that we have 5015 observations in cluster 1 and 3621 observations in cluster 2. We can check the visualization of it using PCA biplot as below:
# Cluster Visualization
fviz_cluster(object = cc_km2, data = cc_z, geom = "point") +
ggtitle("K-Means Clustering Plot") +
scale_color_brewer(palette= "Accent") +
theme_minimal() +
theme(legend.position = "bottom")K-Means with k = 4
set.seed(123)
cc_km4 <- kmeans(x= cc_z, centers = 4)# Number of observations in each cluster
cc_km4$size## [1] 3984 2934 1068 650
# Location of the center of the cluster/centroid, commonly used for cluster profiling
cc_km4$centers## BALANCE BALANCE_FREQUENCY PURCHASES ONEOFF_PURCHASES
## 1 -0.2596493 -0.3121714 -0.3302001 -0.2344101
## 2 -0.1273203 0.3667280 0.5780483 0.4132288
## 3 1.5380670 0.3766507 -0.1897072 -0.1338074
## 4 -0.3610076 -0.3608489 -0.2736449 -0.2086418
## INSTALLMENTS_PURCHASES CASH_ADVANCE PURCHASES_FREQUENCY
## 1 -0.3497405 -0.21776739 -0.6458316
## 2 0.6070181 -0.35499461 1.0825484
## 3 -0.2027364 1.77189566 -0.5001745
## 4 -0.2632345 0.02577674 -0.1061808
## ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FREQUENCY
## 1 -0.3748694 -0.5717539
## 2 0.6499571 0.9651283
## 3 -0.2159751 -0.4308311
## 4 -0.2812817 -0.1441407
## CASH_ADVANCE_FREQUENCY CASH_ADVANCE_TRX PURCHASES_TRX CREDIT_LIMIT PAYMENTS
## 1 -0.1636267 -0.18927960 -0.4319162 -0.3151287 -0.2591746
## 2 -0.4689014 -0.35712895 0.7546139 0.2196558 0.1692113
## 3 1.7513685 1.71413995 -0.2384098 0.9185768 0.7379135
## 4 0.2418213 -0.04430031 -0.3671711 -0.5692883 -0.3877017
## MINIMUM_PAYMENTS PRC_FULL_PAYMENT TENURE
## 1 -0.10469509 -0.2129974 0.26747034
## 2 -0.01864834 0.4173833 0.29640425
## 3 0.57660716 -0.4145205 0.08375683
## 4 -0.22153383 0.1025953 -3.11492951
After the clustering process, it can be seen above that we have 3984 observations in cluster 1, 2934 observations in cluster 2, 1068 observations in cluster 3, and 650 observations in cluster 4. We can check the visualization of it using PCA biplot as below:
# Cluster Visualization
fviz_cluster(object = cc_km4, data = cc_z, geom = "point") +
ggtitle("K-Means Clustering Plot") +
scale_color_brewer(palette= "Accent") +
theme_minimal() +
theme(legend.position = "bottom")Since in the plot the dimension was reduced into two-dimensional, each of the cluster intersect each other, some even clumped together. This happens because we have to little dimensions to represent our data (17 variables).
Goodness of Fit
The evaluation of clustering results can be seen from 3 values:
- Within Sum of Squares : the sum of squared distances from each observation to the centroid of each cluster.
- Between Sum of Squares : the sum of squared distances from each centroid of each cluster to the whole data average.
- Total Sum of Squares : the sum of squared distances from each observation to the whole data average.
Goodness of Fit of K-Means with k = 2
# Within Sum of Squares
cc_km2$withinss## [1] 55676.70 67590.79
# Total Sum of Squares
cc_km2$betweenss/cc_km2$totss## [1] 0.1602746
The “good” cluster results have a low value of Within Sum of Squares and Total Sum of Squares near to 1. However, as can be seen from the result, the Within Sum of Squares are too big and the Total Sum of Squares are still far from 1. Which indicates that this clustering might not be good enough.
Goodness of Fit of K-Means with k = 4
# Within Sum of Squares
cc_km4$withinss## [1] 22578.74 50375.07 23843.58 5555.14
# Total Sum of Squares
cc_km4$betweenss/cc_km4$totss## [1] 0.3027519
The same with the previous section, “good” cluster results have a low value of Within Sum of Squares and Total Sum of Squares near to 1. Unfortunately, as can be seen from the result, the Within Sum of Squares are too big and the Total Sum of Squares are still far from 1. Which indicates that this clustering might not be good enough. However, since this result is better than k = 2, we are going to use k = 4 model for further analysis.
Cluster Profiling
Now that we get the information about the cluster of each observation, we are going to combine the cluster column into the data set to interpret each characteristics of the cluster. We could also add the CUST_ID column to find out which cluster each customer belongs to in the end.
# Combining the cluster label into the data set
cc_clean$CLUSTER <- as.factor(cc_km4$cluster)
# Profiling with aggregation table
cc_clean %>%
group_by(CLUSTER) %>%
summarise_all(mean)## # A tibble: 4 x 18
## CLUSTER BALANCE BALANCE_FREQUENCY PURCHASES ONEOFF_PURCHASES INSTALLMENTS_PUR~
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1057. 0.830 310. 210. 100.
## 2 2 1334. 0.971 2278. 1301. 978.
## 3 3 4824. 0.973 614. 380. 235.
## 4 4 845. 0.820 432. 253. 179.
## # ... with 12 more variables: CASH_ADVANCE <dbl>, PURCHASES_FREQUENCY <dbl>,
## # ONEOFF_PURCHASES_FREQUENCY <dbl>, PURCHASES_INSTALLMENTS_FREQUENCY <dbl>,
## # CASH_ADVANCE_FREQUENCY <dbl>, CASH_ADVANCE_TRX <dbl>, PURCHASES_TRX <dbl>,
## # CREDIT_LIMIT <dbl>, PAYMENTS <dbl>, MINIMUM_PAYMENTS <dbl>,
## # PRC_FULL_PAYMENT <dbl>, TENURE <dbl>
# Profiling with aggregation table
cc_clean %>%
group_by(CLUSTER) %>%
summarise_all(mean) %>%
tidyr::pivot_longer(-CLUSTER) %>%
group_by(name) %>%
summarize(cluster_min_val = which.min(value),
cluster_max_val = which.max(value))## # A tibble: 17 x 3
## name cluster_min_val cluster_max_val
## <chr> <int> <int>
## 1 BALANCE 4 3
## 2 BALANCE_FREQUENCY 4 3
## 3 CASH_ADVANCE 2 3
## 4 CASH_ADVANCE_FREQUENCY 2 3
## 5 CASH_ADVANCE_TRX 2 3
## 6 CREDIT_LIMIT 4 3
## 7 INSTALLMENTS_PURCHASES 1 2
## 8 MINIMUM_PAYMENTS 4 3
## 9 ONEOFF_PURCHASES 1 2
## 10 ONEOFF_PURCHASES_FREQUENCY 1 2
## 11 PAYMENTS 4 3
## 12 PRC_FULL_PAYMENT 3 2
## 13 PURCHASES 1 2
## 14 PURCHASES_FREQUENCY 1 2
## 15 PURCHASES_INSTALLMENTS_FREQUENCY 1 2
## 16 PURCHASES_TRX 1 2
## 17 TENURE 4 2
Based on both aggregation tables, it can be concluded that:
Cluster 1: Customers with lowest amount of all purchases, not much withdrawals, indicates not many transactions of the credit card compared to the other clusters.Cluster 2: Customers with lowest amount of withdrawal and frequency, however, have the highest amount of all purchases. They have the longest tenure and highest percent of full payments paid, indicating that they are aware of their credits.Cluster 3: Customers with high amount of balance, high cash advance and high credit limit. Their balance also seemed to be updated frequently, indicates many transactions of the credit card. The customers of this cluster also have high amount of minimum payments, however, lowest percent of full payments paid, indicating higher loans amount and often like to withdraw a lot of money from the credit card.Cluster 4: Customers with lowest amount of balance and lowest credit limit. The customers of this cluster also have the lowest of minimum payments, payments and tenure; indicating that the transactions made in these credit cards are small transactions.
These customer profiling can help the business to decide which programs will interest the customer the most. So that they could held the campaign efficiently and effectively. Besides using aggregation table, we can also use visualization to simplify the profiling process. For example:
- Purchases of Each Cluster
As can be seen from the plot, cluster 1 & 4 have the lowest amount of purchases compared with the other clusters, hence if there are offers such as reward programs, discounts using credit card, they could be the best target.
- Payments of Each Cluster
As can be seen from the plot, cluster 2 & 3 also has the highest amount of payments compared with the other clusters, indicating how aware they are of their credits. Hence, if there are offers such as loyalty points, they could be the best target. From the both plots above, cluster 4 relatively has the lowest amount of purchases and payments compared with the other clusters. They also can be offered to zero interest program to increase theirs purchase and payments.
Which cluster do they belong?
As stated earlier, we could add the CUST_ID column back to find out which cluster each customer belongs to.
# Combining the `CUST_ID` column into the data set
cc_id <- cc_clean %>%
mutate(CUST_ID = cc_na$CUST_ID)
datatable(cc_id, options = list(scrollX = TRUE))To find out a certain customer belongs to which cluster, please type in their ID in the Search tab from the data table above.
Principle Component Analysis (PCA)
Dimensionality reduction
Dimensionality reduction are often performed using PCA as the algorithm, the main purpose is to reduce the number of variables (dimensions) in the data while retaining as much information as possible. Dimensionality reduction can solve the problem of high-dimensional data such as in the data set we are using.
# PCA using FactoMineR
cc_pca <- PCA(X = cc_clean,
quali.sup = 18,
scale.unit = T,
ncp = 17,
graph = F)
cc_pca$eig## eigenvalue percentage of variance cumulative percentage of variance
## comp 1 4.629300e+00 2.723118e+01 27.23118
## comp 2 3.463632e+00 2.037431e+01 47.60548
## comp 3 1.516117e+00 8.918334e+00 56.52382
## comp 4 1.287513e+00 7.573609e+00 64.09743
## comp 5 1.067025e+00 6.276618e+00 70.37405
## comp 6 9.711736e-01 5.712786e+00 76.08683
## comp 7 8.357519e-01 4.916188e+00 81.00302
## comp 8 7.158248e-01 4.210734e+00 85.21375
## comp 9 6.258884e-01 3.681697e+00 88.89545
## comp 10 5.238563e-01 3.081508e+00 91.97696
## comp 11 4.018466e-01 2.363804e+00 94.34076
## comp 12 3.016717e-01 1.774540e+00 96.11530
## comp 13 2.425407e-01 1.426710e+00 97.54201
## comp 14 2.003722e-01 1.178660e+00 98.72067
## comp 15 1.713765e-01 1.008097e+00 99.72877
## comp 16 4.609765e-02 2.711627e-01 99.99993
## comp 17 1.173078e-05 6.900460e-05 100.00000
# Variance explained by each dimensions
fviz_eig(cc_pca, ncp = 17, addlabels = T, main = "Variance explained by each dimensions")Based from the above results, if we want to retain 80% information of the data set, using 7 dimensions to do so is enough. This means that we can reduce the number of dimensions on our data set from 17 to 7 dimensions. However, by doing so,it might be hard to interpret the classification result, since we can interpret it by each variable anymore. But, we could see the Variable Contribution of each PCA, such as below.
# Variable Contribution of PC1
fviz_contrib(X = cc_pca, choice = "var", axes = 1)# variable contribution untuk PC2
fviz_contrib(X = cc_pca, choice = "var", axes = 2)PCA Visualization
Individual Factor Map
Individual Factor Map plot the distribution of observations to find out which index is considered an outlier.
plot.PCA(x = cc_pca ,
choix = "ind",
invisible = "quali",
select = "contrib 8",
habillage = "CLUSTER") +
scale_color_brewer(palette = "Accent") +
theme(legend.position = "bottom")From the plot, we can see 8 outliers’ indexes: 123, 247, 465, 513, 1167, 1510, 3793, and 2055. Most of the outliers are in CLUSTER_2.
Variables Factor Map
Variables Factor Map are used to find out the variable contribution to each PC, as well as the amount of information summarized from each variable to each PC; and to find out the correlation between the initial variables.
fviz_pca_var(cc_pca, select.var = list(contrib = 17), col.var = "contrib",
gradient.cols = c("#7d9029", "white", "#3580d2"), repel = TRUE)From the plot above, we can conclude that:
The most contributing variables in PC1 are
PURCHASESandPURCHASES_TRX.The most contributing variables in PC2 are
CASH_ADVANCEandCASH_ADVANCE_TRXPositive highly correlated variables are:
PURCHASESandONEOFF_PURCHASESPURCHASES_FREQUENCYandPURCHASES_INSTALLMENTS_FREQUENCYCASH_ADVANCEandCASH_ADVANCE_TRX
Cluster Visualization with PCA
PCA can also be integrated with the result of the K-means Clustering to help visualize our data in a fewer dimensions than the original features.
# visualisasi PCA + hasil kmeans clustering
fviz_pca_biplot(cc_pca,
habillage = 18,
addEllipses = T,
geom.ind = "point") +
theme_minimal() +
theme(legend.position = "bottom") +
scale_color_brewer(palette= "Accent") However, same problem as the previous section (clustering plot using PCA biplot) happens: each of the cluster intersect each other, some even clumped together. This happens because we have to little dimensions to represent our data (17 variables).
Conclusion
Based on the data used in this report and the K_Means Clustering process that has been done, we can conclude that:
Cluster 1: Customers with lowest amount of all purchases, not much withdrawals, indicates not many transactions of the credit card compared to the other clusters.Cluster 2: Customers with lowest amount of withdrawal and frequency, however, have the highest amount of all purchases. They have the longest tenure and highest percent of full payments paid, indicating that they are aware of their credits.Cluster 3: Customers with high amount of balance, high cash advance and high credit limit. Their balance also seemed to be updated frequently, indicates many transactions of the credit card. The customers of this cluster also have high amount of minimum payments, however, lowest percent of full payments paid, indicating higher loans amount and often like to withdraw a lot of money from the credit card.Cluster 4: Customers with lowest amount of balance and lowest credit limit. The customers of this cluster also have the lowest of minimum payments, payments and tenure; indicating that the transactions made in these credit cards are small transactions.
Based on the clusters that have been produced, a few business suggestions can be made to profit the industry, such as:
Cluster 1 & 4 have the lowest amount of purchases compared with the other clusters, hence if there are offers such as reward programs, discounts using credit card, they could be the best target.
Cluster 2 & 3 also has the highest amount of payments compared with the other clusters, indicating how aware they are of their credits. Hence, if there are offers such as loyalty points, they could be the best target.
Cluster 4 relatively has the lowest amount of purchases and payments compared with the other clusters. They also can be offered to zero interest program to increase theirs purchase and payments.
To find out a certain customer belongs to which cluster, type in their ID in the Search tab from the data table in this section.