Credit Card Users Clustering
Before you start reading my analysis below, let me thank you first for being willing to see the results of my writing, and it means a lot to me.
Whatever you will see in this analysis, is the result of my study in Unsupervised Learning class at Algoritma Academy. To see what I’ve learned in more detail, you can visit the Algoritma Academy learning syllabus.
Everything I have written is entirely my personal opinion based on my experience and knowledge up until now. If something is not right or missing, please feel free to contact me here, I’d love to discuss it with you. Thank you.
Introduction
Background
This dataset is a sample dataset that summarizes the usage behavior of around 9000 active credit card holders. The behavior of customers inside the dataset is described by 18 variables, from their balances to their usages.
For more information about the dataset, the dataset is downloaded from Kaggle and was uploaded by Arjun Bhasin.
Objectives
Our objectives from this analysis is to :
- Find customer clusters based on their behaviors.
- Try to see how variables affect every cluster.
Load Libraries
library(tidyverse)
library(FactoMineR)
library(factoextra)Import Dataset
cc <- read_csv("dataset/CC GENERAL.xls")
glimpse(cc)## Rows: 8,950
## Columns: 18
## $ CUST_ID <chr> "C10001", "C10002", "C10003", "C10004~
## $ BALANCE <dbl> 40.90075, 3202.46742, 2495.14886, 166~
## $ BALANCE_FREQUENCY <dbl> 0.818182, 0.909091, 1.000000, 0.63636~
## $ PURCHASES <dbl> 95.40, 0.00, 773.17, 1499.00, 16.00, ~
## $ ONEOFF_PURCHASES <dbl> 0.00, 0.00, 773.17, 1499.00, 16.00, 0~
## $ INSTALLMENTS_PURCHASES <dbl> 95.40, 0.00, 0.00, 0.00, 0.00, 1333.2~
## $ CASH_ADVANCE <dbl> 0.0000, 6442.9455, 0.0000, 205.7880, ~
## $ PURCHASES_FREQUENCY <dbl> 0.166667, 0.000000, 1.000000, 0.08333~
## $ ONEOFF_PURCHASES_FREQUENCY <dbl> 0.000000, 0.000000, 1.000000, 0.08333~
## $ PURCHASES_INSTALLMENTS_FREQUENCY <dbl> 0.083333, 0.000000, 0.000000, 0.00000~
## $ CASH_ADVANCE_FREQUENCY <dbl> 0.000000, 0.250000, 0.000000, 0.08333~
## $ CASH_ADVANCE_TRX <dbl> 0, 4, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ PURCHASES_TRX <dbl> 2, 0, 12, 1, 1, 8, 64, 12, 5, 3, 12, ~
## $ CREDIT_LIMIT <dbl> 1000, 7000, 7500, 7500, 1200, 1800, 1~
## $ PAYMENTS <dbl> 201.8021, 4103.0326, 622.0667, 0.0000~
## $ MINIMUM_PAYMENTS <dbl> 139.50979, 1072.34022, 627.28479, NA,~
## $ PRC_FULL_PAYMENT <dbl> 0.000000, 0.222222, 0.000000, 0.00000~
## $ TENURE <dbl> 12, 12, 12, 12, 12, 12, 12, 12, 12, 1~
Variable Description :
CUST_ID : Identification of Credit Card holder (Categorical) BALANCE : Balance amount left in their account to make purchases BALANCE_FREQUENCY : How frequently the Balance is updated, score between 0 and 1 (1 = frequently updated, 0 = not frequently updated) PURCHASES : Amount of purchases made from account ONEOFF_PURCHASES : Maximum purchase amount done in one-go INSTALLMENTS_PURCHASES : Amount of purchase done in installment CASH_ADVANCE : Cash in advance given by the user PURCHASES_FREQUENCY : How frequently the Purchases are being made, score between 0 and 1 (1 = frequently purchased, 0 = not frequently purchased) ONEOFF_PURCHASES_FREQUENCY : How frequently Purchases are happening in one-go (1 = frequently purchased, 0 = not frequently purchased) PURCHASES_INSTALLMENTS_FREQUENCY : How frequently purchases in installments are being done (1 = frequently done, 0 = not frequently done) CASH_ADVANCE_FREQUENCY : How frequently the cash in advance being paid CASH_ADVANCE_TRX : Number of Transactions made with “Cash in Advanced” PURCHASES_TRX : Number of purchase transactions made CREDIT_LIMIT : Limit of Credit Card for user PAYMENTS : Amount of Payment done by user MINIMUM_PAYMENTS : Minimum amount of payments made by user PRC_FULL_PAYMENT : Percent of full payment paid by user TENURE : Tenure of credit card service for user
Our dataset contains 8950 observations with 18 various columns that explain the behavior of credit card customers.
Data Cleansing
Before we analyze our data and fulfill our business objectives, we have to make sure that our dataset is ready to use. To achieve this, we have to make sure that our dataset has no missing values and each column is on the correct data type.
# Check for missing values
colSums(is.na(cc))## CUST_ID BALANCE
## 0 0
## BALANCE_FREQUENCY PURCHASES
## 0 0
## ONEOFF_PURCHASES INSTALLMENTS_PURCHASES
## 0 0
## CASH_ADVANCE PURCHASES_FREQUENCY
## 0 0
## ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FREQUENCY
## 0 0
## CASH_ADVANCE_FREQUENCY CASH_ADVANCE_TRX
## 0 0
## PURCHASES_TRX CREDIT_LIMIT
## 0 1
## PAYMENTS MINIMUM_PAYMENTS
## 0 313
## PRC_FULL_PAYMENT TENURE
## 0 0
Unfortunately, our data has missing values. On MINIMUM_PAYMENT column, we have 313 empty rows and on CREDIT_LIMIT we have 1 empty row. But, since we have 8950 observations and 313 rows is only around 3% of our data, we can safely remove it.
# Proportions of missing values compared to total observations
313/8950## [1] 0.03497207
# Remove rows that contains missing values
cc_clean <- na.omit(cc)
dim(cc_clean)## [1] 8636 18
Other than CUST_ID, all columns are on the correct data type. Now we only have to remove CUST_ID because it’s only a unique value that represents each customer, it has nothing to do with our analysis.
# Remove CUST_ID and turn it into row names
rownames(cc_clean) <- cc_clean$CUST_ID## Warning: Setting row names on a tibble is deprecated.
cc_clean <- cc_clean %>%
select(-c(CUST_ID))
head(cc_clean)## # A tibble: 6 x 17
## BALANCE BALANCE_FREQUENCY PURCHASES ONEOFF_PURCHASES INSTALLMENTS_PURCHASES
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 40.9 0.818 95.4 0 95.4
## 2 3202. 0.909 0 0 0
## 3 2495. 1 773. 773. 0
## 4 818. 1 16 16 0
## 5 1810. 1 1333. 0 1333.
## 6 627. 1 7091. 6403. 688.
## # ... with 12 more variables: CASH_ADVANCE <dbl>, PURCHASES_FREQUENCY <dbl>,
## # ONEOFF_PURCHASES_FREQUENCY <dbl>, PURCHASES_INSTALLMENTS_FREQUENCY <dbl>,
## # CASH_ADVANCE_FREQUENCY <dbl>, CASH_ADVANCE_TRX <dbl>, PURCHASES_TRX <dbl>,
## # CREDIT_LIMIT <dbl>, PAYMENTS <dbl>, MINIMUM_PAYMENTS <dbl>,
## # PRC_FULL_PAYMENT <dbl>, TENURE <dbl>
Exploratory Data Analysis
As we can see, every column has its own range value. We have to scale every column so that it will be more uniform and easier to deal with. To do this, we can use scale() function.
# Scale the data
cc_scale <- scale(cc_clean)
summary(cc_scale)## BALANCE BALANCE_FREQUENCY PURCHASES ONEOFF_PURCHASES
## Min. :-0.7641 Min. :-4.30933 Min. :-0.47318 Min. :-0.359139
## 1st Qu.:-0.6934 1st Qu.: 0.06767 1st Qu.:-0.45317 1st Qu.:-0.359139
## Median :-0.3266 Median : 0.50538 Median :-0.29995 Median :-0.332425
## Mean : 0.0000 Mean : 0.00000 Mean : 0.00000 Mean : 0.000000
## 3rd Qu.: 0.2405 3rd Qu.: 0.50538 3rd Qu.: 0.05563 3rd Qu.:-0.003444
## Max. : 8.3232 Max. : 0.50538 Max. :22.15586 Max. :23.841455
## INSTALLMENTS_PURCHASES CASH_ADVANCE PURCHASES_FREQUENCY
## Min. :-0.45881 Min. :-0.46863 Min. :-1.236067
## 1st Qu.:-0.45881 1st Qu.:-0.46863 1st Qu.:-1.028396
## Median :-0.35548 Median :-0.46863 Median : 0.009968
## Mean : 0.00000 Mean : 0.00000 Mean : 0.000000
## 3rd Qu.: 0.06902 3rd Qu.: 0.06515 3rd Qu.: 1.048332
## Max. :24.07116 Max. :21.75062 Max. : 1.256004
## ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FREQUENCY
## Min. :-0.6862 Min. :-0.9265
## 1st Qu.:-0.6862 1st Qu.:-0.9265
## Median :-0.4085 Median :-0.5078
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4247 3rd Qu.: 0.9575
## Max. : 2.6465 Max. : 1.5855
## CASH_ADVANCE_FREQUENCY CASH_ADVANCE_TRX PURCHASES_TRX CREDIT_LIMIT
## Min. :-0.6819 Min. :-0.47941 Min. :-0.5970 Min. :-1.2221
## 1st Qu.:-0.6819 1st Qu.:-0.47941 1st Qu.:-0.5573 1st Qu.:-0.7986
## Median :-0.6819 Median :-0.47941 Median :-0.3190 Median :-0.4160
## Mean : 0.0000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5570 3rd Qu.: 0.09925 3rd Qu.: 0.1178 3rd Qu.: 0.5405
## Max. : 6.7515 Max. :17.31443 Max. :13.6203 Max. : 6.9626
## PAYMENTS MINIMUM_PAYMENTS PRC_FULL_PAYMENT TENURE
## Min. :-0.61325 Min. :-0.36428 Min. :-0.53770 Min. :-4.2216
## 1st Qu.:-0.46942 1st Qu.:-0.29299 1st Qu.:-0.53770 1st Qu.: 0.3552
## Median :-0.30511 Median :-0.23260 Median :-0.53770 Median : 0.3552
## Mean : 0.00000 Mean : 0.00000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.05728 3rd Qu.:-0.01636 3rd Qu.: 0.02485 3rd Qu.: 0.3552
## Max. :16.81794 Max. :31.83974 Max. : 2.83759 Max. : 0.3552
Now, after our data is scaled, we can continue to find customer segments from our data.
Outlier Detection
One of the most famous methods to do clustering is K-Means, which uses centroid to cluster our data. The centroid is based on the mean/average of each cluster. Because it uses mean, it is very sensitive to outliers. So, to solve this issue we have to take the outliers out first.
We can see the outliers by using PCA Visualization.
# Do PCA
pca_cc <- PCA(X = cc_scale, # our scaled data
scale.unit = F, # no need to do scale again
graph = F)
plot.PCA(
x = pca_cc, # objek PCA
choix = "ind", # individual (sebaran data)
select = "contrib 10" # menampilkan 5 outlier terluar
) Now we store the outliers data in a vector object.
outliers <- c(2055, 123, 465, 513, 247, 1510, 3565, 1546, 3793, 1167)Because we will delete the outliers, we must do this from the original data before scale so that we can rescale our data.
cc_no_outlier <- cc_clean[-outliers,]
cc_no_outlier_scale <- scale(cc_no_outlier)
dim(cc_no_outlier_scale)## [1] 8626 17
K-Means Clustering
Ok, now that our dataset is clean without outliers, we can proceed to do clustering to find out the best number of clusters. To achieve this, we can use the elbow method.
fviz_nbclust(cc_no_outlier_scale, FUNcluster = kmeans, method = "wss") According to the elbow method above, the best number of clusters is 7, because after 7 the curve is starting to slope.
# Do the K-Means Clustering
set.seed(333)
cc_kmeans <- kmeans(cc_no_outlier_scale, centers = 7)
# Apply the cluster to our data
cc_no_outlier$group <- cc_kmeans$cluster
glimpse(cc_no_outlier)## Rows: 8,626
## Columns: 18
## $ BALANCE <dbl> 40.90075, 3202.46742, 2495.14886, 817~
## $ BALANCE_FREQUENCY <dbl> 0.818182, 0.909091, 1.000000, 1.00000~
## $ PURCHASES <dbl> 95.40, 0.00, 773.17, 16.00, 1333.28, ~
## $ ONEOFF_PURCHASES <dbl> 0.00, 0.00, 773.17, 16.00, 0.00, 6402~
## $ INSTALLMENTS_PURCHASES <dbl> 95.40, 0.00, 0.00, 0.00, 1333.28, 688~
## $ CASH_ADVANCE <dbl> 0.0000, 6442.9455, 0.0000, 0.0000, 0.~
## $ PURCHASES_FREQUENCY <dbl> 0.166667, 0.000000, 1.000000, 0.08333~
## $ ONEOFF_PURCHASES_FREQUENCY <dbl> 0.000000, 0.000000, 1.000000, 0.08333~
## $ PURCHASES_INSTALLMENTS_FREQUENCY <dbl> 0.083333, 0.000000, 0.000000, 0.00000~
## $ CASH_ADVANCE_FREQUENCY <dbl> 0.000000, 0.250000, 0.000000, 0.00000~
## $ CASH_ADVANCE_TRX <dbl> 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ PURCHASES_TRX <dbl> 2, 0, 12, 1, 8, 64, 12, 5, 3, 12, 6, ~
## $ CREDIT_LIMIT <dbl> 1000, 7000, 7500, 1200, 1800, 13500, ~
## $ PAYMENTS <dbl> 201.8021, 4103.0326, 622.0667, 678.33~
## $ MINIMUM_PAYMENTS <dbl> 139.50979, 1072.34022, 627.28479, 244~
## $ PRC_FULL_PAYMENT <dbl> 0.000000, 0.222222, 0.000000, 0.00000~
## $ TENURE <dbl> 12, 12, 12, 12, 12, 12, 12, 12, 12, 1~
## $ group <int> 6, 5, 2, 6, 1, 3, 1, 6, 7, 1, 6, 1, 1~
To understand the characteristics of each cluster better, we can create a biplot visualization.
# PCA
cc_pca <- PCA(X = cc_no_outlier,
scale.unit = T, # do scaling
quali.sup = 18, # group column
graph = F)
# PCA + K-Means Visualization
fviz_pca_biplot(X = cc_pca,
habillage = 18, # group column
geom.ind = "point", # display each observations without text
addEllipses = T)Conclusion
After doing some analysis of our data about customers of credit cards, we can say that the customers can be divided into 7 different clusters which every cluster has its own characteristics in terms of using a credit card. But, after all, almost every variable has a strong variance, we can see this in the biplot above. Almost every arrow has a long arrow. And according to the biplot, group 4 has the biggest distribution.