Credit Card Users Clustering

Before you start reading my analysis below, let me thank you first for being willing to see the results of my writing, and it means a lot to me.

Whatever you will see in this analysis, is the result of my study in Unsupervised Learning class at Algoritma Academy. To see what I’ve learned in more detail, you can visit the Algoritma Academy learning syllabus.

Everything I have written is entirely my personal opinion based on my experience and knowledge up until now. If something is not right or missing, please feel free to contact me here, I’d love to discuss it with you. Thank you.

Introduction

Background

This dataset is a sample dataset that summarizes the usage behavior of around 9000 active credit card holders. The behavior of customers inside the dataset is described by 18 variables, from their balances to their usages.

For more information about the dataset, the dataset is downloaded from Kaggle and was uploaded by Arjun Bhasin.

Objectives

Our objectives from this analysis is to :

Find customer clusters based on their behaviors.
Try to see how variables affect every cluster.

Load Libraries

library(tidyverse)
library(FactoMineR)
library(factoextra)

Import Dataset

cc <- read_csv("dataset/CC GENERAL.xls")
glimpse(cc)

## Rows: 8,950
## Columns: 18
## $ CUST_ID                          <chr> "C10001", "C10002", "C10003", "C10004~
## $ BALANCE                          <dbl> 40.90075, 3202.46742, 2495.14886, 166~
## $ BALANCE_FREQUENCY                <dbl> 0.818182, 0.909091, 1.000000, 0.63636~
## $ PURCHASES                        <dbl> 95.40, 0.00, 773.17, 1499.00, 16.00, ~
## $ ONEOFF_PURCHASES                 <dbl> 0.00, 0.00, 773.17, 1499.00, 16.00, 0~
## $ INSTALLMENTS_PURCHASES           <dbl> 95.40, 0.00, 0.00, 0.00, 0.00, 1333.2~
## $ CASH_ADVANCE                     <dbl> 0.0000, 6442.9455, 0.0000, 205.7880, ~
## $ PURCHASES_FREQUENCY              <dbl> 0.166667, 0.000000, 1.000000, 0.08333~
## $ ONEOFF_PURCHASES_FREQUENCY       <dbl> 0.000000, 0.000000, 1.000000, 0.08333~
## $ PURCHASES_INSTALLMENTS_FREQUENCY <dbl> 0.083333, 0.000000, 0.000000, 0.00000~
## $ CASH_ADVANCE_FREQUENCY           <dbl> 0.000000, 0.250000, 0.000000, 0.08333~
## $ CASH_ADVANCE_TRX                 <dbl> 0, 4, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ PURCHASES_TRX                    <dbl> 2, 0, 12, 1, 1, 8, 64, 12, 5, 3, 12, ~
## $ CREDIT_LIMIT                     <dbl> 1000, 7000, 7500, 7500, 1200, 1800, 1~
## $ PAYMENTS                         <dbl> 201.8021, 4103.0326, 622.0667, 0.0000~
## $ MINIMUM_PAYMENTS                 <dbl> 139.50979, 1072.34022, 627.28479, NA,~
## $ PRC_FULL_PAYMENT                 <dbl> 0.000000, 0.222222, 0.000000, 0.00000~
## $ TENURE                           <dbl> 12, 12, 12, 12, 12, 12, 12, 12, 12, 1~

Variable Description :

CUST_ID : Identification of Credit Card holder (Categorical) BALANCE : Balance amount left in their account to make purchases BALANCE_FREQUENCY : How frequently the Balance is updated, score between 0 and 1 (1 = frequently updated, 0 = not frequently updated) PURCHASES : Amount of purchases made from account ONEOFF_PURCHASES : Maximum purchase amount done in one-go INSTALLMENTS_PURCHASES : Amount of purchase done in installment CASH_ADVANCE : Cash in advance given by the user PURCHASES_FREQUENCY : How frequently the Purchases are being made, score between 0 and 1 (1 = frequently purchased, 0 = not frequently purchased) ONEOFF_PURCHASES_FREQUENCY : How frequently Purchases are happening in one-go (1 = frequently purchased, 0 = not frequently purchased) PURCHASES_INSTALLMENTS_FREQUENCY : How frequently purchases in installments are being done (1 = frequently done, 0 = not frequently done) CASH_ADVANCE_FREQUENCY : How frequently the cash in advance being paid CASH_ADVANCE_TRX : Number of Transactions made with “Cash in Advanced” PURCHASES_TRX : Number of purchase transactions made CREDIT_LIMIT : Limit of Credit Card for user PAYMENTS : Amount of Payment done by user MINIMUM_PAYMENTS : Minimum amount of payments made by user PRC_FULL_PAYMENT : Percent of full payment paid by user TENURE : Tenure of credit card service for user

Our dataset contains 8950 observations with 18 various columns that explain the behavior of credit card customers.

Data Cleansing

Before we analyze our data and fulfill our business objectives, we have to make sure that our dataset is ready to use. To achieve this, we have to make sure that our dataset has no missing values and each column is on the correct data type.

# Check for missing values
colSums(is.na(cc))

##                          CUST_ID                          BALANCE 
##                                0                                0 
##                BALANCE_FREQUENCY                        PURCHASES 
##                                0                                0 
##                 ONEOFF_PURCHASES           INSTALLMENTS_PURCHASES 
##                                0                                0 
##                     CASH_ADVANCE              PURCHASES_FREQUENCY 
##                                0                                0 
##       ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FREQUENCY 
##                                0                                0 
##           CASH_ADVANCE_FREQUENCY                 CASH_ADVANCE_TRX 
##                                0                                0 
##                    PURCHASES_TRX                     CREDIT_LIMIT 
##                                0                                1 
##                         PAYMENTS                 MINIMUM_PAYMENTS 
##                                0                              313 
##                 PRC_FULL_PAYMENT                           TENURE 
##                                0                                0

Unfortunately, our data has missing values. On MINIMUM_PAYMENT column, we have 313 empty rows and on CREDIT_LIMIT we have 1 empty row. But, since we have 8950 observations and 313 rows is only around 3% of our data, we can safely remove it.

# Proportions of missing values compared to total observations
313/8950

## [1] 0.03497207

# Remove rows that contains missing values
cc_clean <- na.omit(cc)

dim(cc_clean)

## [1] 8636   18

Other than CUST_ID, all columns are on the correct data type. Now we only have to remove CUST_ID because it’s only a unique value that represents each customer, it has nothing to do with our analysis.

# Remove CUST_ID and turn it into row names
rownames(cc_clean) <- cc_clean$CUST_ID

## Warning: Setting row names on a tibble is deprecated.

cc_clean <- cc_clean %>% 
  select(-c(CUST_ID))

head(cc_clean)

## # A tibble: 6 x 17
##   BALANCE BALANCE_FREQUENCY PURCHASES ONEOFF_PURCHASES INSTALLMENTS_PURCHASES
##     <dbl>             <dbl>     <dbl>            <dbl>                  <dbl>
## 1    40.9             0.818      95.4               0                    95.4
## 2  3202.              0.909       0                 0                     0  
## 3  2495.              1         773.              773.                    0  
## 4   818.              1          16                16                     0  
## 5  1810.              1        1333.                0                  1333. 
## 6   627.              1        7091.             6403.                  688. 
## # ... with 12 more variables: CASH_ADVANCE <dbl>, PURCHASES_FREQUENCY <dbl>,
## #   ONEOFF_PURCHASES_FREQUENCY <dbl>, PURCHASES_INSTALLMENTS_FREQUENCY <dbl>,
## #   CASH_ADVANCE_FREQUENCY <dbl>, CASH_ADVANCE_TRX <dbl>, PURCHASES_TRX <dbl>,
## #   CREDIT_LIMIT <dbl>, PAYMENTS <dbl>, MINIMUM_PAYMENTS <dbl>,
## #   PRC_FULL_PAYMENT <dbl>, TENURE <dbl>

Exploratory Data Analysis

As we can see, every column has its own range value. We have to scale every column so that it will be more uniform and easier to deal with. To do this, we can use scale() function.

# Scale the data
cc_scale <- scale(cc_clean)

summary(cc_scale)

##     BALANCE        BALANCE_FREQUENCY    PURCHASES        ONEOFF_PURCHASES   
##  Min.   :-0.7641   Min.   :-4.30933   Min.   :-0.47318   Min.   :-0.359139  
##  1st Qu.:-0.6934   1st Qu.: 0.06767   1st Qu.:-0.45317   1st Qu.:-0.359139  
##  Median :-0.3266   Median : 0.50538   Median :-0.29995   Median :-0.332425  
##  Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.000000  
##  3rd Qu.: 0.2405   3rd Qu.: 0.50538   3rd Qu.: 0.05563   3rd Qu.:-0.003444  
##  Max.   : 8.3232   Max.   : 0.50538   Max.   :22.15586   Max.   :23.841455  
##  INSTALLMENTS_PURCHASES  CASH_ADVANCE      PURCHASES_FREQUENCY
##  Min.   :-0.45881       Min.   :-0.46863   Min.   :-1.236067  
##  1st Qu.:-0.45881       1st Qu.:-0.46863   1st Qu.:-1.028396  
##  Median :-0.35548       Median :-0.46863   Median : 0.009968  
##  Mean   : 0.00000       Mean   : 0.00000   Mean   : 0.000000  
##  3rd Qu.: 0.06902       3rd Qu.: 0.06515   3rd Qu.: 1.048332  
##  Max.   :24.07116       Max.   :21.75062   Max.   : 1.256004  
##  ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FREQUENCY
##  Min.   :-0.6862            Min.   :-0.9265                 
##  1st Qu.:-0.6862            1st Qu.:-0.9265                 
##  Median :-0.4085            Median :-0.5078                 
##  Mean   : 0.0000            Mean   : 0.0000                 
##  3rd Qu.: 0.4247            3rd Qu.: 0.9575                 
##  Max.   : 2.6465            Max.   : 1.5855                 
##  CASH_ADVANCE_FREQUENCY CASH_ADVANCE_TRX   PURCHASES_TRX      CREDIT_LIMIT    
##  Min.   :-0.6819        Min.   :-0.47941   Min.   :-0.5970   Min.   :-1.2221  
##  1st Qu.:-0.6819        1st Qu.:-0.47941   1st Qu.:-0.5573   1st Qu.:-0.7986  
##  Median :-0.6819        Median :-0.47941   Median :-0.3190   Median :-0.4160  
##  Mean   : 0.0000        Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5570        3rd Qu.: 0.09925   3rd Qu.: 0.1178   3rd Qu.: 0.5405  
##  Max.   : 6.7515        Max.   :17.31443   Max.   :13.6203   Max.   : 6.9626  
##     PAYMENTS        MINIMUM_PAYMENTS   PRC_FULL_PAYMENT       TENURE       
##  Min.   :-0.61325   Min.   :-0.36428   Min.   :-0.53770   Min.   :-4.2216  
##  1st Qu.:-0.46942   1st Qu.:-0.29299   1st Qu.:-0.53770   1st Qu.: 0.3552  
##  Median :-0.30511   Median :-0.23260   Median :-0.53770   Median : 0.3552  
##  Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.05728   3rd Qu.:-0.01636   3rd Qu.: 0.02485   3rd Qu.: 0.3552  
##  Max.   :16.81794   Max.   :31.83974   Max.   : 2.83759   Max.   : 0.3552

Now, after our data is scaled, we can continue to find customer segments from our data.

Outlier Detection

One of the most famous methods to do clustering is K-Means, which uses centroid to cluster our data. The centroid is based on the mean/average of each cluster. Because it uses mean, it is very sensitive to outliers. So, to solve this issue we have to take the outliers out first.

We can see the outliers by using PCA Visualization.

# Do PCA
pca_cc <- PCA(X = cc_scale, # our scaled data
              scale.unit = F, # no need to do scale again
              graph = F)

plot.PCA(
  x = pca_cc, # objek PCA
  choix = "ind", # individual (sebaran data)
  select = "contrib 10" # menampilkan 5 outlier terluar
)

Now we store the outliers data in a vector object.

outliers <- c(2055, 123, 465, 513, 247, 1510, 3565, 1546, 3793, 1167)

Because we will delete the outliers, we must do this from the original data before scale so that we can rescale our data.

cc_no_outlier <- cc_clean[-outliers,]
cc_no_outlier_scale <- scale(cc_no_outlier)

dim(cc_no_outlier_scale)

## [1] 8626   17

K-Means Clustering

Ok, now that our dataset is clean without outliers, we can proceed to do clustering to find out the best number of clusters. To achieve this, we can use the elbow method.

fviz_nbclust(cc_no_outlier_scale, FUNcluster = kmeans, method = "wss")

According to the elbow method above, the best number of clusters is 7, because after 7 the curve is starting to slope.

# Do the K-Means Clustering
set.seed(333)
cc_kmeans <- kmeans(cc_no_outlier_scale, centers = 7)

# Apply the cluster to our data
cc_no_outlier$group <- cc_kmeans$cluster

glimpse(cc_no_outlier)

## Rows: 8,626
## Columns: 18
## $ BALANCE                          <dbl> 40.90075, 3202.46742, 2495.14886, 817~
## $ BALANCE_FREQUENCY                <dbl> 0.818182, 0.909091, 1.000000, 1.00000~
## $ PURCHASES                        <dbl> 95.40, 0.00, 773.17, 16.00, 1333.28, ~
## $ ONEOFF_PURCHASES                 <dbl> 0.00, 0.00, 773.17, 16.00, 0.00, 6402~
## $ INSTALLMENTS_PURCHASES           <dbl> 95.40, 0.00, 0.00, 0.00, 1333.28, 688~
## $ CASH_ADVANCE                     <dbl> 0.0000, 6442.9455, 0.0000, 0.0000, 0.~
## $ PURCHASES_FREQUENCY              <dbl> 0.166667, 0.000000, 1.000000, 0.08333~
## $ ONEOFF_PURCHASES_FREQUENCY       <dbl> 0.000000, 0.000000, 1.000000, 0.08333~
## $ PURCHASES_INSTALLMENTS_FREQUENCY <dbl> 0.083333, 0.000000, 0.000000, 0.00000~
## $ CASH_ADVANCE_FREQUENCY           <dbl> 0.000000, 0.250000, 0.000000, 0.00000~
## $ CASH_ADVANCE_TRX                 <dbl> 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ PURCHASES_TRX                    <dbl> 2, 0, 12, 1, 8, 64, 12, 5, 3, 12, 6, ~
## $ CREDIT_LIMIT                     <dbl> 1000, 7000, 7500, 1200, 1800, 13500, ~
## $ PAYMENTS                         <dbl> 201.8021, 4103.0326, 622.0667, 678.33~
## $ MINIMUM_PAYMENTS                 <dbl> 139.50979, 1072.34022, 627.28479, 244~
## $ PRC_FULL_PAYMENT                 <dbl> 0.000000, 0.222222, 0.000000, 0.00000~
## $ TENURE                           <dbl> 12, 12, 12, 12, 12, 12, 12, 12, 12, 1~
## $ group                            <int> 6, 5, 2, 6, 1, 3, 1, 6, 7, 1, 6, 1, 1~

To understand the characteristics of each cluster better, we can create a biplot visualization.

# PCA
cc_pca <- PCA(X = cc_no_outlier,
              scale.unit = T, # do scaling
              quali.sup = 18, # group column
              graph = F)

# PCA + K-Means Visualization
fviz_pca_biplot(X = cc_pca,
                habillage = 18, # group column
                geom.ind = "point", # display each observations without text
                addEllipses = T)

Conclusion

After doing some analysis of our data about customers of credit cards, we can say that the customers can be divided into 7 different clusters which every cluster has its own characteristics in terms of using a credit card. But, after all, almost every variable has a strong variance, we can see this in the biplot above. Almost every arrow has a long arrow. And according to the biplot, group 4 has the biggest distribution.