Short Introduction

This case requires to develop a customer segmentation to define marketing strategy. The sample dataset summarizes the usage behavior of up to 10000 active credit card holders during a periode of the data. The file is at a customer level with 23 columns in raw data.

The analysis is using K-Prototype Clustering method which is an improvement of the K-Means and K-Modes clustering algorithm to handle clustering with the mixed data types.

Dataset Source: https://www.kaggle.com/sakshigoyal7/credit-card-customers

*there is definition of each variable in the source link

Required Packages

The below code is a function, which can help to install any package if it’s not installed yet.

ins_packages <- function(x){
  x <- as.character(match.call()[[2]])
  if (!require(x,character.only = TRUE)){
    install.packages(pkgs = x,repos = "http://cran.r-project.org")
    require(x, character.only = TRUE)
  }
}

ins_packages(rio); ins_packages(skimr); ins_packages(DT); ins_packages(tidyverse)
ins_packages(ggplot2); ins_packages(corrplot); ins_packages(psych); ins_packages(dplyr)
ins_packages(ggcorrplot); ins_packages(caret); ins_packages(clustMixType)

Load the Data

The dataset is loaded from personal directory which there is in ‘.csv’ format.

credit_df <- rio::import("C:/Users/ASUS/Downloads/BankChurners.csv/BankChurners.csv")
head(credit_df)
##   CLIENTNUM    Attrition_Flag Customer_Age Gender Dependent_count
## 1 768805383 Existing Customer           45      M               3
## 2 818770008 Existing Customer           49      F               5
## 3 713982108 Existing Customer           51      M               3
## 4 769911858 Existing Customer           40      F               4
## 5 709106358 Existing Customer           40      M               3
## 6 713061558 Existing Customer           44      M               2
##   Education_Level Marital_Status Income_Category Card_Category Months_on_book
## 1     High School        Married     $60K - $80K          Blue             39
## 2        Graduate         Single  Less than $40K          Blue             44
## 3        Graduate        Married    $80K - $120K          Blue             36
## 4     High School        Unknown  Less than $40K          Blue             34
## 5      Uneducated        Married     $60K - $80K          Blue             21
## 6        Graduate        Married     $40K - $60K          Blue             36
##   Total_Relationship_Count Months_Inactive_12_mon Contacts_Count_12_mon
## 1                        5                      1                     3
## 2                        6                      1                     2
## 3                        4                      1                     0
## 4                        3                      4                     1
## 5                        5                      1                     0
## 6                        3                      1                     2
##   Credit_Limit Total_Revolving_Bal Avg_Open_To_Buy Total_Amt_Chng_Q4_Q1
## 1        12691                 777           11914                1.335
## 2         8256                 864            7392                1.541
## 3         3418                   0            3418                2.594
## 4         3313                2517             796                1.405
## 5         4716                   0            4716                2.175
## 6         4010                1247            2763                1.376
##   Total_Trans_Amt Total_Trans_Ct Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio
## 1            1144             42               1.625                 0.061
## 2            1291             33               3.714                 0.105
## 3            1887             20               2.333                 0.000
## 4            1171             20               2.333                 0.760
## 5             816             28               2.500                 0.000
## 6            1088             24               0.846                 0.311
##   Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1
## 1                                                                                                                         9.3448e-05
## 2                                                                                                                         5.6861e-05
## 3                                                                                                                         2.1081e-05
## 4                                                                                                                         1.3366e-04
## 5                                                                                                                         2.1676e-05
## 6                                                                                                                         5.5077e-05
##   Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2
## 1                                                                                                                            0.99991
## 2                                                                                                                            0.99994
## 3                                                                                                                            0.99998
## 4                                                                                                                            0.99987
## 5                                                                                                                            0.99998
## 6                                                                                                                            0.99994

Data Wrangling

This step is requires as a process of cleaning, structuring, and enriching raw data into a better structure for analyze.

Clustering method is unsupervised learning and works with unlabeled data. Therefore, ‘Attrition_Flag’ variable can be discarded. On the other hand, there are a couple of variables, last two columns, which are suggested to delete by the data uploader.

Missing value should be checked. If there are missing value, median imputation will be used to handle it (replace missing value with variable median).

#column delete
credit_df <- credit_df[, -c(2, 22, 23)]

#fixing some variables type
credit_df$CLIENTNUM <- as.character(credit_df$CLIENTNUM)
credit_df$Gender <- factor(credit_df$Gender, order = FALSE)
credit_df$Marital_Status <- factor(credit_df$Marital_Status, order = FALSE)
credit_df$Education_Level <- factor(credit_df$Education_Level, order = TRUE,
                                    levels = c("Unknown", "Uneducated", "High School",
                                               "College", "Graduate", "Post-Graduate",
                                               "Doctorate"))
credit_df$Income_Category <- factor(credit_df$Income_Category, order = TRUE,
                                    levels = c("Unknown", "Less than $40K", 
                                               "$40K - $60K", "$60K - $80K", 
                                               "$80K - $120K", "$120K +"))
credit_df$Card_Category <- factor(credit_df$Card_Category, order =  TRUE,
                                  levels = c("Blue", "Silver", "Gold", "Platinum"))

#summary & missing value check
skimr::skim(credit_df)
Data summary
Name credit_df
Number of rows 10127
Number of columns 20
_______________________
Column type frequency:
character 1
factor 5
numeric 14
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
CLIENTNUM 0 1 9 9 0 10127 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Gender 0 1 FALSE 2 F: 5358, M: 4769
Education_Level 0 1 TRUE 7 Gra: 3128, Hig: 2013, Unk: 1519, Une: 1487
Marital_Status 0 1 FALSE 4 Mar: 4687, Sin: 3943, Unk: 749, Div: 748
Income_Category 0 1 TRUE 6 Les: 3561, $40: 1790, $80: 1535, $60: 1402
Card_Category 0 1 TRUE 4 Blu: 9436, Sil: 555, Gol: 116, Pla: 20

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Customer_Age 0 1 46.33 8.02 26.0 41.00 46.00 52.00 73.00 ▂▆▇▃▁
Dependent_count 0 1 2.35 1.30 0.0 1.00 2.00 3.00 5.00 ▇▇▇▅▁
Months_on_book 0 1 35.93 7.99 13.0 31.00 36.00 40.00 56.00 ▁▃▇▃▂
Total_Relationship_Count 0 1 3.81 1.55 1.0 3.00 4.00 5.00 6.00 ▇▇▆▆▆
Months_Inactive_12_mon 0 1 2.34 1.01 0.0 2.00 2.00 3.00 6.00 ▅▇▇▁▁
Contacts_Count_12_mon 0 1 2.46 1.11 0.0 2.00 2.00 3.00 6.00 ▅▇▇▃▁
Credit_Limit 0 1 8631.95 9088.78 1438.3 2555.00 4549.00 11067.50 34516.00 ▇▂▁▁▁
Total_Revolving_Bal 0 1 1162.81 814.99 0.0 359.00 1276.00 1784.00 2517.00 ▇▅▇▇▅
Avg_Open_To_Buy 0 1 7469.14 9090.69 3.0 1324.50 3474.00 9859.00 34516.00 ▇▂▁▁▁
Total_Amt_Chng_Q4_Q1 0 1 0.76 0.22 0.0 0.63 0.74 0.86 3.40 ▅▇▁▁▁
Total_Trans_Amt 0 1 4404.09 3397.13 510.0 2155.50 3899.00 4741.00 18484.00 ▇▅▁▁▁
Total_Trans_Ct 0 1 64.86 23.47 10.0 45.00 67.00 81.00 139.00 ▂▅▇▂▁
Total_Ct_Chng_Q4_Q1 0 1 0.71 0.24 0.0 0.58 0.70 0.82 3.71 ▇▆▁▁▁
Avg_Utilization_Ratio 0 1 0.27 0.28 0.0 0.02 0.18 0.50 1.00 ▇▂▂▂▁

There is no missing data in each variable. Therefore, this analysis using the dataset which shown in table below.

DT::datatable(credit_df, options=list(
       pageLength = 10,
       lengthMenu = c(10, 50, 100, 1000),
       searching = TRUE))

There is statistics summary of variables

summary(credit_df)
##   CLIENTNUM          Customer_Age   Gender   Dependent_count
##  Length:10127       Min.   :26.00   F:5358   Min.   :0.000  
##  Class :character   1st Qu.:41.00   M:4769   1st Qu.:1.000  
##  Mode  :character   Median :46.00            Median :2.000  
##                     Mean   :46.33            Mean   :2.346  
##                     3rd Qu.:52.00            3rd Qu.:3.000  
##                     Max.   :73.00            Max.   :5.000  
##                                                             
##       Education_Level  Marital_Status       Income_Category  Card_Category 
##  Unknown      :1519   Divorced: 748   Unknown       :1112   Blue    :9436  
##  Uneducated   :1487   Married :4687   Less than $40K:3561   Silver  : 555  
##  High School  :2013   Single  :3943   $40K - $60K   :1790   Gold    : 116  
##  College      :1013   Unknown : 749   $60K - $80K   :1402   Platinum:  20  
##  Graduate     :3128                   $80K - $120K  :1535                  
##  Post-Graduate: 516                   $120K +       : 727                  
##  Doctorate    : 451                                                        
##  Months_on_book  Total_Relationship_Count Months_Inactive_12_mon
##  Min.   :13.00   Min.   :1.000            Min.   :0.000         
##  1st Qu.:31.00   1st Qu.:3.000            1st Qu.:2.000         
##  Median :36.00   Median :4.000            Median :2.000         
##  Mean   :35.93   Mean   :3.813            Mean   :2.341         
##  3rd Qu.:40.00   3rd Qu.:5.000            3rd Qu.:3.000         
##  Max.   :56.00   Max.   :6.000            Max.   :6.000         
##                                                                 
##  Contacts_Count_12_mon  Credit_Limit   Total_Revolving_Bal Avg_Open_To_Buy
##  Min.   :0.000         Min.   : 1438   Min.   :   0        Min.   :    3  
##  1st Qu.:2.000         1st Qu.: 2555   1st Qu.: 359        1st Qu.: 1324  
##  Median :2.000         Median : 4549   Median :1276        Median : 3474  
##  Mean   :2.455         Mean   : 8632   Mean   :1163        Mean   : 7469  
##  3rd Qu.:3.000         3rd Qu.:11068   3rd Qu.:1784        3rd Qu.: 9859  
##  Max.   :6.000         Max.   :34516   Max.   :2517        Max.   :34516  
##                                                                           
##  Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct   Total_Ct_Chng_Q4_Q1
##  Min.   :0.0000       Min.   :  510   Min.   : 10.00   Min.   :0.0000     
##  1st Qu.:0.6310       1st Qu.: 2156   1st Qu.: 45.00   1st Qu.:0.5820     
##  Median :0.7360       Median : 3899   Median : 67.00   Median :0.7020     
##  Mean   :0.7599       Mean   : 4404   Mean   : 64.86   Mean   :0.7122     
##  3rd Qu.:0.8590       3rd Qu.: 4741   3rd Qu.: 81.00   3rd Qu.:0.8180     
##  Max.   :3.3970       Max.   :18484   Max.   :139.00   Max.   :3.7140     
##                                                                           
##  Avg_Utilization_Ratio
##  Min.   :0.0000       
##  1st Qu.:0.0230       
##  Median :0.1760       
##  Mean   :0.2749       
##  3rd Qu.:0.5030       
##  Max.   :0.9990       
## 
str(credit_df)
## 'data.frame':    10127 obs. of  20 variables:
##  $ CLIENTNUM               : chr  "768805383" "818770008" "713982108" "769911858" ...
##  $ Customer_Age            : int  45 49 51 40 40 44 51 32 37 48 ...
##  $ Gender                  : Factor w/ 2 levels "F","M": 2 1 2 1 2 2 2 2 2 2 ...
##  $ Dependent_count         : int  3 5 3 4 3 2 4 0 3 2 ...
##  $ Education_Level         : Ord.factor w/ 7 levels "Unknown"<"Uneducated"<..: 3 5 5 3 2 5 1 3 2 5 ...
##  $ Marital_Status          : Factor w/ 4 levels "Divorced","Married",..: 2 3 2 4 2 2 2 4 3 3 ...
##  $ Income_Category         : Ord.factor w/ 6 levels "Unknown"<"Less than $40K"<..: 4 2 5 2 4 3 6 4 4 5 ...
##  $ Card_Category           : Ord.factor w/ 4 levels "Blue"<"Silver"<..: 1 1 1 1 1 1 3 2 1 1 ...
##  $ Months_on_book          : int  39 44 36 34 21 36 46 27 36 36 ...
##  $ Total_Relationship_Count: int  5 6 4 3 5 3 6 2 5 6 ...
##  $ Months_Inactive_12_mon  : int  1 1 1 4 1 1 1 2 2 3 ...
##  $ Contacts_Count_12_mon   : int  3 2 0 1 0 2 3 2 0 3 ...
##  $ Credit_Limit            : num  12691 8256 3418 3313 4716 ...
##  $ Total_Revolving_Bal     : int  777 864 0 2517 0 1247 2264 1396 2517 1677 ...
##  $ Avg_Open_To_Buy         : num  11914 7392 3418 796 4716 ...
##  $ Total_Amt_Chng_Q4_Q1    : num  1.33 1.54 2.59 1.4 2.17 ...
##  $ Total_Trans_Amt         : int  1144 1291 1887 1171 816 1088 1330 1538 1350 1441 ...
##  $ Total_Trans_Ct          : int  42 33 20 20 28 24 31 36 24 32 ...
##  $ Total_Ct_Chng_Q4_Q1     : num  1.62 3.71 2.33 2.33 2.5 ...
##  $ Avg_Utilization_Ratio   : num  0.061 0.105 0 0.76 0 0.311 0.066 0.048 0.113 0.144 ...

There are categorical and numerical variables which compatible for mix data types clustering algorithm, K-Prototype Clustering. Therefore, all of variables there are used for analysis. Variables in dataset are relevant enough for credit card customer segmentation though. But, there are selection variables for high-correlational variables

Exploratory Data Analysis

This step display some of EDA method, histogram and count plot as visualization.

#histogram some numerical-continuous and discrete variables
credit_df %>%
  gather(attributes, value, -c(1, 3:4, 5:8, 10:12)) %>%
  ggplot(aes(x = value)) +
  geom_histogram(fill = 'lightblue2', color = 'black') +
  facet_wrap(~attributes, scales = 'free_x') +
  labs(title = "Distribution of some numerical variables", x = "Values", y = "Frequency") +
  theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#histogram some numerical-discrete variables
credit_df %>%
  gather(attributes, value, c(4, 10:12)) %>%
  ggplot(aes(x = value)) +
  geom_histogram(bins = 5, fill = 'lightblue2', color = 'black') +
  facet_wrap(~attributes, scales = 'free_x') +
  labs(title = "Distribution of some numerical variables", x = "Values", y = "Frequency") +
  theme_bw()

#count plot of some categorical variables 
ggplot(credit_df, aes(x = credit_df$Gender, fill = credit_df$Income_Category)) + geom_bar(width = 0.25) + labs(title = "Gender ~ Income Category",                                                        x = "Gender Category", y = "Frequency")+
  guides(fill = guide_legend(title = "Income Level Category"))

#count plot of some categorical variables 
ggplot(credit_df, aes(x = credit_df$Card_Category, fill = credit_df$Education_Level)) + geom_bar(width = 0.25) + labs(title = "Card Category ~ Education Level",                                                        x = "Card Category", y = "Frequency")+
  guides(fill = guide_legend(title = "Education Level Category"))

##Clustering Assumptions By attached reference, there are assumptions for clustering analysis, Non-Multicollinearity & Data Adequacy.

#correlation plot for mixture data type
#change column names for more better labels visual
for (i in 2:ncol(credit_df)){
  colnames(credit_df)[i] <- paste0("X", i-1)
}
model.matrix(~0+., data = credit_df[, -1]) %>% 
  cor(use = "pairwise.complete.obs") %>% 
  ggcorrplot(show.diag = FALSE, type = "full", tl.cex = 7, lab = TRUE, 
             lab_size = 1.5) +
  labs(title = "Correlation Heat Map of Used Variables")

There are some couple variables have a relatively high correlation which are multicollinearity indicate. Variable ‘X12’-‘X14’ & ‘X16’-‘X17’ have correlation up to 0.8, which is cut points of multicollinearity (by attached reference).Therefore, one of couple high-correlation will be deleted.

#correlation plot for mixture data type
#change column names for more better labels visual
for (i in 2:ncol(credit_df)){
  colnames(credit_df)[i] <- paste0("X", i-1)
}
model.matrix(~0+., data = credit_df[, -c(1, 13, 18)]) %>% 
  cor(use = "pairwise.complete.obs") %>% 
  ggcorrplot(show.diag = FALSE, type = "full", tl.cex = 7, lab = TRUE, 
             lab_size = 1.5) +
  labs(title = "Correlation Heat Map of Used Variables *high-correlation deleted")

There are another assumptions for clustering analysis which is data adequacy is fulfill if MSA values at least 0.5

#Data Adequacy Assumption: Keiser-Meyer-Olkin Test
cor_mix <- model.matrix(~0+., data = credit_df[, -c(1, 13, 18)]) %>% 
  cor(use = "pairwise.complete.obs")
KMO(cor_mix)
## Error in solve.default(r) : 
##   Lapack routine dgesv: system is exactly singular: U[3,3] = 0
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = cor_mix)
## Overall MSA =  0.5
## MSA for each item = 
##        X1       X2F       X2M        X3      X4.L      X4.Q      X4.C      X4^4 
##       0.5       0.5       0.5       0.5       0.5       0.5       0.5       0.5 
##      X4^5      X4^6 X5Married  X5Single X5Unknown      X6.L      X6.Q      X6.C 
##       0.5       0.5       0.5       0.5       0.5       0.5       0.5       0.5 
##      X6^4      X6^5      X7.L      X7.Q      X7.C        X8        X9       X10 
##       0.5       0.5       0.5       0.5       0.5       0.5       0.5       0.5 
##       X11       X13       X14       X15       X16       X18       X19 
##       0.5       0.5       0.5       0.5       0.5       0.5       0.5

Optimal Number of Cluster(s)

Elbow Method is used for calculating optimal number of cluster. Code and markdown for this step is available in personal Google Colab Jupyter Notebook. There is the link to access.

https://colab.research.google.com/drive/1WI_I8AzXuhWlM62hLIfrJhVPN7hF9EMF?usp=sharing

There is summary plot for this step. Elbow Method From theory, there are 3 choices for number of clusters, 2 or 3 or 4. By subjectively, 3 number is chosen for this analysis so will be constructed clustering model with 3-Prototype Clustering.

Cluster

credit_clusters <- kproto(credit_df[, -c(1, 13, 18)], 3)
## # NAs in variables:
##  X1  X2  X3  X4  X5  X6  X7  X8  X9 X10 X11 X13 X14 X15 X16 X18 X19 
##   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 
## 0 observation(s) with NAs.
## 
## Estimated lambda: 13903110
credit_clusters
## Numeric predictors: 12 
## Categorical predictors: 5 
## Lambda: 13903110 
## 
## Number of Clusters: 3 
## Cluster sizes: 2112 1103 6912 
## Within cluster error: 1.36864e+11 81122744270 282627833223 
## 
## Cluster prototypes:
##         X1 X2       X3       X4      X5             X6   X7       X8       X9
## 1 46.28693  M 2.379735 Graduate Married   $80K - $120K Blue 36.03456 3.783144
## 2 46.35449  M 2.549411 Graduate  Single   $80K - $120K Blue 35.95286 3.517679
## 3 46.33333  F 2.303530 Graduate Married Less than $40K Blue 35.89207 3.868634
##        X10      X11      X13       X14       X15      X16       X18        X19
## 1 2.317708 2.491951 1100.750 12422.478 0.7648958 4884.566 0.7111259 0.08449384
## 2 2.312783 2.488667 1170.412 29415.736 0.7595947 5652.302 0.7102919 0.03866908
## 3 2.352865 2.438802 1180.566  2453.436 0.7584818 4058.086 0.7128655 0.37076736
#summary of each variable
summary(credit_clusters)
## X1 
##   Min. 1st Qu. Median     Mean 3rd Qu. Max.
## 1   26      41     46 46.28693      51   68
## 2   26      42     47 46.35449      51   65
## 3   26      41     46 46.33333      52   73
## 
## -----------------------------------------------------------------
## X2 
##        
## cluster     F     M
##       1 0.245 0.755
##       2 0.112 0.888
##       3 0.683 0.317
## 
## -----------------------------------------------------------------
## X3 
##   Min. 1st Qu. Median     Mean 3rd Qu. Max.
## 1    0       1      2 2.379735       3    5
## 2    0       2      3 2.549411       3    5
## 3    0       1      2 2.303530       3    5
## 
## -----------------------------------------------------------------
## X4 
##        
## cluster Unknown Uneducated High School College Graduate Post-Graduate Doctorate
##       1   0.145      0.152       0.204   0.093    0.309         0.052     0.044
##       2   0.150      0.160       0.190   0.108    0.297         0.051     0.044
##       3   0.151      0.143       0.198   0.101    0.311         0.051     0.045
## 
## -----------------------------------------------------------------
## X5 
##        
## cluster Divorced Married Single Unknown
##       1    0.076   0.450  0.394   0.080
##       2    0.086   0.383  0.441   0.090
##       3    0.071   0.479  0.380   0.070
## 
## -----------------------------------------------------------------
## X6 
##        
## cluster Unknown Less than $40K $40K - $60K $60K - $80K $80K - $120K $120K +
##       1   0.133          0.089       0.150       0.238        0.284   0.108
##       2   0.110          0.000       0.018       0.162        0.418   0.292
##       3   0.103          0.488       0.210       0.104        0.069   0.026
## 
## -----------------------------------------------------------------
## X7 
##        
## cluster  Blue Silver  Gold Platinum
##       1 0.885  0.101 0.012    0.002
##       2 0.608  0.297 0.080    0.015
##       3 0.998  0.002 0.000    0.000
## 
## -----------------------------------------------------------------
## X8 
##   Min. 1st Qu. Median     Mean 3rd Qu. Max.
## 1   13      32     36 36.03456      40   56
## 2   13      32     36 35.95286      40   56
## 3   13      31     36 35.89207      40   56
## 
## -----------------------------------------------------------------
## X9 
##   Min. 1st Qu. Median     Mean 3rd Qu. Max.
## 1    1       3      4 3.783144       5    6
## 2    1       2      3 3.517679       5    6
## 3    1       3      4 3.868634       5    6
## 
## -----------------------------------------------------------------
## X10 
##   Min. 1st Qu. Median     Mean 3rd Qu. Max.
## 1    0       2      2 2.317708       3    6
## 2    0       2      2 2.312783       3    6
## 3    0       2      2 2.352865       3    6
## 
## -----------------------------------------------------------------
## X11 
##   Min. 1st Qu. Median     Mean 3rd Qu. Max.
## 1    0       2      3 2.491951       3    6
## 2    0       2      2 2.488667       3    6
## 3    0       2      2 2.438802       3    6
## 
## -----------------------------------------------------------------
## X13 
##   Min. 1st Qu. Median     Mean 3rd Qu. Max.
## 1    0     0.0   1211 1100.750 1720.25 2517
## 2    0   524.5   1289 1170.412 1793.00 2517
## 3    0   493.0   1297 1180.566 1798.00 2517
## 
## -----------------------------------------------------------------
## X14 
##    Min.  1st Qu.  Median      Mean 3rd Qu.  Max.
## 1  5277  9164.50 11742.5 12422.478 15374.0 21442
## 2 20321 25127.00 31999.0 29415.736 33134.5 34516
## 3     3   921.75  1708.0  2453.436  3617.0  8934
## 
## -----------------------------------------------------------------
## X15 
##   Min. 1st Qu. Median      Mean 3rd Qu.  Max.
## 1    0 0.62875  0.741 0.7648958   0.869 3.397
## 2    0 0.63600  0.742 0.7595947   0.862 2.204
## 3    0 0.63100  0.735 0.7584818   0.856 2.594
## 
## -----------------------------------------------------------------
## X16 
##   Min. 1st Qu. Median     Mean 3rd Qu.  Max.
## 1  530    2004 3715.0 4884.566 4889.25 18484
## 2  597    2200 3918.0 5652.302 7887.00 17350
## 3  510    2201 3942.5 4058.086 4659.00 17995
## 
## -----------------------------------------------------------------
## X18 
##   Min. 1st Qu. Median      Mean 3rd Qu.  Max.
## 1    0   0.574  0.689 0.7111259  0.8100 3.571
## 2    0   0.583  0.706 0.7102919  0.8035 2.429
## 3    0   0.585  0.706 0.7128655  0.8220 3.714
## 
## -----------------------------------------------------------------
## X19 
##   Min. 1st Qu. Median       Mean 3rd Qu.  Max.
## 1    0 0.00000  0.086 0.08449384   0.133 0.269
## 2    0 0.01600  0.042 0.03866908   0.059 0.105
## 3    0 0.10675  0.359 0.37076736   0.609 0.999
## 
## -----------------------------------------------------------------
fit_df <- factor(credit_clusters$cluster, order =  TRUE,
                                  levels = c(1:3))
fit <- data.frame(credit_df[,-c(1, 13, 18)], fit_df)
result_df <- credit_clusters$centers
Member <- credit_clusters$size
result <- data.frame(Member, result_df)
result
##   clusters Freq       X1 X2       X3       X4      X5             X6   X7
## 1        1 2112 46.28693  M 2.379735 Graduate Married   $80K - $120K Blue
## 2        2 1103 46.35449  M 2.549411 Graduate  Single   $80K - $120K Blue
## 3        3 6912 46.33333  F 2.303530 Graduate Married Less than $40K Blue
##         X8       X9      X10      X11      X13       X14       X15      X16
## 1 36.03456 3.783144 2.317708 2.491951 1100.750 12422.478 0.7648958 4884.566
## 2 35.95286 3.517679 2.312783 2.488667 1170.412 29415.736 0.7595947 5652.302
## 3 35.89207 3.868634 2.352865 2.438802 1180.566  2453.436 0.7584818 4058.086
##         X18        X19
## 1 0.7111259 0.08449384
## 2 0.7102919 0.03866908
## 3 0.7128655 0.37076736
#one of plotting cluster result
ggplot(fit, aes(x = X14, y = X19, color = fit_df)) +
  geom_point() +
  labs(title = "Average Open Credit to Buy (X14) ~ Average Card Utilization Ratio (X19)",
       x = "Average Open Credit to Buy (X14)", y = "Average Card Utilization Ratio (X19)")   + guides(color = guide_legend(title = "Cluster"))

Conclusions

K-Prototype Clustering is an improvement of the K-Means and K-Modes clustering algorithm to handle clustering with the mixed data types.

There are some variables which are not used for the analysis.

Clustering assumptions is fulfilled, both for multicollinearity and adequacy data which are some variables had been deleted because indicate high-correlation.

Final number of clusters is 3 from elbow method.

References

Kaiser, H. F. (1974). “An Index of Factorial Simplicity”. Psychometric. 39: 31-36.

Puspitasari, et al. (2014). “Perbandingan Analisis Faktor Klasik dan Analisis Faktor Robust untuk Data Inflasi Kelompok Bahan Makanan di Jawa Tengah” Jurnal Gaussian. 3(3): 343-352.

Naes, T., et al. (2002). Multivariate Calibration and Classification. West Sussex: NIR Publication.