This case requires to develop a customer segmentation to define marketing strategy. The sample dataset summarizes the usage behavior of up to 10000 active credit card holders during a periode of the data. The file is at a customer level with 23 columns in raw data.
The analysis is using K-Prototype Clustering method which is an improvement of the K-Means and K-Modes clustering algorithm to handle clustering with the mixed data types.
Dataset Source: https://www.kaggle.com/sakshigoyal7/credit-card-customers
*there is definition of each variable in the source link
The below code is a function, which can help to install any package if it’s not installed yet.
ins_packages <- function(x){
x <- as.character(match.call()[[2]])
if (!require(x,character.only = TRUE)){
install.packages(pkgs = x,repos = "http://cran.r-project.org")
require(x, character.only = TRUE)
}
}
ins_packages(rio); ins_packages(skimr); ins_packages(DT); ins_packages(tidyverse)
ins_packages(ggplot2); ins_packages(corrplot); ins_packages(psych); ins_packages(dplyr)
ins_packages(ggcorrplot); ins_packages(caret); ins_packages(clustMixType)
The dataset is loaded from personal directory which there is in ‘.csv’ format.
credit_df <- rio::import("C:/Users/ASUS/Downloads/BankChurners.csv/BankChurners.csv")
head(credit_df)
## CLIENTNUM Attrition_Flag Customer_Age Gender Dependent_count
## 1 768805383 Existing Customer 45 M 3
## 2 818770008 Existing Customer 49 F 5
## 3 713982108 Existing Customer 51 M 3
## 4 769911858 Existing Customer 40 F 4
## 5 709106358 Existing Customer 40 M 3
## 6 713061558 Existing Customer 44 M 2
## Education_Level Marital_Status Income_Category Card_Category Months_on_book
## 1 High School Married $60K - $80K Blue 39
## 2 Graduate Single Less than $40K Blue 44
## 3 Graduate Married $80K - $120K Blue 36
## 4 High School Unknown Less than $40K Blue 34
## 5 Uneducated Married $60K - $80K Blue 21
## 6 Graduate Married $40K - $60K Blue 36
## Total_Relationship_Count Months_Inactive_12_mon Contacts_Count_12_mon
## 1 5 1 3
## 2 6 1 2
## 3 4 1 0
## 4 3 4 1
## 5 5 1 0
## 6 3 1 2
## Credit_Limit Total_Revolving_Bal Avg_Open_To_Buy Total_Amt_Chng_Q4_Q1
## 1 12691 777 11914 1.335
## 2 8256 864 7392 1.541
## 3 3418 0 3418 2.594
## 4 3313 2517 796 1.405
## 5 4716 0 4716 2.175
## 6 4010 1247 2763 1.376
## Total_Trans_Amt Total_Trans_Ct Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio
## 1 1144 42 1.625 0.061
## 2 1291 33 3.714 0.105
## 3 1887 20 2.333 0.000
## 4 1171 20 2.333 0.760
## 5 816 28 2.500 0.000
## 6 1088 24 0.846 0.311
## Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1
## 1 9.3448e-05
## 2 5.6861e-05
## 3 2.1081e-05
## 4 1.3366e-04
## 5 2.1676e-05
## 6 5.5077e-05
## Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2
## 1 0.99991
## 2 0.99994
## 3 0.99998
## 4 0.99987
## 5 0.99998
## 6 0.99994
This step is requires as a process of cleaning, structuring, and enriching raw data into a better structure for analyze.
Clustering method is unsupervised learning and works with unlabeled data. Therefore, ‘Attrition_Flag’ variable can be discarded. On the other hand, there are a couple of variables, last two columns, which are suggested to delete by the data uploader.
Missing value should be checked. If there are missing value, median imputation will be used to handle it (replace missing value with variable median).
#column delete
credit_df <- credit_df[, -c(2, 22, 23)]
#fixing some variables type
credit_df$CLIENTNUM <- as.character(credit_df$CLIENTNUM)
credit_df$Gender <- factor(credit_df$Gender, order = FALSE)
credit_df$Marital_Status <- factor(credit_df$Marital_Status, order = FALSE)
credit_df$Education_Level <- factor(credit_df$Education_Level, order = TRUE,
levels = c("Unknown", "Uneducated", "High School",
"College", "Graduate", "Post-Graduate",
"Doctorate"))
credit_df$Income_Category <- factor(credit_df$Income_Category, order = TRUE,
levels = c("Unknown", "Less than $40K",
"$40K - $60K", "$60K - $80K",
"$80K - $120K", "$120K +"))
credit_df$Card_Category <- factor(credit_df$Card_Category, order = TRUE,
levels = c("Blue", "Silver", "Gold", "Platinum"))
#summary & missing value check
skimr::skim(credit_df)
| Name | credit_df |
| Number of rows | 10127 |
| Number of columns | 20 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| factor | 5 |
| numeric | 14 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| CLIENTNUM | 0 | 1 | 9 | 9 | 0 | 10127 | 0 |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| Gender | 0 | 1 | FALSE | 2 | F: 5358, M: 4769 |
| Education_Level | 0 | 1 | TRUE | 7 | Gra: 3128, Hig: 2013, Unk: 1519, Une: 1487 |
| Marital_Status | 0 | 1 | FALSE | 4 | Mar: 4687, Sin: 3943, Unk: 749, Div: 748 |
| Income_Category | 0 | 1 | TRUE | 6 | Les: 3561, $40: 1790, $80: 1535, $60: 1402 |
| Card_Category | 0 | 1 | TRUE | 4 | Blu: 9436, Sil: 555, Gol: 116, Pla: 20 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Customer_Age | 0 | 1 | 46.33 | 8.02 | 26.0 | 41.00 | 46.00 | 52.00 | 73.00 | ▂▆▇▃▁ |
| Dependent_count | 0 | 1 | 2.35 | 1.30 | 0.0 | 1.00 | 2.00 | 3.00 | 5.00 | ▇▇▇▅▁ |
| Months_on_book | 0 | 1 | 35.93 | 7.99 | 13.0 | 31.00 | 36.00 | 40.00 | 56.00 | ▁▃▇▃▂ |
| Total_Relationship_Count | 0 | 1 | 3.81 | 1.55 | 1.0 | 3.00 | 4.00 | 5.00 | 6.00 | ▇▇▆▆▆ |
| Months_Inactive_12_mon | 0 | 1 | 2.34 | 1.01 | 0.0 | 2.00 | 2.00 | 3.00 | 6.00 | ▅▇▇▁▁ |
| Contacts_Count_12_mon | 0 | 1 | 2.46 | 1.11 | 0.0 | 2.00 | 2.00 | 3.00 | 6.00 | ▅▇▇▃▁ |
| Credit_Limit | 0 | 1 | 8631.95 | 9088.78 | 1438.3 | 2555.00 | 4549.00 | 11067.50 | 34516.00 | ▇▂▁▁▁ |
| Total_Revolving_Bal | 0 | 1 | 1162.81 | 814.99 | 0.0 | 359.00 | 1276.00 | 1784.00 | 2517.00 | ▇▅▇▇▅ |
| Avg_Open_To_Buy | 0 | 1 | 7469.14 | 9090.69 | 3.0 | 1324.50 | 3474.00 | 9859.00 | 34516.00 | ▇▂▁▁▁ |
| Total_Amt_Chng_Q4_Q1 | 0 | 1 | 0.76 | 0.22 | 0.0 | 0.63 | 0.74 | 0.86 | 3.40 | ▅▇▁▁▁ |
| Total_Trans_Amt | 0 | 1 | 4404.09 | 3397.13 | 510.0 | 2155.50 | 3899.00 | 4741.00 | 18484.00 | ▇▅▁▁▁ |
| Total_Trans_Ct | 0 | 1 | 64.86 | 23.47 | 10.0 | 45.00 | 67.00 | 81.00 | 139.00 | ▂▅▇▂▁ |
| Total_Ct_Chng_Q4_Q1 | 0 | 1 | 0.71 | 0.24 | 0.0 | 0.58 | 0.70 | 0.82 | 3.71 | ▇▆▁▁▁ |
| Avg_Utilization_Ratio | 0 | 1 | 0.27 | 0.28 | 0.0 | 0.02 | 0.18 | 0.50 | 1.00 | ▇▂▂▂▁ |
There is no missing data in each variable. Therefore, this analysis using the dataset which shown in table below.
DT::datatable(credit_df, options=list(
pageLength = 10,
lengthMenu = c(10, 50, 100, 1000),
searching = TRUE))
There is statistics summary of variables
summary(credit_df)
## CLIENTNUM Customer_Age Gender Dependent_count
## Length:10127 Min. :26.00 F:5358 Min. :0.000
## Class :character 1st Qu.:41.00 M:4769 1st Qu.:1.000
## Mode :character Median :46.00 Median :2.000
## Mean :46.33 Mean :2.346
## 3rd Qu.:52.00 3rd Qu.:3.000
## Max. :73.00 Max. :5.000
##
## Education_Level Marital_Status Income_Category Card_Category
## Unknown :1519 Divorced: 748 Unknown :1112 Blue :9436
## Uneducated :1487 Married :4687 Less than $40K:3561 Silver : 555
## High School :2013 Single :3943 $40K - $60K :1790 Gold : 116
## College :1013 Unknown : 749 $60K - $80K :1402 Platinum: 20
## Graduate :3128 $80K - $120K :1535
## Post-Graduate: 516 $120K + : 727
## Doctorate : 451
## Months_on_book Total_Relationship_Count Months_Inactive_12_mon
## Min. :13.00 Min. :1.000 Min. :0.000
## 1st Qu.:31.00 1st Qu.:3.000 1st Qu.:2.000
## Median :36.00 Median :4.000 Median :2.000
## Mean :35.93 Mean :3.813 Mean :2.341
## 3rd Qu.:40.00 3rd Qu.:5.000 3rd Qu.:3.000
## Max. :56.00 Max. :6.000 Max. :6.000
##
## Contacts_Count_12_mon Credit_Limit Total_Revolving_Bal Avg_Open_To_Buy
## Min. :0.000 Min. : 1438 Min. : 0 Min. : 3
## 1st Qu.:2.000 1st Qu.: 2555 1st Qu.: 359 1st Qu.: 1324
## Median :2.000 Median : 4549 Median :1276 Median : 3474
## Mean :2.455 Mean : 8632 Mean :1163 Mean : 7469
## 3rd Qu.:3.000 3rd Qu.:11068 3rd Qu.:1784 3rd Qu.: 9859
## Max. :6.000 Max. :34516 Max. :2517 Max. :34516
##
## Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct Total_Ct_Chng_Q4_Q1
## Min. :0.0000 Min. : 510 Min. : 10.00 Min. :0.0000
## 1st Qu.:0.6310 1st Qu.: 2156 1st Qu.: 45.00 1st Qu.:0.5820
## Median :0.7360 Median : 3899 Median : 67.00 Median :0.7020
## Mean :0.7599 Mean : 4404 Mean : 64.86 Mean :0.7122
## 3rd Qu.:0.8590 3rd Qu.: 4741 3rd Qu.: 81.00 3rd Qu.:0.8180
## Max. :3.3970 Max. :18484 Max. :139.00 Max. :3.7140
##
## Avg_Utilization_Ratio
## Min. :0.0000
## 1st Qu.:0.0230
## Median :0.1760
## Mean :0.2749
## 3rd Qu.:0.5030
## Max. :0.9990
##
str(credit_df)
## 'data.frame': 10127 obs. of 20 variables:
## $ CLIENTNUM : chr "768805383" "818770008" "713982108" "769911858" ...
## $ Customer_Age : int 45 49 51 40 40 44 51 32 37 48 ...
## $ Gender : Factor w/ 2 levels "F","M": 2 1 2 1 2 2 2 2 2 2 ...
## $ Dependent_count : int 3 5 3 4 3 2 4 0 3 2 ...
## $ Education_Level : Ord.factor w/ 7 levels "Unknown"<"Uneducated"<..: 3 5 5 3 2 5 1 3 2 5 ...
## $ Marital_Status : Factor w/ 4 levels "Divorced","Married",..: 2 3 2 4 2 2 2 4 3 3 ...
## $ Income_Category : Ord.factor w/ 6 levels "Unknown"<"Less than $40K"<..: 4 2 5 2 4 3 6 4 4 5 ...
## $ Card_Category : Ord.factor w/ 4 levels "Blue"<"Silver"<..: 1 1 1 1 1 1 3 2 1 1 ...
## $ Months_on_book : int 39 44 36 34 21 36 46 27 36 36 ...
## $ Total_Relationship_Count: int 5 6 4 3 5 3 6 2 5 6 ...
## $ Months_Inactive_12_mon : int 1 1 1 4 1 1 1 2 2 3 ...
## $ Contacts_Count_12_mon : int 3 2 0 1 0 2 3 2 0 3 ...
## $ Credit_Limit : num 12691 8256 3418 3313 4716 ...
## $ Total_Revolving_Bal : int 777 864 0 2517 0 1247 2264 1396 2517 1677 ...
## $ Avg_Open_To_Buy : num 11914 7392 3418 796 4716 ...
## $ Total_Amt_Chng_Q4_Q1 : num 1.33 1.54 2.59 1.4 2.17 ...
## $ Total_Trans_Amt : int 1144 1291 1887 1171 816 1088 1330 1538 1350 1441 ...
## $ Total_Trans_Ct : int 42 33 20 20 28 24 31 36 24 32 ...
## $ Total_Ct_Chng_Q4_Q1 : num 1.62 3.71 2.33 2.33 2.5 ...
## $ Avg_Utilization_Ratio : num 0.061 0.105 0 0.76 0 0.311 0.066 0.048 0.113 0.144 ...
There are categorical and numerical variables which compatible for mix data types clustering algorithm, K-Prototype Clustering. Therefore, all of variables there are used for analysis. Variables in dataset are relevant enough for credit card customer segmentation though. But, there are selection variables for high-correlational variables
This step display some of EDA method, histogram and count plot as visualization.
#histogram some numerical-continuous and discrete variables
credit_df %>%
gather(attributes, value, -c(1, 3:4, 5:8, 10:12)) %>%
ggplot(aes(x = value)) +
geom_histogram(fill = 'lightblue2', color = 'black') +
facet_wrap(~attributes, scales = 'free_x') +
labs(title = "Distribution of some numerical variables", x = "Values", y = "Frequency") +
theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#histogram some numerical-discrete variables
credit_df %>%
gather(attributes, value, c(4, 10:12)) %>%
ggplot(aes(x = value)) +
geom_histogram(bins = 5, fill = 'lightblue2', color = 'black') +
facet_wrap(~attributes, scales = 'free_x') +
labs(title = "Distribution of some numerical variables", x = "Values", y = "Frequency") +
theme_bw()
#count plot of some categorical variables
ggplot(credit_df, aes(x = credit_df$Gender, fill = credit_df$Income_Category)) + geom_bar(width = 0.25) + labs(title = "Gender ~ Income Category", x = "Gender Category", y = "Frequency")+
guides(fill = guide_legend(title = "Income Level Category"))
#count plot of some categorical variables
ggplot(credit_df, aes(x = credit_df$Card_Category, fill = credit_df$Education_Level)) + geom_bar(width = 0.25) + labs(title = "Card Category ~ Education Level", x = "Card Category", y = "Frequency")+
guides(fill = guide_legend(title = "Education Level Category"))
##Clustering Assumptions By attached reference, there are assumptions for clustering analysis, Non-Multicollinearity & Data Adequacy.
#correlation plot for mixture data type
#change column names for more better labels visual
for (i in 2:ncol(credit_df)){
colnames(credit_df)[i] <- paste0("X", i-1)
}
model.matrix(~0+., data = credit_df[, -1]) %>%
cor(use = "pairwise.complete.obs") %>%
ggcorrplot(show.diag = FALSE, type = "full", tl.cex = 7, lab = TRUE,
lab_size = 1.5) +
labs(title = "Correlation Heat Map of Used Variables")
There are some couple variables have a relatively high correlation which are multicollinearity indicate. Variable ‘X12’-‘X14’ & ‘X16’-‘X17’ have correlation up to 0.8, which is cut points of multicollinearity (by attached reference).Therefore, one of couple high-correlation will be deleted.
#correlation plot for mixture data type
#change column names for more better labels visual
for (i in 2:ncol(credit_df)){
colnames(credit_df)[i] <- paste0("X", i-1)
}
model.matrix(~0+., data = credit_df[, -c(1, 13, 18)]) %>%
cor(use = "pairwise.complete.obs") %>%
ggcorrplot(show.diag = FALSE, type = "full", tl.cex = 7, lab = TRUE,
lab_size = 1.5) +
labs(title = "Correlation Heat Map of Used Variables *high-correlation deleted")
There are another assumptions for clustering analysis which is data adequacy is fulfill if MSA values at least 0.5
#Data Adequacy Assumption: Keiser-Meyer-Olkin Test
cor_mix <- model.matrix(~0+., data = credit_df[, -c(1, 13, 18)]) %>%
cor(use = "pairwise.complete.obs")
KMO(cor_mix)
## Error in solve.default(r) :
## Lapack routine dgesv: system is exactly singular: U[3,3] = 0
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = cor_mix)
## Overall MSA = 0.5
## MSA for each item =
## X1 X2F X2M X3 X4.L X4.Q X4.C X4^4
## 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
## X4^5 X4^6 X5Married X5Single X5Unknown X6.L X6.Q X6.C
## 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
## X6^4 X6^5 X7.L X7.Q X7.C X8 X9 X10
## 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
## X11 X13 X14 X15 X16 X18 X19
## 0.5 0.5 0.5 0.5 0.5 0.5 0.5
Elbow Method is used for calculating optimal number of cluster. Code and markdown for this step is available in personal Google Colab Jupyter Notebook. There is the link to access.
https://colab.research.google.com/drive/1WI_I8AzXuhWlM62hLIfrJhVPN7hF9EMF?usp=sharing
There is summary plot for this step. From theory, there are 3 choices for number of clusters, 2 or 3 or 4. By subjectively, 3 number is chosen for this analysis so will be constructed clustering model with 3-Prototype Clustering.
credit_clusters <- kproto(credit_df[, -c(1, 13, 18)], 3)
## # NAs in variables:
## X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X13 X14 X15 X16 X18 X19
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 0 observation(s) with NAs.
##
## Estimated lambda: 13903110
credit_clusters
## Numeric predictors: 12
## Categorical predictors: 5
## Lambda: 13903110
##
## Number of Clusters: 3
## Cluster sizes: 2112 1103 6912
## Within cluster error: 1.36864e+11 81122744270 282627833223
##
## Cluster prototypes:
## X1 X2 X3 X4 X5 X6 X7 X8 X9
## 1 46.28693 M 2.379735 Graduate Married $80K - $120K Blue 36.03456 3.783144
## 2 46.35449 M 2.549411 Graduate Single $80K - $120K Blue 35.95286 3.517679
## 3 46.33333 F 2.303530 Graduate Married Less than $40K Blue 35.89207 3.868634
## X10 X11 X13 X14 X15 X16 X18 X19
## 1 2.317708 2.491951 1100.750 12422.478 0.7648958 4884.566 0.7111259 0.08449384
## 2 2.312783 2.488667 1170.412 29415.736 0.7595947 5652.302 0.7102919 0.03866908
## 3 2.352865 2.438802 1180.566 2453.436 0.7584818 4058.086 0.7128655 0.37076736
#summary of each variable
summary(credit_clusters)
## X1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 26 41 46 46.28693 51 68
## 2 26 42 47 46.35449 51 65
## 3 26 41 46 46.33333 52 73
##
## -----------------------------------------------------------------
## X2
##
## cluster F M
## 1 0.245 0.755
## 2 0.112 0.888
## 3 0.683 0.317
##
## -----------------------------------------------------------------
## X3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 0 1 2 2.379735 3 5
## 2 0 2 3 2.549411 3 5
## 3 0 1 2 2.303530 3 5
##
## -----------------------------------------------------------------
## X4
##
## cluster Unknown Uneducated High School College Graduate Post-Graduate Doctorate
## 1 0.145 0.152 0.204 0.093 0.309 0.052 0.044
## 2 0.150 0.160 0.190 0.108 0.297 0.051 0.044
## 3 0.151 0.143 0.198 0.101 0.311 0.051 0.045
##
## -----------------------------------------------------------------
## X5
##
## cluster Divorced Married Single Unknown
## 1 0.076 0.450 0.394 0.080
## 2 0.086 0.383 0.441 0.090
## 3 0.071 0.479 0.380 0.070
##
## -----------------------------------------------------------------
## X6
##
## cluster Unknown Less than $40K $40K - $60K $60K - $80K $80K - $120K $120K +
## 1 0.133 0.089 0.150 0.238 0.284 0.108
## 2 0.110 0.000 0.018 0.162 0.418 0.292
## 3 0.103 0.488 0.210 0.104 0.069 0.026
##
## -----------------------------------------------------------------
## X7
##
## cluster Blue Silver Gold Platinum
## 1 0.885 0.101 0.012 0.002
## 2 0.608 0.297 0.080 0.015
## 3 0.998 0.002 0.000 0.000
##
## -----------------------------------------------------------------
## X8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 13 32 36 36.03456 40 56
## 2 13 32 36 35.95286 40 56
## 3 13 31 36 35.89207 40 56
##
## -----------------------------------------------------------------
## X9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 1 3 4 3.783144 5 6
## 2 1 2 3 3.517679 5 6
## 3 1 3 4 3.868634 5 6
##
## -----------------------------------------------------------------
## X10
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 0 2 2 2.317708 3 6
## 2 0 2 2 2.312783 3 6
## 3 0 2 2 2.352865 3 6
##
## -----------------------------------------------------------------
## X11
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 0 2 3 2.491951 3 6
## 2 0 2 2 2.488667 3 6
## 3 0 2 2 2.438802 3 6
##
## -----------------------------------------------------------------
## X13
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 0 0.0 1211 1100.750 1720.25 2517
## 2 0 524.5 1289 1170.412 1793.00 2517
## 3 0 493.0 1297 1180.566 1798.00 2517
##
## -----------------------------------------------------------------
## X14
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 5277 9164.50 11742.5 12422.478 15374.0 21442
## 2 20321 25127.00 31999.0 29415.736 33134.5 34516
## 3 3 921.75 1708.0 2453.436 3617.0 8934
##
## -----------------------------------------------------------------
## X15
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 0 0.62875 0.741 0.7648958 0.869 3.397
## 2 0 0.63600 0.742 0.7595947 0.862 2.204
## 3 0 0.63100 0.735 0.7584818 0.856 2.594
##
## -----------------------------------------------------------------
## X16
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 530 2004 3715.0 4884.566 4889.25 18484
## 2 597 2200 3918.0 5652.302 7887.00 17350
## 3 510 2201 3942.5 4058.086 4659.00 17995
##
## -----------------------------------------------------------------
## X18
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 0 0.574 0.689 0.7111259 0.8100 3.571
## 2 0 0.583 0.706 0.7102919 0.8035 2.429
## 3 0 0.585 0.706 0.7128655 0.8220 3.714
##
## -----------------------------------------------------------------
## X19
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 0 0.00000 0.086 0.08449384 0.133 0.269
## 2 0 0.01600 0.042 0.03866908 0.059 0.105
## 3 0 0.10675 0.359 0.37076736 0.609 0.999
##
## -----------------------------------------------------------------
fit_df <- factor(credit_clusters$cluster, order = TRUE,
levels = c(1:3))
fit <- data.frame(credit_df[,-c(1, 13, 18)], fit_df)
result_df <- credit_clusters$centers
Member <- credit_clusters$size
result <- data.frame(Member, result_df)
result
## clusters Freq X1 X2 X3 X4 X5 X6 X7
## 1 1 2112 46.28693 M 2.379735 Graduate Married $80K - $120K Blue
## 2 2 1103 46.35449 M 2.549411 Graduate Single $80K - $120K Blue
## 3 3 6912 46.33333 F 2.303530 Graduate Married Less than $40K Blue
## X8 X9 X10 X11 X13 X14 X15 X16
## 1 36.03456 3.783144 2.317708 2.491951 1100.750 12422.478 0.7648958 4884.566
## 2 35.95286 3.517679 2.312783 2.488667 1170.412 29415.736 0.7595947 5652.302
## 3 35.89207 3.868634 2.352865 2.438802 1180.566 2453.436 0.7584818 4058.086
## X18 X19
## 1 0.7111259 0.08449384
## 2 0.7102919 0.03866908
## 3 0.7128655 0.37076736
#one of plotting cluster result
ggplot(fit, aes(x = X14, y = X19, color = fit_df)) +
geom_point() +
labs(title = "Average Open Credit to Buy (X14) ~ Average Card Utilization Ratio (X19)",
x = "Average Open Credit to Buy (X14)", y = "Average Card Utilization Ratio (X19)") + guides(color = guide_legend(title = "Cluster"))
K-Prototype Clustering is an improvement of the K-Means and K-Modes clustering algorithm to handle clustering with the mixed data types.
There are some variables which are not used for the analysis.
Clustering assumptions is fulfilled, both for multicollinearity and adequacy data which are some variables had been deleted because indicate high-correlation.
Final number of clusters is 3 from elbow method.
Kaiser, H. F. (1974). “An Index of Factorial Simplicity”. Psychometric. 39: 31-36.
Puspitasari, et al. (2014). “Perbandingan Analisis Faktor Klasik dan Analisis Faktor Robust untuk Data Inflasi Kelompok Bahan Makanan di Jawa Tengah” Jurnal Gaussian. 3(3): 343-352.
Naes, T., et al. (2002). Multivariate Calibration and Classification. West Sussex: NIR Publication.