#install.packages(ggplot2)
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.2.2
#install.packages("ggfortify")
library(ggfortify)
## Warning: package 'ggfortify' was built under R version 4.2.2
#install.packages("ranger")
library(ranger)
## Warning: package 'ranger' was built under R version 4.2.2
#install.packages("dplyr")
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.2.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
#install.packages("Hmisc")
library(Hmisc)
## Warning: package 'Hmisc' was built under R version 4.2.2
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
#install.packages("factoextra")
library(factoextra)
## Warning: package 'factoextra' was built under R version 4.2.2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
#install.packages("cluster")
library(cluster)
## Warning: package 'cluster' was built under R version 4.2.2
#install.packages("magrittr")
library(magrittr)
#install.packages("NbClust")
library("NbClust")
mydata <- read.csv("~/IMB/Multivariate analysis/HOMEWORK MVA/HW4/CC GENERAL.csv", sep = ",")
set.seed(1)
mydata1 <- sample_n(mydata, 300)
head(mydata1, 10)
## CUST_ID BALANCE BALANCE_FREQUENCY PURCHASES ONEOFF_PURCHASES
## 1 C11051 2128.151695 1.000000 0.00 0.00
## 2 C18220 5.416921 0.545455 288.80 0.00
## 3 C14908 49.564532 1.000000 433.50 273.94
## 4 C18691 919.563867 1.000000 3000.00 0.00
## 5 C14164 4663.032241 1.000000 1325.31 1325.31
## 6 C19027 133.099445 1.000000 1170.00 0.00
## 7 C11345 2894.348509 1.000000 657.71 310.70
## 8 C18752 1869.632550 1.000000 0.00 0.00
## 9 C11858 2270.955171 1.000000 25.26 25.26
## 10 C18449 51.089340 0.818182 140.75 0.00
## INSTALLMENTS_PURCHASES CASH_ADVANCE PURCHASES_FREQUENCY
## 1 0.00 19.34573 0.000000
## 2 288.80 0.00000 0.916667
## 3 159.56 0.00000 0.909091
## 4 3000.00 823.97913 0.166667
## 5 0.00 0.00000 0.333333
## 6 1170.00 0.00000 1.000000
## 7 347.01 3384.77264 1.000000
## 8 0.00 1897.99301 0.000000
## 9 0.00 452.80366 0.083333
## 10 140.75 0.00000 0.416667
## ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FREQUENCY
## 1 0.000000 0.000000
## 2 0.000000 0.833333
## 3 0.181818 0.909091
## 4 0.000000 0.083333
## 5 0.333333 0.000000
## 6 0.000000 0.900000
## 7 0.166667 0.916667
## 8 0.000000 0.000000
## 9 0.083333 0.000000
## 10 0.000000 0.416667
## CASH_ADVANCE_FREQUENCY CASH_ADVANCE_TRX PURCHASES_TRX CREDIT_LIMIT PAYMENTS
## 1 0.083333 1 0 2100 138.9072
## 2 0.000000 0 11 3000 254.4436
## 3 0.000000 0 13 9000 465.4661
## 4 0.083333 1 2 2000 4426.4983
## 5 0.000000 0 5 4500 2064.1108
## 6 0.000000 0 12 1000 1068.5025
## 7 0.583333 17 21 3500 3166.7656
## 8 0.166667 4 0 2000 454.3482
## 9 0.250000 3 1 3100 503.1965
## 10 0.000000 0 5 1000 347.2401
## MINIMUM_PAYMENTS PRC_FULL_PAYMENT TENURE
## 1 13266.38190 0.000000 12
## 2 32.47525 0.181818 12
## 3 148.91458 0.272727 11
## 4 1451.75634 0.083333 12
## 5 6004.09418 0.000000 12
## 6 161.81650 0.444444 10
## 7 1006.61472 0.000000 12
## 8 452.59794 0.000000 12
## 9 815.40061 0.000000 12
## 10 238.81136 0.111111 12
Description of the dataset: This case requires to develop a customer segmentation to define marketing strategy. The sample Dataset summarizes the usage behavior of about 9000 active credit card holders during the last 6 months. The file is at a customer level with 18 behavioral variables. The dataset is sourced from Kaggle: “https://www.kaggle.com/datasets/arjunbhasin2013/ccdata?resource=download”
Following is the Data Dictionary for Credit Card data set :
CUST_ID : Identification of Credit Card holder (Categorical)
BALANCE : Balance amount left in their account to make purchases
BALANCE_FREQUENCY : How frequently the Balance is updated, score between 0 and 1 (1 = frequently updated, 0 = not frequently updated)
PURCHASES : Amount of purchases made from account
ONEOFF_PURCHASES : Maximum purchase amount done in one-go
INSTALLMENTS_PURCHASES : Amount of purchase done in installment
CASH_ADVANCE : Cash in advance given by the user
PURCHASES_FREQUENCY : How frequently the Purchases are being made, score between 0 and 1 (1 = frequently purchased, 0 = not frequently purchased)
ONEOFFPURCHASESFREQUENCY : How frequently Purchases are happening in one-go (1 = frequently purchased, 0 = not frequently purchased)
PURCHASESINSTALLMENTSFREQUENCY : How frequently purchases in installments are being done (1 = frequently done, 0 = not frequently done)
CASHADVANCEFREQUENCY : How frequently the cash in advance being paid
CASHADVANCETRX : Number of Transactions made with “Cash in Advanced”
PURCHASES_TRX : Numbe of purchase transactions made
CREDIT_LIMIT : Limit of Credit Card for user
PAYMENTS : Amount of Payment done by user
MINIMUM_PAYMENTS : Minimum amount of payments made by user
PRCFULLPAYMENT : Percent of full payment paid by user
TENURE : Tenure of credit card service for user
RESEARCH QUESTION: I will try to cluster the 300 active card holders based on 6 variables, and analyze the differences between the pertaining clusters.
mydata2 <- na.omit(mydata1)
mydata2$ONEOFF_PURCHASES_FREQUENCYF <- ifelse(mydata2$ONEOFF_PURCHASES_FREQUENCY < 0.5, "low", "high")
mydata2$PURCHASES_INSTALLMENTS_FREQUENCYF <- ifelse(mydata2$PURCHASES_INSTALLMENTS_FREQUENCY < 0.5, "low", "high")
head(mydata2)
## CUST_ID BALANCE BALANCE_FREQUENCY PURCHASES ONEOFF_PURCHASES
## 1 C11051 2128.151695 1.000000 0.00 0.00
## 2 C18220 5.416921 0.545455 288.80 0.00
## 3 C14908 49.564532 1.000000 433.50 273.94
## 4 C18691 919.563867 1.000000 3000.00 0.00
## 5 C14164 4663.032241 1.000000 1325.31 1325.31
## 6 C19027 133.099445 1.000000 1170.00 0.00
## INSTALLMENTS_PURCHASES CASH_ADVANCE PURCHASES_FREQUENCY
## 1 0.00 19.34573 0.000000
## 2 288.80 0.00000 0.916667
## 3 159.56 0.00000 0.909091
## 4 3000.00 823.97913 0.166667
## 5 0.00 0.00000 0.333333
## 6 1170.00 0.00000 1.000000
## ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FREQUENCY
## 1 0.000000 0.000000
## 2 0.000000 0.833333
## 3 0.181818 0.909091
## 4 0.000000 0.083333
## 5 0.333333 0.000000
## 6 0.000000 0.900000
## CASH_ADVANCE_FREQUENCY CASH_ADVANCE_TRX PURCHASES_TRX CREDIT_LIMIT PAYMENTS
## 1 0.083333 1 0 2100 138.9072
## 2 0.000000 0 11 3000 254.4436
## 3 0.000000 0 13 9000 465.4661
## 4 0.083333 1 2 2000 4426.4983
## 5 0.000000 0 5 4500 2064.1108
## 6 0.000000 0 12 1000 1068.5025
## MINIMUM_PAYMENTS PRC_FULL_PAYMENT TENURE ONEOFF_PURCHASES_FREQUENCYF
## 1 13266.38190 0.000000 12 low
## 2 32.47525 0.181818 12 low
## 3 148.91458 0.272727 11 low
## 4 1451.75634 0.083333 12 low
## 5 6004.09418 0.000000 12 low
## 6 161.81650 0.444444 10 low
## PURCHASES_INSTALLMENTS_FREQUENCYF
## 1 low
## 2 high
## 3 high
## 4 low
## 5 low
## 6 high
mydata2$BALANCE_z <- scale(mydata2$BALANCE)
mydata2$PURCHASES_TRX <- scale(mydata2$PURCHASES_TRX)
mydata2$CASH_ADVANCE_z <- scale(mydata2$CASH_ADVANCE)
mydata2$PRC_FULL_PAYMENT_z <- scale(mydata2$PRC_FULL_PAYMENT)
mydata2$PAYMENTS_z <- scale(mydata2$PAYMENTS)
mydata2$INSTALLMENTS_PURCHASES <- scale(mydata2$PURCHASES_INSTALLMENTS_FREQUENCY)
#using the scale function to standardize the variables, so that the variables will have the same effect
#Zx1=(X1-(X1mean))/Sx1 <- xq(meanstd), S^x1=1
rcorr(as.matrix(mydata2[, c("BALANCE_z", "PURCHASES_TRX", "CASH_ADVANCE_z", "INSTALLMENTS_PURCHASES", "PRC_FULL_PAYMENT_z", "PAYMENTS_z")]), type = "pearson")
## BALANCE_z PURCHASES_TRX CASH_ADVANCE_z
## BALANCE_z 1.00 0.14 0.49
## PURCHASES_TRX 0.14 1.00 -0.10
## CASH_ADVANCE_z 0.49 -0.10 1.00
## INSTALLMENTS_PURCHASES -0.08 0.66 -0.22
## PRC_FULL_PAYMENT_z -0.35 0.07 -0.14
## PAYMENTS_z 0.32 0.40 0.49
## INSTALLMENTS_PURCHASES PRC_FULL_PAYMENT_z PAYMENTS_z
## BALANCE_z -0.08 -0.35 0.32
## PURCHASES_TRX 0.66 0.07 0.40
## CASH_ADVANCE_z -0.22 -0.14 0.49
## INSTALLMENTS_PURCHASES 1.00 0.20 0.10
## PRC_FULL_PAYMENT_z 0.20 1.00 0.02
## PAYMENTS_z 0.10 0.02 1.00
##
## n= 291
##
##
## P
## BALANCE_z PURCHASES_TRX CASH_ADVANCE_z
## BALANCE_z 0.0176 0.0000
## PURCHASES_TRX 0.0176 0.0857
## CASH_ADVANCE_z 0.0000 0.0857
## INSTALLMENTS_PURCHASES 0.1958 0.0000 0.0001
## PRC_FULL_PAYMENT_z 0.0000 0.2333 0.0157
## PAYMENTS_z 0.0000 0.0000 0.0000
## INSTALLMENTS_PURCHASES PRC_FULL_PAYMENT_z PAYMENTS_z
## BALANCE_z 0.1958 0.0000 0.0000
## PURCHASES_TRX 0.0000 0.2333 0.0000
## CASH_ADVANCE_z 0.0001 0.0157 0.0000
## INSTALLMENTS_PURCHASES 0.0005 0.0750
## PRC_FULL_PAYMENT_z 0.0005 0.7296
## PAYMENTS_z 0.0750 0.7296
#checking for correlation between variables
mydata2$Dissimilarity <- sqrt(mydata2$BALANCE_z^2 + mydata2$PURCHASES_TRX^2 + mydata2$CASH_ADVANCE_z^2 + mydata2$INSTALLMENTS_PURCHASES^2 + mydata2$PRC_FULL_PAYMENT_z^2 + mydata2$PAYMENTS_z^2) #Finding outliers in regards to dissimilarity
head(mydata2[order(-mydata2$Dissimilarity), ], 10) #10 units with the highest value of dissimilarity
## CUST_ID BALANCE BALANCE_FREQUENCY PURCHASES ONEOFF_PURCHASES
## 108 C11261 1592.168 1.000000 0.00 0.00
## 193 C11819 5383.084 1.000000 9736.61 3684.52
## 289 C12078 5919.199 1.000000 9321.13 4678.79
## 282 C10750 13774.742 1.000000 404.24 0.00
## 294 C16238 1226.562 0.857143 360.00 360.00
## 67 C16604 3746.919 1.000000 7468.37 1310.10
## 28 C14758 4768.063 1.000000 374.81 333.20
## 75 C17716 6388.821 1.000000 5154.75 1622.06
## 86 C15758 1357.969 0.454545 0.00 0.00
## 287 C15344 481.961 0.909091 967.23 585.50
## INSTALLMENTS_PURCHASES CASH_ADVANCE PURCHASES_FREQUENCY
## 108 -0.8680692 15084.009 0.000000
## 193 1.7114505 2965.657 1.000000
## 289 1.7114505 0.000 1.000000
## 282 -0.2231893 3369.475 0.250000
## 294 -0.8680692 7883.542 0.857143
## 67 1.7114505 0.000 1.000000
## 28 -0.6531101 9917.639 0.333333
## 75 1.7114505 0.000 1.000000
## 86 -0.8680692 2903.355 0.000000
## 287 -0.4381484 6718.128 0.250000
## ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FREQUENCY
## 108 0.000000 0.000000
## 193 0.416667 1.000000
## 289 1.000000 1.000000
## 282 0.000000 0.250000
## 294 0.857143 0.000000
## 67 0.750000 1.000000
## 28 0.250000 0.083333
## 75 0.500000 1.000000
## 86 0.000000 0.000000
## 287 0.166667 0.166667
## CASH_ADVANCE_FREQUENCY CASH_ADVANCE_TRX PURCHASES_TRX CREDIT_LIMIT
## 108 0.333333 53 -0.58425072 4500
## 193 0.250000 7 4.82774868 9000
## 289 0.000000 0 6.27094853 15000
## 282 0.500000 7 -0.46398406 14500
## 294 0.857143 30 -0.34371741 2500
## 67 0.000000 0 4.98810422 10000
## 28 0.833333 18 -0.42389518 7000
## 75 0.000000 0 4.02597100 9500
## 86 0.250000 6 -0.58425072 6000
## 287 0.666667 19 0.01708255 1400
## PAYMENTS MINIMUM_PAYMENTS PRC_FULL_PAYMENT TENURE
## 108 11382.366 472.7032 0.166667 12
## 193 15824.736 3364.0212 0.083333 12
## 289 7303.898 1395.7151 0.000000 12
## 282 3167.871 3533.4648 0.000000 12
## 294 10132.432 920.2878 0.600000 7
## 67 4793.231 949.0611 0.000000 12
## 28 5283.297 1589.2939 0.000000 12
## 75 6436.518 1823.7896 0.000000 12
## 86 12353.265 717.7885 0.200000 12
## 287 9282.027 217.8233 0.571429 12
## ONEOFF_PURCHASES_FREQUENCYF PURCHASES_INSTALLMENTS_FREQUENCYF BALANCE_z
## 108 low low -0.04381369
## 193 low high 1.70223980
## 289 high high 1.94916799
## 282 low low 5.56734317
## 294 high low -0.21220794
## 67 high high 0.94864033
## 28 low low 1.41896770
## 75 high high 2.16547068
## 86 low low -0.15168316
## 287 low low -0.55516280
## CASH_ADVANCE_z PRC_FULL_PAYMENT_z PAYMENTS_z Dissimilarity
## 108 7.8913180 0.05194399 4.4747329 9.132122
## 193 1.1614790 -0.23612650 6.5140403 8.542327
## 289 -0.4854771 -0.52419354 2.6024779 7.303182
## 282 1.3857362 -0.52419354 0.7037999 5.826730
## 294 3.8925908 1.54989744 3.9009405 5.804180
## 67 -0.4854771 -0.52419354 1.4499353 5.596690
## 28 5.0222121 -0.52419354 1.6749041 5.560792
## 75 -0.4854771 -0.52419354 2.2043001 5.403348
## 86 1.1268803 0.16717012 4.9204326 5.160074
## 287 3.2453865 1.45113268 3.5105547 5.046066
There was no severe jump in dissimilarity among the units of observation, so I presume it will not hurt the coming analysis, if I leave all of the units in the data set.
#mydata2 <- mydata2[-387, -447, - 481 ]
#removing the units with the highest dissimilarity
library(factoextra)
#Calculating Euclidean distances
distance <- get_dist(mydata2[c("BALANCE_z", "PURCHASES_TRX", "CASH_ADVANCE_z", "INSTALLMENTS_PURCHASES", "PRC_FULL_PAYMENT_z", "PAYMENTS_z")],
method = "euclidian")
distance2 <- distance^2 #squared euklidian distance
fviz_dist(distance2) #Showing dissimilarity matrix
get_clust_tendency(mydata2[c("BALANCE_z", "PURCHASES_TRX", "CASH_ADVANCE_z", "INSTALLMENTS_PURCHASES", "PRC_FULL_PAYMENT_z", "PAYMENTS_z")], #Hopkins statistics
n = nrow(mydata2) - 1,
graph = FALSE)
## $hopkins_stat
## [1] 0.8874106
##
## $plot
## NULL
Judging from the graph the data is clustarable. Furthermore, the Hopkins statistic, which is used to add a layer of objectivity to the possibility of clustering, confirms the latter. An appropriate statistic would be higher than 0.5. In our instance it is valued at approx. 0.89, this means the data is optimal for clustering.
ward <- mydata2[c("BALANCE_z", "PURCHASES_TRX", "CASH_ADVANCE_z", "INSTALLMENTS_PURCHASES", "PRC_FULL_PAYMENT_z", "PAYMENTS_z")] %>%
get_dist(method = "euclidean") %>%
hclust(method = "ward.D2")
fviz_dend(ward,
k = 5,
cex = 0.5,
pallete = "Dark2",
color_labels_by_k = TRUE,
rect = TRUE)
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
## Please report the issue at <]8;;https://github.com/kassambara/factoextra/issueshttps://github.com/kassambara/factoextra/issues]8;;>.
fviz_dend(ward,
k = 4,
cex = 0.5,
pallete = "Dark2",
color_labels_by_k = TRUE,
rect = TRUE)
Optimal_N_Clusters <- mydata2[c("BALANCE_z", "PURCHASES_TRX", "CASH_ADVANCE_z", "INSTALLMENTS_PURCHASES", "PRC_FULL_PAYMENT_z", "PAYMENTS_z")] %>%
NbClust(distance = "euclidean",
min.nc = 2,
max.nc = 10,
method = "ward.D2",
index = "all")
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 1 proposed 2 as the best number of clusters
## * 3 proposed 3 as the best number of clusters
## * 4 proposed 4 as the best number of clusters
## * 6 proposed 5 as the best number of clusters
## * 1 proposed 6 as the best number of clusters
## * 1 proposed 7 as the best number of clusters
## * 2 proposed 8 as the best number of clusters
## * 1 proposed 9 as the best number of clusters
## * 4 proposed 10 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 5
##
##
## *******************************************************************
According to the majority rule portraied in the output above, we should use 5 different clusters to fit our data the best. It is also sensible when looking at the dendrogram and the dissimilarity matrix.
mydata2$ClusterWard <- cutree(ward,
k = 5)
head(mydata2, 10)
## CUST_ID BALANCE BALANCE_FREQUENCY PURCHASES ONEOFF_PURCHASES
## 1 C11051 2128.151695 1.000000 0.00 0.00
## 2 C18220 5.416921 0.545455 288.80 0.00
## 3 C14908 49.564532 1.000000 433.50 273.94
## 4 C18691 919.563867 1.000000 3000.00 0.00
## 5 C14164 4663.032241 1.000000 1325.31 1325.31
## 6 C19027 133.099445 1.000000 1170.00 0.00
## 7 C11345 2894.348509 1.000000 657.71 310.70
## 8 C18752 1869.632550 1.000000 0.00 0.00
## 9 C11858 2270.955171 1.000000 25.26 25.26
## 10 C18449 51.089340 0.818182 140.75 0.00
## INSTALLMENTS_PURCHASES CASH_ADVANCE PURCHASES_FREQUENCY
## 1 -0.8680692 19.34573 0.000000
## 2 1.2815297 0.00000 0.916667
## 3 1.4769490 0.00000 0.909091
## 4 -0.6531101 823.97913 0.166667
## 5 -0.8680692 0.00000 0.333333
## 6 1.4534986 0.00000 1.000000
## 7 1.4964914 3384.77264 1.000000
## 8 -0.8680692 1897.99301 0.000000
## 9 -0.8680692 452.80366 0.083333
## 10 0.2067316 0.00000 0.416667
## ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FREQUENCY
## 1 0.000000 0.000000
## 2 0.000000 0.833333
## 3 0.181818 0.909091
## 4 0.000000 0.083333
## 5 0.333333 0.000000
## 6 0.000000 0.900000
## 7 0.166667 0.916667
## 8 0.000000 0.000000
## 9 0.083333 0.000000
## 10 0.000000 0.416667
## CASH_ADVANCE_FREQUENCY CASH_ADVANCE_TRX PURCHASES_TRX CREDIT_LIMIT PAYMENTS
## 1 0.083333 1 -0.58425072 2100 138.9072
## 2 0.000000 0 -0.14327299 3000 254.4436
## 3 0.000000 0 -0.06309522 9000 465.4661
## 4 0.083333 1 -0.50407295 2000 4426.4983
## 5 0.000000 0 -0.38380630 4500 2064.1108
## 6 0.000000 0 -0.10318410 1000 1068.5025
## 7 0.583333 17 0.25761586 3500 3166.7656
## 8 0.166667 4 -0.58425072 2000 454.3482
## 9 0.250000 3 -0.54416183 3100 503.1965
## 10 0.000000 0 -0.38380630 1000 347.2401
## MINIMUM_PAYMENTS PRC_FULL_PAYMENT TENURE ONEOFF_PURCHASES_FREQUENCYF
## 1 13266.38190 0.000000 12 low
## 2 32.47525 0.181818 12 low
## 3 148.91458 0.272727 11 low
## 4 1451.75634 0.083333 12 low
## 5 6004.09418 0.000000 12 low
## 6 161.81650 0.444444 10 low
## 7 1006.61472 0.000000 12 low
## 8 452.59794 0.000000 12 low
## 9 815.40061 0.000000 12 low
## 10 238.81136 0.111111 12 low
## PURCHASES_INSTALLMENTS_FREQUENCYF BALANCE_z CASH_ADVANCE_z
## 1 low 0.20305421 -0.47473357
## 2 high -0.77465367 -0.48547708
## 3 high -0.75431978 -0.48547708
## 4 low -0.35360782 -0.02788623
## 5 low 1.37059183 -0.48547708
## 6 high -0.71584453 -0.48547708
## 7 high 0.55595589 1.39423194
## 8 low 0.08398318 0.56855962
## 9 low 0.26882789 -0.23401584
## 10 low -0.75361747 -0.48547708
## PRC_FULL_PAYMENT_z PAYMENTS_z Dissimilarity ClusterWard
## 1 -0.5241935 -0.6866714 1.451824 1
## 2 0.1043182 -0.6336334 1.706164 2
## 3 0.4185741 -0.5367618 1.858323 2
## 4 -0.2361265 1.2815834 1.582615 1
## 5 -0.5241935 0.1971095 1.824470 1
## 6 1.0121686 -0.2599328 1.990848 2
## 7 -0.5241935 0.7032925 2.308294 2
## 8 -0.5241935 -0.5418655 1.411950 1
## 9 -0.5241935 -0.5194413 1.311978 1
## 10 -0.1401030 -0.5910344 1.167314 1
Performing K-MEANS clustering
K_Means <- hkmeans(mydata2[c("BALANCE_z", "PURCHASES_TRX", "CASH_ADVANCE_z", "INSTALLMENTS_PURCHASES", "PRC_FULL_PAYMENT_z", "PAYMENTS_z")],
k = 5,
hc.metric = "euclidean",
hc.method = "ward.D2")
#Showing the K-means clusters performed
K_Means
## Hierarchical K-means clustering with 5 clusters of sizes 142, 66, 41, 29, 13
##
## Cluster means:
## BALANCE_z PURCHASES_TRX CASH_ADVANCE_z INSTALLMENTS_PURCHASES
## 1 -0.1799033 -0.4589024 -0.1748947 -0.7034867
## 2 -0.1994935 0.4489492 -0.3278804 1.2668574
## 3 -0.7379063 0.0190381 -0.4653924 0.3264442
## 4 1.9332450 -0.3948653 2.3148014 -0.6005484
## 5 0.9925302 3.5541557 -0.1210006 1.5626323
## PRC_FULL_PAYMENT_z PAYMENTS_z
## 1 -0.4444159 -0.2838668
## 2 -0.2473679 -0.2266477
## 3 2.2011945 -0.2658485
## 4 -0.3068373 1.3838796
## 5 -0.1474889 2.0027006
##
## Clustering vector:
## 1 2 3 4 5 6 7 8 9 10 11 12 14 15 16 17 18 19 20 22
## 1 2 2 1 1 2 2 1 1 1 3 1 1 3 1 1 2 2 1 2
## 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 43
## 2 2 4 1 1 4 2 2 3 1 1 1 1 1 3 1 2 2 2 2
## 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
## 1 1 3 2 2 1 2 4 1 3 1 1 1 1 1 5 1 1 1 1
## 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 82 83 84
## 3 1 4 5 1 1 2 1 1 1 2 5 1 1 4 1 1 2 1 1
## 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104
## 1 4 1 2 1 1 3 1 1 1 3 2 2 2 3 2 2 1 4 2
## 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124
## 4 4 2 4 1 1 5 1 4 1 1 1 1 1 1 1 1 1 1 2
## 125 126 127 128 129 130 131 132 133 134 135 137 138 139 140 141 142 143 144 145
## 2 3 3 2 1 5 2 3 5 1 4 1 4 1 2 1 1 1 2 3
## 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165
## 3 1 4 3 1 1 2 1 1 1 3 1 2 2 2 1 5 1 3 1
## 166 167 168 169 170 171 172 173 174 175 177 178 179 180 181 182 183 184 185 186
## 2 1 4 1 3 1 1 1 3 1 3 1 2 1 4 5 4 1 3 1
## 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206
## 1 1 1 2 1 2 5 1 1 1 4 1 3 2 2 1 3 2 1 1
## 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226
## 2 1 2 1 3 1 1 1 2 1 1 1 3 3 4 1 1 2 1 2
## 229 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249
## 1 3 3 1 1 1 1 1 1 1 4 1 2 2 2 2 1 5 1 5
## 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269
## 1 4 1 1 1 3 4 3 3 1 2 1 2 2 4 1 4 1 4 1
## 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289
## 1 3 2 2 3 2 2 1 3 4 2 1 4 1 3 3 2 4 1 5
## 290 291 292 293 294 295 296 297 298 299 300
## 2 1 3 3 4 1 5 2 3 3 1
##
## Within cluster sum of squares by cluster:
## [1] 141.29704 100.18938 83.75134 210.38105 83.84743
## (between_SS / total_SS = 64.4 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault" "data"
## [11] "hclust"
The larger the between/total SS => the higher the difference between clusters (different pertaining information/data). The ratio of between_SS / total_SS is valued at 64.4%, which suffices for our analysis (a value above 50% is deemed acceptable). The ratio itself tells us the ratio between the
In the next step, I used graphical means of presenting the clusters:
fviz_cluster(K_Means,
repel = FALSE,
ggtheme = theme_gray(),
palette = "Dark2")
The clusters appear to be overlapping heavily, this is however not necessarily the case, since the plot is presented in 2 dimension, yet was not measured on only 2 dimensions. If we were to present the data in more dimension, we would observe less overlapping.
In the next step, I will be looking at the potential re-classification that may have occurred.
mydata2$ClusterK_Means <- K_Means$cluster
mydata2$ClusterK_Means <- as.factor(mydata2$ClusterK_Means)
table(mydata2$ClusterWard, mydata2$ClusterK_Means)
##
## 1 2 3 4 5
## 1 140 0 0 0 0
## 2 1 65 0 0 1
## 3 0 0 41 0 1
## 4 1 0 0 29 0
## 5 0 1 0 0 11
From the table above, we can observe that based on the variables used, re-classifications did in fact occur. Namely, from cluster 2, there were 1 credit card holder re-classified to cluster 1. From cluster 3, 1 holder was re-classified to cluster 5. From cluster 4 there was 1 re-classification to cluster 1. From cluster 5, there was 1 re-classification to cluster 2.
centroids <- K_Means$centers
centroids
## BALANCE_z PURCHASES_TRX CASH_ADVANCE_z INSTALLMENTS_PURCHASES
## 1 -0.1799033 -0.4589024 -0.1748947 -0.7034867
## 2 -0.1994935 0.4489492 -0.3278804 1.2668574
## 3 -0.7379063 0.0190381 -0.4653924 0.3264442
## 4 1.9332450 -0.3948653 2.3148014 -0.6005484
## 5 0.9925302 3.5541557 -0.1210006 1.5626323
## PRC_FULL_PAYMENT_z PAYMENTS_z
## 1 -0.4444159 -0.2838668
## 2 -0.2473679 -0.2266477
## 3 2.2011945 -0.2658485
## 4 -0.3068373 1.3838796
## 5 -0.1474889 2.0027006
library(ggplot2)
library(tidyr)
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:magrittr':
##
## extract
figure <- as.data.frame(centroids)
figure$ID <- 1:nrow(figure)
figure <- pivot_longer(figure,
cols = c(BALANCE_z, PURCHASES_TRX, CASH_ADVANCE_z, INSTALLMENTS_PURCHASES, PRC_FULL_PAYMENT_z, PAYMENTS_z))
figure$Groups <- factor(figure$ID,
levels = c(1,2,3,4,5),
labels = c("Cluster 1", "Cluster 2","Cluster 3", "Cluster 4", "Cluster 5"))
figure$nameFactor <- factor(figure$name,
levels = c("BALANCE_z", "PURCHASES_TRX", "CASH_ADVANCE_z", "INSTALLMENTS_PURCHASES", "PRC_FULL_PAYMENT_z", "PAYMENTS_z"),
labels = c("BALANCE", "PURCHASESx", "$ADV", "INSTALLMENT_P", "%FULLPAY", "PAYMENTS"))
ggplot(figure, aes(x = nameFactor, y = value))+
geom_hline(yintercept = 0)+
theme_gray()+
geom_line(aes(group = ID), linewidth = 0.7)+
geom_point(aes(shape = Groups, col = Groups), size = 2)+
ylab("Averages") +
xlab("Clustering variables")
From the above chart we can conclude the following: Cluster 5 has an above average balance on their accounts (second highest), it has the highest number of purchases, yet a bellow average of advance payments, the highest amount of installment payments, a bellow average level of payments made in full and the highest cumulative amount of payments. This seems to be the segment with the highest level of consumption, and a propensity to spend in installments. This seems to be the most profitable among the segments to target with marketing campaigns.
Cluster 4 has the highest balance on their account among the segments, it has the lowest number of purchases, the highest amount of cash advances, a bellow average number of payments by installments (second lowest), a bellow average percent of payments made in full and an above average (second highest) cumulative amount of payments. This seems to be a segment denoted by high income, high consumption and seems to be able to take on credit. This would be the second most profitable segment to target with a marketing campaign.
The credit card holders from the 1st cluster have a bellow average balance on their account, take on the lowest amount of purchases, put up a bellow average cash advance, use installment purchases the least, have the lowest percentage of full payments and a bellow average level (while still being the lowest) of the cumulative amount of payments. This would appear to be a lower income segment, or at the least a segment denoted by low consumption.
The 2nd cluster has bellow average balance, above average number of payments, a bellow average level of cash advances, they are however above average in installment payments (second highest), a bellow average level of full payments and a bellow average level of cumulative amount of payments. This segments appears to be keen on installment payments, and consumes most of the income with financing.
The 3rd cluster has the lowest balance among the segments, average number of purchases, the lowest level of cash advances, an above average level of installment payments, it has however the highest level of full payments and a bellow average cumulative amount of payments. This cluster appears to have used high amounts of financing or consuming in the past, and has fulfilled their payment obligations.
#In the next phase, I will be testing the clustering variables using ANOVA, in order to determine which of the variables pertains the most differences among the segments.
fit <- aov(cbind(BALANCE_z, PURCHASES_TRX, CASH_ADVANCE_z, INSTALLMENTS_PURCHASES, PRC_FULL_PAYMENT_z, PAYMENTS_z) ~ as.factor(ClusterK_Means), data = mydata2)
summary(fit)
## Response 1 :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(ClusterK_Means) 4 150.74 37.685 77.394 < 2.2e-16 ***
## Residuals 286 139.26 0.487
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response 2 :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(ClusterK_Means) 4 211.959 52.990 194.2 < 2.2e-16 ***
## Residuals 286 78.041 0.273
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response 3 :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(ClusterK_Means) 4 175.9 43.975 110.23 < 2.2e-16 ***
## Residuals 286 114.1 0.399
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response 4 :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(ClusterK_Means) 4 222.772 55.693 236.93 < 2.2e-16 ***
## Residuals 286 67.228 0.235
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response 5 :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(ClusterK_Means) 4 233.753 58.438 297.14 < 2.2e-16 ***
## Residuals 286 56.247 0.197
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response 6 :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(ClusterK_Means) 4 125.41 31.3524 54.479 < 2.2e-16 ***
## Residuals 286 164.59 0.5755
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the ANOVA tests conducted above, we can observe that in fact all of the variables hold different means, between all of the clusters/segments. This is confirmed by the rejection of the null hypothesis, which states that the means of the values of the variables pertaining to each segment are the same. Due to the p value being < 0.05 in all cases, we can confirm the alternative hypothesis. Judging by the F-statistics, we can say that the variables NUMBER OF PURCHASES, INSTALLMENT PURCHASES and PERCENT OF FULL PAYMENTS are the variables which account for the most differences between the segments.
# VALIDATION
# I will now check for if there is a statistically significant difference between clusters, in regards to the frequency of their specific actions, which the credit card provides. Namely, if the cluster prefers to buy in installments or in one go.
# The null hypothesis for the chisquared test was: H0:There is no association between the two categorical variables (low/high frequency of installments payments)
# H1 there are differences
chisq_results <- chisq.test(mydata2$PURCHASES_INSTALLMENTS_FREQUENCYF, as.factor(mydata2$ClusterK_Means))
## Warning in chisq.test(mydata2$PURCHASES_INSTALLMENTS_FREQUENCYF,
## as.factor(mydata2$ClusterK_Means)): Chi-squared approximation may be incorrect
chisq_results
##
## Pearson's Chi-squared test
##
## data: mydata2$PURCHASES_INSTALLMENTS_FREQUENCYF and as.factor(mydata2$ClusterK_Means)
## X-squared = 224.22, df = 4, p-value < 2.2e-16
addmargins(chisq_results$observed)
##
## 1 2 3 4 5 Sum
## high 1 64 19 2 13 99
## low 141 2 22 27 0 192
## Sum 142 66 41 29 13 291
round(chisq_results$expected, 2)
##
## 1 2 3 4 5
## high 48.31 22.45 13.95 9.87 4.42
## low 93.69 43.55 27.05 19.13 8.58
round(chisq_results$res, 2)
##
## 1 2 3 4 5
## high -6.81 8.77 1.35 -2.50 4.08
## low 4.89 -6.30 -0.97 1.80 -2.93
From the results of the chisq test, we can reject H0 at p-value < 0.05. From this we can conclude, there is an association between the clusters of the categorical variable PURCHASES_INSTALLMENTS_FREQUENCYF.
#VALIDATION - Tenure
fit <- aov(TENURE ~ as.factor(ClusterK_Means), data = mydata2)
# H0: µ(cluster1, TENURE) = µ(cluster1,2,3,4,5 TENURE)
# H1: µ(cluster1, TENURE) ≠ µ(cluster1,2,3,4,5 TENURE)
summary(fit)
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(ClusterK_Means) 4 5.1 1.277 0.789 0.533
## Residuals 286 462.7 1.618
Do to the P-value of the anova test being above 0.05, we can not reject the null hypothesis, from which we can conclude that there are no statistically significant differences, between the average tenures among the clusters.
CONCLUSTION: I performed hierarchical clustering of 300 units based on 6 cluster variables (squared Euclidean distance, Ward’s algorithm). The classification was further optimized by K-Means clustering.
The resulting conclusion is as follows: I believe that the most profitable segments to target with the marketing campaign are segments 4 and 5. This is mainly due to the fact that they possess the highest balance on their account, while also being the largest consumers among the segments. They outperform other segments by a large margin on both accounts. The description of the segments can be located bellow the cluster chart, but for ease of access, I will once again copy the analysis of the two clusters according to the used variables once again in the following pharagraphs:
Cluster 5 has an above average balance on their accounts (second highest), it has the highest number of purchases, yet a bellow average of advance payments, the highest amount of installment payments, a bellow average level of payments made in full and the highest cumulative amount of payments. This seems to be the segment with the highest level of consumption, and a propensity to spend in installments. This seems to be the most profitable among the segments to target with marketing campaigns.
Cluster 4 has the highest balance on their account among the segments, it has the lowest number of purchases, the highest amount of cash advances, a bellow average number of payments by installments (second lowest), a bellow average percent of payments made in full and an above average (second highest) cumulative amount of payments. This seems to be a segment denoted by high income, high consumption and seems to be able to take on credit. This would be the second most profitable segment to target with a marketing campaign.