HW4 - Clustering

#install.packages(ggplot2)
library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.2.2

#install.packages("ggfortify")
library(ggfortify)

## Warning: package 'ggfortify' was built under R version 4.2.2

#install.packages("ranger")
library(ranger)

## Warning: package 'ranger' was built under R version 4.2.2

#install.packages("dplyr")
library(dplyr)

## Warning: package 'dplyr' was built under R version 4.2.2

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

#install.packages("Hmisc")
library(Hmisc)

## Warning: package 'Hmisc' was built under R version 4.2.2

## Loading required package: lattice

## Loading required package: survival

## Loading required package: Formula

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:dplyr':
## 
##     src, summarize

## The following objects are masked from 'package:base':
## 
##     format.pval, units

#install.packages("factoextra")
library(factoextra)

## Warning: package 'factoextra' was built under R version 4.2.2

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

#install.packages("cluster")
library(cluster)

## Warning: package 'cluster' was built under R version 4.2.2

#install.packages("magrittr")
library(magrittr)
#install.packages("NbClust")
library("NbClust")

mydata <- read.csv("~/IMB/Multivariate analysis/HOMEWORK MVA/HW4/CC GENERAL.csv", sep = ",")

set.seed(1)
mydata1 <- sample_n(mydata, 300)
head(mydata1, 10)

##    CUST_ID     BALANCE BALANCE_FREQUENCY PURCHASES ONEOFF_PURCHASES
## 1   C11051 2128.151695          1.000000      0.00             0.00
## 2   C18220    5.416921          0.545455    288.80             0.00
## 3   C14908   49.564532          1.000000    433.50           273.94
## 4   C18691  919.563867          1.000000   3000.00             0.00
## 5   C14164 4663.032241          1.000000   1325.31          1325.31
## 6   C19027  133.099445          1.000000   1170.00             0.00
## 7   C11345 2894.348509          1.000000    657.71           310.70
## 8   C18752 1869.632550          1.000000      0.00             0.00
## 9   C11858 2270.955171          1.000000     25.26            25.26
## 10  C18449   51.089340          0.818182    140.75             0.00
##    INSTALLMENTS_PURCHASES CASH_ADVANCE PURCHASES_FREQUENCY
## 1                    0.00     19.34573            0.000000
## 2                  288.80      0.00000            0.916667
## 3                  159.56      0.00000            0.909091
## 4                 3000.00    823.97913            0.166667
## 5                    0.00      0.00000            0.333333
## 6                 1170.00      0.00000            1.000000
## 7                  347.01   3384.77264            1.000000
## 8                    0.00   1897.99301            0.000000
## 9                    0.00    452.80366            0.083333
## 10                 140.75      0.00000            0.416667
##    ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FREQUENCY
## 1                    0.000000                         0.000000
## 2                    0.000000                         0.833333
## 3                    0.181818                         0.909091
## 4                    0.000000                         0.083333
## 5                    0.333333                         0.000000
## 6                    0.000000                         0.900000
## 7                    0.166667                         0.916667
## 8                    0.000000                         0.000000
## 9                    0.083333                         0.000000
## 10                   0.000000                         0.416667
##    CASH_ADVANCE_FREQUENCY CASH_ADVANCE_TRX PURCHASES_TRX CREDIT_LIMIT  PAYMENTS
## 1                0.083333                1             0         2100  138.9072
## 2                0.000000                0            11         3000  254.4436
## 3                0.000000                0            13         9000  465.4661
## 4                0.083333                1             2         2000 4426.4983
## 5                0.000000                0             5         4500 2064.1108
## 6                0.000000                0            12         1000 1068.5025
## 7                0.583333               17            21         3500 3166.7656
## 8                0.166667                4             0         2000  454.3482
## 9                0.250000                3             1         3100  503.1965
## 10               0.000000                0             5         1000  347.2401
##    MINIMUM_PAYMENTS PRC_FULL_PAYMENT TENURE
## 1       13266.38190         0.000000     12
## 2          32.47525         0.181818     12
## 3         148.91458         0.272727     11
## 4        1451.75634         0.083333     12
## 5        6004.09418         0.000000     12
## 6         161.81650         0.444444     10
## 7        1006.61472         0.000000     12
## 8         452.59794         0.000000     12
## 9         815.40061         0.000000     12
## 10        238.81136         0.111111     12

Description of the dataset: This case requires to develop a customer segmentation to define marketing strategy. The sample Dataset summarizes the usage behavior of about 9000 active credit card holders during the last 6 months. The file is at a customer level with 18 behavioral variables. The dataset is sourced from Kaggle: “https://www.kaggle.com/datasets/arjunbhasin2013/ccdata?resource=download”

Following is the Data Dictionary for Credit Card data set :

CUST_ID : Identification of Credit Card holder (Categorical)

BALANCE : Balance amount left in their account to make purchases

BALANCE_FREQUENCY : How frequently the Balance is updated, score between 0 and 1 (1 = frequently updated, 0 = not frequently updated)

PURCHASES : Amount of purchases made from account

ONEOFF_PURCHASES : Maximum purchase amount done in one-go

INSTALLMENTS_PURCHASES : Amount of purchase done in installment

CASH_ADVANCE : Cash in advance given by the user

PURCHASES_FREQUENCY : How frequently the Purchases are being made, score between 0 and 1 (1 = frequently purchased, 0 = not frequently purchased)

ONEOFFPURCHASESFREQUENCY : How frequently Purchases are happening in one-go (1 = frequently purchased, 0 = not frequently purchased)

PURCHASESINSTALLMENTSFREQUENCY : How frequently purchases in installments are being done (1 = frequently done, 0 = not frequently done)

CASHADVANCEFREQUENCY : How frequently the cash in advance being paid

CASHADVANCETRX : Number of Transactions made with “Cash in Advanced”

PURCHASES_TRX : Numbe of purchase transactions made

CREDIT_LIMIT : Limit of Credit Card for user

PAYMENTS : Amount of Payment done by user

MINIMUM_PAYMENTS : Minimum amount of payments made by user

PRCFULLPAYMENT : Percent of full payment paid by user

TENURE : Tenure of credit card service for user

RESEARCH QUESTION: I will try to cluster the 300 active card holders based on 6 variables, and analyze the differences between the pertaining clusters.

mydata2 <- na.omit(mydata1)

mydata2$ONEOFF_PURCHASES_FREQUENCYF <- ifelse(mydata2$ONEOFF_PURCHASES_FREQUENCY < 0.5, "low", "high")   
mydata2$PURCHASES_INSTALLMENTS_FREQUENCYF <- ifelse(mydata2$PURCHASES_INSTALLMENTS_FREQUENCY < 0.5, "low", "high")

head(mydata2)

##   CUST_ID     BALANCE BALANCE_FREQUENCY PURCHASES ONEOFF_PURCHASES
## 1  C11051 2128.151695          1.000000      0.00             0.00
## 2  C18220    5.416921          0.545455    288.80             0.00
## 3  C14908   49.564532          1.000000    433.50           273.94
## 4  C18691  919.563867          1.000000   3000.00             0.00
## 5  C14164 4663.032241          1.000000   1325.31          1325.31
## 6  C19027  133.099445          1.000000   1170.00             0.00
##   INSTALLMENTS_PURCHASES CASH_ADVANCE PURCHASES_FREQUENCY
## 1                   0.00     19.34573            0.000000
## 2                 288.80      0.00000            0.916667
## 3                 159.56      0.00000            0.909091
## 4                3000.00    823.97913            0.166667
## 5                   0.00      0.00000            0.333333
## 6                1170.00      0.00000            1.000000
##   ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FREQUENCY
## 1                   0.000000                         0.000000
## 2                   0.000000                         0.833333
## 3                   0.181818                         0.909091
## 4                   0.000000                         0.083333
## 5                   0.333333                         0.000000
## 6                   0.000000                         0.900000
##   CASH_ADVANCE_FREQUENCY CASH_ADVANCE_TRX PURCHASES_TRX CREDIT_LIMIT  PAYMENTS
## 1               0.083333                1             0         2100  138.9072
## 2               0.000000                0            11         3000  254.4436
## 3               0.000000                0            13         9000  465.4661
## 4               0.083333                1             2         2000 4426.4983
## 5               0.000000                0             5         4500 2064.1108
## 6               0.000000                0            12         1000 1068.5025
##   MINIMUM_PAYMENTS PRC_FULL_PAYMENT TENURE ONEOFF_PURCHASES_FREQUENCYF
## 1      13266.38190         0.000000     12                         low
## 2         32.47525         0.181818     12                         low
## 3        148.91458         0.272727     11                         low
## 4       1451.75634         0.083333     12                         low
## 5       6004.09418         0.000000     12                         low
## 6        161.81650         0.444444     10                         low
##   PURCHASES_INSTALLMENTS_FREQUENCYF
## 1                               low
## 2                              high
## 3                              high
## 4                               low
## 5                               low
## 6                              high

mydata2$BALANCE_z <- scale(mydata2$BALANCE)
mydata2$PURCHASES_TRX <- scale(mydata2$PURCHASES_TRX)
mydata2$CASH_ADVANCE_z <- scale(mydata2$CASH_ADVANCE)
mydata2$PRC_FULL_PAYMENT_z <- scale(mydata2$PRC_FULL_PAYMENT)
mydata2$PAYMENTS_z <- scale(mydata2$PAYMENTS)
mydata2$INSTALLMENTS_PURCHASES <- scale(mydata2$PURCHASES_INSTALLMENTS_FREQUENCY)

  
#using the scale function to standardize the variables, so that the variables will have the same effect
#Zx1=(X1-(X1mean))/Sx1   <- xq(meanstd), S^x1=1

rcorr(as.matrix(mydata2[, c("BALANCE_z", "PURCHASES_TRX", "CASH_ADVANCE_z", "INSTALLMENTS_PURCHASES", "PRC_FULL_PAYMENT_z", "PAYMENTS_z")]), type = "pearson")

##                        BALANCE_z PURCHASES_TRX CASH_ADVANCE_z
## BALANCE_z                   1.00          0.14           0.49
## PURCHASES_TRX               0.14          1.00          -0.10
## CASH_ADVANCE_z              0.49         -0.10           1.00
## INSTALLMENTS_PURCHASES     -0.08          0.66          -0.22
## PRC_FULL_PAYMENT_z         -0.35          0.07          -0.14
## PAYMENTS_z                  0.32          0.40           0.49
##                        INSTALLMENTS_PURCHASES PRC_FULL_PAYMENT_z PAYMENTS_z
## BALANCE_z                               -0.08              -0.35       0.32
## PURCHASES_TRX                            0.66               0.07       0.40
## CASH_ADVANCE_z                          -0.22              -0.14       0.49
## INSTALLMENTS_PURCHASES                   1.00               0.20       0.10
## PRC_FULL_PAYMENT_z                       0.20               1.00       0.02
## PAYMENTS_z                               0.10               0.02       1.00
## 
## n= 291 
## 
## 
## P
##                        BALANCE_z PURCHASES_TRX CASH_ADVANCE_z
## BALANCE_z                        0.0176        0.0000        
## PURCHASES_TRX          0.0176                  0.0857        
## CASH_ADVANCE_z         0.0000    0.0857                      
## INSTALLMENTS_PURCHASES 0.1958    0.0000        0.0001        
## PRC_FULL_PAYMENT_z     0.0000    0.2333        0.0157        
## PAYMENTS_z             0.0000    0.0000        0.0000        
##                        INSTALLMENTS_PURCHASES PRC_FULL_PAYMENT_z PAYMENTS_z
## BALANCE_z              0.1958                 0.0000             0.0000    
## PURCHASES_TRX          0.0000                 0.2333             0.0000    
## CASH_ADVANCE_z         0.0001                 0.0157             0.0000    
## INSTALLMENTS_PURCHASES                        0.0005             0.0750    
## PRC_FULL_PAYMENT_z     0.0005                                    0.7296    
## PAYMENTS_z             0.0750                 0.7296

#checking for correlation between variables

mydata2$Dissimilarity <- sqrt(mydata2$BALANCE_z^2 + mydata2$PURCHASES_TRX^2 + mydata2$CASH_ADVANCE_z^2 + mydata2$INSTALLMENTS_PURCHASES^2 + mydata2$PRC_FULL_PAYMENT_z^2 + mydata2$PAYMENTS_z^2) #Finding outliers in regards to dissimilarity

head(mydata2[order(-mydata2$Dissimilarity), ], 10) #10 units with the highest value of dissimilarity

##     CUST_ID   BALANCE BALANCE_FREQUENCY PURCHASES ONEOFF_PURCHASES
## 108  C11261  1592.168          1.000000      0.00             0.00
## 193  C11819  5383.084          1.000000   9736.61          3684.52
## 289  C12078  5919.199          1.000000   9321.13          4678.79
## 282  C10750 13774.742          1.000000    404.24             0.00
## 294  C16238  1226.562          0.857143    360.00           360.00
## 67   C16604  3746.919          1.000000   7468.37          1310.10
## 28   C14758  4768.063          1.000000    374.81           333.20
## 75   C17716  6388.821          1.000000   5154.75          1622.06
## 86   C15758  1357.969          0.454545      0.00             0.00
## 287  C15344   481.961          0.909091    967.23           585.50
##     INSTALLMENTS_PURCHASES CASH_ADVANCE PURCHASES_FREQUENCY
## 108             -0.8680692    15084.009            0.000000
## 193              1.7114505     2965.657            1.000000
## 289              1.7114505        0.000            1.000000
## 282             -0.2231893     3369.475            0.250000
## 294             -0.8680692     7883.542            0.857143
## 67               1.7114505        0.000            1.000000
## 28              -0.6531101     9917.639            0.333333
## 75               1.7114505        0.000            1.000000
## 86              -0.8680692     2903.355            0.000000
## 287             -0.4381484     6718.128            0.250000
##     ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FREQUENCY
## 108                   0.000000                         0.000000
## 193                   0.416667                         1.000000
## 289                   1.000000                         1.000000
## 282                   0.000000                         0.250000
## 294                   0.857143                         0.000000
## 67                    0.750000                         1.000000
## 28                    0.250000                         0.083333
## 75                    0.500000                         1.000000
## 86                    0.000000                         0.000000
## 287                   0.166667                         0.166667
##     CASH_ADVANCE_FREQUENCY CASH_ADVANCE_TRX PURCHASES_TRX CREDIT_LIMIT
## 108               0.333333               53   -0.58425072         4500
## 193               0.250000                7    4.82774868         9000
## 289               0.000000                0    6.27094853        15000
## 282               0.500000                7   -0.46398406        14500
## 294               0.857143               30   -0.34371741         2500
## 67                0.000000                0    4.98810422        10000
## 28                0.833333               18   -0.42389518         7000
## 75                0.000000                0    4.02597100         9500
## 86                0.250000                6   -0.58425072         6000
## 287               0.666667               19    0.01708255         1400
##      PAYMENTS MINIMUM_PAYMENTS PRC_FULL_PAYMENT TENURE
## 108 11382.366         472.7032         0.166667     12
## 193 15824.736        3364.0212         0.083333     12
## 289  7303.898        1395.7151         0.000000     12
## 282  3167.871        3533.4648         0.000000     12
## 294 10132.432         920.2878         0.600000      7
## 67   4793.231         949.0611         0.000000     12
## 28   5283.297        1589.2939         0.000000     12
## 75   6436.518        1823.7896         0.000000     12
## 86  12353.265         717.7885         0.200000     12
## 287  9282.027         217.8233         0.571429     12
##     ONEOFF_PURCHASES_FREQUENCYF PURCHASES_INSTALLMENTS_FREQUENCYF   BALANCE_z
## 108                         low                               low -0.04381369
## 193                         low                              high  1.70223980
## 289                        high                              high  1.94916799
## 282                         low                               low  5.56734317
## 294                        high                               low -0.21220794
## 67                         high                              high  0.94864033
## 28                          low                               low  1.41896770
## 75                         high                              high  2.16547068
## 86                          low                               low -0.15168316
## 287                         low                               low -0.55516280
##     CASH_ADVANCE_z PRC_FULL_PAYMENT_z PAYMENTS_z Dissimilarity
## 108      7.8913180         0.05194399  4.4747329      9.132122
## 193      1.1614790        -0.23612650  6.5140403      8.542327
## 289     -0.4854771        -0.52419354  2.6024779      7.303182
## 282      1.3857362        -0.52419354  0.7037999      5.826730
## 294      3.8925908         1.54989744  3.9009405      5.804180
## 67      -0.4854771        -0.52419354  1.4499353      5.596690
## 28       5.0222121        -0.52419354  1.6749041      5.560792
## 75      -0.4854771        -0.52419354  2.2043001      5.403348
## 86       1.1268803         0.16717012  4.9204326      5.160074
## 287      3.2453865         1.45113268  3.5105547      5.046066

There was no severe jump in dissimilarity among the units of observation, so I presume it will not hurt the coming analysis, if I leave all of the units in the data set.

#mydata2 <- mydata2[-387, -447, - 481 ]
#removing the units with the highest dissimilarity

library(factoextra)

#Calculating Euclidean distances
distance <- get_dist(mydata2[c("BALANCE_z", "PURCHASES_TRX", "CASH_ADVANCE_z", "INSTALLMENTS_PURCHASES", "PRC_FULL_PAYMENT_z", "PAYMENTS_z")], 
                     method = "euclidian")

distance2 <- distance^2 #squared euklidian distance 

fviz_dist(distance2) #Showing dissimilarity matrix

get_clust_tendency(mydata2[c("BALANCE_z", "PURCHASES_TRX", "CASH_ADVANCE_z", "INSTALLMENTS_PURCHASES", "PRC_FULL_PAYMENT_z", "PAYMENTS_z")], #Hopkins statistics
                   n = nrow(mydata2) - 1, 
                   graph = FALSE)

## $hopkins_stat
## [1] 0.8874106
## 
## $plot
## NULL

Judging from the graph the data is clustarable. Furthermore, the Hopkins statistic, which is used to add a layer of objectivity to the possibility of clustering, confirms the latter. An appropriate statistic would be higher than 0.5. In our instance it is valued at approx. 0.89, this means the data is optimal for clustering.

ward <- mydata2[c("BALANCE_z", "PURCHASES_TRX", "CASH_ADVANCE_z", "INSTALLMENTS_PURCHASES", "PRC_FULL_PAYMENT_z", "PAYMENTS_z")] %>%
  get_dist(method = "euclidean") %>%
  hclust(method = "ward.D2")


fviz_dend(ward,
          k = 5,
          cex = 0.5,
          pallete = "Dark2",
          color_labels_by_k = TRUE,
          rect = TRUE)

## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
##   Please report the issue at <]8;;https://github.com/kassambara/factoextra/issueshttps://github.com/kassambara/factoextra/issues]8;;>.

fviz_dend(ward,
          k = 4,
          cex = 0.5,
          pallete = "Dark2",
          color_labels_by_k = TRUE,
          rect = TRUE)

Optimal_N_Clusters <- mydata2[c("BALANCE_z", "PURCHASES_TRX", "CASH_ADVANCE_z", "INSTALLMENTS_PURCHASES", "PRC_FULL_PAYMENT_z", "PAYMENTS_z")] %>%
  NbClust(distance = "euclidean",
          min.nc = 2,
          max.nc = 10,
          method = "ward.D2",
          index = "all")

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
##

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 1 proposed 2 as the best number of clusters 
## * 3 proposed 3 as the best number of clusters 
## * 4 proposed 4 as the best number of clusters 
## * 6 proposed 5 as the best number of clusters 
## * 1 proposed 6 as the best number of clusters 
## * 1 proposed 7 as the best number of clusters 
## * 2 proposed 8 as the best number of clusters 
## * 1 proposed 9 as the best number of clusters 
## * 4 proposed 10 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  5 
##  
##  
## *******************************************************************

According to the majority rule portraied in the output above, we should use 5 different clusters to fit our data the best. It is also sensible when looking at the dendrogram and the dissimilarity matrix.

mydata2$ClusterWard <- cutree(ward,
                              k = 5)
head(mydata2, 10)

##    CUST_ID     BALANCE BALANCE_FREQUENCY PURCHASES ONEOFF_PURCHASES
## 1   C11051 2128.151695          1.000000      0.00             0.00
## 2   C18220    5.416921          0.545455    288.80             0.00
## 3   C14908   49.564532          1.000000    433.50           273.94
## 4   C18691  919.563867          1.000000   3000.00             0.00
## 5   C14164 4663.032241          1.000000   1325.31          1325.31
## 6   C19027  133.099445          1.000000   1170.00             0.00
## 7   C11345 2894.348509          1.000000    657.71           310.70
## 8   C18752 1869.632550          1.000000      0.00             0.00
## 9   C11858 2270.955171          1.000000     25.26            25.26
## 10  C18449   51.089340          0.818182    140.75             0.00
##    INSTALLMENTS_PURCHASES CASH_ADVANCE PURCHASES_FREQUENCY
## 1              -0.8680692     19.34573            0.000000
## 2               1.2815297      0.00000            0.916667
## 3               1.4769490      0.00000            0.909091
## 4              -0.6531101    823.97913            0.166667
## 5              -0.8680692      0.00000            0.333333
## 6               1.4534986      0.00000            1.000000
## 7               1.4964914   3384.77264            1.000000
## 8              -0.8680692   1897.99301            0.000000
## 9              -0.8680692    452.80366            0.083333
## 10              0.2067316      0.00000            0.416667
##    ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FREQUENCY
## 1                    0.000000                         0.000000
## 2                    0.000000                         0.833333
## 3                    0.181818                         0.909091
## 4                    0.000000                         0.083333
## 5                    0.333333                         0.000000
## 6                    0.000000                         0.900000
## 7                    0.166667                         0.916667
## 8                    0.000000                         0.000000
## 9                    0.083333                         0.000000
## 10                   0.000000                         0.416667
##    CASH_ADVANCE_FREQUENCY CASH_ADVANCE_TRX PURCHASES_TRX CREDIT_LIMIT  PAYMENTS
## 1                0.083333                1   -0.58425072         2100  138.9072
## 2                0.000000                0   -0.14327299         3000  254.4436
## 3                0.000000                0   -0.06309522         9000  465.4661
## 4                0.083333                1   -0.50407295         2000 4426.4983
## 5                0.000000                0   -0.38380630         4500 2064.1108
## 6                0.000000                0   -0.10318410         1000 1068.5025
## 7                0.583333               17    0.25761586         3500 3166.7656
## 8                0.166667                4   -0.58425072         2000  454.3482
## 9                0.250000                3   -0.54416183         3100  503.1965
## 10               0.000000                0   -0.38380630         1000  347.2401
##    MINIMUM_PAYMENTS PRC_FULL_PAYMENT TENURE ONEOFF_PURCHASES_FREQUENCYF
## 1       13266.38190         0.000000     12                         low
## 2          32.47525         0.181818     12                         low
## 3         148.91458         0.272727     11                         low
## 4        1451.75634         0.083333     12                         low
## 5        6004.09418         0.000000     12                         low
## 6         161.81650         0.444444     10                         low
## 7        1006.61472         0.000000     12                         low
## 8         452.59794         0.000000     12                         low
## 9         815.40061         0.000000     12                         low
## 10        238.81136         0.111111     12                         low
##    PURCHASES_INSTALLMENTS_FREQUENCYF   BALANCE_z CASH_ADVANCE_z
## 1                                low  0.20305421    -0.47473357
## 2                               high -0.77465367    -0.48547708
## 3                               high -0.75431978    -0.48547708
## 4                                low -0.35360782    -0.02788623
## 5                                low  1.37059183    -0.48547708
## 6                               high -0.71584453    -0.48547708
## 7                               high  0.55595589     1.39423194
## 8                                low  0.08398318     0.56855962
## 9                                low  0.26882789    -0.23401584
## 10                               low -0.75361747    -0.48547708
##    PRC_FULL_PAYMENT_z PAYMENTS_z Dissimilarity ClusterWard
## 1          -0.5241935 -0.6866714      1.451824           1
## 2           0.1043182 -0.6336334      1.706164           2
## 3           0.4185741 -0.5367618      1.858323           2
## 4          -0.2361265  1.2815834      1.582615           1
## 5          -0.5241935  0.1971095      1.824470           1
## 6           1.0121686 -0.2599328      1.990848           2
## 7          -0.5241935  0.7032925      2.308294           2
## 8          -0.5241935 -0.5418655      1.411950           1
## 9          -0.5241935 -0.5194413      1.311978           1
## 10         -0.1401030 -0.5910344      1.167314           1

Performing K-MEANS clustering

K_Means <- hkmeans(mydata2[c("BALANCE_z", "PURCHASES_TRX", "CASH_ADVANCE_z", "INSTALLMENTS_PURCHASES", "PRC_FULL_PAYMENT_z", "PAYMENTS_z")],
                   k = 5,
                   hc.metric = "euclidean",
                   hc.method = "ward.D2")

#Showing the K-means clusters performed
K_Means

## Hierarchical K-means clustering with 5 clusters of sizes 142, 66, 41, 29, 13
## 
## Cluster means:
##    BALANCE_z PURCHASES_TRX CASH_ADVANCE_z INSTALLMENTS_PURCHASES
## 1 -0.1799033    -0.4589024     -0.1748947             -0.7034867
## 2 -0.1994935     0.4489492     -0.3278804              1.2668574
## 3 -0.7379063     0.0190381     -0.4653924              0.3264442
## 4  1.9332450    -0.3948653      2.3148014             -0.6005484
## 5  0.9925302     3.5541557     -0.1210006              1.5626323
##   PRC_FULL_PAYMENT_z PAYMENTS_z
## 1         -0.4444159 -0.2838668
## 2         -0.2473679 -0.2266477
## 3          2.2011945 -0.2658485
## 4         -0.3068373  1.3838796
## 5         -0.1474889  2.0027006
## 
## Clustering vector:
##   1   2   3   4   5   6   7   8   9  10  11  12  14  15  16  17  18  19  20  22 
##   1   2   2   1   1   2   2   1   1   1   3   1   1   3   1   1   2   2   1   2 
##  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  43 
##   2   2   4   1   1   4   2   2   3   1   1   1   1   1   3   1   2   2   2   2 
##  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63 
##   1   1   3   2   2   1   2   4   1   3   1   1   1   1   1   5   1   1   1   1 
##  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80  82  83  84 
##   3   1   4   5   1   1   2   1   1   1   2   5   1   1   4   1   1   2   1   1 
##  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100 101 102 103 104 
##   1   4   1   2   1   1   3   1   1   1   3   2   2   2   3   2   2   1   4   2 
## 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 
##   4   4   2   4   1   1   5   1   4   1   1   1   1   1   1   1   1   1   1   2 
## 125 126 127 128 129 130 131 132 133 134 135 137 138 139 140 141 142 143 144 145 
##   2   3   3   2   1   5   2   3   5   1   4   1   4   1   2   1   1   1   2   3 
## 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 
##   3   1   4   3   1   1   2   1   1   1   3   1   2   2   2   1   5   1   3   1 
## 166 167 168 169 170 171 172 173 174 175 177 178 179 180 181 182 183 184 185 186 
##   2   1   4   1   3   1   1   1   3   1   3   1   2   1   4   5   4   1   3   1 
## 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 
##   1   1   1   2   1   2   5   1   1   1   4   1   3   2   2   1   3   2   1   1 
## 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 
##   2   1   2   1   3   1   1   1   2   1   1   1   3   3   4   1   1   2   1   2 
## 229 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 
##   1   3   3   1   1   1   1   1   1   1   4   1   2   2   2   2   1   5   1   5 
## 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 
##   1   4   1   1   1   3   4   3   3   1   2   1   2   2   4   1   4   1   4   1 
## 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 
##   1   3   2   2   3   2   2   1   3   4   2   1   4   1   3   3   2   4   1   5 
## 290 291 292 293 294 295 296 297 298 299 300 
##   2   1   3   3   4   1   5   2   3   3   1 
## 
## Within cluster sum of squares by cluster:
## [1] 141.29704 100.18938  83.75134 210.38105  83.84743
##  (between_SS / total_SS =  64.4 %)
## 
## Available components:
## 
##  [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
##  [6] "betweenss"    "size"         "iter"         "ifault"       "data"        
## [11] "hclust"

The larger the between/total SS => the higher the difference between clusters (different pertaining information/data). The ratio of between_SS / total_SS is valued at 64.4%, which suffices for our analysis (a value above 50% is deemed acceptable). The ratio itself tells us the ratio between the

In the next step, I used graphical means of presenting the clusters:

fviz_cluster(K_Means,
             repel = FALSE,
             ggtheme = theme_gray(),
             palette = "Dark2")

The clusters appear to be overlapping heavily, this is however not necessarily the case, since the plot is presented in 2 dimension, yet was not measured on only 2 dimensions. If we were to present the data in more dimension, we would observe less overlapping.

In the next step, I will be looking at the potential re-classification that may have occurred.

mydata2$ClusterK_Means <- K_Means$cluster
mydata2$ClusterK_Means <- as.factor(mydata2$ClusterK_Means)
table(mydata2$ClusterWard, mydata2$ClusterK_Means)

##    
##       1   2   3   4   5
##   1 140   0   0   0   0
##   2   1  65   0   0   1
##   3   0   0  41   0   1
##   4   1   0   0  29   0
##   5   0   1   0   0  11

From the table above, we can observe that based on the variables used, re-classifications did in fact occur. Namely, from cluster 2, there were 1 credit card holder re-classified to cluster 1. From cluster 3, 1 holder was re-classified to cluster 5. From cluster 4 there was 1 re-classification to cluster 1. From cluster 5, there was 1 re-classification to cluster 2.

centroids <- K_Means$centers
centroids

##    BALANCE_z PURCHASES_TRX CASH_ADVANCE_z INSTALLMENTS_PURCHASES
## 1 -0.1799033    -0.4589024     -0.1748947             -0.7034867
## 2 -0.1994935     0.4489492     -0.3278804              1.2668574
## 3 -0.7379063     0.0190381     -0.4653924              0.3264442
## 4  1.9332450    -0.3948653      2.3148014             -0.6005484
## 5  0.9925302     3.5541557     -0.1210006              1.5626323
##   PRC_FULL_PAYMENT_z PAYMENTS_z
## 1         -0.4444159 -0.2838668
## 2         -0.2473679 -0.2266477
## 3          2.2011945 -0.2658485
## 4         -0.3068373  1.3838796
## 5         -0.1474889  2.0027006

library(ggplot2)
library(tidyr)

## 
## Attaching package: 'tidyr'

## The following object is masked from 'package:magrittr':
## 
##     extract

figure <- as.data.frame(centroids)
figure$ID <- 1:nrow(figure)
figure <- pivot_longer(figure,
                       cols = c(BALANCE_z, PURCHASES_TRX, CASH_ADVANCE_z, INSTALLMENTS_PURCHASES, PRC_FULL_PAYMENT_z, PAYMENTS_z))

figure$Groups <- factor(figure$ID,
                        levels = c(1,2,3,4,5),
                        labels = c("Cluster 1", "Cluster 2","Cluster 3", "Cluster 4", "Cluster 5"))

figure$nameFactor <- factor(figure$name, 
                            levels = c("BALANCE_z", "PURCHASES_TRX", "CASH_ADVANCE_z", "INSTALLMENTS_PURCHASES", "PRC_FULL_PAYMENT_z", "PAYMENTS_z"), 
                            labels = c("BALANCE", "PURCHASESx", "$ADV", "INSTALLMENT_P", "%FULLPAY", "PAYMENTS"))

ggplot(figure, aes(x = nameFactor, y = value))+
  geom_hline(yintercept = 0)+
  theme_gray()+
  geom_line(aes(group = ID), linewidth = 0.7)+
  geom_point(aes(shape = Groups, col = Groups), size = 2)+
  ylab("Averages") + 
  xlab("Clustering variables")

From the above chart we can conclude the following: Cluster 5 has an above average balance on their accounts (second highest), it has the highest number of purchases, yet a bellow average of advance payments, the highest amount of installment payments, a bellow average level of payments made in full and the highest cumulative amount of payments. This seems to be the segment with the highest level of consumption, and a propensity to spend in installments. This seems to be the most profitable among the segments to target with marketing campaigns.

Cluster 4 has the highest balance on their account among the segments, it has the lowest number of purchases, the highest amount of cash advances, a bellow average number of payments by installments (second lowest), a bellow average percent of payments made in full and an above average (second highest) cumulative amount of payments. This seems to be a segment denoted by high income, high consumption and seems to be able to take on credit. This would be the second most profitable segment to target with a marketing campaign.

The credit card holders from the 1st cluster have a bellow average balance on their account, take on the lowest amount of purchases, put up a bellow average cash advance, use installment purchases the least, have the lowest percentage of full payments and a bellow average level (while still being the lowest) of the cumulative amount of payments. This would appear to be a lower income segment, or at the least a segment denoted by low consumption.

The 2nd cluster has bellow average balance, above average number of payments, a bellow average level of cash advances, they are however above average in installment payments (second highest), a bellow average level of full payments and a bellow average level of cumulative amount of payments. This segments appears to be keen on installment payments, and consumes most of the income with financing.

The 3rd cluster has the lowest balance among the segments, average number of purchases, the lowest level of cash advances, an above average level of installment payments, it has however the highest level of full payments and a bellow average cumulative amount of payments. This cluster appears to have used high amounts of financing or consuming in the past, and has fulfilled their payment obligations.

#In the next phase, I will be testing the clustering variables using ANOVA, in order to determine which of the variables pertains the most differences among the segments.

fit <- aov(cbind(BALANCE_z, PURCHASES_TRX, CASH_ADVANCE_z, INSTALLMENTS_PURCHASES, PRC_FULL_PAYMENT_z, PAYMENTS_z) ~ as.factor(ClusterK_Means), data = mydata2)
summary(fit)

##  Response 1 :
##                            Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(ClusterK_Means)   4 150.74  37.685  77.394 < 2.2e-16 ***
## Residuals                 286 139.26   0.487                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response 2 :
##                            Df  Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(ClusterK_Means)   4 211.959  52.990   194.2 < 2.2e-16 ***
## Residuals                 286  78.041   0.273                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response 3 :
##                            Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(ClusterK_Means)   4  175.9  43.975  110.23 < 2.2e-16 ***
## Residuals                 286  114.1   0.399                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response 4 :
##                            Df  Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(ClusterK_Means)   4 222.772  55.693  236.93 < 2.2e-16 ***
## Residuals                 286  67.228   0.235                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response 5 :
##                            Df  Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(ClusterK_Means)   4 233.753  58.438  297.14 < 2.2e-16 ***
## Residuals                 286  56.247   0.197                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response 6 :
##                            Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(ClusterK_Means)   4 125.41 31.3524  54.479 < 2.2e-16 ***
## Residuals                 286 164.59  0.5755                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the ANOVA tests conducted above, we can observe that in fact all of the variables hold different means, between all of the clusters/segments. This is confirmed by the rejection of the null hypothesis, which states that the means of the values of the variables pertaining to each segment are the same. Due to the p value being < 0.05 in all cases, we can confirm the alternative hypothesis. Judging by the F-statistics, we can say that the variables NUMBER OF PURCHASES, INSTALLMENT PURCHASES and PERCENT OF FULL PAYMENTS are the variables which account for the most differences between the segments.

# VALIDATION

# I will now check for if there is a statistically significant difference between clusters, in regards to the frequency of their specific actions, which the credit card provides. Namely, if the cluster prefers to buy in installments or in one go.

# The null hypothesis for the chisquared test was: H0:There is no association between the two categorical variables (low/high frequency of installments payments)
# H1 there are differences

chisq_results <- chisq.test(mydata2$PURCHASES_INSTALLMENTS_FREQUENCYF, as.factor(mydata2$ClusterK_Means))

## Warning in chisq.test(mydata2$PURCHASES_INSTALLMENTS_FREQUENCYF,
## as.factor(mydata2$ClusterK_Means)): Chi-squared approximation may be incorrect

chisq_results

## 
##  Pearson's Chi-squared test
## 
## data:  mydata2$PURCHASES_INSTALLMENTS_FREQUENCYF and as.factor(mydata2$ClusterK_Means)
## X-squared = 224.22, df = 4, p-value < 2.2e-16

addmargins(chisq_results$observed)

##       
##          1   2   3   4   5 Sum
##   high   1  64  19   2  13  99
##   low  141   2  22  27   0 192
##   Sum  142  66  41  29  13 291

round(chisq_results$expected, 2)

##       
##            1     2     3     4    5
##   high 48.31 22.45 13.95  9.87 4.42
##   low  93.69 43.55 27.05 19.13 8.58

round(chisq_results$res, 2)

##       
##            1     2     3     4     5
##   high -6.81  8.77  1.35 -2.50  4.08
##   low   4.89 -6.30 -0.97  1.80 -2.93

From the results of the chisq test, we can reject H0 at p-value < 0.05. From this we can conclude, there is an association between the clusters of the categorical variable PURCHASES_INSTALLMENTS_FREQUENCYF.

#VALIDATION - Tenure

fit <- aov(TENURE ~ as.factor(ClusterK_Means), data = mydata2)

# H0: µ(cluster1, TENURE) = µ(cluster1,2,3,4,5 TENURE)
# H1: µ(cluster1, TENURE) ≠ µ(cluster1,2,3,4,5 TENURE)

summary(fit)

##                            Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(ClusterK_Means)   4    5.1   1.277   0.789  0.533
## Residuals                 286  462.7   1.618

Do to the P-value of the anova test being above 0.05, we can not reject the null hypothesis, from which we can conclude that there are no statistically significant differences, between the average tenures among the clusters.

CONCLUSTION: I performed hierarchical clustering of 300 units based on 6 cluster variables (squared Euclidean distance, Ward’s algorithm). The classification was further optimized by K-Means clustering.

The resulting conclusion is as follows: I believe that the most profitable segments to target with the marketing campaign are segments 4 and 5. This is mainly due to the fact that they possess the highest balance on their account, while also being the largest consumers among the segments. They outperform other segments by a large margin on both accounts. The description of the segments can be located bellow the cluster chart, but for ease of access, I will once again copy the analysis of the two clusters according to the used variables once again in the following pharagraphs:

Cluster 5 has an above average balance on their accounts (second highest), it has the highest number of purchases, yet a bellow average of advance payments, the highest amount of installment payments, a bellow average level of payments made in full and the highest cumulative amount of payments. This seems to be the segment with the highest level of consumption, and a propensity to spend in installments. This seems to be the most profitable among the segments to target with marketing campaigns.

HW4 - Clustering

Kristjan Puhek

2023-02-18