#Data exploration

In order to correctly and efficiently perform the cluster analysis, it is important to understand what kind of data one is dealing with. The data contains information about 8950 credit cards, which can be assumed to be different customers for convenience. The data contains 17 variables, all of them concerning their balance, purchases and credit history, so both financial and behavioral data. The variables are as follows:

colnames(df)
##  [1] "CUST_ID"                          "BALANCE"                         
##  [3] "BALANCE_FREQUENCY"                "PURCHASES"                       
##  [5] "ONEOFF_PURCHASES"                 "INSTALLMENTS_PURCHASES"          
##  [7] "CASH_ADVANCE"                     "PURCHASES_FREQUENCY"             
##  [9] "ONEOFF_PURCHASES_FREQUENCY"       "PURCHASES_INSTALLMENTS_FREQUENCY"
## [11] "CASH_ADVANCE_FREQUENCY"           "CASH_ADVANCE_TRX"                
## [13] "PURCHASES_TRX"                    "CREDIT_LIMIT"                    
## [15] "PAYMENTS"                         "MINIMUM_PAYMENTS"                
## [17] "PRC_FULL_PAYMENT"                 "TENURE"

Here, in order to analyze the variables, I would like to leverage one of the most powerful R libraries, DescTool. It provides the user with easily accessible tool for data exploration, however since its output is very robust, here the analysis of only the variable BALANCE will be presented

#library(DescTools)
Desc(df$BALANCE)
## ------------------------------------------------------------------------------ 
## df$BALANCE (numeric)
## 
##          length             n         NAs        unique            0s'
##           8'950         8'950           0         8'871            80
##                        100.0%        0.0%                        0.9%
##                                                                      
##             .05           .10         .25        median           .75
##        8.814518     23.575529  128.281915    873.385231  2'054.140036
##                                                                      
##           range            sd       vcoef           mad           IQR
##   19'043.138560  2'081.531879    1.330499  1'185.880141  1'925.858120
##                                                                      
##           mean        meanCI
##   1'564.474828  1'521.344934
##                 1'607.604721
##                             
##            .90           .95
##   4'338.563657  5'909.111808
##                             
##           skew          kurt
##       2.392584      7.667410
##                             
## lowest : 0.0 (80), 0.000199, 0.001146, 0.001214, 0.001289
## highest: 16'115.596400, 16'259.448570, 16'304.889250, 18'495.558550, 19'043.138560
## 
## ' 95%-CI (classic)

# use Desc(df) to analyze all variables in the  dataframe

As you can see, there are several outlier in this variable, which may be an issue for the clustering, as some of these outliers may prove to “occupy” some clusters, and therefore hinder the performance of the algorithms. It would be quite tedious to get rid of these outliers for each of the variables, so here I propose an alternative solution - utilizing PCA to locate these clients and excluding them from the analysis. PCA allows to reduce the number of variables present in the dataset to a lower number, which would be more managable. In order to perform the data first has to be normalized, as the variables present have different orders of magnitude.

df.n<-data.Normalization(df, type="n1", normalization="column")
## Warning in data.Normalization(df, type = "n1", normalization = "column"): Data
## not numeric, normalization not applicable
head(df.n)
##   CUST_ID            BALANCE  BALANCE_FREQUENCY          PURCHASES
## 1  C10001  -0.73194847204352 -0.249420545524583 -0.424875997162139
## 2  C10002  0.786916887744014  0.134317166670032 -0.469525649414068
## 3  C10003  0.447110151666223  0.518054878864647 -0.107662215120318
## 4  C10004 0.0490963964234747  -1.01689596991381  0.232044882301248
## 5  C10005  -0.35875525138393  0.518054878864647 -0.462037238134918
## 6  C10006  0.117871806693615  0.518054878864647  0.154483662477457
##     ONEOFF_PURCHASES INSTALLMENTS_PURCHASES       CASH_ADVANCE
## 1  -0.35691408120146      -0.34905931685907 -0.466759476151324
## 2  -0.35691408120146      -0.45455083424524   2.60545893975651
## 3  0.108882429440933      -0.45455083424524 -0.466759476151324
## 4  0.546158942134865      -0.45455083424524 -0.368632658647015
## 5 -0.347274876149171      -0.45455083424524 -0.466759476151324
## 6  -0.35691408120146       1.01976499699829 -0.466759476151324
##   PURCHASES_FREQUENCY ONEOFF_PURCHASES_FREQUENCY
## 1  -0.806445289054105         -0.678622892777191
## 2   -1.22168980074932         -0.678622892777191
## 3    1.26977228649778           2.67330172076504
## 4   -1.01406879063275         -0.399296958956877
## 5   -1.01406879063275         -0.399296958956877
## 6   0.439285754569446         -0.678622892777191
##   PURCHASES_INSTALLMENTS_FREQUENCY CASH_ADVANCE_FREQUENCY   CASH_ADVANCE_TRX
## 1               -0.707273649243451     -0.675311127842014 -0.476043220613547
## 2               -0.916943961286405      0.573930656428145  0.110067702038892
## 3               -0.916943961286405     -0.675311127842014 -0.476043220613547
## 4               -0.916943961286405     -0.258898865407673 -0.329515489950437
## 5               -0.916943961286405     -0.675311127842014 -0.476043220613547
## 6                 0.55075325512189     -0.675311127842014 -0.476043220613547
##        PURCHASES_TRX CREDIT_LIMIT           PAYMENTS MINIMUM_PAYMENTS
## 1 -0.511304683160103         <NA> -0.528949237947743             <NA>
## 2 -0.591762814625347         <NA>  0.818596391641855             <NA>
## 3 -0.109014025833886         <NA> -0.383783295747786             <NA>
## 4 -0.551533748892725         <NA> -0.598654813004152             <NA>
## 5 -0.551533748892725         <NA> -0.364347447101688             <NA>
## 6 -0.269930288764373         <NA> -0.115053107632055             <NA>
##     PRC_FULL_PAYMENT            TENURE
## 1 -0.525521609873799 0.360659393995693
## 2  0.234213811150449 0.360659393995693
## 3 -0.525521609873799 0.360659393995693
## 4 -0.525521609873799 0.360659393995693
## 5 -0.525521609873799 0.360659393995693
## 6 -0.525521609873799 0.360659393995693

Here we can notice that two of the columns - CREDIT_LIMIT and MINIMUM_PAYMENTS contain missing values, therefore in order to avoid any errors this might cause they will be deleted before proceeding with the PCA.

df.n <- subset(df.n, select = -c(CREDIT_LIMIT, MINIMUM_PAYMENTS, CUST_ID))
for(i in 1:15){
  df.n[,i] <- as.numeric(df.n[,i])
}
df.n.pca <- prcomp(df.n, center = FALSE)
summary(df.n.pca)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.1165 1.7866 1.20891 1.09334 0.99055 0.94650 0.81307
## Proportion of Variance 0.2986 0.2128 0.09743 0.07969 0.06541 0.05972 0.04407
## Cumulative Proportion  0.2986 0.5114 0.60886 0.68855 0.75396 0.81369 0.85776
##                            PC8     PC9    PC10    PC11    PC12    PC13    PC14
## Standard deviation     0.74951 0.69030 0.62732 0.52291 0.45549 0.41847 0.21391
## Proportion of Variance 0.03745 0.03177 0.02624 0.01823 0.01383 0.01167 0.00305
## Cumulative Proportion  0.89521 0.92698 0.95321 0.97144 0.98527 0.99695 1.00000
##                            PC15
## Standard deviation     0.003413
## Proportion of Variance 0.000000
## Cumulative Proportion  1.000000

In order to establish the optimal number of principal components, the proportion of variance explained by a given component has to be analyzed. This can be done just by looking at the summary above (the expectation there is that only the first 2 components are enough), however it is easier to visualize this and analyze the plot

fviz_eig(df.n.pca)

The rule of thumb is that the optimal number of principal components should have the biggest dropoff in explained variance for the next best component. In this case, this is the case for 2 primary components. Another benefit of chosing 2 principal components is that 2 dimentional data is easy to visualize, so knowing all this, 2 principal components are chosen.

Below is the scatterplot of the observations according to their principal component values:

fviz_pca_ind(df.n.pca, col.ind="cos2", geom="point")

The colour in this plot describes the quality of representatives of the PCA. Regardless of the quality, here it is visible that we have several outliers in the dataset. In order to get rid of them for the analysis, I select the cutoff of 8 for either of the components. If an observation has a hgiher PC, than it is excluded from the analysis. Before I exclude the observations, it may prove usefull to see what these components mean, or what variables in the original dataset are used to construct them.

fviz_pca_var(df.n.pca,
             col.var = "contrib", # Color by contributions to the PC
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE     # Avoid text overlapping
             )

As can be seen in the plot above, all of the variables, except for tenure have significant impact on at least one of the components, so it can be reasoned that if we exclude the observations with outliers in the PCs, than we will most likely exclude the outliers for the other variables. To that extent, we should create a vector containing boolean values, with information wether the observation is an outlier or not.

pca.results <- data.frame(df.n.pca$x)
df.n$PC1 <- pca.results$PC1
df.n$PC2 <- pca.results$PC2
df2<-subset(df.n, PC1<8 & PC2<8)
nrow(df2)
## [1] 8851

The resulting dataset consists of 8851 customers. As utilization of the principal components in clustering will introduce some information leakage, since the components are by definition correlated with the original variables.

df3 <- subset(df2, select = -c(PC1,PC2))

#customer segmentation

Now we can proceed to the clustering proper. First, the optimal number of clusters should be checked

fviz_nbclust(df3, FUNcluster=kmeans)+ theme_classic()

According to to the silhouette, the optimal number of clusters is 7, however the silhouette is also high for 2 through 4 clusters. In order to fully explore the possible segmentations all of these possibilities will be explored. The following part will analyze the possible clustering results

km2<-eclust(df3, "kmeans", hc_metric="euclidean",k=2, seed = 1234)

With 2 clusters, and euclidian distance measure, the clusters significantly overlap eachother, which is definitely not desireable, so maybe for more clusters the situation will imporve:

km3<-eclust(df3, "kmeans", hc_metric="euclidean",k=3, seed = 1234)

Here, the results seem to be improving, since just by looking at the clusters one can notice less overlap. To that extent, I believe that it may be more brief and go straight to 7 clusters.

km7<-eclust(df3, "kmeans", hc_metric="euclidean",k=7, seed = 1234)

The output is not very readible, so in order to provide a more understandable plot, 1000 observations were sampled, for visualizatio

samp <- df3[sample(nrow(df3), 1000), ]
km7.samp<-eclust(samp, "kmeans", hc_metric="euclidean",k=7, seed = 1234)

To asses the quality of each cluster, the shillouette of each of them is analyzed.

fviz_silhouette(km7.samp)
##   cluster size ave.sil.width
## 1       1   40          0.14
## 2       2  111          0.18
## 3       3   40          0.10
## 4       4  231          0.22
## 5       5  124          0.02
## 6       6  290          0.39
## 7       7  164          0.20

Tough the results are flipped along both axi, the results are basically the same. From not on I wil refer to each of the clusters by their number in the sampled clustering. Right off the bat, we can see several things about the results: - The clusters 4 and 2 are the most separated ones from the other clusters - The other clusters are quite similar to each other - The cluster 7 contains the most clients, with the cluster 6 being the second most populous - the cluster 7 is most likely of the highest quality

In order to get some insights from the clustering results, we can use the boxplots for each variable for each cluster:

groupBWplot(samp[,1:3], km7.samp$cluster, alpha=0.05)
## Warning in data.frame(..., check.names = FALSE): nazwy wierszy zostały
## znalezione z krótkiej zmiennej i zostały odrzucone

The results presented are for the first 3 variables, as for more of them the plot becomes unreadable. We can notice that as far balance goes, the clusters number 2 and 4 contais the clients with the highest existing balance, so it might be in the best interest for the bank to offer them some investment products. Cluster number 5 contains the customers who update their balance the least frequently, so it may contain people who are not used to using their credit card, and therefore the bank can target them with a campaign encouraging to use it more often. As you can see, a lot of usefull information can be extracted for even the 3 variables. In order to look into the results more more deeply, one might want to see the information

samp.cl<-data.frame(cbind(samp, km7.samp$cluster))
colnames(samp.cl)[16]<-"clust"

describeBy(samp.cl[,7:9], samp.cl[,16])
## 
##  Descriptive statistics by group 
## group: 1
##                                  vars  n mean   sd median trimmed  mad   min
## PURCHASES_FREQUENCY                 1 40 1.07 0.36   1.27    1.15 0.00 -0.39
## ONEOFF_PURCHASES_FREQUENCY          2 40 2.07 0.72   2.39    2.19 0.41 -0.12
## PURCHASES_INSTALLMENTS_FREQUENCY    3 40 0.37 0.99   0.55    0.38 1.55 -0.92
##                                   max range  skew kurtosis   se
## PURCHASES_FREQUENCY              1.27  1.66 -2.21     5.15 0.06
## ONEOFF_PURCHASES_FREQUENCY       2.67  2.79 -1.17     0.65 0.11
## PURCHASES_INSTALLMENTS_FREQUENCY 1.60  2.52 -0.18    -1.59 0.16
## ------------------------------------------------------------ 
## group: 2
##                                  vars   n mean   sd median trimmed  mad   min
## PURCHASES_FREQUENCY                 1 111 0.91 0.42   1.06    0.97 0.31 -0.11
## ONEOFF_PURCHASES_FREQUENCY          2 111 1.77 0.76   1.84    1.80 1.24  0.44
## PURCHASES_INSTALLMENTS_FREQUENCY    3 111 0.13 0.90   0.13    0.08 1.24 -0.92
##                                   max range  skew kurtosis   se
## PURCHASES_FREQUENCY              1.27  1.38 -0.82    -0.66 0.04
## ONEOFF_PURCHASES_FREQUENCY       2.67  2.23 -0.16    -1.46 0.07
## PURCHASES_INSTALLMENTS_FREQUENCY 1.60  2.52  0.27    -1.33 0.09
## ------------------------------------------------------------ 
## group: 3
##                                  vars  n mean   sd median trimmed  mad   min
## PURCHASES_FREQUENCY                 1 40 1.23 0.12   1.27    1.26 0.00  0.65
## ONEOFF_PURCHASES_FREQUENCY          2 40 1.51 0.95   1.56    1.59 1.24 -0.68
## PURCHASES_INSTALLMENTS_FREQUENCY    3 40 1.42 0.39   1.60    1.52 0.00 -0.29
##                                   max range  skew kurtosis   se
## PURCHASES_FREQUENCY              1.27  0.62 -3.38    12.72 0.02
## ONEOFF_PURCHASES_FREQUENCY       2.67  3.35 -0.52    -0.71 0.15
## PURCHASES_INSTALLMENTS_FREQUENCY 1.60  1.89 -2.81     8.28 0.06
## ------------------------------------------------------------ 
## group: 4
##                                  vars   n  mean   sd median trimmed  mad   min
## PURCHASES_FREQUENCY                 1 231  0.99 0.35   1.27    1.04 0.00 -0.18
## ONEOFF_PURCHASES_FREQUENCY          2 231 -0.38 0.45  -0.68   -0.46 0.00 -0.68
## PURCHASES_INSTALLMENTS_FREQUENCY    3 231  1.19 0.42   1.28    1.24 0.47  0.13
##                                   max range  skew kurtosis   se
## PURCHASES_FREQUENCY              1.27  1.45 -0.97    -0.18 0.02
## ONEOFF_PURCHASES_FREQUENCY       1.28  1.96  1.36     0.80 0.03
## PURCHASES_INSTALLMENTS_FREQUENCY 1.60  1.47 -0.79    -0.34 0.03
## ------------------------------------------------------------ 
## group: 5
##                                  vars   n  mean   sd median trimmed  mad   min
## PURCHASES_FREQUENCY                 1 124 -0.61 0.78  -1.01   -0.74 0.31 -1.22
## ONEOFF_PURCHASES_FREQUENCY          2 124 -0.21 0.77  -0.68   -0.39 0.00 -0.68
## PURCHASES_INSTALLMENTS_FREQUENCY    3 124 -0.57 0.64  -0.92   -0.71 0.00 -0.92
##                                   max range skew kurtosis   se
## PURCHASES_FREQUENCY              1.27  2.49 1.05    -0.26 0.07
## ONEOFF_PURCHASES_FREQUENCY       2.67  3.35 2.07     3.88 0.07
## PURCHASES_INSTALLMENTS_FREQUENCY 1.60  2.52 1.75     2.15 0.06
## ------------------------------------------------------------ 
## group: 6
##                                  vars   n  mean   sd median trimmed  mad   min
## PURCHASES_FREQUENCY                 1 290 -0.87 0.42  -1.01   -0.93 0.31 -1.22
## ONEOFF_PURCHASES_FREQUENCY          2 290 -0.42 0.40  -0.68   -0.50 0.00 -0.68
## PURCHASES_INSTALLMENTS_FREQUENCY    3 290 -0.75 0.33  -0.92   -0.84 0.00 -0.92
##                                   max range skew kurtosis   se
## PURCHASES_FREQUENCY              0.44  1.66 0.96    -0.25 0.02
## ONEOFF_PURCHASES_FREQUENCY       1.00  1.68 1.59     1.74 0.02
## PURCHASES_INSTALLMENTS_FREQUENCY 0.34  1.26 1.99     2.73 0.02
## ------------------------------------------------------------ 
## group: 7
##                                  vars   n  mean   sd median trimmed  mad   min
## PURCHASES_FREQUENCY                 1 164 -0.51 0.62  -0.60   -0.58 0.62 -1.22
## ONEOFF_PURCHASES_FREQUENCY          2 164 -0.38 0.46  -0.68   -0.48 0.00 -0.68
## PURCHASES_INSTALLMENTS_FREQUENCY    3 164 -0.45 0.62  -0.71   -0.55 0.31 -0.92
##                                   max range skew kurtosis   se
## PURCHASES_FREQUENCY              1.27  2.49 0.87     0.24 0.05
## ONEOFF_PURCHASES_FREQUENCY       1.45  2.13 1.78     2.57 0.04
## PURCHASES_INSTALLMENTS_FREQUENCY 1.60  2.52 1.40     1.61 0.05

Finally, it is also possible to verify which variables drive the clustering, meaning which ones are the most important ones in assesing which cluster a given client should belong to. In order to do that, a different package is used, so the clustering has to be performed again.

km7.imp<-kcca(samp, k=7)
FeatureImp_km<-FeatureImpCluster(km7.imp, as.data.table(samp))
plot(FeatureImp_km)

The variables with the lowest classification rate are the ones which drive the clustering, so here the information about the purchases is the most significant, and therefore when analyzing the results one must keep in mind that they may be very good for purchases information, and not so much for other variables. It may be useful to perform separate clustering for the non purchases related data.

Conclusions

Although their uses might not be instantly obvious, the unsupervised learning methods presented in this analysis are invaluable for various businesses. Here, the utility of PCA and clustering was shown, with PCA being used mostly on the technical level, as it played a supporting role in the main clustering analysis. The clustering allowed for efficient customer segmentation, and even with basic knowledge of the banking sector allowed us to derive some practical insights, with real value attached to them. In the case of this dataset, it might be more practical to utilize some other clustering algorithms, the ones more accustomed to handling large quantities of data, such as CLARA, as the calculation took some time.