#Data exploration
In order to correctly and efficiently perform the cluster analysis, it is important to understand what kind of data one is dealing with. The data contains information about 8950 credit cards, which can be assumed to be different customers for convenience. The data contains 17 variables, all of them concerning their balance, purchases and credit history, so both financial and behavioral data. The variables are as follows:
colnames(df)
## [1] "CUST_ID" "BALANCE"
## [3] "BALANCE_FREQUENCY" "PURCHASES"
## [5] "ONEOFF_PURCHASES" "INSTALLMENTS_PURCHASES"
## [7] "CASH_ADVANCE" "PURCHASES_FREQUENCY"
## [9] "ONEOFF_PURCHASES_FREQUENCY" "PURCHASES_INSTALLMENTS_FREQUENCY"
## [11] "CASH_ADVANCE_FREQUENCY" "CASH_ADVANCE_TRX"
## [13] "PURCHASES_TRX" "CREDIT_LIMIT"
## [15] "PAYMENTS" "MINIMUM_PAYMENTS"
## [17] "PRC_FULL_PAYMENT" "TENURE"
Here, in order to analyze the variables, I would like to leverage one of the most powerful R libraries, DescTool. It provides the user with easily accessible tool for data exploration, however since its output is very robust, here the analysis of only the variable BALANCE will be presented
#library(DescTools)
Desc(df$BALANCE)
## ------------------------------------------------------------------------------
## df$BALANCE (numeric)
##
## length n NAs unique 0s'
## 8'950 8'950 0 8'871 80
## 100.0% 0.0% 0.9%
##
## .05 .10 .25 median .75
## 8.814518 23.575529 128.281915 873.385231 2'054.140036
##
## range sd vcoef mad IQR
## 19'043.138560 2'081.531879 1.330499 1'185.880141 1'925.858120
##
## mean meanCI
## 1'564.474828 1'521.344934
## 1'607.604721
##
## .90 .95
## 4'338.563657 5'909.111808
##
## skew kurt
## 2.392584 7.667410
##
## lowest : 0.0 (80), 0.000199, 0.001146, 0.001214, 0.001289
## highest: 16'115.596400, 16'259.448570, 16'304.889250, 18'495.558550, 19'043.138560
##
## ' 95%-CI (classic)
# use Desc(df) to analyze all variables in the dataframe
As you can see, there are several outlier in this variable, which may be an issue for the clustering, as some of these outliers may prove to “occupy” some clusters, and therefore hinder the performance of the algorithms. It would be quite tedious to get rid of these outliers for each of the variables, so here I propose an alternative solution - utilizing PCA to locate these clients and excluding them from the analysis. PCA allows to reduce the number of variables present in the dataset to a lower number, which would be more managable. In order to perform the data first has to be normalized, as the variables present have different orders of magnitude.
df.n<-data.Normalization(df, type="n1", normalization="column")
## Warning in data.Normalization(df, type = "n1", normalization = "column"): Data
## not numeric, normalization not applicable
head(df.n)
## CUST_ID BALANCE BALANCE_FREQUENCY PURCHASES
## 1 C10001 -0.73194847204352 -0.249420545524583 -0.424875997162139
## 2 C10002 0.786916887744014 0.134317166670032 -0.469525649414068
## 3 C10003 0.447110151666223 0.518054878864647 -0.107662215120318
## 4 C10004 0.0490963964234747 -1.01689596991381 0.232044882301248
## 5 C10005 -0.35875525138393 0.518054878864647 -0.462037238134918
## 6 C10006 0.117871806693615 0.518054878864647 0.154483662477457
## ONEOFF_PURCHASES INSTALLMENTS_PURCHASES CASH_ADVANCE
## 1 -0.35691408120146 -0.34905931685907 -0.466759476151324
## 2 -0.35691408120146 -0.45455083424524 2.60545893975651
## 3 0.108882429440933 -0.45455083424524 -0.466759476151324
## 4 0.546158942134865 -0.45455083424524 -0.368632658647015
## 5 -0.347274876149171 -0.45455083424524 -0.466759476151324
## 6 -0.35691408120146 1.01976499699829 -0.466759476151324
## PURCHASES_FREQUENCY ONEOFF_PURCHASES_FREQUENCY
## 1 -0.806445289054105 -0.678622892777191
## 2 -1.22168980074932 -0.678622892777191
## 3 1.26977228649778 2.67330172076504
## 4 -1.01406879063275 -0.399296958956877
## 5 -1.01406879063275 -0.399296958956877
## 6 0.439285754569446 -0.678622892777191
## PURCHASES_INSTALLMENTS_FREQUENCY CASH_ADVANCE_FREQUENCY CASH_ADVANCE_TRX
## 1 -0.707273649243451 -0.675311127842014 -0.476043220613547
## 2 -0.916943961286405 0.573930656428145 0.110067702038892
## 3 -0.916943961286405 -0.675311127842014 -0.476043220613547
## 4 -0.916943961286405 -0.258898865407673 -0.329515489950437
## 5 -0.916943961286405 -0.675311127842014 -0.476043220613547
## 6 0.55075325512189 -0.675311127842014 -0.476043220613547
## PURCHASES_TRX CREDIT_LIMIT PAYMENTS MINIMUM_PAYMENTS
## 1 -0.511304683160103 <NA> -0.528949237947743 <NA>
## 2 -0.591762814625347 <NA> 0.818596391641855 <NA>
## 3 -0.109014025833886 <NA> -0.383783295747786 <NA>
## 4 -0.551533748892725 <NA> -0.598654813004152 <NA>
## 5 -0.551533748892725 <NA> -0.364347447101688 <NA>
## 6 -0.269930288764373 <NA> -0.115053107632055 <NA>
## PRC_FULL_PAYMENT TENURE
## 1 -0.525521609873799 0.360659393995693
## 2 0.234213811150449 0.360659393995693
## 3 -0.525521609873799 0.360659393995693
## 4 -0.525521609873799 0.360659393995693
## 5 -0.525521609873799 0.360659393995693
## 6 -0.525521609873799 0.360659393995693
Here we can notice that two of the columns - CREDIT_LIMIT and MINIMUM_PAYMENTS contain missing values, therefore in order to avoid any errors this might cause they will be deleted before proceeding with the PCA.
df.n <- subset(df.n, select = -c(CREDIT_LIMIT, MINIMUM_PAYMENTS, CUST_ID))
for(i in 1:15){
df.n[,i] <- as.numeric(df.n[,i])
}
df.n.pca <- prcomp(df.n, center = FALSE)
summary(df.n.pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.1165 1.7866 1.20891 1.09334 0.99055 0.94650 0.81307
## Proportion of Variance 0.2986 0.2128 0.09743 0.07969 0.06541 0.05972 0.04407
## Cumulative Proportion 0.2986 0.5114 0.60886 0.68855 0.75396 0.81369 0.85776
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 0.74951 0.69030 0.62732 0.52291 0.45549 0.41847 0.21391
## Proportion of Variance 0.03745 0.03177 0.02624 0.01823 0.01383 0.01167 0.00305
## Cumulative Proportion 0.89521 0.92698 0.95321 0.97144 0.98527 0.99695 1.00000
## PC15
## Standard deviation 0.003413
## Proportion of Variance 0.000000
## Cumulative Proportion 1.000000
In order to establish the optimal number of principal components, the proportion of variance explained by a given component has to be analyzed. This can be done just by looking at the summary above (the expectation there is that only the first 2 components are enough), however it is easier to visualize this and analyze the plot
fviz_eig(df.n.pca)
The rule of thumb is that the optimal number of principal components should have the biggest dropoff in explained variance for the next best component. In this case, this is the case for 2 primary components. Another benefit of chosing 2 principal components is that 2 dimentional data is easy to visualize, so knowing all this, 2 principal components are chosen.
Below is the scatterplot of the observations according to their principal component values:
fviz_pca_ind(df.n.pca, col.ind="cos2", geom="point")
The colour in this plot describes the quality of representatives of the PCA. Regardless of the quality, here it is visible that we have several outliers in the dataset. In order to get rid of them for the analysis, I select the cutoff of 8 for either of the components. If an observation has a hgiher PC, than it is excluded from the analysis. Before I exclude the observations, it may prove usefull to see what these components mean, or what variables in the original dataset are used to construct them.
fviz_pca_var(df.n.pca,
col.var = "contrib", # Color by contributions to the PC
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)
As can be seen in the plot above, all of the variables, except for tenure have significant impact on at least one of the components, so it can be reasoned that if we exclude the observations with outliers in the PCs, than we will most likely exclude the outliers for the other variables. To that extent, we should create a vector containing boolean values, with information wether the observation is an outlier or not.
pca.results <- data.frame(df.n.pca$x)
df.n$PC1 <- pca.results$PC1
df.n$PC2 <- pca.results$PC2
df2<-subset(df.n, PC1<8 & PC2<8)
nrow(df2)
## [1] 8851
The resulting dataset consists of 8851 customers. As utilization of the principal components in clustering will introduce some information leakage, since the components are by definition correlated with the original variables.
df3 <- subset(df2, select = -c(PC1,PC2))
#customer segmentation
Now we can proceed to the clustering proper. First, the optimal number of clusters should be checked
fviz_nbclust(df3, FUNcluster=kmeans)+ theme_classic()
According to to the silhouette, the optimal number of clusters is 7, however the silhouette is also high for 2 through 4 clusters. In order to fully explore the possible segmentations all of these possibilities will be explored. The following part will analyze the possible clustering results
km2<-eclust(df3, "kmeans", hc_metric="euclidean",k=2, seed = 1234)
With 2 clusters, and euclidian distance measure, the clusters significantly overlap eachother, which is definitely not desireable, so maybe for more clusters the situation will imporve:
km3<-eclust(df3, "kmeans", hc_metric="euclidean",k=3, seed = 1234)
Here, the results seem to be improving, since just by looking at the clusters one can notice less overlap. To that extent, I believe that it may be more brief and go straight to 7 clusters.
km7<-eclust(df3, "kmeans", hc_metric="euclidean",k=7, seed = 1234)
The output is not very readible, so in order to provide a more understandable plot, 1000 observations were sampled, for visualizatio
samp <- df3[sample(nrow(df3), 1000), ]
km7.samp<-eclust(samp, "kmeans", hc_metric="euclidean",k=7, seed = 1234)
To asses the quality of each cluster, the shillouette of each of them is analyzed.
fviz_silhouette(km7.samp)
## cluster size ave.sil.width
## 1 1 40 0.14
## 2 2 111 0.18
## 3 3 40 0.10
## 4 4 231 0.22
## 5 5 124 0.02
## 6 6 290 0.39
## 7 7 164 0.20
Tough the results are flipped along both axi, the results are basically the same. From not on I wil refer to each of the clusters by their number in the sampled clustering. Right off the bat, we can see several things about the results: - The clusters 4 and 2 are the most separated ones from the other clusters - The other clusters are quite similar to each other - The cluster 7 contains the most clients, with the cluster 6 being the second most populous - the cluster 7 is most likely of the highest quality
In order to get some insights from the clustering results, we can use the boxplots for each variable for each cluster:
groupBWplot(samp[,1:3], km7.samp$cluster, alpha=0.05)
## Warning in data.frame(..., check.names = FALSE): nazwy wierszy zostały
## znalezione z krótkiej zmiennej i zostały odrzucone
The results presented are for the first 3 variables, as for more of them the plot becomes unreadable. We can notice that as far balance goes, the clusters number 2 and 4 contais the clients with the highest existing balance, so it might be in the best interest for the bank to offer them some investment products. Cluster number 5 contains the customers who update their balance the least frequently, so it may contain people who are not used to using their credit card, and therefore the bank can target them with a campaign encouraging to use it more often. As you can see, a lot of usefull information can be extracted for even the 3 variables. In order to look into the results more more deeply, one might want to see the information
samp.cl<-data.frame(cbind(samp, km7.samp$cluster))
colnames(samp.cl)[16]<-"clust"
describeBy(samp.cl[,7:9], samp.cl[,16])
##
## Descriptive statistics by group
## group: 1
## vars n mean sd median trimmed mad min
## PURCHASES_FREQUENCY 1 40 1.07 0.36 1.27 1.15 0.00 -0.39
## ONEOFF_PURCHASES_FREQUENCY 2 40 2.07 0.72 2.39 2.19 0.41 -0.12
## PURCHASES_INSTALLMENTS_FREQUENCY 3 40 0.37 0.99 0.55 0.38 1.55 -0.92
## max range skew kurtosis se
## PURCHASES_FREQUENCY 1.27 1.66 -2.21 5.15 0.06
## ONEOFF_PURCHASES_FREQUENCY 2.67 2.79 -1.17 0.65 0.11
## PURCHASES_INSTALLMENTS_FREQUENCY 1.60 2.52 -0.18 -1.59 0.16
## ------------------------------------------------------------
## group: 2
## vars n mean sd median trimmed mad min
## PURCHASES_FREQUENCY 1 111 0.91 0.42 1.06 0.97 0.31 -0.11
## ONEOFF_PURCHASES_FREQUENCY 2 111 1.77 0.76 1.84 1.80 1.24 0.44
## PURCHASES_INSTALLMENTS_FREQUENCY 3 111 0.13 0.90 0.13 0.08 1.24 -0.92
## max range skew kurtosis se
## PURCHASES_FREQUENCY 1.27 1.38 -0.82 -0.66 0.04
## ONEOFF_PURCHASES_FREQUENCY 2.67 2.23 -0.16 -1.46 0.07
## PURCHASES_INSTALLMENTS_FREQUENCY 1.60 2.52 0.27 -1.33 0.09
## ------------------------------------------------------------
## group: 3
## vars n mean sd median trimmed mad min
## PURCHASES_FREQUENCY 1 40 1.23 0.12 1.27 1.26 0.00 0.65
## ONEOFF_PURCHASES_FREQUENCY 2 40 1.51 0.95 1.56 1.59 1.24 -0.68
## PURCHASES_INSTALLMENTS_FREQUENCY 3 40 1.42 0.39 1.60 1.52 0.00 -0.29
## max range skew kurtosis se
## PURCHASES_FREQUENCY 1.27 0.62 -3.38 12.72 0.02
## ONEOFF_PURCHASES_FREQUENCY 2.67 3.35 -0.52 -0.71 0.15
## PURCHASES_INSTALLMENTS_FREQUENCY 1.60 1.89 -2.81 8.28 0.06
## ------------------------------------------------------------
## group: 4
## vars n mean sd median trimmed mad min
## PURCHASES_FREQUENCY 1 231 0.99 0.35 1.27 1.04 0.00 -0.18
## ONEOFF_PURCHASES_FREQUENCY 2 231 -0.38 0.45 -0.68 -0.46 0.00 -0.68
## PURCHASES_INSTALLMENTS_FREQUENCY 3 231 1.19 0.42 1.28 1.24 0.47 0.13
## max range skew kurtosis se
## PURCHASES_FREQUENCY 1.27 1.45 -0.97 -0.18 0.02
## ONEOFF_PURCHASES_FREQUENCY 1.28 1.96 1.36 0.80 0.03
## PURCHASES_INSTALLMENTS_FREQUENCY 1.60 1.47 -0.79 -0.34 0.03
## ------------------------------------------------------------
## group: 5
## vars n mean sd median trimmed mad min
## PURCHASES_FREQUENCY 1 124 -0.61 0.78 -1.01 -0.74 0.31 -1.22
## ONEOFF_PURCHASES_FREQUENCY 2 124 -0.21 0.77 -0.68 -0.39 0.00 -0.68
## PURCHASES_INSTALLMENTS_FREQUENCY 3 124 -0.57 0.64 -0.92 -0.71 0.00 -0.92
## max range skew kurtosis se
## PURCHASES_FREQUENCY 1.27 2.49 1.05 -0.26 0.07
## ONEOFF_PURCHASES_FREQUENCY 2.67 3.35 2.07 3.88 0.07
## PURCHASES_INSTALLMENTS_FREQUENCY 1.60 2.52 1.75 2.15 0.06
## ------------------------------------------------------------
## group: 6
## vars n mean sd median trimmed mad min
## PURCHASES_FREQUENCY 1 290 -0.87 0.42 -1.01 -0.93 0.31 -1.22
## ONEOFF_PURCHASES_FREQUENCY 2 290 -0.42 0.40 -0.68 -0.50 0.00 -0.68
## PURCHASES_INSTALLMENTS_FREQUENCY 3 290 -0.75 0.33 -0.92 -0.84 0.00 -0.92
## max range skew kurtosis se
## PURCHASES_FREQUENCY 0.44 1.66 0.96 -0.25 0.02
## ONEOFF_PURCHASES_FREQUENCY 1.00 1.68 1.59 1.74 0.02
## PURCHASES_INSTALLMENTS_FREQUENCY 0.34 1.26 1.99 2.73 0.02
## ------------------------------------------------------------
## group: 7
## vars n mean sd median trimmed mad min
## PURCHASES_FREQUENCY 1 164 -0.51 0.62 -0.60 -0.58 0.62 -1.22
## ONEOFF_PURCHASES_FREQUENCY 2 164 -0.38 0.46 -0.68 -0.48 0.00 -0.68
## PURCHASES_INSTALLMENTS_FREQUENCY 3 164 -0.45 0.62 -0.71 -0.55 0.31 -0.92
## max range skew kurtosis se
## PURCHASES_FREQUENCY 1.27 2.49 0.87 0.24 0.05
## ONEOFF_PURCHASES_FREQUENCY 1.45 2.13 1.78 2.57 0.04
## PURCHASES_INSTALLMENTS_FREQUENCY 1.60 2.52 1.40 1.61 0.05
Finally, it is also possible to verify which variables drive the clustering, meaning which ones are the most important ones in assesing which cluster a given client should belong to. In order to do that, a different package is used, so the clustering has to be performed again.
km7.imp<-kcca(samp, k=7)
FeatureImp_km<-FeatureImpCluster(km7.imp, as.data.table(samp))
plot(FeatureImp_km)
The variables with the lowest classification rate are the ones which drive the clustering, so here the information about the purchases is the most significant, and therefore when analyzing the results one must keep in mind that they may be very good for purchases information, and not so much for other variables. It may be useful to perform separate clustering for the non purchases related data.
Although their uses might not be instantly obvious, the unsupervised learning methods presented in this analysis are invaluable for various businesses. Here, the utility of PCA and clustering was shown, with PCA being used mostly on the technical level, as it played a supporting role in the main clustering analysis. The clustering allowed for efficient customer segmentation, and even with basic knowledge of the banking sector allowed us to derive some practical insights, with real value attached to them. In the case of this dataset, it might be more practical to utilize some other clustering algorithms, the ones more accustomed to handling large quantities of data, such as CLARA, as the calculation took some time.