Dimensionality reduction techniques and clustering

The dataset used in analysis consists mostly of behavioral and non-labeled data related to credit cards transactions. Main goal is to indicate customer segments best fitted to the data by implementing clustering. Based on dataset characteristics which are high-dimensionality and presence of correlated variables, I will compare the result of clustering based on original data with the results obtained after implementing relevant algorithm to reduce dimensionality and correlation.

Dataset description

Chosen dataset contains 8950 observations and 18 variables. It was provided by Kaggle user - arjunbhasin2013. In the table below, you can see the names of variables with its descriptions.

library(readr)
library(kableExtra)
# reading the dataset
data <- read.csv("C:/Users/user/Desktop/NewProject/CC GENERAL.csv")
# displaying the dataset
kable(head(data[-1], n = 10)) %>%
  kable_styling(bootstrap_options = c("hover", "condensed", "responsive")) %>%
  scroll_box(width = "100%", height = "90%")
BALANCE BALANCE_FREQUENCY PURCHASES ONEOFF_PURCHASES INSTALLMENTS_PURCHASES CASH_ADVANCE PURCHASES_FREQUENCY ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FREQUENCY CASH_ADVANCE_FREQUENCY CASH_ADVANCE_TRX PURCHASES_TRX CREDIT_LIMIT PAYMENTS MINIMUM_PAYMENTS PRC_FULL_PAYMENT TENURE
40.90075 0.818182 95.40 0.00 95.40 0.000 0.166667 0.000000 0.083333 0.000000 0 2 1000 201.8021 139.5098 0.000000 12
3202.46742 0.909091 0.00 0.00 0.00 6442.945 0.000000 0.000000 0.000000 0.250000 4 0 7000 4103.0326 1072.3402 0.222222 12
2495.14886 1.000000 773.17 773.17 0.00 0.000 1.000000 1.000000 0.000000 0.000000 0 12 7500 622.0667 627.2848 0.000000 12
1666.67054 0.636364 1499.00 1499.00 0.00 205.788 0.083333 0.083333 0.000000 0.083333 1 1 7500 0.0000 NA 0.000000 12
817.71434 1.000000 16.00 16.00 0.00 0.000 0.083333 0.083333 0.000000 0.000000 0 1 1200 678.3348 244.7912 0.000000 12
1809.82875 1.000000 1333.28 0.00 1333.28 0.000 0.666667 0.000000 0.583333 0.000000 0 8 1800 1400.0578 2407.2460 0.000000 12
627.26081 1.000000 7091.01 6402.63 688.38 0.000 1.000000 1.000000 1.000000 0.000000 0 64 13500 6354.3143 198.0659 1.000000 12
1823.65274 1.000000 436.20 0.00 436.20 0.000 1.000000 0.000000 1.000000 0.000000 0 12 2300 679.0651 532.0340 0.000000 12
1014.92647 1.000000 861.49 661.49 200.00 0.000 0.333333 0.083333 0.250000 0.000000 0 5 7000 688.2786 311.9634 0.000000 12
152.22598 0.545455 1281.60 1281.60 0.00 0.000 0.166667 0.166667 0.000000 0.000000 0 3 11000 1164.7706 100.3023 0.000000 12

Exploratory data analysis In the first step, I will focus on basic statistics of the features, searching for missing data, any potencially informative values and also on examining correlations between variables.

# shortening variables' names
colnames(data) <- c("CUST_ID", "BAL","BAL_FREQ", "PURCH", "ONEOFF_PURCH", "INST_PURCH", "CASH_ADV", "PURCH_FREQ","ONEOFF_PURCH_FREQ", "PURCH_INST_FREQ", "CASH_ADV_FREQ","CASH_ADV_TRX", "PURCH_TRX", "CREDIT_LIMIT", "PAY", "MIN_PAY", "PRC_FULL_PAY", "TENURE")

# displaying basic statistics
kable(summary(data[,-1])) %>% 
  kable_styling(bootstrap_options = c("hover", "condensed", "responsive")) %>%
  scroll_box(width = "100%", height = "100%")
BAL BAL_FREQ PURCH ONEOFF_PURCH INST_PURCH CASH_ADV PURCH_FREQ ONEOFF_PURCH_FREQ PURCH_INST_FREQ CASH_ADV_FREQ CASH_ADV_TRX PURCH_TRX CREDIT_LIMIT PAY MIN_PAY PRC_FULL_PAY TENURE
Min. : 0.0 Min. :0.0000 Min. : 0.00 Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. :0.00000 Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. : 0.000 Min. : 0.00 Min. : 50 Min. : 0.0 Min. : 0.02 Min. :0.0000 Min. : 6.00
1st Qu.: 128.3 1st Qu.:0.8889 1st Qu.: 39.63 1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.:0.08333 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 0.000 1st Qu.: 1.00 1st Qu.: 1600 1st Qu.: 383.3 1st Qu.: 169.12 1st Qu.:0.0000 1st Qu.:12.00
Median : 873.4 Median :1.0000 Median : 361.28 Median : 38.0 Median : 89.0 Median : 0.0 Median :0.50000 Median :0.08333 Median :0.1667 Median :0.0000 Median : 0.000 Median : 7.00 Median : 3000 Median : 856.9 Median : 312.34 Median :0.0000 Median :12.00
Mean : 1564.5 Mean :0.8773 Mean : 1003.20 Mean : 592.4 Mean : 411.1 Mean : 978.9 Mean :0.49035 Mean :0.20246 Mean :0.3644 Mean :0.1351 Mean : 3.249 Mean : 14.71 Mean : 4494 Mean : 1733.1 Mean : 864.21 Mean :0.1537 Mean :11.52
3rd Qu.: 2054.1 3rd Qu.:1.0000 3rd Qu.: 1110.13 3rd Qu.: 577.4 3rd Qu.: 468.6 3rd Qu.: 1113.8 3rd Qu.:0.91667 3rd Qu.:0.30000 3rd Qu.:0.7500 3rd Qu.:0.2222 3rd Qu.: 4.000 3rd Qu.: 17.00 3rd Qu.: 6500 3rd Qu.: 1901.1 3rd Qu.: 825.49 3rd Qu.:0.1429 3rd Qu.:12.00
Max. :19043.1 Max. :1.0000 Max. :49039.57 Max. :40761.2 Max. :22500.0 Max. :47137.2 Max. :1.00000 Max. :1.00000 Max. :1.0000 Max. :1.5000 Max. :123.000 Max. :358.00 Max. :30000 Max. :50721.5 Max. :76406.21 Max. :1.0000 Max. :12.00
NA NA NA NA NA NA NA NA NA NA NA NA NA’s :1 NA NA’s :313 NA NA

The dataset consists mostly of continuous variables except for CASH_ADVANCE_TRX, PURCHASES_TRX and TENURE which represent integer values. It contains only 1 NA value in case of CREDIT_LIMIT and 313 NA values in MINIMUM_PAYMENTS. As long as there are not many missings and the dataset is extended, these observations may be deleted.

Starting with MINIMUM_PAYMENTS variable, one must spot that NA value occures only if PAYMENT variable is equal to zero. Thus in this situation, I will just replace missing values with zeros.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:kableExtra':
## 
##     group_rows
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# filtering observations for which PAYMENTS is equal to zero and comparing them with MINIMUM PAYMENTS variable
data1 <- data %>% 
           select(PAY, MIN_PAY) %>% 
           filter(PAY == 0)
# displaying first five observations after the filtering process
head(data1, n = 5)
##   PAY MIN_PAY
## 1   0      NA
## 2   0      NA
## 3   0      NA
## 4   0      NA
## 5   0      NA
data$MIN_PAY[is.na(data$MIN_PAY)] <- 0
summary(data$MIN_PAY)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   163.0   289.6   834.0   788.7 76406.2

In case of variable CREDIT_LIMIT, I will just delete NA values because there is only 1 observation like that and deleting it will not influence much the dataset because of its volume.

data <- data[!(is.na(data$CREDIT_LIMIT)),]

In order to look for potential outliers, I will implement univariate approach and apply IQR rule. This rule searches for outliers in both tails of variable’s density distribution.

Following loop will return all variables and the number of potential outliers among their values.

vars <- c(colnames(data)[-1])
Outliers <- c()
for(i in vars){
  max <- quantile(data[,i], 0.75) + (IQR(data[,i]) * 1.5 )
  min <- quantile(data[,i], 0.25) - (IQR(data[,i]) * 1.5 )
  idx <- which(data[,i] < min | data[,i] > max)
  print(paste(i, length(idx), sep=' ')) # printing variable and number of potential outliers 
  Outliers <- c(Outliers, idx) 
}
## [1] "BAL 695"
## [1] "BAL_FREQ 1492"
## [1] "PURCH 808"
## [1] "ONEOFF_PURCH 1013"
## [1] "INST_PURCH 867"
## [1] "CASH_ADV 1030"
## [1] "PURCH_FREQ 0"
## [1] "ONEOFF_PURCH_FREQ 782"
## [1] "PURCH_INST_FREQ 0"
## [1] "CASH_ADV_FREQ 525"
## [1] "CASH_ADV_TRX 804"
## [1] "PURCH_TRX 766"
## [1] "CREDIT_LIMIT 248"
## [1] "PAY 808"
## [1] "MIN_PAY 899"
## [1] "PRC_FULL_PAY 1474"
## [1] "TENURE 1365"

The number of potential outliers is high. Thus, the right approach will be to implement algorithm which is insensitive to outliers.

The last step in exploratory data analysis will be calculating correlation matrix. Inspecting the relationship between variables is very important because passing highly correlated or collinear variables may disrupt the algorithm and eventually affect distinguished clusters. The strenght of affection depends on correlation magnitude. High correlation between two variables means that they have similar trends and are likely to carry similar information. If two variables are perfectly correlated, the concept represented by both variables is now represented twice in the data. The final solution is thus likely to be skewed in the direction of that concept. It’s nothing but creating artificial features.

Informations which are visible with the naked eye are that PURCHASES variable is the sum of ONEOFF_PURCHASES and INSTALLMENTS_PURCHASES. Analogical problem concern variable PURCHASES_FREQUENCY which is the sum of ONEOFF_PURCHASES_FREQUENCY and PURCHASES_INSTALLMENTS_FREQUENCY.

colnames(data)[-1] <- c("V1", "V2", "V3", "V4", "V5", "V6", "V7", "V8", "V9", "V10", "V11", "V12", "V13", "V14", "V15", "V16", "V17")
library(corrplot)
## corrplot 0.84 loaded
data_matrix <- data.matrix(data, rownames.force = NA)
M <- cor(data_matrix[,-1])
corrplot(M, method = "number", number.cex = 0.6, order = "AOE")

In the dataset we may observe mainly positively correlated variables. As mentioned earlier, variables PURCHASES, ONEOFF_PURCHASES and INSTALLMENTS_PURCHASES seem to be highly correlated. Moreover, high correlation occur in case of variables CASH_ADV, CASH_ADV_FREQ and CASH_ADV_TRX and also PURCH_INST_FREQ and PURCH_FREQ.

Clustering

Due to outliers, I will rely on k-medoids algorithm as it is more robust to noise in comparison with very popular k-means algorithm. The most common algorithm of the k-medoid clustering is the Partitioning Around Medoids (PAM) clustering. Although PAM performs well on small datasets, it cannot scale for large datasets. Thus, in case of analysed dataset it would be more effective to implement CLARA (Clustering for Large Applications). CLARA draws saples of the data and applies the PAM algorithm to find an optimal set of medoids for the sample. The quality of resulting medoids is measured by the average dissimilarity between every object in the entire data set and the medoid of its cluster, defined as the cost function. CLARA repeats the sampling and clustering processes a pre-specified number of times in order to minimize the sampling bias. The final clustering results correspond to the set of medoids with the minimal cost.

Bearing in mind correlated variables, I will distinguish two approaches to validate which one performs better. In the first one, I will pass all variables into clustering algorithm and execute the algorithm using two distance metrices which are based on correlation magnitude. These are Pearson’s correlation and Mahalanobis distance. Second approach will be based on proncipal components. After implementing algorithm, I will run PAM clustering.

For most of the analysis, I will use ClusterR package since it provides all algorithms and distance metrics which I needed. The documentation of the package is available on a website https://cran.r-project.org/web/packages/ClusterR/ClusterR.pdf.

First approach - passing all variables into clustering algorithm

First step before applying any clustering algorithm will be finding the optimal nu

library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
library(factoextra)
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(ggplot2)
a <- fviz_nbclust(data[,-1], FUNcluster = kmeans, method = "silhouette") + theme_classic() 
b <- fviz_nbclust(data[,-1], FUNcluster = cluster::clara, method = "silhouette") + theme_classic() 
grid.arrange(a, b, nrow=2)

It occured that according to silhouette statistic, the most optimal number of clusters is 2. Thus, further analysis will be based on the assumption of two customer segments.

First implemented algorithm wil be CLARA with Pearson’s correlation as a distance metric.

library(ClusterR)
## Loading required package: gtools
start = Sys.time()
clm_pearson <- Clara_Medoids(data_matrix, 2, distance_metric = c("pearson_correlation"), samples = 5, sample_size = 0.2, swap_phase = TRUE, verbose = T)
##  
## sample 1 -->  local (sample) dissimilarity : 228.768 -->  global dissimilarity : 1114.78
## sample 2 -->  local (sample) dissimilarity : 229.83 -->  global dissimilarity : 1121.28
## sample 3 -->  local (sample) dissimilarity : 229.83 -->  global dissimilarity : 1121.28
## sample 4 -->  local (sample) dissimilarity : 229.83 -->  global dissimilarity : 1121.28
## sample 5 -->  local (sample) dissimilarity : 229.83 -->  global dissimilarity : 1121.28
##  
##  ====================================================== 
##  --- best sample: 1 --->  global dissimilarity: 1114.78
##  ======================================================
end = Sys.time()
t = end - start
cat('time to complete :', t, attributes(t)$units, '\n')
## time to complete : 8.390749 secs

CLARA with Pearson’s correlation as a distance metric resulted in obtaining two clusters while in first cluster there are 870 customers and in the second one there are 920 customers. The value of silhouette for first and second cluster is 0.725 and 0.356 respectively. The average silhouette value is thus 0.535 which means that about half of the objects have been well clustered.

The second implemented algorithm will be CLARA with Mahalanobis distance assuming two clusters in advance.

start = Sys.time()
clm_mahal <- Clara_Medoids(data_matrix, 2, distance_metric = c("mahalanobis"), samples = 5, sample_size = 0.2, swap_phase = TRUE, verbose = T)
##  
## sample 1 -->  local (sample) dissimilarity : 6557.47 -->  global dissimilarity : 32440.8
## sample 2 -->  local (sample) dissimilarity : 6557.47 -->  global dissimilarity : 32440.8
## sample 3 -->  local (sample) dissimilarity : 6557.47 -->  global dissimilarity : 32440.8
## sample 4 -->  local (sample) dissimilarity : 6557.47 -->  global dissimilarity : 32440.8
## sample 5 -->  local (sample) dissimilarity : 6557.47 -->  global dissimilarity : 32440.8
##  
##  ====================================================== 
##  --- best sample: 1 --->  global dissimilarity: 32440.8
##  ======================================================
end = Sys.time()
t = end - start
cat('time to complete :', t, attributes(t)$units, '\n')
## time to complete : 11.12679 secs

In case of CLARA with Mahalanobis distance in teh first cluster there are 1090 objects and in the second cluster there are 700 objects. The silhouette statistic in both clusters in low which results in a low average silhouette which is 0.069. It is clearly visible that this algorithm results in worst clustering than the previous one.

Below one can see basic statistics of two above algorithms. For each object there is information about number of observations in each cluster (considering clustering full dataset), maximum dissimilarity, average dissimilarity and isolation.

clm_pearson$clustering_stats
##   clusters number_obs max_dissimilarity average_dissimilarity isolation
## 1        2       4591         0.9496998            0.07162453  2.619020
## 2        3       4358         1.0135705            0.18034747  2.795158
clm_mahal$clustering_stats
##   clusters number_obs max_dissimilarity average_dissimilarity isolation
## 1        2       5597          61.71249              3.525857  27.17082
## 2        3       3352          41.43548              3.790758  18.24324

Obtained statistics confirm conclusions drawn before. Out of theese two algorithms, definately better one is CLARA with Pearson’s correlation as distance metric.

Next step will be applying dimensionality reduction algorithm and running CLARA algorithm on obtained principal components.

Second approach - dimensionality reduction Dimensionality reduction aims to reduce the number of variables in the dataset and also reduce multicollinearity. There are a lot of algorithms to reduce the dimensionality. They are divided into supervisd and unsupervised techniques. Since the paper is focused on unsupervised learning algorithms, I will apply Principal Component Analysis, Multidimensional Scaling and hierarchical trees. Moreover, considering the increasing importance of neural networks, I will also compute autoencoder. In the end, I will try to compare the performance of all these techniques using logistic regression since the dataset originally contains class labels.

In order to reduce the dimensions, I will implement PCA algorithm. There are several functions which might be used. The most popular ones are stats:prcomp() and stats:princomp(). The main difference between them is that prcomp() is based on singular value decomposition and princomp() is based on spectral decomposition approach. According to R documentation, more preferred method is prcomp() but no details on this topic are provided.

In this analysis, I will use prcomp() function. Since the data are not on the same scale, argument scale = TRUE should be added.

data.pca <- prcomp(data[,-1], center = TRUE, scale = TRUE)
summary(data.pca)
## Importance of components:
##                          PC1    PC2     PC3    PC4     PC5     PC6     PC7
## Standard deviation     2.154 1.8586 1.22482 1.1276 1.02707 0.98720 0.91101
## Proportion of Variance 0.273 0.2032 0.08825 0.0748 0.06205 0.05733 0.04882
## Cumulative Proportion  0.273 0.4762 0.56444 0.6392 0.70129 0.75862 0.80744
##                            PC8     PC9   PC10    PC11    PC12    PC13    PC14
## Standard deviation     0.85733 0.80175 0.7236 0.63509 0.54907 0.49285 0.45478
## Proportion of Variance 0.04324 0.03781 0.0308 0.02373 0.01773 0.01429 0.01217
## Cumulative Proportion  0.85068 0.88849 0.9193 0.94301 0.96075 0.97504 0.98720
##                           PC15    PC16     PC17
## Standard deviation     0.41489 0.21307 0.003413
## Proportion of Variance 0.01013 0.00267 0.000000
## Cumulative Proportion  0.99733 1.00000 1.000000

The result of the summary function shows three statistics according to all components: standard deviation, proportion of variance and cumulative variance. From the output we can see that PC1 explains 27% of variance, PC2 explains 20% of variance, PC3 explains 8% of variance, ans so on. What is more, beyond obtaining principal components the algorithm also reduced the correlation between variables since the principal components are orthogonal.

data_matrix <- data.matrix(data.pca$x, rownames.force = NA)
M <- cor(data_matrix)
corrplot(M, method = "number", number.cex = 0.75)

In order to reduce dimensionality, the task now is to choose the optimal number of components to retain which is a topic of the next section.

Choosing the number of principal components to retain

According to Zwick and Velicer, there are at least five methods to determine the number of components to retain. One can use among others Barlletl’s test, Kaiser’s criterion, Minimum Average Partial method, scree test and parallel analysis (PA). Since MAP method does not have convinient R implementation, I will omit it.

First implemented method will be Barlletl’s test. According to Zwick and Velicer, null hipothesis of the test is that the remaining P variables - m eigenvalues are equal. Each eigenvalue is excluded sequentially until the approximate chi-square test of the null hypothesis of equality fails to be rejected. The first m excluded components are retained. R provides function nBartlett() to conduct the test (https://cran.r-project.org/web/packages/nFactors/nFactors.pdf).

library(nFactors)
## Loading required package: lattice
## 
## Attaching package: 'nFactors'
## The following object is masked from 'package:lattice':
## 
##     parallel
nBartlett(data[,-1], nrow(data[,-1]), alpha=0.05, details=TRUE)
## bartlett anderson   lawley 
##       16       16       16

Unfortunately, it seems that the test failed to find the number of components to retain.

Next method will be Kaiser’s criterion. It says that one should retain the components with eigenvalues are greater than 1. To calculate the eigenvalues, I will use get_eigenvalue() from factoextra package.

library(factoextra)
eig.val <- get_eigenvalue(data.pca)
eig.val
##          eigenvalue variance.percent cumulative.variance.percent
## Dim.1  4.640880e+00     2.729930e+01                    27.29930
## Dim.2  3.454466e+00     2.032039e+01                    47.61968
## Dim.3  1.500177e+00     8.824573e+00                    56.44426
## Dim.4  1.271589e+00     7.479937e+00                    63.92419
## Dim.5  1.054863e+00     6.205076e+00                    70.12927
## Dim.6  9.745631e-01     5.732724e+00                    75.86199
## Dim.7  8.299380e-01     4.881988e+00                    80.74398
## Dim.8  7.350087e-01     4.323581e+00                    85.06756
## Dim.9  6.428031e-01     3.781195e+00                    88.84876
## Dim.10 5.236154e-01     3.080091e+00                    91.92885
## Dim.11 4.033368e-01     2.372569e+00                    94.30142
## Dim.12 3.014818e-01     1.773422e+00                    96.07484
## Dim.13 2.429048e-01     1.428852e+00                    97.50369
## Dim.14 2.068286e-01     1.216639e+00                    98.72033
## Dim.15 1.721326e-01     1.012545e+00                    99.73287
## Dim.16 4.539988e-02     2.670581e-01                    99.99993
## Dim.17 1.164913e-05     6.852429e-05                   100.00000

Since the first eigenvalue (4.64) is higher than 1, we have support to retain the first component. The second eigenvalue (3.45) is also higher than 1, so it should also be retained. Eigenvalue higher than 1 is also for third (1.5), fourth (1.27) and fifth component (1.05). According to the rule, the remaining components shouldn’t be used in further analysis.

Next possible method is scree test which is based on eigenvalues graph. This method is analogical to the Kaiser’s criterion and thus gives analogical results. There are several possibilities to draw the so called scree plot. One of them is provided by scree.plot() from psy package.

library(psy)
scree.plot(eig.val[,1], title = "Scree Plot", type = "E",  simu = "P")

The last method to be implemented is parallel analysis. R provides function hornpa() which conducts the test and returns easy to interpret results. In order to decide which components to retain, one must compare eigenvalues with the numbers in column ‘0.95’. If eigenvalue is hiher than the particular value returned by the function, there is a support to retain the component.

library(hornpa)
hornpa(k=17,size=8949,reps=500,seed=123)
## 
##  Parallel Analysis Results  
##  
## Method: pca 
## Number of variables: 17 
## Sample size: 8949 
## Number of correlation matrices: 500 
## Seed: 123 
## Percentile: 0.95 
##  
## Compare your observed eigenvalues from your original dataset to the 95 percentile in the table below generated using random data. If your eigenvalue is greater than the percentile indicated (not the mean), you have support to retain that factor/component. 
##  
##  Component  Mean  0.95
##          1 1.075 1.088
##          2 1.061 1.071
##          3 1.050 1.058
##          4 1.040 1.048
##          5 1.031 1.039
##          6 1.023 1.030
##          7 1.015 1.022
##          8 1.007 1.014
##          9 1.000 1.006
##         10 0.992 0.998
##         11 0.984 0.990
##         12 0.976 0.983
##         13 0.968 0.976
##         14 0.960 0.967
##         15 0.951 0.959
##         16 0.940 0.950
##         17 0.928 0.940

In our case, the method indicates that there should be five principal components considered in the further analysis.

To sum up all the methods, three of them occured to be successful. They indicated to retain five principal components.

Graphical analysis

In order to make graphical analysis easier, I will from now make an analysis in two dimensions.

Visualization of variable’s loadings

First plot represents variable loadings which are coefficients to predict components by original variables.

library(factoextra)
fviz_pca_var(data.pca, col.var = "navy", repel = TRUE, axes = c(4, 5)) +
  labs(title="PCA", x="PC4", y="PC5")

Above plots show the correlation between original variables as well as the strengh of each variable contribution to particular principal components. It is clear from the first plot (PC1, PC2) that e.g. cash-advance frequency and average amount per purchase transaction are positively correlated while cash-advance frequency and % of months with full payment of the due statement balance seems to be negatively correlated. Analogical conclusions may be drawn also from four other plots. The dependence is if variables are grouped together, it means they are positively correlated while if variables are positioned on opposite sides of the plot, it means that they are negatively correlated. The length of the vector tells how strong is the contribution of particular variable to particular principal component. It can be confirmed by the plots below. It presents contribution of variables to PC1, PC2, PC3, PC4 and PC5.

library(gridExtra)
cont1 <- fviz_contrib(data.pca, "var", axes = 1)
cont2 <- fviz_contrib(data.pca, "var", axes = 2)
cont3 <- fviz_contrib(data.pca, "var", axes = 3)
cont4 <- fviz_contrib(data.pca, "var", axes = 4)
cont5 <- fviz_contrib(data.pca, "var", axes = 5)
grid.arrange(cont1, cont2, cont3, cont4, cont5)

The above charts clearly show which variables contribute the most to particular components.

PC1 consists of: - Total purchase amount spent during last 12 months - Average amount per purchase transaction - Total amount of one-off purchases - Total amount of installment purchases - Frequency of purchases (percentage of months with at least one purchase - Frequency of one-off-purchases - Frequency of installment purchases - Total payments (due amount paid by the customer to decrease their statement balance) in the period

PC2 consists of: - Total cash-advance amount - Cash-Advance frequency - Average amount per purchase transaction - Monthly average balance (based on daily balance averages) - Credit limit - Total payments (due amount paid by the customer to decrease their statement balance) in the period

PC3 consists of: - Frequency of installment purchases - Ratio of last 12 months with balance - Total amount of one-off purchases - Frequency of purchases (percentage of months with at least one purchase) - Total payments (due amount paid by the customer to decrease their statement balance) in the period - Total purchase amount spent during last 12 months - Total minimum payments due in the period

PC4 consists of: - Number of months as a customer - Percentage of months with full payment of the due statement balance - Total minimum payments due in the period - Average amount per cash-advance transaction - Monthly average balance (based on daily balance averages) - Total cash-advance amount - Cash-Advance frequency

PC5 consists of: - Frequency of one-off-purchases - Ratio of last 12 months with balance - Total minimum payments due in the period - Total amount of installment purchases

One can also visualize individual observations and the strenght of their contribution to the principal components. One can choose which observations to display on the plot. One of the possibilities is selecting the top contributing individuals. Below plots show 50 top contributing observations and also all observations.

Visualization of observations’ loadings

First plot represents variable loadings which are coefficients to predict components by original variables. It is clearly seen that the most contributing variables are placed relatively far from the origin of coordinates.

fviz_pca_ind(data.pca, col.ind = "navy", repel = TRUE, select.ind = list(contrib=50), axes = c(1, 2)) +
  labs(title="PCA", x="PC1", y="PC2")

## Clustering of PCA Finally, after analysis of principal components, I will conduct clustering based on five subsetted principal components. In order to do that, I will use the same function as before - Clara_Medoids - and Euclidean distance.

pca_data <- data.pca$x[,1:5]
start = Sys.time()
clm_eucl <- Clara_Medoids(pca_data, 2, distance_metric = c("euclidean"), samples = 5, sample_size = 0.2, swap_phase = TRUE, verbose = T)
##  
## sample 1 -->  local (sample) dissimilarity : 4363.67 -->  global dissimilarity : 21728.4
## sample 2 -->  local (sample) dissimilarity : 4369.08 -->  global dissimilarity : 21808.9
## sample 3 -->  local (sample) dissimilarity : 4369.08 -->  global dissimilarity : 21808.9
## sample 4 -->  local (sample) dissimilarity : 4369.08 -->  global dissimilarity : 21808.9
## sample 5 -->  local (sample) dissimilarity : 4369.08 -->  global dissimilarity : 21808.9
##  
##  ====================================================== 
##  --- best sample: 1 --->  global dissimilarity: 21728.4
##  ======================================================
end = Sys.time()
t = end - start
cat('time to complete :', t, attributes(t)$units, '\n')
## time to complete : 2.044902 secs

CLARA with Euclidean distance resulted in obtaining two clusters while in first cluster there are 799 customers and in the second one there are 991 customers. The value of silhouette for first and second cluster is 0.221 and 0.306 respectively. The average silhouette value is thus 0.268. This is about twice less than in case of clustering the whole dataset using CLARA algorithm with Pearson’s correlation as distance metric.

The result obtained using PCA is not very satisfying. The average silhouette statistic is lower that in case of clustering the whole dataset. That is probably because five chosed principal components explain only about 70% of variance in the data. Also, the reason might be also presence of outliers.

Conlcusions

Comparing all clustering processes, one can clearly see that the best performing algorithm in case of segmentation of the customers are both CLARA algorithm with Pearson’s correlation as distance metric and CLARA based on robust PCA principal component. However, considering the run-time of the algorithm, CLARA based on robust PCA is almost 4 times faster than the CLARA on the whole dataset which is a huge advantage in case of large datasets.

In order to choose the best algorithm overall, one has to specify his goal. If he wants to reach out everyone, it is difficult and thus maybe more complex clustering method should be chosen. But if one is enough with reaching only some part, algorithm based on robust PCA could work since it reaches the most similar customers and they are regardless over 50% of the population.