RedNew2

Dimensionality reduction techniques and clustering

The dataset used in analysis consists mostly of behavioral and non-labeled data related to credit cards transactions. Main goal is to indicate customer segments best fitted to the data by implementing clustering. Based on dataset characteristics which are high-dimensionality and presence of correlated variables, I will compare the result of clustering based on original data with the results obtained after implementing relevant algorithm to reduce dimensionality and correlation.

Dataset description

Chosen dataset contains 8950 observations and 18 variables. It was provided by Kaggle user - arjunbhasin2013. In the table below, you can see the names of variables with its descriptions.

library(readr)
library(kableExtra)
# reading the dataset
data <- read.csv("C:/Users/user/Desktop/NewProject/CC GENERAL.csv")
# displaying the dataset
kable(head(data[-1], n = 10)) %>%
  kable_styling(bootstrap_options = c("hover", "condensed", "responsive")) %>%
  scroll_box(width = "100%", height = "90%")

BALANCE	BALANCE_FREQUENCY	PURCHASES	ONEOFF_PURCHASES	INSTALLMENTS_PURCHASES	CASH_ADVANCE	PURCHASES_FREQUENCY	ONEOFF_PURCHASES_FREQUENCY	PURCHASES_INSTALLMENTS_FREQUENCY	CASH_ADVANCE_FREQUENCY	CASH_ADVANCE_TRX	PURCHASES_TRX	CREDIT_LIMIT	PAYMENTS	MINIMUM_PAYMENTS	PRC_FULL_PAYMENT	TENURE
40.90075	0.818182	95.40	0.00	95.40	0.000	0.166667	0.000000	0.083333	0.000000	0	2	1000	201.8021	139.5098	0.000000	12
3202.46742	0.909091	0.00	0.00	0.00	6442.945	0.000000	0.000000	0.000000	0.250000	4	0	7000	4103.0326	1072.3402	0.222222	12
2495.14886	1.000000	773.17	773.17	0.00	0.000	1.000000	1.000000	0.000000	0.000000	0	12	7500	622.0667	627.2848	0.000000	12
1666.67054	0.636364	1499.00	1499.00	0.00	205.788	0.083333	0.083333	0.000000	0.083333	1	1	7500	0.0000	NA	0.000000	12
817.71434	1.000000	16.00	16.00	0.00	0.000	0.083333	0.083333	0.000000	0.000000	0	1	1200	678.3348	244.7912	0.000000	12
1809.82875	1.000000	1333.28	0.00	1333.28	0.000	0.666667	0.000000	0.583333	0.000000	0	8	1800	1400.0578	2407.2460	0.000000	12
627.26081	1.000000	7091.01	6402.63	688.38	0.000	1.000000	1.000000	1.000000	0.000000	0	64	13500	6354.3143	198.0659	1.000000	12
1823.65274	1.000000	436.20	0.00	436.20	0.000	1.000000	0.000000	1.000000	0.000000	0	12	2300	679.0651	532.0340	0.000000	12
1014.92647	1.000000	861.49	661.49	200.00	0.000	0.333333	0.083333	0.250000	0.000000	0	5	7000	688.2786	311.9634	0.000000	12
152.22598	0.545455	1281.60	1281.60	0.00	0.000	0.166667	0.166667	0.000000	0.000000	0	3	11000	1164.7706	100.3023	0.000000	12

Exploratory data analysis In the first step, I will focus on basic statistics of the features, searching for missing data, any potencially informative values and also on examining correlations between variables.

# shortening variables' names
colnames(data) <- c("CUST_ID", "BAL","BAL_FREQ", "PURCH", "ONEOFF_PURCH", "INST_PURCH", "CASH_ADV", "PURCH_FREQ","ONEOFF_PURCH_FREQ", "PURCH_INST_FREQ", "CASH_ADV_FREQ","CASH_ADV_TRX", "PURCH_TRX", "CREDIT_LIMIT", "PAY", "MIN_PAY", "PRC_FULL_PAY", "TENURE")

# displaying basic statistics
kable(summary(data[,-1])) %>% 
  kable_styling(bootstrap_options = c("hover", "condensed", "responsive")) %>%
  scroll_box(width = "100%", height = "100%")

BAL	BAL_FREQ	PURCH	ONEOFF_PURCH	INST_PURCH	CASH_ADV	PURCH_FREQ	ONEOFF_PURCH_FREQ	PURCH_INST_FREQ	CASH_ADV_FREQ	CASH_ADV_TRX	PURCH_TRX	CREDIT_LIMIT	PAY	MIN_PAY	PRC_FULL_PAY	TENURE
Min. : 0.0	Min. :0.0000	Min. : 0.00	Min. : 0.0	Min. : 0.0	Min. : 0.0	Min. :0.00000	Min. :0.00000	Min. :0.0000	Min. :0.0000	Min. : 0.000	Min. : 0.00	Min. : 50	Min. : 0.0	Min. : 0.02	Min. :0.0000	Min. : 6.00
1st Qu.: 128.3	1st Qu.:0.8889	1st Qu.: 39.63	1st Qu.: 0.0	1st Qu.: 0.0	1st Qu.: 0.0	1st Qu.:0.08333	1st Qu.:0.00000	1st Qu.:0.0000	1st Qu.:0.0000	1st Qu.: 0.000	1st Qu.: 1.00	1st Qu.: 1600	1st Qu.: 383.3	1st Qu.: 169.12	1st Qu.:0.0000	1st Qu.:12.00
Median : 873.4	Median :1.0000	Median : 361.28	Median : 38.0	Median : 89.0	Median : 0.0	Median :0.50000	Median :0.08333	Median :0.1667	Median :0.0000	Median : 0.000	Median : 7.00	Median : 3000	Median : 856.9	Median : 312.34	Median :0.0000	Median :12.00
Mean : 1564.5	Mean :0.8773	Mean : 1003.20	Mean : 592.4	Mean : 411.1	Mean : 978.9	Mean :0.49035	Mean :0.20246	Mean :0.3644	Mean :0.1351	Mean : 3.249	Mean : 14.71	Mean : 4494	Mean : 1733.1	Mean : 864.21	Mean :0.1537	Mean :11.52
3rd Qu.: 2054.1	3rd Qu.:1.0000	3rd Qu.: 1110.13	3rd Qu.: 577.4	3rd Qu.: 468.6	3rd Qu.: 1113.8	3rd Qu.:0.91667	3rd Qu.:0.30000	3rd Qu.:0.7500	3rd Qu.:0.2222	3rd Qu.: 4.000	3rd Qu.: 17.00	3rd Qu.: 6500	3rd Qu.: 1901.1	3rd Qu.: 825.49	3rd Qu.:0.1429	3rd Qu.:12.00
Max. :19043.1	Max. :1.0000	Max. :49039.57	Max. :40761.2	Max. :22500.0	Max. :47137.2	Max. :1.00000	Max. :1.00000	Max. :1.0000	Max. :1.5000	Max. :123.000	Max. :358.00	Max. :30000	Max. :50721.5	Max. :76406.21	Max. :1.0000	Max. :12.00
NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA’s :1	NA	NA’s :313	NA	NA

The dataset consists mostly of continuous variables except for CASH_ADVANCE_TRX, PURCHASES_TRX and TENURE which represent integer values. It contains only 1 NA value in case of CREDIT_LIMIT and 313 NA values in MINIMUM_PAYMENTS. As long as there are not many missings and the dataset is extended, these observations may be deleted.

Starting with MINIMUM_PAYMENTS variable, one must spot that NA value occures only if PAYMENT variable is equal to zero. Thus in this situation, I will just replace missing values with zeros.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:kableExtra':
## 
##     group_rows

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

# filtering observations for which PAYMENTS is equal to zero and comparing them with MINIMUM PAYMENTS variable
data1 <- data %>% 
           select(PAY, MIN_PAY) %>% 
           filter(PAY == 0)
# displaying first five observations after the filtering process
head(data1, n = 5)

##   PAY MIN_PAY
## 1   0      NA
## 2   0      NA
## 3   0      NA
## 4   0      NA
## 5   0      NA

data$MIN_PAY[is.na(data$MIN_PAY)] <- 0
summary(data$MIN_PAY)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   163.0   289.6   834.0   788.7 76406.2

In case of variable CREDIT_LIMIT, I will just delete NA values because there is only 1 observation like that and deleting it will not influence much the dataset because of its volume.

data <- data[!(is.na(data$CREDIT_LIMIT)),]

In order to look for potential outliers, I will implement univariate approach and apply IQR rule. This rule searches for outliers in both tails of variable’s density distribution.

Following loop will return all variables and the number of potential outliers among their values.

vars <- c(colnames(data)[-1])
Outliers <- c()
for(i in vars){
  max <- quantile(data[,i], 0.75) + (IQR(data[,i]) * 1.5 )
  min <- quantile(data[,i], 0.25) - (IQR(data[,i]) * 1.5 )
  idx <- which(data[,i] < min | data[,i] > max)
  print(paste(i, length(idx), sep=' ')) # printing variable and number of potential outliers 
  Outliers <- c(Outliers, idx) 
}

## [1] "BAL 695"
## [1] "BAL_FREQ 1492"
## [1] "PURCH 808"
## [1] "ONEOFF_PURCH 1013"
## [1] "INST_PURCH 867"
## [1] "CASH_ADV 1030"
## [1] "PURCH_FREQ 0"
## [1] "ONEOFF_PURCH_FREQ 782"
## [1] "PURCH_INST_FREQ 0"
## [1] "CASH_ADV_FREQ 525"
## [1] "CASH_ADV_TRX 804"
## [1] "PURCH_TRX 766"
## [1] "CREDIT_LIMIT 248"
## [1] "PAY 808"
## [1] "MIN_PAY 899"
## [1] "PRC_FULL_PAY 1474"
## [1] "TENURE 1365"

The number of potential outliers is high. Thus, the right approach will be to implement algorithm which is insensitive to outliers.

The last step in exploratory data analysis will be calculating correlation matrix. Inspecting the relationship between variables is very important because passing highly correlated or collinear variables may disrupt the algorithm and eventually affect distinguished clusters. The strenght of affection depends on correlation magnitude. High correlation between two variables means that they have similar trends and are likely to carry similar information. If two variables are perfectly correlated, the concept represented by both variables is now represented twice in the data. The final solution is thus likely to be skewed in the direction of that concept. It’s nothing but creating artificial features.

Informations which are visible with the naked eye are that PURCHASES variable is the sum of ONEOFF_PURCHASES and INSTALLMENTS_PURCHASES. Analogical problem concern variable PURCHASES_FREQUENCY which is the sum of ONEOFF_PURCHASES_FREQUENCY and PURCHASES_INSTALLMENTS_FREQUENCY.

colnames(data)[-1] <- c("V1", "V2", "V3", "V4", "V5", "V6", "V7", "V8", "V9", "V10", "V11", "V12", "V13", "V14", "V15", "V16", "V17")
library(corrplot)

## corrplot 0.84 loaded

data_matrix <- data.matrix(data, rownames.force = NA)
M <- cor(data_matrix[,-1])
corrplot(M, method = "number", number.cex = 0.6, order = "AOE")

In the dataset we may observe mainly positively correlated variables. As mentioned earlier, variables PURCHASES, ONEOFF_PURCHASES and INSTALLMENTS_PURCHASES seem to be highly correlated. Moreover, high correlation occur in case of variables CASH_ADV, CASH_ADV_FREQ and CASH_ADV_TRX and also PURCH_INST_FREQ and PURCH_FREQ.

Clustering

Due to outliers, I will rely on k-medoids algorithm as it is more robust to noise in comparison with very popular k-means algorithm. The most common algorithm of the k-medoid clustering is the Partitioning Around Medoids (PAM) clustering. Although PAM performs well on small datasets, it cannot scale for large datasets. Thus, in case of analysed dataset it would be more effective to implement CLARA (Clustering for Large Applications). CLARA draws saples of the data and applies the PAM algorithm to find an optimal set of medoids for the sample. The quality of resulting medoids is measured by the average dissimilarity between every object in the entire data set and the medoid of its cluster, defined as the cost function. CLARA repeats the sampling and clustering processes a pre-specified number of times in order to minimize the sampling bias. The final clustering results correspond to the set of medoids with the minimal cost.

Bearing in mind correlated variables, I will distinguish two approaches to validate which one performs better. In the first one, I will pass all variables into clustering algorithm and execute the algorithm using two distance metrices which are based on correlation magnitude. These are Pearson’s correlation and Mahalanobis distance. Second approach will be based on proncipal components. After implementing algorithm, I will run PAM clustering.

For most of the analysis, I will use ClusterR package since it provides all algorithms and distance metrics which I needed. The documentation of the package is available on a website https://cran.r-project.org/web/packages/ClusterR/ClusterR.pdf.

First approach - passing all variables into clustering algorithm

First step before applying any clustering algorithm will be finding the optimal nu

library(gridExtra)

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

library(factoextra)

## Loading required package: ggplot2

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

library(ggplot2)
a <- fviz_nbclust(data[,-1], FUNcluster = kmeans, method = "silhouette") + theme_classic() 
b <- fviz_nbclust(data[,-1], FUNcluster = cluster::clara, method = "silhouette") + theme_classic() 
grid.arrange(a, b, nrow=2)

It occured that according to silhouette statistic, the most optimal number of clusters is 2. Thus, further analysis will be based on the assumption of two customer segments.

First implemented algorithm wil be CLARA with Pearson’s correlation as a distance metric.

library(ClusterR)

## Loading required package: gtools

start = Sys.time()
clm_pearson <- Clara_Medoids(data_matrix, 2, distance_metric = c("pearson_correlation"), samples = 5, sample_size = 0.2, swap_phase = TRUE, verbose = T)

##  
## sample 1 -->  local (sample) dissimilarity : 228.768 -->  global dissimilarity : 1114.78
## sample 2 -->  local (sample) dissimilarity : 229.83 -->  global dissimilarity : 1121.28
## sample 3 -->  local (sample) dissimilarity : 229.83 -->  global dissimilarity : 1121.28
## sample 4 -->  local (sample) dissimilarity : 229.83 -->  global dissimilarity : 1121.28
## sample 5 -->  local (sample) dissimilarity : 229.83 -->  global dissimilarity : 1121.28
##  
##  ====================================================== 
##  --- best sample: 1 --->  global dissimilarity: 1114.78
##  ======================================================

end = Sys.time()
t = end - start
cat('time to complete :', t, attributes(t)$units, '\n')

## time to complete : 8.390749 secs

CLARA with Pearson’s correlation as a distance metric resulted in obtaining two clusters while in first cluster there are 870 customers and in the second one there are 920 customers. The value of silhouette for first and second cluster is 0.725 and 0.356 respectively. The average silhouette value is thus 0.535 which means that about half of the objects have been well clustered.

The second implemented algorithm will be CLARA with Mahalanobis distance assuming two clusters in advance.

start = Sys.time()
clm_mahal <- Clara_Medoids(data_matrix, 2, distance_metric = c("mahalanobis"), samples = 5, sample_size = 0.2, swap_phase = TRUE, verbose = T)

##  
## sample 1 -->  local (sample) dissimilarity : 6557.47 -->  global dissimilarity : 32440.8
## sample 2 -->  local (sample) dissimilarity : 6557.47 -->  global dissimilarity : 32440.8
## sample 3 -->  local (sample) dissimilarity : 6557.47 -->  global dissimilarity : 32440.8
## sample 4 -->  local (sample) dissimilarity : 6557.47 -->  global dissimilarity : 32440.8
## sample 5 -->  local (sample) dissimilarity : 6557.47 -->  global dissimilarity : 32440.8
##  
##  ====================================================== 
##  --- best sample: 1 --->  global dissimilarity: 32440.8
##  ======================================================

end = Sys.time()
t = end - start
cat('time to complete :', t, attributes(t)$units, '\n')

## time to complete : 11.12679 secs

In case of CLARA with Mahalanobis distance in teh first cluster there are 1090 objects and in the second cluster there are 700 objects. The silhouette statistic in both clusters in low which results in a low average silhouette which is 0.069. It is clearly visible that this algorithm results in worst clustering than the previous one.

Below one can see basic statistics of two above algorithms. For each object there is information about number of observations in each cluster (considering clustering full dataset), maximum dissimilarity, average dissimilarity and isolation.

clm_pearson$clustering_stats

##   clusters number_obs max_dissimilarity average_dissimilarity isolation
## 1        2       4591         0.9496998            0.07162453  2.619020
## 2        3       4358         1.0135705            0.18034747  2.795158

clm_mahal$clustering_stats

##   clusters number_obs max_dissimilarity average_dissimilarity isolation
## 1        2       5597          61.71249              3.525857  27.17082
## 2        3       3352          41.43548              3.790758  18.24324

Obtained statistics confirm conclusions drawn before. Out of theese two algorithms, definately better one is CLARA with Pearson’s correlation as distance metric.

Next step will be applying dimensionality reduction algorithm and running CLARA algorithm on obtained principal components.

Second approach - dimensionality reduction Dimensionality reduction aims to reduce the number of variables in the dataset and also reduce multicollinearity. There are a lot of algorithms to reduce the dimensionality. They are divided into supervisd and unsupervised techniques. Since the paper is focused on unsupervised learning algorithms, I will apply Principal Component Analysis, Multidimensional Scaling and hierarchical trees. Moreover, considering the increasing importance of neural networks, I will also compute autoencoder. In the end, I will try to compare the performance of all these techniques using logistic regression since the dataset originally contains class labels.

In order to reduce the dimensions, I will implement PCA algorithm. There are several functions which might be used. The most popular ones are stats:prcomp() and stats:princomp(). The main difference between them is that prcomp() is based on singular value decomposition and princomp() is based on spectral decomposition approach. According to R documentation, more preferred method is prcomp() but no details on this topic are provided.

In this analysis, I will use prcomp() function. Since the data are not on the same scale, argument scale = TRUE should be added.

data.pca <- prcomp(data[,-1], center = TRUE, scale = TRUE)
summary(data.pca)

## Importance of components:
##                          PC1    PC2     PC3    PC4     PC5     PC6     PC7
## Standard deviation     2.154 1.8586 1.22482 1.1276 1.02707 0.98720 0.91101
## Proportion of Variance 0.273 0.2032 0.08825 0.0748 0.06205 0.05733 0.04882
## Cumulative Proportion  0.273 0.4762 0.56444 0.6392 0.70129 0.75862 0.80744
##                            PC8     PC9   PC10    PC11    PC12    PC13    PC14
## Standard deviation     0.85733 0.80175 0.7236 0.63509 0.54907 0.49285 0.45478
## Proportion of Variance 0.04324 0.03781 0.0308 0.02373 0.01773 0.01429 0.01217
## Cumulative Proportion  0.85068 0.88849 0.9193 0.94301 0.96075 0.97504 0.98720
##                           PC15    PC16     PC17
## Standard deviation     0.41489 0.21307 0.003413
## Proportion of Variance 0.01013 0.00267 0.000000
## Cumulative Proportion  0.99733 1.00000 1.000000

The result of the summary function shows three statistics according to all components: standard deviation, proportion of variance and cumulative variance. From the output we can see that PC1 explains 27% of variance, PC2 explains 20% of variance, PC3 explains 8% of variance, ans so on. What is more, beyond obtaining principal components the algorithm also reduced the correlation between variables since the principal components are orthogonal.

data_matrix <- data.matrix(data.pca$x, rownames.force = NA)
M <- cor(data_matrix)
corrplot(M, method = "number", number.cex = 0.75)

In order to reduce dimensionality, the task now is to choose the optimal number of components to retain which is a topic of the next section.

Choosing the number of principal components to retain

According to Zwick and Velicer, there are at least five methods to determine the number of components to retain. One can use among others Barlletl’s test, Kaiser’s criterion, Minimum Average Partial method, scree test and parallel analysis (PA). Since MAP method does not have convinient R implementation, I will omit it.

First implemented method will be Barlletl’s test. According to Zwick and Velicer, null hipothesis of the test is that the remaining P variables - m eigenvalues are equal. Each eigenvalue is excluded sequentially until the approximate chi-square test of the null hypothesis of equality fails to be rejected. The first m excluded components are retained. R provides function nBartlett() to conduct the test (https://cran.r-project.org/web/packages/nFactors/nFactors.pdf).

library(nFactors)

## Loading required package: lattice

## 
## Attaching package: 'nFactors'

## The following object is masked from 'package:lattice':
## 
##     parallel

nBartlett(data[,-1], nrow(data[,-1]), alpha=0.05, details=TRUE)

## bartlett anderson   lawley 
##       16       16       16

Unfortunately, it seems that the test failed to find the number of components to retain.

Next method will be Kaiser’s criterion. It says that one should retain the components with eigenvalues are greater than 1. To calculate the eigenvalues, I will use get_eigenvalue() from factoextra package.

library(factoextra)
eig.val <- get_eigenvalue(data.pca)
eig.val

##          eigenvalue variance.percent cumulative.variance.percent
## Dim.1  4.640880e+00     2.729930e+01                    27.29930
## Dim.2  3.454466e+00     2.032039e+01                    47.61968
## Dim.3  1.500177e+00     8.824573e+00                    56.44426
## Dim.4  1.271589e+00     7.479937e+00                    63.92419
## Dim.5  1.054863e+00     6.205076e+00                    70.12927
## Dim.6  9.745631e-01     5.732724e+00                    75.86199
## Dim.7  8.299380e-01     4.881988e+00                    80.74398
## Dim.8  7.350087e-01     4.323581e+00                    85.06756
## Dim.9  6.428031e-01     3.781195e+00                    88.84876
## Dim.10 5.236154e-01     3.080091e+00                    91.92885
## Dim.11 4.033368e-01     2.372569e+00                    94.30142
## Dim.12 3.014818e-01     1.773422e+00                    96.07484
## Dim.13 2.429048e-01     1.428852e+00                    97.50369
## Dim.14 2.068286e-01     1.216639e+00                    98.72033
## Dim.15 1.721326e-01     1.012545e+00                    99.73287
## Dim.16 4.539988e-02     2.670581e-01                    99.99993
## Dim.17 1.164913e-05     6.852429e-05                   100.00000

Since the first eigenvalue (4.64) is higher than 1, we have support to retain the first component. The second eigenvalue (3.45) is also higher than 1, so it should also be retained. Eigenvalue higher than 1 is also for third (1.5), fourth (1.27) and fifth component (1.05). According to the rule, the remaining components shouldn’t be used in further analysis.

Next possible method is scree test which is based on eigenvalues graph. This method is analogical to the Kaiser’s criterion and thus gives analogical results. There are several possibilities to draw the so called scree plot. One of them is provided by scree.plot() from psy package.

library(psy)
scree.plot(eig.val[,1], title = "Scree Plot", type = "E",  simu = "P")

The last method to be implemented is parallel analysis. R provides function hornpa() which conducts the test and returns easy to interpret results. In order to decide which components to retain, one must compare eigenvalues with the numbers in column ‘0.95’. If eigenvalue is hiher than the particular value returned by the function, there is a support to retain the component.

library(hornpa)
hornpa(k=17,size=8949,reps=500,seed=123)

## 
##  Parallel Analysis Results  
##  
## Method: pca 
## Number of variables: 17 
## Sample size: 8949 
## Number of correlation matrices: 500 
## Seed: 123 
## Percentile: 0.95 
##  
## Compare your observed eigenvalues from your original dataset to the 95 percentile in the table below generated using random data. If your eigenvalue is greater than the percentile indicated (not the mean), you have support to retain that factor/component. 
##  
##  Component  Mean  0.95
##          1 1.075 1.088
##          2 1.061 1.071
##          3 1.050 1.058
##          4 1.040 1.048
##          5 1.031 1.039
##          6 1.023 1.030
##          7 1.015 1.022
##          8 1.007 1.014
##          9 1.000 1.006
##         10 0.992 0.998
##         11 0.984 0.990
##         12 0.976 0.983
##         13 0.968 0.976
##         14 0.960 0.967
##         15 0.951 0.959
##         16 0.940 0.950
##         17 0.928 0.940

In our case, the method indicates that there should be five principal components considered in the further analysis.

To sum up all the methods, three of them occured to be successful. They indicated to retain five principal components.

Graphical analysis

In order to make graphical analysis easier, I will from now make an analysis in two dimensions.

Visualization of variable’s loadings

First plot represents variable loadings which are coefficients to predict components by original variables.

library(factoextra)
fviz_pca_var(data.pca, col.var = "navy", repel = TRUE, axes = c(4, 5)) +
  labs(title="PCA", x="PC4", y="PC5")

Above plots show the correlation between original variables as well as the strengh of each variable contribution to particular principal components. It is clear from the first plot (PC1, PC2) that e.g. cash-advance frequency and average amount per purchase transaction are positively correlated while cash-advance frequency and % of months with full payment of the due statement balance seems to be negatively correlated. Analogical conclusions may be drawn also from four other plots. The dependence is if variables are grouped together, it means they are positively correlated while if variables are positioned on opposite sides of the plot, it means that they are negatively correlated. The length of the vector tells how strong is the contribution of particular variable to particular principal component. It can be confirmed by the plots below. It presents contribution of variables to PC1, PC2, PC3, PC4 and PC5.

library(gridExtra)
cont1 <- fviz_contrib(data.pca, "var", axes = 1)
cont2 <- fviz_contrib(data.pca, "var", axes = 2)
cont3 <- fviz_contrib(data.pca, "var", axes = 3)
cont4 <- fviz_contrib(data.pca, "var", axes = 4)
cont5 <- fviz_contrib(data.pca, "var", axes = 5)
grid.arrange(cont1, cont2, cont3, cont4, cont5)

The above charts clearly show which variables contribute the most to particular components.

PC1 consists of: - Total purchase amount spent during last 12 months - Average amount per purchase transaction - Total amount of one-off purchases - Total amount of installment purchases - Frequency of purchases (percentage of months with at least one purchase - Frequency of one-off-purchases - Frequency of installment purchases - Total payments (due amount paid by the customer to decrease their statement balance) in the period

PC2 consists of: - Total cash-advance amount - Cash-Advance frequency - Average amount per purchase transaction - Monthly average balance (based on daily balance averages) - Credit limit - Total payments (due amount paid by the customer to decrease their statement balance) in the period

PC3 consists of: - Frequency of installment purchases - Ratio of last 12 months with balance - Total amount of one-off purchases - Frequency of purchases (percentage of months with at least one purchase) - Total payments (due amount paid by the customer to decrease their statement balance) in the period - Total purchase amount spent during last 12 months - Total minimum payments due in the period

PC4 consists of: - Number of months as a customer - Percentage of months with full payment of the due statement balance - Total minimum payments due in the period - Average amount per cash-advance transaction - Monthly average balance (based on daily balance averages) - Total cash-advance amount - Cash-Advance frequency

PC5 consists of: - Frequency of one-off-purchases - Ratio of last 12 months with balance - Total minimum payments due in the period - Total amount of installment purchases

One can also visualize individual observations and the strenght of their contribution to the principal components. One can choose which observations to display on the plot. One of the possibilities is selecting the top contributing individuals. Below plots show 50 top contributing observations and also all observations.

Visualization of observations’ loadings

First plot represents variable loadings which are coefficients to predict components by original variables. It is clearly seen that the most contributing variables are placed relatively far from the origin of coordinates.

fviz_pca_ind(data.pca, col.ind = "navy", repel = TRUE, select.ind = list(contrib=50), axes = c(1, 2)) +
  labs(title="PCA", x="PC1", y="PC2")

## Clustering of PCA Finally, after analysis of principal components, I will conduct clustering based on five subsetted principal components. In order to do that, I will use the same function as before - Clara_Medoids - and Euclidean distance.

pca_data <- data.pca$x[,1:5]
start = Sys.time()
clm_eucl <- Clara_Medoids(pca_data, 2, distance_metric = c("euclidean"), samples = 5, sample_size = 0.2, swap_phase = TRUE, verbose = T)

##  
## sample 1 -->  local (sample) dissimilarity : 4363.67 -->  global dissimilarity : 21728.4
## sample 2 -->  local (sample) dissimilarity : 4369.08 -->  global dissimilarity : 21808.9
## sample 3 -->  local (sample) dissimilarity : 4369.08 -->  global dissimilarity : 21808.9
## sample 4 -->  local (sample) dissimilarity : 4369.08 -->  global dissimilarity : 21808.9
## sample 5 -->  local (sample) dissimilarity : 4369.08 -->  global dissimilarity : 21808.9
##  
##  ====================================================== 
##  --- best sample: 1 --->  global dissimilarity: 21728.4
##  ======================================================

end = Sys.time()
t = end - start
cat('time to complete :', t, attributes(t)$units, '\n')

## time to complete : 2.044902 secs

CLARA with Euclidean distance resulted in obtaining two clusters while in first cluster there are 799 customers and in the second one there are 991 customers. The value of silhouette for first and second cluster is 0.221 and 0.306 respectively. The average silhouette value is thus 0.268. This is about twice less than in case of clustering the whole dataset using CLARA algorithm with Pearson’s correlation as distance metric.

The result obtained using PCA is not very satisfying. The average silhouette statistic is lower that in case of clustering the whole dataset. That is probably because five chosed principal components explain only about 70% of variance in the data. Also, the reason might be also presence of outliers.

Conlcusions

Comparing all clustering processes, one can clearly see that the best performing algorithm in case of segmentation of the customers are both CLARA algorithm with Pearson’s correlation as distance metric and CLARA based on robust PCA principal component. However, considering the run-time of the algorithm, CLARA based on robust PCA is almost 4 times faster than the CLARA on the whole dataset which is a huge advantage in case of large datasets.

In order to choose the best algorithm overall, one has to specify his goal. If he wants to reach out everyone, it is difficult and thus maybe more complex clustering method should be chosen. But if one is enough with reaching only some part, algorithm based on robust PCA could work since it reaches the most similar customers and they are regardless over 50% of the population.